R: Revenue Projection || ARIMA Time Series Modeling

What patterns exist in the available telecom company revenue data?

Context

The purpose of this analysis is to identify a model that fits the available historical revenue data, for the purpose of extrapolating revenue into the future. The set provided is a time series that presents revenue in millions USD and the associated day during two years of company operation.

The principal assumption of most time series analysis problems is that the data set is stationary. This means that the mean and variance are consistent over long periods. Time series that exhibit trends or seasonality are not stationary, but transformation may be performed on the set so that we observe stationary data and perform modeling on it.

ARIMA(p,d,q) refers to Auto-Regressive Integrated Moving Average, a class of forecasting models that may be applied to data that is or can be made to be stationary. This type of model is one of the most commonly used for econometrics applications, and it combines the following three modeling techniques (Prabhakaran, 2022):

Auto-Regressive refers to a type of model that takes into account the correlation between observed data points and the ones prior. These are known as lagged variables.

The p parameter in ARIMA(p,d,q) tells the model to explain a given data point as a linear combination of the p past observations. This is the order of the auto-regressive model.

The partial auto-correlation function (PACF) plot is useful for determining an optimal value for p.

Integrated refers to the transformation of the dataset into a stationary form through differencing. The modeling is then done not on the actual time series, but on a series of differences between successive observations.

The d parameter refers to the number of times the differencing process is applied to the raw data to make it stationary, also known as the differencing order.

Moving Average is a technique that mitigates the effect of short-term fluctuations (noise) in the time series. The previous errors on observations of the an autoregressive model are used to inform the current observation.

The q parameter is the order of the moving average model and represents the number of prior error terms that should be considered.

The full auto-correlation function (ACF) plot is useful for determining an optimal value for p.

Autocorrelation refers to the assumption that a data value at one point in time is related to its value at previous point. This underpins the analysis of time series since modeling used for prediction must rely on currently available data in order to forecast. Autocorrelation can be expressed as a value between -1 and 1, where 1 represents strong positive correlation between an index and a lagged observation of that index. The degree of correlation between consecutive time measures or otherwise affects the parameters of an ARIMA model. In a later section the series properties will be investigated, providing justification for the selection of parameters (p,d,q).

Data Preparation

The (WGU, 2022) telecommunications revenue data is initialized in R with a copy on local storage, along with packages that are useful for the analysis. All of the packages used herein are associated with visualizing time series, as the functionality for ARIMA models is included in base R. The project is realized in the form of a straightforward dataframe of 2 columns (day and dollar revenue) and 731 observations, converted into a proprietary R ts object and plotted in a straightforward way.

suppressMessages(library(astsa)) #time series extension for ggplot
suppressMessages(library(ggplot2)) #visualization essentials
suppressMessages(library(ggfortify)) #additional plotting features

#import telecommunications daily revenue
csv <- read.csv("/Volumes/EXT128/WGU/Advanced Analytics/time_series/teleco_time_series.csv")
#convert 
data<-ts(csv[2], frequency=1)
#realize initial dataset
autoplot(data,main="Telecom Revenue",xlab="Days in Operation",ylab="Revenue, Millions USD")

To analyze the efficacy of a proposed model, we will keep the initial 90% of daily observations for training and test on the remaining 10%. The result of this split is visualized:

#train/test split
train_proportion <- 0.90
train_size <- floor(train_proportion*nrow(data))
train <- window(data,start=1,end=train_size)
test <- window(data,start=train_size+1,end=nrow(data))
#plot split data
autoplot(cbind(train, test), facets=FALSE,main="Train/Test Split Revenue",xlab="Days in Operation",ylab="Revenue, Millions USD") + labs(colour = "Group") + theme(legend.position = c(0.9, 0.175))

Model Selection

The full and partial autocorrelation function plots are used to investigate the properties of the daily time series and decide on parameters. An ACF uses all of the properties of the time series, including trend and seasonality, to describe the strength of the relationship between current and prior observations (lags). In other words, at lag k we quantify the correlation between observations that are separated by k time steps. There is an obvious trend in the raw data, so we expect there to be a strong pattern in the ACF.

The partial variant, PACF, performs a similar calculation to the ACF in that it returns the correlation between observations that are k time steps apart. However, the PACF controls for the effect of intermediate lags. The effect is to report the correlation of lag k that is not explained by the shorter lags, and to reveal relationships that are obscured by the ACF.

#plot raw ACF
acf2(train)

This plot reveals the clear correlation between successive measures that we expect from the trending data. First order differencing is attempted in order to remove the trend information and plot the autocorrelations:

#detrend with differencing order 1
train_diff <- diff(train)
#plot differenced training data acf
acf2(train_diff)

These plots reveal that, for the differenced data, there is a strong negative correlation lag 1 and a significant positive correlation lag 2 before the ACF decays to insignificance. The trend characteristic in the differenced training set is completely removed. The plots suggest parameter values for the ARIMA(p,d,q) model:

‘p’ is the order autoregressive term, which refers to the number of lags to utilize in predicting the next data point. The PACF contained a significant spike lag 1, so we assign p = 1.

‘d’ is the order of differencing used in order to make the time series stationary. The raw data was differenced once to detrend it, and so d = 1.

‘q’ is order of the moving average term which relates to the error terms of previous observations. The first 2 lags contained significant spikes in the ACF function and thus q = 2.

The next section will feature the code for implementing an ARIMA(1,1,2) model with the astsa package. The decomposition of the series consists of the trend component and an error component, both of which are plotted below.

#create data frame from differenced time series
td_num <- data.frame(day=1:length(as.numeric(train_diff)),measure = as.numeric(train_diff))
#plot differenced series and reference
ggplot(td_num, aes(x=day,y=measure)) + geom_hline(yintercept=0,size=2) + geom_line(alp0.5) + ggtitle("Differenced Training Series") + xlab("Day") + ylab("Difference")
#plot error distribution
ggplot() + aes(td_num$measure) + geom_histogram(bins=30,color="black",fill="white")+ggtitle("Error Distribution")+ xlab(NULL) + ylab(NULL)

The errors in the horizontal trend line are normally distributed and also visible is the lack of a trend in the first-order differencing of the original training data. It's now prudent to fit an ARIMA model using the insights gained in this section.

Model Implementation

Consider an ARIMA(1,1,2) for this dataset. The model evaluation plots which are automatically generated from the sarima() call are shown.

#model object for evaluation
mod <- sarima(train,1,1,2)
mod

The residuals of the fitted model are not autocorrelated and are normally distributed. Also, the Ljung-Box statistic p-values are all well above the critical value which suggests an acceptable model.

Forecasting and Evaluation

Forecast ahead for the length of the testing interval with an ARIMA(1,1,2) model. The plotting output only is shown below.

#forecast ahead for 74 days
forc <- sarima.for(train, n.ahead = length(test),1,1,2,main="Forecast 74 Days",xlab="Days in Operation")
forc

The forecast length was chosen to match the length of the test series so that it could be evaluated for accuracy. This is from day 658 to 731. The 95% prediction interval for day 731, the final day, is from

forc$pred[length(forc$pred)]-1.96*forc$se[length(forc$se)]

## [1] 8.848428

forc$pred[length(forc$pred)]+1.96*forc$se[length(forc$se)]

## [1] 20.93031

million dollars. The mean absolute percentage error (MAPE) metric is used to assess the errors in the forecast for all 74 predictions.

#create comparison vectors
pred_vector <- forc$pred
train_vector <- as.numeric(train)
test_vector <- as.numeric(test)
#MAPE
mean(abs((test_vector-pred_vector)/test_vector)) * 100

## [1] 9.204432

Then the mean absolute error for all predictions was 9.2%. Here these predictions are compared to the testing set in the context of the whole available series:

#plot prediction data
sarima.for(train, n.ahead = length(test), 1, 1, 2, plot.all=TRUE, main="Revenue Series with Predictions on Test Data",ylab="Revenue",xlab="Days in Operation")
#add true values
lines(test)

The red line indicates the predicted values for the final 10% of observations whereas the black line is raw.

Based on an acceptable statistical model, we have visualized with 95% certainty the trajectory of company revenue. Note that within this prediction interval lies the possibility of a reversal in the overall trend.

References

WGU Curriculum. Telecommunications Company Revenue Time Series. Western Governor's University. 2022. https://access.wgu.edu/ASP3/aap/content/s8dj4edjf84hd8fcvn3d.zip. Accessed Oct. 2, 2022.

Prabhakaran, Selva. ARIMA Model – Complete Guide to Time Series Forecasting in Python. MachineLearning+/ 22 August, 2021. https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ Accessed Oct. 2, 2022.

Stephan Pichardo, MSDA

Search This Blog