R: Revenue Projection || ARIMA Time Series Modeling


What patterns exist in the available telecom company revenue data?

Context

The purpose of this analysis is to identify a model that fits the available historical revenue data, for the purpose of extrapolating revenue into the future. The set provided is a time series that presents revenue in millions USD and the associated day during two years of company operation. 

The principal assumption of most time series analysis problems is that the data set is stationary. This means that the mean and variance are consistent over long periods. Time series that exhibit trends or seasonality are not stationary, but transformation may be performed on the set so that we observe stationary data and perform modeling on it. 

ARIMA(p,d,q) refers to Auto-Regressive Integrated Moving Average, a class of forecasting models that may be applied to data that is or can be made to be stationary. This type of model is one of the most commonly used for econometrics applications, and it combines the following three modeling techniques (Prabhakaran, 2022):

  • Auto-Regressive refers to a type of model that takes into account the correlation between observed data points and the ones prior. These are known as lagged variables.
    • The p parameter in ARIMA(p,d,q) tells the model to explain a given data point as a linear combination of the p past observations. This is the order of the auto-regressive model.
    • The partial auto-correlation function (PACF) plot is useful for determining an optimal value for p
  • Integrated refers to the transformation of the dataset into a stationary form through differencing. The modeling is then done not on the actual time series, but on a series of differences between successive observations.
    • The d parameter refers to the number of times the differencing process is applied to the raw data to make it stationary, also known as the differencing order. 
  • Moving Average is a technique that mitigates the effect of short-term fluctuations (noise) in the time series. The previous errors on observations of the an autoregressive model are used to inform the current observation.
    • The q parameter is the order of the moving average model and represents the number of prior error terms that should be considered.
    • The full auto-correlation function (ACF) plot is useful for determining an optimal value for p.

Autocorrelation refers to the assumption that a data value at one point in time is related to its value at previous point. This underpins the analysis of time series since modeling used for prediction must rely on currently available data in order to forecast. Autocorrelation can be expressed as a value between -1 and 1, where 1 represents strong positive correlation between an index and a lagged observation of that index. The degree of correlation between consecutive time measures or otherwise affects the parameters of an ARIMA model. In a later section the series properties will be investigated, providing justification for the selection of parameters (p,d,q).

Data Preparation

The (WGU, 2022) telecommunications revenue data is initialized in R with a copy on local storage, along with packages that are useful for the analysis. All of the packages used herein are associated with visualizing time series, as the functionality for ARIMA models is included in base R. The project is realized in the form of a straightforward dataframe of 2 columns (day and dollar revenue) and 731 observations, converted into a proprietary R ts object and plotted in a straightforward way.

suppressMessages(library(astsa)) #time series extension for ggplot
suppressMessages(library(ggplot2)) #visualization essentials suppressMessages(library(ggfortify)) #additional plotting features

#import telecommunications daily revenue

csv <- read.csv("/Volumes/EXT128/WGU/Advanced Analytics/time_series/teleco_time_series.csv")

#convert

data<-ts(csv[2], frequency=1) #realize initial dataset autoplot(data,main="Telecom Revenue",xlab="Days in Operation",ylab="Revenue, Millions USD")

To analyze the efficacy of a proposed model, we will keep the initial 90% of daily observations for training and test on the remaining 10%. The result of this split is visualized:

#train/test split
train_proportion <- 0.90
train_size <- floor(train_proportion*nrow(data))
train <- window(data,start=1,end=train_size)
test <- window(data,start=train_size+1,end=nrow(data))
#plot split data
autoplot(cbind(train, test), facets=FALSE,main="Train/Test Split Revenue",xlab="Days in Operation",ylab="Revenue, Millions USD") + labs(colour = "Group") + theme(legend.position = c(0.9, 0.175))

Model Selection

Model Implementation

Consider an ARIMA(1,1,2) for this dataset. The model evaluation plots which are automatically generated from the sarima() call are shown.

#model object for evaluation

mod <- sarima(train,1,1,2)

mod

The residuals of the fitted model are not autocorrelated and are normally distributed. Also, the Ljung-Box statistic p-values are all well above the critical value which suggests an acceptable model. 

Forecasting and Evaluation

Forecast ahead for the length of the testing interval with an ARIMA(1,1,2) model. The plotting output only is shown below. 

#forecast ahead for 74 days

forc <- sarima.for(train, n.ahead = length(test),1,1,2,main="Forecast 74 Days",xlab="Days in Operation")

forc


The forecast length was chosen to match the length of the test series so that it could be evaluated for accuracy. This is from day 658 to 731. The 95% prediction interval for day 731, the final day, is from

forc$pred[length(forc$pred)]-1.96*forc$se[length(forc$se)]
## [1] 8.848428

to

forc$pred[length(forc$pred)]+1.96*forc$se[length(forc$se)]
## [1] 20.93031

million dollars. The mean absolute percentage error (MAPE) metric is used to assess the errors in the forecast for all 74 predictions. 

#create comparison vectors
pred_vector <- forc$pred
train_vector <- as.numeric(train)
test_vector <- as.numeric(test)
#MAPE
mean(abs((test_vector-pred_vector)/test_vector)) * 100
## [1] 9.204432

Then the mean absolute error for all predictions was 9.2%. Here these predictions are compared to the testing set in the context of the whole available series:

#plot prediction data
sarima.for(train, n.ahead = length(test), 1, 1, 2, plot.all=TRUE, main="Revenue Series with Predictions on Test Data",ylab="Revenue",xlab="Days in Operation")
#add true values lines(test)

The red line indicates the predicted values for the final 10% of observations whereas the black line is raw.

Based on an acceptable statistical model, we have visualized with 95% certainty the trajectory of company revenue. Note that within this prediction interval lies the possibility of a reversal in the overall trend.

References 

WGU Curriculum. Telecommunications Company Revenue Time Series. Western Governor's University. 2022. https://access.wgu.edu/ASP3/aap/content/s8dj4edjf84hd8fcvn3d.zip. Accessed Oct. 2, 2022. 

Prabhakaran, Selva. ARIMA Model – Complete Guide to Time Series Forecasting in Python. MachineLearning+/ 22 August, 2021. https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ Accessed Oct. 2, 2022.