What patterns exist in the available telecom company revenue data? |
Context
The purpose of this analysis is to identify a model that fits the available historical revenue data, for the purpose of extrapolating revenue into the future. The set provided is a time series that presents revenue in millions USD and the associated day during two years of company operation.
Data Preparation
The (WGU, 2022) telecommunications revenue data is initialized in R with a copy on local storage, along with packages that are useful for the analysis. All of the packages used herein are associated with visualizing time series, as the functionality for ARIMA models is included in base R. The project is realized in the form of a straightforward dataframe of 2 columns (day and dollar revenue) and 731 observations, converted into a proprietary R ts object and plotted in a straightforward way.
suppressMessages(library(astsa)) #time series extension for ggplot
suppressMessages(library(ggplot2)) #visualization essentials
suppressMessages(library(ggfortify)) #additional plotting features
#import telecommunications daily revenue
csv <- read.csv("/Volumes/EXT128/WGU/Advanced Analytics/time_series/teleco_time_series.csv")
#convert
data<-ts(csv[2], frequency=1)
#realize initial dataset
autoplot(data,main="Telecom Revenue",xlab="Days in Operation",ylab="Revenue, Millions USD")
To analyze the efficacy of a proposed model, we will keep the initial 90% of daily observations for training and test on the remaining 10%. The result of this split is visualized:
#train/test split
train_proportion <- 0.90
train_size <- floor(train_proportion*nrow(data))
train <- window(data,start=1,end=train_size)
test <- window(data,start=train_size+1,end=nrow(data))
#plot split data
autoplot(cbind(train, test), facets=FALSE,main="Train/Test Split Revenue",xlab="Days in Operation",ylab="Revenue, Millions USD") + labs(colour = "Group") + theme(legend.position = c(0.9, 0.175))
Model Selection
The full and partial autocorrelation function plots are used to investigate the properties of the daily time series and decide on parameters. An ACF uses all of the properties of the time series, including trend and seasonality, to describe the strength of the relationship between current and prior observations (lags). In other words, at lag k we quantify the correlation between observations that are separated by k time steps. There is an obvious trend in the raw data, so we expect there to be a strong pattern in the ACF.
The partial variant, PACF, performs a similar calculation to the ACF in that it returns the correlation between observations that are k time steps apart. However, the PACF controls for the effect of intermediate lags. The effect is to report the correlation of lag k that is not explained by the shorter lags, and to reveal relationships that are obscured by the ACF.
#plot raw ACF acf2(train)
This plot reveals the clear correlation between successive measures that we expect from the trending data. First order differencing is attempted in order to remove the trend information and plot the autocorrelations:
#detrend with differencing order 1 train_diff <- diff(train) #plot differenced training data acf acf2(train_diff)
These plots reveal that, for the differenced data, there is a strong negative correlation lag 1 and a significant positive correlation lag 2 before the ACF decays to insignificance. The trend characteristic in the differenced training set is completely removed. The plots suggest parameter values for the ARIMA(p,d,q) model:
‘p’ is the order autoregressive term, which refers to the number of lags to utilize in predicting the next data point. The PACF contained a significant spike lag 1, so we assign p = 1.
‘d’ is the order of differencing used in order to make the time series stationary. The raw data was differenced once to detrend it, and so d = 1.
‘q’ is order of the moving average term which relates to the error terms of previous observations. The first 2 lags contained significant spikes in the ACF function and thus q = 2.
The next section will feature the code for implementing an ARIMA(1,1,2) model with the astsa package. The decomposition of the series consists of the trend component and an error component, both of which are plotted below.
#create data frame from differenced time series td_num <- data.frame(day=1:length(as.numeric(train_diff)),measure = as.numeric(train_diff))
#plot differenced series and reference ggplot(td_num, aes(x=day,y=measure)) + geom_hline(yintercept=0,size=2) + geom_line(alp0.5) + ggtitle("Differenced Training Series") + xlab("Day") + ylab("Difference")
#plot error distribution ggplot() + aes(td_num$measure) + geom_histogram(bins=30,color="black",fill="white")+ggtitle("Error Distribution")+ xlab(NULL) + ylab(NULL)
The errors in the horizontal trend line are normally distributed and also visible is the lack of a trend in the first-order differencing of the original training data. It's now prudent to fit an ARIMA model using the insights gained in this section.
Model Implementation
Consider an ARIMA(1,1,2) for this dataset. The model evaluation plots which are automatically generated from the sarima() call are shown.
#model object for evaluation
mod <- sarima(train,1,1,2)
mod
The residuals of the fitted model are not autocorrelated and are normally distributed. Also, the Ljung-Box statistic p-values are all well above the critical value which suggests an acceptable model.
Forecasting and Evaluation
Forecast ahead for the length of the testing interval with an ARIMA(1,1,2) model. The plotting output only is shown below.
#forecast ahead for 74 days
forc <- sarima.for(train, n.ahead = length(test),1,1,2,main="Forecast 74 Days",xlab="Days in Operation")
forc
The forecast length was chosen to match the length of the test series so that it could be evaluated for accuracy. This is from day 658 to 731. The 95% prediction interval for day 731, the final day, is from
forc$pred[length(forc$pred)]-1.96*forc$se[length(forc$se)]
## [1]
8.848428
to
forc$pred[length(forc$pred)]+1.96*forc$se[length(forc$se)]
## [1]
20.93031
million dollars. The mean absolute percentage error (MAPE) metric is used to assess the errors in the forecast for all 74 predictions.
#create comparison vectors
pred_vector <- forc$pred
train_vector <- as.numeric(train)
test_vector <- as.numeric(test)
#MAPE
mean(abs((test_vector-pred_vector)/test_vector)) * 100
## [1] 9.204432
#create comparison vectors
pred_vector <- forc$pred
train_vector <- as.numeric(train)
test_vector <- as.numeric(test)
#MAPE
mean(abs((test_vector-pred_vector)/test_vector)) * 100
## [1] 9.204432
Then the mean absolute error for all predictions was 9.2%. Here these predictions are compared to the testing set in the context of the whole available series:
#plot prediction data sarima.for(train, n.ahead = length(test), 1, 1, 2, plot.all=TRUE, main="Revenue Series with Predictions on Test Data",ylab="Revenue",xlab="Days in Operation")
#add true values lines(test)
The red line indicates the predicted values for the final 10% of observations whereas the black line is raw.
References
WGU Curriculum. Telecommunications Company Revenue Time Series. Western Governor's University. 2022. https://access.wgu.edu/ASP3/aap/content/s8dj4edjf84hd8fcvn3d.zip. Accessed Oct. 2, 2022.
Prabhakaran, Selva. ARIMA Model – Complete Guide to Time Series Forecasting in Python. MachineLearning+/ 22 August, 2021. https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ Accessed Oct. 2, 2022.