Time Series Forecasting
Extended Project
(Shoes Sales)
By - Parijat Dev
1
1. Executive Summary 3
2. Data details 3
Q1 Read the data as an appropriate Time Series data and plot the data 3
Q2- Perform appropriate Exploratory Data Analysis to understand the data and also
perform decomposition. 5
Data Description 5
Q3- Split the data into training and test. The test data should start in 1991. 8
We can observe the training and test data in the above plot, the blue part of the plots
depicts the Train datasets (January ’80 – December ‘90), and the Orange part of the plots
depict the test datasets (January ’91 – July ‘95). 9
Q4- Build various exponential smoothing models on the training data and evaluate the
model using RMSE on the test data. Other models such as regression, naïve forecast
models, simple average models etc. should also be built on the training data and check
the performance on the test data using RMSE. 9
Q5- Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If the
data is found to be non-stationary, take appropriate steps to make it stationary. Check the
new data for stationarity and comment. Note: Stationarity should be checked at alpha =
0.05. 17
6 Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and
evaluate this model on the test data using RMSE. 18
Q7 - Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE. 23
Q 8- Build a table with all the models built along with their corresponding parameters and
the respective RMSE values on the test data. 24
Q9- Based on the model-building exercise, build the most optimum model(s) on the
complete data and predict 12 months into the future with appropriate confidence
intervals/bands. 25
Q10 - Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales. 26
2
1. Executive Summary
You are an analyst in the IJK shoe company and you are expected to forecast the sales of the
pairs of shoes for the upcoming 12 months from where the data ends. The data for the pair of
shoe sales have been given to you from January 1980 to July 1995.
2. Data details
Figure 1- Shoe Sales Time Series Plot
Data set contains two columns, where the first column shows the month and year of the
corresponding Production Quantity recorded in the second column
Q1 Read the data as an appropriate Time Series
data and plot the data
I have imported the data series and as we can observe, entry has an YearMonth value with it,
which is not really a data point, but an index for the sales entry. So in reality the datasets have a
single column that contains the quantity of shoes sales in that particular month. Here, while
reading the datasets I have given the argument in a way so that it parses the first column which
is date column, and indicates to the system that this is a one column series through squeeze. It
can be observed the dataset has data starting from January 1980 going till July 1995, so there
are 187 entries in totality in each dataset. the Data Now that I have uploaded the dataset with
no arguments (and hence uploaded the datasets without parsing the dates here), I will need to
provide a time stamp value by ourselves. In addition to that I have removed the YearMonth
variable and added a time stamp to the dataset myself. I have plotted the time series below.
3
Figure 2- Shoe Sales Time Series Plot
As we can observe from the above plot, the sales of shoes was in upward direction. There is a
certain seasonality element that is visible in the graph. We will explore the trend and seasonality
further during decomposition, where we will be able to view a much detailed report on these two
factors.
4
Q2- Perform appropriate Exploratory Data Analysis
to understand the data and also perform
decomposition.
Data Description
Figure 3 - Description of the Dataset
As we can see from the above, the shoes sales time series data look like they are skewed.
There is High Standard Deviation for the time series since the Min and Max have significant
difference between them. Moreover, there is difference between the mean and the median for
the same reason of skewness. As mentioned earlier, there are in total 187 records in the
dataset.
Yealy Boxplot
5
Figure 4 - Yearly Boxplot of the Dataset
As we can see the data is showing upward and downward trend, till 1987 the shoe sales have
shown a significant growth and from 1987 onwards the growth of shoe sales got hampered and
the sales number started to decline. The highest shoe sales has happed in 1987. The year 1984
is the year with lowest variation in sales.
Monthly Boxplot
Figure 4 - Yearly Boxplot of the Dataset
6
As we can observe that there is more sales happening towards the second half of the year than
the fist half. Highest sales have happened in the december month.
Monthly Sales Across Years
Figure 5 - Monthly sales line chart across Years
December and Novemeber are the months that derives maximum sales throughout the years.
From the above chart it is visible that the seasonality element is present in the chart.
7
Decomposition
Figure 6 - Multiplicative Decomposition of the data
From the above decomposition chart it is clear that the trend and seasonality both are present in
the dataset. The residuals are minimal in the multiplicative decomposition.
Q3- Split the data into training and test. The test
data should start in 1991.
I have split the time series datasets into Train and Test datasets below. It is given the question
that the Test Data should start in 1991.
Figure 7 - Test and Training Datasets
8
I have also confirmed that the Train dataset indeed ends in 1990, and the Test dataset indeed
starts in 1991 by using the Head and Tail functions on the Training and Test dataset. As we can
observe, the size of the Train data frame is 132 observations and that of the Test data frame is
55 observations.
I have also plotted the Train and test data frames for both time series datasets below:
Figure 8 - Test and Training Datasets
We can observe the training and test data in the above plot, the blue part of the plots depicts the
Train datasets (January ’80 – December ‘90), and the Orange part of the plots depict the test
datasets (January ’91 – July ‘95).
Q4- Build various exponential smoothing models on the training
data and evaluate the model using RMSE on the test data. Other
models such as regression, naïve forecast models, simple
average models etc. should also be built on the training data and
check the performance on the test data using RMSE.
In this section I will try to run the various available models on time series data set. Let’s kick off
the analysis with Linear Regression model.
4.1 Linear Regression
Figure 9 - Test and Training Data for linear regression
9
Following is the results from a Linear Regression model on the dataset:
Figure 10 - Test and Training Data for linear regression
For Regression on Time forecast on the Test Data,
RMSE = 266.28
4.2 Naive Model
Figure 11 - Test and Training Data for Naive Model
10
For Naive model on Time forecast on the Test Data, RMSE = 245.1
4.3 Simple Average Model
Figure 12 - Test and Training Data for Simple Average
The extracts of Training data for the Simple Average Model can be seen below:
Figure 13 - Extract of Training data
For Simple Average Model, RMSE = 63.985
11
4.4 Moving Average Model
Figure 13 - Dataset for the moving Average
Results from Moving Average
12
Figure 15: Moving Average Model Outcome
For 2 point Moving Average Model forecast on the Testing Data, RMSE = 45.949
For 4 point Moving Average Model forecast on the Testing Data, RMSE = 57.873
For 6 point Moving Average Model forecast on the Testing Data, RMSE = 63.457
For 9 point Moving Average Model forecast on the Testing Data, RMSE = 67.724
I have applied 2, 4, 6 and 9-point trailing averages on the dataset.
As we can observe from the above plots, all of the trailing average plots show prediction values
below the actual train and test data sets, and the 9 point trailing average plot shows the lowest
prediction of all the plots. The closest prediction to actual data is shown by the 2 point trailing
moving average model. This observation is corroborated by the RMSE scores for each of these
moving average models.
13
4.5 Simple Smoothing
Figure 15: Moving Average Model Outcome
Figure 16: SES Parameters
Following is the result from running a SES Model on the dataset:
Figure 17: SES Model Outcome
For Alpha =0.605 Simple Exponential Smoothing Model forecast on the Test Data, RMSE is
196.405
14
4.6 Double Exponential Smoothing
Figure 18: DES Parameters
Following is the result from running a SES & DES Model on the dataset:
Figure 19: Smoothing Models Outcome
Double Exponential Smoothing RMSE = 288.5473422908694
4.7 Triple Exponential Smoothing
Figure 20: TES Parameters
15
Following is the result from running a SES, DES & TES Model on the dataset:
Figure 21: Smoothing Models Outcome
Triple Exponential Smoothing Model forecast on the Test, RMSE = 128.992526
The summarized performance of the models run on the Shoe Sales datasets can be seen
below:
Figure 22: Performance metrics of Different models
16
Q5- Check for the stationarity of the data on which the model is
being built on using appropriate statistical tests and also mention
the hypothesis for the statistical test. If the data is found to be
non-stationary, take appropriate steps to make it stationary.
Check the new data for stationarity and comment. Note:
Stationarity should be checked at alpha = 0.05.
5.1 Dickey Fuller test on the stationarity of the data
DF test statistic is -1.717
DF test p-value is 0.4222
As the p value is more than 0.05 the data is not stationary. We can try different methods to
make the data stationary.
5.2 Data Stationarity by Differencing
Figure 23: New Data after Differencing it by 1
AD Fuller test on the new Data after differencing
DF test statistic is -3.479
DF test p-value is 0.0085
17
5.3 Data Stationarity by Differencing and Log Transformation
Figure 24: New Data after Differencing it by 1 and Log Transfornation
AD Fuller test on the new Data after differencing
DF test statistic is -3.479
DF test p-value is 0.0086
The data has become stationary
6 Build an automated version of the ARIMA/SARIMA model in
which the parameters are selected using the lowest Akaike
Information Criteria (AIC) on the training data and evaluate this
model on the test data using RMSE.
18
6.1 ARIMA
Figure 24: Best AIC
Figure 25: ARIMA Results
19
Figure 26: ARIMA Model Performance
The Root Mean Squared Error of ARIMA forecasts is 175.196
20
6.2 SARIMA
Figure 27: AUTO Sarima Performance
Figure 28: Best AIC values for SARIMA
21
Diagnostic Plot
Figure 29: SARIMA Dianostic Plot
The Root Mean Squared Error of SARIMA forecasts is 69.03
22
Q7 - Build ARIMA/SARIMA models based on the cut-off
points of ACF and PACF on the training data and evaluate
this model on the test data using RMSE.
Figure 30: ACF Parameters
Figure 31: PACF Parameters
23
Q 8- Build a table with all the models built along with their
corresponding parameters and the respective RMSE
values on the test data.
Figure 32: Model Performances
As We can see from the above table that the best model performance is of the 2 point
Trailing Average
24
Q9- Based on the model-building exercise, build the
most optimum model(s) on the complete data and
predict 12 months into the future with appropriate
confidence intervals/bands.
Figure 33: Prediction of 12 months ahead
25
12 Months predictions
Figure 34: Prediction of 12 months ahead
Q10 - Comment on the model thus built and report your
findings and suggest the measures that the company
should be taking for future sales.
Dataset contains 187 entries among 2 variables.
The data has outliers present
There is more sales in the second half of the year than the first year
December records the highest sales
We can see the sales of shoes has decreased drastically over few years.
The company can have more sales that forecasted if they focus on the innovation of the
products and applying marketing strategies.
The decrease in the sales figures over the years suggest that there has been a significant rise in
the competition and other brands are providing much better shoes than the company at a much
better prices.
To remain competitive in the market the company needs to implement multiple strategies
26