0% found this document useful (0 votes)
145 views26 pages

Time Series Forecasting Project (Shoe Sales)

This document describes analyzing a time series dataset of shoe sales to build forecasting models. It includes splitting the data into training and test sets, exploring the data through decomposition and plots, building linear regression, naive, simple average and exponential smoothing models on the training data and evaluating them on the test set using RMSE.

Uploaded by

PARIJAT DEV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views26 pages

Time Series Forecasting Project (Shoe Sales)

This document describes analyzing a time series dataset of shoe sales to build forecasting models. It includes splitting the data into training and test sets, exploring the data through decomposition and plots, building linear regression, naive, simple average and exponential smoothing models on the training data and evaluating them on the test set using RMSE.

Uploaded by

PARIJAT DEV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Time Series Forecasting

Extended Project
(Shoes Sales)
By - Parijat Dev

1
1. Executive Summary 3
2. Data details 3
Q1 Read the data as an appropriate Time Series data and plot the data 3
Q2- Perform appropriate Exploratory Data Analysis to understand the data and also
perform decomposition. 5
Data Description 5
Q3- Split the data into training and test. The test data should start in 1991. 8
We can observe the training and test data in the above plot, the blue part of the plots
depicts the Train datasets (January ’80 – December ‘90), and the Orange part of the plots
depict the test datasets (January ’91 – July ‘95). 9
Q4- Build various exponential smoothing models on the training data and evaluate the
model using RMSE on the test data. Other models such as regression, naïve forecast
models, simple average models etc. should also be built on the training data and check
the performance on the test data using RMSE. 9
Q5- Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If the
data is found to be non-stationary, take appropriate steps to make it stationary. Check the
new data for stationarity and comment. Note: Stationarity should be checked at alpha =
0.05. 17
6 Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and
evaluate this model on the test data using RMSE. 18
Q7 - Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE. 23
Q 8- Build a table with all the models built along with their corresponding parameters and
the respective RMSE values on the test data. 24
Q9- Based on the model-building exercise, build the most optimum model(s) on the
complete data and predict 12 months into the future with appropriate confidence
intervals/bands. 25
Q10 - Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales. 26

2
1. Executive Summary
You are an analyst in the IJK shoe company and you are expected to forecast the sales of the
pairs of shoes for the upcoming 12 months from where the data ends. The data for the pair of
shoe sales have been given to you from January 1980 to July 1995.

2. Data details

Figure 1- Shoe Sales Time Series Plot

Data set contains two columns, where the first column shows the month and year of the
corresponding Production Quantity recorded in the second column

Q1 Read the data as an appropriate Time Series


data and plot the data
I have imported the data series and as we can observe, entry has an YearMonth value with it,
which is not really a data point, but an index for the sales entry. So in reality the datasets have a
single column that contains the quantity of shoes sales in that particular month. Here, while
reading the datasets I have given the argument in a way so that it parses the first column which
is date column, and indicates to the system that this is a one column series through squeeze. It
can be observed the dataset has data starting from January 1980 going till July 1995, so there
are 187 entries in totality in each dataset. the Data Now that I have uploaded the dataset with
no arguments (and hence uploaded the datasets without parsing the dates here), I will need to
provide a time stamp value by ourselves. In addition to that I have removed the YearMonth
variable and added a time stamp to the dataset myself. I have plotted the time series below.

3
Figure 2- Shoe Sales Time Series Plot

As we can observe from the above plot, the sales of shoes was in upward direction. There is a
certain seasonality element that is visible in the graph. We will explore the trend and seasonality
further during decomposition, where we will be able to view a much detailed report on these two
factors.

4
Q2- Perform appropriate Exploratory Data Analysis
to understand the data and also perform
decomposition.
Data Description

Figure 3 - Description of the Dataset

As we can see from the above, the shoes sales time series data look like they are skewed.
There is High Standard Deviation for the time series since the Min and Max have significant
difference between them. Moreover, there is difference between the mean and the median for
the same reason of skewness. As mentioned earlier, there are in total 187 records in the
dataset.

Yealy Boxplot

5
Figure 4 - Yearly Boxplot of the Dataset

As we can see the data is showing upward and downward trend, till 1987 the shoe sales have
shown a significant growth and from 1987 onwards the growth of shoe sales got hampered and
the sales number started to decline. The highest shoe sales has happed in 1987. The year 1984
is the year with lowest variation in sales.

Monthly Boxplot

Figure 4 - Yearly Boxplot of the Dataset

6
As we can observe that there is more sales happening towards the second half of the year than
the fist half. Highest sales have happened in the december month.

Monthly Sales Across Years

Figure 5 - Monthly sales line chart across Years

December and Novemeber are the months that derives maximum sales throughout the years.
From the above chart it is visible that the seasonality element is present in the chart.

7
Decomposition

Figure 6 - Multiplicative Decomposition of the data

From the above decomposition chart it is clear that the trend and seasonality both are present in
the dataset. The residuals are minimal in the multiplicative decomposition.

Q3- Split the data into training and test. The test
data should start in 1991.
I have split the time series datasets into Train and Test datasets below. It is given the question
that the Test Data should start in 1991.

Figure 7 - Test and Training Datasets

8
I have also confirmed that the Train dataset indeed ends in 1990, and the Test dataset indeed
starts in 1991 by using the Head and Tail functions on the Training and Test dataset. As we can
observe, the size of the Train data frame is 132 observations and that of the Test data frame is
55 observations.
I have also plotted the Train and test data frames for both time series datasets below:

Figure 8 - Test and Training Datasets

We can observe the training and test data in the above plot, the blue part of the plots depicts the
Train datasets (January ’80 – December ‘90), and the Orange part of the plots depict the test
datasets (January ’91 – July ‘95).

Q4- Build various exponential smoothing models on the training


data and evaluate the model using RMSE on the test data. Other
models such as regression, naïve forecast models, simple
average models etc. should also be built on the training data and
check the performance on the test data using RMSE.
In this section I will try to run the various available models on time series data set. Let’s kick off
the analysis with Linear Regression model.

4.1 Linear Regression

Figure 9 - Test and Training Data for linear regression

9
Following is the results from a Linear Regression model on the dataset:

Figure 10 - Test and Training Data for linear regression

For Regression on Time forecast on the Test Data,


RMSE = 266.28

4.2 Naive Model

Figure 11 - Test and Training Data for Naive Model

10
For Naive model on Time forecast on the Test Data, RMSE = 245.1

4.3 Simple Average Model

Figure 12 - Test and Training Data for Simple Average

The extracts of Training data for the Simple Average Model can be seen below:

Figure 13 - Extract of Training data

For Simple Average Model, RMSE = 63.985

11
4.4 Moving Average Model

Figure 13 - Dataset for the moving Average

Results from Moving Average

12
Figure 15: Moving Average Model Outcome

For 2 point Moving Average Model forecast on the Testing Data, RMSE = 45.949
For 4 point Moving Average Model forecast on the Testing Data, RMSE = 57.873
For 6 point Moving Average Model forecast on the Testing Data, RMSE = 63.457
For 9 point Moving Average Model forecast on the Testing Data, RMSE = 67.724

I have applied 2, 4, 6 and 9-point trailing averages on the dataset.

As we can observe from the above plots, all of the trailing average plots show prediction values
below the actual train and test data sets, and the 9 point trailing average plot shows the lowest
prediction of all the plots. The closest prediction to actual data is shown by the 2 point trailing
moving average model. This observation is corroborated by the RMSE scores for each of these
moving average models.

13
4.5 Simple Smoothing

Figure 15: Moving Average Model Outcome


Figure 16: SES Parameters

Following is the result from running a SES Model on the dataset:

Figure 17: SES Model Outcome

For Alpha =0.605 Simple Exponential Smoothing Model forecast on the Test Data, RMSE is
196.405

14
4.6 Double Exponential Smoothing

Figure 18: DES Parameters

Following is the result from running a SES & DES Model on the dataset:

Figure 19: Smoothing Models Outcome

Double Exponential Smoothing RMSE = 288.5473422908694

4.7 Triple Exponential Smoothing

Figure 20: TES Parameters

15
Following is the result from running a SES, DES & TES Model on the dataset:

Figure 21: Smoothing Models Outcome

Triple Exponential Smoothing Model forecast on the Test, RMSE = 128.992526

The summarized performance of the models run on the Shoe Sales datasets can be seen
below:

Figure 22: Performance metrics of Different models

16
Q5- Check for the stationarity of the data on which the model is
being built on using appropriate statistical tests and also mention
the hypothesis for the statistical test. If the data is found to be
non-stationary, take appropriate steps to make it stationary.
Check the new data for stationarity and comment. Note:
Stationarity should be checked at alpha = 0.05.
5.1 Dickey Fuller test on the stationarity of the data
DF test statistic is -1.717
DF test p-value is 0.4222

As the p value is more than 0.05 the data is not stationary. We can try different methods to
make the data stationary.

5.2 Data Stationarity by Differencing

Figure 23: New Data after Differencing it by 1


AD Fuller test on the new Data after differencing
DF test statistic is -3.479
DF test p-value is 0.0085

17
5.3 Data Stationarity by Differencing and Log Transformation

Figure 24: New Data after Differencing it by 1 and Log Transfornation

AD Fuller test on the new Data after differencing


DF test statistic is -3.479
DF test p-value is 0.0086
The data has become stationary

6 Build an automated version of the ARIMA/SARIMA model in


which the parameters are selected using the lowest Akaike
Information Criteria (AIC) on the training data and evaluate this
model on the test data using RMSE.

18
6.1 ARIMA

Figure 24: Best AIC

Figure 25: ARIMA Results

19
Figure 26: ARIMA Model Performance

The Root Mean Squared Error of ARIMA forecasts is 175.196

20
6.2 SARIMA

Figure 27: AUTO Sarima Performance

Figure 28: Best AIC values for SARIMA

21
Diagnostic Plot

Figure 29: SARIMA Dianostic Plot

The Root Mean Squared Error of SARIMA forecasts is 69.03

22
Q7 - Build ARIMA/SARIMA models based on the cut-off
points of ACF and PACF on the training data and evaluate
this model on the test data using RMSE.

Figure 30: ACF Parameters

Figure 31: PACF Parameters

23
Q 8- Build a table with all the models built along with their
corresponding parameters and the respective RMSE
values on the test data.

Figure 32: Model Performances

As We can see from the above table that the best model performance is of the 2 point
Trailing Average

24
Q9- Based on the model-building exercise, build the
most optimum model(s) on the complete data and
predict 12 months into the future with appropriate
confidence intervals/bands.

Figure 33: Prediction of 12 months ahead

25
12 Months predictions

Figure 34: Prediction of 12 months ahead

Q10 - Comment on the model thus built and report your


findings and suggest the measures that the company
should be taking for future sales.
Dataset contains 187 entries among 2 variables.
The data has outliers present
There is more sales in the second half of the year than the first year
December records the highest sales
We can see the sales of shoes has decreased drastically over few years.
The company can have more sales that forecasted if they focus on the innovation of the
products and applying marketing strategies.
The decrease in the sales figures over the years suggest that there has been a significant rise in
the competition and other brands are providing much better shoes than the company at a much
better prices.
To remain competitive in the market the company needs to implement multiple strategies

26

You might also like