Time Series Forecasting Project Report
Time Series Forecasting Project Report
PROJECT REPORT
DSBA
NAME : SREEVATHSAN S S
BATCH : PGPDSBA ONLINE APRIL_B 2021
TABLE OF CONTENTS
PROBLEM:
For this particular assignment, the data of different types of wine sales in the 20th century is to be
analyzed. Both of these data are from the same company but of different wines. As an analyst in the ABC
Estate Wines, you are tasked to analyses and forecast Wine Sales in the 20th century.
Dataset : Sparkling
1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Check for the stationarity of the data on which the model is being built on using appropriate statistical
tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take
appropriate steps to make it stationary. Check the new data for stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
5. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using
the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data
using RMSE.
6. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and
evaluate this model on the test data using RMSE.
7. Build a table with all the models built along with their corresponding parameters and the respective
RMSE values on the test data.
8. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict
12 months into the future with appropriate confidence intervals/bands.
9. Comment on the model thus built and report your findings and suggest the measures that the company
should be taking for future sales.
Dataset : Rose
10. Read the data as an appropriate Time Series data and plot the data.
11. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
12. Split the data into training and test. The test data should start in 1991.
13. Check for the stationarity of the data on which the model is being built on using appropriate statistical
tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take
appropriate steps to make it stationary. Check the new data for stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
14. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using
the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data
using RMSE.
15. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and
evaluate this model on the test data using RMSE.
16. Build a table with all the models built along with their corresponding parameters and the respective
RMSE values on the test data.
17. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict
12 months into the future with appropriate confidence intervals/bands.
18. Comment on the model thus built and report your findings and suggest the measures that the company
should be taking for future sales.
19.
Sparkling:
Data Dictionary:
Year Month – Month & Year of Sales
Sparkling – No. of units of Sparkling brand wine got sold
1. Read the data as an appropriate Time Series data and plot the data.
We will read the data as Time Stamp data to conduct the time series by passing parse date to the column
‘YearMonth and will convert ‘YearMonth’ field as index field
We will plot the data now to see whether the production values are represented against time or not.
• The dataset contains the monthly Sparkling Wine Sales value from Jan-1980 to Jul-1995
• It has total of 187 observations
• There is no missing value available in this dataset
• Average monthly sales value is around 2402.41 and the median value is around 1874 which
implies there is the right skewness present in this data set
Observations:
1. Yearly Sales value trend shows almost constant throughout the 16 years, however the variance
between the monthly sales value within the year is getting wider after 1984
2. Almost every year have at least one positive Outlier
3. From Monthly Box plot it is clearly visible that till Jun the sales value is lower and almost constant
till June after the trend there is increasing trend observed with the highest sales value is getting
recorded in the month of December
4. Month wise comparison plot also shows that across all the year the sales value is recorded higher in
December followed by November
5. There is clear seasonality is visible in this data set
Decomposition of Data:
We will perform Decomposition the data to segregate Trend, Seasonality and Residuals.
The individual components and its plots are mentioned below.
Fig 6: Decomposition graph of time series
3. Split the data into training and test. The test data should start in 1991.
We have split the data in training and test data. Our training data is from January 1980 to December
1990 and testing data is from January 1991 to July 1995.
The total records for the training data sets are 132 and for testing data sets are 55.
We have displayed the last 5 records of training data followed by first 5 records of testing data.
For 2 point Moving Average Model forecast on the Training Data, RMSE is 813.401 MAPE
For 4 point Moving Average Model forecast on the Training Data, RMSE is 1156.590 MAPE
For 6 point Moving Average Model forecast on the Training Data, RMSE is 1283.927 MAPE
For 9 point Moving Average Model forecast on the Training Data, RMSE is 1346.278 MAPE
Before we go on to build the various Exponential Smoothing models, let us plot all the models
[only the most optimum Moving Average model (one with least RMSE) is plotted] and
compare the Time Series plots.
Below plot showcases the various models built on the test data.
The higher the alpha value more weightage is given to the more recent observation. That
means, what happened recently will happen again.
We have run a loop with different alpha values to understand which particular value works best
for alpha on the test set. Below are the top 5 𝛼 values with the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.216) as well as the 𝛼 with the
least test RMSE values (0.1).
Method 6 – Double Exponential Smoothing (Holt's Model)
Two parameters 𝛼 and 𝛽 are estimated in this model. Level and Trend are accounted for in this
model. This particular Time Series seems to have a Seasonality as well. Let us see how Holt's
Model behaves in such a scenario.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.400 and 𝛽 to be 0.072.
We have run the model by setting different alpha and beta values.
We have run a loop with different alpha and beta values to understand which particular value
combination works best on the test set. Below are the top 5 𝛼 and 𝛽 value combinations with
the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.111), 𝛽(0.049) and 𝛾(0.395)
as well as the 𝛼, 𝛽 and 𝛾 with the least test RMSE values (0.4, 0.3 and 0.1).
Test RMS: 1778.564670 Test MAPE: 85.874037
We have run a loop with different alpha, beta and gamma values to understand which
particular value combination works best on the test set. Below are the top 5 𝛼, 𝛽 and 𝛾 value
combinations with the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.111), 𝛽(0.049) and 𝛾(0.395)
as well as the 𝛼, 𝛽 and 𝛾 with the least test RMSE values (0.4, 0.3 and 0.1).
6. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to be
non-stationary, take appropriate steps to make it stationary. Check the new data for
stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
Assumptions:
H0: The data is not stationary
H1: The data is stationary
We have checked the stationarity of data using Dickey-Fuller test. From the below figure we can infer
that at 5% significant level, we can't reject null hypothesis and hence the time series data is not
stationary.
Since this dataset is not stationary we have taken 1st order difference and checked the stationarity of
the data. We see that at alpha = 0.05, we can reject the null hypothesis as the p value is almost 0 and
less than 0.05 , hence the time series is indeed stationary at difference of order 1.
7. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.
ARIMA Model :
In this model we required the value of p, q, d . And the best possible value of thes parameter
can be finalized based on the lowest AIC number of that model. Hence we have to build model
with the parameter combination as mentioned below. The parameter range selected for p and q is
from 0 to 4 and for d it is 1 and 2.
We have built the ARIMA model for the parameters ranged from 0 to 5 for p and q. We have
sorted model results based on lowest Akaike Information Criteria (AIC). At the lowest AIC
(2213.509213) on the training data , the parameters are (2,1,2)
Below are the results applying the lowest parameters identified – ARIMA (2,1,2). Both Lags and
error term were significant.
Fig. 8 ARIMA(2,1,2) Result
Test RMSE has been calculated for the ARIMA (2,1,2 ) which is 1299.980869
SARIMA
To build SARIMA model we required 6 parameter p,q,d and P,Q,D . We have built the SARIMA
model considering the seasonality for the range 0 to 2 and the selected the lowest AIC
(1054.718055) on the training data – SARIMA(0, 1, 1)x(1, 0, 1, 12)
Below are the results applying the lowest parameters identified – SARIMA(0, 1, 1)x(1, 0, 1, 12).
Fig. 9 SARIMA(0,1,1)x(1,0,1,12) Result
The Test RMSE for the SARIMA (0,1,1)x(1,0,1,12) is 603.649011 compared to ARIMA(2,1,2) it has
very less RMSE value which is due to seasonality presence in the dataset.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training
data and evaluate this model on the test data using RMSE.
We have to plot the autocorrelation and partial autocorrelation function on the whole data. From
Autocorrelation we have to find out the value of q and Q, and from Partial Autocorrelation we
have to find out the value of p and P based on the significant level.
The autocorrelation has been plotted using stats model as mentioned in the Figure 10, the 3rd level
lag lies in the significant level and hence the value of q can be considered as 3.
Similarly, 2nd season lies in the significant area hence we will consider the value of Q as 2.
Fig.10 ACF plot
The Partial autocorrelation has been plotted using stats model as mentioned in the Figure 11,
from this plot, we can predict p as 3 and since every lag is significant, P can be taken as 1
All the required values has been found out using the plot.
p= 3 , q=3 , P=1 , Q =2
ARIMA model has been built with the parameter p=3, d=1, q=3 and its result shown in the figure12
All AR and MA values are significant in this model.
Fig 12. ARIMA (3,1,3) Result
The test RMSE value of ARIMA (3,1,3) is 1228.4889 which is slightly lower than the ARIMA(2,1,2)
but very higher than the SARIMA (0,1,1)x(1,0,1,12)
SARIMA model has been built with the parameter p=3, d=1, q=3, P=1, Q=2, D=0 and its result
shown in the figure13
The Test RMSE value for this model SARIMA (3,1,3)x(1,0,2,12) is 623.9257 and has lesser RMSE
compared with the both the ARIMA model , however slightly higher RMSE than the
SARIMA(0,1,1)x(1,0,1,12)
We have also built auto ARIMA model using PMD ARIMA function in Python for the range of p and q
from 0 to 3 and also d with 1.
Figure 14 shows the result of the model thus built using PMD ARIMA function .
Fig 14. ARIMA(2,1,3) Result
The Test RMSE value of this model built with PMD function is 1300.1634 which seems higher than the
rest of the model RMSE value.
Similarly, we have built SARIMA model using the PMD function with the range of p and q from 0 to 4
and the parameter value of P and Q starts from 0 and with the seasonal value 12.
Figure 15 shows the model result of SARIMA(3, 1, 0)x(1, 0, 1, 12) built through the PMD function.
The RMSE value for the SARIMA(3, 1, 0)x(1, 0, 1, 12) is 899.7035 which has lesser value of all ARIMA
model, however slightly higher than the other two SARIMA model.
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.
We have built a table (Fig 16) with all the Test RMSE value of the model thus built so far. Out of the 6
model built we have observed that SARIMA (0,1,1)x(1,0,1,12) has the least RMSE value compared to
other 5 model. Hence we can finalise this model as the optimum model to forecast 12-month data.
9. Based on the model-building exercise, build the most optimum model(s) on the complete data
and predict 12 months into the future with appropriate confidence intervals/bands.
Based on the above 6 model we have finalised the SARIMA (0,1,1)x(1,0,1,12) model since it has the
least Test RMSE value .
We have built the model SARIMA (0,1,1)x(1,0,1,12) with full data set and its results were shown in the
figure 17.
Fig 17. SARIMA (0,1,1)x(1,0,1,12) Result
The RMSE value for the SARIMA (0,1,1)x(1,0,1,12) model for the full data set is 519.0809
We have forecasted 12 month value starting from Aug’1995 till Jul’1996 and the forecasted value is
mentioned in the below table.
We have built a plot with the forecasted value of the model SARIMA (0,1,1)x(1,0,1,12) along with the
original value which shown in the fig 18.
Fig 18. Original & Forecasted value of the model SARIMA (0,1,1)x(1,0,1,12)
20. Comment on the model thus built and report your findings and suggest the measures that the
company should be taking for future sales.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.
The dataset contains monthly sales of rose wine data from January 1980 to July 1995. There are
two null values. We have interpolated the null values using linear interpolation method.
We have looked into the statistics and have identified different statistical measures like mean,
standard deviation and other measures on the given data. The mean monthly rose wine sales over
the period is 90.39 and median is 86. It shows slight skewness towards the right.
We also plotted the data against the time and studied the pattern.(Figure 1)
In the below figures we studied the distribution of data and confirm the skewness of the data. The
most of data values situated between 30 to 190.(Figure 2)
Figure 2: Histogram & Density of time series data - rose wine sales
We also plotted the yearly box plot. From that we can clearly see the high rose wine sales are in
the years 1980 and 1981. And has gradually decreased over the years. We also witness the
presence of outliers which are negligible in size and hence it is untreated.(Figure 3)
We have also plotted the monthly sales data across the years and confirmed that throughout all
the years, December month sales is tremendous. (Figure 6)
Figure 6: Yearly Line plot
We have also decomposed the time series data to check for the components of Trend,
Seasonality and Residuals. The individual components and their plots are indicated
below.(Figure 7)
Decomposition Graph shows very good seasonality for month-on-month patterns. The series is
additive because the seasonal variation did not increases as we move across time. Trend shows
a decreasing pattern from the year 1981.
Figure 7: Decomposition of data
3. Split the data into training and test. The test data should start in 1991.
We have split the data in training and test data. Our training data is from January 1980 to
December 1990 and testing data is from January 1991 to July 1995.
The total records for the training data sets are 132 and for testing data sets are 55.
We have displayed the last 5 records of training data followed by first 5 records of testing data.
Figure 8: Rose wine sales – Split into Test and Train data
4. Build various exponential smoothing models on the training data and evaluate the
model using RMSE on the test data. Other models such as regression, naive forecast
models, simple average models etc. should also be built on the training data and check
the performance on the test data using RMSE.
For this particular simple average method, we will forecast by using the average of the training values.
Test RMSE: 53.460570 Test MAPE: 110.587957
Model 4 – Moving Average
For the moving average model, we are going to calculate rolling means (or moving averages) for
different intervals. The best interval can be determined by the maximum accuracy (or the
minimum error).
Before we go on to build the various Exponential Smoothing models, let us plot all the models
[only the most optimum Moving Average model (one with least RMSE) is plotted] and
compare the Time Series plots.
Model 5 – Simple Exponential Smoothing
In the Simple Exponential Smoothing Model, only the level of the Time Series is accounted for.
Here, we can see that the data has both trend and seasonality. This particular Simple
Exponential Smoothing model is built only to showcase how Simple Exponential Smoothing
models are built in Python.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.216.
Test RMSE: 36.796242 Test MAPE: 75.909219
The higher the alpha value more weightage is given to the more recent observation. That
means, what happened recently will happen again.
We have run a loop with different alpha values to understand which particular value works best
for alpha on the test set. Below are the top 5 𝛼 values with the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.216) as well as the 𝛼 with the
least test RMSE values (0.1).
Method 6 – Double Exponential Smoothing (Holt's Model)
Two parameters 𝛼 and 𝛽 are estimated in this model. Level and Trend are accounted for in this
model. This particular Time Series seems to have a Seasonality as well. Let us see how Holt's
Model behaves in such a scenario.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.400 and 𝛽 to be 0.072.
We have run the model by setting different alpha and beta values.
We have run a loop with different alpha and beta values to understand which particular value
combination works best on the test set. Below are the top 5 𝛼 and 𝛽 value combinations with
the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.400) and 𝛽(0.072) as well as
the 𝛼 and 𝛽 with the least test RMSE values (0.1 and 0.1).
Model 7 – Triple Exponential Smoothing (Holt - Winter's Model)
Three parameters 𝛼, 𝛽 and 𝛾 are estimated in this model. Level, Trend and Seasonality are
accounted for in this model. This particular Time Series looks to have trend as well as
seasonality, so Holt-Winter's model theoretically seems to be a correct fit. Let us see how the
model behaves.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.111, 𝛽 to be 0.049 and 𝛾 to
be 0.395.
We have run the model by setting different alpha, beta and gamma values.
We have run a loop with different alpha, beta and gamma values to understand which
particular value combination works best on the test set. Below are the top 5 𝛼, 𝛽 and 𝛾 value
combinations with the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.111), 𝛽(0.049) and 𝛾(0.395)
as well as the 𝛼, 𝛽 and 𝛾 with the least test RMSE values (0.4, 0.3 and 0.1).
5. Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If the
data is found to be non-stationary, take appropriate steps to make it stationary. Check the
new data for stationarity and comment. Note: Stationarity should be checked at alpha =
0.05.
Assumptions:
H0: The data is not stationary
H1: The data is stationary
We have checked the stationarity of data using Dickey-Fuller test. From the below figure we can
infer that at 5% significant level, we can't reject null hypothesis and hence the time series data is
not stationary.
We have taken a order of 1 and checked the stationarity of the data. We see that at alpha = 0.05,
we can reject the null hypothesis and hence the time series is indeed stationary
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and
evaluate this model on the test data using RMSE.
We have built the ARIMA model for the parameters ranged from 0 to 5 for p and q. We have
sorted model results based on lowest Akaike Information Criteria (AIC). At the lowest AIC
(1274.695172) on the training data , the parameters are (2,1,3)
Below are the results applying the lowest parameters identified – ARIMA (2,1,3). From that
results, error term of 1 period and 3 period lags are slightly insignificant.
We have also built the SARIMA model considering the seasonality for the range 0 to 2 and the
selected the lowest AIC (1054.718055) on the training data – SARIMA(1, 1, 1)x(1, 0, 1, 12)
Below are the results applying the lowest parameters identified – SARIMA(1, 1, 1)x(1, 0, 1, 12).
We have calculated the RMSE value for the ARIMA and SARIMA models built on Test data.
RMSE using ARIMA (2,1,3) model is 36.813755 and SARIMA(1, 1, 1)x(1, 0, 1, 12) model is
21.703017
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE.
We have plotted the autocorrelation and partial autocorrelation function plots on the whole data.
From the Figure 9, the we take q value as 3 , in case of Q, all are significant. As assumption Q is
taken as 2.
Figure 9: Autocorrelation function plot
From the Figure 10, the value for p is taken as 5, and for P it is 3 since at these values, it seems to
be significant.
We build the ARIMA model at parameters (5,1,3) based on the results derived after plotting
autocorrelation and partial autocorrelation function plots. Below are the results. We witness that
the 1 period and 4 period lag are slightly insignificant.
We build the SARIMA model of parameters (5,1,3) x (3,0,2,12) based on the results derived after
plotting autocorrelation and partial autocorrelation function plots. It shows only the error term for
2 period lag and 12 period (1st seasonal) lag are significant.
We have calculated RMSE values for ARIMA (5,1,3) and SARIMA (5,1,3) x (3,0,2,12) models
We have also built the models of ARIMA based on pmd arima function in python for the ranges 0
to 3. Below is the result.
We have also built the models of SARIMA based on pmdarima function in python for the ranges
0 to 4 in terms of trend and 0 to 5 in terms of seasonal. Below is the result.
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.
We have calculated the RMSE values for different models and the best model is SARIMA of pmd
function (1, 1, 2) x (1,0,1,12) with an RMSE value of 14.562001
9. Based on the model-building exercise, build the most optimum model(s) on the complete
data and predict 12 months into the future with appropriate confidence intervals/bands.
We build the SARIMA model of (1, 1, 2) x (1,0,1,12) which has lowest RMSE value for the
whole dataset. Below is the result.
We have forecasted the data for the next 12 months – August 1995 to July 1996 after applying the
best model – SARIMA of (1, 1, 2) x (1,0,1,12). We have also calculated the RMSE value for the
full period.
Figure 11: Rose wine sales (original data – Jan 1980 to Jul 1995, Forecast – Aug 1995 to Jul
1996)
10. Comment on the model thus built and report your findings and suggest the measures that
the company should be taking for future sales.