100% found this document useful (3 votes)
535 views62 pages

Time Series Forecasting Project Report

The document describes a time series forecasting project to analyze and forecast wine sales data for two types of wines, Sparkling and Rose, from 1980 to 1995. For the Sparkling wine data, the author reads in the data, performs exploratory data analysis including decomposition, splits the data into training and test sets, checks for stationarity, builds ARIMA/SARIMA models using AIC and ACF/PACF cut-offs, evaluates models on test data, selects the optimal model on complete data to forecast 12 months into the future, and comments on findings. The same process is repeated for the Rose wine data. Various time series models like exponential smoothing, regression and moving averages are also built and compared on

Uploaded by

pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
535 views62 pages

Time Series Forecasting Project Report

The document describes a time series forecasting project to analyze and forecast wine sales data for two types of wines, Sparkling and Rose, from 1980 to 1995. For the Sparkling wine data, the author reads in the data, performs exploratory data analysis including decomposition, splits the data into training and test sets, checks for stationarity, builds ARIMA/SARIMA models using AIC and ACF/PACF cut-offs, evaluates models on test data, selects the optimal model on complete data to forecast 12 months into the future, and comments on findings. The same process is repeated for the Rose wine data. Various time series models like exponential smoothing, regression and moving averages are also built and compared on

Uploaded by

pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

TIME SERIES FORECASTING

PROJECT REPORT

DSBA

NAME : SREEVATHSAN S S
BATCH : PGPDSBA ONLINE APRIL_B 2021
TABLE OF CONTENTS
PROBLEM:
For this particular assignment, the data of different types of wine sales in the 20th century is to be
analyzed. Both of these data are from the same company but of different wines. As an analyst in the ABC
Estate Wines, you are tasked to analyses and forecast Wine Sales in the 20th century.
Dataset : Sparkling
1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Check for the stationarity of the data on which the model is being built on using appropriate statistical
tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take
appropriate steps to make it stationary. Check the new data for stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
5. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using
the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data
using RMSE.
6. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and
evaluate this model on the test data using RMSE.
7. Build a table with all the models built along with their corresponding parameters and the respective
RMSE values on the test data.
8. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict
12 months into the future with appropriate confidence intervals/bands.
9. Comment on the model thus built and report your findings and suggest the measures that the company
should be taking for future sales.

Dataset : Rose
10. Read the data as an appropriate Time Series data and plot the data.
11. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
12. Split the data into training and test. The test data should start in 1991.
13. Check for the stationarity of the data on which the model is being built on using appropriate statistical
tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take
appropriate steps to make it stationary. Check the new data for stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
14. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using
the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data
using RMSE.
15. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and
evaluate this model on the test data using RMSE.
16. Build a table with all the models built along with their corresponding parameters and the respective
RMSE values on the test data.
17. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict
12 months into the future with appropriate confidence intervals/bands.
18. Comment on the model thus built and report your findings and suggest the measures that the company
should be taking for future sales.
19.
Sparkling:
Data Dictionary:
Year Month – Month & Year of Sales
Sparkling – No. of units of Sparkling brand wine got sold
1. Read the data as an appropriate Time Series data and plot the data.
We will read the data as Time Stamp data to conduct the time series by passing parse date to the column
‘YearMonth and will convert ‘YearMonth’ field as index field

We will plot the data now to see whether the production values are represented against time or not.

Fig 1. Monthly Sales Values


2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.

Description of the Data:

• The dataset contains the monthly Sparkling Wine Sales value from Jan-1980 to Jul-1995
• It has total of 187 observations
• There is no missing value available in this dataset
• Average monthly sales value is around 2402.41 and the median value is around 1874 which
implies there is the right skewness present in this data set

Exploratory Analysis through different Plots


Fig 2. Yearly Box plot

Fig3. Monthly Box Plot

Fig 4. Average Month plot within different month across year


Fig 5. Month wise comparison Plot

Observations:
1. Yearly Sales value trend shows almost constant throughout the 16 years, however the variance
between the monthly sales value within the year is getting wider after 1984
2. Almost every year have at least one positive Outlier
3. From Monthly Box plot it is clearly visible that till Jun the sales value is lower and almost constant
till June after the trend there is increasing trend observed with the highest sales value is getting
recorded in the month of December
4. Month wise comparison plot also shows that across all the year the sales value is recorded higher in
December followed by November
5. There is clear seasonality is visible in this data set

Decomposition of Data:

We will perform Decomposition the data to segregate Trend, Seasonality and Residuals.
The individual components and its plots are mentioned below.
Fig 6: Decomposition graph of time series

• Decomposition graph shows very good seasonality on yearly basis


• The series is additive because there is not much variance observed as we move across time
• Linear model might not work as the trend doesn’t show proper pattern

3. Split the data into training and test. The test data should start in 1991.
We have split the data in training and test data. Our training data is from January 1980 to December
1990 and testing data is from January 1991 to July 1995.
The total records for the training data sets are 132 and for testing data sets are 55.
We have displayed the last 5 records of training data followed by first 5 records of testing data.

Fig. 7 Train and Test Data Split


4. Build various exponential smoothing models on the training data and evaluate the
model using RMSE on the test data.
Other models such as regression,naïve forecast models, simple average models etc.
should also be built on the training data and check the performance on the test data
using RMSE.

Model 1 – Linear Regression


For this particular linear regression, we are going to regress the ‘Sparkling’ variable against the order
of the occurrence. For this we have modified our data before fitting it into a linear regression.

Test RMSE 1275.659913 Test MAPE: 38.700848


Model 2 – Naive Approach
For this particular naive model, we say that the prediction for tomorrow is the same as today
and the prediction for day after tomorrow is tomorrow and since the prediction of tomorrow is
same as today, therefore the prediction for day after tomorrow is also today.

Test RMSE 3864.279352 Test MAPE: 201.327650


Model 3 – Simple Average
For this particular simple average method, we will forecast by using the average of the training
values.

Test RMSE: 1275.081804 Test MAPE: 39.157336


Model 4 – Moving Average
For the moving average model, we are going to calculate rolling means (or moving averages) for
different intervals. The best interval can be determined by the maximum accuracy (or the
minimum error).

For 2 point Moving Average Model forecast on the Training Data, RMSE is 813.401 MAPE
For 4 point Moving Average Model forecast on the Training Data, RMSE is 1156.590 MAPE
For 6 point Moving Average Model forecast on the Training Data, RMSE is 1283.927 MAPE
For 9 point Moving Average Model forecast on the Training Data, RMSE is 1346.278 MAPE

Before we go on to build the various Exponential Smoothing models, let us plot all the models
[only the most optimum Moving Average model (one with least RMSE) is plotted] and
compare the Time Series plots.
Below plot showcases the various models built on the test data.

Model 5 – Simple Exponential Smoothing


In the Simple Exponential Smoothing Model, only the level of the Time Series is accounted for.
Here, we can see that the data has both trend and seasonality. This particular Simple
Exponential Smoothing model is built only to showcase how Simple Exponential Smoothing
models are built in Python.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.216.
For 𝛼 = 0.216, Test RMSE: 1275.081823 Test MAPE: 39.157523

We have run the model by setting different alpha values.

The higher the alpha value more weightage is given to the more recent observation. That
means, what happened recently will happen again.

We have run a loop with different alpha values to understand which particular value works best
for alpha on the test set. Below are the top 5 𝛼 values with the least test RMSE values.

Now we will go ahead and plot the graph with auto predicted 𝛼 (0.216) as well as the 𝛼 with the
least test RMSE values (0.1).
Method 6 – Double Exponential Smoothing (Holt's Model)
Two parameters 𝛼 and 𝛽 are estimated in this model. Level and Trend are accounted for in this
model. This particular Time Series seems to have a Seasonality as well. Let us see how Holt's
Model behaves in such a scenario.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.400 and 𝛽 to be 0.072.
We have run the model by setting different alpha and beta values.

We have run a loop with different alpha and beta values to understand which particular value
combination works best on the test set. Below are the top 5 𝛼 and 𝛽 value combinations with
the least test RMSE values.

Now we will go ahead and plot the graph with auto predicted 𝛼 (0.111), 𝛽(0.049) and 𝛾(0.395)
as well as the 𝛼, 𝛽 and 𝛾 with the least test RMSE values (0.4, 0.3 and 0.1).
Test RMS: 1778.564670 Test MAPE: 85.874037

Model 7 – Triple Exponential Smoothing (Holt - Winter's Model)


Three parameters 𝛼, 𝛽 and 𝛾 are estimated in this model. Level, Trend and Seasonality are
accounted for in this model. This particular Time Series looks to have trend as well as
seasonality, so Holt-Winter's model theoretically seems to be a correct fit. Let us see how the
model behaves.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.111, 𝛽 to be 0.049 and 𝛾 to
be 0.395.
We have run the model by setting different alpha, beta and gamma values.

We have run a loop with different alpha, beta and gamma values to understand which
particular value combination works best on the test set. Below are the top 5 𝛼, 𝛽 and 𝛾 value
combinations with the least test RMSE values.

Now we will go ahead and plot the graph with auto predicted 𝛼 (0.111), 𝛽(0.049) and 𝛾(0.395)
as well as the 𝛼, 𝛽 and 𝛾 with the least test RMSE values (0.4, 0.3 and 0.1).
6. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to be
non-stationary, take appropriate steps to make it stationary. Check the new data for
stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
Assumptions:
H0: The data is not stationary
H1: The data is stationary

We have checked the stationarity of data using Dickey-Fuller test. From the below figure we can infer
that at 5% significant level, we can't reject null hypothesis and hence the time series data is not
stationary.

Since this dataset is not stationary we have taken 1st order difference and checked the stationarity of
the data. We see that at alpha = 0.05, we can reject the null hypothesis as the p value is almost 0 and
less than 0.05 , hence the time series is indeed stationary at difference of order 1.

7. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.

ARIMA Model :

In this model we required the value of p, q, d . And the best possible value of thes parameter
can be finalized based on the lowest AIC number of that model. Hence we have to build model
with the parameter combination as mentioned below. The parameter range selected for p and q is
from 0 to 4 and for d it is 1 and 2.
We have built the ARIMA model for the parameters ranged from 0 to 5 for p and q. We have
sorted model results based on lowest Akaike Information Criteria (AIC). At the lowest AIC
(2213.509213) on the training data , the parameters are (2,1,2)

Below are the results applying the lowest parameters identified – ARIMA (2,1,2). Both Lags and
error term were significant.
Fig. 8 ARIMA(2,1,2) Result

Test RMSE has been calculated for the ARIMA (2,1,2 ) which is 1299.980869

SARIMA

To build SARIMA model we required 6 parameter p,q,d and P,Q,D . We have built the SARIMA
model considering the seasonality for the range 0 to 2 and the selected the lowest AIC
(1054.718055) on the training data – SARIMA(0, 1, 1)x(1, 0, 1, 12)

Below are the results applying the lowest parameters identified – SARIMA(0, 1, 1)x(1, 0, 1, 12).
Fig. 9 SARIMA(0,1,1)x(1,0,1,12) Result

The Test RMSE for the SARIMA (0,1,1)x(1,0,1,12) is 603.649011 compared to ARIMA(2,1,2) it has
very less RMSE value which is due to seasonality presence in the dataset.

7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training
data and evaluate this model on the test data using RMSE.

We have to plot the autocorrelation and partial autocorrelation function on the whole data. From
Autocorrelation we have to find out the value of q and Q, and from Partial Autocorrelation we
have to find out the value of p and P based on the significant level.

The autocorrelation has been plotted using stats model as mentioned in the Figure 10, the 3rd level
lag lies in the significant level and hence the value of q can be considered as 3.

Similarly, 2nd season lies in the significant area hence we will consider the value of Q as 2.
Fig.10 ACF plot

The Partial autocorrelation has been plotted using stats model as mentioned in the Figure 11,
from this plot, we can predict p as 3 and since every lag is significant, P can be taken as 1

Fig.11 PACF plot

All the required values has been found out using the plot.
p= 3 , q=3 , P=1 , Q =2

Based on these value we have calculated ARIMA /SARIMA model

ARIMA model has been built with the parameter p=3, d=1, q=3 and its result shown in the figure12
All AR and MA values are significant in this model.
Fig 12. ARIMA (3,1,3) Result

The test RMSE value of ARIMA (3,1,3) is 1228.4889 which is slightly lower than the ARIMA(2,1,2)
but very higher than the SARIMA (0,1,1)x(1,0,1,12)

SARIMA model has been built with the parameter p=3, d=1, q=3, P=1, Q=2, D=0 and its result
shown in the figure13

AR lag 3 and MA s Lag 24 are not significant in this model


Fig 13. SARIMA(3,1,3)x(1,0,2,12) Result

The Test RMSE value for this model SARIMA (3,1,3)x(1,0,2,12) is 623.9257 and has lesser RMSE
compared with the both the ARIMA model , however slightly higher RMSE than the
SARIMA(0,1,1)x(1,0,1,12)

We have also built auto ARIMA model using PMD ARIMA function in Python for the range of p and q
from 0 to 3 and also d with 1.
Figure 14 shows the result of the model thus built using PMD ARIMA function .
Fig 14. ARIMA(2,1,3) Result

The Test RMSE value of this model built with PMD function is 1300.1634 which seems higher than the
rest of the model RMSE value.

Similarly, we have built SARIMA model using the PMD function with the range of p and q from 0 to 4
and the parameter value of P and Q starts from 0 and with the seasonal value 12.
Figure 15 shows the model result of SARIMA(3, 1, 0)x(1, 0, 1, 12) built through the PMD function.

Fig 15. SARIMA(3, 1, 0)x(1, 0, 1, 12) Result

The RMSE value for the SARIMA(3, 1, 0)x(1, 0, 1, 12) is 899.7035 which has lesser value of all ARIMA
model, however slightly higher than the other two SARIMA model.
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.

We have built a table (Fig 16) with all the Test RMSE value of the model thus built so far. Out of the 6
model built we have observed that SARIMA (0,1,1)x(1,0,1,12) has the least RMSE value compared to
other 5 model. Hence we can finalise this model as the optimum model to forecast 12-month data.

Fig 16. Test RMSE value table

9. Based on the model-building exercise, build the most optimum model(s) on the complete data
and predict 12 months into the future with appropriate confidence intervals/bands.

Based on the above 6 model we have finalised the SARIMA (0,1,1)x(1,0,1,12) model since it has the
least Test RMSE value .
We have built the model SARIMA (0,1,1)x(1,0,1,12) with full data set and its results were shown in the
figure 17.
Fig 17. SARIMA (0,1,1)x(1,0,1,12) Result
The RMSE value for the SARIMA (0,1,1)x(1,0,1,12) model for the full data set is 519.0809

We have forecasted 12 month value starting from Aug’1995 till Jul’1996 and the forecasted value is
mentioned in the below table.
We have built a plot with the forecasted value of the model SARIMA (0,1,1)x(1,0,1,12) along with the
original value which shown in the fig 18.

Fig 18. Original & Forecasted value of the model SARIMA (0,1,1)x(1,0,1,12)
20. Comment on the model thus built and report your findings and suggest the measures that the
company should be taking for future sales.

• Final model suggested is SARIMA (0,1,1)x(1,0,1,12)


• RMSE value for the test data sets of the selected model is 603.6490
• RMSE value for full data set of the selected model is 519.0809
• To predict the August 1995 value, the error term of July 1995 as p & q values are 0,1
respectively and August 1994 and error terms of August 1994 as P&Q values are 1 each,
are needed.
• Generalizing the above, in order to predict the future, one year data is significant.
• The Forecasted sales value though shows immediate increase in the value which is due
to seasonality, and also the overall forecasted year value is slightly better than the current
year
• The sales value for the past few years suggest the trend is almost consistent and have the
potential to increase the sales value in the upcoming year
Problem Statement:
For this particular assignment, the data of different types of wine sales in the 20th century is to be
analysed. Both of these data are from the same company but of different wines. As an analyst in the ABC
Estate Wines, you are tasked to analyse and forecast Wine Sales in the 20th century.
1. Read the data as an appropriate Time Series data and plot the data.
We have read the data and converted the data into a monthly time series data and the new field is
labeled as “YearMonth” using parse_dates function.

2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.

The dataset contains monthly sales of rose wine data from January 1980 to July 1995. There are
two null values. We have interpolated the null values using linear interpolation method.

We have looked into the statistics and have identified different statistical measures like mean,
standard deviation and other measures on the given data. The mean monthly rose wine sales over
the period is 90.39 and median is 86. It shows slight skewness towards the right.
We also plotted the data against the time and studied the pattern.(Figure 1)

Figure 1: Time series plot of rose wine sales data

In the below figures we studied the distribution of data and confirm the skewness of the data. The
most of data values situated between 30 to 190.(Figure 2)

Figure 2: Histogram & Density of time series data - rose wine sales
We also plotted the yearly box plot. From that we can clearly see the high rose wine sales are in
the years 1980 and 1981. And has gradually decreased over the years. We also witness the
presence of outliers which are negligible in size and hence it is untreated.(Figure 3)

Figure 1: Yearly Box Plot


We also plotted the monthly box plot. From that, we can clearly see the high rose wine sales are
in the month of December and shows the slight increasing pattern throughout the year.(Figure 4)

Figure 2: Monthly Box Plot


We plotted the monthly plot and the red line shows the mean value of particular month sales. We
can infer that December continued to experience high sales compared to other months over the
given period.(Figure 5)
Figure 3: Monthly Plot

We have also plotted the monthly sales data across the years and confirmed that throughout all
the years, December month sales is tremendous. (Figure 6)
Figure 6: Yearly Line plot

We have also decomposed the time series data to check for the components of Trend,
Seasonality and Residuals. The individual components and their plots are indicated
below.(Figure 7)
Decomposition Graph shows very good seasonality for month-on-month patterns. The series is
additive because the seasonal variation did not increases as we move across time. Trend shows
a decreasing pattern from the year 1981.
Figure 7: Decomposition of data

3. Split the data into training and test. The test data should start in 1991.

We have split the data in training and test data. Our training data is from January 1980 to
December 1990 and testing data is from January 1991 to July 1995.
The total records for the training data sets are 132 and for testing data sets are 55.
We have displayed the last 5 records of training data followed by first 5 records of testing data.
Figure 8: Rose wine sales – Split into Test and Train data

4. Build various exponential smoothing models on the training data and evaluate the
model using RMSE on the test data. Other models such as regression, naive forecast
models, simple average models etc. should also be built on the training data and check
the performance on the test data using RMSE.

Model 1 – Linear Regression


For this particular linear regression, we are going to regress the “Rose” variable against
the order of the occurrence. For this we have modified our data before fitting it into a
linear regression.
Test RMSE: 54.28611 Test MAPE: 111.119236

Model 2 – Naive Approach


For this particular naive model, we say that the prediction for tomorrow is the same as today
and the prediction for day after tomorrow is tomorrow and since the prediction of tomorrow is
same as today, therefore the prediction for day after tomorrow is also today.
Test RMSE: 79.718773 Test MAPE: 164.846275

Model 3 – Simple Average

For this particular simple average method, we will forecast by using the average of the training values.
Test RMSE: 53.460570 Test MAPE: 110.587957
Model 4 – Moving Average
For the moving average model, we are going to calculate rolling means (or moving averages) for
different intervals. The best interval can be determined by the maximum accuracy (or the
minimum error).

2-point Moving Average - Test RMSE: 556.725 Test MAPE: 12.85


4-point Moving Average - Test RMSE: 687.182 Test MAPE: 15.51
6-point Moving Average - Test RMSE: 710.514 Test MAPE: 16.64
9-point Moving Average - Test RMSE: 735.890 Test MAPE: 16.61
According to the above we can see that the 2-point moving average is the best one to go with.

Before we go on to build the various Exponential Smoothing models, let us plot all the models
[only the most optimum Moving Average model (one with least RMSE) is plotted] and
compare the Time Series plots.
Model 5 – Simple Exponential Smoothing
In the Simple Exponential Smoothing Model, only the level of the Time Series is accounted for.
Here, we can see that the data has both trend and seasonality. This particular Simple
Exponential Smoothing model is built only to showcase how Simple Exponential Smoothing
models are built in Python.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.216.
Test RMSE: 36.796242 Test MAPE: 75.909219

We have run the model by setting different alpha values.

The higher the alpha value more weightage is given to the more recent observation. That
means, what happened recently will happen again.

We have run a loop with different alpha values to understand which particular value works best
for alpha on the test set. Below are the top 5 𝛼 values with the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.216) as well as the 𝛼 with the
least test RMSE values (0.1).
Method 6 – Double Exponential Smoothing (Holt's Model)
Two parameters 𝛼 and 𝛽 are estimated in this model. Level and Trend are accounted for in this
model. This particular Time Series seems to have a Seasonality as well. Let us see how Holt's
Model behaves in such a scenario.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.400 and 𝛽 to be 0.072.

We have run the model by setting different alpha and beta values.

We have run a loop with different alpha and beta values to understand which particular value
combination works best on the test set. Below are the top 5 𝛼 and 𝛽 value combinations with
the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.400) and 𝛽(0.072) as well as
the 𝛼 and 𝛽 with the least test RMSE values (0.1 and 0.1).
Model 7 – Triple Exponential Smoothing (Holt - Winter's Model)
Three parameters 𝛼, 𝛽 and 𝛾 are estimated in this model. Level, Trend and Seasonality are
accounted for in this model. This particular Time Series looks to have trend as well as
seasonality, so Holt-Winter's model theoretically seems to be a correct fit. Let us see how the
model behaves.
For this dataset Python has optimized the smoothing level 𝛼 to be 0.111, 𝛽 to be 0.049 and 𝛾 to
be 0.395.

We have run the model by setting different alpha, beta and gamma values.

We have run a loop with different alpha, beta and gamma values to understand which
particular value combination works best on the test set. Below are the top 5 𝛼, 𝛽 and 𝛾 value
combinations with the least test RMSE values.
Now we will go ahead and plot the graph with auto predicted 𝛼 (0.111), 𝛽(0.049) and 𝛾(0.395)
as well as the 𝛼, 𝛽 and 𝛾 with the least test RMSE values (0.4, 0.3 and 0.1).
5. Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If the
data is found to be non-stationary, take appropriate steps to make it stationary. Check the
new data for stationarity and comment. Note: Stationarity should be checked at alpha =
0.05.

Assumptions:
H0: The data is not stationary
H1: The data is stationary

We have checked the stationarity of data using Dickey-Fuller test. From the below figure we can
infer that at 5% significant level, we can't reject null hypothesis and hence the time series data is
not stationary.

We have taken a order of 1 and checked the stationarity of the data. We see that at alpha = 0.05,
we can reject the null hypothesis and hence the time series is indeed stationary

6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and
evaluate this model on the test data using RMSE.

We have built the ARIMA model for the parameters ranged from 0 to 5 for p and q. We have
sorted model results based on lowest Akaike Information Criteria (AIC). At the lowest AIC
(1274.695172) on the training data , the parameters are (2,1,3)
Below are the results applying the lowest parameters identified – ARIMA (2,1,3). From that
results, error term of 1 period and 3 period lags are slightly insignificant.

We have also built the SARIMA model considering the seasonality for the range 0 to 2 and the
selected the lowest AIC (1054.718055) on the training data – SARIMA(1, 1, 1)x(1, 0, 1, 12)

Below are the results applying the lowest parameters identified – SARIMA(1, 1, 1)x(1, 0, 1, 12).
We have calculated the RMSE value for the ARIMA and SARIMA models built on Test data.
RMSE using ARIMA (2,1,3) model is 36.813755 and SARIMA(1, 1, 1)x(1, 0, 1, 12) model is
21.703017

7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE.

We have plotted the autocorrelation and partial autocorrelation function plots on the whole data.
From the Figure 9, the we take q value as 3 , in case of Q, all are significant. As assumption Q is
taken as 2.
Figure 9: Autocorrelation function plot

From the Figure 10, the value for p is taken as 5, and for P it is 3 since at these values, it seems to
be significant.

Figure 10: Partial Autocorrelation function plot

We build the ARIMA model at parameters (5,1,3) based on the results derived after plotting
autocorrelation and partial autocorrelation function plots. Below are the results. We witness that
the 1 period and 4 period lag are slightly insignificant.
We build the SARIMA model of parameters (5,1,3) x (3,0,2,12) based on the results derived after
plotting autocorrelation and partial autocorrelation function plots. It shows only the error term for
2 period lag and 12 period (1st seasonal) lag are significant.
We have calculated RMSE values for ARIMA (5,1,3) and SARIMA (5,1,3) x (3,0,2,12) models

We have also built the models of ARIMA based on pmd arima function in python for the ranges 0
to 3. Below is the result.
We have also built the models of SARIMA based on pmdarima function in python for the ranges
0 to 4 in terms of trend and 0 to 5 in terms of seasonal. Below is the result.

8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.

We have calculated the RMSE values for different models and the best model is SARIMA of pmd
function (1, 1, 2) x (1,0,1,12) with an RMSE value of 14.562001
9. Based on the model-building exercise, build the most optimum model(s) on the complete
data and predict 12 months into the future with appropriate confidence intervals/bands.

We build the SARIMA model of (1, 1, 2) x (1,0,1,12) which has lowest RMSE value for the
whole dataset. Below is the result.

We have forecasted the data for the next 12 months – August 1995 to July 1996 after applying the
best model – SARIMA of (1, 1, 2) x (1,0,1,12). We have also calculated the RMSE value for the
full period.
Figure 11: Rose wine sales (original data – Jan 1980 to Jul 1995, Forecast – Aug 1995 to Jul
1996)

10. Comment on the model thus built and report your findings and suggest the measures that
the company should be taking for future sales.

• Final model suggested is SARIMA (1, 1, 2) x (1,0,1,12)


• RMSE value for the test data sets of the selected model is 14.562001
• RMSE value for full data set of the selected model is 33.6399
• To predict the August 1995 value, the significant values are the July 1995 and error term
of June 1995 as p& q values are 1,2 respectively and August 1994 and error terms of
August 1994 as P&Q values are 1 each, are needed.
• Generalizing the above, in order to predict the future, one-year data is significant.
• The Forecasted sales value though shows immediate increase in the value which is due to
seasonality, however the overall decreasing trend is being observed and suggesting
management to take steps to increase the sales

You might also like