0% found this document useful (0 votes)
15 views24 pages

Time-Series-Forecast-A-Comprehensive-Guide - Jupyter Notebook

This document serves as a comprehensive guide to Time Series analysis, aimed at beginners, covering both theoretical concepts and practical code implementation using Python. It discusses key topics such as seasonality, trends, stationarity, and forecasting methods like ARIMA, while also providing a dataset for hands-on practice. The document is a work in progress, encouraging community feedback and contributions.

Uploaded by

Teo Chee Kiat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views24 pages

Time-Series-Forecast-A-Comprehensive-Guide - Jupyter Notebook

This document serves as a comprehensive guide to Time Series analysis, aimed at beginners, covering both theoretical concepts and practical code implementation using Python. It discusses key topics such as seasonality, trends, stationarity, and forecasting methods like ARIMA, while also providing a dataset for hands-on practice. The document is a work in progress, encouraging community feedback and contributions.

Uploaded by

Teo Chee Kiat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

In [1]:  # This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python


# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/dataset-superstore-20152018/Dataset- Superstore (2015-2018).csv

Prefaces: ¶
This kernel is prepared to be a container of many broad topics in the field of Time Series.

My motive is to make this the ultimate reference to Times Series analysis for beginners.

This kernel is a work in progress so every time you see on your home feed and open it, you will find some new updated content.

If there is any suggestion or any specific topic you would like me to cover, kindly mention that in the comments.

If you like my work, please upvote(press the like button) this kernel so it looks more relevant and meaningful to the community.

Thank you and together keep up the momentum!

TABLE OF CONTENT:

I.TIME SERIES: THEORY


1. INTRODUCTION

1.1. What is Time Series

1.2 What makes Time Series Special

2. BASIC CONCEPTS OF TIME SERIES

2.1. Seasonality

2.2. Trend

2.3. Cyclic

2.4. Random

3. STATIONATY vs NON-STATIONARY

3.1. What is Stationary

3.2. How to make a time series stationary

3.3. Why make a non-stationary series stationary before forecasting

4. FORECASTING THE TIME SERIES

II.TIME SERIES: CODE IMPLEMENTATION


1. Import Neccessary Library

2. Loading Dataset

3. Data Processing

4. Importing with indexing with time series data

5. Data Visualization

6. Check The Stationary the Dataset

7. Make a Time Series Stationary

8. Time series forecasting with ARIMA

8.1 Train-test Split

8.2 Hyperparameters of ARIMA model p,d,q using auto_arima

I.TIME SERIES: THEORY

1. INTRODUCTION:

1.1. What is Times Series:


Time series is a sequence of observations recorded at regular time intervals. Depending on the frequency of observations, a time series may typically be hourly, daily,
weekly, monthly, quarterly and annual.

Here time is the independent variable while the dependent variable might be

Stock market data


Sales data of companies
Data from the sensors of smart devices
The measure of electrical energy generated in the powerhouse.

Time series forecasting is basically the machine learning modeling for Time Series data (years, days, hours…etc.)for predicting future values using Time Series
modeling.

To gain some useful insights from time-series data, you have to decompose the time series and look for some basic components such as trend, seasonality, cyclic
behaviour, and irregular fluctuations. Based on some of these behaviours, we are deciding on which model to choose for time series modelling.

1.2. What makes Time Series Special?


Time Series is different from a regular regression problem in two points:

It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t apply in this case.
most Time Series have some form of seasonality trends, i.e. variations specific to a particular time frame. For example, if you see the sales of a woolen jacket over
time, you will invariably find higher sales in winter seasons.

Because of the inherent properties of a Time Series, there are various steps involved in analyzing it.

Let’s get a better understanding by exploring somw Basic concept of Time Series

2. BASIC CONCEPTS OF TIME- SERIES

2.1. Seasonality:
A data pattern that repeats itself at regular intervals is called Seasonality. Seasonal patterns can be very useful in scenarios like predicting network traffic, road traffic,
sales patterns of certain commodities that have high sales in certain seasons, etc.

Seasonal data with a slightly increasing trend.

2.2. Trend:
A long-term increasing or decreasing pattern in the data points indicates a trend. It could be linear/non-linear. For example, global temperature is at an increasing trend
due to global warming.

Global temperature through the years with an increasing trend.

2.3. Cyclic:
2.4. Random:
We know that data cannot be perfect, and we always need to provide leeway for some noise.

3. STATIONARY

3.1 What is stationary?

In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time. In other words all its statistical
properties (mean,variance, standard deviation) remain constant over time.

If you keenly observe the above images you can find the difference between the two plots. In stationary time series the mean, variance, and standard deviation of the
observed value over time are almost constant whereas in non-stationary time series this is not the case.

There are a lot of statistical theories to explore stationary series than non-stationary series.

In practice we can assume the series to be stationary if it has constant statistical properties over time and these properties can be:

• Constant mean

• Constant variance

• An auto co-variance that does not depend on time.

3.2 How to make a time series stationary?

You can make series stationary by:

Differencing the Series (once or more)


Take the log of the series
Take the nth root of the series
Combination of the above

The most common and convenient method to stationarize the series is by differencing the series at least once until it becomes approximately stationary.

So what is differencing? If Y_t is the value at time ‘t’, then the first difference of Y = Yt – Yt-1. In simpler terms, differencing the series is nothing but subtracting the next
value by the current value. If the first difference doesn’t make a series stationary, you can go for the second differencing. And so on.

For example, consider the following series: [1, 5, 2, 12, 20]

First differencing gives: [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]

Second differencing gives: [-3-4, -10-3, 8-10] = [-7, -13, -2]

3.3 Why make a non-stationary series stationary before forecasting?

The stationarity of a series can be established by looking at the plot of the series.

Another method is to split the series into 2 or more contiguous parts and computing the summary statistics like the mean, variance and the autocorrelation. If the stats
are quite different, then the series is not likely to be stationary.

Nevertheless, you need a method to quantitatively determine if a given series is stationary or not. This can be done using statistical tests called ‘Unit Root Tests’. There
are multiple implementations of Unit Root tests like:

Augmented Dickey Fuller test (ADH Test)


Kwiatkowski-Phillips-Schmidt-Shin – KPSS test (trend stationary)
Philips Perron test (PP Test)
The most commonly used is the ADF test,In this test, First we consider the null hypothesis: the time series is non- stationary. The result from the rest will contain the
test statistic and critical value for different confidence levels. The idea is to have Test statistics less than critical value, in this case we can reject the null hypothesis and
say that this Time series is indeed stationary

4. FORCASTING A TIME SERIES:


Now that we have made the Time series stationary, let’s make models on the time series using differencing because it is easy to add the error , trend and seasonality
back into predicted values .

We will use statistical modelling method called ARIMA to forecast the data where there are dependencies in the values.

Auto Regressive Integrated Moving Average(ARIMA) — It is like a liner regression equation where the predictors depend on parameters (p,d,q) of the ARIMA model
.These three parameters account for seasonality, trend, and noise in data.

We can dive into this part more intensively in the Code Implementation section.

REFERENCE:
https://www.simplilearn.com/tutorials/python-tutorial/time-series-analysis-in-python#what_is_time_series_analysis (https://www.simplilearn.com/tutorials/python-
tutorial/time-series-analysis-in-python#what_is_time_series_analysis)

https://medium.com/@stallonejacob/time-series-forecast-a-basic-introduction-using-python-414fcb963000 (https://medium.com/@stallonejacob/time-series-forecast-a-
basic-introduction-using-python-414fcb963000)

https://www.analyticsvidhya.com/blog/2021/07/time-series-analysis-a-beginner-friendly-guide/#h2_2 (https://www.analyticsvidhya.com/blog/2021/07/time-series-analysis-
a-beginner-friendly-guide/#h2_2)

https://www.machinelearningplus.com/time-series/time-series-analysis-python/ (https://www.machinelearningplus.com/time-series/time-series-analysis-python/)

https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/ (https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-
python/)

https://machinelearningmastery.com/time-series-forecasting-with-prophet-in-python/ (https://machinelearningmastery.com/time-series-forecasting-with-prophet-in-
python/)

https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b (https://towardsdatascience.com/an-end-to-
end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b)

https://medium.com/coders-camp/10-machine-learning-projects-on-time-series-forecasting-ee0368420ccd (https://medium.com/coders-camp/10-machine-learning-
projects-on-time-series-forecasting-ee0368420ccd)

https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322 (https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322)

II.TIME SERIES: CODE IMPLEMENTATION


Here we are going to take the dataset containing a daily Sales and Profit of a Superstore in 4 year from 2015 to 2018

1. Importing Necessary libraries


In [2]:  from dateutil.parser import parse
import itertools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rcParams.update({'figure.figsize':(10,7),'figure.dpi':120})

2. Loading Dataset
In [3]:  df=pd.read_csv('../input/dataset-superstore-20152018/Dataset- Superstore (2015-2018).csv')
df

Out[3]: Row Order Order Ship Customer Customer Postal Product Sub-
Ship Date Segment Country City ... Region Category
ID ID Date Mode ID Name Code ID Category

CA-
Second United FUR-BO-
0 1 2016- 2016/11/08 2016/11/11 CG-12520 Claire Gute Consumer Henderson ... 42420 South Furniture Bookcases
Class States 10001798
152156

CA-
Second United FUR-CH-
1 2 2016- 2016/11/08 2016/11/11 CG-12520 Claire Gute Consumer Henderson ... 42420 South Furniture Chairs
Class States 10000454
152156

CA-
Second Darrin Van United OFF-LA- Office
2 3 2016- 2016/06/12 2016/06/16 DV-13045 Corporate Los Angeles ... 90036 West Labels
Class Huff States 10000240 Supplies
138688

US-
Standard Sean United Fort FUR-TA-
3 4 2015- 2015/10/11 2015/10/18 SO-20335 Consumer ... 33311 South Furniture Tables
Class O'Donnell States Lauderdale 10000577
108966

US-
Standard Sean United Fort OFF-ST- Office
4 5 2015- 2015/10/11 2015/10/18 SO-20335 Consumer ... 33311 South Storage
Class O'Donnell States Lauderdale 10000760 Supplies
108966

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..

CA-
Second Tom United FUR-FU-
9989 9990 2014- 2014/01/21 2014/01/23 TB-21400 Consumer Miami ... 33180 South Furniture Furnishings
Class Boeckenhauer States 10001889
110422

CA-
Standard United FUR-FU-
9990 9991 2017- 2017/02/26 2017/03/03 DB-13060 Dave Brooks Consumer Costa Mesa ... 92627 West Furniture Furnishings
Class States 10000747
121258

CA-
Standard United TEC-PH-
9991 9992 2017- 2017/02/26 2017/03/03 DB-13060 Dave Brooks Consumer Costa Mesa ... 92627 West Technology Phones
Class States 10003645
121258

CA-
Standard United OFF-PA- Office
9992 9993 2017- 2017/02/26 2017/03/03 DB-13060 Dave Brooks Consumer Costa Mesa ... 92627 West Paper
Class States 10004041 Supplies
121258

CA-
Second United OFF-AP- Office
9993 9994 2017- 2017/05/04 2017/05/09 CC-12220 Chris Cortes Consumer Westminster ... 92683 West Appliances
Class States 10002684 Supplies
119914

9994 rows × 21 columns

We will take a look of the 'categories' variable to see what kind of product the store is selling:

In [4]:  df['Category'].value_counts()

Out[4]: Office Supplies 6026


Furniture 2121
Technology 1847
Name: Category, dtype: int64

There are several Categories in the Superstore sale data, we will start from time series analysis and
forcasting foe the 'Office Supplies' sales:
In [5]:  OS= df.loc[df['Category']=='Office Supplies']
OS.head(5)

Out[5]: Row Order Order Ship Customer Customer Postal Product Sub- Pro
Ship Date Segment Country City ... Region Category
ID ID Date Mode ID Name Code ID Category N

Adhe
CA-
Second Darrin United Los OFF-LA- Office Add
2 3 2016- 2016/06/12 2016/06/16 DV-13045 Corporate ... 90036 West Labels
Class Van Huff States Angeles 10000240 Supplies Labe
138688
Typew

US- Eldon Fo
Standard Sean United Fort OFF-ST- Office
4 5 2015- 2015/10/11 2015/10/18 SO-20335 Consumer ... 33311 South Storage Roll
Class O'Donnell States Lauderdale 10000760 Supplies
108966 Sy

CA-
Standard Brosina United Los OFF-AR- Office
6 7 2014- 2014/06/09 2014/06/14 BH-11710 Consumer ... 90032 West Art Newel
Class Hoffman States Angeles 10002833 Supplies
115812

DXL A
CA-
Standard Brosina United Los OFF-BI- Office View Bin
8 9 2014- 2014/06/09 2014/06/14 BH-11710 Consumer ... 90032 West Binders
Class Hoffman States Angeles 10003910 Supplies with Loc
115812
Rings by

B
CA-
Standard Brosina United Los OFF-AP- Office F5C206V
9 10 2014- 2014/06/09 2014/06/14 BH-11710 Consumer ... 90032 West Appliances
Class Hoffman States Angeles 10002892 Supplies 6O
115812
S

5 rows × 21 columns
We have a four year of Office Supplies data:

In [6]:  print('Starting date:',OS['Order Date'].min())


print('Ending date:',OS['Order Date'].max())

Starting date: 2014/01/03


Ending date: 2017/12/30

3. Data Processing

In this process, we will removing unrelevant variables, handling missing data, aggregate sales by date.

Our focus in this kernel is the Sale of Office Supplier over the time series. Therefore, we will skip only two columns:Oder
Date and Sales

In [7]:  # Drop unrelevant variables:


cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Posta


OS.drop(cols, axis=1, inplace= True)
OS

/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4913: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus


-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)
errors=errors,

Out[7]: Order Date Sales

2 2016/06/12 14.620

4 2015/10/11 22.368

6 2014/06/09 7.280

8 2014/06/09 18.504

9 2014/06/09 114.900

... ... ...

9982 2016/09/22 35.560

9984 2015/05/17 31.500

9985 2015/05/17 55.600

9992 2017/02/26 29.600

9993 2017/05/04 243.160

6026 rows × 2 columns

In [8]:  ### Check out missing values:


OS.isnull().sum()

Out[8]: Order Date 0


Sales 0
dtype: int64

There are no missing values, so we move to the next step.

Aggregate sum of Office Supplies by date:

In [9]:  OS= OS.groupby('Order Date')['Sales'].sum().reset_index()


OS.head()

Out[9]: Order Date Sales

0 2014/01/03 16.448

1 2014/01/04 288.060

2 2014/01/05 19.536

3 2014/01/06 685.340

4 2014/01/07 10.430

https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Time%20Series%20Forecastings.ipynb (https://github.com/susanli2016/Machine-Learning-
with-Python/blob/master/Time%20Series%20Forecastings.ipynb)

4. Import Data with indexing time series data


In [10]:  OS['Order Date'] = pd.to_datetime(df['Order Date'])
OS= OS.set_index('Order Date')
OS

Out[10]: Sales

Order Date

2016-11-08 16.448

2016-11-08 288.060

2016-06-12 19.536

2015-10-11 685.340

2015-10-11 10.430

... ...

2014-09-29 814.594

2014-09-29 13.248

2014-09-29 1091.244

2015-04-04 282.440

2015-04-04 299.724

1148 rows × 1 columns

5. Data Visualization
In [11]:  OS['Sales'].plot()
plt.xlabel('Order Date')
plt.ylabel('Sales')
plt.title('Total sale over years')
plt.show()

The above is quite busy to interpret, we should use the resample function the time series data by Month and use the
averages monthly values

In [12]:  #create new DataFrame


monthly_OS = pd.DataFrame()

monthly_OS['Sales'] = OS['Sales'].resample('MS').mean()
In [13]:  #plot weekly sales data
plt.plot(monthly_OS.index, monthly_OS.Sales, linewidth=3)

Out[13]: [<matplotlib.lines.Line2D at 0x7962ef5694d0>]

Since all values are positive, you can show this on both sides of the Y axis to emphasize the
growth.
In [14]:  x= monthly_OS.index
y1= monthly_OS['Sales'].values

fig, ax = plt.subplots(1, 1, figsize=(16,5), dpi= 120)
plt.fill_between(x, y1=y1, y2=-y1, alpha=0.5, linewidth=2, color='seagreen')
plt.ylim(-800, 800)
plt.title('Sales (Two Side View)', fontsize=16)
plt.hlines(y=0, xmin=np.min(monthly_OS.index), xmax=np.max(monthly_OS.index), linewidth=.5)
plt.show()

We can nicely visualize the trend and how it varies each year in a nice year-wise boxplot.

Likewise, we can do a month-wise boxplot to visualize the monthly distributions.

Boxplot of Month-wise (Seasonal) and Year-wise (trend) Distribution

We can group the data at seasonal intervals and see how the values are distributed within a given year or month and how it compares over time.

In [15]:  OS['year'] = [d.year for d in OS.index]


OS['month'] = [d.strftime('%b') for d in OS.index]
years= OS['year'].unique()
years

Out[15]: array([2016, 2015, 2014, 2017])


In [16]:  # Draw Plot
fig, axes = plt.subplots(1, 2, figsize=(20,7), dpi= 80)
sns.boxplot(x='year', y='Sales', data=OS, ax=axes[0])
sns.boxplot(x='month', y='Sales', data=OS.loc[~OS.year.isin([2014,2917]), :])

# Set Title
axes[0].set_title('Year-wise Box Plot\n(The Trend)', fontsize=18);
axes[1].set_title('Month-wise Box Plot\n(The Seasonality)', fontsize=18)
plt.show()

In [17]:  from pylab import rcParams


rcParams['figure.figsize'] = 18, 8

decomposition = sm.tsa.seasonal_decompose(monthly_OS['Sales'], model='additive')
fig = decomposition.plot()
plt.show()

The plots show the data is seasonality

6. Check Stationary of the Dataset


Stationarity is defined using very strict criterion. However, for practical purposes we can assume the series to be stationary if it has constant statistical properties over
time, ie. the following:

constant mean
constant variance
an autocovariance that does not depend on time.

Formally, we check the stationary using the following:

Plotting Rolling Statistic: we can plot the moving average or moving variance and see if it varies with time.
Dickey- Fuller Test: The test results comprise of a Test Statistic and some Critical Values for difference confidence levels. If the ‘Test Statistic’ is less than the
‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.

First, we will plot the Rolling Statistics Plot

In [18]:  #Determing rolling statistics


moving_avg = monthly_OS.rolling(12).mean()
moving_std= monthly_OS.rolling(12).std()
In [19]:  #Plot rolling statistics:
orig = plt.plot(monthly_OS, color='blue',label='Original')
mean = plt.plot(moving_avg, color='red', label='Rolling Mean')
std = plt.plot(moving_std, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

Now, we will conduct the Dickey-Fuller test:

In [20]:  from statsmodels.tsa.stattools import adfuller


print ('Results of Dickey-Fuller Test:')
dftest = adfuller(monthly_OS, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)

Results of Dickey-Fuller Test:


Test Statistic -1.630238
p-value 0.467366
#Lags Used 4.000000
Number of Observations Used 43.000000
Critical Value (1%) -3.592504
Critical Value (5%) -2.931550
Critical Value (10%) -2.604066
dtype: float64

Here’s how to interpret the most important values in the output:

Test statistic: -1.630238

P-value: 0.467366

Since the p-value is not less than .05, we fail to reject the null hypothesis.

This means the time series is non-stationary.

In other words, it has some time-dependent structure and does not have constant variance over time.

7. Make a Time Series Stationary


There are several method to make a time series stationary:

Take a log transform


Moving average
Exponentially weighted moving average
Difference
Decomposition

Some might work well in this case and others might not. But the idea is to get a hang of all the methods and not focus on just the problem at hand.

Let's started!
a) Log Transform:
In [21]:  do= pd.read_csv('../input/dataset-superstore-20152018/Dataset- Superstore (2015-2018).csv')
store= do.loc[do['Category']=='Office Supplies']
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Posta
store.drop(cols, axis=1, inplace=True)
store

/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4913: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus


-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)
errors=errors,

Out[21]: Order Date Sales

2 2016/06/12 14.620

4 2015/10/11 22.368

6 2014/06/09 7.280

8 2014/06/09 18.504

9 2014/06/09 114.900

... ... ...

9982 2016/09/22 35.560

9984 2015/05/17 31.500

9985 2015/05/17 55.600

9992 2017/02/26 29.600

9993 2017/05/04 243.160

6026 rows × 2 columns

In [22]:  store = store.groupby('Order Date')['Sales'].sum().reset_index()


store

Out[22]: Order Date Sales

0 2014/01/03 16.448

1 2014/01/04 288.060

2 2014/01/05 19.536

3 2014/01/06 685.340

4 2014/01/07 10.430

... ... ...

1143 2017/12/26 814.594

1144 2017/12/27 13.248

1145 2017/12/28 1091.244

1146 2017/12/29 282.440

1147 2017/12/30 299.724

1148 rows × 2 columns

In [23]:  store = store.set_index('Order Date')


store.index

Out[23]: Index(['2014/01/03', '2014/01/04', '2014/01/05', '2014/01/06', '2014/01/07',


'2014/01/09', '2014/01/10', '2014/01/13', '2014/01/16', '2014/01/18',
...
'2017/12/21', '2017/12/22', '2017/12/23', '2017/12/24', '2017/12/25',
'2017/12/26', '2017/12/27', '2017/12/28', '2017/12/29', '2017/12/30'],
dtype='object', name='Order Date', length=1148)

In [24]:  #create new DataFrame:


store.index = pd.to_datetime(store.index)

y = store['Sales'].resample('MS').mean()

In [25]:  ## Lets take a log transform here for simplicity:
ts_log = np.log(y)
plt.plot(ts_log)

Out[25]: [<matplotlib.lines.Line2D at 0x7962eee8a290>]

In this case, we can see the plot is not a forward trend in the data. So, take a log transform is not a solution to make a time-series stationary

b) Moving Average:

In this approach, we take average of ‘k’ consecutive values depending on the frequency of time series. Here we can take the average over the past 1 year, i.e. last 12
values.

Pandas has specific functions defined for determining rolling statistics.

In [26]:  moving_avg = ts_log.rolling(12).mean()


plt.plot(ts_log)
plt.plot(moving_avg, color='red')

Out[26]: [<matplotlib.lines.Line2D at 0x7962eee13850>]

The red line shows the rolling mean.

Lets subtract this from the original series.

Note that since we are taking average of last 12 values, rolling mean is not defined for first 11 values. This can be observed as:
In [27]:  ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.head(12)

Out[27]: Order Date


2014-01-01 NaN
2014-02-01 NaN
2014-03-01 NaN
2014-04-01 NaN
2014-05-01 NaN
2014-06-01 NaN
2014-07-01 NaN
2014-08-01 NaN
2014-09-01 NaN
2014-10-01 NaN
2014-11-01 NaN
2014-12-01 0.464652
Freq: MS, Name: Sales, dtype: float64

Notice the first 11 being Nan.

Lets drop these NaN values and check the plots to test stationarity.

In [28]:  def test_stationarity(timeseries):


#Determing rolling statistics
rolmean = timeseries.rolling(12).mean()
rolstd = timeseries.rolling(12).std()

#Plot rolling statistics:
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)
print ('Results of Dickey-Fuller Test:')

#Perform Dickey-Fuller test:


dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)

In [29]:  ts_log_moving_avg_diff.dropna(inplace=True)
test_stationarity(ts_log_moving_avg_diff)

Results of Dickey-Fuller Test:


Test Statistic -5.626996
p-value 0.000001
#Lags Used 0.000000
Number of Observations Used 36.000000
Critical Value (1%) -3.626652
Critical Value (5%) -2.945951
Critical Value (10%) -2.611671
dtype: float64

This looks like a much better series.

The rolling values appear to be varying slightly but there is no specific trend.

Also, the test statistic is smaller than the 1% critical values so we can say with 99% confidence that this is a stationary series.

c) Exponentially weighted moving average:

However, a drawback in this particular approach is that the time-period has to be strictly defined.

So we take a ‘weighted moving average’ where more recent values are given a higher weight. There can be many technique for assigning weights.

A popular one is exponentially weighted moving average where weights are assigned to all the previous values with a decay factor.
This can be implemented in Pandas as:

In [30]:  expwighted_avg = ts_log.ewm(halflife=12).mean()



plt.plot(ts_log)
plt.plot(expwighted_avg, color='red')

Out[30]: [<matplotlib.lines.Line2D at 0x7962eec664d0>]

Note that here the parameter ‘halflife’ is used to define the amount of exponential decay. This is just an assumption here and would depend largely on the business
domain.

Other parameters like span and center of mass can also be used to define decay which are discussed in the link shared above.

Now, let’s remove this from series and check stationarity:

In [31]:  ts_log_ewma_diff = ts_log - expwighted_avg


test_stationarity(ts_log_ewma_diff)

Results of Dickey-Fuller Test:


Test Statistic -2.967356
p-value 0.038057
#Lags Used 3.000000
Number of Observations Used 44.000000
Critical Value (1%) -3.588573
Critical Value (5%) -2.929886
Critical Value (10%) -2.603185
dtype: float64

Since the p-value is less than .05, we fail to acept the null hypothesis: this time-series is stationary

d) Differencing:

One of the most common methods of dealing with both trend and seasonality is differencing.

In this technique, we take the difference of the observation at a particular instant with that at the previous instant.

This mostly works well in improving stationarity.


In [32]:  ts_log_diff = ts_log - ts_log.shift()
plt.plot(ts_log_diff)

Out[32]: [<matplotlib.lines.Line2D at 0x7962ee31d350>]

In [33]:  ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff)

Results of Dickey-Fuller Test:


Test Statistic -4.771865
p-value 0.000062
#Lags Used 9.000000
Number of Observations Used 37.000000
Critical Value (1%) -3.620918
Critical Value (5%) -2.943539
Critical Value (10%) -2.610400
dtype: float64

We can see that the mean and std variations have small variations with time.

Also, the Dickey-Fuller test statistic is less than the 1% critical value, thus the TS is stationary with 99% confidence.

e) Decomposing:

In this approach, both trend and seasonality are modeled separately and the remaining part of the series is returned.
In [34]:  from pylab import rcParams
rcParams['figure.figsize'] = 18, 8

decomposition = sm.tsa.seasonal_decompose(ts_log, model='additive')
fig = decomposition.plot()
plt.show()

Here we can see that the trend, seasonality are separated out from data and we can model the residuals.

Lets check stationarity of residuals:

In [35]:  from statsmodels.tsa.seasonal import seasonal_decompose


decomposition = seasonal_decompose(ts_log)
residual = decomposition.resid

ts_log_decompose = residual
ts_log_decompose.dropna(inplace=True)
test_stationarity(ts_log_decompose)

Results of Dickey-Fuller Test:


Test Statistic -4.901459
p-value 0.000035
#Lags Used 4.000000
Number of Observations Used 31.000000
Critical Value (1%) -3.661429
Critical Value (5%) -2.960525
Critical Value (10%) -2.619319
dtype: float64

The Dickey-Fuller test statistic is significantly lower than the 1% critical value.

So this TS is very close to stationary.

8. Time Series Forecasting with ARIMA


We are going to apply one of the most commonly used method for time-series forecasting, known as ARIMA, which stands for Autoregressive Integrated Moving
Average.

ARIMA stands for Auto-Regressive Integrated Moving Averages.


The ARIMA forecasting for a stationary time series is nothing but a linear (like a linear regression) equation. The predictors depend on the parameters (p,d,q) of the
ARIMA model:

Number of AR (Auto-Regressive) terms (p): AR terms are just lags of dependent variable.
Number of MA (Moving Average) terms (q): MA terms are lagged forecast errors in prediction equation.
Number of Differences (d): These are the number of nonseasonal differences, i.e. in this case we took the first order difference. So either we can pass that
variable and put d=0 or pass the original variable and put d=1 Both will generate same results

There are 3 way to define p,q,d:

ACF and PACF plot


Auto_arima
Loops

In this project I will use auto_arima function to decide p,q,d

auto_arima() uses a stepwise approach to search multiple combinations of p,d,q parameters and chooses the best model that has the least AIC.

I will split the train and test set, apply autoarima to decide p,q,d. Then get the predicted value for test set,plot the train,test, prect data and then evaluate the forcast
accuracy.

If the result from forecast accuracy doesn't support the ARIMA model, we should choose different method to forecast the data. One suggestion is Seasonal ARIMA
model called SARIMAX

Let's get started!

8.1. Train Test Split:


In [36]:  ​

train= y[:40]
test= y[40:]




In [37]:  !pip install pmdarima
import pmdarima

from pmdarima import auto_arima

Collecting pmdarima
Downloading pmdarima-2.0.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (1.8 MB)
|████████████████████████████████| 1.8 MB 11.3 MB/s
Requirement already satisfied: urllib3 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.26.7)
Requirement already satisfied: scipy>=1.3.2 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.7.3)
Collecting numpy>=1.21.2
Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
|████████████████████████████████| 15.7 MB 49.4 MB/s
Requirement already satisfied: Cython!=0.29.18,!=0.29.31,>=0.29 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (0.29.28)
Collecting statsmodels>=0.13.2
Downloading statsmodels-0.13.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
|████████████████████████████████| 9.9 MB 47.1 MB/s
Requirement already satisfied: scikit-learn>=0.22 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.0.1)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.1.0)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (59.5.0)
Requirement already satisfied: pandas>=0.19 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.3.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->pmdarima) (2.8.
2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->pmdarima) (2021.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.22->pmdarima)
(3.0.0)
Requirement already satisfied: packaging>=21.3 in /opt/conda/lib/python3.7/site-packages (from statsmodels>=0.13.2->pmdarima) (21.
3)
Requirement already satisfied: patsy>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from statsmodels>=0.13.2->pmdarima) (0.5.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=21.3->statsmodel
s>=0.13.2->pmdarima) (3.0.6)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels>=0.13.2->pmdarima) (1.
16.0)
Installing collected packages: numpy, statsmodels, pmdarima
Attempting uninstall: numpy
Found existing installation: numpy 1.20.3
Uninstalling numpy-1.20.3:
Successfully uninstalled numpy-1.20.3
Attempting uninstall: statsmodels
Found existing installation: statsmodels 0.13.1
Uninstalling statsmodels-0.13.1:
Successfully uninstalled statsmodels-0.13.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the so
urce of the following dependency conflicts.
tensorflow-io 0.21.0 requires tensorflow-io-gcs-filesystem==0.21.0, which is not installed.
explainable-ai-sdk 1.3.2 requires xai-image-widget, which is not installed.
beatrix-jupyterlab 3.1.6 requires google-cloud-bigquery-storage, which is not installed.
thinc 8.0.15 requires typing-extensions<4.0.0.0,>=3.7.4.1; python_version < "3.8", but you have typing-extensions 4.1.1 which is in
compatible.
tfx-bsl 1.5.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.15.0 which is incompatible.
tfx-bsl 1.5.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.6 which is incompatible.
tfx-bsl 1.5.0 requires pyarrow<6,>=1, but you have pyarrow 6.0.1 which is incompatible.
tfx-bsl 1.5.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<3,>=1.15.2, but you have tensorflow 2.6.2
which is incompatible.
tensorflow 2.6.2 requires numpy~=1.19.2, but you have numpy 1.21.6 which is incompatible.
tensorflow 2.6.2 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow 2.6.2 requires typing-extensions~=3.7.4, but you have typing-extensions 4.1.1 which is incompatible.
tensorflow 2.6.2 requires wrapt~=1.12.1, but you have wrapt 1.13.3 which is incompatible.
tensorflow-transform 1.5.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.15.0 which is incompatible.
tensorflow-transform 1.5.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.6 which is incompatible.
tensorflow-transform 1.5.0 requires pyarrow<6,>=1, but you have pyarrow 6.0.1 which is incompatible.
tensorflow-transform 1.5.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<2.8,>=1.15.2, but you have t
ensorflow 2.6.2 which is incompatible.
tensorflow-serving-api 2.7.0 requires tensorflow<3,>=2.7.0, but you have tensorflow 2.6.2 which is incompatible.
spacy 3.2.3 requires typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8", but you have typing-extensions 4.1.1 which is incom
patible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.5.1 which is incompatible.
numba 0.54.1 requires numpy<1.21,>=1.17, but you have numpy 1.21.6 which is incompatible.
arviz 0.11.4 requires typing-extensions<4,>=3.7.4.3, but you have typing-extensions 4.1.1 which is incompatible.
apache-beam 2.34.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.4 which is incompatible.
apache-beam 2.34.0 requires httplib2<0.20.0,>=0.8, but you have httplib2 0.20.2 which is incompatible.
apache-beam 2.34.0 requires numpy<1.21.0,>=1.14.3, but you have numpy 1.21.6 which is incompatible.
apache-beam 2.34.0 requires pyarrow<6.0.0,>=0.15.1, but you have pyarrow 6.0.1 which is incompatible.
apache-beam 2.34.0 requires typing-extensions<4,>=3.7.0, but you have typing-extensions 4.1.1 which is incompatible.
Successfully installed numpy-1.21.6 pmdarima-2.0.3 statsmodels-0.13.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager.
It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv (https://pip.pypa.io/warnings/venv)
8.2 Hyperparameters of ARIMA model p,d,q using auto_arima
In [38]:  auto_arima(train, test='adf',seasonal=True, trace=True, error_action='ignore', suppress_warnings=True)

Performing stepwise search to minimize aic


ARIMA(2,2,2)(0,0,0)[0] : AIC=inf, Time=0.33 sec
ARIMA(0,2,0)(0,0,0)[0] : AIC=592.497, Time=0.04 sec
ARIMA(1,2,0)(0,0,0)[0] : AIC=576.965, Time=0.04 sec
ARIMA(0,2,1)(0,0,0)[0] : AIC=inf, Time=0.08 sec
ARIMA(2,2,0)(0,0,0)[0] : AIC=559.020, Time=0.09 sec
ARIMA(3,2,0)(0,0,0)[0] : AIC=554.143, Time=0.13 sec
ARIMA(4,2,0)(0,0,0)[0] : AIC=555.183, Time=0.20 sec
ARIMA(3,2,1)(0,0,0)[0] : AIC=inf, Time=0.25 sec
ARIMA(2,2,1)(0,0,0)[0] : AIC=inf, Time=0.14 sec
ARIMA(4,2,1)(0,0,0)[0] : AIC=inf, Time=0.28 sec
ARIMA(3,2,0)(0,0,0)[0] intercept : AIC=555.965, Time=0.23 sec

Best model: ARIMA(3,2,0)(0,0,0)[0]


Total fit time: 1.836 seconds

Out[38]: ARIMA(order=(3, 2, 0), scoring_args={}, suppress_warnings=True,


with_intercept=False)

In [39]:  ​
auto_arima(y,test='adf', # use adftest to find optimal 'd'
# maximum p and q
# frequency of series
# let model determine 'd'
seasonal=True, # No Seasonality


trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)

Performing stepwise search to minimize aic


ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=inf, Time=0.32 sec
ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=682.234, Time=0.01 sec
ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=673.518, Time=0.05 sec
ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=inf, Time=0.06 sec
ARIMA(0,1,0)(0,0,0)[0] : AIC=680.349, Time=0.02 sec
ARIMA(2,1,0)(0,0,0)[0] intercept : AIC=666.940, Time=0.03 sec
ARIMA(3,1,0)(0,0,0)[0] intercept : AIC=668.928, Time=0.08 sec
ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=668.928, Time=0.09 sec
ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=inf, Time=0.10 sec
ARIMA(3,1,1)(0,0,0)[0] intercept : AIC=670.928, Time=0.13 sec
ARIMA(2,1,0)(0,0,0)[0] : AIC=665.748, Time=0.04 sec
ARIMA(1,1,0)(0,0,0)[0] : AIC=671.902, Time=0.23 sec
ARIMA(3,1,0)(0,0,0)[0] : AIC=667.718, Time=0.11 sec
ARIMA(2,1,1)(0,0,0)[0] : AIC=667.709, Time=0.13 sec
ARIMA(1,1,1)(0,0,0)[0] : AIC=666.599, Time=0.06 sec
ARIMA(3,1,1)(0,0,0)[0] : AIC=668.151, Time=0.28 sec

Best model: ARIMA(2,1,0)(0,0,0)[0]


Total fit time: 1.778 seconds

Out[39]: ARIMA(order=(2, 1, 0), scoring_args={}, suppress_warnings=True,


with_intercept=False)
8.3 Build ARIMA model:
In [40]:  from statsmodels.tsa.arima.model import ARIMA
model=ARIMA(train, order=(1,1,1)).fit()
model.summary()

Out[40]:
SARIMAX Results

Dep. Variable: Sales No. Observations: 40

Model: ARIMA(1, 1, 1) Log Likelihood -273.088

Date: Thu, 06 Jul 2023 AIC 552.176

Time: 15:19:31 BIC 557.167

Sample: 01-01-2014 HQIC 553.967

- 04-01-2017

Covariance Type: opg

coef std err z P>|z| [0.025 0.975]

ar.L1 0.1424 0.249 0.573 0.567 -0.345 0.630

ma.L1 -0.9118 0.143 -6.398 0.000 -1.191 -0.632

sigma2 6.796e+04 1.48e+04 4.607 0.000 3.9e+04 9.69e+04

Ljung-Box (L1) (Q): 0.28 Jarque-Bera (JB): 1.34

Prob(Q): 0.60 Prob(JB): 0.51

Heteroskedasticity (H): 0.84 Skew: 0.45

Prob(H) (two-sided): 0.76 Kurtosis: 3.16

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Predict test dataset:


In [41]:  pred= model.predict(start=len(train), end=(len(y)-1),dynamic=True)
pred

Out[41]: 2017-05-01 609.236644


2017-06-01 616.500443
2017-07-01 617.534859
2017-08-01 617.682167
2017-09-01 617.703144
2017-10-01 617.706132
2017-11-01 617.706557
2017-12-01 617.706618
Freq: MS, Name: predicted_mean, dtype: float64

In [42]:  test

Out[42]: Order Date


2017-05-01 508.776444
2017-06-01 650.463038
2017-07-01 393.902615
2017-08-01 1156.148154
2017-09-01 1139.137250
2017-10-01 886.045846
2017-11-01 1124.012036
2017-12-01 1049.549724
Freq: MS, Name: Sales, dtype: float64

8.4. ARIMA- Model Evaluation: using MAPE

Mean Absolute Percentage Error (MAPE): MAPE is defined as the percentage of the average of absolute difference between forecasted values and true values, divided
by true value.

In [43]:  from sklearn.metrics import mean_absolute_percentage_error



mape= mean_absolute_percentage_error(test, pred)

print('MAPE: %f' %mape)

MAPE: 0.363205

The lower the MAPE, the better the model is. Our model have considerably high MAPE.

Let's plot the predict value to see what can be the reason for the low accuracy
8.5. Plot prediction for test value:
In [44]:  train.plot(legend=True, label='Train', figsize=(10,6))

test.plot(legend=True, label= 'Test')

pred.plot(legend=True, label='ARIMA prediction')

Out[44]: <AxesSubplot:xlabel='Order Date'>


The problem of ARIMA prediction is it doesnt show the season trend of data.

The solution for this problem is using Seasonal ARIMA called SARIMAX function!

9. FORECAST DATA USING SARIMAX model:


In [45]:  import statsmodels.api as sm
model1=sm.tsa.statespace.SARIMAX(train,order=(1, 1, 1),seasonal_order=(1,1,1,12))

results=model1.fit()
results.summary()

/opt/conda/lib/python3.7/site-packages/statsmodels/tsa/statespace/sarimax.py:868: UserWarning: Too few observations to estimate sta


rting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
' zeros.' % warning_description)
This problem is unconstrained.

RUNNING THE L-BFGS-B CODE

* * *

Machine precision = 2.220D-16


N = 5 M = 10

At X0 0 variables are exactly at the bounds

At iterate 0 f= 4.79193D+00 |proj g|= 7.99194D-02

At iterate 5 f= 4.75725D+00 |proj g|= 2.20420D-02

At iterate 10 f= 4.75546D+00 |proj g|= 7.72465D-04

At iterate 15 f= 4.75499D+00 |proj g|= 5.67592D-03

At iterate 20 f= 4.73881D+00 |proj g|= 5.05180D-02

At iterate 25 f= 4.73188D+00 |proj g|= 7.83322D-03

At iterate 30 f= 4.73136D+00 |proj g|= 7.12500D-04

At iterate 35 f= 4.73129D+00 |proj g|= 6.85744D-04

At iterate 40 f= 4.73127D+00 |proj g|= 3.92378D-04

At iterate 45 f= 4.73108D+00 |proj g|= 6.41897D-04

At iterate 50 f= 4.73107D+00 |proj g|= 3.38067D-06

* * *

Tit = total number of iterations


Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value

* * *

N Tit Tnf Tnint Skip Nact Projg F


5 50 57 1 0 0 3.381D-06 4.731D+00
F = 4.7310740331402190

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

/opt/conda/lib/python3.7/site-packages/statsmodels/base/model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to


converge. Check mle_retvals
ConvergenceWarning)
Out[45]:
SARIMAX Results

Dep. Variable: Sales No. Observations: 40

Model: SARIMAX(1, 1, 1)x(1, 1, 1, 12) Log Likelihood -189.243

Date: Thu, 06 Jul 2023 AIC 388.486

Time: 15:19:38 BIC 394.965

Sample: 01-01-2014 HQIC 390.413

- 04-01-2017

Covariance Type: opg

coef std err z P>|z| [0.025 0.975]

ar.L1 0.2075 0.300 0.693 0.489 -0.380 0.795

ma.L1 -0.9253 0.326 -2.839 0.005 -1.564 -0.286

ar.S.L12 0.2317 0.758 0.305 0.760 -1.255 1.718

ma.S.L12 -0.9971 0.465 -2.143 0.032 -1.909 -0.085

sigma2 4.794e+04 9.81e-06 4.89e+09 0.000 4.79e+04 4.79e+04

Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 3.39

Prob(Q): 0.98 Prob(JB): 0.18

Heteroskedasticity (H): 2.18 Skew: 0.63

Prob(H) (two-sided): 0.26 Kurtosis: 4.20

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.67e+26. Standard errors may be unstable.

In [46]:  pre=results.predict(start= len(train), end= (len(y)-1),dynamic=True)


pre

Out[46]: 2017-05-01 524.245522


2017-06-01 636.979984
2017-07-01 686.349438
2017-08-01 583.601864
2017-09-01 997.210164
2017-10-01 638.870800
2017-11-01 957.407646
2017-12-01 1119.324792
Freq: MS, Name: predicted_mean, dtype: float64

Plot Forecast Data:


In [47]:  train.plot(legend=True, label='Train', figsize=(10,6))

test.plot(legend=True, label= 'Test')

pre.plot(legend=True, label='SARIMAX prediction')

Out[47]: <AxesSubplot:xlabel='Order Date'>


9.1 SARIMAX MODEL EVALUATION: MAPE
In [48]:  mape= mean_absolute_percentage_error(test, pre)

print('MAPE1: %f' %mape)

MAPE1: 0.238381

The model is improved!!! We can coutinue use SARIMAX to forecast the Office Supplier's Sale

9.2. PREDICT FUTURE SALE WITH ARIMAX:


In [49]:  future_sale= results.predict(start= len(y), end=(len(y)+12))
future_sale

Out[49]: 2018-01-01 733.865582


2018-02-01 467.398664
2018-03-01 714.996393
2018-04-01 671.129843
2018-05-01 602.604323
2018-06-01 738.389077
2018-07-01 762.834934
2018-08-01 692.288054
2018-09-01 1089.011560
2018-10-01 682.953021
2018-11-01 1049.907246
2018-12-01 1117.481887
2019-01-01 733.084608
Freq: MS, Name: predicted_mean, dtype: float64

In [50]:  y.plot(legend=True, label='Current Sale', figsize=(10,6))



future_sale.plot(legend= True, label='Future Sale')

Out[50]: <AxesSubplot:xlabel='Order Date'>

THE END
Thank you for spending time checking my kernel.

Please leave comment and like this kernel if you think it's helpful.

Thank you!

In [ ]:  ​

You might also like