Time-Series-Forecast-A-Comprehensive-Guide - Jupyter Notebook
Time-Series-Forecast-A-Comprehensive-Guide - Jupyter Notebook
Prefaces: ¶
This kernel is prepared to be a container of many broad topics in the field of Time Series.
My motive is to make this the ultimate reference to Times Series analysis for beginners.
This kernel is a work in progress so every time you see on your home feed and open it, you will find some new updated content.
If there is any suggestion or any specific topic you would like me to cover, kindly mention that in the comments.
If you like my work, please upvote(press the like button) this kernel so it looks more relevant and meaningful to the community.
TABLE OF CONTENT:
2.1. Seasonality
2.2. Trend
2.3. Cyclic
2.4. Random
3. STATIONATY vs NON-STATIONARY
2. Loading Dataset
3. Data Processing
5. Data Visualization
1. INTRODUCTION:
Here time is the independent variable while the dependent variable might be
Time series forecasting is basically the machine learning modeling for Time Series data (years, days, hours…etc.)for predicting future values using Time Series
modeling.
To gain some useful insights from time-series data, you have to decompose the time series and look for some basic components such as trend, seasonality, cyclic
behaviour, and irregular fluctuations. Based on some of these behaviours, we are deciding on which model to choose for time series modelling.
It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t apply in this case.
most Time Series have some form of seasonality trends, i.e. variations specific to a particular time frame. For example, if you see the sales of a woolen jacket over
time, you will invariably find higher sales in winter seasons.
Because of the inherent properties of a Time Series, there are various steps involved in analyzing it.
Let’s get a better understanding by exploring somw Basic concept of Time Series
2.1. Seasonality:
A data pattern that repeats itself at regular intervals is called Seasonality. Seasonal patterns can be very useful in scenarios like predicting network traffic, road traffic,
sales patterns of certain commodities that have high sales in certain seasons, etc.
2.2. Trend:
A long-term increasing or decreasing pattern in the data points indicates a trend. It could be linear/non-linear. For example, global temperature is at an increasing trend
due to global warming.
2.3. Cyclic:
2.4. Random:
We know that data cannot be perfect, and we always need to provide leeway for some noise.
3. STATIONARY
In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time. In other words all its statistical
properties (mean,variance, standard deviation) remain constant over time.
If you keenly observe the above images you can find the difference between the two plots. In stationary time series the mean, variance, and standard deviation of the
observed value over time are almost constant whereas in non-stationary time series this is not the case.
There are a lot of statistical theories to explore stationary series than non-stationary series.
In practice we can assume the series to be stationary if it has constant statistical properties over time and these properties can be:
• Constant mean
• Constant variance
The most common and convenient method to stationarize the series is by differencing the series at least once until it becomes approximately stationary.
So what is differencing? If Y_t is the value at time ‘t’, then the first difference of Y = Yt – Yt-1. In simpler terms, differencing the series is nothing but subtracting the next
value by the current value. If the first difference doesn’t make a series stationary, you can go for the second differencing. And so on.
First differencing gives: [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]
The stationarity of a series can be established by looking at the plot of the series.
Another method is to split the series into 2 or more contiguous parts and computing the summary statistics like the mean, variance and the autocorrelation. If the stats
are quite different, then the series is not likely to be stationary.
Nevertheless, you need a method to quantitatively determine if a given series is stationary or not. This can be done using statistical tests called ‘Unit Root Tests’. There
are multiple implementations of Unit Root tests like:
We will use statistical modelling method called ARIMA to forecast the data where there are dependencies in the values.
Auto Regressive Integrated Moving Average(ARIMA) — It is like a liner regression equation where the predictors depend on parameters (p,d,q) of the ARIMA model
.These three parameters account for seasonality, trend, and noise in data.
We can dive into this part more intensively in the Code Implementation section.
REFERENCE:
https://www.simplilearn.com/tutorials/python-tutorial/time-series-analysis-in-python#what_is_time_series_analysis (https://www.simplilearn.com/tutorials/python-
tutorial/time-series-analysis-in-python#what_is_time_series_analysis)
https://medium.com/@stallonejacob/time-series-forecast-a-basic-introduction-using-python-414fcb963000 (https://medium.com/@stallonejacob/time-series-forecast-a-
basic-introduction-using-python-414fcb963000)
https://www.analyticsvidhya.com/blog/2021/07/time-series-analysis-a-beginner-friendly-guide/#h2_2 (https://www.analyticsvidhya.com/blog/2021/07/time-series-analysis-
a-beginner-friendly-guide/#h2_2)
https://www.machinelearningplus.com/time-series/time-series-analysis-python/ (https://www.machinelearningplus.com/time-series/time-series-analysis-python/)
https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/ (https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-
python/)
https://machinelearningmastery.com/time-series-forecasting-with-prophet-in-python/ (https://machinelearningmastery.com/time-series-forecasting-with-prophet-in-
python/)
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b (https://towardsdatascience.com/an-end-to-
end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b)
https://medium.com/coders-camp/10-machine-learning-projects-on-time-series-forecasting-ee0368420ccd (https://medium.com/coders-camp/10-machine-learning-
projects-on-time-series-forecasting-ee0368420ccd)
https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322 (https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322)
2. Loading Dataset
In [3]: df=pd.read_csv('../input/dataset-superstore-20152018/Dataset- Superstore (2015-2018).csv')
df
Out[3]: Row Order Order Ship Customer Customer Postal Product Sub-
Ship Date Segment Country City ... Region Category
ID ID Date Mode ID Name Code ID Category
CA-
Second United FUR-BO-
0 1 2016- 2016/11/08 2016/11/11 CG-12520 Claire Gute Consumer Henderson ... 42420 South Furniture Bookcases
Class States 10001798
152156
CA-
Second United FUR-CH-
1 2 2016- 2016/11/08 2016/11/11 CG-12520 Claire Gute Consumer Henderson ... 42420 South Furniture Chairs
Class States 10000454
152156
CA-
Second Darrin Van United OFF-LA- Office
2 3 2016- 2016/06/12 2016/06/16 DV-13045 Corporate Los Angeles ... 90036 West Labels
Class Huff States 10000240 Supplies
138688
US-
Standard Sean United Fort FUR-TA-
3 4 2015- 2015/10/11 2015/10/18 SO-20335 Consumer ... 33311 South Furniture Tables
Class O'Donnell States Lauderdale 10000577
108966
US-
Standard Sean United Fort OFF-ST- Office
4 5 2015- 2015/10/11 2015/10/18 SO-20335 Consumer ... 33311 South Storage
Class O'Donnell States Lauderdale 10000760 Supplies
108966
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..
CA-
Second Tom United FUR-FU-
9989 9990 2014- 2014/01/21 2014/01/23 TB-21400 Consumer Miami ... 33180 South Furniture Furnishings
Class Boeckenhauer States 10001889
110422
CA-
Standard United FUR-FU-
9990 9991 2017- 2017/02/26 2017/03/03 DB-13060 Dave Brooks Consumer Costa Mesa ... 92627 West Furniture Furnishings
Class States 10000747
121258
CA-
Standard United TEC-PH-
9991 9992 2017- 2017/02/26 2017/03/03 DB-13060 Dave Brooks Consumer Costa Mesa ... 92627 West Technology Phones
Class States 10003645
121258
CA-
Standard United OFF-PA- Office
9992 9993 2017- 2017/02/26 2017/03/03 DB-13060 Dave Brooks Consumer Costa Mesa ... 92627 West Paper
Class States 10004041 Supplies
121258
CA-
Second United OFF-AP- Office
9993 9994 2017- 2017/05/04 2017/05/09 CC-12220 Chris Cortes Consumer Westminster ... 92683 West Appliances
Class States 10002684 Supplies
119914
We will take a look of the 'categories' variable to see what kind of product the store is selling:
In [4]: df['Category'].value_counts()
There are several Categories in the Superstore sale data, we will start from time series analysis and
forcasting foe the 'Office Supplies' sales:
In [5]: OS= df.loc[df['Category']=='Office Supplies']
OS.head(5)
Out[5]: Row Order Order Ship Customer Customer Postal Product Sub- Pro
Ship Date Segment Country City ... Region Category
ID ID Date Mode ID Name Code ID Category N
Adhe
CA-
Second Darrin United Los OFF-LA- Office Add
2 3 2016- 2016/06/12 2016/06/16 DV-13045 Corporate ... 90036 West Labels
Class Van Huff States Angeles 10000240 Supplies Labe
138688
Typew
US- Eldon Fo
Standard Sean United Fort OFF-ST- Office
4 5 2015- 2015/10/11 2015/10/18 SO-20335 Consumer ... 33311 South Storage Roll
Class O'Donnell States Lauderdale 10000760 Supplies
108966 Sy
CA-
Standard Brosina United Los OFF-AR- Office
6 7 2014- 2014/06/09 2014/06/14 BH-11710 Consumer ... 90032 West Art Newel
Class Hoffman States Angeles 10002833 Supplies
115812
DXL A
CA-
Standard Brosina United Los OFF-BI- Office View Bin
8 9 2014- 2014/06/09 2014/06/14 BH-11710 Consumer ... 90032 West Binders
Class Hoffman States Angeles 10003910 Supplies with Loc
115812
Rings by
B
CA-
Standard Brosina United Los OFF-AP- Office F5C206V
9 10 2014- 2014/06/09 2014/06/14 BH-11710 Consumer ... 90032 West Appliances
Class Hoffman States Angeles 10002892 Supplies 6O
115812
S
5 rows × 21 columns
We have a four year of Office Supplies data:
3. Data Processing
In this process, we will removing unrelevant variables, handling missing data, aggregate sales by date.
Our focus in this kernel is the Sale of Office Supplier over the time series. Therefore, we will skip only two columns:Oder
Date and Sales
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4913: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
2 2016/06/12 14.620
4 2015/10/11 22.368
6 2014/06/09 7.280
8 2014/06/09 18.504
9 2014/06/09 114.900
0 2014/01/03 16.448
1 2014/01/04 288.060
2 2014/01/05 19.536
3 2014/01/06 685.340
4 2014/01/07 10.430
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Time%20Series%20Forecastings.ipynb (https://github.com/susanli2016/Machine-Learning-
with-Python/blob/master/Time%20Series%20Forecastings.ipynb)
Out[10]: Sales
Order Date
2016-11-08 16.448
2016-11-08 288.060
2016-06-12 19.536
2015-10-11 685.340
2015-10-11 10.430
... ...
2014-09-29 814.594
2014-09-29 13.248
2014-09-29 1091.244
2015-04-04 282.440
2015-04-04 299.724
5. Data Visualization
In [11]: OS['Sales'].plot()
plt.xlabel('Order Date')
plt.ylabel('Sales')
plt.title('Total sale over years')
plt.show()
The above is quite busy to interpret, we should use the resample function the time series data by Month and use the
averages monthly values
Since all values are positive, you can show this on both sides of the Y axis to emphasize the
growth.
In [14]: x= monthly_OS.index
y1= monthly_OS['Sales'].values
fig, ax = plt.subplots(1, 1, figsize=(16,5), dpi= 120)
plt.fill_between(x, y1=y1, y2=-y1, alpha=0.5, linewidth=2, color='seagreen')
plt.ylim(-800, 800)
plt.title('Sales (Two Side View)', fontsize=16)
plt.hlines(y=0, xmin=np.min(monthly_OS.index), xmax=np.max(monthly_OS.index), linewidth=.5)
plt.show()
We can nicely visualize the trend and how it varies each year in a nice year-wise boxplot.
We can group the data at seasonal intervals and see how the values are distributed within a given year or month and how it compares over time.
constant mean
constant variance
an autocovariance that does not depend on time.
Plotting Rolling Statistic: we can plot the moving average or moving variance and see if it varies with time.
Dickey- Fuller Test: The test results comprise of a Test Statistic and some Critical Values for difference confidence levels. If the ‘Test Statistic’ is less than the
‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.
P-value: 0.467366
Since the p-value is not less than .05, we fail to reject the null hypothesis.
In other words, it has some time-dependent structure and does not have constant variance over time.
Some might work well in this case and others might not. But the idea is to get a hang of all the methods and not focus on just the problem at hand.
Let's started!
a) Log Transform:
In [21]: do= pd.read_csv('../input/dataset-superstore-20152018/Dataset- Superstore (2015-2018).csv')
store= do.loc[do['Category']=='Office Supplies']
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Posta
store.drop(cols, axis=1, inplace=True)
store
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4913: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
2 2016/06/12 14.620
4 2015/10/11 22.368
6 2014/06/09 7.280
8 2014/06/09 18.504
9 2014/06/09 114.900
0 2014/01/03 16.448
1 2014/01/04 288.060
2 2014/01/05 19.536
3 2014/01/06 685.340
4 2014/01/07 10.430
In this case, we can see the plot is not a forward trend in the data. So, take a log transform is not a solution to make a time-series stationary
b) Moving Average:
In this approach, we take average of ‘k’ consecutive values depending on the frequency of time series. Here we can take the average over the past 1 year, i.e. last 12
values.
Note that since we are taking average of last 12 values, rolling mean is not defined for first 11 values. This can be observed as:
In [27]: ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.head(12)
Lets drop these NaN values and check the plots to test stationarity.
In [29]: ts_log_moving_avg_diff.dropna(inplace=True)
test_stationarity(ts_log_moving_avg_diff)
The rolling values appear to be varying slightly but there is no specific trend.
Also, the test statistic is smaller than the 1% critical values so we can say with 99% confidence that this is a stationary series.
However, a drawback in this particular approach is that the time-period has to be strictly defined.
So we take a ‘weighted moving average’ where more recent values are given a higher weight. There can be many technique for assigning weights.
A popular one is exponentially weighted moving average where weights are assigned to all the previous values with a decay factor.
This can be implemented in Pandas as:
Note that here the parameter ‘halflife’ is used to define the amount of exponential decay. This is just an assumption here and would depend largely on the business
domain.
Other parameters like span and center of mass can also be used to define decay which are discussed in the link shared above.
Since the p-value is less than .05, we fail to acept the null hypothesis: this time-series is stationary
d) Differencing:
One of the most common methods of dealing with both trend and seasonality is differencing.
In this technique, we take the difference of the observation at a particular instant with that at the previous instant.
In [33]: ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff)
We can see that the mean and std variations have small variations with time.
Also, the Dickey-Fuller test statistic is less than the 1% critical value, thus the TS is stationary with 99% confidence.
e) Decomposing:
In this approach, both trend and seasonality are modeled separately and the remaining part of the series is returned.
In [34]: from pylab import rcParams
rcParams['figure.figsize'] = 18, 8
decomposition = sm.tsa.seasonal_decompose(ts_log, model='additive')
fig = decomposition.plot()
plt.show()
Here we can see that the trend, seasonality are separated out from data and we can model the residuals.
The Dickey-Fuller test statistic is significantly lower than the 1% critical value.
Number of AR (Auto-Regressive) terms (p): AR terms are just lags of dependent variable.
Number of MA (Moving Average) terms (q): MA terms are lagged forecast errors in prediction equation.
Number of Differences (d): These are the number of nonseasonal differences, i.e. in this case we took the first order difference. So either we can pass that
variable and put d=0 or pass the original variable and put d=1 Both will generate same results
auto_arima() uses a stepwise approach to search multiple combinations of p,d,q parameters and chooses the best model that has the least AIC.
I will split the train and test set, apply autoarima to decide p,q,d. Then get the predicted value for test set,plot the train,test, prect data and then evaluate the forcast
accuracy.
If the result from forecast accuracy doesn't support the ARIMA model, we should choose different method to forecast the data. One suggestion is Seasonal ARIMA
model called SARIMAX
Collecting pmdarima
Downloading pmdarima-2.0.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (1.8 MB)
|████████████████████████████████| 1.8 MB 11.3 MB/s
Requirement already satisfied: urllib3 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.26.7)
Requirement already satisfied: scipy>=1.3.2 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.7.3)
Collecting numpy>=1.21.2
Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
|████████████████████████████████| 15.7 MB 49.4 MB/s
Requirement already satisfied: Cython!=0.29.18,!=0.29.31,>=0.29 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (0.29.28)
Collecting statsmodels>=0.13.2
Downloading statsmodels-0.13.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
|████████████████████████████████| 9.9 MB 47.1 MB/s
Requirement already satisfied: scikit-learn>=0.22 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.0.1)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.1.0)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (59.5.0)
Requirement already satisfied: pandas>=0.19 in /opt/conda/lib/python3.7/site-packages (from pmdarima) (1.3.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->pmdarima) (2.8.
2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->pmdarima) (2021.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.22->pmdarima)
(3.0.0)
Requirement already satisfied: packaging>=21.3 in /opt/conda/lib/python3.7/site-packages (from statsmodels>=0.13.2->pmdarima) (21.
3)
Requirement already satisfied: patsy>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from statsmodels>=0.13.2->pmdarima) (0.5.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=21.3->statsmodel
s>=0.13.2->pmdarima) (3.0.6)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels>=0.13.2->pmdarima) (1.
16.0)
Installing collected packages: numpy, statsmodels, pmdarima
Attempting uninstall: numpy
Found existing installation: numpy 1.20.3
Uninstalling numpy-1.20.3:
Successfully uninstalled numpy-1.20.3
Attempting uninstall: statsmodels
Found existing installation: statsmodels 0.13.1
Uninstalling statsmodels-0.13.1:
Successfully uninstalled statsmodels-0.13.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the so
urce of the following dependency conflicts.
tensorflow-io 0.21.0 requires tensorflow-io-gcs-filesystem==0.21.0, which is not installed.
explainable-ai-sdk 1.3.2 requires xai-image-widget, which is not installed.
beatrix-jupyterlab 3.1.6 requires google-cloud-bigquery-storage, which is not installed.
thinc 8.0.15 requires typing-extensions<4.0.0.0,>=3.7.4.1; python_version < "3.8", but you have typing-extensions 4.1.1 which is in
compatible.
tfx-bsl 1.5.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.15.0 which is incompatible.
tfx-bsl 1.5.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.6 which is incompatible.
tfx-bsl 1.5.0 requires pyarrow<6,>=1, but you have pyarrow 6.0.1 which is incompatible.
tfx-bsl 1.5.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<3,>=1.15.2, but you have tensorflow 2.6.2
which is incompatible.
tensorflow 2.6.2 requires numpy~=1.19.2, but you have numpy 1.21.6 which is incompatible.
tensorflow 2.6.2 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow 2.6.2 requires typing-extensions~=3.7.4, but you have typing-extensions 4.1.1 which is incompatible.
tensorflow 2.6.2 requires wrapt~=1.12.1, but you have wrapt 1.13.3 which is incompatible.
tensorflow-transform 1.5.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.15.0 which is incompatible.
tensorflow-transform 1.5.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.6 which is incompatible.
tensorflow-transform 1.5.0 requires pyarrow<6,>=1, but you have pyarrow 6.0.1 which is incompatible.
tensorflow-transform 1.5.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<2.8,>=1.15.2, but you have t
ensorflow 2.6.2 which is incompatible.
tensorflow-serving-api 2.7.0 requires tensorflow<3,>=2.7.0, but you have tensorflow 2.6.2 which is incompatible.
spacy 3.2.3 requires typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8", but you have typing-extensions 4.1.1 which is incom
patible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.5.1 which is incompatible.
numba 0.54.1 requires numpy<1.21,>=1.17, but you have numpy 1.21.6 which is incompatible.
arviz 0.11.4 requires typing-extensions<4,>=3.7.4.3, but you have typing-extensions 4.1.1 which is incompatible.
apache-beam 2.34.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.4 which is incompatible.
apache-beam 2.34.0 requires httplib2<0.20.0,>=0.8, but you have httplib2 0.20.2 which is incompatible.
apache-beam 2.34.0 requires numpy<1.21.0,>=1.14.3, but you have numpy 1.21.6 which is incompatible.
apache-beam 2.34.0 requires pyarrow<6.0.0,>=0.15.1, but you have pyarrow 6.0.1 which is incompatible.
apache-beam 2.34.0 requires typing-extensions<4,>=3.7.0, but you have typing-extensions 4.1.1 which is incompatible.
Successfully installed numpy-1.21.6 pmdarima-2.0.3 statsmodels-0.13.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager.
It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv (https://pip.pypa.io/warnings/venv)
8.2 Hyperparameters of ARIMA model p,d,q using auto_arima
In [38]: auto_arima(train, test='adf',seasonal=True, trace=True, error_action='ignore', suppress_warnings=True)
In [39]:
auto_arima(y,test='adf', # use adftest to find optimal 'd'
# maximum p and q
# frequency of series
# let model determine 'd'
seasonal=True, # No Seasonality
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
Out[40]:
SARIMAX Results
- 04-01-2017
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [42]: test
Mean Absolute Percentage Error (MAPE): MAPE is defined as the percentage of the average of absolute difference between forecasted values and true values, divided
by true value.
MAPE: 0.363205
The lower the MAPE, the better the model is. Our model have considerably high MAPE.
Let's plot the predict value to see what can be the reason for the low accuracy
8.5. Plot prediction for test value:
In [44]: train.plot(legend=True, label='Train', figsize=(10,6))
test.plot(legend=True, label= 'Test')
pred.plot(legend=True, label='ARIMA prediction')
The solution for this problem is using Seasonal ARIMA called SARIMAX function!
* * *
* * *
* * *
- 04-01-2017
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.67e+26. Standard errors may be unstable.
MAPE1: 0.238381
The model is improved!!! We can coutinue use SARIMAX to forecast the Office Supplier's Sale
THE END
Thank you for spending time checking my kernel.
Please leave comment and like this kernel if you think it's helpful.
Thank you!
In [ ]: