Week 10 Intro Forecasting
Week 10 Intro Forecasting
Import Modules
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Read data
Let's use the US retail employment example again.
In [4]: us_retail_df = pd.read_csv('us_retail_employment.csv')
In [5]: us_retail_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 357 entries, 0 to 356
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 357 non-null int64
1 Month 357 non-null int64
2 Day 357 non-null int64
3 Employed 357 non-null float64
dtypes: float64(1), int64(3)
memory usage: 11.3 KB
In [6]: us_retail_df.head()
Prepare data
We need to create the datetime object column and then separate the Employed
column into its own Series.
In [7]: us_retail_df['date_dt'] = pd.to_datetime( us_retail_df.loc[:, ['Year', 'Mont
In [8]: us_retail_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 357 entries, 0 to 356
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 357 non-null int64
1 Month 357 non-null int64
2 Day 357 non-null int64
3 Employed 357 non-null float64
4 date_dt 357 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 14.1 KB
In [9]: us_retail_df.head()
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
In [12]: retail_series
Out[12]: 0 13255.8
1 12966.3
2 12938.2
3 13012.3
4 13108.3
...
352 15691.6
353 15775.5
354 15785.9
355 15749.5
356 15611.3
Name: Employed, Length: 357, dtype: float64
In [13]: retail_series.index
In [15]: retail_series
Out[15]: date_dt
1990-01-01 13255.8
1990-02-01 12966.3
1990-03-01 12938.2
1990-04-01 13012.3
1990-05-01 13108.3
...
2019-05-01 15691.6
2019-06-01 15775.5
2019-07-01 15785.9
2019-08-01 15749.5
2019-09-01 15611.3
Name: Employed, Length: 357, dtype: float64
In [16]: retail_series.index
In [18]: ready_series
Out[18]: date_dt
1990-01-01 13255.8
1990-02-01 12966.3
1990-03-01 12938.2
1990-04-01 13012.3
1990-05-01 13108.3
...
2019-05-01 15691.6
2019-06-01 15775.5
2019-07-01 15785.9
2019-08-01 15749.5
2019-09-01 15611.3
Freq: MS, Name: Employed, Length: 357, dtype: float64
plt.show()
Split data
Let's split the data into dedicating training and test sets. This way we can get some idea
of how well the forecasting methods are working.
However, the goal of time series forecasters is to forecast the future. Therefore, we
should NEVER randomly split time series data. Instead, we should force the hold-out
test set always be in the future!!!!
Let's first check the number of unique years in the data.
In [20]: us_retail_df.Year.value_counts().sort_index()
Out[20]: Year
1990 12
1991 12
1992 12
1993 12
1994 12
1995 12
1996 12
1997 12
1998 12
1999 12
2000 12
2001 12
2002 12
2003 12
2004 12
2005 12
2006 12
2007 12
2008 12
2009 12
2010 12
2011 12
2012 12
2013 12
2014 12
2015 12
2016 12
2017 12
2018 12
2019 9
Name: count, dtype: int64
Out[21]: date_dt
1990-01-01 13255.8
1990-02-01 12966.3
1990-03-01 12938.2
1990-04-01 13012.3
1990-05-01 13108.3
...
2016-08-01 15864.6
2016-09-01 15750.3
2016-10-01 15899.5
2016-11-01 16260.2
2016-12-01 16394.3
Freq: MS, Name: Employed, Length: 324, dtype: float64
Visualize the TRAINING set and the HOLD-OUT future test set.
In [24]: fig, ax = plt.subplots(figsize=(15, 6))
ax.legend()
plt.show()
If we remove the "ALL" series...then there will be a gap between the training and test
series.
In [25]: fig, ax = plt.subplots(figsize=(15, 6))
ax.legend()
plt.show()
Simple Forecasting
The two simplest forecasting methods:
Average all historical measurements - all future forecasts equal the AVERAGE
Use the most recent (last) observation as the forecast -> Naive method
The average or MEAN method is easy to calculate...
In [26]: train_series.mean()
Out[26]: 14623.75277777778
Out[27]: 16394.3
The Naive method literally uses the LAST observation as the forecast.
In [28]: train_series
Out[28]: date_dt
1990-01-01 13255.8
1990-02-01 12966.3
1990-03-01 12938.2
1990-04-01 13012.3
1990-05-01 13108.3
...
2016-08-01 15864.6
2016-09-01 15750.3
2016-10-01 15899.5
2016-11-01 16260.2
2016-12-01 16394.3
Freq: MS, Name: Employed, Length: 324, dtype: float64
In [32]: my_forecasts.head()
Out[32]: observed
date_dt
2017-01-01 15854.4
2017-02-01 15627.9
2017-03-01 15635.0
2017-04-01 15686.6
2017-05-01 15759.5
Forecast using the AVERAGE or MEAN method.
In [35]: my_forecasts['AVERAGE'] = train_series.mean()
In [36]: my_forecasts.head()
In [38]: my_forecasts.head()
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
Let's use Pandas plotting and matplotlib plotting to include the Training set and the
forecasts and the test set in a single plot. Neither approach captures the repeating
patterns associated with the hold out test set. However, the Naive method is at least "in
the right ballpark" compared to the AVERAGE method in this example.
In [40]: fig, ax = plt.subplots( figsize=(15, 6) )
ax.legend()
plt.show()
We know from our exploration...that there is a SEASONAL pattern present in this data
set!!!!
We can modify our simple forecasts to account for the seasonality by using: SEASONAL
NAIVE forecasting!!!!
Seasonal Naive corresponds to using the last or most recent season as the future
forecasts for all future seasons.
Future forecasts in May will correspond to the most recently observed or last value for
May. While future forecasts for October will be the last October value. Therefore, not all
seasonal (month in this case) forecasts are the same. There seasonal (monthly) variation
is preserved based on the last year in the training data.
The last year in the training data is 2016:
In [42]: us_retail_df.loc[ us_retail_df.Year == 2016 ]
In [45]: my_forecasts_b.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_dt 33 non-null datetime64[ns]
1 observed 33 non-null float64
2 AVERAGE 33 non-null float64
3 Naive 33 non-null float64
dtypes: datetime64[ns](1), float64(3)
memory usage: 1.2 KB
Let's extract the Date Time components of Year and Month from the date_dt column.
In [47]: my_forecasts_b['Year'] = my_forecasts_b.date_dt.dt.year
In [48]: my_forecasts_b.head()
In [50]: my_forecasts_b.head()
In [54]: my_forecasts_c.head()
In [55]: my_forecasts_c.set_index('date_dt').head()
In [59]: my_forecasts_d.head()
ax.legend()
plt.show()
Combine simple
Decomposition forecast with Time Series
Uses a time series decomposition method to enable a simple forecaster which must the
be re-seasonalized. Let's use the STL decomposition for this example.
In [61]: from statsmodels.tsa.seasonal import STL
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
We will use the Naive method...but apply the Naive logic to the seasonally adjusted
data. Thus, we will use the last or most recent seasonally adjusted value.
file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_forecasting.html 19/25
22/04/2024, 03:17 week_10_intro_forecasting
In [66]: df_stl_train.seasonal_adjust.iloc[-1]
Out[66]: 15913.226856066618
Out[67]: date_dt
2016-01-01 -104.057251
2016-02-01 -279.689973
2016-03-01 -216.257000
2016-04-01 -153.775629
2016-05-01 -64.140218
2016-06-01 25.390812
2016-07-01 34.305993
2016-08-01 -1.915552
2016-09-01 -134.848905
2016-10-01 15.664630
2016-11-01 364.508477
2016-12-01 481.073144
Freq: MS, Name: season, dtype: float64
ADD the seasonally adjusted Naive value to the most recent year's Seasonal
component!!!
In [68]: train_stl_fit.seasonal[ train_stl_fit.seasonal.index >= '2016-01-01' ] + df_
Out[68]: date_dt
2016-01-01 15809.169605
2016-02-01 15633.536883
2016-03-01 15696.969856
2016-04-01 15759.451227
2016-05-01 15849.086638
2016-06-01 15938.617668
2016-07-01 15947.532849
2016-08-01 15911.311304
2016-09-01 15778.377951
2016-10-01 15928.891486
2016-11-01 16277.735333
2016-12-01 16394.300000
Freq: MS, Name: season, dtype: float64
In [71]: df_reseason_naive_forecast
In [73]: df_reseason_naive_forecast
Merge the above forecasts with the larger hold-out test forecast DataFrame.
In [74]: df_reseason_naive_forecast.loc[:, ['Month', 'season']].rename(columns={'seas
how='left').\
copy()
In [77]: my_forecasts_e.head()
In [79]: my_forecasts_f.head()
ax.legend()
plt.show()
Model selection
You have seen 4 different forecasting methods in this report. Three of the methods
involve NO parameters or model "fitting". Summary statistics are used as the forecast!
These three methods are foundation for all other forecasting methods. The last method
combined a simple approach with a time series decomposition to enable capturing more
advanced patterns. The fourth approach is your first "advanced" method because it
combines the decomposition approach from visualizing and exploring the time series
with a simple forecasting procedure. The simple forecasting methods were executed
using Pandas attributes, methods, and functions, but there are multiple ways to execute
these simple strategies.
The 4 methods were visually compared on the hold out test set. The AVERAGE or MEAN
method clearly does not capture the hold out test set behavior. However, it is visually
difficult to tell which method is better between Seasonal Naive and the STL Reseason
Naive approach. Let's quantify the performance on the hold out test set by calculating a
performance metric appropriate for regression problems. THe cells below calculate
the RMSE for each of the 4 forecasting methods. As shown by the values displayed to
the screen, the Seasonal Naive method has the lowest RMSE on the hold-out test set.
Thus, the Seasonal Naive outperforms the STL Reseasoning approach!
In [80]: np.sqrt( ( ( my_forecasts_f.observed - my_forecasts_f.Seasonal_Naive )**2 ).
Out[80]: 81.74907839347037
Out[81]: 1190.6051791210214
Out[82]: 632.3527036537291
Out[83]: 100.99782463931113
Conclusion
Three of the four methods involve zero parameters. They are summary statistics which
are VERY easy to interpret and describe. You should always include these simple
methods as benchmarks to compare against more complex time series forecasting
methods.
In [ ]: