Time Series Forecasting Predicting Monthly Beer Production
Time Series Forecasting Predicting Monthly Beer Production
The aim of this project is to apply algorithms to find the most accurate prediction for monthly
Australian beer production for the years 1996-2000. Although the database provided covers
decades (1956-1995), it contains only two columns: Time frame and beer production. Intuition
tells us that there are several factors that can influence beer production in a particular country,
including temperature, price, advertising, but also national and international economic and
political factors such as: shortage of ingredients for beer, inflation or the government's intention
to reduce per capita alcohol consumption.
Under a logical assumption, we believe that all these factors could be taken into account and
would create models with a much higher predictability.
Approach:
Time Series Forecasting
Time series forecasting is a specialized area of predictive analytics that focuses on predicting
future data points in a time-ordered sequence. In time series data, observations are recorded at
regular intervals over time, and the goal is to use historical data to make informed predictions
about future values in the sequence. This approach is critical in various fields, including finance,
economics, meteorology, and more, where understanding and predicting trends and patterns
over time are essential for decision-making.
Time series data often exhibits distinct components, such as trends, seasonality, and random
noise. Trends represent long-term movements in the data, while seasonality involves regular,
repeating patterns linked to specific time intervals (e.g., daily, weekly, or annually). Forecasting
methods take these components into account to provide accurate predictions.
The methods for time series forecasting vary in complexity, from simple techniques like moving
averages and exponential smoothing to more advanced models like ARIMA and machine learning
algorithms. These methods use historical data to make predictions about future values. The
accuracy of these forecasts can be assessed using various metrics, including Mean Absolute
Error (MAE) and Mean Squared Error (MSE).
Time series forecasting is crucial for making decisions based on historical data trends and
patterns. It plays a pivotal role in applications such as stock price predictions, GDP forecasts,
weather predictions, inventory management, and energy production. As data collection and
analysis techniques advance, including the use of machine learning and deep learning, the
accuracy of time series forecasting continues to improve, enabling better-informed decisions
across various industries.
0 1956-01 93.2
1 1956-02 96.0
2 1956-03 95.2
3 1956-04 77.1
4 1956-05 70.9
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 476 entries, 0 to 475
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Month 476 non-null datetime64[ns]
1 Monthly beer production 476 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 7.6 KB
Out[7]: 476
In [8]: # To apply a time series model the date stamps has to be the index of the data frame:
df.set_index('Month', inplace=True)
In [9]: df.head()
Month
1956-01-01 93.2
1956-02-01 96.0
1956-03-01 95.2
1956-04-01 77.1
1956-05-01 70.9
In [9]: # Visualising how the monthly beer production has varied since years:
from pylab import rcParams
rcParams['figure.figsize'] = 15,12
df.plot()
plt.show()
There seems to be a clear trend and a strong seasonality at the end of each year.
Seasonal Decomposition
Seasonal decomposition, as implemented in methods like seasonal_decompose in time series analysis, is a process used to
decompose a time series into its fundamental components: trend, seasonality, and residual (or noise). These components provide
valuable insights into the underlying patterns and variations within the data. The decomposition process helps analysts better
understand and model time series data, which is often critical for forecasting and decision-making.
The key difference between additive and multiplicative seasonal decomposition lies in how the seasonality component is modeled. In
an additive decomposition, seasonality is treated as a fixed, constant pattern, where the seasonal fluctuations are added to the level of
the time series. In contrast, in a multiplicative decomposition, seasonality is treated as a proportional, relative pattern, meaning the
seasonal fluctuations are multiplied by the level of the time series. Additive decomposition is suitable when the magnitude of seasonal
fluctuations remains relatively constant over time, while the multiplicative approach is more appropriate when the magnitude of
seasonality changes with the level of the data. The choice between additive and multiplicative decomposition depends on the specific
characteristics of the time series data being analyzed.
In [10]: # Decomposing the time series data by using additive method:
from statsmodels.tsa.seasonal import seasonal_decompose
decompose_additive = seasonal_decompose(df['Monthly beer production'], model='additive',period=12)
decompose_additive.plot()
plt.show()
In [12]: # Decomposing the time series data by using multiplicative method:
from statsmodels.tsa.seasonal import seasonal_decompose
decompose_additive = seasonal_decompose(df['Monthly beer production'], model='multiplicative',period=12)
decompose_additive.plot()
plt.show()
This confirms the presence of both trend and seasonality in the data
Next, The Durbin-Watson test is used to detect autocorrelation in time series data. Autocorrelation is the correlation between a time
series and a lagged version of itself. In time series forecasting, it's crucial to identify and address autocorrelation because it can violate
the assumption of independence in many forecasting models. The Durbin-Watson test provides a statistic that helps determine
whether autocorrelation is present in the residuals of a time series model. If the test statistic falls significantly below or above a critical
range, it indicates the presence of positive or negative autocorrelation, respectively. Detecting and addressing autocorrelation is
essential to build accurate time series forecasting models.
In [11]: # Checking the auto correlation of the data using Durbin Watson test
sm.stats.durbin_watson(df['Monthly beer production'])
Out[11]: 0.019486494992529867
The Augmented Dickey-Fuller (ADF) test is used to determine if a time series is stationary. In time series forecasting, stationarity is a
crucial assumption because many forecasting methods work best with stationary data. The ADF test helps assess whether
differencing the data (i.e., subtracting consecutive observations) is necessary to make it stationary. If the test suggests non-
stationarity, differencing can be applied to make the data suitable for forecasting models.
In [12]: # The time series data needs to be stationary before building a Time Series model
# This will be tested by using Augumented Dickey Fuller Test.
from statsmodels.tsa.stattools import adfuller
def adf_check(timeseries):
result = adfuller(timeseries)
print('Augumented Dickey Fuller Test to ensure if the data is stationary:')
labels = ['ADF Stats Test', 'P Value', 'No. of lags', 'No. of observations']
In [14]: # Now to make the time series stationary, performing 1st order differentiation in order to remove trend from
df['First_order_diff'] = df['Monthly beer production'] - df['Monthly beer production'].shift(1)
In [15]: df.head()
Month
In [17]: # The attempt was successful and that data is now free of trend, now removing seasonality from the data.
# As it is seen that the pattern is repeated and reaches the high at every end of the year applying a lag of
df['First_order_seasonal_diff'] = df['Monthly beer production'] - df['Monthly beer production'].shift(12)
In [18]: # Now after removing the seasonality from the data checking if it is stationary:
adf_check(df['First_order_seasonal_diff'].dropna())
Now in order to fit SARIMAX model we need to know the optimum values of p, d, q (p-order of an auto regressive model, d-order of
differencing need to be applied, q-order of moving average model) and P, D, Q (P-order of a seasonal auto regressive model, D-order
of seasonal differencing need to be applied, Q-order of seasonal moving average model)
Auto-correlation (ACF) and partial auto-correlation (PACF) are two fundamental concepts in time series forecasting and analysis.
1. Auto-correlation (ACF): Auto-correlation, often denoted as ACF, is a measure of the correlation between a time series and a
lagged version of itself. In other words, it quantifies how each data point in a time series is related to its previous values at various
lags. ACF is a fundamental tool for identifying the presence of seasonality and trend patterns in a time series. It helps in
understanding how past observations influence the current observation.
ACF is calculated for various lags, and the resulting plot, called the ACF plot or correlogram, shows the correlation coefficients at
different lags. If there is a significant spike in the ACF at a specific lag, it suggests a relationship between the current value and the
value at that lag.
2. Partial Auto-correlation (PACF): Partial auto-correlation, often denoted as PACF, is a measure of the correlation between a data
point and a lagged version of itself, after accounting for the contributions of intermediate lags. In other words, PACF measures the
direct relationship between a data point and a lag, removing the influence of the shorter lags in between. PACF helps in identifying
the order of an autoregressive (AR) model, which is a common component in time series forecasting.
PACF is used to distinguish between genuine relationships with specific lags and indirect relationships caused by shorter lags. By
examining the PACF plot, you can identify the number of lags to include in an AR model. Significant spikes in the PACF plot at specific
lags indicate the order of the AR model.
In summary, ACF and PACF are tools for understanding the temporal relationships within a time series. ACF helps identify overall
patterns, while PACF helps identify direct relationships between a data point and specific lags, aiding in model selection and
forecasting in time series analysis.
In [22]: # Finding the value of p (pacf - Partial auto correlation plot is used to find the optimum value of p)
plot_pacf(df['First_order_diff'].dropna())
plt.show()
Consecutive values after the first line should be considered until the line converges into the range of -0.2 to +0.2, in this case the value
of p is 2
Consecutive values after the first line should be considered until the line converges into the range of -0.2 to +0.2, in this case the value
of q is 4
A SARIMAX (Seasonal Autoregressive Integrated Moving Average with Exogenous Variables) model is significant for time series
forecasting when the data exhibits seasonality and may be influenced by external factors. It combines ARIMA (AutoRegressive
Integrated Moving Average) modeling with seasonal components and the inclusion of exogenous variables. The key parameters
include the order of differencing (d), autoregressive order (p), moving average order (q), seasonal differencing (D), seasonal
autoregressive order (P), seasonal moving average order (Q), and the periodicity (s) of the seasonality.
SARIMAX is best used for forecasting time series data with clear seasonal patterns and when external factors, such as economic
indicators or weather data, have an impact on the series. It is a versatile model that can handle complex time series data, making it
valuable in various domains, including finance, economics, and demand forecasting.
In [36]: # Now that the initial pdq values are found, fitting a time series model (SARIMAX):
model = sm.tsa.statespace.SARIMAX(df['Monthly beer production'], order=(0,1,5), seasonal_order=(3,1,3,12))
result = model.fit()
print(result.summary())
SARIMAX Results
==================================================================================================
Dep. Variable: Monthly beer production No. Observations: 476
Model: SARIMAX(0, 1, 5)x(3, 1, [1, 2, 3], 12) Log Likelihood -1682.287
Date: Sat, 14 Oct 2023 AIC 3388.574
Time: 18:30:43 BIC 3438.227
Sample: 01-01-1956 HQIC 3408.121
- 08-01-1995
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ma.L1 -1.0103 0.039 -25.647 0.000 -1.087 -0.933
ma.L2 -0.0522 0.058 -0.899 0.369 -0.166 0.062
ma.L3 0.1156 0.061 1.907 0.056 -0.003 0.234
ma.L4 -0.0449 0.054 -0.831 0.406 -0.151 0.061
ma.L5 0.1500 0.041 3.690 0.000 0.070 0.230
ar.S.L12 0.7545 0.072 10.535 0.000 0.614 0.895
ar.S.L24 -0.8869 0.064 -13.871 0.000 -1.012 -0.762
ar.S.L36 -0.1081 0.056 -1.914 0.056 -0.219 0.003
ma.S.L12 -1.5920 0.080 -19.814 0.000 -1.750 -1.435
ma.S.L24 1.5397 0.130 11.879 0.000 1.286 1.794
ma.S.L36 -0.7237 0.082 -8.830 0.000 -0.884 -0.563
sigma2 76.5508 4.986 15.353 0.000 66.779 86.323
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 79.70
Prob(Q): 0.97 Prob(JB): 0.00
Heteroskedasticity (H): 3.53 Skew: -0.39
Prob(H) (two-sided): 0.00 Kurtosis: 4.88
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
AIC (Akaike Information Criterion) in the SARIMAX results summary is a measure of the model's goodness of fit while penalizing for
model complexity. Lower AIC values are better. The optimal values of AIC are the ones that correspond to the SARIMAX model with
the best trade-off between goodness of fit and simplicity. AIC is affected by the model's parameters, including the orders of differencing
(d, D), autoregressive (p, P), and moving average (q, Q) components, as well as the choice of seasonality. The model with the lowest
AIC is typically preferred for forecasting.
BIC (Bayesian Information Criterion) in the SARIMAX results summary is another measure of the model's goodness of fit and
complexity. Lower BIC values are better. The optimal values of BIC are the ones that correspond to the SARIMAX model with the best
trade-off between goodness of fit and simplicity. BIC is affected by the model's parameters, including the orders of differencing (d, D),
autoregressive (p, P), and moving average (q, Q) components, as well as the choice of seasonality. Like AIC, the model with the
lowest BIC is generally preferred for forecasting.
This means to say that the model is understanding the pattern well and giving the right predictions and can be forecasted for future
values in time
In [38]: # Now predicting for future values in time from 1995-08
from pandas.tseries.offsets import DateOffset
In [39]: # Predicting the monthly production of beer in megaliters for next 5 years i.e up to 1998-12
future_dates = [df.index[-1] + DateOffset(months = x) for x in range(1, 65)]
In [40]: future_dates
1995-09-01 129.201730
1995-10-01 164.717849
1995-11-01 190.498579
1995-12-01 180.611712
1996-01-01 150.036754
1996-02-01 138.649104
1996-03-01 148.419637
1996-04-01 136.291654
1996-05-01 145.038382
1996-06-01 119.225425
1996-07-01 133.319343
1996-08-01 138.687390
1996-09-01 130.435667
1996-10-01 172.889229
1996-11-01 181.990136
1996-12-01 181.786655
1997-01-01 152.244477
1997-02-01 136.617234
1997-03-01 146.481100
1997-04-01 140.697603
1997-05-01 138.117874
1997-06-01 119.088996
1997-07-01 135.822438
1997-08-01 134.390670
1997-09-01 133.560464
1997-10-01 169.509651
1997-11-01 171.332673
1997-12-01 183.894687
1998-01-01 148.647043
1998-02-01 133.846982
1998-03-01 150.079997
1998-04-01 141.105383
1998-05-01 128.636869
1998-06-01 122.169779
1998-07-01 133.270201
1998-08-01 133.876571
1998-09-01 136.885463
1998-10-01 160.129298
1998-11-01 169.495371
1998-12-01 184.650552
1999-01-01 141.350342
1999-02-01 131.949531
1999-03-01 153.578572
1999-04-01 135.177633
1999-05-01 126.943275
1999-06-01 124.457053
1999-07-01 126.253276
1999-08-01 137.524013
1999-09-01 135.166138
1999-10-01 153.843010
1999-11-01 177.158305
1999-12-01 181.901088
2000-01-01 137.474124
2000-02-01 131.871581
2000-03-01 151.912753
2000-04-01 128.544292
2000-05-01 133.499525
2000-06-01 122.142023
Forecast_1
2000-07-01 121.629207
2000-08-01 139.873329
2000-09-01 129.259091
2000-10-01 156.462143
2000-11-01 184.398451
2000-12-01 177.605366