Module - 3 Time Series Analysis
Module - 3 Time Series Analysis
Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time
series analysis because currency and sales are always changing. Stock market analysis is an
excellent example of time series analysis in action, especially with automated trading algorithms.
Likewise, time series analysis is ideal for forecasting weather changes, helping meteorologists
predict everything from tomorrow’s weather report to future years of climate change. Examples
of time series analysis in action include:
● Weather data
● Rainfall measurements
● Temperature readings
● Heart rate monitoring (EKG)
● Brain monitoring (EEG)
● Quarterly sales
● Stock prices
● Automated stock trading
● Industry forecasts
● Interest rates
Time Series Analysis Types
Because time series analysis includes many categories or variations of data, analysts sometimes
must make complex models. However, analysts can’t account for all variances, and they can’t
generalize a specific model to every sample. Models that are too complex or that try to do too
many things can lead to a lack of fit. Lack of fit or overfitting models lead to those models not
distinguishing between random error and true relationships, leaving analysis skewed and
forecasts incorrect.
Data classification
Further, time series data can be classified into two main categories:
● Stock time series data means measuring attributes at a certain point in time, like a static
snapshot of the information as it was.
● Flow time series data means measuring the activity of the attributes over a certain
period, which is generally part of the total whole and makes up a portion of the results.
Data variations
In time series data, variations can occur sporadically throughout the data:
● Functional analysis can pick out the patterns and relationships within the data to identify
notable events.
● Trend analysis means determining consistent movement in a certain direction. There are
two types of trends: deterministic, where we can find the underlying cause, and
stochastic, which is random and unexplainable.
● Seasonal variation describes events that occur at specific and regular intervals during the
course of a year. Serial dependence occurs when data points close together in time tend to
be related.
Time series analysis and forecasting models must define the types of data relevant to answering
the business question. Once analysts have chosen the relevant data they want to analyze, they
choose what types of analysis and techniques are the best fit.
● Time series data is data that is recorded over consistent intervals of time.
● Cross-sectional data consists of several variables recorded at the same time.
● Pooled data is a combination of both time series data and cross-sectional data.
Time Series Analysis Models and Techniques
Just as there are many types and models, there are also a variety of methods to study data. Here
are the three most common.
● Box-Jenkins ARIMA models: These univariate models are used to better understand a
single time-dependent variable, such as temperature over time, and to predict future data
points of variables. These models work on the assumption that the data is stationary.
Analysts have to account for and remove as many differences and seasonalities in past
data points as they can. Thankfully, the ARIMA model includes terms to account for
moving averages, seasonal difference operators, and autoregressive terms within the
model.
● Box-Jenkins Multivariate Models: Multivariate models are used to analyze more than
one time-dependent variable, such as temperature and humidity, over time.
● Holt-Winters Method: The Holt-Winters method is an exponential smoothing technique.
It is designed to predict outcomes, provided that the data points include seasonality.
A time series is nothing but a sequence of various data points that occurred in a successive
order for a given period of time
Objectives:
● To understand how time series works, what factors are affecting a certain variable(s) at different
points of time.
● Time series analysis will provide the consequences and insights of features of the given dataset
that changes over time.
● Supporting to derive the predicting the future values of the time series variable.
● Assumptions: There is one and the only assumption that is “stationary”, which means that the
origin of time, does not affect the properties of the process under the statistical factor.
How to analyze Time Series?
● Trend
● Seasonality
● Cyclical
● Irregularity/random
● Trend: In which there is no fixed interval and any divergence within the given dataset is a
continuous timeline. The trend would be Negative or Positive or Null Trend
● Seasonality: In which regular or fixed interval shifts within the dataset in a continuous timeline.
Would be bell curve or saw tooth
● Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern
● Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.
Data Types of Time Series
Time series’ data types and their influence. While discussing TS data-types, there are two major
types.
● Stationary
● Non- Stationary
6.1 Stationary: A dataset should follow the below thumb rules, without having Trend,
Seasonality, Cyclical, and Irregularity component of time series
● The MEAN value of them should be completely constant in the data during the analysis
● The VARIANCE should be constant with respect to the time-frame
● The COVARIANCE measures the relationship between two variables.
During the TSA model preparation workflow, we must access if the given dataset is Stationary or
NOT. Using Statistical and Plots test.
Statistical Test: There are two tests available to test if the dataset is Stationary or NOT.
How to convert Non- stationary into stationary for effective time series modeling. There are two
major methods available for this conversion.
● Detrending
● Differencing
● Transformation
8.1 Detrending: It involves removing the trend effects from the given dataset and showing only
the differences in values from the trend. it always allows the cyclical patterns to be identified.
8.2 Differencing: This is a simple transformation of the series into a new time series, which we
use to remove the series dependence on time and stabilize the mean of the time series, so trend
and seasonality are reduced during this transformation.
Yt= Yt – Yt-1
Auto-Correlation Function (ACF): ACF is used to indicate and how similar a value is within a
given time series and the previous value. (OR) It measures the degree of the similarity between a
given time series and the lagged version of that time series at different intervals that we
observed.
Python Statsmodels library calculates autocorrelation. This is used to identify a set of trends in
the given dataset and the influence of former observed values on the currently observed values.
plot_acf(df_temperature)
plt.show()
plot_acf(df_temperature, lags=30)
plt.show()
Observation: The previous temperature influences the current temperature, but the significance
of that influence decreases and slightly increases from the above visualization along with the
temperature with regular time intervals.
Remember that both ACF and PACF require stationary time series for analysis.
Auto-Regressive model
This is a simple model, that predicts future performance based on past performance. mainly used
for forecasting, when there is some correlation between values in a given time series and the
values that precede and succeed (back and forth).
An AR model is a Linear Regression model, that uses lagged variables as input. The Linear
Regression model can be easily built using the scikit-learn library by indicating the input to use.
Statsmodels library is used to provide autoregression model-specific functions where you have to
specify an appropriate lag value and train the model. It is provided in the AutoTeg class to get
the results, using simple steps
● p=past values
● Yt=Function of different past values
● Ert=errors in time
● C=intercept
#import libraries
from matplotlib import pyplot
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_squared_error
from math import sqrt
# load csv as dataset
#series = read_csv('daily-min-temperatures.csv', header=0, index_col=0,
parse_dates=True, squeeze=True)
# split dataset for test and training
X = df_temperature.values
train, test = X[1:len(X)-7], X[len(X)-7:]
# train autoregression
model = AutoReg(train, lags=20)
model_fit = model.fit()
print('Coefficients: %s' % model_fit.params)
# Predictions
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1,
dynamic=False)
for i in range(len(predictions)):
print('predicted=%f, expected=%f' % (predictions[i], test[i]))
rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
# plot results
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()
OUTPUT
predicted=15.893972, expected=16.275000
predicted=15.917959, expected=16.600000
predicted=15.812741, expected=16.475000
predicted=15.787555, expected=16.375000
predicted=16.023780, expected=16.283333
predicted=15.940271, expected=16.525000
predicted=15.831538, expected=16.758333
Test RMSE: 0.617
Observation: Expected (blue) Against Predicted (red). The forecast looks good on the 4th and
the deviation on the 6th day.
Based on the frequency, a Time Series can be classified into the following categories:
Time Series Forecasting is generally used in many manufacturing companies as it drives the
primary business planning, procurement, and production activities. Any forecasts' errors will
undulate throughout the chain of the supply or any business framework.
The Time Series forecasting can be broadly classified into two categories:
ARIMA, abbreviated for 'Auto Regressive Integrated Moving Average', is a class of models
that 'demonstrates' a given time series based on its previous values: its lags and the lagged errors
in forecasting, so that equation can be utilized in order to forecast future values.
p, q, and d
where,
If a Time Series has seasonal patterns, we have to insert seasonal periods, and it becomes
SARIMA, short for 'Seasonal ARIMA'.
The 'p' is the order of the 'AR' (Auto-Regressive) term, which means that the number of lags of
Y to be utilized as predictors. At the same time, 'q' is the order of the 'MA' (Moving Average)
term, which means that the number of lagged forecast errors should be used in the ARIMA
Model.
A Pure AR (Auto-Regressive only) Model is a model which relies only on its own lags. Hence,
we can also conclude that it is a function of the 'lags of Yt'
where, Yt-1 is the lag1 of the series. β1 is the coefficient of lag1 and is the term of intercept that is
calculated by the model.
Similarly, a Pure MA (Moving Average only) model is a model where Yt relies only on the
lagged predicted errors.
Where, the error terms are the AR models errors of the corresponding lags. The errors ϵt and ϵt-1
are the errors from the equations given below:
Thus, we have concluded Auto-Regressive (AR) and Moving Average (MA) models,
respectively.
The equation of an ARIMA Model.
An ARIMA model is a model where the series of time was subtracted at least once in order to
make it stationary, and we combine the Auto-Regressive (AR) and the Moving Average (MA)
terms. Hence, we got the following equation:
ARMA This is a combination of the Auto-Regressive and Moving Average model for
forecasting. This model provides a weakly stationary stochastic process in terms of two
polynomials, one for the Auto-Regressive and the second for the Moving Average.
ARMA is best for predicting stationary series. So ARIMA came in since it supports stationary as
well as non-stationary.
● AR ==> Uses the past values to predict the future
● MA ==> Uses the past error terms in the given series to predict the future
● I==> uses the differencing of observation and makes the stationary data
AR+I+MA= ARIMA
Understand the Signature of ARIMA
Step 4: Difference log transform to make as stationary on both statistic mean and variance
Step 5: Plot ACF & PACF, and identify the potential AR and MA model
Step 7: Forecast/Predict the value, using the best fit ARIMA model
Step 8: Plot ACF & PACF for residuals of the ARIMA model, and ensure no more information is
left.
Implementation of ARIMA
Already we have discussed steps 1-5, let’s focus on the rest here.
results_ARIMA.forecast(3)[0]
Output
array([16.47648941, 16.48621826, 16.49594711])
results_ARIMA.plot_predict(start=200)
plt.show()
Process flow
Finding the order of differencing 'd' in the ARIMA Model
The primary purpose of differencing in the ARIMA model is to make the Time Series stationary.
Output :
Augmented Dickey-Fuller Statistic: -2.464240
p-value: 0.124419
It is necessary to check whether the series is stationary or not. If not, we have to use difference;
else, d becomes zero.
The Augmented Dickey-Fuller (ADF) test's null hypothesis is that the time series is not
stationary. Thus, if the ADF test's p-value is less than the significance level (0.05), then we will
reject the null hypothesis and infer that the time series is definitely stationary. As we can
observe, the p-value is more significant than the level of significance. Therefore, we can
difference the series and check the plot of autocorrelation as shown below.
plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})
# importing data
axes[1].set(ylim = (0,5))
plot_pacf(df.value.diff().dropna(), ax = axes[1])
plt.show()
Output:
Explanation:
As a result, we can observe that the PACF lag 1 is pretty significant above the line of
significance. Lag 2 also appears to be substantial, entirely maintaining to cross the limit of
significance (blue region). However, we will be conservative and fix the p as one tentatively.
use the ACF plot to find the number of Moving Average (MA) Terms. A Moving Average (MA)
term is, theoretically, the lagged forecast's error.
# importing data
plot_acf(mydata.value.diff().dropna(), ax = axes[1])
plt.show()
Output:
Explanation:
In the above example, we have imported the required libraries, modules, and datasets. We have
then plotted the graphs to represent the First Order Differencing and its Autocorrelation. As a
result, we can observe that some lags are pretty above the line of significance. So, let us fix q as
2, tentatively. We can also use the simpler model in case of any doubt that adequately
demonstrates the Y.
Example:
# importing data
modelfit = mymodel.fit(disp = 0)
print(modelfit.summary())
====================================================================
==========
---------------------------------------------------------------------------------
Roots
====================================================================
=========
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
Explanation:
In the above example, we have imported the new module called ARIMA from the statsmodels
class and create the ARIMA model of the order 1, 1, and 2. We have then printed the summary of
the model to the user. As we can observe, the overview of the model reveals a lot of details. The
middle table is the table of coefficients where the 'coef' values act as the related terms' weights.
We can also notice that the MA2 term's coefficient tends to zero, and the P-Value in the 'P > |z|'
column is exceedingly insignificant. The P-Value should be less than 0.05, ideally for the
corresponding X to be significant.