A time series is a collection of observations of well-defined data items obtained
through repeated measurements over time. For example, measuring the value of retail
sales each month of the year would comprise a time series. This is because sales
revenue is well defined, and consistently measured at equally spaced intervals. Data
collected irregularly or only once are not time series.
Time series data is everywhere since time is a constituent of everything that is
observable. As our world gets increasingly instrumented, sensors and systems are
constantly emitting a relentless stream of time series data. Such data has numerous
applications across various industries.
Let’s put this in context through some examples. Time series data can be useful for a
variety of scenarios:
● Tracking daily, hourly, or weekly weather data
● Tracking changes in application performance
● Medical devices to visualize vitals in real-time
Let’s look at some
[email protected] of the examples with more details,
R8L0PN473F
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The above graph shows the quarterly (that is, four times yearly) observations of the
earnings of Johnson & Johnson corporation from 1960 to 1980 recorded at equally
spaced time intervals. There are 84 observations: one for each quarter over 21 years.
So these values are not random; each of the values are dependent on previous values.
Generally, the study of time-series data involves two fundamental questions: what
happened (description), and what will happen next (forecasting)?
For the Johnson & Johnson data, you might ask
● Is the price of Johnson & Johnson shares changing over time?
● Are there quarterly effects, with share prices rising and falling regularly
throughout the year?
● Can you then forecast what the future share prices will be?
Why are time series different?
● Because data points in time series are collected at adjacent time periods they
are dependent on previous observations. In the language of probability, these
are dependent random variables. This is one of the features that distinguishes
time-series data from other kinds of data.
[email protected]R8L0PN473F ● The statistical
characteristics of time series data often violate the assumptions of
conventional statistical methods. For example, the logistic regression or random
forest, or any other algorithm assumes the samples are independent of each
other. Because of this, analyzing time-series data requires a unique set of tools
and methods.
● Two major reasons for variations in the time series data can be due to
○ Trend
○ Seasonality
Trend :
The trend shows a general direction of the time series data over a long period of time.
A trend can be increasing (upward), decreasing (downward), or horizontal (stationary).
In the Johnson & Johnson quarterly earnings, we can see an upward trend.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Seasonality :
The seasonal component tells if there is a regularly repeating pattern of highs and lows
related to calendar time such as seasons, quarters, months, days of the week, and so
on. Some examples include an increase in water consumption in summer due to hot
weather conditions, or an increase in the number of airline passengers during holidays
each year.
What is a Stochastic Process?
A stochastic process is a set or collection of random variables {𝑋𝑡} (not necessarily
independent), where the index ‘t’ takes values in a certain set, this set is ordered and
corresponds to the moment of time. Time series is the realizations of the stochastic
process. Realization is a unique function of time different from the others. The process
is characterized by the joint probability distribution of the random variables 𝑋1, 𝑋2,....., 𝑋𝑇,
for any value of T.
Obtaining the probability distributions of the process is possible in some situations, for
[email protected]
R8L0PN473F example with climatic variables, where we can assume that each year a realization of
the same process is observed, or techniques that can be generated in a laboratory.
Nevertheless, in many situations of interest, such as with economic or social variables,
we can only observe one realization of the process.
For example, if we observe the series of yearly growth in the wealth of a country it is
not possible to go back in time to generate another realization. The stochastic process
exists conceptually, but it is not possible to obtain successive samples or independent
realizations of it.
Stationarity :
To tackle the above problem we introduce the concept of the stationary stochastic
process. There are two types of stationary processes:
1. Strong sense stationarity
2. Wide sense stationarity
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Strong sense stationarity :
Strong stationarity requires the shift-invariance (in time) of the stochastic process. This
means the joint probability distribution is the same if you shift the data. For example,
the distribution of 𝑋𝑡, 𝑋(𝑡+1), 𝑋(𝑡+2)... is the same as 𝑋(𝑡+ℎ), 𝑋(𝑡+ℎ+1), 𝑋(𝑡+ℎ+2)... for any ℎ.
, ,
It’s a very strong condition since to prove it we must have the joint distributions for any
set of variables in the process. Generally hard to verify so we are going with a weaker
notion which is wide sense.
Before entering into wide sense stationary, we need to understand the Mean and
autocovariance.
Mean of the process:
It's the expected value of a stochastic process 𝑋𝑡 which is
µ𝑡 = 𝐸(𝑋𝑡)
Autocovariance:
Autocovariance is defined as the covariance between the random variable 𝑋𝑡1 and 𝑋𝑡2.
[email protected]
R8L0PN473F
Autocovariance (auto means itself) of 𝑋𝑡1 and 𝑋𝑡2 is defined as the covariance between
the same times series at different time periods. It’s denoted as,
𝑅𝑋(𝑡1, 𝑡2) = 𝐸((𝑋𝑡 − µ𝑡 )(𝑋𝑡 − µ𝑡 ))
1 1 2 2
Wide sense stationarity :
The conditions for wide sense are
1. 𝐸 ( 𝑋𝑡 ) = µ𝑡
2. 𝑅𝑋(𝑡1 , 𝑡2) = 𝑅𝑋(𝑡1 − 𝑡2) = 𝑅𝑋(𝑡2 − 𝑡1)
For a wide sense, the process should have the same mean at all time points, and the
covariance between the values at any two-time points, 𝑡 and 𝑡 − 𝑘, depend only on k,
the difference between the two times, and not on the location of the points along the
time axis. For example, the covariance of X1 and X5 should be the same as X3 and X7.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Testing Stationarity:
Before modeling, we need to check if a given series is wide sense stationary or not. To
find whether the given series is stationary or not we need to compute
● The sample mean for each λ
𝑁−1
1
µ= 𝑁−λ
∑ 𝑋𝑖
𝑖=λ
Where N is the total number of samples and λ is any point you have selected.
Let’s understand with an example, we have some value in the table
X0 11
X1 13
X2 14
X3 12
[email protected] X4 14
R8L0PN473F
First, will do for λ =0 then mean is the sum of all samples by number of samples
Mean = (11+13+14+12+14)/5
= 12.8
Then increase the value of λ =1,then mean is
4
1
= 5−1
∑ 𝑋𝑖
1
= (13+14+12+14)/4 =13.25
Then increase the value of λ =2, then mean is
4
1
= 5−2
∑ 𝑋𝑖
2
= (14+12+14)/3 = 13.3
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
This should be repeated until λ = N-1 and For each of the λ, we need to get the
constant mean which hasn’t got in the above example.
Compute sample autocovariance for each λ which is
𝑁−1
1
𝑅𝑋(τ) = 𝑁−λ
∑ (𝑋𝑖 − µ)(𝑋𝑖+τ − µ)
𝑖=λ
The process is the same, go with some value for λ and calculate the covariance
between two variables. In the end, you should not see the larger variation over λ.
Let’s understand with an example,
[email protected]
R8L0PN473F
Here we have generated the data of 100 samples as we can see from the first figure.
The first step for checking stationarity is to check the empirical mean as we can see in
figure 2 the mean is constant which has been calculated at different values of λ. So one
of the conditions is satisfied.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Next thing is to check the autocovariance for every difference which is 𝑡1 − 𝑡2= τ
(symbol is called Tau). The tau value will be from 1 to the total number of data points.
In the third figure, the autocovariance has been calculated for tau=5 and different
values of λ. It’s also constant.
We need to check for every value of Tau and for any of the values of Tau if it violates
the condition then it’s not stationary. Figure 4 is not the same for different values of λ
so it’s not a stationary time series.
Detrend :
The detrending of times series is a process of removing the trend from a non-stationary
time series. If we have time series with only trend components in that case the
detrended time series is known as a stationary time series, while a time series with a
trend is non-stationary time series. A stationary time series oscillates about the
horizontal line. If a series does not have a trend or we remove the trend successfully,
the series is said to be trend stationery. Elimination of the trend component may be
thought of as rotating the trend line to a horizontal position.
For example,
[email protected] below graph shows the consumer price index. We can see the trend
R8L0PN473F
which is increasing shown by the orange color.
Once we remove the trend the graph looks like below. So there is no trend in the data.
Now we need to check if the given graph is stationarity or not.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
R8L0PN473F
Similarly, we also have deseasonalizing which is removing the seasonality from the
data to make it stationarity.
Models of time series
White Noise:
Consider we have a time series 𝑊𝑡If the elements of the time series have,
1. Mean of zero .i.e 𝐸(𝑊𝑡) = 0
2. Autocovariance between two random variables (if both are not the same) is zero.
Considering we have two values 𝑊𝑡1,𝑊𝑡2then the autocovariance should be zero.
If 𝑊𝑡1, 𝑊𝑡2are the same then the autocovariance is equal to the variance.
2
Autocovariance = 𝐸(𝑊𝑡1,𝑊𝑡2) =σ δ(𝑡1 − 𝑡2)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 8
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
(
δ(𝑡1 − 𝑡2) = 1 𝑖𝑓 𝑡1 = 𝑡2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 )
An example of white noise is tossing a coin at time 𝑡. It is white noise because it
doesn’t correlate with previous trials.
The graph below is white noise with a mean of 0.
White noise is a series that’s not predictable, as it’s a sequence of random numbers. If
[email protected]
R8L0PN473F you build a model and its residuals (the difference between predicted and actual)
values look like white noise, then you know you did everything to make the model as
good as possible. On the opposite side, there’s a better model for your dataset if there
are visible patterns in the residuals.
Random Walk:
A random walk is another time series model where the current observation is equal to
the previous observation with a random step up or down. It is formally defined as
𝑋𝑡 = 𝑋𝑡−1 + 𝑊𝑡
● 𝑋𝑡 is the current value
● 𝑋𝑡−1is the previous value
● 𝑊𝑡 is white noise
Just like white noise, random walk series also isn’t predictable on the basis of the past
values. When the term is applied to the stock market, it means that short-run changes
in stock prices are unpredictable.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 9
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Autoregressive Models:
An autoregressive model is when a value from a time series is regressed on previous
values from that same time series.
𝑝
𝑋𝑡 = ∑ 𝑎𝑖𝑋𝑡−𝑖 + 𝑤𝑡
𝑖=1
● 𝑋𝑡 is the current value
● 𝑋𝑡−𝑖is the previous values
● 𝑊𝑡 is white noise
In this autoregressive model, the response variable in the previous time period has
become the predictor and the errors have our usual assumptions about errors in a
simple linear regression model. The order(p) of an autoregression is the number of
previous values in the series that are used to predict the value at the present time.
So, If the model is a first-order autoregression, written as AR(1), the equation will be
written as:
[email protected] 𝑋𝑡 = 𝑎1𝑋 + 𝑤𝑡
R8L0PN473F 𝑡−1
AR(p) Model :
The AR(p) process is given by,
𝑝
𝑋𝑡 = ∑ 𝑎𝑖𝑋𝑡−𝑖 + 𝑤𝑡
𝑖=1
For determining the given process is stationary we want roots of the polynomial (1-
2 𝑝
𝑎1𝑧1 − 𝑎2𝑧 −........ 𝑎𝑝𝑧 ) to be inside the unit disc or should be less than 1. If it is >1
then it’s not a well-defined AR process.
For example, if we have a random walk 𝑋𝑡 = 𝑋𝑡−1 + 𝑊𝑡 then the roots of the polynomial
equation will be
𝑃(𝑧) = (1 − 𝑧)
So𝑃(𝑧) =0 then z=1. Here it violates our condition.
If we write the Random walk equation differently,
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 10
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
𝑋𝑡 − 𝑋𝑡−1 = 𝑊𝑡
Then it’s stationary because it’s white noise. We know white noise is a stationary
process. The 𝑋𝑡 − 𝑋𝑡−1 is called first differencing which is used to make a
non-stationary time series to stationary time series.
The Autocovariance of the AR(p) process is given by,
𝑝
2
𝑅𝑋(τ) = ∑ 𝑎𝑖𝑅𝑋(τ − 𝑖) + σ δ(τ)
𝑖=1
So one of the properties of AR models is that the autocovariance decays exponentially.
If you calculate the autocovariance using the above formula then you can find it decays
exponentially.
[email protected]
R8L0PN473F
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 11
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
From the above image, for the AR(1) model the autocovariance function has been
decaying exponentially, and similarly for AR(2) and AR(3) we can see the
autocovariance function has been decaying exponentially. For AR(4) the
autocovariance function is decaying exponentially but with a lot of fluctuations
ultimately it's going to decay. It’s a property of the AR model that ACF decays
exponentially.
Moving Average Models (MA(q)) :
A Moving Average model is when a value from a time series is regressed on past errors
from that same time series. A moving average term in a time series model is a past
error (multiplied by a coefficient). So the equation is
𝑞
𝑋𝑡 = ∑ 𝑏𝑖𝑤𝑡−𝑖
𝑖=0
● 𝑋𝑡 is the current value
● 𝑊𝑡 is white noise
● 𝑏𝑖 are coefficients
●
[email protected] q is the order - how many past errors are to be considered
R8L0PN473F
The autocovariance for the MA is,
𝑞
2
𝑅𝑋(τ) = σ ∑ 𝑏𝑖𝑏𝑗−τ
𝑗=0
And autocovariance for MA is 0 for Tau greater than q. For example, if we have MA(1)
model then the only nonzero value in the ACF is for lag 1. All other autocorrelations are
0. Thus a sample ACF with a significant autocorrelation only at lag 1 is an indicator of a
possible MA(1) model.
Autoregressive moving average ( ARMA (p,q ) ):
ARMA is a model of forecasting in which the methods of autoregression (AR) and
moving average (MA) both are applied to time-series data. It accounts for past values
as well as past errors. we can write it as
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
𝑝 𝑞
𝑋𝑡 = ∑ 𝑎𝑖𝑋𝑡−𝑖 + ∑ 𝑏𝑗𝑤𝑡−𝑗
𝑖=1 𝑗=0
So the first part of the equation corresponds to the AR process and the second part of
the equation corresponds to the MA process.
Integrated Process:
For AR, MA, and ARMA modeling we require the given times series to be stationary. In
most cases, we don’t have a stationary time series. So we need to convert the
non-stationary time series to stationary. This is called Integrated.
To make time-series stationary one such approach is differencing. Differencing can
help stabilize the mean of the time series by removing changes in the level of a time
series, and so eliminating (or reducing) trend and seasonality. Differencing is performed
by subtracting the previous observation from the current observation. The number of
times that differencing is performed is called the difference order.
For example,
[email protected]
R8L0PN473F 1. First-order: 𝑋𝑡 − 𝑋𝑡−1
2. Second-Order: 𝑋𝑡 − 2𝑋𝑡−1 + 𝑋𝑡−2
ARIMA:
Arima is the combination of AR, MA, and Integrated. So let’s understand each
component of ARIMA in detail,
● The “AR” in ARIMA stands for autoregression indicating that the outcome of the
model depends on the past values.
● The “I” stands for integrated to make stationary.
● The “MA” stands for moving average model, indicating that the outcome of the
model depends on past errors.
The ARIMA model takes in three parameters:
1. p is the order of the AR term.
2. q is the order of the MA term.
3. d is the number of times we do differencing to make stationary.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Time Series
AR(p) :
So far we have understood what AR is and how it works. Just to recap we know that
every value is regressed on past p values. Now we are going to understand how to
estimate the coefficients.
Let’s say we have a time series 𝑋 0, 𝑋 1,....., 𝑋 𝑁
, assuming we know the value of p, so
the equation of the model is:
𝑝
𝑋𝑡 = ∑ 𝑎𝑖𝑋𝑡−𝑖 + 𝑤𝑡
𝑖=1
In the equation, we know 𝑋𝑡’s. The unknowns in the equation are 𝑎𝑖'𝑠. To estimate
them, we are going to define,
𝑋 = (𝑥𝑝+1, 𝑥𝑝+2, 𝑥𝑝+3,.......... 𝑥𝑁)'
[email protected]R8L0PN473F The vector 𝑋 takes the values starting from the p+1 to N. Consider having p equal to 3
i.e. the current value depends on past 3 values. So for 𝑥1 ,𝑥2 , and 𝑥3 there are not 3
past values for these records. So, we cannot write the equation for them but for x4 we
have 3 past values 𝑥3 ,𝑥2, and 𝑥1 . This is the reason we are taking the starting value
as 𝑝 + 1. We can assume this as a dependent variable.
Let a be the vector of coefficients that the model wants to learn,
𝑎 = (𝑎1, 𝑎2, 𝑎3,.......... 𝑎𝑝)'
Now we are going to define Matrix A,
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
In the matrix, each row acts as regressors. If we multiply the first row of the matrix A
with a vector the equation will be,
𝑥𝑝𝑎1 + 𝑥𝑝−1𝑎2 + 𝑥𝑝−2𝑎3 +.............
This is used for predicting 𝑥
𝑝+1
.
We can think in the sense of regression where X represents predictors and "𝐴. 𝑎"
represents regressors. So we need to find the vector "𝑎" such that it minimizes the
error. It’s a kind of least squares problem. We are finding the best fit over here similar to
regression only difference is the values are independent over there but in time series
the values are dependent. So we can write this as
[email protected]
R8L0PN473F
Order Estimation :
The order estimation is one of the most important tasks in time series. There are many
ways to estimate the order, some of them are
1. We are going to divide the data, build multiple models with different orders and
compare their errors.
2. Adding penalty to the model's complexity. Models are scored both on their
performance on the training dataset and based on the complexity of the model.
a. Model Performance: How well a candidate model has performed on the
training dataset.
b. Model Complexity: How complicated the trained candidate model is
after training.
Some of the techniques are AIC (Akaike Information Criterion), MDL (Minimum
Description Length). To use AIC for model selection, we simply choose the model
giving the smallest AIC over the set of models considered.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 15
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
We have a quick way of determining the order using ACF plots. As we know for MA the
ACF becomes zero after Tau is greater than q. So we can use ACF for determining the
order of MA but for AR the ACF is decaying exponentially so we cannot say order from
the plot. So for that, we are going to introduce another concept called PACF.
PACF (Partial Autocorrelation Function) :
A partial correlation is a conditional correlation. It is the correlation between two
variables under the assumption that we know and takes into account the values of
some other set of variables. For instance, consider a regression context in which y is
the response variable and 𝑥1 ,𝑥2 and 𝑥3 are predictor variables. The partial correlation
between y and 𝑥3is the correlation between the variables determined taking into
account how both y and 𝑥3 are related to 𝑥1 and 𝑥2.
For example, if we want to find the covariance between 𝑋𝑡 and 𝑋𝑡+𝑘,
| |
𝑋𝑡 |𝑋𝑡+1, 𝑋𝑡+2, 𝑋𝑡+3..............,𝑋 | 𝑋𝑡+𝑘
| 𝑡+𝑘−1 |
[email protected]
R8L0PN473F
We are going to subtract the projection of 𝑋𝑡 from 𝑋𝑡+1 to 𝑋𝑡+𝑘−1i.e. for future values
and similarly, for 𝑋𝑡+𝑘 subtract the projection(regression-regress on the values) of past
values. So the Autocovariance between them is
γ(𝑘) = 𝐸(𝑋𝑡 − 𝑃𝑡)𝐸(𝑋𝑡+𝑘 − 𝑃𝑡+𝑘)
The PACF determines the partial correlation between the time period 𝑡 and 𝑡 − 𝑘. It
doesn’t take into consideration all the time lags between 𝑡 and 𝑡 − 𝑘. For e.g. let's
assume that today's stock price may be dependent on 3 days prior stock price but it
might not take into consideration yesterday's stock price closure. Hence we consider
only the time lags having a direct impact on future time periods by neglecting the
insignificant time lags in between the two-time slots 𝑡 and 𝑡 − 𝑘.
So for AR modes, the PACF becomes zero after Tau is greater than p. So using PACF
we are going to estimate the order of p.
For AR(p) : γ(𝑘) = 0 ∀ 𝑘 ≥ 𝑝
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 16
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Let’s look at some examples for determining the order of p for AR models,
The graph in the above image corresponds to the AR(3) model as we can see in the
equation we are using the past three values to predict the current value. we can see the
ACF plot it's decaying exponentially so from this plot we cannot say the order of p but
[email protected]
R8L0PN473F in the partial ACF plot, we can see after 3 lags it’s cut off or not significant after that. So
using the PACF graph we can say it’s an AR(3) process.
Let’s take a look at another example,
The graph in the above image corresponds to the MA(2) model as we can see in the
equation we are using the past two errors to predict the current value. we can see the
PACF plot it's decaying exponentially so from this plot we cannot say the order of q but
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 17
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
in the ACF plot, we can see after 2 lags it’s cut off or not significant. So using the ACF
graph we can say it’s MA(2) process.
ACF vs PACF:
ACF PACF
AR(p) Decays Zero for ℎ > 𝑝
MA(q) Zero for ℎ > 𝑝 Decays
ARMA(p, q) Decays Decays
To conclude for AR Models we have to look for PACF plot and for MA models we have
to look for ACF plot. For ARMA models both ACF and PACF decay exponentially, we
can get a slight idea but for ARMA we need techniques such as AIC.
[email protected]
R8L0PN473F
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 18
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.