Intro To Time Series

Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

Intro to Time Series

Introduction
 Time series is a set of observations, each one being recorded at a specific time. (e.g.,
Annual GDP of a country, Sales figure, etc.)

 Discrete time series is one in which the set of time points at which observations are
made is a discrete set. (e.g., All above including irregularly spaced data)

 Continuous time series are obtained when observations are made continuously over
some time intervals. (e.g., ECG graph)

 Forecasting is estimating how the sequence of observations will continue in to the


future. (e.g., Forecasting of major economic variables like GDP, Unemployment,
Inflation, Exchange rates, Production and Consumption)

 Forecasting is very difficult, since it’s about the future! (e.g., forecasts of daily cases
of COVID-19)
Time Series Data
 A time series is a sequence of observations over time. What makes it distinguishable from other
statistical analyses is the explicit recognition of the importance of the order in which the
observations are made. Also, unlike many other problems where observations are independent, in
time series observations are most often dependent.

 Why do we need special models for time series data?

 Prediction of the future based on knowledge of the past (most important).


 To control the process producing the series.
 To have a description of the salient features of the series.

 Applications of time series forecasting


 Economic planning
 Sales forecasting
 Inventory (stock) control
 Exchange rate forecasting
 Etc…
Use of Time Series Data
 To develop forecast model
 What will the rate of inflation be next year?

 To estimate dynamic causal effects


 If the rate of interest increases the interest rate now, what will be the effect on the
rates of inflation and unemployment in 3 months? in 12 months?

 What is the effect over time on electronics good consumption of a hike in the
excise duty?

 Time dependent analysis

 Rates of inflation and unemployment in the country can be observed only over
time!
A Forecasting Problem: India / U.S. Foreign
Exchange Rate (EXINUS)
 Source: FRED ECONOMICS DATA (Shaded areas indicate US recessions)
 Units: Indian Rupees to One U.S. Dollar, Not Seasonally Adjusted
 Frequency: Monthly (Averages of daily figures)
Forecasting: Assumptions
 Time series Forecasting: Data collected at regular intervals of time (e.g.,
Weather and Electricity Forecasting).
 Assumptions: (a) Historical information is available;
(b) Past patterns will continue in the future.
Time Series Components
 Trend (𝑇𝑡 ) : pattern exists when there is a long-term increase or decrease in the data.

 Seasonal (𝑆𝑡 ) : pattern exists when a series is influenced by seasonal factors (e.g., the
quarter of the year, the month, or day of the week).

 Cyclic (𝐶𝑡 ) : pattern exists when data exhibit rises and falls that are not of fixed period
(duration usually of at least 2 years).
 Decomposition : 𝑌𝑡 = 𝑓(𝑇𝑡 ; 𝑆𝑡 ; 𝐶𝑡 ; 𝐼𝑡 ) , where 𝑌𝑡 is data at period t and 𝐼𝑡 is irregular
component at period t.
 Additive decomposition: : 𝑌𝑡 = 𝑇𝑡 + 𝑆𝑡 + 𝐶𝑡 + 𝐼𝑡

 Multiplicative decomposition: 𝑌𝑡 = 𝑇𝑡 ∗ 𝑆𝑡 ∗ 𝐶𝑡 ∗ 𝐼𝑡

 A stationary series is : roughly horizontal, constant variance and no patterns


predictable in the long-term.
Auto Regression Analysis
 Regression analysis for time-ordered data is known as Auto-Regression
Analysis

 Time series data are data collected on the same observational unit at multiple
time periods

Example: Indian rate of price inflation


Modeling with Time Series Data
 Correlation over time
 Serial correlation, also called autocorrelation
 Calculating standard error

 To estimate dynamic causal effects


 Under which dynamic effects can be estimated?

 How to estimate?

 Forecasting model Can we predict the tend at a time say 2017?


 Forecasting model build on regression model
Some Notations and Concepts
 𝑌𝑡 = Value of Y in a period t
 Data set [Y1, Y2, … YT-1, YT]: T observations on the time series random
variable Y
 Assumptions
 We consider only consecutive, evenly spaced observations
 For example, monthly, 2000-2015, no missing months

 A time series 𝑌𝑡 is stationary if its probability distribution does not change over
time, that is, if the joint distribution of (Yi+1, Yi+2, …, 𝑌𝑖+𝑇 ) does not depend on i.

 Stationary property implies that history is relevant. In other words, Stationary


requires the future to be like the past (in a probabilistic sense).

 Auto Regression analysis assumes that 𝑌𝑡 is stationary.


Some Notations and Concepts

Some Notations and Concepts
 Autocorrelation
 The correlation of a series with its own lagged values is called autocorrelation
(also called serial correlation)

Definition : j-th Autocorrelation

The j-th autocorrelation, denoted by 𝜌𝑗 is defined as

𝐶𝑜𝑣(𝑌𝑡 , 𝑌𝑡−𝑗 )
𝜌𝑗 =
𝜎𝑌𝑡 𝜎𝑌𝑡−𝑗
Where, 𝐶𝑜𝑣(𝑌𝑡 , 𝑌𝑡−𝑗 ) is the j-th autocovariance.
 For the given data, say ρ1 = 0.84
 This implies that the Dollars per Pound is highly serially correlated

 Similarly, we can determine ρ2 , ρ3 …. etc., and hence different regression analyses


Auto-Regression Model for Forecasting
 A natural starting point for forecasting model is to use past values of Y, that is,
Yt-1, Yt-2, … to predict Yt

 An autoregression is a regression model in which Yt is regressed against its


own lagged values.

 The number of lags used as regressors is called the order of autoregression


 In first order autoregression (denoted as AR(1)), Yt is regressed against Yt-1
 In p-th order autoregression (denoted as AR(p)), Yt is regressed against,
Yt-1, Yt-2, …,Yt-p .
p-th Order AutoRegression Model
Definition : p-th AutoRegression Model

 For example, AR(1) is 𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡 + 𝜀𝑡


 The task in AR analysis is to derive the ‘best’ possible values for
𝛽𝑖 given a time series 𝑌𝑡 .
Computing AR Coefficients
 A number of techniques known for computing the AR coefficients
 The most common method is called Least Squares Method (LSM)
 The LSM is based upon the Yule-Walker equations

 Here, ri (i = 1, 2 , 3, …, p-1) denotes the ith auto correlation coefficient.


 β0 can be chosen empirically, usually taken as zero.
AutoRegressive Integrated Moving Average (ARIMA) Model
 The ARIMA model, introduced by Box and Jenkins (1976), is a linear regression model indulged in
tracking linear tendencies in stationary time series data.

 AR: autoregressive (lagged observations as inputs) I: integrated (differencing to make series stationary)
MA: moving average (lagged errors as inputs).

 The model is expressed as ARIMA 𝑝, 𝑑, 𝑞 where 𝑝, 𝑑 𝑎𝑛𝑑 𝑞 are integer parameter values that decide the
structure of the model.

 More precisely, 𝑝 𝑎𝑛𝑑 𝑞 are the order of the AR model and the MA model respectively, and parameter d is
the level of differencing applied to the data.

 The mathematical expression of the ARIMA model is as follows:


𝑦𝑡 = 𝜃0 + 𝜙1 𝑦𝑡−1 + 𝜙2 𝑦𝑡−2 + ⋯ + 𝜙𝑝 𝑦𝑡−𝑝 + 𝜀𝑡 − 𝜃1 𝜀𝑡−1 − 𝜃2 𝜀𝑡−2 − ⋯ − 𝜃𝑞 𝜀𝑡−𝑞

where 𝑦𝑡 is the actual value, 𝜀𝑡 is the random error at time t, 𝜙𝑖 and 𝜃𝑗 are the coefficients of the model.

 It is assumed that 𝜀𝑡−1 𝜀𝑡−1 = 𝑦𝑡−1 − 𝑦𝑡−1 has zero mean with constant variance, and satisfies the i.i.d.
condition.

 Three basic Steps: Model identification, Parameter Estimation, and Diagnostic Checking.
Differencing in ARIMA Model
ARIMA model
ACF / PACF Plots
ACF / PACF Plots : Example
Forecast Evaluation
Performance metrics such as mean absolute error (MAE), root mean square error
(RMSE), and mean absolute percent error (MAPE) are used to evaluate the
performances of different forecasting models for the unemployment rate data sets:
𝑛
1 2;
𝑅𝑀𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
𝑛
𝑖=1
𝑛
1
𝑀𝐴𝐸 = 𝑦𝑖 − 𝑦𝑖 ;
𝑛
𝑖=1
1 𝑛 𝑦𝑖 −𝑦𝑖
𝑀𝐴𝑃𝐸 = 𝑛 𝑖=1 ,
𝑦𝑖

Where 𝑦𝑖 is the actual output, 𝑦𝑖 is the predicted output, and n denotes the number
of data points.
By definition, the lower the value of these performance metrics, the better is the
performance of the concerned forecasting model.
Time Series
Analysis using R
Time Series Plot:
The graphical representation of time series data by taking time on x axis & data on y
axis.
A plot of data over time

Example
The demand for a commodity E15 for last 20 months from April 2012 to October 2013
is given in E15demand.csv file. Draw the time series plot
Month Demand Month Demand
1 139 11 193
2 137 12 207
3 174 13 218
4 142 14 229
5 141 15 225
6 162 16 204
7 180 17 227
8 164 18 223
9 171 19 242
10 206 20 239
24
Reading data to R
mydata <- read.csv("E15demand.csv")
E15 = ts(mydata$Demand, start = c(2012,4), end = c(2013,10), frequency = 12)
E15
plot(E15, type = "b")

For quarterly data, frequency = 4


For monthly data, frequency = 12
Reading data to R
E15 = ts(mydata$Demand)
E15
plot(E15, type = "b")
Trend:
A long term increase or decrease in the data
Example: The data on Yearly average of Indian GDP during 1993 to 2005.

Year GDP
1993 94.43
1994 100.00
1995 107.25
1996 115.13
1997 124.16
1998 130.11
1999 138.57
2000 146.97
2001 153.40
2002 162.28
2003 168.73
Seasonal Pattern:
The time series data exhibiting rises and falls influenced by seasonal factors
Example: The data on monthly sales of a branded jackets

Month Sales Month Sales Month Sales Month Sales


Jan-02 164 Jan-03 147 Jan-04 139 Jan-05 151
Feb-02 148 Feb-03 133 Feb-04 143 Feb-05 134
Mar-02 152 Mar-03 163 Mar-04 150 Mar-05 164
Apr-02 144 Apr-03 150 Apr-04 154 Apr-05 126
May-02 155 May-03 129 May-04 137 May-05 131
Jun-02 125 Jun-03 131 Jun-04 129 Jun-05 125
Jul-02 153 Jul-03 145 Jul-04 128 Jul-05 127
Aug-02 146 Aug-03 137 Aug-04 140 Aug-05 143
Sep-02 138 Sep-03 138 Sep-04 143 Sep-05 143
Oct-02 190 Oct-03 168 Oct-04 151 Oct-05 160
Nov-02 192 Nov-03 176 Nov-04 177 Nov-05 190
Dec-02 192 Dec-03 188 Dec-04 184 Dec-05 182
Seasonal Pattern:
The time series data exhibiting rises and falls influenced by seasonal factors
Trend and Seasonal Patterns Combined
The time series data may include a combination of trend and seasonal patterns
Example: The data on monthly sales of an aircraft component is given below:

Month Sales Month Sales Month Sales


1 742 21 1341 41 1274
2 697 22 1296 42 1422
3 776 23 1066 43 1486
4 898 24 901 44 1555
5 1030 25 896 45 1604
6 1107 26 793 46 1600
7 1165 27 885 47 1403
8 1216 28 1055 48 1209
9 1208 29 1204 49 1030
10 1131 30 1326 50 1032
11 971 31 1303 51 1126
12 783 32 1436 52 1285
13 741 33 1473 53 1468
14 700 34 1453 54 1637
15 774 35 1170 55 1611
16 932 36 1023 56 1608
17 1099 37 951 57 1528
18 1223 38 861 58 1420
19 1290 39 938 59 1119
20 1349 40 1109 60 1013
Stationary Series:
A series free from trend and seasonal patterns
A series exhibits only random fluctuations around mean

Test for Stationary: Unit root test


Augmented Dickey Fuller Test (ADF) :
Checks whether any specific patterns exists in the series
H0: data is non stationary
H1: data is stationary
A small p-value suggest data is stationary.

Kwiatkowski-Phillips-Schmidt-Shin Test (KPSS) :


Another test for stationary.
Checks especially the existence of trend in the data set
H0: data is stationary
H1: data is non stationary
A large p-value suggest data is stationary.
Check stationary of data
Example : The data on daily shipments is given in shipment.csv. Check whether the
data is stationary

Day Shipments Day Shipments


1 99 13 101
2 103 14 111
3 92 15 94
4 100 16 101
5 99 17 104
6 99 18 99
7 103 19 94
8 101 20 110
9 100 21 108
10 100 22 102
11 102 23 100
12 101 24 98
Stationary Series: A series free from trend and seasonal patterns.
A series exhibits only random fluctuations around mean

Example : The data on daily shipments is given in shipment.csv. Check whether the
data is stationary
R code
mydata <- read.csv("shipment.csv")
shipments = ts(mydata$Shipments)
plot(shipments, type = "b")
Test for checking series is Stationary: Unit root test in R
ADF Test

R Code
install.packages("tseries")
library("tseries")
adf.test(shipments)

Statistic Value
Dickey-Fuller -3.2471
P value 0.09901

Since p value = 0.099 < 0.1, the data is stationary at 10% significant
level
Test for checking series is Stationary : Unit root test in R

KPSS test

R Code
kpss.test(shipments)

Statistic Value
KPSS Level 0.24322
P value > 0.1

Since p value > 0.1 >= 0.1, the data is stationary at 10% level of
significance
Differencing: A method for making series stationary
A differenced series is the series of difference between each observation 𝑌𝑡 and the
previous observation 𝑌𝑡−1
𝑌𝑡′ = 𝑌𝑡 − 𝑌𝑡−1

A series with trend can be made stationary with 1st differencing


A series with seasonality can be made stationary with seasonal differencing

Example: Is it possible to make the GDP data given in GDP.csv stationary?


Differencing: A method for making series stationary

Example: Is it possible to make the GDP data given in GDP.csv stationary?

R Code
>mydata = ts(GDP$GDP)
> plot(mydata, type = "b")

KPSS Statistic 0.48402


P value 0.04527

Conclusion
Series has a linear trend
KPSS test (p value < 0.05) shows data is not stationary
Differencing: A method for making data stationary

Example: Is it possible to make the GDP data given in GDP.csv stationary?

Identify the number of differencing required

R Code
install.packages("forecast")
library(forecast)
ndiffs(GDP)

Differencing required is 1
Yt’ = Yt – Yt-1

mydiffdata = diff(GDP, difference = 1)


plot(mydiffdata, type = "b")
adf.test(mydiffdata)
kpss.test(mydiffdata)
Differencing: A method for making series stationary
Example: Is it possible to make the GDP data given in GDP.csv stationary?

Test Statistic P value


ADF -5.0229 < 0.01
KPSS 0.20905 >0.1

Conclusion: Series became stationary after 1st order differencing


Single Exponential Smoothing:
Give more weight to recent values compared to the old values
More efficient for stationary data without any seasonality and trend

Single Exponential Smoothing: Methodology


Let y1,y2, - - - yt be the values, then
yt+1 estimate = St+1 =  yt + (1- ) St
where 0    1 and S1 = y1
Example: The data on ad revenue from an advertising agency for the last 12 months is
given in Amount.csv. Forecast the ad revenue from the agency in the future
month using single exponential smoothing method with best value of ?

Month Amount Month Amount


1 9 7 11
2 8 8 7
3 9 9 13
4 12 10 9
5 9 11 11
6 12 12 10
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?

R code
Reading and plotting the data
mydata <- read.csv("Amount.csv")
amount = ts(mydata$Amount)
plot(amount, type ="b")
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?

R code
Checking whether series is stationary
library(forecast)
adf.test(amount)
kpss.test(amount)

Test Statistic P value


ADF -2.3285 0.4472
KPSS 0.24038 >0.1

ADF and KPSS tests show that the series is stationary


Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?

R code
Fitting the model
mymodel = HoltWinters(amount, beta = FALSE, gamma = FALSE)
mymodel

Smoothing parameter value


alpha 0.1285076
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?
R code

Actual Vs Fitted plot


plot(mymodel)
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?

R code

Computing predicted values and residuals (errors)


pred = fitted(mymodel)
res = residuals(mymodel)
outputdata = cbind(amount, pred[,1], res)
write.csv(outputdata, “amount_outputdata.csv")
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?

Month Actual Predicted Error


1 9
2 8 9 -1
3 9 8.8715 0.12851
4 12 8.8880 3.11199
5 9 9.2879 -0.2879
6 12 9.2509 2.74908
7 11 9.6042 1.3958
8 7 9.7836 -2.7836
9 13 9.4259 3.57414
10 9 9.8852 -0.8852
11 11 9.7714 1.22859
12 10 9.9293 0.0707
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?
Model diagnostics

Residual = Actual – Predicted


Mean Absolute Error: MAE
Root Mean Square Error: RMSE
Mean Absolute Percentage Error: MAPE
Indian Statistical Institute

Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?
Model diagnostics – R Code

abs_res = abs(res)
res_sq = res^2
pae = abs_res/ amount

50
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?
Model diagnostics

Month Absolute Error Error Squares Absolute Error / Actual


1.0000 1.0000 1.0000 0.1250
2.0000 0.1285 0.0165 0.0143
3.0000 3.1120 9.6845 0.2593
4.0000 0.2879 0.0829 0.0320
5.0000 2.7491 7.5574 0.2291
6.0000 1.3958 1.9483 0.1269
7.0000 2.7836 7.7483 0.3977
8.0000 3.5741 12.7745 0.2749
9.0000 0.8852 0.7835 0.0984
10.0000 1.2286 1.5094 0.1117
11.0000 0.0707 0.0050 0.0071
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?
Model diagnostics
Statistic Description R Code Value
ME Average residuals mean(res) 0.6638322
MAE Average of absolute residuals mean(abs_res) 1.565
MSE Average of residual squares mse = mean(res_sq) 3.919
RMSE Square root of MSE sqrt(mse) 1.980
MAPE Average of absolute error / actual mean(PAE)*100 15.23%

Criteria

MAPE < 10% is reasonably good


MAPE < 5 % is very good
Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?
Model diagnostics - Normality of Errors with zero
R Code
qqnorm(res)
qqline(res)
shapiro.test(res)
mean(res)

Statistic (w) P value


0.962 0.7963

Error Mean 0.6638


Example: The data on ad revenue from an advertising agency for the last 12 months is
given below. Forecast the ad revenue from the agency in the future month
using single exponential smoothing method with best value of ?
Model diagnostics – Normal Q – Q plot
Forecast and Prediction Interval
Prediction interval : Predicted value  z MSE
where z = width of prediction interval

Prediction Interval z
90% 1.645
95% 1.960
99% 2.576

Forecasted value St+1 = yt + (1 - )St


Forecasted value S13 = y12 + (1 - )S12
Forecasted value S13 = 0.1285076 x 10 + (1 - 0.1285076) x 9.9293 = 9.9383
Forecast
R Code
library(forecast)
forecast = forecast(mymodel, 1)
forecast
plot(forecast)

80% Prediction Interval 95% Prediction Interval


Month Forecast
Lower Upper Lower Upper
13 9.938382 7.431552 12.44521 6.104517 13.77225
Forecast Plot
TIME SERIES MODELING

General form of linear model


y is modeled in terms of x’s
Y = a +b1x1+b2x2+ - - -+bkxk
Step 1: Check Correlation between y and x’s
y should be correlated with some of the x’s

Time series model


Generally there will not be any x’s
Hence patterns in y series is explored
y will be modeled in terms of previous values of y
yt = a +b1yt-1+b2yt-2+ - -
Step 1: Check correlation between yt and yt-1, etc
correlation between y and previous values of y are called autocorrelation
Example: Check the auto correlation up to 3 lags in GDP data

Year GDP(yt) yt-1 yt-2 yt-3


1993 94.43 Lag variables Auto Correlation
1994 100 94.43 1 yt vs yt-1 0.9985
1995 107.3 100 94.43 2 yt vs yt-2 0.9984
1996 115.1 107.3 100 94.43 3 yt vs yt-3 0.9981
1997 124.2 115.1 107.3 100
1998 130.1 124.2 115.1 107.3
1999 138.6 130.1 124.2 115.1
2000 147 138.6 130.1 124.2
2001 153.4 147 138.6 130.1
2002 162.3 153.4 147 138.6
2003 168.7 162.3 153.4 147
Example: Check the auto correlation up to 3 lags in GDP data

R Code
mydata <- read.csv("Trens_GDP.csv")
GDP <- ts(mydata$GDP, start = 1993, end = 2003)
acf(GDP, 3)
acf(GDP)
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Widely used and very effective modeling approach


Proposed by George Box and Gwilym Jenkins
Also known as Box – Jenkins model or ARIMA(p,d,q)
where
p: number of auto regressive (AR) terms
q: number of moving average (MA) terms
d: level of differencing
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

General Form
yt = c + 1yt-1+ 2yt-2 + - - - + 1et-1+ + 2et-2 - - - -
Where
c: constant
1, 2, 1, 2 , - - - are model parameters
et-1 = yt-1 – st-1, et are called errors or residuals
st-1 : predicted value for the t-1th observation (yt-1)
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 1:
Draw time series plot and check for trend, seasonality, etc

Step 2:
Draw Auto Correlation Function (ACF) and Partially Auto Correlation Function
(PACF) graphs to identify auto correlation structure of the series

Step 3:
Check whether the series is stationary using unit root test (ADF test, KPSS test)
If series is non stationary do differencing or transform the series
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 4:
Identify the model using ACF and PACF or automatically
The best model is one which minimizes AIC or BIC or both

Step 5:
Estimate the model parameters using maximum likelihood method (MLE)
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 6:
Do model diagnostic checks
The errors or residuals should be white noise and should not be auto
correlated
Do Portmanteau and Ljung & Box tests. If p value > 0.05, then there is no
autocorrelation in residuals and residuals are purely white noise.
The model is a good fit
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Example: The number of visitors to a web page is given in Visits.csv. Develop a


model to predict the daily number of visitors?
SL No. Data SL No. Data
1 259 16 416
2 310 17 248
3 268 18 314
4 379 19 351
5 275 20 417
6 102 21 276
7 139 22 164
8 60 23 120
9 93 24 379
10 45 25 277
11 101 26 208
12 161 27 361
13 288 28 289
14 372 29 138
15 291 30 206
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 1: Read and plot the series


mydata <- read.csv("Visits.csv")
mydata <- ts(mydata$Data)
plot(mydata, type = "b")
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 2: Descriptive Statistics


summary(mydata)

Statistic Value
Minimum 45
Quartile 1 144.5
Median 271.5
Mean 243.6
Quartile 3 313

Maximum 417
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 3: Check whether the series is stationary


library(tseries)
adf.test(mydata)
kpss.test(mydata)
ndiffs(mydata)

Test Statistic P value


ADF -2.494 0.3829
KPSS 0.15007 > 0.1

Both tests shows that series is stationary


Number of differences required = 0
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))
Step 4: Draw ACF & PACF Graphs
acf(mydata)
pacf(mydata)

Potential Models
ARMA(1,0) since acf at lag 1 is crossing 95% confidence interval
ARMA(0,1) since pacf at lag 1 is crossing 95% confidence interval
ARMA(1,1) since both acf and pacf at lag 1 is crossing 95% confidence interval
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 5: Identification of model automatically


library(forecast)
mymodel = auto.arima(mydata)
mymodel

Model Log likelihood AIC BIC


ARIMA(1,0,0) -178.31 362.62 366.82

Model Parameters Value


Intercept 242.8594
AR1 0.5064
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))
Step 6: Identification of model manually
arima(mydata, c(0,0,1))
arima(mydata, c(1,0,0))
arima(mydata, c(1,0,1))

Model Log likelihood AIC

p=0,q=1 ARIMA(0,0,1) -179.07 364.15

p=1,q=0 ARIMA(1,0,0) -178.31 362.62

p=1,q=1 ARIMA(1,0,1) -178.31 364.62

Conclusion:
The best model which minimizes AIC & BIC is p=1, q=0 or ARIMA(1,0,0)
Identified automatically
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 7: Estimation of parameters

ARIMA(1,0,0) Parameters Value Std Error


Intercept 242.8594 32.8552
AR1 0.5064 0.1520

The model is: 𝑌𝑡 = 242.8594 + 0.5064 𝑌𝑡−1


Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model Diagnostics


summary(mymodel)

Statistic Description Value


ME Residual average -0.3470709
MAE Average of absolute residuals 76.90398
RMSE Root mean square of residuals 91.81328
MAPE Mean absolute percent error 47.78088
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model Diagnostics


pred = fitted(mymodel)
res = residuals(mymodel)

Normality check on Residuals


qqnorm(res)
qqline(res)
shapiro.test(res)
hist(res, col = "grey")
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model Diagnostics

Normality check on Residuals : Normal Q – Q Plot


Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model Diagnostics

Normality check on Residuals: Histogram of Residuals


Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model Diagnostics

Normality check on Residuals: Shapiro Wilk Normality test

Statistic p value
0.96445 0.4004

P > 0.05, Residuals are normal


Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model Diagnostics

Checking auto correlation among residuals: ACF of Residuals

None of the autocorrelation values is exceeding 95% confidence interval


Residuals are not auto correlated
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model Diagnostics

Tests for checking auto correlation among residuals

Ljung-Box Test

Test whether the residuals are independent or not auto correlated


If p value  0.05, then the residuals are not auto correlated and independent
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 8: Model diagnostics


Ljung & Box Test
Box.test(res, lag = 15, type = "Ljung-Box")

Test Lag Statistic df p value

Ljung & Box 15 6.5528 15 0.9689

Since the p value  0.05, The residuals are not auto correlated
The residuals are white noise
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Step 9: Forecasting upcoming values


forecast = forecast(mymodel, h = 3)
forecast

80% Prediction Interval 95% Prediction Interval


Point Forecast
Lower Upper Lower Upper
31 224.1953 102.40201 345.9885 37.92856 410.4620
32 233.4086 96.89144 369.9258 24.62361 442.1936
33 238.0739 98.03062 378.1172 23.89618 452.2516
Auto Regressive Integrated Moving Average Models (ARIMA (p,d,q))

Exercise 1: The data on sales of a electro magnetic component is given in


Sales.csv. Develop a forecasting methodology?

Period Data Period Data


1 4737 16 4405
2 5117 17 4595
3 5091 18 5045
4 3468 19 5700
5 4320 20 5716
6 3825 21 5138
7 3673 22 5010
8 3694 23 5353
9 3708 24 6074
10 3333 25 5031
11 3367 26 5648
12 3614 27 5506
13 3362 28 4230
14 3655 29 4827
15 3963 30 3885
Cheatsheet
References

Read Online: https://otexts.com/fpp3/

A very updated Survey Paper:

https://arxiv.org/abs/2010.05079

You might also like