0% found this document useful (0 votes)
48 views74 pages

Stat 479

This document provides course notes on time series analysis. It covers topics such as different types of noise in time series like white noise and autoregressive noise. It also discusses properties of time series like autocovariance and stationarity. Statistical models for time series such as regression, smoothing, ARIMA models are presented. Methods for estimation, forecasting, analysis in the frequency domain are also introduced.

Uploaded by

aswin vijayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views74 pages

Stat 479

This document provides course notes on time series analysis. It covers topics such as different types of noise in time series like white noise and autoregressive noise. It also discusses properties of time series like autocovariance and stationarity. Statistical models for time series such as regression, smoothing, ARIMA models are presented. Methods for estimation, forecasting, analysis in the frequency domain are also introduced.

Uploaded by

aswin vijayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Time Series Analysis

Course notes for STAT 479

Adam B Kashlak
Mathematical & Statistical Sciences
University of Alberta
Edmonton, Canada, T6G 2G1

April 20, 2021


cbna
This work is licensed under the Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/
by-nc-sa/4.0/.
Contents

Preface 1

1 Time Series: Overview 2


1.1 Types of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Autoregressive . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Markov Processes and Martingales . . . . . . . . . . . . . . . 4
1.2 Properties of Times Series . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Autocovariance . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Cross-Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Estimating the mean . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Estimating the autocovariance . . . . . . . . . . . . . . . . . 11
1.3.3 Detecting White Noise . . . . . . . . . . . . . . . . . . . . . . 12

2 Statistical Models for Time Series 15


2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Linear Regression in Brief . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Linear Regression for Time Series . . . . . . . . . . . . . . . . 16
2.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 ARIMA Models for Times Series . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Autoregressive Processes . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Moving Average Process . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Auto Regressive Moving Average Processes . . . . . . . . . . 25
2.3.4 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Testing for Stationarity and Autocorrelation . . . . . . . . . . . . . . 29
2.4.1 Box-Pierce and Ljung-Box Tests . . . . . . . . . . . . . . . . 30
2.4.2 Durbin–Watson Test . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Breusch–Godfrey test . . . . . . . . . . . . . . . . . . . . . . 31
2.4.4 Augmented Dickey-Fuller Test . . . . . . . . . . . . . . . . . 32
2.4.5 Phillips–Perron test . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Autocorrelation and Partial Autocorrelation . . . . . . . . . . . . . . 33
2.5.1 ACF for AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 ACF for MA(q) . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.3 PACF for AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.4 PACF for MA(1) . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Estimation and Forecasting 38


3.1 The AR process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 Estimation for AR processes . . . . . . . . . . . . . . . . . . 38
3.1.2 Forecasting for AR processes . . . . . . . . . . . . . . . . . . 43
3.2 The ARMA Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Estimation for ARMA processes . . . . . . . . . . . . . . . . 47
3.2.2 Forecasting for ARMA processes . . . . . . . . . . . . . . . . 49
3.3 Seasonal ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Seasonal Autoregressive Processes . . . . . . . . . . . . . . . 54
3.3.2 Seasonal ARMA Processes . . . . . . . . . . . . . . . . . . . . 55

4 Analysis in the Frequency Domain 57


4.1 Periodic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Regression, Estimation, and the FFT . . . . . . . . . . . . . 58
4.2 Spectral Distribution and Density . . . . . . . . . . . . . . . . . . . . 61
4.2.1 Filtering and ARMA . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Spectral Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Spectral ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Large Sample Behaviour . . . . . . . . . . . . . . . . . . . . . 66
4.3.3 Banding, Tapering, Smoothing, and more . . . . . . . . . . . 67
4.3.4 Parametric Estimation . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Preface

The understanding’s in your mind. You only have to find


it. But, time—Time, the creature said, is the simplest
thing there is.
Time is the Simplest Thing
Clifford D Simak (1961)

The following are lecture notes originally produced for an undergraduate course
on time series at the University of Alberta in the winter of 2020. The aim of these
notes is is to introduce the main topics, applications, and mathematical underpin-
nings of time series analysis.
These notes were produced by consolidating two main sources being the textbook
of Shumway and Stoffer, Time Series Analysis and Its Applications, and the past
course notes produced by Dr. Doug Weins also at the University of Alberta.

Adam B Kashlak
Edmonton, Canada
January 2020

1
Chapter 1

Time Series: Overview

Introduction
In this chapter we consider different types of time series processes that we may
encounter in practice. The main difference between time series and other areas of
statistics like linear regression is that the noise or errors can be correlated. This
arises from the fact that time implies causality; the past predicts the future. Thus,
we no longer live in the independent and identically distributed setting of most other
areas of statistics.
This chapter also reintroduces notions of covariance and correlation in the con-
text of time series, which become autocovariance and autocorrelation. The critical
property of stationarity is defined, which allows us to estimate such autocovariances
and autocorrelations from a given time series dataset.

1.1 Types of Noise


When one is first introduced to the realm of statistics, the data on hand is treated as
independent and identically distributed observations from some population. That
is, the noise or errors or randomness present in the data is treated as a collection
of iid random variables—typically mean zero Gaussians. Times series data breaks
from the iid setting as causality becomes a key notion: the effects of random errors
in the past are present in future observations as well. Thus, we consider many types
of noise that can occur in real data.

1.1.1 White Noise


Let wt denote the white noise process. This is a random variable indexed by time t
such that

Ewt = 0 and Var (wt ) = σ 2 ∀t ∈ [0, T ], and cov (wt , ws ) = 0 ∀t 6= s.

2
That is, wt and ws are uncorrelated but not necessarily independent.
This can be strengthened to iid noise if uncorrelated is replaced with independent.
This can be further strengthened to Gaussian white noise were every wt ∼ N 0, σ 2 .


The intuition behind the term white noise comes from signals processing where a
signal is white if it contains all possible frequencies. Furthermore, the white noise
process will be used to generate all of the subsequent processes.

1.1.2 Autoregressive
The autoregressive (AR) process is a natural way to encode causality into the white
noise process, that is, demonstrate how the past influences the future. The general
formula is
Xp
Xt = θi Xt−i + wt ,
i=1

which is that past time observation Xt−i contributions θi ∈ R to the present time
observation Xt .
For example, if p = 1 and θ1 = 1, then we have the process

Xt = Xt−1 + wt ,

which is also an example of a Markov process and a Martingale to be discussed


below. This process could model, say, the price of a commodity where the previous
price Xt−1 is the best guess for the current price Xt plus or minus some noise wt .
The AR process with p = 1 and θ1 = 1 can be thought of as a random walk. An
interesting and useful extension of this process is the random walk with drift, which
is
Xt = a + Xt−1 + wt
for some a 6= 0. In this case, the process increases (or decreases) by the fixed amount
a at each time step. But there is also the addition of white noise wt at each step.
Hence, one could try to estimate the drift term a from historical data in order to
ascertain if Xt (say the price of some commodity) is increasing or decreasing or
remaining constant up to random noise.

Example 1.1.1. In Figure 1.1, we have four examples of autoregressive processes


where wt is Gaussian white noise.

1. The white noise process: Xt = wt .

2. Random Walk: Xt = Xt−1 + wt .

3. AR(2): Xt = Xt−1 − 0.2Xt−2 + wt

4. AR(3): Xt = Xt−1 − 0.2Xt−2 + 0.18Xt−3 + wt

3
Based solely on the plots, the random walk and the chosen AR(3) process look sim-
ilar. Likewise, it may be hard to immediately identify that the top left plot is white
noise while the bottom left is an AR(2) process.

1.1.3 Moving Average


The moving average (MA) process is a smoother type of noise than the white noise
process. It can be expressed by the formula
q
X
Xt = φj wt−j + wt
j=1

for φj ∈ R. Compared to the above AR formula, the MA formula averages over


the noise terms wt as opposed to the observed values Xt . It can be thought of as
ripple effects in the process. That is, if there is a shock to the process wt−1 , then
it’s effects are still felt at time t by the term φ1 wt−1 .
Alternatively, this model can be written as an overall average like
q/2
X
Xt = φj wt+j .
j=−q/2

In this way, we can consider the MA process as a weighted averaged white noise
process. A simple example is Xt = (wt−1 + wt + wt+1 )/3.

1.1.4 Markov Processes and Martingales


Three very useful tools in probability theory that have been extensively studied
are Markov processes, martingales, and the Gaussian process. We briefly introduce
them here noting that there has be extensive research on each topic.1
A Markov process is one where conditioning on the entire history of the process
is equivalent to just conditioning on the most recent time point. More precisely, Xt
is Markov if
E (Xt |Xt−1 , . . . , X1 ) = E (Xt |Xt−1 ) .
For example, the AR process with p = 1 is Markov as the present value Xt only
depends on the past via Xt−1 and no other Xt−i . Note that people have also
studied order p Markov processes where the present depends on the more recent p
time points, but the above definition is the most commonly used.
A martingale is often defined as a fair game being one where the expected win-
nings/looses is zero. That is, it is a stochastic process where the conditional expec-
tation is equal to the most recent observation, i.e.
E (Xt |Xt−1 , . . . , X1 ) = Xt−1 .
1
See, for example, the two volume series Diffusions, Markov Processes, and Martingales by
Rogers and Williams.

4
White Noise Random Walk
3

5
2

0
1

−5
−25 −20 −15 −10
0
−1
−2

0 50 100 150 200 0 200 400 600 800 1000

Index Time

AR 2 AR 3
6

10
4

5
2

0
0
−2

−5
−4

−10

0 200 400 600 800 1000 0 200 400 600 800 1000

Time Time

Figure 1.1: Examples of the white noise process and autoregressive versions of that
white noise process of order 1, 2, and 3.

5
White Noise MA 3
3

1.0
2

0.5
1

0.0
0

−0.5
−1

−1.0
−2

−1.5

0 50 100 150 200 0 50 100 150 200

Index Index

MA 9 MA 21
0.4
0.5

0.2
0.0

0.0
−0.2
−0.5

−0.4

0 50 100 150 200 0 50 100 150 200

Index Index

Figure 1.2: Examples of the white noise process and moving averaged versions of
that white noise process averaged over windows of length 3, 9, and 21.

6
The AR process with p = 1 and θ1 = 1 is an example of a martingale. Note that
supermartingales and submartingales have also been studied where the above = is
replaced by a ≤ or ≥, respectively.
As the normal distribution lends itself elegantly to other areas of statistics, so
does it to time series. The Gaussian process is a generalization of the multivariate
normal distribution. It is a stochastic process Xt where for any finite collection of
time points {t1 , . . . , tk }, the random vector (Xt1 , . . . , Xtk ) is multivariate normal.
Much like the multivariate normal distribution, the Gaussian process can be defined
by its mean µt and covariance Cs,t where
µt = EXt and Σs,t = cov (Xs , Xt ) .
Many time series fall under the category of linear processes. Given a white noise
process wt , a linear process is defined as

X
Xt = µ + θj wt−j ,
j=−∞

which is that every Xt is a linear combination ofPthe terms in the white noise process
with some mean µ added on. Here, we require θi2 < ∞ in order for the process to
have a finite variance. However, as we are generally interested in modelling casual
processes in time—i.e. the past predicts the future and not vice versa—we can
instead consider the more restricted definition
X∞
Xt = µ + θj wt−j .
j=0

Example 1.1.2 (The AR(1) Process). We revisit the AR(1) process, Xt = θXt−1 +
wt , and by using this recursive definition and assuming we can extend the series
infinitely into the past, we can rewrite it as

X
Xt = θ(θXt−2 + wt−1 ) + wt = . . . = θj wt−j .
j=0
P∞j
PN j
Infinite series are limits—i.e. j=0 θ wt−j = limN →∞ j=0 θ wt−j . Hence, this
sum may not converge in any meaningful way. Let SN (θ) = N j
P
j=0 θ wt−j , then
N  2N +2

2 1−θ
X
2 2j
ESN = 0 and Var (SN ) = σ θ =σ .
1 − θ2
j=0

Thus, if |θ| < 1, then Var (SN (θ)) → σ 2 /(1 − θ2 ), and if wt is Gaussian noise, then
d
→ N 0, σ 2 /(1 − θ2 ) .

SN (θ) −
In the case of the random walk, which is θ = 1, the series does not converge, but by
the central limit theorem, we have
d
n−1/2 SN (1) −
→ N (0, 1) .

7
1.2 Properties of Times Series
1.2.1 Autocovariance
As a time series Xt can be thought of as a single entity, the covariance between two
time points is referred to as the autocovariance and is defined as

KX (s, t) = cov(Xt , Xs ).

The notation K comes from treating the autocovariance as a kernel function for an
integral transform.2 Note that the autocovariance function is symmetric, K(s, t) =
K(t, s), and positive (semi) definite in the sense that for any finite collection of time
points {t1 , . . . , tk }, we have a k × k matrix with i, jth entry K(ti , tj ) and this matrix
is positive (semi) definite.
Similar to the multivariate setting, we can normalize the autocovariance into an
autocorrelation by
KX (s, t)
ρ(s, t) = p .
KX (s, s)KX (t, t)

Example 1.2.1 (AR(1) with drift). For wt a white noise process with variance σ 2 ,
consider the AR(1) with drift process

Xt = a + θXt−1 + wt

for some real a and θ. We can use the recursive definition to get that

Xt = a + θ(a + θXt−2 + wt−1 ) + wt


= (1 + θ)a + θ2 Xt−2 + θwt−1 + wt .

This can be repeated m times to get


m
X m
X
Xt = a θj + θm+1 Xt−m + θj wt−j .
j=0 j=0

Then, assuming this process has an infinite past and that |θ| < 1, we can take m to
infinity to get

a X
Xt = + θj wt−j ,
1−θ
j=0

which happens to be a linear process. The mean can now be quickly calculated to be
2
R
For example, g(s) = f (t)K(s, t)dt.

8
EXt = a/(1 − θ) as Ewt = 0 for all t. Furthermore, the autocovariance is

KX (s, t) = cov (Xs , Xt )


   
X ∞ ∞
X ∞
X
= cov  θj ws−j , θj wt−j  = E  θj ws−j θi wt−i 
j=0 j=0 j,i=0
∞ ∞
X X σ 2 θ|s−t|
= σ2 θi+j 1[s − j = t − i] = σ 2 θ|s−t| θ2j = .
1 − θ2
j,i=0 j=0

This implies that the variance is σ 2 /(1 − θ2 ). Note that this process is a weakly
stationary process, which will be defined below.

1.2.2 Cross-Covariance
The cross covariance is similar to the auto covariance, but applies to multivariate
time series. More simply, if we have two time series Xt and Yt , then we can consider

KXY (t, s) = cov (Xt , Ys )

and the cross-correlation


KXY (t, s)
ρXY (t, s) = .
KX (t, t)KY (s, s)

1.2.3 Stationarity
In a broad sense, stationarity implies some property of the time series is invariant
to shifts in time. There are two such notions we will consider.

Definition 1.2.2 (Weak Stationarity). A process Xt is said to be weakly stationary


if its mean and autocovariance are invariant to time shifts. That is, for any r > 0,

(Mean) : EXt = EXt+r = µ


(Autocovariance) : KX (t, s) = KX (t + r, s + r)

Definition 1.2.3 (Strong Stationarity). A process Xt is said to be strongly station-


ary if its joint distribution function is invariant to time shifts. That is, for any
r > 0, and any finite collection of time points t1 , . . . , tk ,

F (Xt1 , . . . , Xtk ) = F (Xt1 +r , . . . , Xtk +r )

where F () is the joint CDF of the k random variables. That is, F (Xt1 , . . . , Xtk ) =
P (Xt1 ≤ xt1 , . . . , Xtk ≤ xtk ) .

9
For a weakly (and thus for a strongly) stationary process Xt , we have that the
autocovariance function is

KX (s, t) = KX (s − t, 0) = KX (τ )

for some τ being the difference between two time points (the time lag). Hence, if
a process is weakly stationary, the autocovariance can be treated as a univariate
function.
Furthermore, this univariate function is both symmetric and bounded in τ . This
can be seen by noting, for symmetry, that

KX (τ ) = KX ((τ + t) − t) = KX (τ + t, t)
= KX (t, τ + t) = KX (t − (τ + t)) = KX (−τ ),

and, for boundedness, that KX (0) = Var (Xt ) for all t and by applying the Cauchy-
Schwarz inequality that, for any r,

KX (0)2 = Var (Xt ) Var (Xt+r ) ≥ cov (Xt , Xt+r )2 = KX (r)2 .

This implies that |KX (τ )| ≤ KX (0).


Stationarity also helps in the statistical context with estimation. Specifically,
we cannot estimate the autocovariance for a single non-stationary series, but can
estimate it for a stationary series. With that in mind, often we can modify a time
series to make it stationary.

Example 1.2.4. Consider the time series

Xt = a + bt + Yt

where Yt is a mean zero stationary process. The mean µt = a + bt is a function of


t. However, subtracting off this linear trend leaves

Xt − a − bt = Yt ,

which is stationary.

As with cross-covariance, we can consider joint stationarity of two series Xt and


Yt when dealing with multiple time series at once.

Definition 1.2.5 (Joint Stationarity). The processes Xt and Yt are said to be jointly
stationary if both are individually stationary and also if the cross covariance function
is also stationary—i.e. KXY (t, s) = KXY (t + r, s + r) for any r.

10
1.3 Estimation
Estimating parameters in a time series model is harder than it is in standard statis-
tics where we often assume that the observations are iid. Now, we are faced with a
single sequence of points X1 , X2 , . . . , XT which are not iid. To estimate the mean
and autocovariance we require the process to be weakly stationary. If it isn’t, then,
for example, every Xt will have its own mean and we cannot estimate it.
Note as indicated above, we will consider time series with T total observations
observed at equally spaced intervals. If the observations are irregularly spaced, more
work has to be done.

1.3.1 Estimating the mean


For a weakly stationary process, EXtP= µ for all t = 1, . . . , T , so we can consider
the usual sample average, X̄ = T −1 Tt=1 Xt , as an estimator for µ. Specifically,
due to the linearity of expectation, we have EX̄ = µ. However, as these Xt are not
uncorrelated, the variance calculation is a bit more involved than usual.
T
!
 1X
Var X̄ = Var Xt
T
t=1
T T
1 XX
= cov (Xt , Xs )
T2
t=1 s=1
T X T
1 X
= KX (|t − s|)
T2
t=1 s=1
T −1
1 2 X z
= KX (0) + 1− KX (z).
T T T
z=1

Note that in the uncorrelated case KX (0) = σ 2 and KX (z) = 0 for z > 0. Thus, the
formula reduces to the usual σ 2 /T .

1.3.2 Estimating the autocovariance


For a weakly stationary process, we can define the sample autocovariance function
to be
T −h
1 X
K̂X (h) = (Xt+h − X̄)(Xt − X̄).
T
t=1

As h gets bigger, the number of terms in the sum decreases giving less accurate
estimation. Similarly, the sample autocorrelation function is defined to be

ρ̂(h) = K̂X (h)/K̂(0),

11
and the sample cross covariance and cross correlation are
T −h
1 X K̂XY (h)
K̂XY (h) = (Xt+h − X̄)(Yt − Ȳ ) and ρ̂XY (h) = q .
T
t=1 K̂X (0)K̂Y (0)

Examples of estimated autocovariance functions are in Figure 1.3 for the white noise
process, the random walk, the moving average process with a window of length 9,
and the autoregressive process Xt = Xt−1 − 0.2Xt−2 + wt . However, we note that
the random walk and this AR(2) process are not stationary. Thus, even though we
can compute the autocovariance in R with the acf function, we must consider its
validity.
The sample autocovariance is defined in such a way to make the function positive
semi-definite. This ensures that estimates of variances will never be negative
PT as, for
T
any real vector a = (a1, . . . , aT ) ∈ R , the variance estimate for a · X = t=1 at Xt
is
X T XT
at as K̂(|t − s|),
t=1 s=1

which must be non-negative.

1.3.3 Detecting White Noise


A main goal of time series analysis is to transform the data into a white noise
process. That is, we aim to identify trends and patterns in the process. Once those
have been removed, what remains is random noise. Hence, we need to determine if
a process has been transformed into a white noise process.
One way to do this is to look at the estimated autocorrelation function for a
given time series as displayed in Figure 1.3. Note that for the white noise process,
we see a spike at lag 0 referring to an estimate of the variance of the process K̂X (0),
which, in this case, is 1. At the remaining lags, K̂X (h) is small for h 6= 0. The
question is what does small mean? √
In the plots of Figure 1.3, R includes blue dashed lines at the value 2/ T . This
is because for wt iid mean zero white noise with variance Ewt2 = σ 2 and finite fourth
moment, Ewt4 < ∞, we have a sample autocovariance of approximately
T −h
1 X
K̂w (h) ≈ wt+h wt .
T
t=1

12
White Noise Random Walk
1.0

1.0
0.8

0.8
0.6

0.6
ACF

ACF
0.4

0.4
0.2

0.2
0.0

0.0

0 5 10 15 0 5 10 15

Lag Lag

MA9 AR2
1.0

1.0
0.8

0.8
0.6

0.6
ACF

ACF
0.4

0.4
0.2

0.2
0.0

0.0

0 5 10 15 0 5 10 15

Lag Lag

Figure 1.3: Estimated autocorrelations from the processes from Figures 1.1 and 1.2
using the acf function in R.

13
For h 6= 0, this has zero mean as E (wt ws ) = 0 for s 6= t, and has a variance of

T −h T −h
!
1 X 1 X
Var wt+h wt = 2 E[ws+h ws wt+h wt ] =
T2 T
t=1 s,t=1
T −h
1 X 2 2 T − h 4 σ4
E[w t+h w t ] = σ ≈
T2 T2 T
t=1

for large T . Thus, the variance of the autocorrelation is approximately 1/T , and
we would expect
√ the values ρ̂X (h) to be within two standard deviations from the
mean, or ±2/ T , when Xt is a white noise process. Hence, we can examine the
autocorrelations to determine whether or not the process looks like white noise.

14
Chapter 2

Statistical Models for Time


Series

Introduction
Two main goals that traverse much of statistics are fitting models to data and using
such models for prediction or forecasting. In this chapter, we consider classic linear
regression (briefly) to discuss its uses and its shortcomings with respect to time series
data. Then, we discuss more sophisticated models for time series of the ARIMA
family of models.
As mentioned in the previous chapter, we need to transform a time series Xt
into a time series Yt that is stationary. This is because stationarity allows us to do
estimation from a single series. The hard part it to determine how to extract such
a stationary Yt from the series Xt . Before continuing, we need some definitions.
Definition 2.0.1 (Backshift Operator). The backshift operator B acts on time se-
ries by BXt = Xt−1 . This can be iterated to get B k Xt = Xt−k . This can also be
inverted to get the forward shift operator B −1 Xt = Xt+1 . Thus, B −1 B = BB −1 = I
the identity operator.
Definition 2.0.2 (Difference Operator). The difference operator ∇ acts on time
series by ∇Xt = Xt − Xt−1 . Note that ∇Xt = (1 − B)Xt . This operator can also be
iterated to get
k  
i k
X
k k
∇ Xt = (1 − B) Xt = (−1) Xt−i .
i
i=0
For example, the second difference operator is ∇2 Xt = Xt − 2Xt−1 + Xt−2 .
Differencing a series can be used to remove periodic trends. Fun fact: using the
gamma function, we can also consider fractional differencing for κ ∈ R+ , which is
κ
X Γ(κ + 1)
∇κ Xt = (−1)i Xt−i .
i!Γ(κ − i + 1)
i=0

15
2.1 Regression
2.1.1 Linear Regression in Brief
In linear regression, we consider modelling a response variable Xt by some indepen-
dent variables or predictors zt,1 , . . . , zt,p as
Xt = β0 + β1 zt,1 + . . . + βp zt,p + εt
where the βi are unknown fixed parameters and the εt are iid mean zero uncorrelated
errors. Written in vector-matrix form, we have X = Zβ + ε where X ∈ RT and
Z ∈ RT ×(p+1) . Given this setup, the Gauss–Markov theorem tells us that the least
squares estimator, β̂ = (Z T Z)−1 Z T X, is the minimal variance unbiased estimator
for β.

2.1.2 Linear Regression for Time Series


The big challenge for time series is that the error process εt may not be uncorrelated
white noise, but in fact, a correlated process. Thus, we consider the time series model
Xt = µt + Yt (2.1.1)
where µt is a deterministic process such as µt = β0 + β1 zt,1 + . . . + βp zt,p and where
Yt is a stationary stochastic process. Starting with an observed series Xt , this leaves
us with the goal of identifying the deterministic trend µt and the stochastic piece
Yt .
When diagnosing the fit of the linear regression model to the data, we often
consider the vector residuals being r = X − β̂Z. Plotting fitted values against
the residuals, we would expect in linear regression to have residuals that look like
random noise. However, with time series data, the residuals will often not be white
noise but some other stochastic process.
Example 2.1.1 (Prescription Price Data). The TSA package in R contains a dataset
called prescrip, which charts monthly US average prescription drug costs for 1986
to 1992. Looking at the time series in Figure 2.1, we notice that there is a steady
increase over the time span of the data. A quadratic regression model to this data
yielded a fitted model
(cost) = 505000 + 510.5(time) − 0.13(time)2
where (time) ranges from 1986.6 to 1992.17. The F-statistic of 1612 with degrees
of freedom (2, 65) is very significant (p-value < 2 × 10−16 ), and the R summary for
the fitted model is
Estimate Std. Error t value Pr(> |t|)
(Intercept) 505000 1.344e+05 3.757 0.0004
time -510.5 1.351e+02 -3.778 0.0003
time2 0.129 3.396e-02 3.799 0.0003

16
Monthly US Prescription Drug Costs ACF for Prescrip

1.0
30

0.8
0.6
25
prescrip

ACF

0.4
0.2
20

0.0
−0.2
15

1987 1988 1989 1990 1991 1992 0.0 0.5 1.0 1.5 2.0 2.5

Time Lag

Residuals for Prescrip ACF for Residuals

1.0
1.0
prescrip − fitted.values(md2)

0.5

0.5
0.0

ACF
−1.5 −1.0 −0.5

0.0
−0.5

1987 1988 1989 1990 1991 1992 0.0 0.5 1.0 1.5 2.0 2.5

Time Lag

Figure 2.1: Plotted time series of prescription drug costs with fitted regression
line (top left). The residuals for the series plotted (bottom left). The estimated
autocorrelation functions for the original series (top right) and the residual series
(bottom right).

By removing this trend and focusing on the residuals, we see the bottom two
plots from Figure 2.1. Here, the estimated autocorrelation does not seem as extreme.
However, there are still some periodic patterns that emerge and that will need to be
dealt with.
To the residual process, we can fit a linear model using sines and cosines, which
gives the fitted model

(residuals) = 0.2 − 0.09 sin(2π(time)) − 0.65 cos(2π(time)).

The F-statistic of 33.36 with degrees of freedom (2, 65) is also very significant (p-
value ≈ 10−10 ), and the R summary for the fitted model is as follows.

17
Estimate Std. Error t value Pr(> |t|)
(Intercept) 0.02 0.06 0.34 0.732
sin -0.09 0.08 -1.07 0.290
cos -0.65 0.08 -8.07 2.24e-11

In Figure 2.2, we have the residuals for this trigonometric model as well as the
estimated autocorrelation. In the ACF plot, we see a large spike at lag = 0.1.
Hence, considering the first difference operator applied to this process, we have the
bottom two plots in Figure 2.2. Now the estimated autocorrelation is looking much
more like white noise.

Regression with time series on time series


Note also that we can consider regression of one series with respect to another. For
example, given two time series Xt and Yt , we could fit a model

Xt = β̂0 + β̂1 Yt .

We could also consider fitting a model with a fixed lag h

Xt = β̂0 + β̂1 Yt−h .

As a hypothetical example, Xt could be monthly rainfall and Yt−1 could be the


average temperature of the previous month.

2.2 Smoothing
Smoothing methods are a large class of statistical tools that can be applied to noisy
data. While much theoretical work has gone into understanding these methods, we
will just present the main idea briefly.

Moving Average Smoothing


Pr
For a time series Xt for t = 1, . . . , T and coefficients θ−r , . . . , θ0 , . . . , θr with j=−r θj =
1, we can define a new process
r
X
Mt = θj Xt−j
j=−r

for t = r + 1, . . . , T − r. The simplest example is to set θj = (2r + 1)−1 , which just


compute the sample average with a window of length 2r + 1. This can be performed
easily in R with the filter function.

18
Residuals for Prescrip Residuals for Prescrip with Trig

prescrip − fitted.values(md2) − fitted.values(md.trig)

1.5
1.0

1.0
prescrip − fitted.values(md2)

0.5

0.5
0.0

0.0
−1.5 −1.0 −0.5

−0.5
−1.0
1987 1988 1989 1990 1991 1992 1987 1988 1989 1990 1991 1992

Time Time

ACF for trig residuals


1.0
0.8
0.6
0.4
ACF

0.2
0.0
−0.2

0.0 0.5 1.0 1.5

Lag

Trig Residual First Differences Trig Residual First Diff ACF


1.0
1.0

0.8
0.5

0.6
diff(res.trig)

0.4
ACF
0.0

0.2
−0.5

−0.2 0.0
−1.0

1987 1988 1989 1990 1991 1992 0.0 0.5 1.0 1.5

Time Lag

Figure 2.2: The trig model fitted to the residuals (top left). The residuals for the
trig model (top right). The ACF for the trig residual model (middle). The first
differences of the trig residuals (bottom left). The ACF for the first differences of
the trig residuals (bottom right).

19
Kernel Smoothing

R is decreasing in |x − x0 |
A kernel function κh (x, x0 ) is a non-negative function that
and has a bandwidth parameter h. We also require that R κh (x, x0 )dx < ∞ for all
x0 ∈ R. Examples include

κh (x, x0 ) = exp −|x − x0 |2 /2h2



Gaussian:
Triangular: κh (x, x0 ) = (1 − |x − x0 |/h)+
Epanechnikov: κh (x, x0 ) = (1 − |x − x0 |2 /h2 )+

where the notation (. . .)+ means take the positive part and set the rest to zero.
In the time series context, we can use a kernel to construct weights to be used
in the previously mentioned moving average smoother. Specifically, for some r ∈ N,
we can define θi for i = −r, . . . , 0, . . . , r as
r
X
θi = κh (i, 0)/ κh (j, 0).
j=−r

Kernel-based methods also occur in probability density estimation (the kernel den-
sity estimator) and in linear regression (Nadaraya-Watson) as well as others. Note
further that kernel based estimators typically are biased estimators and as the band-
width h increases, the bias increases while the variance decreases. As a result, much
research has gone into bandwidth selection. This can be implemented in R via the
ksmooth function.

Lowess
Lowess is an acronym for locally weighted scatterplot smoothing. This method
combines nearest neighbours and weighted linear regression. Effectively, it takes
a window of data points, say Xt−r , . . . , Xt+r , and applies a low degree polynomial
regression to it. This can be implemented in R by the function lowess. The R
manual page states that lowess “is defined by a complex algorithm”.

Cubic Spline Smoothing


As discussed in the context of linear regression, fitting polynomials to data can be
a powerful tool. However, it typically unwise to fit one high degree polynomial to
your data. Instead, spline models split the T data points into k pieces by defining
the partition
1 = t1 < t2 < . . . < tk+1 = T.

20
These points are referred to as the knots. Then, a separate polynomial—typically
cubic but could have another degree instead—is fit to each subinterval of approxi-
mately T /k data points. That is, we fit a linear model
(i)
Mt = βi,0 + βi,1 t + βi,2 t2 + βi,3 t3 .

to the ith interval. As a result, this can be written as a least squares estimation
problem where the M̂t are the fitted values that minimize
T
X
(Xt − Mt )2 .
t=1

Here, if we assume t = 1, . . . , T and fi (t) are spline basis functions for i = 1, . . . , T ,


then the design matrix for the regression is F with ijth entry is Fi,t = fi (t). Thus,
we can write M̂ = F β̂ = F (F T F )−1 F T X.
If we want to upgrade our spline model into a smoothing spline we fit the same
polynomial but with a penalty term to enforce more smoothing. That is, we would
find the M̂t that minimizes
T
X Z
(Xt − Mt ) + λ 2
(Ms00 )2 ds
t=1

where λ ≥ 0 is a smoothing parameter. Note that this minimization problem is taken


over all possible twice continuously differentiable functions Mt , which is referred to
as the Sobolev space H 2 . When λ = 0, so smoothing is applied and we simply have
a least squares estimator for our data. Taking λ → ∞ imposes that the second
derivative of Mt must be zero. Hence, we have a straight line fit to our data in the
limit.
Using the notation from before, we can also work out an explicit solution remi-
niscent of ridge regression.1 From the penalty term, we can define the matrix Ω to
have entries Z
Ωi,j = fi00 (s)fj00 (s)ds.

Then, we have Z
(Ms00 )2 ds = β T Ωβ,

and consequently, we are finding the coefficients β̂ that minimize

kXt − F βk22 + λβ T Ωβ,

which is β̂ = (F T F + λΩ)−1 F T X.
1
Recall that the ridge estimator for the model Y = Xβ + ε is β̂ = (X T X + λI)−1 X T Y .

21
2.3 ARIMA Models for Times Series
In the previous section, we considered regression models like

Xt = f (t) + wt

where f (t) is deterministic and wt is white noise. The goal was to estimate the
deterministic piece by some fˆ(t). But such an approach cannot handle time series
models like the AR(1) process Xt = Xt−1 + wt . In this section, we take a closer look
at fitting such time series models to observed data.

2.3.1 Autoregressive Processes


We reintroduce the autoregressive process formally as

Definition 2.3.1 (Autoregressive Process). The time series Xt is an AR(p) process


if Xt has zero mean and if we can write it as
p
X
Xt = wt + φi Xt−i
i=1

where wt is white noise with variance σ 2 and φ1 , . . . , φp ∈ R are constants with φp 6=


0. Using the backshift operator B, we can write the AR(p) process as Φ(B)Xt = wt
where
p
!
X
Φ(B) = 1 − φi B i .
i=1

We will refer to Φ(B) as the autoregressive operator.

Note that we choose Xt to have zero mean for convenience. If Xt has mean
µ 6= 0, then we can rewrite Xt = X̃t + µ where X̃t is mean zero to get
p
X
X̃t + µ = wt + φi (X̃t + µ)
i=1
p p
!
X X
X̃t = −µ 1 − φi + wt + φi X̃t−i .
i=1 i=1

That is, we can rewrite Xt as a mean zero AR(p) process with an added constant.
A major aspect to consider when analyzing AR processes is causality. We have
seen that the AR(1) process with |φ1 | < 1 has a causal representation as a linear
process. Specifically,

X
Xt = φ1 Xt−1 + wt = φi1 wt−i .
i=0

22
This process was also shown to be stationary. However, when φ1 = 1, we have a
random walk which is not stationary. Writing the random walk as a linear process
gives a series that does not converge.
Similarly, we can consider the setting of an AR(1) process with |φ1 | > 1, which
will grow exponentially fast. However, we can still write this process in the form of
a non-causal linear process.

if Xt+1 = φ1 Xt + wt+1 then Xt = φ−1 −1


1 Xt+1 − φ1 wt+1 .

Continuing in this fashion and noting that |φ−1


1 | < 1, we have that

Xt = φ−1 −1 −1 −1
1 (φ1 Xt+2 − φ1 wt+2 ) − φ1 wt+1

X
= φ−2 −2 −1
1 Xt+2 − φ1 wt+2 − φ1 wt+1 = . . . = − φ−i
1 wt+i ,
i=1

which is a linear process with reverse causality.2 Using this linear process represen-
tation, we can compute the stationary autocovariance to be
 

−(i+j)
X
KX (τ ) = E  φ1 wt+τ +i wt+j 
i,j=1
∞ ∞ −|τ |
X −(i+j) −|τ |
X σ 2 φ1 φ−2
= σ2 φ1 1[τ + i = j] = σ 2 φ1 φ1−2i = −2
1

i,j=1 i=1
1 − φ1

where the extra φ−2


1 comes from the fact that the above sums begin at 1 instead of
at 0.
If we strengthen the white noise process wt to be iid Gaussian with variance σ 2 ,
then we have that the process Xt is Gaussian. Consequently, it is completely char-
acterized by its mean—which is zero—and its autocovariance above. Now consider
the causal AR(1) process
Yt = φ−1 Yt−1 + vt
where vt is iid Gaussian white noise with variance σ 2 φ−2 . This is a mean zero
process with autocovariance
−|τ |
σ 2 φ1 φ−2
1
KY (τ ) = .
1 − φ−2
1

Thus, these two processes are stochastically equivalent—i.e. for any finite collection
of time points t1 , . . . , tk , the vectors (Xt1 , . . . , Xtk ) and (Yt1 , . . . , Ytk ) are equal in
distribution. Thus the non-causal AR(1) process with |φ| > 1 has an equivalent
causal representation.
2
Note: the summation starts from i = 1 instead of i = 0.

23
We can extend this idea to general AR(p) processes in order to rewrite a recur-
sively defined AR(p) process as stationary linear process. For some AR operator,
we have the general form
Φ(B)Xt = wt .
If the operator Φ(B) is invertible, then we can simply write the linear process form

Xt = Φ−1 (B)wt .

But then we have to determine if the inverse operator exists and what its form is.
Reconsidering the AR(1) process above, we write it as (1 − φ1 B)Xt = wt . Con-
sidering the complex polynomial Φ(z) = 1 − φ1 z for z ∈ C, we note that

−1 1 X j
Φ (z) = =1+ φ1 z j ,
1 − φ1 z
j=1

which has a radius of convergence of |φ−1


1 |. In the case that |φ1 | < 1, we can use
this to quickly write  

φj1 B j  wt .
X
Xt = 1 +
j=1

For the general AR(p) process, consider the complex polynomial Φ(z) = 1−φ1 z−
. . . φp z p and recall that this can be factored on C into Φ(z) = Q φp (z − r1 ) . . . (z − rp )
where r1 , . . . , rp are the roots.3 Then, noting that (−1)p φp pj=1 rj = 1, we can
write
1
Φ−1 (z) =
(1 − r1 z) . . . (1 − rp−1 z).
−1

Now, assuming further that all of the roots |ri | > 1, we can write Φ(B)Xt = wt as
a causal linear process
p ∞
!
Y X
Xt = Φ−1 (B)wt = 1+ ri−1 B wt .
j=1 i=1

2.3.2 Moving Average Process


Next, we reintroduce the moving average process formally as
Definition 2.3.2 (Moving Average Process). The time series Xt is an MA(q) pro-
cess if Xt has zero mean and if we can write it as
q
X
Xt = wt + θj wt−j
j=1
3
From the fundamental theorem of algebra, https://en.wikipedia.org/wiki/Fundamental_
theorem_of_algebra

24
where wt is white noise with variance σ 2 and θ1 , . . . , θq ∈ R are constants with θq 6=
0. Using the backshift operator B, we can write the MA(q) process as Xt = Θ(B)wt
where  
Xq
Θ(B) = 1 + θj B j  .
j=1

We will refer to Θ(B) as the moving average operator.

Unlike for autoregressive processes, we already have the MA(q) process written
in the form of a linear process. Hence, it will be stationary for any choice of the θj .
However, similar to how we were able to find a causal AR process that is equivalent
to a non-causal one, there is a uniqueness problem with the MA process that needs
to be addressed.
For simplicity, consider the MA(1) process Xt = wt + θ1 wt−1 with wt white noise
with variance σ 2 . This has mean zero and stationary autocovariance

 (1 + θ12 )σ 2 for τ = 0

KX (τ ) = θ1 σ 2 for τ = 1
0 for τ ≥ 2

Alternatively, we note that the process Yt = vt + θ1−1 vt , with vt white noise with
variance θ12 σ 2 , is also mean zero with the same autocovariance at Xt . Hence, if the
white noise processes are Gaussian, then Xt and Yt are stochastically equivalent.
This can certainly cause trouble in a statistics context as if we were to estimate the
parameters for the MA(1) model, which parameters would we be estimating?
To choose a specific representation for the MA process, we consider which one is
invertible. That is, which process can be written as a causal AR process for white
noise in terms of Xt ? Starting with the general form, we have Xt = Θ(B)wt . If Θ(B)
is invertible, then we can write wt = Θ−1 (B)Xt . Using the above MA(1) process as
an example, we can express the white noise process as

X ∞
X
wt = (−1)i θ1i Xt−i or vt = (−1)i θ1−i Yt−i .
i=0 i=0

Thus, as only one of θ1 and θ1−1 can be less than 1 in magnitude, only one of the
above series is convergent in the mean squared sense. That process will be the
invertible one. Note that wt is equal in distribution to θ1−1 vt . Note also that if
θ1 = 1 then we do not have invertibility.

2.3.3 Auto Regressive Moving Average Processes


Now we can combine the AR and MA processes in to the autoregressive moving
average process (ARMA), which simply is as follows.

25
Definition 2.3.3 (Autoregressive Moving Average Process). The time series Xt is
an ARMA(p, q) process if Xt has zero mean and if we can write it as
p
X q
X
Xt = wt + φi Xt−i + θj wt−j
i=1 j=1

where wt is white noise with variance σ 2 and φ1 , . . . , φp , θ1 , . . . , θq ∈ R are constants


with φp 6= 0 and θq 6= 0. Using the backshift operator B, we can succinctly write this
process as Φ(B)Xt = Θ(B)wt where as before
 
p q
!
X X
Φ(B) = 1 + φi B i , and Θ(B) = 1 + θj B j  .
i=1 j=1

The first thing to note is that similar to the introduction of the AR process, we
assume in the definition that Xt has zero mean. If instead it has a mean µ 6= 0, we
can subtract off the mean to get

Φ(B)(Xt − µ) = Θ(B)wt
p
!
X
Φ(B)Xt = µ 1 − φi + Θ(B)wt
i=1

and consider the mean zero process.


The second thing to note is that the model is not unique as written. That is, for
some invertible operator η(B), we can consider the equivalent process

η(B)Φ(B)Xt = η(B)Θ(B)wt .

For example, we can consider the white noise process Xt = wt and, for some |θ| < 1,
the equivalent process

(1 − θB)Xt = (1 − θB)wt
Xt = θXt−1 − θwt−1 + wt .

This may look like a more complex ARMA process, but is in fact just white noise in
disguise. To address this problem, we only want to consider AR and MA operators
that are relatively prime. That is, for z ∈ C, we want the polynomials

Φ(z) = 1 − φ1 z − . . . − φp z p , and
Θ(z) = 1 + θ1 z + . . . + θq z q

to not have any common roots. In the case that Θ is invertible, we can write the
ARMA process as
Φ(B)
Xt = wt .
Θ(B)

26
Thus, in this form, we see that common factors in Φ and Θ can be cancelled out.
When we write Xt in this way, it is said to be invertible if we have

Φ(B) X
wt = Xt = Xt + πj Xt−j
Θ(B)
j=1
P∞
where j=1 |πj | < ∞. Hence, returning the the previous discussion on the MA
processes, we want to write out the process as a convergent series. Considering the
MA polynomial Θ(z) for z ∈ C, the ARMA process Xt is invertible if and only if all
of the roots of Θ(z) lie outside of the unit disk D = {z : |z| ≤ 1}.
Similarly, we can write the ARMA process in the form of a stationary linear
process
Θ(B)
Xt = wt .
Φ(B)
However, this process may not be causal. A causal process as discussed before can
be written as
X∞
Xt = wt + ψj wt−j
j=1
P∞
with j=1 |ψj | < ∞. A necessary and sufficient condition for causality in an ARMA
process is to have an autoregressive polynomial Φ(z) such that all of its roots lie
outside of the unit disk D = {z : |z| ≤ 1}.
In summary, an ARMA process Xt is
1. causal if r1 , . . . , rp , the roots of Φ(z) are such that |ri | > 1;

2. invertible if r1 , . . . , rq , the roots of Θ(z) are such that |ri | > 1.


Note that the proof of invertibility is mostly identical to that for causality expect
focusing on Θ instead of Φ.

Proof of Causality. Let the roots of Φ(z) be r1 , . . . , rp . First, assume that the roots
are all outside of the unit disk, and without loss of generality, are ordered so that
1 < |r1 | ≤ . . . ≤ |rp |. Then, let |r1 | = 1 + ε for some ε > 0. This implies that Φ−1 (z)
exists and has a power series expansion

X
Φ−1 (z) = aj z j
j=0

with a radius of convergence of |z| < 1 + ε.


If we choose a δ such that 0 < δ < ε then the point z = 1 + δ lies within the
radius of convergence, so

X
Φ−1 (1 + δ) = aj (1 + δ)j < ∞.
j=0

27
As this series converges, we know that there exists a constant c > 0 such that
|aj (1 + δ)j | < c for all j. Hence, |aj | < c(1 + δ)−j . Thus,

X ∞
X
|aj | < c (1 + δ)−j < ∞,
j=0 j=0

and the sequence of aj is absolutely summable. This implies that for the ARMA
process Φ(B)Xt = Θ(B)wt , we can write

X
−1
Xt = Φ (B)Θ(B)wt = wt + ψj wt−j .
j=1

Since the aj are absolutely summable, so are the coefficients ψj . Thus, we have that
Xt is a causal process.
For the reverse, let’s assume that Xt , defined by Φ(B)Xt = Θ(B)wt , be a causal
process. That is, we can write

X ∞
X
Xt = wt + ψj wt−j and |ψj | < ∞.
j=1 j=1

As a result, we can write Xt = Ψ(B)wt and also Φ(B)Xt = Θ(B)wt . Equating the
two right hand expressions, we have

Θ(B)wt = Φ(B)Ψ(B)wt .

Writing the complex polynomial Φ(z)Ψ(z) = ∞ j


P
i=1 aj z , we know that this series
has a radius of convergence of at least |z| ≤ 1 as Ψ also does and Φ is a finite
polynomial. Hence, it makes sense to write
q
X ∞
X
θj wt−j = aj wt−j .
j=1 j=1

If we consider computing the covariance of each sum with wt−i for i = 1, 2, 3, . . ., we


get a sequence of equations θj = aj for j ≤ q and aj = 0 for j > q. That is, we can
equate matching coefficients in the two series. Thus, Θ(z) = Φ(z)Ψ(z) for |z| ≤ 1.
We know that none of the roots of the polynomial Ψ can lie on or within the
unit disk. Hence, if there exists a z0 ∈ D such that Φ(z0 ) = 0, then Θ(z0 ) = 0 and
the two polynomials have a common root. As they are assumed to have no common
factors, this implies that all roots of Φ lie outside of the unit disk.

2.3.4 ARIMA
Often, we do not have an ARMA process but an ARMA process with some determin-
istic trend. Thus, the process is not stationary, but often can be transformed into a

28
stationary process via the differencing operator. For example, if Xt is a stationary
process, and we have Yt defined by

Yt = β0 + β1 t + Xt ,

then by applying the first difference operator, we have

∇Yt = β1 + Xt − Xt−1 ,

which is stationary. This motivates the following definition.

Definition 2.3.4 (Autoregressive Moving Integrated Average Process). The time


series Xt is an ARIMA(p, d, q) process if the dth difference process,

∇d Xt = (1 − B)d Xt ,

is an ARMA(p, q) process. We can write it in terms of the backshift operator as


Φ(B)(1 − B)d Xt = Θ(B)wt where as before
 
p q
!
X X
Φ(B) = 1 + φi B i , and Θ(B) = 1 + θj B j  .
i=1 j=1

As before, we assume in the definition that ∇d Xt has zero mean. If instead it has
a mean µ 6= 0, we write Φ(B)(1 − B)d Xt = µ (1 − pi=1 φi ) + Θ(B)wt . For example,
P
if we have Xt = β0 + β1 t + φXt−1 + wt for β0 , β1 ∈ R and |φ| < 1, then

∇Xt = Xt − Xt−1
= [β0 + β1 t + φXt−1 + wt ] − [β0 + β1 (t − 1) + φXt−2 + wt−1 ]
= β1 + φ∇Xt−1 + wt − wt−1 .

This is an ARMA(1, 1) process with non-zero mean that can be written as

(1 − φB)δXt = β1 + (1 − B)wt .

Note that the mean is not β1 but in fact β1 /(1 − φ).

2.4 Testing for Stationarity and Autocorrelation


When presented with time series data, we often want to know if the series is sta-
tionary as it stands or after applying some difference operators. Similarly, we may
be interested in knowing if there are any significant autocorrelations at various lags.
These questions motivate a large collection of statistical tests.

29
2.4.1 Box-Pierce and Ljung-Box Tests
The R function Box.test() in the stats package performs both the Box-Pierce
and Ljung-Box tests.
For a stationary time series Xt , we denote the estimated autocorrelations to be
ρ̂X (h) at lag h. If we assume that the true autocorrelations are zero—i.e. ρX (h) = 0
for all h 6= 0—then we have white noise. Instead of visually looking at a plot of the
autocorrelation, we can use the Box-Pierce test to test for non-zero correlations by
combining the estimated autocorrelations at lags 1, . . . , h for some user chosen value
h. We aim to test the hypotheses

H0 : ρX (1) = . . . = ρX (h) = 0, H1 : ∃j s.t. ρX (j) 6= 0.



Under H0 , we have that nρXˆ(j) is approximately N (0, 1) for j = 1, . . . , h. The
test statistic for the Box-Pierce test is
h
X
QBP = n ρ̂X (j)2 ,
j=1

which will be approximately χ2 (h) under H0 . Recall, however, that the ability to
estimate ρ̂X becomes harder for large lags especially if the data size is small. Hence,
h should not be set to be too large in practice.
Another version of this test is the Ljung-Box test, which has a similar form to the
Box-Pierce test and the same approximate χ2 (() h) distribution. The alternative
form is supposed to give a distribution under H0 that is closer to the desired χ2 (h).
The test statistic is
h
X ρ̂X (j)2
QLB = n(n + 2) .
n−j
j=1

In the function Box.test(), there is a fitdf argument. The point of this


argument is to reduce the chi-squared degrees of freedom in the case that you are
fitting a model first. In particular, if you first fit an ARMA(p,q) model to Xt
and then apply Box.test to the residual process, you should reduce the number
of degrees of freedom by p + q. In this case, we require h > p + q to be able to
perform these tests. Ljung and Box study QLB in a 1978 research article4 and look
at the first and second moments of their statistic for how closely it coincides with
the χ2 (h) distribution.

2.4.2 Durbin–Watson Test


As mentioned above, the Box-Pierce and Ljung-Box tests can be applied to the
residuals of an ARMA model with the goal of determining if there are non-zero
4
Ljung, Greta M., and George EP Box. “On a measure of lack of fit in time series models.”
Biometrika 65, no. 2 (1978): 297-303.

30
autocorrelations among the residuals. Similarly, the Durbin-Watson test tests for
autocorrelations of order 1 among the residuals of a linear model.
Considering the linear model

Xt = β0 + β1 t + . . . + βp tp + rt

with t = 1, . . . , T , we canDcompute the leastE squares estimator β̂ and then compute


the residuals r̂t = Xt − β̂, (1, t, . . . , tp ) . The Durbin–Watson Test assumes the
following model for the residuals:

r̂t = ρr̂t−1 + wt

where wt is white noise. Then, it tests the hypotheses H0 : ρ = 0, H1 : ρ 6= 0. It


does this by computing the test statistics
PT 2
t=2 (r̂t − r̂t−1 )
QDW = PT .
2
t=1 r̂t

If this test statistic is close to zero, it implies that r̂t and r̂t−1 are close in value
indicating a strong positive autocorrelation of order 1. In contrast, if the test statis-
tic is large (close to the max of 4), then it indicates that there is a strong negative
autocorrelation of order 1. Otherwise, a test statistic near 2 indicates no autocorre-
lation of order 1. In the R function, dwtest() in the lmtest package, p-values are
computed for this statistic. The documentation claims that Under the assumption
of normally distributed [errors], the null distribution of the Durbin-Watson statistic
is the distribution of a linear combination of chi-squared variables. Furthermore, for
large sample sizes, this code apparently switches to a normal approximation for the
p-value computation.

2.4.3 Breusch–Godfrey test


The Breusch–Godfrey test is similar to the Durbin-Watson test in the sense that it
applies to the residuals of a linear model. In this case, it can test for higher order
autocorrelations AR(p) than just AR(1) processes. In this case, the R2 value is
computed for a regression model, being the ratio of the regression sum of squares
to the total sum of squares. Then nR2 is compared to a χ2 (p) where p is the
order of the AR process to be tested. Alternatively, the documentation for the
R implementation, bgtest in the lmtest package, says that the user can use the
Chi-Squared distribution or can switch to the F distribution if desired.
Note that there is also a Breusch–Pagan test, which tests for heteroskedasticity
in the residuals of a linear model. An R implementation can be found in bptest in
the lmtest package.

31
2.4.4 Augmented Dickey-Fuller Test
Switching away from testing for non-zero autocorrelations, we now consider testing
for stationarity or non-stationarity of a time series. These tests are often referred
to as unit root tests, because—recalling the previous sections—if the autoregressive
operator Φ has a unit root, then the process is not stationary. Hence, these tests
aim to determine whether or not a unit root exists based on some observed data.
The Dickey-Fuller Test performs such a unit root test for AR(1) models. In this
case, the null hypothesis is that Φ(z) has a root |r| = 1, and the alternative is that
|r| > 1 for all i. If
Xt = φXt−1 + wt
then the first difference can be written as

∇Xt = (φ − 1)Xt−1 + wt .

Denoting φ0 = φ − 1, we want to test the null H0 : φ0 = 0 against the alternative


H1 : φ 6= 0. This is done by estimating φ̂0 and the standard error for φ̂0 . This null
hypothesis is equivalent to testing H0 : φ = 1 or that the polynomial Φ(z) = 1 − φz
has a unit root.
The Augmented Dickey-Fuller Test extends this idea to AR(p) models. If we
have
Xp
Xt = φi Xt−i + wt ,
i=1

then the first difference can be written as


p−1
X
∇Xt = φ01 Xt−1 + φ0i+1 ∇Xt−i + wt
i=1

where the coefficients are φ01 = pj=1 φj − 1 and φ0i = − pj=i φj for j > 1. Thus, if
P P
1 is a root—i.e. if Φ(1) = 0—then that implies that φ01 = 0. Hence, we can perform
a similar test to the Dickey-Fuller test above.
In R, the Augmented Dickey-Fuller test is implemented in the function adf.test()
in the tseries package. In this version of the test, a constant and a linear term are
first assumed and the residual process is run through the above test. That is, we
consider the model
X p
Xt = β0 + β1 t + φi Xt−i + wt
i=1

with a deterministic linear trend. The R function also requires the user to choose
how many lags to use when estimating the parameters φ̂.

32
2.4.5 Phillips–Perron test
An alternative to the Augmented Dickey-Fuller test is the Phillips-Perron test. The
set up is the same, but the test is designed to be more robust to deviations in
the assumptions. In R, this test can be performed by the function pp.test() in
the tseries package. The test statistic is more complicated, and the p-value is
computed via a table of values and linear interpolation. The documentation also
points out that the Newey-West estimator is used for the variance, which is a robust
estimator of the covariance in linear regression when the classic assumptions of
homoscedastic uncorrelated errors is violated.

2.5 Autocorrelation and Partial Autocorrelation


Given some time series data, we often wish to diagnose the type time series process
that produced the data. Two tools we can use are the estimated autocorrelation
function and the estimated partial autocorrelation function.
First, we recall the notion of partial correlation outside of the time series con-
text. Given two random variables X and Y , we may compute the correlation
corr(X, Y ), which measures the linearity between the two variables. That is, the
closer to 1 the magnitude of the correlation is, the closer X and Y are to being
linearly dependent. Often in statistics, we are reminded that correlation does not
imply causation. In fact, given two correlated random variables, there may be a
third random variable Z influencing both of them. Hence, consider iid observations
(X1 , Y1 , Z1 ), . . . , (Xn , Yn , Zn ). We can define the partial correlation between X and
Y given Z to be the correlation of the residuals of X and Y each regressed on Z.
That is, let
X̂i = α̂0 + α̂1 Zi and Ŷi = β̂0 + β̂1 Zi
be the ith fitted values for X and Y , respectively. Then, the partial correlation is
corr(X − X̂, Y − Ŷ )
where X − X̂ and Y − Ŷ are the residuals for X and Y , respectively.
The idea of partial correlation is to remove the dependency of the confounding
variable Z. Hence, if corr(X − X̂, Y − Ŷ ) = 0, then X and Y are said to be
conditionally uncorrelated—or conditionally independent in the case that X, Y ,
and Z are jointly normal. An hypothetical example is if X is the price of a subjects
car and Y is the price of a subjects house, we may expect X and Y to be positively
correlated. However, conditioning on Z, the subjects income, the house value and
car value may be conditionally uncorrelated. Note that in the case that the random
variables are jointly normal, we have that
E[(X − EX)(Y − EY )|Z]
corr(X − X̂, Y − Ŷ ) = corr(X, Y |Z) = .
E[(X − EX)2 |Z]E[(Y − EY )2 |Z]
In the time series context we define

33
Definition 2.5.1 (Partial Autocorrelation). Let Xt be a stationary process, then
the partial autocorrelation at lag h is

ρX (1) if h = 1
ϕX (h) =
corr(Xt+h − X̂t+h , Xt − X̂t ) if h > 1

where X̂t+h and X̂t are the result of regressing each respective term on all of the
intermediate terms. That is,
X̂t+h = β1 Xt+h−1 + . . . + βh−1 Xt+1
X̂t = β1 Xt+1 + . . . + βh−1 Xt+h−1
where the intercept term is excluded as Xt is assumed to be mean zero. Note that
due to stationarity of Xt and the symmetry of the autocorrelation function, the β’s
above are the same coefficients. Lastly, if Xt is a Gaussian process, we can write
ϕX (h) = corr(Xt+h , Xt |Xt+h−1 , . . . , Xt+1 )
for h > 1.

2.5.1 ACF for AR(p)


To begin, we consider the causal mean zero AR(1) process Xt = φXt−1 + wt where
|φ| < 1. Then, we can multiply by Xt−h and note that
E (Xt Xt−h ) = E (φXt−1 Xt−h ) + E (wt Xt−h )
KX (h) = φKX (h − 1) + 0.
Therefore, the autocovariance is defined by a first order difference equation
f (h) = φf (h − 1).
As the characteristic polynomial is 1 − φz with root z1 = φ−1 , we can solve this dif-
ference equation to get the solution f (h) = cz1−h for some constant c corresponding
to the initial condition f (0) = c. This can be checked by plugging in the solution
to the equation to get
−h −h+1
c φ−1 = cφ φ−1 .
Revisiting the autocorrelation, we have KX (0) = Var (Xt ), so ρX (h) follows the
same difference equation with c = 1. Hence, if we have an AR(1) process as above,
ρX (h) = φ−h .
For the causal AR(2) process, we have Xt = φ1 Xt−1 + φ2 Xt−2 + wt . Proceeding
as before, we note that
E (Xt Xt−h ) = E (φ1 Xt−1 Xt−h ) + E (φ2 Xt−2 Xt−h ) + E (wt Xt−h )
KX (h) = φ1 KX (h − 1) + φ2 KX (h − 2).

34
The roots of the characteristic polynomial will tell us about the behaviour of the
process Xt . For 1 − φ1 z − φ2 z 2 , we denote the two roots as z1 and z2 . Recall that
|zi | > 1 as we assume Xt is causal. There are three possible settings to consider5
1. if z1 6= z2 and the roots are real, then we have the solution to the second order
difference equation
ρ(h) = c1 z1−h + c2 z2−h
where c1 and c2 are two constants such that c1 + c2 = 1.
2. if z1 = z2 and necessarily real, then we have ρ(h) = z1−h (c1 + c2 h).
3. if z1 = z̄2 are complex conjugate roots, then
ρ(h) = c1 z1−h + c̄1 z̄1−h
 
= |c1 ||z1 |−h e−ib e−iθh + eib eiθh
= 2|c1 ||z1 |−h cos(θh + b)

In all three cases, we have the autocorrelation ρ(h) decaying exponentially to zero.
The rate of decay depends on the magnitude of the roots. Furthermore, if the roots
are complex, then there is periodic behaviour in the process.
This can be extended to the AR(p) process where we have a pth order difference
equation for ρ. The resulting solution will look like
ρ(h) = z1−h f1 (h) + . . . + zr−h fr (h)
Pr
where z1 , . . . , zr are the unique roots with multiplicities m1 , . . . , mr with i=1 mi =
p and where fi (h) is a polynomial in h of degree mi .

2.5.2 ACF for MA(q)


The exposition of the autocorrelation for the general MA(q) process P
is much simpler
relative to the previous discussion of the AR(p) process. Let Xt = qj=0 θj wj with
θ0 = 1, then
q−h
X
KX (h) = θj θj+h
j=0

for h = 0, . . . , q and with KX (h) = 0 otherwise. Noting the variance is KX (0) =


1 + θ12 + . . . + θq2 , we have an autocorrelation of
Pq−h
j=0 θj θj+h
ρX (h) = Pq 2 for h ≤ q.
j=0 θj

Thus, unlike the AR process, the autocorrelation for the MA(q) process is zero for
h > q. Thus, it can be used to identify the order of the process.
5
For more details, see a textbook on difference equations.

35
2.5.3 PACF for AR(p)
To introduce why the partial correlation is of interest, we first consider the causal
AR(1) process Xt = φXt−1 + wt . From before, we saw that

corr(Xt , Xt−2 ) = corr(φXt−1 + wt , Xt−2 )


= corr(φ2 Xt−2 + φwt−1 + wt , Xt−2 )
= corr(φ2 Xt−2 , Xt−2 ) + corr(φwt−1 , Xt−2 ) + corr(wt , Xt−2 )
= φ2 + 0 + 0.

In contrast, if we consider

corr(Xt − φXt−1 , Xt−2 − φXt−1 ) = corr(wt , Xt−2 − φXt−1 ) = 0.

Hence when taking Xt−1 into account, the autocorrelation between Xt and Xt−2 is
zero.
To properly consider the partial autocorrelation, we need to compute the least
squares estimator X̂t for Xt based on previous time points. For example, to compute
ϕ(2), we take X̂t = β̂Xt−1 where β̂ is chosen to minimize

E(Xt − X̂t )2 = E(Xt − βXt−1 )2


= E Xt2 − 2βE (Xt Xt−1 ) + β 2 E Xt−1
2
= (1 + β 2 )KX (0) − 2βKX (1).
 

By taking the derivative with respect to β, we can find the critical point β̂ =
KX (1)/KX (0). Similarly, for X̂t−2 = β̂Xt−1 , we have

E(Xt−2 − βXt−1 )2
2
− 2βE (Xt−2 Xt−1 ) + β 2 E Xt−1
2
= (1 + β 2 )KX (0) − 2βKX (1)
 
= E Xt−2

as before. In the case of the AR(1) model, we have β̂ = φ. Thus, we have from
ϕ(h) = 0 for h ≥ 2.
before that ϕ(1) = φ and ϕ(2) = 0 and, in fact, P
For the general AR(p) process, Xt = wt + pi=1 φi Xt−i , we have a similar set
up. For lags h > p, if we assume for now that the least squares estimator is

X̂t = φ1 Xt−1 + . . . + φp Xt−p ,

then we get a similar calculation as above. Namely that

corr(Xt − X̂t , Xt−h − X̂t−h ) = corr(wt , Xt−h − X̂t−h ) = 0.

In the case that the lag is less than or equal to p, we need to determine how the
estimate the coefficients βi before computing the PACF.

36
2.5.4 PACF for MA(1)
For the invertible MA(1) model Xt = wt + θwt−1 , which is with |θ| < 1, we can
write it as a convergent infinite series

X
Xt = wt + θi Xt−i
i=1

in terms of the Xt−i . Then, applying similar tricks as above gives a least squares
estimator for X̂t = β̂Xt−1 to be β̂ = KX (1)/KX (0). In the case of the MA(1)
process, we have β̂ = θ/(1 + θ2 ). Hence,
 
θXt−1 θXt−1
cov Xt − , Xt−2 −
1 + θ2 1 + θ2
2
−θ2

2θ θ
= KX (2) − KX (1) + KX (0) = .
1 + θ2 1 + θ2 1 + θ2

Also, the variance is


   2 !
θXt−1 θ 2θ
Var Xt − 2
= KX (0) 1 + 2
− KX (1)
1+θ 1+θ 1 + θ2
θ2 2θ2 1 + θ2 + θ4
= 1 + θ2 + − = .
1 + θ2 1 + θ2 1 + θ2

Thus, the partial autocorrelation at lag 1 is ϕ(1) = −θ2 /(1 + θ2 + θ4 ). This can be
extended to lags greater than one to show that the partial autocorrelation for the
MA(1) process decreases but does not vanish as the lag increases.
Hence, we have the following table:

AR(p) MA(q)
ACF decreases geometrically = zero for lags > q
PACF = zero for lags > p decreases geometrically

This means that we can use the ACF and PACF to try to understand the behaviour
of a time series process.

37
Chapter 3

Estimation and Forecasting

Introduction
Thus far, we have considered many types of time series models, but have performed
little in the way of actual statistics. In this chapter, we consider to main goals of
time series models: estimating parameters and forecasting/predicting. For the first
topic, we will consider different methods for estimating parameters as well as model
selection methods to determine the best fit to the data. For the second part, we
consider the task of prediction in time series.
For a time series observed at X1 , . . . , XT , we may want to fit a causal invertible
ARMA(p,q) process,
p
X q
X
Xt = wt + φi Xt−i + θj wt−j ,
i=1 j=1

to the data by estimating the coefficients φ̂i and θ̂j .

3.1 The AR process


3.1.1 Estimation for AR processes
There are two approaches to estimating the parameters of the AR(p) process that we
will consider: Using (1) the Yule-Walker equations or (2) the maximum likelihood
estimator. More methods are available in the R function ar().

The Yule-Walker Estimator


We first consider the causal AR(p) process and the autocovariance function. We
first assume that the mean EXt = 0. In practise, we can estimate the mean by

38
X̄ = T −1 Tt=1 Xt and then consider the centred time series Xi − X̄. For estimating
P
the variance for the white noise process σ 2 ,

Xt = wt + φ1 Xt−1 + . . . + φp Xt−p
KX (0) = cov (Xt , Xt ) = cov (Xt , wt + φ1 Xt−1 + . . . + φp Xt−p )
= σ 2 + φ1 KX (1) + . . . + φp KX (p)
σ 2 = KX (0) − φ1 KX (1) − . . . − φp KX (p).

Hence, we can use the estimates for the autocovariance to estimate σ 2 .

σ̂ 2 = K̂X (0) − φ1 K̂X (1) − . . . − φ̂p KX (p).

However, we need estimators for the parameters φi . To estimate these φi , we can


consider more equations based on the autocovariance at lags 1 through p.

KX (1) = cov (Xt−1 , Xt ) = φ1 KX (0) + φ2 KX (1) + . . . + φp KX (p − 1)


KX (2) = cov (Xt−2 , Xt ) = φ1 KX (1) + φ2 KX (0) + . . . + φp KX (p − 2)
..
.
KX (p) = cov (Xt−p , Xt ) = φ1 KX (p − 1) + φ2 KX (p − 2) + . . . + φp KX (0)

Here, we have p linear equations with p unknowns, which can be written as K = Γφ


where
 
    KX (0) KX (1) · · · KX (p − 1)
KX (1) φ1
 ..   .. 
 KX (1) KX (0) · · · KX (p − 2)
K =  . , φ =  . , Γ =  .
 
.. . . . . ..
 . . . . 
KX (p) φp
KX (p − 1) KX (p − 2) · · · KX (0)

We can also write σ 2 = KX (0) − φT K. This system of p + 1 equations is known


as the Yule-Walker Equations. We can solve for the coefficients φ = Γ−1 K as the
matrix Γ is positive definite. As a result, σ 2 = KX (0) − K T Γ−1 K, because Γ is
symmetric.
We can use the estimator for the autocovariance from Chapter 1 to get a data
driven estimate for the parameters for this time series:

φ̂ = Γ̂−1 K̂, and σ̂ 2 = K̂X (0) − K̂ T Γ̂−1 K.

These estimators can be shown to converge in distribution to a multivariate normal


distribution.
Theorem 3.1.1 (Asymptotic Normality for Yule-Walker). Given φ̂i and σ̂ 2 as above
for a causal AR(p) process, we have as T → ∞,
√ d P
→ N 0, σ 2 Γ−1 and σ̂ 2 −
→ σ2.

T (φ̂ − φ) −

39
Corollary 3.1.1 (Asymptotic Normality for PACF). For the causal AR(p) process,
as T → ∞,
√ d
T ϕ̂(h) −
→ N (0, 1) .
for lags h > p.
In the standard stats package in R, the function ar() fits an autoregressive
model to time series data, which can implement many ways to estimate the param-
eters, but defaults to the Yule-Walker equations. To demonstrate it, we can use the
arima.sim() function to simulate T = 100 data points from the AR(1) process
Xt = 0.7Xt−1 + wt .
Using the Yule-Walker equations, we get φ̂1 = 0.73. Note that the ar() function
fits models for AR(1) up to AR(20) and then chooses the best with respect to AIC.
In the case of data from the AR(3) process
Xt = 0.7Xt−1 − 0.3Xt−3 + wt
we get the fitted model
Xt = 0.752Xt−1 − 0.002Xt−2 − 0.285Xt−3 .
Plots of these two processes with the estimated ACF and PACF are displayed in
Figure 3.1.

Maximum Likelihood Estimation


Given that wt is Gaussian white noise, we can write down the likelihood and solve
the maximum likelihood estimator. There is actually more than one MLE approach
in time series. Also, this can be made more tractible by using conditional probability.
Begining with the causal AR(1) process Xt = µ + φ(Xt−1 − µ) + wt with some
mean µ ∈ R, we use the recursive definition of the process to write the likelihood as
L(µ, φ, σ 2 ; X1 , . . . , XT ) = f (X1 , . . . , XT ; µ, φ, σ 2 )
= f (X1 )f (X2 |X1 ) . . . f (XT |XT −1 ).
Assuming the white noise is Gaussian, the term X1 ∼ N µ, σ 2 /(1 − φ2 ) as we have


solved before.1 Meanwhile, recalling that normality is preserved under conditioning,


the conditional distribution of Xt |Xt−1 is N µ + φ(Xt−1 − µ), σ 2 . Hence, putting


it all together gives

(1 − φ2 )1/2
L(µ, φ, σ 2 ) =
(2πσ 2 )T /2
T
" ( )#
1 X
× exp − 2 (X1 − µ)2 (1 − φ2 ) + ((Xt − µ) − φ(Xt−1 − µ))2

t=2
1
Note that here, we are considering X1 as an infinite causal linear process.

40
AR(1) Process AR(3) Process

4
3
2
2
1
0 0
−1
−2 −2
−3
−4

0 20 40 60 80 100 0 20 40 60 80 100

Time Time

ACF for AR(1) ACF for AR(3)

1.0 1.0
0.8 0.8
0.6 0.6
0.4
ACF

ACF

0.4
0.2
0.2
0.0
0.0
−0.2
−0.4 −0.2

0 5 10 15 20 0 5 10 15 20

Lag Lag

PACF for AR(1) PACF for AR(3)

0.6
0.6
0.4
Partial ACF

Partial ACF

0.4
0.2
0.2
0.0
0.0
−0.2
−0.2

5 10 15 20 5 10 15 20

Lag Lag

Figure 3.1: Simulated AR(1) and AR(3) processes with estimated ACF and PACF.

41
Writing the unconditional sum of squares in the exponent as
T
X
Su (µ, φ) = (X1 − µ)2 (1 − φ2 ) + ((Xt − µ) − φ(Xt−1 − µ))2
t=2

we can take derivatives of the log likelihood to solve for the MLEs. For the variance,
∂ log(L)
=
∂σ 2  
∂ 1 2 T T 2 Su (µ, φ)
= log(1 − φ ) − log(2π) − log(σ ) −
∂σ 2 2 2 2 2σ 2
n Su (µ, φ)
=− 2 + ,
2σ 2σ 4
which gives σ̂ 2 = Su (µ, φ)/T . However, solving for MLEs µ̂ and φ̂ are not as straight
forward, because we would have to solve the nonlinear system of equations
T
X
2
0 = −2(1 − φ )(X1 − µ) + 2(1 − φ) (Xt − φXt−1 − µ(1 − φ))
t=2
T
( )
−φ 1 2
X
0= 2
− 2 2φ(X1 − µ) − 2 (Xt − φXt−1 − µ(1 − φ))(Xt−1 − µ) .
1−φ 2σ
t=2

This headache arises due to the starting point X1 . If we condition the likelihood on
X1 , we can simplify the problem.2
Conditioning on X1 , we have
" T #
2 1 X
2
L(µ, φ, σ |X1 ) = exp ((Xt − µ) − φ(Xt−1 − µ)) .
(2πσ 2 )(T −1)/2 t=2

Thus, the MLE for the variance is σ̂ 2 = Sc (µ, φ)/(T − 1) where similarly to above
Sc is the conditional sum of squares in the exponent. We can rewrite Sc as
T
X
Sc (µ, φ) = (Xt − (α + φXt−1 ))2
t=2

where α = µ(1 + φ), which coincides with simple linear regression. Hence, for the
design matrix X ∈ R(T −1)×2 with columns 1 and Xt for t = 1, . . . , T − 1, the least
squares estimator is
 
α̂
= (X T X)−1 X T (X2 , . . . , XT )T ,
φ̂
2
What we just did above is the unconditional likelihood. What follows is the conditional likelihood
as we condition on X1 to remove the nonlinearity.

42
which after some computation can be reduced to
PT
(Xt − X̄(2) )(Xt−1 − X̄(1) )
φ̂ = t=2PT
2
t=2 (Xt−1 − X̄(1) )
PT −1
where X̄(1) = (T − 1)−1 t=1 Xt and X̄(2) = (T − 1)−1 Tt=2 Xt .3 Given φ̂, we can
P

determine α̂ = X̄(2) − φ̂X̄(1) and finally

X̄(2) − φ̂X̄(1)
µ̂ = .
1 − φ̂
To compare these estimators to the Yule-Walker estimator, we note that for the
AR(1) process that
PT
YW K̂X (1) t=2 (Xt − X̄)(Xt−1 − X̄)
φ̂ = = PT ,
2
K̂X (0) t=1 (Xt − X̄)

which is very similar to the MLE estimator except that the MLE uses X̄(1) and X̄(2)
that are adjusted for the end points of the time series. In the limit, the two are
equivalent.
Similarly, for the mean, the Yule-Walker equations chooses µ̂YW = X̄. In con-
trast, for the MLE, we have

X̄(2) − φ̂X̄(1) X̄ − φ̂X̄


≈ = X̄.
1 − φ̂ 1 − φ̂

For AR(p) processes, the MLE estimators can still be computed in a similar
manor, but the equations are more complex. Still, conditioning on the starting
values X1 , . . . , Xp allows for a reduction to linear regression.

Proof of Asymptotic Normality for Yule-Walker


3.1.2 Forecasting for AR processes
Another significant problem in time series analysis is that of forecasting. That is,
given data X1 , . . . , XT , we want to compute the best predictions for subsequent
time points XT +1 , XT +2 , . . . , XT +m . We won’t initially assume that the process is
autoregressive, but we will assume that Xt is stationary.
First, we can consider linear predictors, which are those of the form
T
X
X̂T +m = α0 + αt Xt
t=1
3 Pn Pn
Note that this is just i=1 xi yi / i=1 x2i from linear regression.

43
where we want to make a good choice of parameters αt . To do that, we minimize
the squared error as usual:
 !2 
 T
X 
arg min E XT +m − α0 − αt Xt .
α1 ,...,αT  
t=1

Taking the derivative with respect to each αt gives a system of equations


 
0 = E XT +m − X̂T +m
 
0 = E (XT +m − X̂T +m )X1
..
.
 
0 = E (XT +m − X̂T +m )XT

Let the mean of the process be µ. By the first equation,


T T
!
  X X
µ = E (XT +m ) = E X̂T +m = E α0 + αt Xt = α0 + αt µ
t=1 t=1
T
!
X
α0 = µ 1 − αt
t=1

Hence, we have X̂t+m = µ + Tt=1 αt (Xt − µ). Thus, we can centre the process and
P
consider time series with µ = 0 and α0 = 0.
For a one-step-ahead prediction, which is to estimate X̂T +1 , we solve the above
equations to get
  T
X
0 = E (XT +1 − X̂T +1 )X1 = KX (T ) − αt KX (t − 1)
t=1
  T
X
0 = E (XT +1 − X̂T +1 )X2 = KX (T − 1) − αt KX (t − 2)
t=1
..
.
  T
X
0 = E (XT +m − X̂T +m )XT = KX (1) − αt KX (T − t)
t=1

If similar to before we let K be the T -long vector with entries KX (T ), . . . , KX (1)


and let  
KX (0) KX (1) · · · KX (T − 1)
 KX (1) KX (0) · · · KX (T − 2)
Γ= .
 
.. .. .. ..
 . . . . 
KX (T − 1) KX (T − 2) · · · KX (0)

44
Then, the above equations can be written as K = Γα or α = Γ−1 K in the case that
the inverse exists. Thus, for X = (X1 , . . . , XT )T , our one-step prediction can be
written as
X̂T +1 = αT X = K T Γ−1 X.
As like estimation for the AR process with the Yule-Walker equations, our prediction
is based on the autocovariances. If we knew what the autocovariance is—i.e. we use
K and Γ instead of K̂ and Γ̂—then the mean squared prediction error is
 2 2
E XT +1 − X̂T +1 = E XT +1 − K T Γ−1 X
= E XT2 +1 − 2K T Γ−1 XXT +1 + K T Γ−1 XX T Γ−1 K


= KX (0) − 2K T Γ−1 K + K T Γ−1 ΓΓ−1 K


= KX (0) − K T Γ−1 K.

For the AR(p) process,


p
X
Xt = wt + φi Xt−i ,
i=1

the one-step-ahead prediction comes precisely from estimating the coefficents as in


the Yule-Walker equations to get
p
X
X̂T +1 = φ̂i XT +1−i .
i=1

However, if we do not know a priori that we have an order-p process, then we would
have to estimate αi for all i = 1, . . . , T , which could require the inversion of a very
large matrix. Thus, for general ARMA models, we have to work harder.

The Durbin-Levinson Algorithm


The system of equations α = Γ−1 K and computation of the mean squared prediction
error for the one-step ahead estimate,
 2
PTT+1 := E XT +1 − X̂T +1 = KX (0) − K T Γ−1 K,

can be solved iteratively. This is because, Γ is a Toeplitz Matrix, and the Durbin-
Levinsion Algorithm can be used to solve a system of equations involving a Toeplitz

45
matrix. To do this, we need to consider a sequence of one-step-ahead predictors

X̂21 = α1,1 X1
X̂32 = α2,1 X1 + α2,2 X2
X̂43 = α3,1 X1 + α3,2 X2 + α3,3 X3
..
.
X̂TT+1 = αT,1 X1 + αT,2 X2 + . . . αT,T XT .

We begin recursively with α0,0 = 0 and P01 = KX (0), which is that the MSPE given
no information is just the variance. Then, given coefficients αt,1 , . . . , αt,t , we can
compute
ρX (t) − t−1
P
αt−1,i ρX (i)
αt+1,1 = Pt−1 i=1
1 − i=1 αt−1,i ρX (t − i)
t
and Pt+1 = Ptt−1 (1 − αt+1,1
2 ) and for the remaining coefficients αt+1,i = αt,t−i−1 −
αt+1,1 αt,i

The Innovations Algorithm


The innovations for a time series are the residuals for the one-step-ahead estimate,
Xt − X̂tt−1 . First, note that Xt − X̂tt−1 and Xs − X̂ss−1 are uncorrelated for s 6= t.
The the innovations algorithm for calculating the one-step-ahead prediction X̂TT+1
is as follows.
First initialize X10 = 0 and P10 = KX (0). Then, given the past observations
Xt , . . . , X1 and past one-step predictions Xtt−1 , . . . , X10 , we compute
t
t−j
X
t
Xt+1 = θt,j (Xt+1−j − Xt+1−j )
j=1
t−1
j
X
t 2
Pt+1 = KX (0) − θt,t−j Pj+1
j=0

where
j−1
!
j
X
θt,t−j = KX (t − j) − k
θj,j−k θt,t−k Pk+1 (Pj+1 )−1
k=0

The innovations algorithm is useful for computing predictions for ARMA(p,q)


processes specifically for the MA part. To see this, we consider a simple example:
the MA(1) process.
Let Xt = θwt−1 + wt . Then, we know that the autocovariance is KX (0) =
σ (1 + θ2 ), KX (1) = σ 2 θ, and KX (h) = 0 for h ≥ 2. Then, we have that θ0,0 = 1
2

46
and

θ1,1 = [KX (1) − 0]/[KX (0) − 0] = θ/(1 + θ2 ) = σ 2 θ/P10


θ2,2 = [KX (2) − 0]/[KX (0) − 0] = 0
θ2,1 = [KX (1) − θ1,1 θ2,2 P10 ]/[P21 ] = [σ 2 θ − 0]/[P21 ]
..
.
θt,j = 0, for j > 1
θt,1 = [KX (1) − 0]/[Ptt−1 ] = σ 2 θ/[Ptt−1 ]
t
We can also update the MSPEs as Pt+1 = (1 + θ2 − θθt,1 )σ 2 . Therefore, the one-
step-ahead prediction is
t
Xt+1 = θt,1 (Xt − Xtt−1 ) = θ(Xt − Xtt−1 )σ 2 /Ptt−1

3.2 The ARMA Process


3.2.1 Estimation for ARMA processes
For an ARMA(p,q) process, we have parameters µ, σ 2 , φ1 , . . . , φp , θ1 , . . . θq to esti-
mate. To estimate terms for an ARMA process, we revisit the maximum likelihood
approach from above for the AR process. Note first that to do this, we assume
that the white noise is Gaussian so that we have a distribution for the likelihood
equation. Similar to before, we need to carefully rewrite the likelihood to make it
tractible. This time, we consider conditioning the tth time point on the previous
t − 1 time points. That is,
T
Y
2
L(µ, σ , φ, θ) = f (Xt | Xt−1 , . . . , X1 ),
t=1

which means that we will consider each Xt predicted by the previous observations
Xt−1 , . . . , X1 .
As we assume we have a causal invertiable ARMA(p.q) process, we can write it
as a linear process in the form

X
Xt = ψi wt−i
i=0

where the infinite past will be convenient to assume even if the data is finite. The
t−1
distribution of Xt |Xt−1 . . . X1 will be
 normal with mean X̂t , the one-step-ahead
prediction. The variance with by E (Xt − X̂tt−1 )2 , which is also the mean squared

47
prediction error Ptt−1 . In the case of t = 1, we just use the variance for the linear
process

X
KX (0) = σ 2 ψi2 .
i=1

From there, we can use the Durbin-Levinson Algorithm to update the MSPE by
t
Pt+1 = Ptt−1 (1−αt+1,1
2 ). The precise computation is not important for our purposes,
but we can write Ptt−1 = σ 2 rt where rt does not depend on σ 2 . This allows us to
write the likelihood as
" T #−1/2  
2 2 −T /2
Y S(µ, φ, θ)
L(µ, σ , φ, θ) = (2πσ ) rt exp −
2σ 2
t=1

with the sum of squares S(µ, φ, θ) = Tt=1 (Xt − X̂tt−1 )2 /rt .


P

From all of this, we can get the MLE for the variance σ̂ 2 = S(µ̂, φ̂, θ̂)/n as a
function of the other estimators. To find those estimators, we can maximize the
concentrated likelihood, which is when we replace σ̂ 2 with S(µ̂, φ̂, θ̂)/n and solve for
the MLEs for µ̂, φ̂, θ̂. That is, for some constant C,
T
T 1X
log L(µ, φ, θ) = C − log σ̂ 2 − log rt , or
2 2
t=1
  T
S(µ, φ, θ) 1X
`(µ, φ, θ) = log + log rt .
T T
t=1

We could check that for AR(p) processes—that is, without any MA part—that we
recover the MLE estimates from before.

Asymptotic Distribution
Similar to the Yule-Walker equations for the AR process, we have a central limit-like
theorem for the MLE estimator for the ARMA process. If we let β = (φ1 , . . . , φp , θ1 , . . . , θq )
then as T → ∞ √ d
→ N 0, σ 2 Γ−1

T (β̂ − β) − p,q

where Γp,q is a (p + q) × (p + q) matrix with block form


 
Γφφ Γφθ
Γp,q =
Γθφ Γθθ

where the i, jth entry in Γφφ is KY (i − j) for the AR process Φ(B)Yt = wt , and
the i, jth entry in Γθθ is KY 0 (i − j) for the AR process Θ(B)Yt0 = wt , and the i, jth
entry in Γφθ is the cross covariance between Y and Y 0 .

48
Example 3.2.1 (AR(1)). For the casual AR(1) process Xt = φXt−1 + wt , we recall
that KX (0) = σ 2 /(1 − φ2 ). Therefore, Γ1,0 = (1 − φ2 )−1 and we have that
 
d 1
→ N φ, (1 − φ2 ) .
φ̂ −
T

Example 3.2.2 (MA(1)). Similar to the AR(1) prcess, consider the invertible
MA(1) process Xt = θwt−1 + wt . The MA polynomial is Θ(B) = 1 + θB, so the
AR(1) process Θ(B)Yt = wt has a variance of KY (0) = σ 2 /(1 − θ2 ). Thus, we have
that  
d 1
→ N θ, (1 − θ2 ) .
θ̂ −
T
Note that similar to linear regression and many other areas of statistics, if we
include too many terms when fitting an ARMA process to a dataset, the standard
errors of the estimate will be larger than necessary. Thus, it is typically good to
keep the number of parameters as small as possible. Hence, the use of AIC and BIC
in the R function arima().

3.2.2 Forecasting for ARMA processes


We already discussed approaches to forecasting that can be applied to the ARMA(p,q)
process. In this section, we take a closer look at forecasting for the ARMA process
and the behaviour of these predicted values. As usual, we will assume that the
ARMA(p,q) process written in operator form as

Φ(B)Xt = Θ(B)wt

is both causal and invertible as well as mean zero.4 We will also assume that we are
making a prediction based on observed data X1 , . . . , XT .
Given a sample size T , there are two possible estimators for the future point
XT +h to consider. The prediction that minimizes the mean squared error is

X̂T +h = E (XT +h | XT , . . . , X1 ) .

However, it is often mathematically convenient to consider the estimator based on


the infinite past, which is

X̃T +h = E (XT +h | XT , . . . , X1 , X0 , X−1 , . . .) .

As the data size T gets large, these two predictions are very close.
4
As usual, we can subtract the mean to achieve this last assumption.

49
As we assume that that ARMA process is both causal and invertible, we can
rewrite it in two different forms:
X∞
XT +h = wT +h + ψj wT +h−j (Causal)
j=1
X∞
wT +h = XT +h + πj XT +h−j (Invertible)
j=1

and we can consider the above conditional expectations applied to each of these
equations.
First, we note that

wt if t ≤ T
E (wt |XT , . . . , X0 , . . .) =
0 if t > T
This is because (1) if t > T then wt is independent of the sequence of XT , . . . and
(2) if t ≤ T then based on the casual and invertible representations, we have a one
to one correspondence between the X’s and the w’s. Similarly,

Xt if t ≤ T
E (Xt |XT , . . . , X0 , . . .) =
X̃t if t > T
Applying this idea to the causal representation, we get that

X
X̃T +h = ψj wT +h−j
j=h

as the first h − 1 terms in the sum become zero. Then, subtracting this from Xt+h
gives
h−1
X
XT +h − X̃T +h = ψj wT +h−j .
j=0
Hence, the mean squared prediction error is
  h−1
X
PTT+h = E (XT +h − X̃T +h )2 = σ 2 ψj2 .
j=0

Note that we can also apply the conditional expectation to the invertible repre-
sentation. In that case, we get
h−1
X ∞
X
0 = X̃T +h + πi X̃T +h−j + πi XT +h−j
j=1 j=h
h−1
X ∞
X
X̃T +h = − πi X̃T +h−j − πi XT +h−j .
j=1 j=h

50
This shows that the T + h predicted value is a function of the data XT , . . . and the
previous h − 1 predicted values X̃T +h−1 , . . . , X̃T +1 .

Long Range Forecast Behaviour


What happens if we try to predict far into the future? If we consider an ARMA(p,q)
process with mean µ, then the h-step-ahead estimator based on the infinite past is

X
X̃T +h = µ + ψj wt+h−j .
j=h

Now, we know from before P that the coefficients ψi tendPto zero fast enough to be
∞ ∞
absolutely summable—i.e. j=0 |ψj | < ∞. Therefore, j=h |ψj | → 0 as h → ∞.
This implies that
P
X̃T +h − → µ as h → ∞.
We can prove this first by noting that the variance of X̃T +h is ∞ 2
P
j=h ψj . Then, for
any ε > 0, we can use Chebyshev’s inequality to get that
 
  X∞ X∞
P |X̃T +h − µ| > ε = P  ψj wt+h−j > ε ≤ ε−2 ψj2 → 0
j=h j=h

as h → ∞.
Meanwhile, the MSPE from above is
h−1
X
PTT+h = σ 2 ψj2 .
j=0

Therefore, as h → ∞, we have that the MSPE tends to KX (0), which is just the
variance of the process Xt .
Hence, in the long run, the forecast for an ARMA(p,q) process tends towards
its mean, and the variance tends to the variance of the process.

Truncating the infinite past


For a small sample size T , we can forecast by solving the system of equations pre-
sented above by inverting the T × T matrix Γ. For a large sample size T , we can
use the recursive approach to forecast. However, it is worth considering what the
effect is of not having access to the past time points X0 , X−1 , . . . before the dataset
was collected.
Using the invertible respresentation of the time series, we have the truncated
h-step-ahead prediction
h−1
X T
X
X̃TT+h =− πi X̃TT+h−j − πi XT +h−j .
j=1 j=h

51
Given the coefficients φi and θi , we can write this truncated prediction as
p
X q
X
X̃TT+h = φi X̃TT+h−i + θj w̃TT +h−j
i=1 j=1

where we replace the predicted value X̃TT+h−i with the observed value XT +h−i if
i ∈ [h, T + h − 1] and with 0 if i ≥ T + h. Similarly, w̃tT = 0 if t < 1 or if t > T .
Otherwise,
Xq
w̃tT = Φ(B)X̃tT − θj w̃t−j .
j=1

To see all of this in action, we can consider a few simple examples.

Example 3.2.3 (ARMA(1,1)). For the causal invertible ARMA(1,1) process Xt+1 =
φXt + wt+1 + θwt , we can consider the one-step-ahead prediction

X̃TT+1 = φXT + θw̃TT

and the h-step-ahead prediction X̃TT+h = φXTT+h−1 for h ≥ 2.


Hence, to forecast for the ARMA(1,1) process, we only need XT and the estimate
w̃TT . For the error term w̃TT , we have that

wt+1 = Xt+1 − φXt − θwt


T
w̃t+1 = Xt+1 − φXt − θw̃tT .

Hence, we can start from w̃0T = 0 and X0 = 0 and compute the w̃t+1
T iteratively.
We can also compute the variance of the prediction (the MSPE). For this, we
note that the ARMA(1,1) process can be writen in a causal form as

X ∞
X
Xt = wt + ψi wt−i = wt + (φ + θ)φi−1 wt−i .
i=1 i=1

Then, we have that


 
h−1
X h−1
X
PTT+h = σ 2 ψj2 = σ 2 1 + (φ + θ)2 φ2j−2 
j=0 j=1

1 − φ2h−2 (φ + θ)2
    
2 2 2
=σ 1 + (φ + θ) →σ 1+
1 − φ2 1 − φ2

as h → ∞.

52
Backcasting
We can also consider forecasting into the past or backcasting. That is, we can predict
backwards h time units into the past by
T
X
T
X̂1−h = αi Xi
i=1

for some coefficients αt . To do this, we proceed as before by considering for t =


1, . . . , T
T
X
E (X1−h Xt ) = αi E (Xi Xt )
i=1
XT
KX (t + h − 1) = αi KX (t − i).
i=1

This means we can compute the coefficients αi by solving the system of equations
K = Γα where
   
KX (h) KX (0) KX (1) · · · KX (T − 1)
 KX (h + 1)   KX (1) KX (0) · · · KX (T − 2)
K= , Γ = 
   
.. .. .. . . .. 
 .   . . . . 
KX (T + h − 1) KX (T − 1) KX (T − 2) · · · KX (0)
just as we did before for forecasting.
Remark 3.2.4 (Fun Fact). For a stationary Gaussian process, the vector (XT +1,XT ,...,X1 )
is equal in distribution to (X0,X1 ,...,XT ), so forecasting and backcasting are equivalent.

3.3 Seasonal ARIMA


Very often with time series, there is a strong seasonal component to the data. For
example, temperatures and other climate measurements have annual cycles. Fi-
nancial data may have annual or quarterly cycles. For another example, electricity
consumption may have both annual cycles and daily cycles—we use more electricity
when we are awake than when we are asleep, and we use more electricity in the
winter when it is dark and we are prone to staying inside, for example.
Thus, it is often beneficial to consider modelling time series data are certain lags
based on the seasonality of the data. For example, instead of considering an AR(1)
process
Xt = φXt−1 + wt ,
we could consider an annual AR(1) process
Xt = φXt−12 + wt

53
or more generally, Xt = φXt−s + wt which we will call an AR(1)s process for some
value of s > 1.
In general, we can consider a seasonal ARMA process denoted AMRA(p0 , q 0 )s
which is
Φs (B s )Xt = Θs (B s )wt
where the polynomials Φs and Θs are
0
Φs (B s ) = 1 − ϕ1 B s − ϕ2 B 2s − . . . − ϕp0 B p s
0
Θs (B s ) = 1 + ϑ1 B s + ϑ2 B 2s + . . . + ϑq0 B q s

The reason for the notation is to combine the seasonal ARMA with the regular
ARMA process to get an ARMA(p, q) × (p0 , q 0 )s process, which can be written as

Φs (B s )Φ(B)Xt = Θs (B s )Θ(B)wt .

If we were to include a differencing operator as in the ARIMA model to account


for non-stationarity, we get the Seasonal ARIMA or SARIMA(p, d, q) × (p0 , d0 , q 0 )s
model. This takes on the form
0
Φs (B s )Φ(B)∇ds ∇d Xt = Θs (B s )Θ(B)wt .
0 0
where ∇d = (1 − B)d and ∇ds = (1 − B s )d . The large number of parameters is why
the auto.arima() function in the R package forecast takes so long to run when
the seasonal component is included.

3.3.1 Seasonal Autoregressive Processes


We can consider a purely seasonal AR process such as an annual AR(1) like

(1 − ϕB 12 )Xt = wt
Xt = ϕXt−12 + wt

where |φ| < 1. To compute the autocovariance quickly, we can rewrite this as a
linear process

Xt = ϕXt−12 + wt
= ϕ(ϕXt−24 + wt−12 ) + wt
= ϕ2 Xt−24 + ϕwt−12 + wt
..
.
X ∞
= ϕj wt−12j .
j=0

54
Therefore, the variance is as usual KX (0) = σ 2 /(1 − φ2 ). For lags h = 1, . . . , 11, we
have
 
X∞ ∞
X
KX (h) = cov  ϕj wt−12j , ϕi wt−h−12i 
j=0 i=0
∞ X
X ∞
= ϕj+i cov (wt−12j , wt−h−12i ) = 0,
j=0 i=0

because the indices t − 12j and t − h − 12i will never be equal unless h is a multiple
of 12. In that case, we have

X σ2ϕ
KX (12) = σ 2 ϕ ϕ2j = .
1 − ϕ2
j=0

Note that this is the same as KY (1) for the AR(1) process Yt = ϕYt−1 + wt . Hence,
the above seasonal AR process is effectively 12 uncorrelated AR processes running
in parallel to each other. This is why we often include a seasonal and non-seasonal
component in the SARIMA models.
Note also that the AR(1, 0)12 could also be thought of as an AR(12, 0). However,
trying to estimate or forecast with an AR(12, 0) process will include many param-
eters that are unnecessary, which will increase the variance for our estimators and
predictions.

3.3.2 Seasonal ARMA Processes


We can take the above process and add in an MA(1) process to get the ARMA(0, 1)×
(1, 0)12 which is
(1 − ϕB 12 )Xt = θwt−1 + wt
with |θ| < 1 and |ϕ| < 1. We can compute the variance as usual by noting that the
MA piece is based on wt−1 , which is uncorrelated with the AR(1)12 piece to get

KX (0) = Var (Xt ) = Var (ϕXt−12 + θwt−1 + wt ) = ϕ2 KX (0) + σ 2 (θ2 + 1),

which gives that


1 + θ2
KX (0) = σ 2 .
1 − ϕ2
For the autocovariances at other lags, we first note that if h = 12m is a multiple of
12, then
σ 2 ϕm
KX (h) = KX (12m) =
1 − ϕ2

55
as above, since the MA piece will not affect the calculation. However, if h = 1
mod 12 or h = 11 mod 12, we have to work harder. First, consider that

KX (1) = E (Xt−1 [ϕXt−12 + θwt−1 + wt ])


= ϕKX (11) + σ 2 θ, and
KX (11) = E (Xt−11 [ϕXt−12 + θwt−1 + wt ])
= ϕKX (1).

From this, we have that

σ2θ σ 2 θϕ
KX (1) = and KX (11) =
1 − ϕ2 1 − ϕ2
Continuing on, we have that

σ 2 θϕ
KX (13) = ϕKX (1) =
1 − ϕ2
as well. Hence, we can generalize this to

σ 2 θϕm
KX (h) = KX (12m ± 1) = .
1 − ϕ2

Lastly, for any lags h 6= −1, 0, 1 mod 12, we have KX (h) = 0 as none of the indices
line up in the autocovariance computation.

56
Chapter 4

Analysis in the Frequency


Domain

Introduction
Time series data often exhibits cyclic behaviour as we saw with SARIMA models
in the previous chapter. Furthermore, a time series may have more than one cycle
occurring simultaneously. In this chapter, we will consider spectral methods for
identifying the cyclic behaviour of time series data.
In general, we are interested in the frequency ω of the time series. For example,
a time series that repeats every 12 months, the frequency is ω = 1/12.

4.1 Periodic Processes


We will consider the periodic process

Xt = A cos(2πωt + φ)

where A is the amplitude, ω is the frequency, and φ is the phase. This process can
be rewritten as linear combination of trig functions as

Xt = U1 cos(2πωt) + U2 sin(2πωt)

where U1 = A cos(φ) and U2 = −A sin(φ). This is due to the identity cos(x + y) =


cos(x) cos(y) − sin(x) sin(y).
If we take U1 and U2 to be uncorrelated mean zero random variables with vari-

57
ance σ 2 , then we can compute the autocovariance as

KX (h) =
= cov (U1 cos(2πω(t + h)) + U2 sin(2πω(t + h)), U1 cos(2πωt) + U2 sin(2πωt))
= cov (U1 cos(2πω(t + h)), U1 cos(2πωt))
+ cov (U2 sin(2πω(t + h)), U2 sin(2πωt))
= σ 2 cos(2πω(t + h)) cos(2πωt) + σ 2 sin(2πω(t + h)) sin(2πωt)
= σ 2 cos(2πωh),

so depending on the lag h, the autocovariance with rise or fall.


We can combine q different frequencies ω1 , . . . , ωq to get a more general period
process
X q
Xt = {Uj1 cos(2πωj t) + Uj2 sin(2πωj t)}
j=1

where all of the Uj1 and Uj2 are uncorrelated mean zero random variance with
potentially different variances σj2 . The autocovariance in this case is
q
X
KX (h) = σj2 cos(2πωj h).
j=1

Remark 4.1.1 (Aliasing). Aliasing is a problem that can occur when taking a dis-
crete sample from a continuous signal. Since we have to sample a certain rate, high
frequency behaviour in the signal may look like low frequency patterns in our sam-
ple. This is displayed in Figure 4.1 where the red dots are sampled too infrequently
making it appear as if there is a low frequency oscilation in the data instead of the
actually high frequency oscilation in black.

4.1.1 Regression, Estimation, and the FFT


Given some time series data X1 , . . . , XT with T odd, then we can exactly represent
the data as
(T −1)/2
X
Xt = β0 + {βj1 cos(2πt j/T ) + βj2 sin(2πt j/T )} .
j−1

This is because the sin’s and cos’s form a basis, and similarly to how a T − 1 degree
polynomial can pass through T points, these T − 1 parameters β? can be used to fit
the data exactly. Note that for T even, we can also do this with
T /2−1
X
Xt = β0 + {βj1 cos(2πt j/T ) + βj2 sin(2πt j/T )} + βT /2 cos(πt).
j−1

58
1.0
0.5
0.0
xx

−0.5
−1.0

0 2 4 6 8 10

tt

Figure 4.1: Aliasing occurs when we sample too infrequently to capture high
frequency oscilations.

Of course, in practice, we do not want to include T − 1 parameters to fit our data


exactly. Instead, we assume that most of these β will be zero and the only non-zero
β’s will correspond to prominent frequencies in the time series.
We can estimate all of the β’s by treating this as a linear regression problem.1
1
Recall that for Y = Xβ + ε, if X T X = cIn , then β̂ = X T Y /c.

59
First, with a little work, we can show that
T
X T
X
2
cos (2πtj/T ) = sin2 (2πtj/T ) = n/2 for j = 1, . . . T /2 − 1
t=1 t=1
XT
cos2 (2πtj/T ) = n for j = 0, T /2
t=1
XT
sin2 (2πtj/T ) = 0 for j = 0, T /2
t=1
T
X
cos(2πtj/T ) cos(2πtk/T ) = 0 for j 6= k
t=1
T
X
cos(2πtj/T ) cos(2πtk/T ) = 0 for j 6= k
t=1
XT
cos(2πtj/T ) sin(2πtk/T ) = 0 for any j, k.
t=1

Hence, our design matrix for linear regression has orthongal columns, so computing
each β̂ becomes, for j 6= 0, T /2,
T
2X
β̂j1 = Xt cos(2πtj/T )
T
t=1
T
2 X
β̂j2 = Xt sin(2πtj/T )
T
t=1

and β̂0 = X̄ and, if T is even, β̂T /2 = T −1 Tt=1 (−1)t Xt .


P
2 + β̂ 2 ,
Given these estimates, we can define the scaled periodogram P (j/T ) = β̂j1 j2
which can be used to determine which frequencies are the most prominent in the
time series Xt . Note that the variance of for βj1 cos(2πtj/T ) + βj2 sin(2πtj/T ) is
2 + β 2 , so the periodogram is the sample variance for frequency j/T .
βj1 j2
Computing all of these β as above is computationally infeasable for large T .
However, if T is a highly composite integer—i.e. one with a lot of small integer
factors like 2m —then, we can use the discrete Fourier transform,
T
1 X
d(j/T ) = √ Xt exp(−2πitj/T )
T t=1
( T T
)
1 X X
=√ Xt cos(2πtj/T ) − i Xt sin(2πtj/T ) .
T t=1 t=1

60
The squared magnitude of the coefficients
( T )2 ( T )2
2 1 X 1 X
|d(j/T )| = Xt cos(2πtj/T ) + Xt sin(2πtj/T )
T T
t=1 t=1

is the (unscaled) periodogram. The scaled periodogram is P (j/T ) = (4/T )|d(j/T )|2 ,
which follows from the equations for the β̂ above. Noting that cos(2π − θ) = cos(θ)
and that sin(2π − θ) = − sin(θ), we have that |d(1 − j/T )|2 = |d(j/T )|2 and likewise
P (1 − j/T ) = P (j/T ). Hence, we only consider frequencies j/T < 1/2.
The DFT can be computed quickly via the Fast Fourier Transform (FFT).
The DFT is just a linear transformation of the data Xt , which can be written at
d = W X for some matrix W . This type of transformation would take O(T 2 ) time to
compute. However, the FFT uses a sparse representation of W to reduce the time
to O(T log2 T ). There are many algorithms for the FFT, but the most common
takes a divide-and-conquer approach. That is, if T = 2m , then it breaks the data
in half based on odd and even indices X1 , X3 , . . . , XT −1 and X2 , X4 , . . . , XT and
computes the Fourier transform of each separately. However, since T /2 = 2m−1 is
also divisible by 2, this idea can be applied recursively to get 4 partitions of the
data, then 8, and so on.
If we rewrite the DFT as
T /2 T /2
√ X − 2πitj − 2πij
X − 2πitj
T d(j/T ) = X2t e T /2 +e T X2t−1 e T /2

t=1 t=1
− 2πij
= Ej + e T Oj ,
then we can decompose it into even and odd parts Ej and Oj , respectively. These
two pieces are each DFTs of size T /2. We also note that there is a redundency in
the calculations, which is for j < T /2, we have
√ 2πij
T d(j/T ) = Ej + e− T Oj
and that √ 2πij
T d(j/T + 1/2) = Ej − e− T Oj .
Remark 4.1.2. Scaling and the FFT In Fourier analysis and for different FFT
implementations in code, there are often different scaling factors included. Hence,
to make sure we are estimating what we want to estimate, one needs to take care
when using FFT algorithms.

4.2 Spectral Distribution and Density


We begin by presenting a way to represent the autocovariance function as described
in the Wiener-Khintchine Theorem.2 The theorem applies to continuous time pro-
2
https://en.wikipedia.org/wiki/Wiener\OT1\textendashKhinchin_theorem

61
cesses. However, in this course, we only consider discrete time processes.

Theorem 4.2.1 (Wiener-Khintchine Theorem I). Let Xt be stationary with auto-


covariance KX (h) = cov (Xt+h , Xt ). Then, there exists a unique monotonically in-
creasing function FX (ω), called the spectral distribution function, with FX (−1/2) =
0 and FX (1/2) = KX (0) such that
Z 1/2
KX (h) = e2πiωh dFX (ω)
−1/2

where this is a Riemann-Stieltjes integral.

Given slightly stronger conditions, we can also define the spectral density. That
is, if the autocovariance is absolutely summable, then the spectral distribution is
absolutely continuous in turn implying that the derivative exists almost everywhere:
dFX (ω) = fX (ω)dω.

Theorem 4.2.2 (Wiener-Khintchine Theorem II). Let Xt be stationary with auto-


covariance KX (h) = cov (Xt+h , Xt ) such that

X
|KX (h)| < ∞
h=−∞

Then, we can write


Z 1/2
KX (h) = e2πiωh fX (ω)dω.
−1/2

Furthermore, we have the inverse transformation



X
fX (ω) = KX (h)e−2πiωh
h=−∞

for ω ∈ [−1/2, 1/2].

From here we see that if fX (ω) exists, then it is an even function—i.e. fX (ω) =
R 1/2
fX (−ω). Also, since KX (0) = −1/2 fX (ω)dω, the variance of the process can be
thought of as the integral of the spectral density over all frequencies. In a way, this
is similar to how the total sum of squares can be decomposed into separate sums of
squares in a ANOVA.
As a simple example, consider the period process Xt = U1 cos(2πω0 t)+U2 sin(2πω0 t)
from before. Then, the autocovariance is

σ 2  2πω0 h  Z 1/2
2 −2πω0 h
KX (h) = σ cos(2πω0 h) = e +e = e2πiωh dFX (ω)
2 −1/2

62
where FX (ω) = 0 for ω < −ω0 , FX (ω) = σ 2 /2 for ω ∈ [−ω0 , ω0 ], and FX (ω) = σ 2
for ω > ω0 . Note that in this case, the autocovariance is not absolutely summable.
As a second example, we consider the white noise process wt . In this case,
the autocovariance is simply σ 2 at lag h = 0 and 0 for all other lags h. Hence,
it is absolutely summable and the spectral density is just fX (ω) = σ 2 for all ω ∈
[−1/2, 1/2]. Hence, as mentioned in Chapter 1, white noise in a sense contains every
frequency at once with equal power.

4.2.1 Filtering and ARMA


Given the spectral density for one time P∞ series Xt , we can find the spectral density
for
P∞ another related time series y t = j=−∞ aj Xt−j for some fixed sequence aj with
j=−∞ |aj | < ∞. Treating a : Z → R as a function, then a(j) = aj is called the
impulse response function and its Fourier transform

X
A(ω) = aj e−2πiωj
j=−∞

is the frequency response function. Given all of this, we have the following theorem.
P∞
Theorem 4.2.3. Let Xt be a time seriesP∞ with spectral density fX (ω) and let j=−∞ |aj | <
∞, then the spectral density for yt = j=−∞ aj Xt−j is

fY (ω) = |A(ω)|2 fX (ω)


for A(ω) the frequency response function from above.
Proof. To prove the above theorem, we just compute the autocovariance directly.
 
X∞ ∞
X
KY (h) = cov  aj Xt+h−j , al Xt−l 
j=−∞ l=−∞

X ∞
X
= aj al KX (h − j + l)
j=−∞ l=−∞
X∞ X∞ Z 1/2
= aj al e2πiω(h−j+l) fX (ω)dω
j=−∞ l=−∞ −1/2
 "
∞ ∞
#
Z 1/2 X X
−2πiωj  2πiωl
=  aj e al e e2πiωh fX (ω)dω
−1/2 j=−∞ l=−∞
Z 1/2
= e2πiωh |A(ω)|2 fX (ω)dω.
−1/2

Therefore, by the uniqueness of the Fourier transform, we have that fY (ω) =


|A(ω)|2 fX (ω).

63
We can apply this result to a causal ARMA(p,q) process. For Φ(B)Xt = Θ(B)wt ,
we have rewrite it as

Θ(B) X
Xt = wt = ψj wt−j .
Φ(B)
j=0
P∞ j
Writing Ψ(z) = Θ(Z)/Φ(z) = j=0 ψj z and using this as the aj from the above
theorem, we have that

X Θ(e−2πiω )
A(ω) = ψj e−2πiωj = Ψ(e−2πiω ) = .
Φ(e−2πiω )
j=−∞

Using the fact that fw (ω) = σ 2 for all ω, we have finally that
2
Θ(e−2πiω )
fX (ω) = |A(ω)|2 fw (ω) = σ 2 .
Φ(e−2πiω )

4.3 Spectral Statistics


Thus far, our discussion of spectral analysis for time series has not considered the
issue of working with noisy data. Given a finite set of data X1 , . . . , XT , we can
compute the DFT
T
1 X
d(ωj ) = √ Xt e−2πiωj t
T t=1
for frequencies ωj = j/T . We can also compute the real and imaginary part sepa-
rately being the sine and cosine transformations:
T
1 X
dc (ωj ) = √ Xt cos 2πωj t
T t=1
T
1 X
ds (ωj ) = √ Xt sin 2πωj t
T t=1

so that d(ωj ) = dc (ωj ) − ids (ωj ).


We can compute the inverse DFT by
T −1
1 X
Xt = √ d(ωj )e2πiωj t
T j=0

for t = 1, . . . , T . This gives us the periodogram defined to be

I(ωj ) = |d(ωj )|2 = dc (ωj )2 + ds (ωj )2 .

64
We can also centre the DFT when j 6= 0 to get
T
1 X
d(ωj ) = √ (Xt − X̄)e−2πiωj t
T t=1

as Tt=1 e−2πiωj t = 0 for any ωj 6= 0. This is useful as it allows us to write the


P
periodogram for j 6= 0 as
T 2
1 X
I(ωj ) = √ (Xt − X̄)e−2πiωj t
T t=1
T T
1 XX
= (Xt − X̄)(Xs − X̄)e−2πiωj (t−s)
T
t=1 s=1
T −1 T −|h|
1 X
−2πiωj h
X
= e (Xt+|h| − X̄)(Xt − X̄)
T
h=−T +1 t=1
T −1
1 X
= K̂X (h)e−2πiωj h
T
h=−T +1

Hence, the periodogram can be written in terms of the Fourier transform of the
estimated autocovariance as we might have expected from the previous discussion.
The problem we face here is that that the estimator K̂X (h) is very poor for large h
as there are relatively few pairs of time points to consider. Hence, we often truncate
this summation by only summing over |h| ≤ m for some m  T .

4.3.1 Spectral ANOVA


We can consider the spectral approach to time series as an ANOVA problem. That
is, we can consider how much variation in the time series is due to a certain frequency
much like the sum of squares decomposition from classic ANOVA.
For simplicity, let T be odd. We consider
(T −1)/2
X
Xt = β0 + {βj1 cos(2πt j/T ) + βj2 sin(2πt j/T )}
j=1

where we found before that


T
2X 2
β̂j1 = Xt cos(2πtj/T ) = √ dc (ωj )
T T
t=1
T
2X 2
β̂j2 = Xt sin(2πtj/T ) = √ ds (ωj )
T T
t=1

65
and β̂0 = X̄. Therefore, we have
(T −1)/2
2 X
Xt − X̄ = √ {dc (ωj ) cos(2πt j/T ) + ds (ωj ) sin(2πt j/T )}
T j=1

P(T −1)/2  P(T −1)/2


and Tt=1 (Xt − X̄)2 = 2 j=1 dc (ωj )2 + ds (ωj )2 = 2 j=1
P
I(ωj ). This is
PT 2
because t=1 cos(2πtj/T ) = T /2 and similarly for the sine series.
Thus, we have decomposed the total sum of squares over T data points into
the sum of (T − 1)/2 terms, 2I(ωj ), each with 2 degrees of freedom. Thus, the
periodogram I(ωj ) can directly be thought of as the variation due to frequency ωj
in the time series. Note that one would never want to use the aov() function in R
to compute this. Instead, the fft() is much more efficient.

4.3.2 Large Sample Behaviour


In this section, we assume Xt is a stationary process with mean µ, absolutely
summable autocovariance function KX (h) and spectral density fX (ω). If we write
the periodogram using the true mean µ—this makes the calculations easier than
using X̄—we find that

T −1 T −|h|
1 X
−2πiωj h
X
I(ωj ) = e (Xt+|h| − µ)(Xt − µ)
T
h=−T +1 t=1
T −1  
X T − |h|
E[I(ωj )] = KX (h)e−2πiωj h .
T
h=−T +1

For taking the limit as T → ∞, we have to consider a sequence of frequencies


(T )
ωj that tends towards some ω as the sample size grows. For example, if we want
ω = 1/3, we could consider
(2) (4) (8) (16) (32)
ω1 = 1/2, ω1 = 1/4, ω3 = 3/8, ω5 = 5/16, ω11 = 11/32 . . .
(T )
In this case, if we have ωj → ω as T → ∞, then

(T )
X
E[I(ωj )] → fX (ω) = KX (h)e−2πiωj h .
h=−∞
P∞
Going further, if we strengthen the absolute summability condition h=−∞ |KX (h)| <
∞ to the condition
X∞
c= |h||KX (h)| < ∞,
h=−∞

66
then we have that

fX (ωj )/2 + εT for ωj = ωk
cov (dc (ωj ), dc (ωk )) =
εT 6 ωk
for ωj =
and similarly for ds where εT is an error term bound by |εT | ≤ c/T . Hence, the
estimated covariance matrix should have a strong diagonal with smaller noisy off-
diagonal entries.
We can use this to find via the central limit theorem that if our process Xt is
just iid white noise with variance σ 2 , then
(T ) d
→ N 0, σ 2 /2

dc (ωj ) −
(T ) d
→ N 0, σ 2 /2

ds (ωj ) −
Thus, recalling that I(ωj ) = dc (ωj )2 + ds (ωj )2 , we have that
(T ) d
2I(ωj )/σ 2 −
→ χ2 (2)
(T ) (T )
and this I(ωj ) will be asymptotically independent with some other I(ωk ).
For the general linear process, we have
Theorem 4.3.1. If Xt = ∞
P
j=∞ ψj wt−j with the ψj absolutely summable and with
wt being iid white noise with variance σ 2 and with

X
|h||KX (h)| < ∞,
h=∞
(T )
then for any collection of m frequencies ωj → ωj , we have jointly that
(T ) d
→ χ2 (2)
2I(ωj )/f (ωj ) −
given that f (ωj ) > 0 for j = 1, . . . , m.
Thus, we can use this result for many statistical applications like constructing a
1 − α confidence interval for the spectral density fx at some frequency ω by
(T ) (T )
2I(ωj ) 2I(ωj )
≤ fX (ω) ≤ .
χ22,1−α/2 χ22,α/2

4.3.3 Banding, Tapering, Smoothing, and more


In the previous section, it is noted that the periodogram is a natural estimator for the
spectral density. However, there are many aspects of such estimation to consider.
In this section, we will consider methods of averaging, smoothing, banding, and
tapering, which will, if used correctly, give us a better estimator for the spectral
density. Ultimately, there is no one right way to estimate a given spectral density
from some time series data. One must explore some of the following techniques as
appropriate.

67
Bartlett’s and Welch’s Methods
For a time series X1 , . . . , XT , we are able to compute I(ωj ) for any frequency ωj =
j/T . However, such fine granularity is often not necessary. Instead, computing the
periodogram for fewer frequencies with better accuracy is often preferrable.
Bartlett’s method take this into consideration by splitting the time series into
K separate disjoint series of equal length m = T /K. That is,

{X1 , . . . , XK }, {XK+1 , . . . , X2K }, . . . {XmK−m+1 , . . . , XT }.

Then, for each of these K time series pieces, we can compute periodograms I (1) (ωj ), . . . , I (K) (ωj )
and average them to get
XK
I(ωj ) = K −1 I (i) (ωj ).
i=1

In this case, we only have periodogram values for ωj = j/m for j = 1, . . . , m. But
the variance of the estimate decreases. It is also faster to compute as performing K
DFTs of size m is faster than performing one DFT of size mK = T .
Welch’s method is nearly identical to Bartlett’s method. However, this new
approach allows for the time series to be partitioned into overlaping pieces that
overlap by a fixed number of data point.

Banding
Instead of partitioning in the time domain as Barlett’s method does, we can instead
partition the frequencies into bands. In this case, we can define a frequency band of
2m + 1 frequencies to be
n m mo
B = ω : ωj − ≤ ω ≤ ωj + .
T T
Here, we say that (2m + 1)/T is the bandwidth of B. The idea is that if locally
fX (ω) is constant for all frequencies in the band B, then the spectal density can be
estimated to be
m
¯ 1 X
I(ω) = I(ωj + i/T )
2m + 1
i=−m

(T ) d
→ χ2 (2), we
for any ω ∈ B. Considering the previous result that 2I(ωj )/f (ωj ) −
have the extension that
¯ (T ) )
2(2m + 1)I(ωj d
→ χ2 (4m − 4)

f (ωj )

as long as T is large and m is small. Note that there typically is no optimal band-
width and many can be tried when analyzing a time series in the spectal domain.

68
The above notion of banding simply weights all frequencies in the band B equally,
which is 1/(2m + 1). Instead, we can use a weighted average of the frequencies of
the form
Xm
˜
I(ω) = ci I(ωj + i/T )
i=−m

where the weights ci sum to 1. The mathematically get convergence for this object
to a chi-squared distribution as before, we require that as T → ∞ and m → ∞ such
that m/T → 0 that m 2
P
i=−m i → 0. Then, it can be shown that
c

˜
E[I(ω)] → fX (ω)
  
˜ 1 ), I(ω
cov I(ω ˜ 2)  0 ω1 6= ω2
→ f (ω) 2 ω1 = ω2 =6 0 6= 1/2
 X
Pm 2
i=−m ci 2fX (ω)2 ω1 = ω2 = 0 or = 1/2

In this case, I˜ is asymptotically a weighted sum of chi-squared random variables


which is hard to work with directly. Instead, we can approximate the “length” of
the band by
" m #−1
X
2
L= ci
i=−m

to get roughly that


˜ (T )
2LI(ωj ) d
→ χ2 (2L) .

f (ωj )
Note that this works perfectly if we replace ci with (2m+1)−1 recovering the equally
weighted scenario from above.
One way to choose specific weights is via the Daniell kernel. In this case, we
begin simply with ci = 1/3 for i = −1, 0, 1. Applying this to a sequence ut results
in
(1) ut−1 ut ut+1
ut = + + .
3 3 3
If we apply this kernel a second time, we get
(1) (1)
(2) ut−1 u(1) u
ut = + t + t+1
3 3 3
ut−2 2ut−1 3ut 2ut+1 ut+2
= + + + + .
9 9 9 9 9
Also, sometimes the modified Daniell kernel is applid, which is the same idea as
above be starting with
(1) ut−1 ut ut+1
ut = + + .
4 2 4

69
Tapering
Tapering a time series is another way to focus in on estimating the spectral density
for a certain range of frequencies. To discuss this, we begin in the time domain.
For a mean zero stationary process Xt with spectral density fX (ω), we construct a
tapered process with Yt = at Xt for some coefficients at . Thus, the DFT for Yt gives
us
T
1 X
dY (ωj ) = 1/2 at Xt e−2πiωj t .
T i=1

Then, the expected value of the periodogram is


Z 1/2
E [IY (ωj )] = WT (ωj − ω)fX (ω)dω
−1/2

PT
with WT (Ω) = |AT (ω)|2 with AT (ω) = T −1/2 i=1 at e
−2πiωt . Here, we refer to
WT (ω) as the spectral window.

Example 4.3.1. The Fejér or modified Bartlett kernel is

sin(nπω)2
WT (ω) =
n sin(πω)2

with WT (0) = n, which comes from at = 1 for all t. When averaging over a
band B ¯
as above, the spectral window is similarly averaged. That is, for I(ω) =
1 Pm
2m+1 i=−m I(ωj + i/T ), we have
m
1 X sin(nπ(ω + i/T ))2
WT (ω) = .
2m + 1 n sin(π(ω + i/T ))2
i=−m

Example 4.3.2. Other tapers that live up to the name “tapering” include the cosine
taper, which sets the coefficients at = [1 + cos(2π(t − (T + 1)/2)/T )]/2.

4.3.4 Parametric Estimation

4.4 Filtering

70

You might also like