Stat 479
Stat 479
Adam B Kashlak
Mathematical & Statistical Sciences
University of Alberta
Edmonton, Canada, T6G 2G1
Preface 1
The following are lecture notes originally produced for an undergraduate course
on time series at the University of Alberta in the winter of 2020. The aim of these
notes is is to introduce the main topics, applications, and mathematical underpin-
nings of time series analysis.
These notes were produced by consolidating two main sources being the textbook
of Shumway and Stoffer, Time Series Analysis and Its Applications, and the past
course notes produced by Dr. Doug Weins also at the University of Alberta.
Adam B Kashlak
Edmonton, Canada
January 2020
1
Chapter 1
Introduction
In this chapter we consider different types of time series processes that we may
encounter in practice. The main difference between time series and other areas of
statistics like linear regression is that the noise or errors can be correlated. This
arises from the fact that time implies causality; the past predicts the future. Thus,
we no longer live in the independent and identically distributed setting of most other
areas of statistics.
This chapter also reintroduces notions of covariance and correlation in the con-
text of time series, which become autocovariance and autocorrelation. The critical
property of stationarity is defined, which allows us to estimate such autocovariances
and autocorrelations from a given time series dataset.
2
That is, wt and ws are uncorrelated but not necessarily independent.
This can be strengthened to iid noise if uncorrelated is replaced with independent.
This can be further strengthened to Gaussian white noise were every wt ∼ N 0, σ 2 .
The intuition behind the term white noise comes from signals processing where a
signal is white if it contains all possible frequencies. Furthermore, the white noise
process will be used to generate all of the subsequent processes.
1.1.2 Autoregressive
The autoregressive (AR) process is a natural way to encode causality into the white
noise process, that is, demonstrate how the past influences the future. The general
formula is
Xp
Xt = θi Xt−i + wt ,
i=1
which is that past time observation Xt−i contributions θi ∈ R to the present time
observation Xt .
For example, if p = 1 and θ1 = 1, then we have the process
Xt = Xt−1 + wt ,
3
Based solely on the plots, the random walk and the chosen AR(3) process look sim-
ilar. Likewise, it may be hard to immediately identify that the top left plot is white
noise while the bottom left is an AR(2) process.
In this way, we can consider the MA process as a weighted averaged white noise
process. A simple example is Xt = (wt−1 + wt + wt+1 )/3.
4
White Noise Random Walk
3
5
2
0
1
−5
−25 −20 −15 −10
0
−1
−2
Index Time
AR 2 AR 3
6
10
4
5
2
0
0
−2
−5
−4
−10
0 200 400 600 800 1000 0 200 400 600 800 1000
Time Time
Figure 1.1: Examples of the white noise process and autoregressive versions of that
white noise process of order 1, 2, and 3.
5
White Noise MA 3
3
1.0
2
0.5
1
0.0
0
−0.5
−1
−1.0
−2
−1.5
Index Index
MA 9 MA 21
0.4
0.5
0.2
0.0
0.0
−0.2
−0.5
−0.4
Index Index
Figure 1.2: Examples of the white noise process and moving averaged versions of
that white noise process averaged over windows of length 3, 9, and 21.
6
The AR process with p = 1 and θ1 = 1 is an example of a martingale. Note that
supermartingales and submartingales have also been studied where the above = is
replaced by a ≤ or ≥, respectively.
As the normal distribution lends itself elegantly to other areas of statistics, so
does it to time series. The Gaussian process is a generalization of the multivariate
normal distribution. It is a stochastic process Xt where for any finite collection of
time points {t1 , . . . , tk }, the random vector (Xt1 , . . . , Xtk ) is multivariate normal.
Much like the multivariate normal distribution, the Gaussian process can be defined
by its mean µt and covariance Cs,t where
µt = EXt and Σs,t = cov (Xs , Xt ) .
Many time series fall under the category of linear processes. Given a white noise
process wt , a linear process is defined as
∞
X
Xt = µ + θj wt−j ,
j=−∞
which is that every Xt is a linear combination ofPthe terms in the white noise process
with some mean µ added on. Here, we require θi2 < ∞ in order for the process to
have a finite variance. However, as we are generally interested in modelling casual
processes in time—i.e. the past predicts the future and not vice versa—we can
instead consider the more restricted definition
X∞
Xt = µ + θj wt−j .
j=0
Example 1.1.2 (The AR(1) Process). We revisit the AR(1) process, Xt = θXt−1 +
wt , and by using this recursive definition and assuming we can extend the series
infinitely into the past, we can rewrite it as
∞
X
Xt = θ(θXt−2 + wt−1 ) + wt = . . . = θj wt−j .
j=0
P∞j
PN j
Infinite series are limits—i.e. j=0 θ wt−j = limN →∞ j=0 θ wt−j . Hence, this
sum may not converge in any meaningful way. Let SN (θ) = N j
P
j=0 θ wt−j , then
N 2N +2
2 1−θ
X
2 2j
ESN = 0 and Var (SN ) = σ θ =σ .
1 − θ2
j=0
Thus, if |θ| < 1, then Var (SN (θ)) → σ 2 /(1 − θ2 ), and if wt is Gaussian noise, then
d
→ N 0, σ 2 /(1 − θ2 ) .
SN (θ) −
In the case of the random walk, which is θ = 1, the series does not converge, but by
the central limit theorem, we have
d
n−1/2 SN (1) −
→ N (0, 1) .
7
1.2 Properties of Times Series
1.2.1 Autocovariance
As a time series Xt can be thought of as a single entity, the covariance between two
time points is referred to as the autocovariance and is defined as
KX (s, t) = cov(Xt , Xs ).
The notation K comes from treating the autocovariance as a kernel function for an
integral transform.2 Note that the autocovariance function is symmetric, K(s, t) =
K(t, s), and positive (semi) definite in the sense that for any finite collection of time
points {t1 , . . . , tk }, we have a k × k matrix with i, jth entry K(ti , tj ) and this matrix
is positive (semi) definite.
Similar to the multivariate setting, we can normalize the autocovariance into an
autocorrelation by
KX (s, t)
ρ(s, t) = p .
KX (s, s)KX (t, t)
Example 1.2.1 (AR(1) with drift). For wt a white noise process with variance σ 2 ,
consider the AR(1) with drift process
Xt = a + θXt−1 + wt
for some real a and θ. We can use the recursive definition to get that
Then, assuming this process has an infinite past and that |θ| < 1, we can take m to
infinity to get
∞
a X
Xt = + θj wt−j ,
1−θ
j=0
which happens to be a linear process. The mean can now be quickly calculated to be
2
R
For example, g(s) = f (t)K(s, t)dt.
8
EXt = a/(1 − θ) as Ewt = 0 for all t. Furthermore, the autocovariance is
This implies that the variance is σ 2 /(1 − θ2 ). Note that this process is a weakly
stationary process, which will be defined below.
1.2.2 Cross-Covariance
The cross covariance is similar to the auto covariance, but applies to multivariate
time series. More simply, if we have two time series Xt and Yt , then we can consider
1.2.3 Stationarity
In a broad sense, stationarity implies some property of the time series is invariant
to shifts in time. There are two such notions we will consider.
where F () is the joint CDF of the k random variables. That is, F (Xt1 , . . . , Xtk ) =
P (Xt1 ≤ xt1 , . . . , Xtk ≤ xtk ) .
9
For a weakly (and thus for a strongly) stationary process Xt , we have that the
autocovariance function is
KX (s, t) = KX (s − t, 0) = KX (τ )
for some τ being the difference between two time points (the time lag). Hence, if
a process is weakly stationary, the autocovariance can be treated as a univariate
function.
Furthermore, this univariate function is both symmetric and bounded in τ . This
can be seen by noting, for symmetry, that
KX (τ ) = KX ((τ + t) − t) = KX (τ + t, t)
= KX (t, τ + t) = KX (t − (τ + t)) = KX (−τ ),
and, for boundedness, that KX (0) = Var (Xt ) for all t and by applying the Cauchy-
Schwarz inequality that, for any r,
Xt = a + bt + Yt
Xt − a − bt = Yt ,
which is stationary.
Definition 1.2.5 (Joint Stationarity). The processes Xt and Yt are said to be jointly
stationary if both are individually stationary and also if the cross covariance function
is also stationary—i.e. KXY (t, s) = KXY (t + r, s + r) for any r.
10
1.3 Estimation
Estimating parameters in a time series model is harder than it is in standard statis-
tics where we often assume that the observations are iid. Now, we are faced with a
single sequence of points X1 , X2 , . . . , XT which are not iid. To estimate the mean
and autocovariance we require the process to be weakly stationary. If it isn’t, then,
for example, every Xt will have its own mean and we cannot estimate it.
Note as indicated above, we will consider time series with T total observations
observed at equally spaced intervals. If the observations are irregularly spaced, more
work has to be done.
Note that in the uncorrelated case KX (0) = σ 2 and KX (z) = 0 for z > 0. Thus, the
formula reduces to the usual σ 2 /T .
As h gets bigger, the number of terms in the sum decreases giving less accurate
estimation. Similarly, the sample autocorrelation function is defined to be
11
and the sample cross covariance and cross correlation are
T −h
1 X K̂XY (h)
K̂XY (h) = (Xt+h − X̄)(Yt − Ȳ ) and ρ̂XY (h) = q .
T
t=1 K̂X (0)K̂Y (0)
Examples of estimated autocovariance functions are in Figure 1.3 for the white noise
process, the random walk, the moving average process with a window of length 9,
and the autoregressive process Xt = Xt−1 − 0.2Xt−2 + wt . However, we note that
the random walk and this AR(2) process are not stationary. Thus, even though we
can compute the autocovariance in R with the acf function, we must consider its
validity.
The sample autocovariance is defined in such a way to make the function positive
semi-definite. This ensures that estimates of variances will never be negative
PT as, for
T
any real vector a = (a1, . . . , aT ) ∈ R , the variance estimate for a · X = t=1 at Xt
is
X T XT
at as K̂(|t − s|),
t=1 s=1
12
White Noise Random Walk
1.0
1.0
0.8
0.8
0.6
0.6
ACF
ACF
0.4
0.4
0.2
0.2
0.0
0.0
0 5 10 15 0 5 10 15
Lag Lag
MA9 AR2
1.0
1.0
0.8
0.8
0.6
0.6
ACF
ACF
0.4
0.4
0.2
0.2
0.0
0.0
0 5 10 15 0 5 10 15
Lag Lag
Figure 1.3: Estimated autocorrelations from the processes from Figures 1.1 and 1.2
using the acf function in R.
13
For h 6= 0, this has zero mean as E (wt ws ) = 0 for s 6= t, and has a variance of
T −h T −h
!
1 X 1 X
Var wt+h wt = 2 E[ws+h ws wt+h wt ] =
T2 T
t=1 s,t=1
T −h
1 X 2 2 T − h 4 σ4
E[w t+h w t ] = σ ≈
T2 T2 T
t=1
for large T . Thus, the variance of the autocorrelation is approximately 1/T , and
we would expect
√ the values ρ̂X (h) to be within two standard deviations from the
mean, or ±2/ T , when Xt is a white noise process. Hence, we can examine the
autocorrelations to determine whether or not the process looks like white noise.
14
Chapter 2
Introduction
Two main goals that traverse much of statistics are fitting models to data and using
such models for prediction or forecasting. In this chapter, we consider classic linear
regression (briefly) to discuss its uses and its shortcomings with respect to time series
data. Then, we discuss more sophisticated models for time series of the ARIMA
family of models.
As mentioned in the previous chapter, we need to transform a time series Xt
into a time series Yt that is stationary. This is because stationarity allows us to do
estimation from a single series. The hard part it to determine how to extract such
a stationary Yt from the series Xt . Before continuing, we need some definitions.
Definition 2.0.1 (Backshift Operator). The backshift operator B acts on time se-
ries by BXt = Xt−1 . This can be iterated to get B k Xt = Xt−k . This can also be
inverted to get the forward shift operator B −1 Xt = Xt+1 . Thus, B −1 B = BB −1 = I
the identity operator.
Definition 2.0.2 (Difference Operator). The difference operator ∇ acts on time
series by ∇Xt = Xt − Xt−1 . Note that ∇Xt = (1 − B)Xt . This operator can also be
iterated to get
k
i k
X
k k
∇ Xt = (1 − B) Xt = (−1) Xt−i .
i
i=0
For example, the second difference operator is ∇2 Xt = Xt − 2Xt−1 + Xt−2 .
Differencing a series can be used to remove periodic trends. Fun fact: using the
gamma function, we can also consider fractional differencing for κ ∈ R+ , which is
κ
X Γ(κ + 1)
∇κ Xt = (−1)i Xt−i .
i!Γ(κ − i + 1)
i=0
15
2.1 Regression
2.1.1 Linear Regression in Brief
In linear regression, we consider modelling a response variable Xt by some indepen-
dent variables or predictors zt,1 , . . . , zt,p as
Xt = β0 + β1 zt,1 + . . . + βp zt,p + εt
where the βi are unknown fixed parameters and the εt are iid mean zero uncorrelated
errors. Written in vector-matrix form, we have X = Zβ + ε where X ∈ RT and
Z ∈ RT ×(p+1) . Given this setup, the Gauss–Markov theorem tells us that the least
squares estimator, β̂ = (Z T Z)−1 Z T X, is the minimal variance unbiased estimator
for β.
16
Monthly US Prescription Drug Costs ACF for Prescrip
1.0
30
0.8
0.6
25
prescrip
ACF
0.4
0.2
20
0.0
−0.2
15
1987 1988 1989 1990 1991 1992 0.0 0.5 1.0 1.5 2.0 2.5
Time Lag
1.0
1.0
prescrip − fitted.values(md2)
0.5
0.5
0.0
ACF
−1.5 −1.0 −0.5
0.0
−0.5
1987 1988 1989 1990 1991 1992 0.0 0.5 1.0 1.5 2.0 2.5
Time Lag
Figure 2.1: Plotted time series of prescription drug costs with fitted regression
line (top left). The residuals for the series plotted (bottom left). The estimated
autocorrelation functions for the original series (top right) and the residual series
(bottom right).
By removing this trend and focusing on the residuals, we see the bottom two
plots from Figure 2.1. Here, the estimated autocorrelation does not seem as extreme.
However, there are still some periodic patterns that emerge and that will need to be
dealt with.
To the residual process, we can fit a linear model using sines and cosines, which
gives the fitted model
The F-statistic of 33.36 with degrees of freedom (2, 65) is also very significant (p-
value ≈ 10−10 ), and the R summary for the fitted model is as follows.
17
Estimate Std. Error t value Pr(> |t|)
(Intercept) 0.02 0.06 0.34 0.732
sin -0.09 0.08 -1.07 0.290
cos -0.65 0.08 -8.07 2.24e-11
In Figure 2.2, we have the residuals for this trigonometric model as well as the
estimated autocorrelation. In the ACF plot, we see a large spike at lag = 0.1.
Hence, considering the first difference operator applied to this process, we have the
bottom two plots in Figure 2.2. Now the estimated autocorrelation is looking much
more like white noise.
Xt = β̂0 + β̂1 Yt .
2.2 Smoothing
Smoothing methods are a large class of statistical tools that can be applied to noisy
data. While much theoretical work has gone into understanding these methods, we
will just present the main idea briefly.
18
Residuals for Prescrip Residuals for Prescrip with Trig
1.5
1.0
1.0
prescrip − fitted.values(md2)
0.5
0.5
0.0
0.0
−1.5 −1.0 −0.5
−0.5
−1.0
1987 1988 1989 1990 1991 1992 1987 1988 1989 1990 1991 1992
Time Time
0.2
0.0
−0.2
Lag
0.8
0.5
0.6
diff(res.trig)
0.4
ACF
0.0
0.2
−0.5
−0.2 0.0
−1.0
1987 1988 1989 1990 1991 1992 0.0 0.5 1.0 1.5
Time Lag
Figure 2.2: The trig model fitted to the residuals (top left). The residuals for the
trig model (top right). The ACF for the trig residual model (middle). The first
differences of the trig residuals (bottom left). The ACF for the first differences of
the trig residuals (bottom right).
19
Kernel Smoothing
R is decreasing in |x − x0 |
A kernel function κh (x, x0 ) is a non-negative function that
and has a bandwidth parameter h. We also require that R κh (x, x0 )dx < ∞ for all
x0 ∈ R. Examples include
where the notation (. . .)+ means take the positive part and set the rest to zero.
In the time series context, we can use a kernel to construct weights to be used
in the previously mentioned moving average smoother. Specifically, for some r ∈ N,
we can define θi for i = −r, . . . , 0, . . . , r as
r
X
θi = κh (i, 0)/ κh (j, 0).
j=−r
Kernel-based methods also occur in probability density estimation (the kernel den-
sity estimator) and in linear regression (Nadaraya-Watson) as well as others. Note
further that kernel based estimators typically are biased estimators and as the band-
width h increases, the bias increases while the variance decreases. As a result, much
research has gone into bandwidth selection. This can be implemented in R via the
ksmooth function.
Lowess
Lowess is an acronym for locally weighted scatterplot smoothing. This method
combines nearest neighbours and weighted linear regression. Effectively, it takes
a window of data points, say Xt−r , . . . , Xt+r , and applies a low degree polynomial
regression to it. This can be implemented in R by the function lowess. The R
manual page states that lowess “is defined by a complex algorithm”.
20
These points are referred to as the knots. Then, a separate polynomial—typically
cubic but could have another degree instead—is fit to each subinterval of approxi-
mately T /k data points. That is, we fit a linear model
(i)
Mt = βi,0 + βi,1 t + βi,2 t2 + βi,3 t3 .
to the ith interval. As a result, this can be written as a least squares estimation
problem where the M̂t are the fitted values that minimize
T
X
(Xt − Mt )2 .
t=1
Then, we have Z
(Ms00 )2 ds = β T Ωβ,
which is β̂ = (F T F + λΩ)−1 F T X.
1
Recall that the ridge estimator for the model Y = Xβ + ε is β̂ = (X T X + λI)−1 X T Y .
21
2.3 ARIMA Models for Times Series
In the previous section, we considered regression models like
Xt = f (t) + wt
where f (t) is deterministic and wt is white noise. The goal was to estimate the
deterministic piece by some fˆ(t). But such an approach cannot handle time series
models like the AR(1) process Xt = Xt−1 + wt . In this section, we take a closer look
at fitting such time series models to observed data.
Note that we choose Xt to have zero mean for convenience. If Xt has mean
µ 6= 0, then we can rewrite Xt = X̃t + µ where X̃t is mean zero to get
p
X
X̃t + µ = wt + φi (X̃t + µ)
i=1
p p
!
X X
X̃t = −µ 1 − φi + wt + φi X̃t−i .
i=1 i=1
That is, we can rewrite Xt as a mean zero AR(p) process with an added constant.
A major aspect to consider when analyzing AR processes is causality. We have
seen that the AR(1) process with |φ1 | < 1 has a causal representation as a linear
process. Specifically,
∞
X
Xt = φ1 Xt−1 + wt = φi1 wt−i .
i=0
22
This process was also shown to be stationary. However, when φ1 = 1, we have a
random walk which is not stationary. Writing the random walk as a linear process
gives a series that does not converge.
Similarly, we can consider the setting of an AR(1) process with |φ1 | > 1, which
will grow exponentially fast. However, we can still write this process in the form of
a non-causal linear process.
Xt = φ−1 −1 −1 −1
1 (φ1 Xt+2 − φ1 wt+2 ) − φ1 wt+1
∞
X
= φ−2 −2 −1
1 Xt+2 − φ1 wt+2 − φ1 wt+1 = . . . = − φ−i
1 wt+i ,
i=1
which is a linear process with reverse causality.2 Using this linear process represen-
tation, we can compute the stationary autocovariance to be
∞
−(i+j)
X
KX (τ ) = E φ1 wt+τ +i wt+j
i,j=1
∞ ∞ −|τ |
X −(i+j) −|τ |
X σ 2 φ1 φ−2
= σ2 φ1 1[τ + i = j] = σ 2 φ1 φ1−2i = −2
1
i,j=1 i=1
1 − φ1
Thus, these two processes are stochastically equivalent—i.e. for any finite collection
of time points t1 , . . . , tk , the vectors (Xt1 , . . . , Xtk ) and (Yt1 , . . . , Ytk ) are equal in
distribution. Thus the non-causal AR(1) process with |φ| > 1 has an equivalent
causal representation.
2
Note: the summation starts from i = 1 instead of i = 0.
23
We can extend this idea to general AR(p) processes in order to rewrite a recur-
sively defined AR(p) process as stationary linear process. For some AR operator,
we have the general form
Φ(B)Xt = wt .
If the operator Φ(B) is invertible, then we can simply write the linear process form
Xt = Φ−1 (B)wt .
But then we have to determine if the inverse operator exists and what its form is.
Reconsidering the AR(1) process above, we write it as (1 − φ1 B)Xt = wt . Con-
sidering the complex polynomial Φ(z) = 1 − φ1 z for z ∈ C, we note that
∞
−1 1 X j
Φ (z) = =1+ φ1 z j ,
1 − φ1 z
j=1
For the general AR(p) process, consider the complex polynomial Φ(z) = 1−φ1 z−
. . . φp z p and recall that this can be factored on C into Φ(z) = Q φp (z − r1 ) . . . (z − rp )
where r1 , . . . , rp are the roots.3 Then, noting that (−1)p φp pj=1 rj = 1, we can
write
1
Φ−1 (z) =
(1 − r1 z) . . . (1 − rp−1 z).
−1
Now, assuming further that all of the roots |ri | > 1, we can write Φ(B)Xt = wt as
a causal linear process
p ∞
!
Y X
Xt = Φ−1 (B)wt = 1+ ri−1 B wt .
j=1 i=1
24
where wt is white noise with variance σ 2 and θ1 , . . . , θq ∈ R are constants with θq 6=
0. Using the backshift operator B, we can write the MA(q) process as Xt = Θ(B)wt
where
Xq
Θ(B) = 1 + θj B j .
j=1
Unlike for autoregressive processes, we already have the MA(q) process written
in the form of a linear process. Hence, it will be stationary for any choice of the θj .
However, similar to how we were able to find a causal AR process that is equivalent
to a non-causal one, there is a uniqueness problem with the MA process that needs
to be addressed.
For simplicity, consider the MA(1) process Xt = wt + θ1 wt−1 with wt white noise
with variance σ 2 . This has mean zero and stationary autocovariance
(1 + θ12 )σ 2 for τ = 0
KX (τ ) = θ1 σ 2 for τ = 1
0 for τ ≥ 2
Alternatively, we note that the process Yt = vt + θ1−1 vt , with vt white noise with
variance θ12 σ 2 , is also mean zero with the same autocovariance at Xt . Hence, if the
white noise processes are Gaussian, then Xt and Yt are stochastically equivalent.
This can certainly cause trouble in a statistics context as if we were to estimate the
parameters for the MA(1) model, which parameters would we be estimating?
To choose a specific representation for the MA process, we consider which one is
invertible. That is, which process can be written as a causal AR process for white
noise in terms of Xt ? Starting with the general form, we have Xt = Θ(B)wt . If Θ(B)
is invertible, then we can write wt = Θ−1 (B)Xt . Using the above MA(1) process as
an example, we can express the white noise process as
∞
X ∞
X
wt = (−1)i θ1i Xt−i or vt = (−1)i θ1−i Yt−i .
i=0 i=0
Thus, as only one of θ1 and θ1−1 can be less than 1 in magnitude, only one of the
above series is convergent in the mean squared sense. That process will be the
invertible one. Note that wt is equal in distribution to θ1−1 vt . Note also that if
θ1 = 1 then we do not have invertibility.
25
Definition 2.3.3 (Autoregressive Moving Average Process). The time series Xt is
an ARMA(p, q) process if Xt has zero mean and if we can write it as
p
X q
X
Xt = wt + φi Xt−i + θj wt−j
i=1 j=1
The first thing to note is that similar to the introduction of the AR process, we
assume in the definition that Xt has zero mean. If instead it has a mean µ 6= 0, we
can subtract off the mean to get
Φ(B)(Xt − µ) = Θ(B)wt
p
!
X
Φ(B)Xt = µ 1 − φi + Θ(B)wt
i=1
η(B)Φ(B)Xt = η(B)Θ(B)wt .
For example, we can consider the white noise process Xt = wt and, for some |θ| < 1,
the equivalent process
(1 − θB)Xt = (1 − θB)wt
Xt = θXt−1 − θwt−1 + wt .
This may look like a more complex ARMA process, but is in fact just white noise in
disguise. To address this problem, we only want to consider AR and MA operators
that are relatively prime. That is, for z ∈ C, we want the polynomials
Φ(z) = 1 − φ1 z − . . . − φp z p , and
Θ(z) = 1 + θ1 z + . . . + θq z q
to not have any common roots. In the case that Θ is invertible, we can write the
ARMA process as
Φ(B)
Xt = wt .
Θ(B)
26
Thus, in this form, we see that common factors in Φ and Θ can be cancelled out.
When we write Xt in this way, it is said to be invertible if we have
∞
Φ(B) X
wt = Xt = Xt + πj Xt−j
Θ(B)
j=1
P∞
where j=1 |πj | < ∞. Hence, returning the the previous discussion on the MA
processes, we want to write out the process as a convergent series. Considering the
MA polynomial Θ(z) for z ∈ C, the ARMA process Xt is invertible if and only if all
of the roots of Θ(z) lie outside of the unit disk D = {z : |z| ≤ 1}.
Similarly, we can write the ARMA process in the form of a stationary linear
process
Θ(B)
Xt = wt .
Φ(B)
However, this process may not be causal. A causal process as discussed before can
be written as
X∞
Xt = wt + ψj wt−j
j=1
P∞
with j=1 |ψj | < ∞. A necessary and sufficient condition for causality in an ARMA
process is to have an autoregressive polynomial Φ(z) such that all of its roots lie
outside of the unit disk D = {z : |z| ≤ 1}.
In summary, an ARMA process Xt is
1. causal if r1 , . . . , rp , the roots of Φ(z) are such that |ri | > 1;
Proof of Causality. Let the roots of Φ(z) be r1 , . . . , rp . First, assume that the roots
are all outside of the unit disk, and without loss of generality, are ordered so that
1 < |r1 | ≤ . . . ≤ |rp |. Then, let |r1 | = 1 + ε for some ε > 0. This implies that Φ−1 (z)
exists and has a power series expansion
∞
X
Φ−1 (z) = aj z j
j=0
27
As this series converges, we know that there exists a constant c > 0 such that
|aj (1 + δ)j | < c for all j. Hence, |aj | < c(1 + δ)−j . Thus,
∞
X ∞
X
|aj | < c (1 + δ)−j < ∞,
j=0 j=0
and the sequence of aj is absolutely summable. This implies that for the ARMA
process Φ(B)Xt = Θ(B)wt , we can write
∞
X
−1
Xt = Φ (B)Θ(B)wt = wt + ψj wt−j .
j=1
Since the aj are absolutely summable, so are the coefficients ψj . Thus, we have that
Xt is a causal process.
For the reverse, let’s assume that Xt , defined by Φ(B)Xt = Θ(B)wt , be a causal
process. That is, we can write
∞
X ∞
X
Xt = wt + ψj wt−j and |ψj | < ∞.
j=1 j=1
As a result, we can write Xt = Ψ(B)wt and also Φ(B)Xt = Θ(B)wt . Equating the
two right hand expressions, we have
Θ(B)wt = Φ(B)Ψ(B)wt .
2.3.4 ARIMA
Often, we do not have an ARMA process but an ARMA process with some determin-
istic trend. Thus, the process is not stationary, but often can be transformed into a
28
stationary process via the differencing operator. For example, if Xt is a stationary
process, and we have Yt defined by
Yt = β0 + β1 t + Xt ,
∇Yt = β1 + Xt − Xt−1 ,
∇d Xt = (1 − B)d Xt ,
As before, we assume in the definition that ∇d Xt has zero mean. If instead it has
a mean µ 6= 0, we write Φ(B)(1 − B)d Xt = µ (1 − pi=1 φi ) + Θ(B)wt . For example,
P
if we have Xt = β0 + β1 t + φXt−1 + wt for β0 , β1 ∈ R and |φ| < 1, then
∇Xt = Xt − Xt−1
= [β0 + β1 t + φXt−1 + wt ] − [β0 + β1 (t − 1) + φXt−2 + wt−1 ]
= β1 + φ∇Xt−1 + wt − wt−1 .
(1 − φB)δXt = β1 + (1 − B)wt .
29
2.4.1 Box-Pierce and Ljung-Box Tests
The R function Box.test() in the stats package performs both the Box-Pierce
and Ljung-Box tests.
For a stationary time series Xt , we denote the estimated autocorrelations to be
ρ̂X (h) at lag h. If we assume that the true autocorrelations are zero—i.e. ρX (h) = 0
for all h 6= 0—then we have white noise. Instead of visually looking at a plot of the
autocorrelation, we can use the Box-Pierce test to test for non-zero correlations by
combining the estimated autocorrelations at lags 1, . . . , h for some user chosen value
h. We aim to test the hypotheses
which will be approximately χ2 (h) under H0 . Recall, however, that the ability to
estimate ρ̂X becomes harder for large lags especially if the data size is small. Hence,
h should not be set to be too large in practice.
Another version of this test is the Ljung-Box test, which has a similar form to the
Box-Pierce test and the same approximate χ2 (() h) distribution. The alternative
form is supposed to give a distribution under H0 that is closer to the desired χ2 (h).
The test statistic is
h
X ρ̂X (j)2
QLB = n(n + 2) .
n−j
j=1
30
autocorrelations among the residuals. Similarly, the Durbin-Watson test tests for
autocorrelations of order 1 among the residuals of a linear model.
Considering the linear model
Xt = β0 + β1 t + . . . + βp tp + rt
r̂t = ρr̂t−1 + wt
If this test statistic is close to zero, it implies that r̂t and r̂t−1 are close in value
indicating a strong positive autocorrelation of order 1. In contrast, if the test statis-
tic is large (close to the max of 4), then it indicates that there is a strong negative
autocorrelation of order 1. Otherwise, a test statistic near 2 indicates no autocorre-
lation of order 1. In the R function, dwtest() in the lmtest package, p-values are
computed for this statistic. The documentation claims that Under the assumption
of normally distributed [errors], the null distribution of the Durbin-Watson statistic
is the distribution of a linear combination of chi-squared variables. Furthermore, for
large sample sizes, this code apparently switches to a normal approximation for the
p-value computation.
31
2.4.4 Augmented Dickey-Fuller Test
Switching away from testing for non-zero autocorrelations, we now consider testing
for stationarity or non-stationarity of a time series. These tests are often referred
to as unit root tests, because—recalling the previous sections—if the autoregressive
operator Φ has a unit root, then the process is not stationary. Hence, these tests
aim to determine whether or not a unit root exists based on some observed data.
The Dickey-Fuller Test performs such a unit root test for AR(1) models. In this
case, the null hypothesis is that Φ(z) has a root |r| = 1, and the alternative is that
|r| > 1 for all i. If
Xt = φXt−1 + wt
then the first difference can be written as
∇Xt = (φ − 1)Xt−1 + wt .
where the coefficients are φ01 = pj=1 φj − 1 and φ0i = − pj=i φj for j > 1. Thus, if
P P
1 is a root—i.e. if Φ(1) = 0—then that implies that φ01 = 0. Hence, we can perform
a similar test to the Dickey-Fuller test above.
In R, the Augmented Dickey-Fuller test is implemented in the function adf.test()
in the tseries package. In this version of the test, a constant and a linear term are
first assumed and the residual process is run through the above test. That is, we
consider the model
X p
Xt = β0 + β1 t + φi Xt−i + wt
i=1
with a deterministic linear trend. The R function also requires the user to choose
how many lags to use when estimating the parameters φ̂.
32
2.4.5 Phillips–Perron test
An alternative to the Augmented Dickey-Fuller test is the Phillips-Perron test. The
set up is the same, but the test is designed to be more robust to deviations in
the assumptions. In R, this test can be performed by the function pp.test() in
the tseries package. The test statistic is more complicated, and the p-value is
computed via a table of values and linear interpolation. The documentation also
points out that the Newey-West estimator is used for the variance, which is a robust
estimator of the covariance in linear regression when the classic assumptions of
homoscedastic uncorrelated errors is violated.
33
Definition 2.5.1 (Partial Autocorrelation). Let Xt be a stationary process, then
the partial autocorrelation at lag h is
ρX (1) if h = 1
ϕX (h) =
corr(Xt+h − X̂t+h , Xt − X̂t ) if h > 1
where X̂t+h and X̂t are the result of regressing each respective term on all of the
intermediate terms. That is,
X̂t+h = β1 Xt+h−1 + . . . + βh−1 Xt+1
X̂t = β1 Xt+1 + . . . + βh−1 Xt+h−1
where the intercept term is excluded as Xt is assumed to be mean zero. Note that
due to stationarity of Xt and the symmetry of the autocorrelation function, the β’s
above are the same coefficients. Lastly, if Xt is a Gaussian process, we can write
ϕX (h) = corr(Xt+h , Xt |Xt+h−1 , . . . , Xt+1 )
for h > 1.
34
The roots of the characteristic polynomial will tell us about the behaviour of the
process Xt . For 1 − φ1 z − φ2 z 2 , we denote the two roots as z1 and z2 . Recall that
|zi | > 1 as we assume Xt is causal. There are three possible settings to consider5
1. if z1 6= z2 and the roots are real, then we have the solution to the second order
difference equation
ρ(h) = c1 z1−h + c2 z2−h
where c1 and c2 are two constants such that c1 + c2 = 1.
2. if z1 = z2 and necessarily real, then we have ρ(h) = z1−h (c1 + c2 h).
3. if z1 = z̄2 are complex conjugate roots, then
ρ(h) = c1 z1−h + c̄1 z̄1−h
= |c1 ||z1 |−h e−ib e−iθh + eib eiθh
= 2|c1 ||z1 |−h cos(θh + b)
In all three cases, we have the autocorrelation ρ(h) decaying exponentially to zero.
The rate of decay depends on the magnitude of the roots. Furthermore, if the roots
are complex, then there is periodic behaviour in the process.
This can be extended to the AR(p) process where we have a pth order difference
equation for ρ. The resulting solution will look like
ρ(h) = z1−h f1 (h) + . . . + zr−h fr (h)
Pr
where z1 , . . . , zr are the unique roots with multiplicities m1 , . . . , mr with i=1 mi =
p and where fi (h) is a polynomial in h of degree mi .
Thus, unlike the AR process, the autocorrelation for the MA(q) process is zero for
h > q. Thus, it can be used to identify the order of the process.
5
For more details, see a textbook on difference equations.
35
2.5.3 PACF for AR(p)
To introduce why the partial correlation is of interest, we first consider the causal
AR(1) process Xt = φXt−1 + wt . From before, we saw that
In contrast, if we consider
Hence when taking Xt−1 into account, the autocorrelation between Xt and Xt−2 is
zero.
To properly consider the partial autocorrelation, we need to compute the least
squares estimator X̂t for Xt based on previous time points. For example, to compute
ϕ(2), we take X̂t = β̂Xt−1 where β̂ is chosen to minimize
By taking the derivative with respect to β, we can find the critical point β̂ =
KX (1)/KX (0). Similarly, for X̂t−2 = β̂Xt−1 , we have
E(Xt−2 − βXt−1 )2
2
− 2βE (Xt−2 Xt−1 ) + β 2 E Xt−1
2
= (1 + β 2 )KX (0) − 2βKX (1)
= E Xt−2
as before. In the case of the AR(1) model, we have β̂ = φ. Thus, we have from
ϕ(h) = 0 for h ≥ 2.
before that ϕ(1) = φ and ϕ(2) = 0 and, in fact, P
For the general AR(p) process, Xt = wt + pi=1 φi Xt−i , we have a similar set
up. For lags h > p, if we assume for now that the least squares estimator is
In the case that the lag is less than or equal to p, we need to determine how the
estimate the coefficients βi before computing the PACF.
36
2.5.4 PACF for MA(1)
For the invertible MA(1) model Xt = wt + θwt−1 , which is with |θ| < 1, we can
write it as a convergent infinite series
∞
X
Xt = wt + θi Xt−i
i=1
in terms of the Xt−i . Then, applying similar tricks as above gives a least squares
estimator for X̂t = β̂Xt−1 to be β̂ = KX (1)/KX (0). In the case of the MA(1)
process, we have β̂ = θ/(1 + θ2 ). Hence,
θXt−1 θXt−1
cov Xt − , Xt−2 −
1 + θ2 1 + θ2
2
−θ2
2θ θ
= KX (2) − KX (1) + KX (0) = .
1 + θ2 1 + θ2 1 + θ2
Thus, the partial autocorrelation at lag 1 is ϕ(1) = −θ2 /(1 + θ2 + θ4 ). This can be
extended to lags greater than one to show that the partial autocorrelation for the
MA(1) process decreases but does not vanish as the lag increases.
Hence, we have the following table:
AR(p) MA(q)
ACF decreases geometrically = zero for lags > q
PACF = zero for lags > p decreases geometrically
This means that we can use the ACF and PACF to try to understand the behaviour
of a time series process.
37
Chapter 3
Introduction
Thus far, we have considered many types of time series models, but have performed
little in the way of actual statistics. In this chapter, we consider to main goals of
time series models: estimating parameters and forecasting/predicting. For the first
topic, we will consider different methods for estimating parameters as well as model
selection methods to determine the best fit to the data. For the second part, we
consider the task of prediction in time series.
For a time series observed at X1 , . . . , XT , we may want to fit a causal invertible
ARMA(p,q) process,
p
X q
X
Xt = wt + φi Xt−i + θj wt−j ,
i=1 j=1
38
X̄ = T −1 Tt=1 Xt and then consider the centred time series Xi − X̄. For estimating
P
the variance for the white noise process σ 2 ,
Xt = wt + φ1 Xt−1 + . . . + φp Xt−p
KX (0) = cov (Xt , Xt ) = cov (Xt , wt + φ1 Xt−1 + . . . + φp Xt−p )
= σ 2 + φ1 KX (1) + . . . + φp KX (p)
σ 2 = KX (0) − φ1 KX (1) − . . . − φp KX (p).
39
Corollary 3.1.1 (Asymptotic Normality for PACF). For the causal AR(p) process,
as T → ∞,
√ d
T ϕ̂(h) −
→ N (0, 1) .
for lags h > p.
In the standard stats package in R, the function ar() fits an autoregressive
model to time series data, which can implement many ways to estimate the param-
eters, but defaults to the Yule-Walker equations. To demonstrate it, we can use the
arima.sim() function to simulate T = 100 data points from the AR(1) process
Xt = 0.7Xt−1 + wt .
Using the Yule-Walker equations, we get φ̂1 = 0.73. Note that the ar() function
fits models for AR(1) up to AR(20) and then chooses the best with respect to AIC.
In the case of data from the AR(3) process
Xt = 0.7Xt−1 − 0.3Xt−3 + wt
we get the fitted model
Xt = 0.752Xt−1 − 0.002Xt−2 − 0.285Xt−3 .
Plots of these two processes with the estimated ACF and PACF are displayed in
Figure 3.1.
(1 − φ2 )1/2
L(µ, φ, σ 2 ) =
(2πσ 2 )T /2
T
" ( )#
1 X
× exp − 2 (X1 − µ)2 (1 − φ2 ) + ((Xt − µ) − φ(Xt−1 − µ))2
2σ
t=2
1
Note that here, we are considering X1 as an infinite causal linear process.
40
AR(1) Process AR(3) Process
4
3
2
2
1
0 0
−1
−2 −2
−3
−4
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
1.0 1.0
0.8 0.8
0.6 0.6
0.4
ACF
ACF
0.4
0.2
0.2
0.0
0.0
−0.2
−0.4 −0.2
0 5 10 15 20 0 5 10 15 20
Lag Lag
0.6
0.6
0.4
Partial ACF
Partial ACF
0.4
0.2
0.2
0.0
0.0
−0.2
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Figure 3.1: Simulated AR(1) and AR(3) processes with estimated ACF and PACF.
41
Writing the unconditional sum of squares in the exponent as
T
X
Su (µ, φ) = (X1 − µ)2 (1 − φ2 ) + ((Xt − µ) − φ(Xt−1 − µ))2
t=2
we can take derivatives of the log likelihood to solve for the MLEs. For the variance,
∂ log(L)
=
∂σ 2
∂ 1 2 T T 2 Su (µ, φ)
= log(1 − φ ) − log(2π) − log(σ ) −
∂σ 2 2 2 2 2σ 2
n Su (µ, φ)
=− 2 + ,
2σ 2σ 4
which gives σ̂ 2 = Su (µ, φ)/T . However, solving for MLEs µ̂ and φ̂ are not as straight
forward, because we would have to solve the nonlinear system of equations
T
X
2
0 = −2(1 − φ )(X1 − µ) + 2(1 − φ) (Xt − φXt−1 − µ(1 − φ))
t=2
T
( )
−φ 1 2
X
0= 2
− 2 2φ(X1 − µ) − 2 (Xt − φXt−1 − µ(1 − φ))(Xt−1 − µ) .
1−φ 2σ
t=2
This headache arises due to the starting point X1 . If we condition the likelihood on
X1 , we can simplify the problem.2
Conditioning on X1 , we have
" T #
2 1 X
2
L(µ, φ, σ |X1 ) = exp ((Xt − µ) − φ(Xt−1 − µ)) .
(2πσ 2 )(T −1)/2 t=2
Thus, the MLE for the variance is σ̂ 2 = Sc (µ, φ)/(T − 1) where similarly to above
Sc is the conditional sum of squares in the exponent. We can rewrite Sc as
T
X
Sc (µ, φ) = (Xt − (α + φXt−1 ))2
t=2
where α = µ(1 + φ), which coincides with simple linear regression. Hence, for the
design matrix X ∈ R(T −1)×2 with columns 1 and Xt for t = 1, . . . , T − 1, the least
squares estimator is
α̂
= (X T X)−1 X T (X2 , . . . , XT )T ,
φ̂
2
What we just did above is the unconditional likelihood. What follows is the conditional likelihood
as we condition on X1 to remove the nonlinearity.
42
which after some computation can be reduced to
PT
(Xt − X̄(2) )(Xt−1 − X̄(1) )
φ̂ = t=2PT
2
t=2 (Xt−1 − X̄(1) )
PT −1
where X̄(1) = (T − 1)−1 t=1 Xt and X̄(2) = (T − 1)−1 Tt=2 Xt .3 Given φ̂, we can
P
X̄(2) − φ̂X̄(1)
µ̂ = .
1 − φ̂
To compare these estimators to the Yule-Walker estimator, we note that for the
AR(1) process that
PT
YW K̂X (1) t=2 (Xt − X̄)(Xt−1 − X̄)
φ̂ = = PT ,
2
K̂X (0) t=1 (Xt − X̄)
which is very similar to the MLE estimator except that the MLE uses X̄(1) and X̄(2)
that are adjusted for the end points of the time series. In the limit, the two are
equivalent.
Similarly, for the mean, the Yule-Walker equations chooses µ̂YW = X̄. In con-
trast, for the MLE, we have
For AR(p) processes, the MLE estimators can still be computed in a similar
manor, but the equations are more complex. Still, conditioning on the starting
values X1 , . . . , Xp allows for a reduction to linear regression.
43
where we want to make a good choice of parameters αt . To do that, we minimize
the squared error as usual:
!2
T
X
arg min E XT +m − α0 − αt Xt .
α1 ,...,αT
t=1
Hence, we have X̂t+m = µ + Tt=1 αt (Xt − µ). Thus, we can centre the process and
P
consider time series with µ = 0 and α0 = 0.
For a one-step-ahead prediction, which is to estimate X̂T +1 , we solve the above
equations to get
T
X
0 = E (XT +1 − X̂T +1 )X1 = KX (T ) − αt KX (t − 1)
t=1
T
X
0 = E (XT +1 − X̂T +1 )X2 = KX (T − 1) − αt KX (t − 2)
t=1
..
.
T
X
0 = E (XT +m − X̂T +m )XT = KX (1) − αt KX (T − t)
t=1
44
Then, the above equations can be written as K = Γα or α = Γ−1 K in the case that
the inverse exists. Thus, for X = (X1 , . . . , XT )T , our one-step prediction can be
written as
X̂T +1 = αT X = K T Γ−1 X.
As like estimation for the AR process with the Yule-Walker equations, our prediction
is based on the autocovariances. If we knew what the autocovariance is—i.e. we use
K and Γ instead of K̂ and Γ̂—then the mean squared prediction error is
2 2
E XT +1 − X̂T +1 = E XT +1 − K T Γ−1 X
= E XT2 +1 − 2K T Γ−1 XXT +1 + K T Γ−1 XX T Γ−1 K
However, if we do not know a priori that we have an order-p process, then we would
have to estimate αi for all i = 1, . . . , T , which could require the inversion of a very
large matrix. Thus, for general ARMA models, we have to work harder.
can be solved iteratively. This is because, Γ is a Toeplitz Matrix, and the Durbin-
Levinsion Algorithm can be used to solve a system of equations involving a Toeplitz
45
matrix. To do this, we need to consider a sequence of one-step-ahead predictors
X̂21 = α1,1 X1
X̂32 = α2,1 X1 + α2,2 X2
X̂43 = α3,1 X1 + α3,2 X2 + α3,3 X3
..
.
X̂TT+1 = αT,1 X1 + αT,2 X2 + . . . αT,T XT .
We begin recursively with α0,0 = 0 and P01 = KX (0), which is that the MSPE given
no information is just the variance. Then, given coefficients αt,1 , . . . , αt,t , we can
compute
ρX (t) − t−1
P
αt−1,i ρX (i)
αt+1,1 = Pt−1 i=1
1 − i=1 αt−1,i ρX (t − i)
t
and Pt+1 = Ptt−1 (1 − αt+1,1
2 ) and for the remaining coefficients αt+1,i = αt,t−i−1 −
αt+1,1 αt,i
where
j−1
!
j
X
θt,t−j = KX (t − j) − k
θj,j−k θt,t−k Pk+1 (Pj+1 )−1
k=0
46
and
which means that we will consider each Xt predicted by the previous observations
Xt−1 , . . . , X1 .
As we assume we have a causal invertiable ARMA(p.q) process, we can write it
as a linear process in the form
∞
X
Xt = ψi wt−i
i=0
where the infinite past will be convenient to assume even if the data is finite. The
t−1
distribution of Xt |Xt−1 . . . X1 will be
normal with mean X̂t , the one-step-ahead
prediction. The variance with by E (Xt − X̂tt−1 )2 , which is also the mean squared
47
prediction error Ptt−1 . In the case of t = 1, we just use the variance for the linear
process
∞
X
KX (0) = σ 2 ψi2 .
i=1
From there, we can use the Durbin-Levinson Algorithm to update the MSPE by
t
Pt+1 = Ptt−1 (1−αt+1,1
2 ). The precise computation is not important for our purposes,
but we can write Ptt−1 = σ 2 rt where rt does not depend on σ 2 . This allows us to
write the likelihood as
" T #−1/2
2 2 −T /2
Y S(µ, φ, θ)
L(µ, σ , φ, θ) = (2πσ ) rt exp −
2σ 2
t=1
From all of this, we can get the MLE for the variance σ̂ 2 = S(µ̂, φ̂, θ̂)/n as a
function of the other estimators. To find those estimators, we can maximize the
concentrated likelihood, which is when we replace σ̂ 2 with S(µ̂, φ̂, θ̂)/n and solve for
the MLEs for µ̂, φ̂, θ̂. That is, for some constant C,
T
T 1X
log L(µ, φ, θ) = C − log σ̂ 2 − log rt , or
2 2
t=1
T
S(µ, φ, θ) 1X
`(µ, φ, θ) = log + log rt .
T T
t=1
We could check that for AR(p) processes—that is, without any MA part—that we
recover the MLE estimates from before.
Asymptotic Distribution
Similar to the Yule-Walker equations for the AR process, we have a central limit-like
theorem for the MLE estimator for the ARMA process. If we let β = (φ1 , . . . , φp , θ1 , . . . , θq )
then as T → ∞ √ d
→ N 0, σ 2 Γ−1
T (β̂ − β) − p,q
where the i, jth entry in Γφφ is KY (i − j) for the AR process Φ(B)Yt = wt , and
the i, jth entry in Γθθ is KY 0 (i − j) for the AR process Θ(B)Yt0 = wt , and the i, jth
entry in Γφθ is the cross covariance between Y and Y 0 .
48
Example 3.2.1 (AR(1)). For the casual AR(1) process Xt = φXt−1 + wt , we recall
that KX (0) = σ 2 /(1 − φ2 ). Therefore, Γ1,0 = (1 − φ2 )−1 and we have that
d 1
→ N φ, (1 − φ2 ) .
φ̂ −
T
Example 3.2.2 (MA(1)). Similar to the AR(1) prcess, consider the invertible
MA(1) process Xt = θwt−1 + wt . The MA polynomial is Θ(B) = 1 + θB, so the
AR(1) process Θ(B)Yt = wt has a variance of KY (0) = σ 2 /(1 − θ2 ). Thus, we have
that
d 1
→ N θ, (1 − θ2 ) .
θ̂ −
T
Note that similar to linear regression and many other areas of statistics, if we
include too many terms when fitting an ARMA process to a dataset, the standard
errors of the estimate will be larger than necessary. Thus, it is typically good to
keep the number of parameters as small as possible. Hence, the use of AIC and BIC
in the R function arima().
Φ(B)Xt = Θ(B)wt
is both causal and invertible as well as mean zero.4 We will also assume that we are
making a prediction based on observed data X1 , . . . , XT .
Given a sample size T , there are two possible estimators for the future point
XT +h to consider. The prediction that minimizes the mean squared error is
X̂T +h = E (XT +h | XT , . . . , X1 ) .
As the data size T gets large, these two predictions are very close.
4
As usual, we can subtract the mean to achieve this last assumption.
49
As we assume that that ARMA process is both causal and invertible, we can
rewrite it in two different forms:
X∞
XT +h = wT +h + ψj wT +h−j (Causal)
j=1
X∞
wT +h = XT +h + πj XT +h−j (Invertible)
j=1
and we can consider the above conditional expectations applied to each of these
equations.
First, we note that
wt if t ≤ T
E (wt |XT , . . . , X0 , . . .) =
0 if t > T
This is because (1) if t > T then wt is independent of the sequence of XT , . . . and
(2) if t ≤ T then based on the casual and invertible representations, we have a one
to one correspondence between the X’s and the w’s. Similarly,
Xt if t ≤ T
E (Xt |XT , . . . , X0 , . . .) =
X̃t if t > T
Applying this idea to the causal representation, we get that
∞
X
X̃T +h = ψj wT +h−j
j=h
as the first h − 1 terms in the sum become zero. Then, subtracting this from Xt+h
gives
h−1
X
XT +h − X̃T +h = ψj wT +h−j .
j=0
Hence, the mean squared prediction error is
h−1
X
PTT+h = E (XT +h − X̃T +h )2 = σ 2 ψj2 .
j=0
Note that we can also apply the conditional expectation to the invertible repre-
sentation. In that case, we get
h−1
X ∞
X
0 = X̃T +h + πi X̃T +h−j + πi XT +h−j
j=1 j=h
h−1
X ∞
X
X̃T +h = − πi X̃T +h−j − πi XT +h−j .
j=1 j=h
50
This shows that the T + h predicted value is a function of the data XT , . . . and the
previous h − 1 predicted values X̃T +h−1 , . . . , X̃T +1 .
Now, we know from before P that the coefficients ψi tendPto zero fast enough to be
∞ ∞
absolutely summable—i.e. j=0 |ψj | < ∞. Therefore, j=h |ψj | → 0 as h → ∞.
This implies that
P
X̃T +h − → µ as h → ∞.
We can prove this first by noting that the variance of X̃T +h is ∞ 2
P
j=h ψj . Then, for
any ε > 0, we can use Chebyshev’s inequality to get that
X∞ X∞
P |X̃T +h − µ| > ε = P ψj wt+h−j > ε ≤ ε−2 ψj2 → 0
j=h j=h
as h → ∞.
Meanwhile, the MSPE from above is
h−1
X
PTT+h = σ 2 ψj2 .
j=0
Therefore, as h → ∞, we have that the MSPE tends to KX (0), which is just the
variance of the process Xt .
Hence, in the long run, the forecast for an ARMA(p,q) process tends towards
its mean, and the variance tends to the variance of the process.
51
Given the coefficients φi and θi , we can write this truncated prediction as
p
X q
X
X̃TT+h = φi X̃TT+h−i + θj w̃TT +h−j
i=1 j=1
where we replace the predicted value X̃TT+h−i with the observed value XT +h−i if
i ∈ [h, T + h − 1] and with 0 if i ≥ T + h. Similarly, w̃tT = 0 if t < 1 or if t > T .
Otherwise,
Xq
w̃tT = Φ(B)X̃tT − θj w̃t−j .
j=1
Example 3.2.3 (ARMA(1,1)). For the causal invertible ARMA(1,1) process Xt+1 =
φXt + wt+1 + θwt , we can consider the one-step-ahead prediction
Hence, we can start from w̃0T = 0 and X0 = 0 and compute the w̃t+1
T iteratively.
We can also compute the variance of the prediction (the MSPE). For this, we
note that the ARMA(1,1) process can be writen in a causal form as
∞
X ∞
X
Xt = wt + ψi wt−i = wt + (φ + θ)φi−1 wt−i .
i=1 i=1
1 − φ2h−2 (φ + θ)2
2 2 2
=σ 1 + (φ + θ) →σ 1+
1 − φ2 1 − φ2
as h → ∞.
52
Backcasting
We can also consider forecasting into the past or backcasting. That is, we can predict
backwards h time units into the past by
T
X
T
X̂1−h = αi Xi
i=1
This means we can compute the coefficients αi by solving the system of equations
K = Γα where
KX (h) KX (0) KX (1) · · · KX (T − 1)
KX (h + 1) KX (1) KX (0) · · · KX (T − 2)
K= , Γ =
.. .. .. . . ..
. . . . .
KX (T + h − 1) KX (T − 1) KX (T − 2) · · · KX (0)
just as we did before for forecasting.
Remark 3.2.4 (Fun Fact). For a stationary Gaussian process, the vector (XT +1,XT ,...,X1 )
is equal in distribution to (X0,X1 ,...,XT ), so forecasting and backcasting are equivalent.
53
or more generally, Xt = φXt−s + wt which we will call an AR(1)s process for some
value of s > 1.
In general, we can consider a seasonal ARMA process denoted AMRA(p0 , q 0 )s
which is
Φs (B s )Xt = Θs (B s )wt
where the polynomials Φs and Θs are
0
Φs (B s ) = 1 − ϕ1 B s − ϕ2 B 2s − . . . − ϕp0 B p s
0
Θs (B s ) = 1 + ϑ1 B s + ϑ2 B 2s + . . . + ϑq0 B q s
The reason for the notation is to combine the seasonal ARMA with the regular
ARMA process to get an ARMA(p, q) × (p0 , q 0 )s process, which can be written as
Φs (B s )Φ(B)Xt = Θs (B s )Θ(B)wt .
(1 − ϕB 12 )Xt = wt
Xt = ϕXt−12 + wt
where |φ| < 1. To compute the autocovariance quickly, we can rewrite this as a
linear process
Xt = ϕXt−12 + wt
= ϕ(ϕXt−24 + wt−12 ) + wt
= ϕ2 Xt−24 + ϕwt−12 + wt
..
.
X ∞
= ϕj wt−12j .
j=0
54
Therefore, the variance is as usual KX (0) = σ 2 /(1 − φ2 ). For lags h = 1, . . . , 11, we
have
X∞ ∞
X
KX (h) = cov ϕj wt−12j , ϕi wt−h−12i
j=0 i=0
∞ X
X ∞
= ϕj+i cov (wt−12j , wt−h−12i ) = 0,
j=0 i=0
because the indices t − 12j and t − h − 12i will never be equal unless h is a multiple
of 12. In that case, we have
∞
X σ2ϕ
KX (12) = σ 2 ϕ ϕ2j = .
1 − ϕ2
j=0
Note that this is the same as KY (1) for the AR(1) process Yt = ϕYt−1 + wt . Hence,
the above seasonal AR process is effectively 12 uncorrelated AR processes running
in parallel to each other. This is why we often include a seasonal and non-seasonal
component in the SARIMA models.
Note also that the AR(1, 0)12 could also be thought of as an AR(12, 0). However,
trying to estimate or forecast with an AR(12, 0) process will include many param-
eters that are unnecessary, which will increase the variance for our estimators and
predictions.
55
as above, since the MA piece will not affect the calculation. However, if h = 1
mod 12 or h = 11 mod 12, we have to work harder. First, consider that
σ2θ σ 2 θϕ
KX (1) = and KX (11) =
1 − ϕ2 1 − ϕ2
Continuing on, we have that
σ 2 θϕ
KX (13) = ϕKX (1) =
1 − ϕ2
as well. Hence, we can generalize this to
σ 2 θϕm
KX (h) = KX (12m ± 1) = .
1 − ϕ2
Lastly, for any lags h 6= −1, 0, 1 mod 12, we have KX (h) = 0 as none of the indices
line up in the autocovariance computation.
56
Chapter 4
Introduction
Time series data often exhibits cyclic behaviour as we saw with SARIMA models
in the previous chapter. Furthermore, a time series may have more than one cycle
occurring simultaneously. In this chapter, we will consider spectral methods for
identifying the cyclic behaviour of time series data.
In general, we are interested in the frequency ω of the time series. For example,
a time series that repeats every 12 months, the frequency is ω = 1/12.
Xt = A cos(2πωt + φ)
where A is the amplitude, ω is the frequency, and φ is the phase. This process can
be rewritten as linear combination of trig functions as
Xt = U1 cos(2πωt) + U2 sin(2πωt)
57
ance σ 2 , then we can compute the autocovariance as
KX (h) =
= cov (U1 cos(2πω(t + h)) + U2 sin(2πω(t + h)), U1 cos(2πωt) + U2 sin(2πωt))
= cov (U1 cos(2πω(t + h)), U1 cos(2πωt))
+ cov (U2 sin(2πω(t + h)), U2 sin(2πωt))
= σ 2 cos(2πω(t + h)) cos(2πωt) + σ 2 sin(2πω(t + h)) sin(2πωt)
= σ 2 cos(2πωh),
where all of the Uj1 and Uj2 are uncorrelated mean zero random variance with
potentially different variances σj2 . The autocovariance in this case is
q
X
KX (h) = σj2 cos(2πωj h).
j=1
Remark 4.1.1 (Aliasing). Aliasing is a problem that can occur when taking a dis-
crete sample from a continuous signal. Since we have to sample a certain rate, high
frequency behaviour in the signal may look like low frequency patterns in our sam-
ple. This is displayed in Figure 4.1 where the red dots are sampled too infrequently
making it appear as if there is a low frequency oscilation in the data instead of the
actually high frequency oscilation in black.
This is because the sin’s and cos’s form a basis, and similarly to how a T − 1 degree
polynomial can pass through T points, these T − 1 parameters β? can be used to fit
the data exactly. Note that for T even, we can also do this with
T /2−1
X
Xt = β0 + {βj1 cos(2πt j/T ) + βj2 sin(2πt j/T )} + βT /2 cos(πt).
j−1
58
1.0
0.5
0.0
xx
−0.5
−1.0
0 2 4 6 8 10
tt
Figure 4.1: Aliasing occurs when we sample too infrequently to capture high
frequency oscilations.
59
First, with a little work, we can show that
T
X T
X
2
cos (2πtj/T ) = sin2 (2πtj/T ) = n/2 for j = 1, . . . T /2 − 1
t=1 t=1
XT
cos2 (2πtj/T ) = n for j = 0, T /2
t=1
XT
sin2 (2πtj/T ) = 0 for j = 0, T /2
t=1
T
X
cos(2πtj/T ) cos(2πtk/T ) = 0 for j 6= k
t=1
T
X
cos(2πtj/T ) cos(2πtk/T ) = 0 for j 6= k
t=1
XT
cos(2πtj/T ) sin(2πtk/T ) = 0 for any j, k.
t=1
Hence, our design matrix for linear regression has orthongal columns, so computing
each β̂ becomes, for j 6= 0, T /2,
T
2X
β̂j1 = Xt cos(2πtj/T )
T
t=1
T
2 X
β̂j2 = Xt sin(2πtj/T )
T
t=1
60
The squared magnitude of the coefficients
( T )2 ( T )2
2 1 X 1 X
|d(j/T )| = Xt cos(2πtj/T ) + Xt sin(2πtj/T )
T T
t=1 t=1
is the (unscaled) periodogram. The scaled periodogram is P (j/T ) = (4/T )|d(j/T )|2 ,
which follows from the equations for the β̂ above. Noting that cos(2π − θ) = cos(θ)
and that sin(2π − θ) = − sin(θ), we have that |d(1 − j/T )|2 = |d(j/T )|2 and likewise
P (1 − j/T ) = P (j/T ). Hence, we only consider frequencies j/T < 1/2.
The DFT can be computed quickly via the Fast Fourier Transform (FFT).
The DFT is just a linear transformation of the data Xt , which can be written at
d = W X for some matrix W . This type of transformation would take O(T 2 ) time to
compute. However, the FFT uses a sparse representation of W to reduce the time
to O(T log2 T ). There are many algorithms for the FFT, but the most common
takes a divide-and-conquer approach. That is, if T = 2m , then it breaks the data
in half based on odd and even indices X1 , X3 , . . . , XT −1 and X2 , X4 , . . . , XT and
computes the Fourier transform of each separately. However, since T /2 = 2m−1 is
also divisible by 2, this idea can be applied recursively to get 4 partitions of the
data, then 8, and so on.
If we rewrite the DFT as
T /2 T /2
√ X − 2πitj − 2πij
X − 2πitj
T d(j/T ) = X2t e T /2 +e T X2t−1 e T /2
t=1 t=1
− 2πij
= Ej + e T Oj ,
then we can decompose it into even and odd parts Ej and Oj , respectively. These
two pieces are each DFTs of size T /2. We also note that there is a redundency in
the calculations, which is for j < T /2, we have
√ 2πij
T d(j/T ) = Ej + e− T Oj
and that √ 2πij
T d(j/T + 1/2) = Ej − e− T Oj .
Remark 4.1.2. Scaling and the FFT In Fourier analysis and for different FFT
implementations in code, there are often different scaling factors included. Hence,
to make sure we are estimating what we want to estimate, one needs to take care
when using FFT algorithms.
61
cesses. However, in this course, we only consider discrete time processes.
Given slightly stronger conditions, we can also define the spectral density. That
is, if the autocovariance is absolutely summable, then the spectral distribution is
absolutely continuous in turn implying that the derivative exists almost everywhere:
dFX (ω) = fX (ω)dω.
From here we see that if fX (ω) exists, then it is an even function—i.e. fX (ω) =
R 1/2
fX (−ω). Also, since KX (0) = −1/2 fX (ω)dω, the variance of the process can be
thought of as the integral of the spectral density over all frequencies. In a way, this
is similar to how the total sum of squares can be decomposed into separate sums of
squares in a ANOVA.
As a simple example, consider the period process Xt = U1 cos(2πω0 t)+U2 sin(2πω0 t)
from before. Then, the autocovariance is
σ 2 2πω0 h Z 1/2
2 −2πω0 h
KX (h) = σ cos(2πω0 h) = e +e = e2πiωh dFX (ω)
2 −1/2
62
where FX (ω) = 0 for ω < −ω0 , FX (ω) = σ 2 /2 for ω ∈ [−ω0 , ω0 ], and FX (ω) = σ 2
for ω > ω0 . Note that in this case, the autocovariance is not absolutely summable.
As a second example, we consider the white noise process wt . In this case,
the autocovariance is simply σ 2 at lag h = 0 and 0 for all other lags h. Hence,
it is absolutely summable and the spectral density is just fX (ω) = σ 2 for all ω ∈
[−1/2, 1/2]. Hence, as mentioned in Chapter 1, white noise in a sense contains every
frequency at once with equal power.
is the frequency response function. Given all of this, we have the following theorem.
P∞
Theorem 4.2.3. Let Xt be a time seriesP∞ with spectral density fX (ω) and let j=−∞ |aj | <
∞, then the spectral density for yt = j=−∞ aj Xt−j is
63
We can apply this result to a causal ARMA(p,q) process. For Φ(B)Xt = Θ(B)wt ,
we have rewrite it as
∞
Θ(B) X
Xt = wt = ψj wt−j .
Φ(B)
j=0
P∞ j
Writing Ψ(z) = Θ(Z)/Φ(z) = j=0 ψj z and using this as the aj from the above
theorem, we have that
∞
X Θ(e−2πiω )
A(ω) = ψj e−2πiωj = Ψ(e−2πiω ) = .
Φ(e−2πiω )
j=−∞
Using the fact that fw (ω) = σ 2 for all ω, we have finally that
2
Θ(e−2πiω )
fX (ω) = |A(ω)|2 fw (ω) = σ 2 .
Φ(e−2πiω )
64
We can also centre the DFT when j 6= 0 to get
T
1 X
d(ωj ) = √ (Xt − X̄)e−2πiωj t
T t=1
Hence, the periodogram can be written in terms of the Fourier transform of the
estimated autocovariance as we might have expected from the previous discussion.
The problem we face here is that that the estimator K̂X (h) is very poor for large h
as there are relatively few pairs of time points to consider. Hence, we often truncate
this summation by only summing over |h| ≤ m for some m T .
65
and β̂0 = X̄. Therefore, we have
(T −1)/2
2 X
Xt − X̄ = √ {dc (ωj ) cos(2πt j/T ) + ds (ωj ) sin(2πt j/T )}
T j=1
T −1 T −|h|
1 X
−2πiωj h
X
I(ωj ) = e (Xt+|h| − µ)(Xt − µ)
T
h=−T +1 t=1
T −1
X T − |h|
E[I(ωj )] = KX (h)e−2πiωj h .
T
h=−T +1
66
then we have that
fX (ωj )/2 + εT for ωj = ωk
cov (dc (ωj ), dc (ωk )) =
εT 6 ωk
for ωj =
and similarly for ds where εT is an error term bound by |εT | ≤ c/T . Hence, the
estimated covariance matrix should have a strong diagonal with smaller noisy off-
diagonal entries.
We can use this to find via the central limit theorem that if our process Xt is
just iid white noise with variance σ 2 , then
(T ) d
→ N 0, σ 2 /2
dc (ωj ) −
(T ) d
→ N 0, σ 2 /2
ds (ωj ) −
Thus, recalling that I(ωj ) = dc (ωj )2 + ds (ωj )2 , we have that
(T ) d
2I(ωj )/σ 2 −
→ χ2 (2)
(T ) (T )
and this I(ωj ) will be asymptotically independent with some other I(ωk ).
For the general linear process, we have
Theorem 4.3.1. If Xt = ∞
P
j=∞ ψj wt−j with the ψj absolutely summable and with
wt being iid white noise with variance σ 2 and with
∞
X
|h||KX (h)| < ∞,
h=∞
(T )
then for any collection of m frequencies ωj → ωj , we have jointly that
(T ) d
→ χ2 (2)
2I(ωj )/f (ωj ) −
given that f (ωj ) > 0 for j = 1, . . . , m.
Thus, we can use this result for many statistical applications like constructing a
1 − α confidence interval for the spectral density fx at some frequency ω by
(T ) (T )
2I(ωj ) 2I(ωj )
≤ fX (ω) ≤ .
χ22,1−α/2 χ22,α/2
67
Bartlett’s and Welch’s Methods
For a time series X1 , . . . , XT , we are able to compute I(ωj ) for any frequency ωj =
j/T . However, such fine granularity is often not necessary. Instead, computing the
periodogram for fewer frequencies with better accuracy is often preferrable.
Bartlett’s method take this into consideration by splitting the time series into
K separate disjoint series of equal length m = T /K. That is,
Then, for each of these K time series pieces, we can compute periodograms I (1) (ωj ), . . . , I (K) (ωj )
and average them to get
XK
I(ωj ) = K −1 I (i) (ωj ).
i=1
In this case, we only have periodogram values for ωj = j/m for j = 1, . . . , m. But
the variance of the estimate decreases. It is also faster to compute as performing K
DFTs of size m is faster than performing one DFT of size mK = T .
Welch’s method is nearly identical to Bartlett’s method. However, this new
approach allows for the time series to be partitioned into overlaping pieces that
overlap by a fixed number of data point.
Banding
Instead of partitioning in the time domain as Barlett’s method does, we can instead
partition the frequencies into bands. In this case, we can define a frequency band of
2m + 1 frequencies to be
n m mo
B = ω : ωj − ≤ ω ≤ ωj + .
T T
Here, we say that (2m + 1)/T is the bandwidth of B. The idea is that if locally
fX (ω) is constant for all frequencies in the band B, then the spectal density can be
estimated to be
m
¯ 1 X
I(ω) = I(ωj + i/T )
2m + 1
i=−m
(T ) d
→ χ2 (2), we
for any ω ∈ B. Considering the previous result that 2I(ωj )/f (ωj ) −
have the extension that
¯ (T ) )
2(2m + 1)I(ωj d
→ χ2 (4m − 4)
−
f (ωj )
as long as T is large and m is small. Note that there typically is no optimal band-
width and many can be tried when analyzing a time series in the spectal domain.
68
The above notion of banding simply weights all frequencies in the band B equally,
which is 1/(2m + 1). Instead, we can use a weighted average of the frequencies of
the form
Xm
˜
I(ω) = ci I(ωj + i/T )
i=−m
where the weights ci sum to 1. The mathematically get convergence for this object
to a chi-squared distribution as before, we require that as T → ∞ and m → ∞ such
that m/T → 0 that m 2
P
i=−m i → 0. Then, it can be shown that
c
˜
E[I(ω)] → fX (ω)
˜ 1 ), I(ω
cov I(ω ˜ 2) 0 ω1 6= ω2
→ f (ω) 2 ω1 = ω2 =6 0 6= 1/2
X
Pm 2
i=−m ci 2fX (ω)2 ω1 = ω2 = 0 or = 1/2
69
Tapering
Tapering a time series is another way to focus in on estimating the spectral density
for a certain range of frequencies. To discuss this, we begin in the time domain.
For a mean zero stationary process Xt with spectral density fX (ω), we construct a
tapered process with Yt = at Xt for some coefficients at . Thus, the DFT for Yt gives
us
T
1 X
dY (ωj ) = 1/2 at Xt e−2πiωj t .
T i=1
PT
with WT (Ω) = |AT (ω)|2 with AT (ω) = T −1/2 i=1 at e
−2πiωt . Here, we refer to
WT (ω) as the spectral window.
sin(nπω)2
WT (ω) =
n sin(πω)2
with WT (0) = n, which comes from at = 1 for all t. When averaging over a
band B ¯
as above, the spectral window is similarly averaged. That is, for I(ω) =
1 Pm
2m+1 i=−m I(ωj + i/T ), we have
m
1 X sin(nπ(ω + i/T ))2
WT (ω) = .
2m + 1 n sin(π(ω + i/T ))2
i=−m
Example 4.3.2. Other tapers that live up to the name “tapering” include the cosine
taper, which sets the coefficients at = [1 + cos(2π(t − (T + 1)/2)/T )]/2.
4.4 Filtering
70