0% found this document useful (0 votes)
24 views57 pages

Time Series 2022 B

A note on time series
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views57 pages

Time Series 2022 B

A note on time series
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Time series analysis

Notes for the course HS 2022

Nicolai Meinshausen
[email protected]

December 13, 2022

These notes are intended to just give a quick summary of what we discussed in the course.
Some parts of this script are reused from an earlier script of Prof. Künsch. For examples and
illustrations of the concepts and methods, you should look at the R-demonstrations which
are on the course web page and the examples in the book Shumway & Stoffer. There are
bound to be errors – I would appreciate if you point them out to me so I can get a corrected
version to everybody.
Any text printed in small font size/gray is a topics we did not discuss in class (and hence
they would not be part of the exam) but is for anyone interested to read a bit more.

Objectives of time series analysis


Goals of time series analysis can be classified in one of the following
i) Compact description of data as Xt = Tt + St + Yt , where Xt is the observed time-
series, Tt a trend, St a seasonal component and Yy stationary noise. This can aid with
interpretation for example by seasonal adjustment of unemployment figures.
ii) Hypothesis testing. We might for example want to test whether the trend component
Tt vanishes for summer rainfall figures in Zurich over the last 10 years.
iii) Prediction. Examples are: predict unemployment data/ strength of El Nino / airlines
passenger numbers or next word in a text. Might sometimes only be possible via sim-
ulation, as when trying to forecast hurricane intensity for the next decade at a specific
location.
iv) Control/Causality/Reinforcement learning. One example is impact of monetary policy
(interest rates) on inflation, where causal impact is quite different (possibly even different
sign) to pure observational correlation. Or optimal filling and draining of lakes for energy
storage.

1
Overview of the course
1. Characteristics of Time-Series (mostly Chapter 1 in book)
(a) Stationarity
(b) Auto-correlation function
(c) Transformations
2. Time domain methods (mostly Chapter 3)
(a) AR/MA/ARMA processes
(b) Forecasting
(c) Parameter estimation
(d) ARIMA models
3. Spectral analysis (mostly Chapter 4.1-4.5)
(a) Periodogram
(b) Spectral density
(c) Spectral estimation
4. State-space models (parts of Chapter 6)
(a) Hidden Markov models (HMM)
(b) Inference for HMM
(c) Forward-backward algorithms
(d) Kalman filter
(e) Particle filters
5. Additional topics (possibly independent component analysis, recurrent neural net-
works, reinforcement learning)

1 Characteristics of Time Series


1.1 Stochastic Process
A stochastic process is a mathematical model for a time series.
Stochastic process = Collection of random variables (Xt (ω); t ∈ T ). Alternative view:
Stochastic process as a random function from T to R.

2
A basic distinction is between continuous and discrete equispaced time T . Models in con-
tinuous time are preferred for irregular observation points. In this course we will restrict
ourselves mostly to discrete equispaced time and, if not stated otherwise, use T = Z.
In all interesting cases, there is dependence between the random variables at different times.
Hence need to consider joint distributions, not only marginals. Gaussian stochastic processes
have joint Gaussian distribution for any number of time points.
A stochastic process describes how different time series (when different ω’s are drawn) could
look like. In most cases, we observe only one realization xt (ω) of the stochastic process
(a single ω). Hence it is clear that we need additional assumptions, if we want to draw
conclusions about the joint distributions (which involves many ω’s) from a single realization.
The most common such assumption is stationarity.
Stationarity means the same behavior of the observed time series in different time windows.
Mathematically, it is formulated as invariance of (joint) distributions when time is shifted.
Stationarity justifies taking of averages (mathematically, one needs ergodicity in addition).
Some examples:
a) White noise Xt = Wt , where Wt ∼ WN (0, σ 2 ), that is Wt ∼ F iid for some distribution
F with mean 0 and variance σ 2 . Special case is Gaussian White noise, where F = Φ.
b) Harmonic oscillations plus (white) noise,
K
X
Xt = αk cos(λk t + φk ) + Wt ,
k=1

where Wt ∼ WN (0, σ 2 ) as above a white noise process and K, α, λ, φ unknown parameters.


c) Moving averages. For example
1
Xt = (Wt + Wt−1 + Wt−2 ),
3
where Wt ∼ WN (0, σ 2 ).
d) Auto-regressive processes. For example

Xt = 0.9Xt−1 + Wt ,

plus initial conditions.


e) Random Walk (special case of an auto-regressive process)

Xt = Xt−1 + Wt

or, with drift,


Xt = Xt−1 + 0.2 + Wt .

3
f) Auto-regressive conditional heteroscedastic models
q
2
Xt = 1 + 0.9Xt−1 · Wt ,
where again Wt ∼ WN (0, σ 2 ).

1.2 Measures of dependence


We want to summarize the distribution of a stochastic process (Xt ) by the first two moments.
1. The mean function of process (Xt ) is defined as
Z ∞
µt := E(Xt ) = xft (x)dx,
−∞

where ft is the density of Xt (if it exists).


2. The auto-covariance function is for all s, t ∈ Z defined as
γ(s, t) := Cov(Xs , Xt ) = E((Xs − µs )(Xt − µt )).

3. The auto-correlation function (ACF) is defined as


γ(s, t)
ρ(s, t) = p .
γ(s, s)γ(t, t)

4. The cross-covariance for two time series (Xt ) and (Yt ) is defined as
γX,Y (s, t) = Cov(Xs , Yt ).

We can also use a subscript X for the first three definition but usually drop it for notational
simplicity.
In vector notation, we can thus write for a collection of n time-points that the vector
(X1 , . . . , Xn )
has mean
(µ1 , . . . , µn )
and covariance matrix
 
γ(1, 1) γ(1, 2) γ(1, 3) . . . γ(1, n)

 γ(2, 1) γ(2, 2) ... 


 ... 
.

 ... 

 ... 
γ(n, 1) ... γ(n, n)

The first two moments of any collection of n random variables can be described as above.
For time-series, we would like to see translational invariance in time, which will be called
stationarity.

4
1.3 Stationarity
Definition: A time-series (Xt ) is strictly stationary iff the distribution of (Xt1 , . . . , Xtk )
is identical to the distribution of (Xt1 +h , . . . , Xtk +h ) for all k ∈ N+ , time-points t1 , . . . , tk ∈ Z
and shifts h ∈ Z.
If (Xt ) is strictly stationary, then
(i) ∃µ ∈ R such that µt = µ for all t ∈ Z, that is the mean is constant.
(ii) γ(s, t) = γ(s + h, t + h) for all s, t, h ∈ Z, that is the covariance is invariant under
time-shifts and we write γ without the second argument as

γ(k) := γ(k, 0) ∀k ∈ Z.

We can also use just the last two properties about the first two moments (which are implied
by strict stationarity) to define weak stationarity.
Definition: A time-series (Xt ) is weakly stationary (or just stationary henceforth) iff
(Xt ) has finite variance and
(i) the mean function µt does not depend on t ∈ Z
(ii) the autocovariance γ(s, t) depends on s, t only through |s − t| and we use again the
notation
γ(k) := γ(k, 0) ∀k ∈ Z.

The mean and covariance of a collection of n consecutive observations, for example (X1 , . . . , Xn ),
are now
(µ, . . . , µ)
and covariance matrix
 
γ(0) γ(1) γ(2) . . . γ(n − 1)

 γ(1) γ(0) γ(1) . . . 


 γ(2) ... 


 ... ,
 (1)

 ... 

 ... 
γ(n − 1) . . . γ(0)

in contrast to the general case discussed above. The invariance under time shifts is now
easily visible.
Note that a strictly stationary time-series is always also weakly stationary.
Moreover, if the distribution of (Xt1 , . . . , Xtk ) is multivariate Gaussian for all k ∈ N and
t1 , . . . , tk ∈ Z, then weak stationarity implies strict stationarity (the proof is trivial since a
multivariate Gaussian distribution is uniquely identified by its mean and covariance).

5
Examples. We look at the same examples as above and see whether they are (weakly)
stationary.
a) White noise Xt = Wt with Wt ∼ WN (0, σ 2 ), that is Wt is iid with distribution F with
mean 0 and variance σ 2 . The expected value is

E(Xt ) = 0∀t ∈ Z.

The variance is given by E(Xt2 ) = σ 2 for all t and the auto-covariance is



0 if h 6= 0
γ(t + h, t) =
Var(Xt ) = σ 2 if h = 0

We can thus write γ(t + h, t) = γ(h) for all t and white noise is (weakly) stationary. The
ACF is given by 
γ(h) 0 if h 6= 0
ρ(h) = =
γ(0) 1 if h = 0

b) The harmonic oscillator is not stationary as the mean function µt is not constant in time.
c) Moving average. Say
Xt = Wt + θWt−1 ,
where Wt ∼ WN (0, σ 2 ). Then
E(Xt ) = 0
and
 2
 σ (1 + θ2 ) if h = 0
2
γ(t + h, t) = Cov(Xt , Xt+h ) = E(Xt Xt+h ) = σ θ if h ∈ {−1, 1} ,
0 if |h| > 1

and we can again write γ(h) = γ(t + h, t) for all t, and the process is weakly stationary.
The ACF is then 
1 if h = 0
γ(h)  θ
ρ(h) = = 2 if h ∈ {−1, 1} .
γ(0)  1+θ
0 if |h| > 1

d) AR(1)-process. Will be discussed in the second Chapter in more detail.


e) Random Walk
t
X
Xt = Wj and Wj ∼ WN (0, σ 2 ).
j=1

First,
E(Xt ) = 0 ∀t ∈ Z,

6
and second
s t min{s,t}
X X X
γ(s, t) = Cov(Xs , Xt ) = Cov( Wj , Wj ) = E( Wj2 ) = min{s, t}σ 2 ,
j=1 j=1 j=1

and the Random Walk is thus not stationary.


f) ARCH model q
Xt = 2
1 + φXt−1 Wt and Wt ∼ WN (0, σ 2 ).
First,
E(Xt ) = 0 ∀t ∈ Z.
Second, for 0 ≤ φσ 2 < 1, weakly stationary since

γ(t, t + h) = Cov(Xt , Xt+h ) = 0 if h 6= 0,

and the variance γ(t, t) is time-invariant with


σ2
E(Xt2 ) = E(1 + φXt−1
2
)σ 2 = .
1 − φσ 2
The ACF is hence ρ(h) = 1{h = 0}, just as for a white noise process. Note, however,
that while Xt−1 and Xt are uncorrelated, they are not independent. For example, |Xt−1 |
2
and |Xt | or Xt−1 and Xt2 will in general have a positive correlation in this model.
While the stationarity above refers to weak stationarity, all weakly stationary examples above
are also strongly stationary.

1.4 Properties of the autocovariance for stationary time-series


In general, for a stationary time-series,
(i) The variance is given by γ(0) = E((Xt − µ)2 ) ≥ 0.
(ii) |γ(h)| ≤ γ(0) for all h ∈ Z. This follows by Cauchy-Schwarz as

|γ(h)| = |E((Xt − µ)(Xt+h − µ))|


1/2
≤ E((Xt − µ)2 )E((Xt+h − µ)2 )


= [γ(0)2 ]1/2 = γ(0).

(iii) γ(−h) = γ(h) (follows trivially).


(iv) γ is positive semi-definite, that is for all a ∈ Rn (and any choice of n ∈ N),
n
X
ai γ(i − j)aj ≥ 0.
i,j=1

7
Pn
As a proof of (iv), consider the variance of (X1 , . . . , Xn )a = i=1 ai Xi , where a ∈ Rn is a
column-vector:
n
X n
X n
X
0 ≤ Var( ai X i ) = ai aj Cov(Xi , Xj ) = ai γ(i − j)aj ,
i=1 i,j=1 i,j=1

which completes the proof.

1.5 Estimating the auto-covariance


For observations x1 , . . . , xn of a stationary time-series, estimate the mean, auto-covariance
and auto-correlation as follows
(i) Sample mean µ̂ = x = n1 ni=1 xi .
P

(ii) Sample auto-covariance function is, for −n ≤ h ≤ n


n−|h|
1 X
γ̂(h) = (xt+|h| − x)(xt − x),
n t=1

and set to 0 otherwise.


(iii) Sample auto-correlation is given by
γ̂(h)
ρ̂(h) = .
γ̂(0)

Note that γ̂(h) is identical to the sample covariance of (X1 , X1+h ), . . . , (Xn−h , Xn ), except
that we normalize by n instead of n − h to keep γ̂ positive semi-definite (see below).
Properties of the sample ACF The four properties of the ACF are also true for the
sample ACF:
(iii) γ̂(−h) = γ̂(h) holds trivially.
(iv) γ̂ is positive semi-definite (proof below).
(i)+(ii) γ̂(0) ≥ 0 and |γ̂(h)| ≤ γ̂(0) follows from property (iv).
Proof of (iv): We can write
 
γ̂(0) γ̂(1) γ̂(2) . . . γ̂(n − 1)
 γ̂(1) γ̂(0) γ̂(1) . . . 
 
 γ̂(2) ... 
  1
Γ̂n = 
 ...  = MMT ,
 n

 ... 

 ... 
γ̂(n − 1) . . . γ̂(0)

8
where the n × (2n − 1)-dimensional matrix M is given by
 
0 ... 0 X̃1 X̃2 X̃3 . . . X̃n−1 X̃n
 0 . . . 0 X̃1 X̃2 X̃3 . . . X̃n 0 
 
M :=  ...
,

 0 X̃1 X̃2 . . . X̃n 0 . . . 0 0 
X̃1 X̃2 . . . X̃n 0 0 ... 0 0
where X̃t := Xt − µ̂. Hence, for any a ∈ Rn ,
1 T 1
aT Γ̂n a = (a M )(M T a) = kM T ak22 ≥ 0,
n n
which completes the proof.

1.6 Transforming to stationarity


Several steps/strategies, not always in the same order
1 Plot the time series: look for trends, seasonal components, step changes, outliers etc.
2 Transform data so that residuals are stationary.
(a) Estimate and subtract trend Tt and seasonal components St
(b) Differencing

(c) Nonlinear transformations (log, · etc.).
3 Fit a stationary model to residuals. This yields then an overall model for the data.
For 2(a), we can use non-parametric estimation (with large bandwidth) to get trend Tt and
smoothing (with medium bandwidth) to get seasonal component. Seasonal component can
also be estimated as empirical average of detrended data in, for example, each given month
(if its yearly data).
For 2(b), define lag-1 difference operator via
(∇X)t = (1 − B)Xt = Xt − Xt−1 ,
where B is the backshift operator defined via
(BX)t = Xt−1 .
(i) For a linear trend, that is if
Xt = µ + βt + Nt ,
with Nt the noise process, we have
(1 − B)Xt = β + (1 − B)Nt .
If differenced noise (1 − B)Nt is stationary, we can estimate slope β from data as the
mean of the differenced time-series ∇X.

9
(ii) For a polynomial trends+noise, that is if
k
X
Xt = βj tj + Nt ,
j=1

difference k times to get

∇k Xt = (1 − B)k Xt = k!βk + (1 − B)k Nt .

If k-times differenced noise (1 − B)k Nt is stationary, can estimate highest-order term


as the mean of the k-times differenced time-series.
(iii) For a seasonal variation of length s, define lag-s differencing as

(1 − B s )Xt = Xt − B s Xt = Xt − Xt−s ,

where B s is the backshift operator applied s times. If

Xt = Tt + St + Nt ,

and St has period s, then

(1 − B s )Xt = Tt − Tt−s + (1 − B s )Nt ,

and the seasonal component has been removed and we can then proceed as in (i) or
(ii), depending on the nature of the trend.

1.7 Cointegration
We say (Xt )t∈Z is integrated of order d if the d times differenced time-series

(1 − B)d Xt is stationary
0
but (1 − B)d Xt is not stationary for all d0 < d. Let (Xt )t∈Z and (Yt )t∈Z be two time-series and integrated
of order d = 1. The two time-series are said to be co-integrated if there exists a linear combination of both
that is integrated of order 0 (that is the linear combination is stationary), that is there exists a β ∈ R2 such
that
Ut = Xt β1 + Yt β2
is stationary. The β is then clearly not unique and one can for example set β1 = 1 wlog. Examples are
stock prices of Apple and Google (where the difference is perhaps stationary) or economic data on money
supply, income, prices and interest rates. If cointegration hold, we can model the difference as a stationary
process. Cointegration means intuitively that while the processes marginally can drift they cannot drift far
apart from each other .

10
1.8 Regression with correlated errors
We can try to estimate a linear trend in two different models

(i) A trend-stationary model, where


Xt = µ + βt + Nt ,
where Nt is stationary noise.
(ii) A difference-stationary model with
Xt = µ + βt + Nt ,
where ∇Nt is stationary noise.

Our goal is to estimate the trend β and give a confidence interval for this parameter.
Note: if model (i) is correct, the model (ii) is also correct, but we lose efficiency in the estimation if we
proceed with differencing.

1.8.1 Trend-stationary models and pre-whitening


If model (i) is correct, we would like to use least-squares estimation to estimate the slope β and derive a
confidence interval as in standard least-squares regression. We can write in vector notation

X = Zθ + N, (2)

where

X = (X1 , . . . , Xn )T ,
N = (N1 , . . . , Nn )T ,
 
1 1
 1 2 
 
Z=  1 3 ,

 ... 
1 n
 
µ
θ= .
β

Note that we have chosen the special case (t1 , . . . , tn ) = (1, . . . , n) in this example to simplify notation.
Naive least-squares estimation would then use an estimator

θ̂ = argminθ0 kX − Zθ0 k22 .

The least-squares estimator is motivated as maximum-likelihood estimator if the noise contributions are
independent and have a Gaussian distribution. In a time-series context, the first assumption will be violated.
Assume, for simplicity, though that the Gaussian assumption is correct and N ∼ N (0, Σ) for a n×n full-rank
matrix Σ of the general form of stationary time-series discussed before. Then, to get the maximum-likelihood
estimate
argminθ0 (X − Zθ0 )T Σ−1 (X − Zθ),
observe that (2) is equivalent to
AX = AZθ + AN, (3)

11
for any full-rank matrix A ∈ Rn×n . Now choose A such that AN ∼ N (0, 1n ), where 1n is the identity matrix
in n dimensions. If Σ = C T C is the Cholesky decomposition of Σ (and C invertible since we assumed that
Σ has full rank), then such a pre-whitening matrix A is given for example by

A = C −T ,

since then
Var(AN ) = E(AN N T AT ) = A E(N N T ) AT = AC T CAT = 1n .
| {z }

Alternatively, if Σ = ΦΛΦ is the eigenvalue decomposition of Σ with orthogonal Φ (specifically, ΦT Φ = 1n )


T

and a diagonal Λ ∈ Rn×n , then we can equivalently choose

A = Λ−1/2 ΦT
−1/2 −1/2 √
as a pre-whitening matrix, where Λij = 0 if i 6= j and Λii = 1/ Λii for i = 1, . . . , n since then

Var(AN ) = AΣAT = AΦΛΦT AT = Λ−1/2 Φ T T


| {zΦ} Λ |Φ{zΦ} Λ
−1/2
= 1n .
=1n =1n

Once we have such a “pre-whitening” matrix A, then the maximum-likelihood estimator for θ is

θ̂ = argminθ0 kAX − AZθ0 k22 = ((AZ)T (AZ))−1 (AZ)T (AX).

The point-estimate for β is thus the second entry in θ̂,

β̂ = ((AZ)T (AZ))−1 (AZ)T (AX) 2 .




To get a confidence interval for β, we need to know the variance of β̂. A confidence interval is easiest to derive
if β̂ if unbiased (that is E(β̂) = β), which is true for the estimator above) and it has a Gaussian distribution
(which it has under the assumption made above that N ∼ N (0, Σ)). Otherwise some modifications are
necessary. The distribution of θ̂ under the Gaussian distribution for the noise is given by

θ̂ ∼ N (θ, ((AZ)T (AZ))−1 ).

A brief argument for this: note that the least squares estimator is given by

θ̂ = ((AZ)T (AZ))−1 (AZ)T (AX).

The expected value is hence θ as AX = AZθ + AN . The variance of θ̂ is thus given by


 
Var(θ̂) = Var ((AZ)T (AZ))−1 (AZ)T AN ,

where AN ∼ N (0, 1n ). Thus

Var(θ̂) = ((AZ)T (AZ))−1 (AZ)T E (AN )(AN )T (AZ)((AZ)T (AZ))−T = ((AZ)T (AZ))−1 .

| {z }
=1n

The distribution is furthermore joint normal (as its a linear combination of normal random variables) which
completes the argument.
Remember that β̂ is the second component in θ̂. We thus know that

β̂ − β
q ∼ N (0, 1),
Var(β̂)

12
where Var(β̂) = (((AZ)T (AZ))−1 )2,2 in our example. Thus
 β̂ − β 
P −q ≤ q ≤ q ≥ 1 − α,
Var(β̂)

where q = Φ−1 (1−α/2) the 1−α/2 quantile of a standard normal distribution (and for example q ≈ 1.96 ≈ 2
for α = 0.05). A (1 − α)-confidence interval for β is then given by
h q q i
β̂ − q Var(β̂), β̂ + q Var(β̂) .

If unsure about any of this this, please consult a textbook on regression or introductory statistics. If we
estimate Σ from the data, we will have to modify the confidence intervals accordingly (they tend to get
wider) but this is beyond the scope here.

1.8.2 Difference-stationary models


For model (ii), let Zt = ∇Xt be the differenced time-series. The slope β in the model can
be estimated as the mean of the differenced time-series, that is
n
1X
β̂ = Zt .
n t=1

To get a confidence interval, we need to know again the variance of β̂. If we do know the
variance (and assume a Gaussian distribution of β̂ for simplicity), then a confidence interval
is given again by q q
h i
β̂ − q Var(β̂), β̂ + q Var(β̂) ,
where the quantile q of the standard normal distribution is defined just as above. Now, if Zt
were independent (and Zt stationary), then
n
1X
Var(β̂) = Var( Zt ) = Var(Z1 )/n.
n t=1

More generally, if we allow correlations and Zt has autocovariance γ, then


(a)
n n−1
1X 1 X |k|
Var( Zt ) = (1 − )γ(k)
n t=1 n k=−n+1 n
P∞
(b) If k=−∞ |γ(k)| < ∞, then, as n → ∞,
n ∞ ∞
1X X X
nVar( Zt ) → γ(k) = Var(Z1 )( ρ(k))
n t=1 k=−∞ k=−∞

X ∞
X
= Var(Z1 )(1 + ρ(k)) = Var(Z1 )(1 + 2 ρ(k))
k=−∞,k6=0 k=1

13
and the asymptotic variance
P∞ is thus inflated (or deflated) compared to the independence
case by the factor (1 + 2 k=1 ρ(k)), which is typically larger than 1 but can also be less
than 1 when negative auto-correlations appear.
Proof of (a):
n
X n X
X n
Var( Zt ) = Cov(Zt , Zs )
t=1 t=1 s=1
n X
X n
= γ(t − s)
t=1 s=1
n−1
X
= γ(k) · (number of pairs (t, s) with t − s = k)
| {z }
k=−n+1
=n−|k|
n−1
X
= γ(k) · (n − |k|)
k=−n+1

and
n n
1X 1 X
Var( Zt ) = 2 Var( Zt )
n t=1 n t=1
n−1
1 X |k|
= (1 − )γ(k).
n k=−n+1 n

Proof of (b): Using (a),


n−1 ∞
X |k| X |k|
(1 − )γ(k) = max{0, 1 − }γ(k),
k=−n+1
n k=−∞ | {z n }
→γ(k) as n→∞

and the claim follows by dominated convergence.

2 Time-Domain models
2.1 Causal and stationary autoregressions
A stochastic process (Xt )t∈Z is called a Markovian autoregressive process of order p if

Xt = φ1 Xt−1 + . . . + φp Xt−p + Wt

where Wt is independent of all Xs , s < t (this is the ”causal” condition; see discussion next
Section) and φp 6= 0. If the process is stationary, we call this a causal, stationary AR(p)
process.

14
The variable Wt is called the innovation at time t. In operator notation,

Φ(B)Xt = Wt ,

where B is again the backshift operator and

Φ(B) := 1 − φ1 B 1 − . . . − φp B p .

For a Markovian autoregression, φ1 Xt−1 + . . . + φp Xt−p + E(Wt ) is the best prediction of Xt


from the past. Furthermore, the innovations at different times are independent: For t > s
Wt is independent of Xs − φ1 Xs−1 − . . . − φp Xs−p = Ws .
When is a Markovian autoregression stationary? First it is clear that under stationarity, the
innovations are not only independent, but also identically distributed. For an AR(1)-process,
we obtain by iteration
t−1
X
Xt = φj Wt−j + φt X0 .
j=0

Hence if second moments exist and if (Xt ) is stationary, then


t−1
X
γ(0) = Var(W ) φ2j + φ2t γ(0)
j=0

since by assumption all terms on the right are independent. Clearly this implies that |φ| < 1
as the equation cannot be fulfilled if |φ| > 1 on the hand (the right hand side will always be
larger than the left hand side for example) and it has a solution for |φ| < 1:
Pt−1 2j
j=0 φ (1 − φ2t )/(1 − φ2 ) Var(W )
γ(0) = Var(W ) 2t
= Var(W ) 2t
= .
1−φ 1−φ 1 − φ2
The general case is covered by the next theorem.

Theorem 1. A stationary Markovian autoregression with finite second moments exists iff
all zeroes of the polynomial

Φ(z) = 1 − φ1 z − . . . − φp z p

are outside the unit disc {z; |z| ≤ 1}. In that case the process has the representation

X
Xt = ψj Wt−j ,
j=0

where the coefficients ψj are the solution of the recursion

ψj = φ1 ψj−1 + . . . + φp ψj−p (j ≥ 1) (4)

15
with initial conditions ψ0 = 1, ψ−1 = . . . ψ1−p = 0 and thus converge to zero exponentially
fast. Moreover, if we define for t > 0

Xt∗ = φ1 Xt−1 ∗
+ . . . φp Xt−p + Wt

with arbitrary initial conditions X0∗ , X−1 ∗
, . . . , X1−p , Xt − Xt∗ → 0 almost surely and in L1 .

Proof. We write the autoregression of order p as a vector autoregression of order 1: If we set


Zt = (Xt , Xt−1 , . . . , Xt−p+1 )T , ηt = (Wt , 0, . . . , 0)T and
 
φ1 . . . φp−1 φp
 0 
Φ=
 
.. 
 Ip−1 . 
0

(with Ip−1 the identity matrix in dimension p − 1), then

Zt = ΦZt−1 + ηt .

Iterating this autoregression, we obtain


t−1
X
Zt = Φj ηt−j + Φt Z0 .
j=0

If Zt is stationary with finite variance, then Φt must converge to zero, and this is known
to be equivalent to the condition that all eigenvalues of Φ are smaller than one in absolute
value1 . The characteristic polynomial of Φ is however nothing else than the polynomial
z p Φ(1/z) = z p − φ1 z p−1 − . . . − φp . In more detail: for an eigenvalue λ we need that
Φv = λv, where v ∈ Cp is the eigenvector. This last equation is the same as
 
φ1 ... φp−1 φp
  
v1 v1
 0   v2 
  v2 

..  = λ ,
 ...  ... 
 
 Ip−1 . 
0 vp vp

which is equivalent (using simple algebra) to the condition that λp − φ1 λp−1 − . . . − φp = 0


and the roots of Φ(z) are thus outside the unit circle iff all eigenvalues of Φ are inside the
unit circle. The above argument shows that if there are eigenvalues of Φ with absolute value
larger or equal to 1 (or, equivalently, if there are roots of the polynomial Φ(z) that lie inside
the unit circle), then the process can not be stationary.
1
since –if any eigenvalue of Φ is larger than 1– the term E(kΦt Z0 k22 ) would increase exponentially with
increasing t and E(kZt k22 ) can hence not be constant, as required for a stationary process. To see this, note
that Φt is identical to UΛt U−1 if UΛU−1 with diagonal Λ is the eigenvalue decomposition of Φ

16
If on the other hand, we assume that all eigenvalues of Φ are indeed smaller than 1, we can
take the limit in the above recursion and obtain

X
Zt = Φj ηt−j . (5)
j=0

This process has a finite variance now and is clearly stationary (since the ηt contributions
are iid). Hence both parts of the first claim (stationary if and only if roots of Φ are outside
the unit circle).
We still need to show the recursion (4) for the coefficients ψj , j ≥ 1. From the definition of
Zt iand ηt and (5), it follows that
ψj = (Φj )1,1 ,
that is the coefficient Ψj is equal to the (1, 1) components of the matrix Φj for all j ≥ 0.
For Φj we have the recursion
Φj = ΦΦj−1
Hence the first column of Φj , say c(j) , satisfies the recursion
(j−1) (j−1)
c(j) = ((φ1 , . . . , φp )T c(j−1) , c1 , . . . , cp−1 ).
(j)
Note that ψj = (Φj )1,1 = c1 . The first components of the recursion reads
(j) (j−1) (j−1)
ψj = c1 = φ1 c1 + φ2 c2 + . . . + φp c(j−1)
p .
(j−1) (j−2) (j−1) (j−k)
But we can see from the recursion that c2 = c1 = ψj−2 and ck = c1 = ψj−k
which shows the claimed recursion for ψ, namely that

ψj = φ1 ψj−1 + φ2 ψj−2 + . . . + φp ψj−p .

2.2 Some more discussion of term “causality”


The theorem above shows that a stationary Markovian autoregression, where Wt is independent of past
values Xs , s < t, can be written as a linear combination of past innovations. Without the condition that Wt
is independent of past values Xs , s < t, the theorem is false. To see why, take any |φ| > 1 and set

X
Xt = − φ−j Wt+j .
j=1

Clearly this is stationary if the Wt are i.i.d. Moreover,

φXt−1 = −Wt + Xt ,

so the recursion is satisfied. However, Wt contributes to the sum defining Xs for s < t, and thus the two
variables are dependent.

17
As a side remark: We called the condition that Wt is independent of all preceding values Xs , s < t the
“causal” condition” as it implies conversely that Xt is not a function of future values of Ws , s > t. The book
Shumway & Stoffer defines an AR(p) process without the last “causal” condition that Wt is independent
of all previous Xs , s < t. Which condition is called causal and which condition leads to stationarity is
inconsistent in the literature, unfortunately. To make this a bit more transparent, we have the corollary that
if the process (Xt )t∈Z is of the Markovian autoregressive form
Xt = φ1 Xt−1 + . . . + φp Xt−p + Wt
and stationary with finite variance, then the following three conditions are equivalent:

(i) Wt is independent of all Xs , s < t [first possible definition of a “causal” process].


(ii) Xt can be written as a one-sided linear process:

X
Xt = ψj Wt−j ,
j=0
P∞
where j=0 |ψj | < ∞ [second possible definition of a “causal” process].
(iii) All zeroes of the polynomial Φ(z) are outside the unit circle {z ∈ C : |z| ≤ 1} [third possible definition].

The first and second make arguably the most sense for defining “causal” but it is a matter of taste. Note
that if we assume (i), as we have done here, then (iii) is a condition for stationarity; in other words, (i) can
be true and (iii) violated, but the process is then not stationary any longer. Conversely, the process given
above is an example where (iii) is violated and the process stationary, but then condition (i) is violated (the
book discusses more, see “every explosion has a cause” and related parts).

2.3 Skeletons of AR(p) processes and impulse-response functions.


Suppose we fix initial conditions X0 , . . . , Xp−1 and compute the solutions for the homoge-
neous equation
Φ(B)Xt = 0 ∀t ≥ p. (6)
The solution Xt is then an “impulse-response” (the response to the initial impulse of the
initial condition) and the solutions to the general case are superpositions of such impulse-
response functions since we have a linear system. The solutions to (6) are linear combinations
of the solutions to the homogenous equations
p
X
ut = φj ut−j or, equivalently, Φ(B)ut = 0 ∀t ∈ Z. (7)
j=1

Theorem 2. The set of sequences (ut ) that satisfy the above difference equation (7) is a
vector space of dimension p. A basis is given by the sequences of the form
ut = tj z0−t
where z0 is a root of the polynomial
Φ(z) = 1 − φ1 z − . . . φp z p
with multiplicity m and 0 ≤ j < m.

18
Proof. It is clear that a linear combination of two solutions is again a solution. Moreover, if
p consecutive values uk+1 , . . . , uk+p of a solution (ut ) are given, then the solution is unique:
Values ut for t > k + p follow by forward iteration, those for t ≤ k follow by backward
iteration
ut − φ1 ut−1 − . . . − φp−1 ut−p+1
ut−p = .
φp
Therefore the dimension of the vector space is p.
Next, we show that the above sequences are indeed solutions. First we take j = 0:
−(t−1) −(t−p)
z0−t − φ1 z0 − . . . φp z0 = z0−t Φ(z0 ) ≡ 0.

Similarly, for j = 1 we have


−(t−1) −(t−p) −(t−1)
tz0−t − φ1 (t − 1)z0 − . . . φp (t − p)z0 = z0−t tΦ(z0 ) − z0 Φ0 (z0 ) ≡ 0.

The general case follows because


p
X
(j)
Φ (z0 ) = −z0j φk k(k − 1) · · · (k − j + 1)z0k = 0 (j < m)
k=j

implies that also


p
X
φk k j z0k = 0 (j < m).
k=1

The proof will be completed if we can show that the above solutions are linearly independent
since the number of zeroes of a polynomial of degree p counted with their multiplicity is equal
to p. For a proof of the linear independence, see Brockwell and Davies, Theorem 3.6.2.

For real-valued processes, the solutions are thus given by the basis vectors
(i) single real root z0 ∈ R yields basis vector

ut = z0−t

(ii) complex roots z0 = r exp(iµ), z̄0 = r exp(−iµ) yield the two vectors

ut = cos(µt)r−t
ut = sin(µt)r−t

(iii) real root z0 ∈ R with multiplicity k yields basis vectors

ut = zo−t tj for j = 0, . . . , k − 1.

19
(iv) Complex roots z0 = r exp(iµ) ∈ C with multiplicity k yields

ut = cos(µt)r−t tj
ut = sin(µt)r−t tj for j = 0, . . . , k − 1.

The coefficients in the basis are chosen to satisfy initial conditions. The solutions
converge to 0 as t → ∞ if and only if all roots are outside the unit circle {z ∈ C : |z| ≤
1} (and if initial conditions are not chosen precisely to cancel out the corresponding
exponential growth terms if they exist).
Examples:
1) AR(1). ut = φ1 ut−1 with φ1 6= 0 has root z0 = 1/φ1 . Exponential growth if |φ1 | > 1.
Critical behaviour (random-walk type) if φ1 = 1 and convergent to 0 from all initial
conditions if |φ1 | < 1.
2) AR(2). The roots of Φ(z) = 1 − φ1 z − φ2 z 2 are
p
φ1 ± φ21 + 4φ2
z1,2 = −
2φ2
One can verify that z1 and z2 are both outside the unit circle iff

−1 < φ2 < 1, φ2 < 1 − |φ1 |

Hence the set of parameters which correspond to a stationary Markovian autoregression


is a triangle. The roots are complex (and the solutions then show oscillatory behaviour)
for φ2 < − 14 φ21 .
The above discussion shows (again) that the roots of Φ(z) are critical for stationarity of the
process (here in the noiseless case) and determine whether there is oscillatory behaviour.

2.4 Invertible moving averages


A linear moving average of order q

Xt = Wt + θ1 Wt−1 + . . . θq Wt−q

with Wt i.i.d. is always stationary. In operator notation, we write:

Xt = Θ(B)Wt ,

where
Θ(B) = 1 + θ1 B + . . . + θq B q .

Wt is always independent of Xs for s < t for such a process. However we cannot call Wt the
innovation of the process unless the other terms θ1 Wt−1 + . . . θq Wt−q on the right hand side

20
can be expressed with the values Xs for s < t. If they can be expressed in this way, Wt is
independent of all Xs with s < t and a function of only Xs , s ≤ t.
P
We therefore call a moving average invertible if there are coefficients (πj ) with |πj | < ∞
such that ∞
X
Wt = πj Xt−j .
j=0

(An autoregression is always invertible, just set π0 = 1, πj = −φj for 1 ≤ j ≤ p and πj = 0


for j > p). Invertibility helps with the uniqueness of the representation, see for example 3.4
and 3.5 in the book, where it is shown that the two processes have the identical distribution
under Gaussian noise
1
Xt = Wt + Wt−1 with Wt iid N (0, 25)
5
Xt = Wt + 5Wt−1 with Wt iid N (0, 1)

We prefer the first representation as it is invertible. Specifically, using the same recursion
idea as before,
1
Wt = Xt − Wt−1
5
1 1
= Xt − (Xt−1 − Wt−2 )
5 5
1 1 1
= Xt − (Xt−1 − (Xt−2 − Wt−3 )
5 5 5
= ...

X 1
= πj Xt−j with πj = (− )j
j=0
5

Theorem 3. A moving average is invertible iff all zeroes of the polynomial

Θ(z) = 1 + θ1 z + . . . + θq z q

are outside the unit circle {z; |z| ≤ 1}. In that case the coefficients πj are the solution of the
recursion
πj = −θ1 πj−1 − . . . − θq πj−q
with initial conditions π0 = 1, π−1 = . . . π1−q = 0 and thus converge to zero exponentially
fast.

Proof. We use the same proof idea as in the theorem about causality, reversing the roles
of Wt and Xt . For q = 1, we simply iterate the equation Xt = Wt + θWt−1 (which is
Wt = Xt −θWt−1 ). For q > 1, we write the process as a vector moving average of order 1.

21
2.5 ARMA-Processes
An autoregressive moving average process of order (p, q), called ARMA(p, q), combines the
properties of the two previous models. The recursion is
p q
X X
Xt = φj Xt−j + θj Wt−j + Wt .
j=1 j=1

For a reasonable model Wt should again be independent of Xs for s < t and Wt should
depend only on past values Xs , s ≤ t, i.e. the model should be invertible, that is there exists
summable coefficients πj , j = 0, 1 . . ., such that

X
Wt = πj Xt−j .
j=0

Again it then follows that the variables Wt are independent for different times t, and if (Xt )
is stationary, the Wt are even i.i.d.
We also want the condition (that is sometimes refereed to as a condition for stationarity and
sometimes as a causal condition, see Section 2.2): There are summable coefficients ψj such
that ∞
X
Xt = ψj Wt−j .
j=0

If one wants to generalize the arguments for the autoregressive case one sees, however, that
a problem occurs: For instance for any φ
Xt = φXt−1 + Wt − φWt−1
has the stationary solution Xt = Wt which is also invertible and Wt is independent of Xs for
s < t. The reason for this problem is that Φ and Θ have common zeroes.
If we assume that Φ and Θ have no common zeroes, then the conditions that all zeroes of Φ
and Θ are outside of the unit circle are again necessary and sufficient for the existence of a
stationary ARMA model which is invertible and causal.
The recursion of the ARMA process can then be written as
Φ(B)Xt = Θ(B)Wt .
Formally, we can thus write
Xt = Φ(B)−1 Θ(B)Wt , Wt = Θ(B)−1 Φ(B)Xt .
If Φ(z) has no zeroes in {z; |z| ≤ 1}, the Taylor series

Θ(z) X
= ψj z j
Φ(z) j=0

22
converges on {z; |z| ≤ 1} and thus we can define

X
−1
Φ(B) Θ(B) = ψj B j .
j=1

From the equality



X
Θ(z) = Φ(z) · ψj z j ,
j=0
j
we obtain by comparing the coefficient of z on both sides the equations
min(p,j) 
X θj 0 ≤ j ≤ q
ψj − φk ψj−k = (8)
0 j>q
k=1

(we set θ0 = 1). This is a generalisation of the corresponding recursion (4) for AR-processes
(where θ0 = 1 and all θj = 0 for j > 0) and the most convenient way to compute ψj
numerically. A similar argument applies for the coefficients in the invertibility representation
Wt = Θ(B)−1 Φ(B)Xt . In particular, ψj again satisfies a difference equation except for an
initial part of length q. This generalizes the previous recursive way of computing the MA-
style ψ-coefficients for an AR-model. Note that the coefficients ψ decay exponentially fast
again for a causal stationary process and we can in practice stop the recursion at some point
when the coefficients have become very small.

2.6 Autocorrelation
Let (Xt ) be a stationary, causal and invertible ARMA(p, q) process.
We compute the autocovariance γ(h) of the process. There are two options. The first one
uses the causal MA(∞)-representation

X
Xt = ψj Wt−j ,
j=0

to get

X ∞
X
γ(h) = Cov(Xh , X0 ) = Cov( ψj Wt+h−j , ψj 0 Wt−j 0 )
j=1 j 0 =0
∞ X
X ∞
= ψj ψj 0 E(Wt+h−j Wt−j 0 )
| {z }
j=0 j 0 =0
=Var(W )1{h−j=−j 0 }

X
= ψj ψj+h .
j=0

23
The second possibility is as follows. Because the covariance is linear in both arguments, we
obtain
p q
X X
γ(h) = Cov(Xh , X0 ) = φj Cov(Xh−j , X0 ) + θk Cov(Wh−k , X0 )
j=1 k=0
p q
X X
= φj γ(h − j) + θk Cov(Wh−k , X0 ).
j=1 k=h

In the last equality, we have used the property that Wt is independent and thus uncorrelated
with X0 for t > 0.
For h > q, the second sum on the right runs over an empty set and is thus zero. Therefore,
we have shown that for h ≥ max(p, q + 1) the autocovariance function satisfies the difference
equation
p
X
γ(h) = φj γ(h − j).
j=1

In particular, it decays to zero exponentially fast. Moreover, the properties are closely linked
to properties of the zeroes of the polynomial Φ. If Φ has two zeroes r exp(±iν) with r close
to one, then the covariance is (approximately) a damped harmonic with period 2π/ν.
In order to compute the values γ(h) for h < max(p, q + 1), we need Cov(Ws , X0 ) for s ≤ 0.
These covariances can be computed from the causal representation:

X
Xt = ψj Wt−j ⇒ Cov(Ws , X0 ) = Var(W )ψ−s (s ≤ 0).
j=0

Example: Autoregressions. For autoregressions, we only need Cov(W0 , X0 ) which is equal


to Var(W ). The autocovariances γ(h) for 0 ≤ h ≤ p are then obtained from the equations
p
X
γ(0) − φj γ(j) = Var(W )
j=1
p
X
γ(h) − φj γ(h − j) = 0 (1 ≤ h ≤ p).
j=1

These equations are called Yule-Walker equations and can be written in matrix form as

Γp φ = γp (9)
Var(W ) = σ 2 = γ(0) − φt γp , (10)

24
where
     
γ(0) γ(1) γ(2) . . . γ(p − 1) φ1 γ(1)

 γ(1) γ(0) γ(1) . . . 


 φ2 


 γ(2) 


 γ(2) ... 


 φ3 


 γ(3) 

Γp = 
 ... , φ = 
  ...  , γp = 
  ... 


 ... 









 ...     
γ(p − 1) . . . γ(0) φp γ(p)

Replacing Γp and γp by their empirical counterparts Γ̂p and γ̂p will yield a possible estima-
tor (the Yule-Walker estimator) for the coefficients φ in an AR(p)-model and the variance
Var(W ) of the innovations (remember that Γ̂p is positive semi-definite).
Example: ARMA(1,1). From

Xt = φXt−1 + Wt + θWt−1

we obtain
Xt = φ2 Xt−2 + Wt + (φ + θ)Wt−1 + φθWt−2
and therefore ψ0 = 1, ψ1 = (φ + θ). Hence the autocovariances γ(0) and γ(1) can be found
by solving the equations

γ(0) = φγ(1) + Var(W )(1 + θ(φ + θ))


γ(1) = φγ(0) + Var(W )θ.

This gives the variance


1 + 2θφ + θ2
γ(0) = Var(W )
1 − φ2
and the autocorrelations
θ(1 − φ2 )
ρ(1) = φ + , ρ(h) = φh−1 ρ(1) (h > 1).
1 + 2θφ + θ2

2.7 Linear prediction and partial autocorrelations


The Yule-Walker equations from above also appear in the context of linear forecasting.
The best linear prediction of Xt based on (Xr , Xr+1 , . . . Xs ) for r ≤ s < t or t < r ≤ s is the
linear combination
Xs−r
r:s
Xt = α +
b β(k)Xs−k
k=0

which minimizes the mean square error of prediction:


b r:s = argminx E((Xt − x)2 |Xr , . . . , Xs ).
X t

25
Schematically, we try to do the following:
. . . , X0 , . . . , Xr−1 , Xr , Xr+1 , . . . , Xs , . . . , Xt ,....
| {z } |{z}
use these to predict this target value

Xbtr:s is determined (using the projection theorem) by a system of linear equations which
involve the mean and autocovariance of Xt only:
E(Xt − X btr:s ) = 0
btr:s )Xu ) = 0 (r ≤ u ≤ s)
E((Xt − X

Assume for simplicity of notation that µ = 0 and a stationary process. Due to stationarity the
problem is translational invariant (if we shift r, s, t by an amount h ∈ Z, nothing changes).
The equations above for the optimal linear coefficients are then equal to (the first one is
automatically fulfilled if the mean is a constant 0 and we use the order u = s as first
equation, u = s − 1 as second until u = r as last equation):
Γr:s r:s r:s
t βt = γt , (11)
where
s−r
X
βtr:s = argminβ E((Xt − β(k)Xs−k )2 |Xr , . . . , Xs ).
k=0
where (analogous to the Yule-Walker equations in (9) for the coefficients φ, which are for an
AR(p) process also the best linear predictors β), which we recover if s = t − 1 and r = t − p,
 
γ(0) γ(1) γ(2) . . . γ(s − r)
 γ(1) γ(0) γ(1) . . . 
 
 γ(2) . . . 
r:s
 
Γt =   . . . ,


 ... 

 ... 
γ(s − r) . . . γ(0)
βtr:s (0)
   
γ(t − s)
 βtr:s (1)   γ(t − s + 1) 
 r:s
  
 βt (2)   γ(t − s + 2) 
   
βtr:s = 
 . . .  , γtr:s = 
  . . . 

   
   
   
βtr:s (s − r) γ(t − r)
We can estimate β by using Γ̂r:st and γ̂tr:s instead of Γr:s
t and γtr:s inverting the matrix Γ̂r:s
t
(which is positive semi-definite; need to regularize if one or more singular values are exactly
zero). Note that for stationary time-series,
r+h:s+h
βtr:s = βt+h

26
for any h ∈ Z, that is the optimal linear prediction coefficients are invariant under a time-shift
(as the same is true for the first two moments of the time-series itself).
Prediction intervals: We can also compute the variance of the residual
s−r
X
(σtr:s )2 btr:s )2 ) = Var(Xt −
= E((Xt − X β(k)Xs−k )
k=0
s−r
X s−r
X
= Var(Xt ) − 2 β(k)Cov(Xt , Xs−k ) + β(k)β(k 0 )Cov(Xs−k , Xs−k0 )
k=0 k,k0 =0
s−r
X s−r
X
= γ(0) − 2 β(k)γ(t − s + k) + β(k)β(k 0 )γ(k − k 0 )
k=0 k,k0 =0

= γ(0) − 2(βtr:s )t γtr:s + (βtr:s )t Γr:s r:s


t βt .

For Gaussian noise, a (1 − α)-prediction interval for Xt is thus (under non-Gaussian noise,
this holds approximately in most cases),
btr:s ± Φ−1 (1 − α/2)σtr:s ) = 1 − α.
P (Xt ∈ X
b r:s → 0 (or the mean µ if µ 6= 0 as
As t − s → ∞, we have for stationary processes that X t
r:s 2
assumed above) and (σt ) → γ(0) = Var(Xt ).
Example: One-step-ahead prediction. Let r = s = t − 1 (and we can introduce again a
possibly non-zero mean µ). One easily verifies that
btt−1:t−1 = µ + ρ(1)(Xt−1 − µ),
X btt−1:t−1 )2 ) = γ(0)(1 − ρ(1)2 ).
E((Xt − X

For r < s, specifically r = 0 and s = k − 1 for t = k, we can either use the linear system
of equations above. Alternatively, the Durbin-Levinson algorithm allows to compute the
coefficients of the linear predictions
k
X
b 0:k−1 = α0:k−1 +
X βk0:k−1 (j) · Xk−j
k k
j=1

b 0:k−1 )2 ) recursively. We start with (σ0 )2 = γ(0)


and the mean square errors (σk )2 = E((Xk −X k
(as we do not have any values to base the forecast on and the predicted value is just the
mean µ) and α0 = µ. Then we have
βk0:k−1 (j) = βk0:k−2 (j) + τ (k)βj0:k−2 (k − j) (1 ≤ j < k),
βk0:k−1 (k) = τ (k),
k
X
αk = µ(1 − βp0:p−1 (j)),
j=1

σk2 = 2
σk−1 (1 − τ (k)2 )

27
where Pk−1
γ(k) − j=1 βk0:k−2 (j)γ(k − j)
τ (k) = 2
σk−1
where τ (k) is the so-called partial autocorrelation of lag k. According to the above formula,
b 0:k−1 , and 1 − τ 2 gives the reduction in mean square error if
τk is the coefficient of X0 in X k k
one more observation from the past becomes available.
The distinction between the autocorrelation ρ(·) and partial autocorrelation τ (·) can be
characterized as follows:

autocorrelation ρ(k) = Cor(X0 , Xk )


partial autocorrelation τ (k) = Cor(X0 − X b 1:k−1 )
b01:k−1 , Xk − X
k

that is the partial autocorrelation τ (k) is the correlation between X0 and Xk once the linear
effects of the intermediate time-points X1 , . . . , Xk−1 have been removed:

. . . , X−1 , X0 , X1 , X2 , . . . , Xk−1 , Xk , . . .
| {z }
effects removed

For a derivation of these formulae, see for instance Brockwell and Davies, Chapter 2.5.
Examples:
(i) For an AR(1)-process we have γ(h) = γ(0)φ|h| and therefore

γ(2) − φ · γ(1)
τ (2) = = 0.
σ12

This means that if we know Xt−1 , then there is no additional information in Xt−2
that can be used for predicting Xt . This holds because Xt = φXt−1 + Wt and Wt is
independent of all past values.
(ii) For general AR(p)-processes Xt = pj=1 φj Xt−j + Wt , the autocorrelation decays expo-
P
nentially (as discussed before). For the partial autocorrelation, observe that the best
prediction for k > p is
p
X
1:k−1
Xk
b = φj Xk−j .
j=1

Thus Xk − X b 1:k−1 = Wk , which is uncorrelated with any past values of Xt , t < k.


k
Hence τ (k) = 0 for k > p for an AR(p)-process.
(iii) For a MA(q)-process, the partial autocorrelation decays exponentially but the autocor-
relation is zero after q lags.

28
In summary, we have
AR(p)-process MA(q)-process ARMA-process
Autocorrelation ρ decays zero after lag q decays
Partial Autocorrelation τ zero after lag p decays decays
This distinction can be used for model identification.
Prediction from the infinite past: We assume that both Xt and Wt have mean zero (i.e. we have
subtracted the mean). For a causal and invertible ARMA model
p
X q
X
bt−∞:t−1 =
X φj Xt−j + θj Wt−j . (12)
j=1 j=1

is then the best prediction of Xt based on the infinite past (Xs , s < t), and Wt is the prediction error. In
order to compute it, one can either express Wt−j with past observations by computing the coefficients πj
according to the formula given above, or one can set Ws−1 = · · · = Ws−q = 0 for a time point s  t and
then iterate the relation
X p Xq
Wu = Xu − φp Xu−j − θj Wu−j
j=1 j=1

for u = s, s + 1, . . . t. The error due to assuming Ws−1 = · · · = Ws−q = 0 decays exponentially as t − s → ∞.


Predictions from the infinite past for more than one time step ahead can be made as follows: Because the
best prediction is linear, we obtain for k > 0
min(p,k−1) p q
X X X
b −∞:t−1 =
X b −∞:t−1 +
φj X φj Xt+k−j + θj Wt+k−j .
t+k t+k−j
j=1 j=k j=k

Hence we see that the predictions for different lead times satisfy the difference equation associated with the
AR part, except for finitely many lead times at the beginning. In particular, as the lead time increases, the
predictions tend to zero, the mean of Xt .
Use of exponentially weighted moving averages. Take again the one-step forecast k = 1. In practice,
many people regress Xt onto the m + r predictor variables
λ1 λ2 λr
(Xt−1 , Xt−2 , . . . , Xt−m , Et−1 , Et−1 , . . . , Et−1 ),

using m past values of Xt (which might for example be a return time-series of an asset) and the exponentially
weighted moving averages are defined for λ ∈ (1, ∞] as

X
Etλ = λ−j Xt−j .
j=0

Note that this allows an easy recursion and updating as


1 λ
Etλ = Xt + E .
λ t−1
In practice, if observing the time-series at n time-points t1 , . . . , tn and trying to regress it onto the m + r

29
variables defined above, one can then define the target variable Y ∈ Rn and the design matrix Z ∈ Rn×p as

Xt1 −1 Xt1 −2 . . . Xt1 −m Etλ11−1 Etλ12−1 . . . Etλ1r−1


 
 Xt −1 Xt −2 . . . Xt −m E λ1 Etλ22−1 . . . Etλ2r−1 
Z= 2 2 2 t2 −1 , (13)
 ... 
λ1 λ2 λr
Xtn −1 Xtn −2 . . . Xtn −m Etn −1 Etn −1 . . . Etn −1
 
Xt1
 Xt2 
Y = ... .

Xtn

The least squares estimator is then

β̂ = argminβ kY − Zβk22 = (Z t Z)−1 Z t Y,

assuming that Z t Z is invertible. Combining the m + r predictor variables with the coefficient β̂ will then
yield optimal linear forecasts (given the set of predictor variables).
Is this model useful if the underlying process is an ARMA model? Let us compare this model to an optimal
forecast for an invertible stationary ARMA model. Let π be the coefficients for expressing Wt as a function
of past Xt -values:
Xp
Wt = πj Xt−j ,
j=0

assuming again that Xt is invertible. From Theorem 3, we do know that we can write πj as
q
αk zk−j + π̃j ,
X
πj =
k=1

where the values zk , k = 1, . . . , q are the roots of the polynomial θ(z) (outside the unit circle as Xt is invertible
and assumed to be single roots for simplicity here), αk are the linear coefficients for each of the solutions
and the coefficients π̃j (where π̃j = 0 for all j > p) account for the initial solutions up to j = p where the
inhomogenous solutions lead to deviations from the exponentially decaying solutions (compare with (8) to
validate, reversing the role of the AR and MA part).

30
Plugging this into (12),
p
X q
X
bt−∞:t−1 =
X φj Xt−j + θj Wt−j
j=1 j=1
Xp Xq ∞
X
= φj Xt−j + θj πj 0 Xt−j−j 0
j=1 j=1 j 0 =0
p q ∞  q
0

αk zk−j
X X X X
= φj Xt−j + θj π̃j 0 + Xt−j−j 0
j=1 j=1 j 0 =0 k=1
| {z }
πj 0

p min{j,q} q ∞ q
00
θl zkl−1 zk−j +1 Xt−j 00
X X  X X X 
= φj + θl π̃j−l Xt−j + αk
j=1 l=1 k=1 j 00 =1 l=1
p min{j,q} q q ∞
00
zk−j
X X X X X +1
θl zkl−1
 
= φj + θl π̃j−l Xt−j + αk Xt−j 00
j=1 l=1 k=1 l=1 j 00 =1
| {z } | {z } | {z }
=:γj =:δk zk
=Et−1 by definition of EWMA’s
p
X q
X zk
= γj Xt−j + δk Et−1 , (14)
j=1 k=1

where
min{j,q}
X
γj := φj + θl π̃j−l 1 ≤ j ≤ p (and 0 for j > p)
l=1
q
X
θl zkl−1

δk := αk k = 1, . . . , q.
l=1

We see from (14) that the optimal forecast can indeed be written as a linear combination of the past m = p
values Xt−1 , Xt−2 , . . . , Xt−p plus a linear combination of the r EWMA where the parameters λk , k = 1, . . . , r
need to contain the roots zk , k = 1, . . . , q of the characteristic polynomial θ(z). For example, if λk = zk for
all k and r = q and m = p, setting β ∈ Rm+r to be
βj = γj if j ≤ p and βj = δj−p if p < j ≤ (p + q),
we get that the optimal linear forecast at time-points t1 , . . . , tn is indeed given by Zβ, where Z is given as
in (13).

In practice, the coefficients φ and θ that define the optimal basis vectors (via p, q and the roots zk , k =
1, . . . , q) are unknown but we can estimate the optimal coefficients β directly from data by least-squares
regression or similar, using a basis λk , k = 1, . . . , r for the EWMA with very large r that should allow a
good approximation to the unknown roots of the polynomial θ(z).

2.8 Statistical inference for ARMA models


2.8.1 Estimation of coefficients
2
Estimation of the unknown parameters φj , θk and σW = Var(Wt ) is usually done with
exact or approximate Gaussian maximum likelihood (MLE). An unknown mean is usually

31
estimated first by the arithmetic mean of the data and then subtracted.
We have the following general formula for the density of X1 , . . . , Xn

f (x1 , . . . , xn ) = f (x1 )f (x2 |x1 ) · · · f (xn |x1 , . . . , , xn−1 ).

In the Gaussian case, the conditional densities f (xt |x1 , . . . , , xt−1 ) are again Gaussian with
mean equal to the best linear prediction X bt|1:t−1 and variance equal to Var(Xt − X bt1:t−1 ).
For the exact MLE, one computes these means and variances exactly as a function of the
unknown parameters. We can restrict ourselves to mean-zero process (estimating the mean
as the empirical mean of X1 , . . . , Xn and subtracting it from the data). The covariance of
X1 , . . . , Xn for an ARMA(p,q) process with coefficients φ = (φ1 , . . . , φp ) and θ = (θ1 , . . . , θq )
2
and the noise variance σW = Var(Wt ) is then the matrix

γφ,θ,σW
2 (0) γφ,θ,σW
2 (1) γφ,θ,σW
2 (2) 2 (n − 1)
. . . γφ,θ,σW
 

 γφ,θ,σW
2 (1) γφ,θ,σW
2 (0) γφ,θ,σW
2 (1) ... 


 γφ,θ,σW
2 (2) ... 

2
Γn (φ, θ, σW ) = 
 ... ,


 ... 

 ... 
2 (n − 1)
γφ,θ,σW ... γφ,θ,σW
2 (0)

2 2
where γ = γφ,θ,σW
2 is the auto-covariance under parameters φ, θ, σ
W . For a given φ, θ, σW , we
2
can compute γ and hence Γn (φ, θ, σW ) and compute the likelihood

2 1 1 2 −1 t

L(φ, θ, σW ) := p exp − (X 1 , . . . , X n )Γ n (φ, θ, σW ) (X 1 , . . . , X n ) .
2
(2π)n |Γn (φ, θ, σW )| 2
2 2
We would like to choose φ, θ, σW such that we maximize the likelihood L(φ, θ, σW ) or mini-
mize the negative log-likelihood
2 2
−2`(φ, θ, σW ) = −2log(L(φ, θ, σW ))
2 2 −1
= constant + log(|Γn (φ, θ, σW )|) + (X1 , . . . , Xn )Γn (φ, θ, σW ) (X1 , . . . , Xn )t .

This is a difficult optimisation problem as the likelihood and the log-likelihood are in general
not concave or convex functions of their parameters.
An approximate likelihood uses X bt−∞:t−1 and Var(Wt ) instead where X bt−∞:t−1 is computed
recursively starting with W0 = · · · = W1−q = 0. In order to reduce the effect of these artificial
starting values, one typically omits the first r = max(p, q + 1) factors in the likelihood, that
is one takes n
Y
f (xr+1 , . . . , xn |x1 , . . . , xr ) = f (xt |x1 , . . . xt−1 ).
t=r+1

32
For the AR(p) model and Gaussian innovations, this reduces to the least squares estimator
n p
X X
arg min (xt − φj xt−j )2
t=p+1 j=1

which is particularly simple to compute. The design matrix X and response in the traditional
regression setting are given by
   
xp xp−1 . . . x1 xp+1
 xp+1 xp . . . x2   xp+2 
X=  ...
,
 Y = 
 ... 

xn−1 xn−2 . . . xn−p xn

and we try to minimize


 
φ1
 φ2 
kXφ − Y k22 , where φ = 
 ... ,

φp

yielding the solution


φ̂ = (X t X)−1 X t Y.
2
The Yule-Walker estimator determines the unknown φj and σW from the Yule-Walker equa-
tions with estimated covariances ρb(h) (0 ≤ h ≤ p) as

φ̂ = Γ̂−1
p γ̂p ,

where Γ̂p and γ̂p are of the form discussed above, using the estimated auto-covariance func-
tion. Note that for large n (and large n − p),

Γ̂p ≈ X t X and γ̂p ≈ X t Y.

The first are estimates for Γp and the latter are for γp and it is easy to see thus that the
Yule-Walker and least-squares estimators for AR(p) processes are converging to the same
value as n is growing.
The Burg estimator proceeds recursively with respect to p, that is, it estimates the partial
autocorrelations, and it does this by minimizing forward and backward prediction errors.
For long series, all the different versions give similar estimates, but for shorter series and
parameters close to the boundary of the causality and invertibility region, the choice of the
estimator can matter. Usually one prefers the exact MLE or the Burg estimator.

33
2.8.2 Order selection
A simple technique is to identify the orders p and q from the plot of the autocorrelations
and partial autocorrelations. For an MA(q) process, all autocorrelations ρ(h) = 0 for h > q
whereas the partial autocorrelations τ (h) decay exponentially or like a damped harmonic as
h → ∞. For an AR(p) process, the partial autocorrelations τ (h) are zero for h > p and
the autocorrelations decay exponentially or like a damped harmonic as h → ∞. For an
ARMA(p,q) process with p > 0 and q > 0 both τ (h) and ρ(h) decay exponentially or like a
damped harmonic.
It is also possible to fit ARMA(p,q) models for all p ≤ p0 and q ≤ q0 and to choose the one
with the best fit afterwards. The most popular methods to choose the order are then the se-
lection critera AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion).
They are defined as follows:
2
−2 sup `(φ, θ, σW ) + C(p + q)
where ` is the log likelihood function and C = 2 in case of the AIC and C = log(n) in case
of the BIC. If the order increases, the first term always decreases because the supremum is
taken over a larger set. The second term is a penalty for the complexity of the model. If the
estimates φb and θb are based on an approximate likelihood, then one uses this approximate
likelihood in the AIC or BIC instead of the exact likelihood `. We then search for the model
which minimizes the AIC criterion.
Validation on independent data is an alternative to penalized likelihood.
I do not discuss here the justification of these criteria, but just mention two results: 1) The
AIC is an unbiased estimate of a distance between the fitted and the true model. 2) The
AIC favors complex models and does not provide a consistent estimate of the true order if
the true order is finite.
Goodness of fit Once a model has been fitted (that is both the orders and the parameters
have been estimated), one should check whether the fit is adequate. As a minimum, one
bt−∞:t−1 which
ct = Xt − X
should look at the time series plot and the acf of the residuals W
approximate the innovations Wt and thus should be approximately i.i.d..
The auto-correlation-function ρres (h) of the residuals should vanish at all lags h > 0. To
test, this, one can for example compute the test statistic
H
X ρ̂res (k)2
Q(H) = n(n + 2) ,
k=1
n−k

where ρ̂res (h) is the empirical auto-correlation at lag h of the residuals. If the residuals are
really i.i.d., then Q(H) will have approximately a χ2H−K -distribution, where K is the number
of fitted parameters (so K = p + q for an ARMA(p,q)-model). The corresponding p-value
of the test (as a function of H, even though H = 10 or H = 20 is an often-made choice) are
shown as Ljung-box p-values in the residual plot diagnostics.

34
It is also a good idea to simulate from the fitted model and compare the plot of a simulated
series with the plot of the original series. Ideally, the two plots should be visually indistin-
guishable. One can also look for nonlinear dependence among the residuals, by using for
instance lag plots (W
ct+h versus Wct ) or the acf of the squared residuals, or for non-Gaussianity
with a normal plot of the residuals.
ARIMA-Models
So far all ARMA models were stationary. One way to analyze nonstationary data is to take
differences, see 1.2. This can be included in the ARMA model:

Φ(B)(1 − B)d Xt = Φ∗ (B)Xt = Θ(B)Wt .

The polynomial Φ∗ (z) has degree d+p and it has a root at z = 1 of multiplicity d and p roots
outside of the unit circle. Such a model is called an ARIMA(p, d, q) model (autoregressive
integrated moving average). Note that an ARIMA model is not unique: If (Xt , Wt ) satisfies
the above recursion, then so does (Xt + A0 + . . . Ad−1 td−1 , Wt ) for arbitrary coefficients
A0 , . . . , Ad−1 . In other words, the ARIMA model only specifies the conditional distribution
of X1 , X2 , . . . given the initial values X0 , X−1 , . . . , Xd−1 , and not the distribution of these
initial values. Also note that if E(Wt ) 6= 0, then E(Xt ) contains a term ctd with c 6= 0.
Because of this, one usually assumes that E(Wt ) = 0 if d > 0.
Whether we should choose d > 0 usually becomes clear from the inspection of the time series
plot (slowly changing level or slowly changing slope of the series) and of the acf (behaviour
of ρb(h) ∼ 1 − const.h with a small value of const.). Identifying p and q and estimating the
coefficients is then done based on the differenced series Yt = (1 − B)d Xt .
For forecasting, one usually assumes that the initial values X0 , X−1 , . . . , Xd−1 are indepen-
dent of the differenced series Yt = (1 − B)d Xt . Then the same formula can be used for
recursive computation of the forecast k steps ahead as in the stationary case.
If a series contains a seasonal component, then we often need to take also seasonal differences
to achieve stationarity. This means that we use a model of the form

Φ(B)(1 − B)d (1 − B M )D Xt = Θ(B)Wt

where M is the number of observations in one seasonal cycle. Moreover, empirically the
seasonal behavior also shows up in the structure of the polynomials Φ and Θ. For instance
in the autoregressive case, Xt depends usually on Xt−1 , Xt−M and maybe Xt−M −1 . This
leads to the so-called seasonal ARIMA(p, d, q, P, D, Q) model:

Φ(B)ΦM (B M )(1 − B)d (1 − B M )D Xt = Θ(B)ΘM (B M )Wt .

In general, one would expect that d + D ≤ 2 and either P or Q is equal to 0 (so the seasonal
component is either of AR or MA but not general ARMA form).

35
An example is given by the so-called airline model (because it fits the data on airline pas-
sengers, one of the standard data sets in R, well):
(1 − B)(1 − B M )Xt = (1 + θ1 B)(1 + θM,1 B M )Wt ,
which is an ARMA(0,1,1,0,1,1) model.
There are many other possible seasonal models but the main point is that the stationary
ARMA models yield a good modelling basis and allow to derive many interesting non-
stationary models for a given dataset.

3 Spectral methods
3.1 The spectral representation
3.1.1 Some results from deterministic spectral analysis
Fourier theory is concerned with the representation of signals g as a superposition of har-
monics with different frequencies and amplitudes. If g is a signal in continuous time t with
finite energy Z ∞
g(t)2 dt < ∞,
−∞
then it can be represented as
Z ∞
g(t) = G(ν) exp(i2πνt)dν (15)
−∞

where Z ∞
G(ν) = g(t) exp(−i2πνt)dt. (16)
−∞
Hence g is a superposition of harmonics with continuous frequencies ν. If we write G(ν)
in polar coordinates G(ν) = |G(ν)| exp(iφ(ν)), we see that the harmonic G(ν) exp(i2πνt)
has amplitude |G(ν)| and phase φ(ν). Since g is real, G(−ν) = G(ν) and we also have
a representation in terms of sine and cosine functions with frequencies ν > 0. Moreover,
Parseval’s theorem says that
Z ∞ Z ∞
2
g(t) dt = |G(ν)|2 dν,
−∞ −∞

i.e. the energy is the integral of squared amplitudes.


Next, we
P consider a signal (gt ) observed at time points t∆ with t = 0, ±1, . . . with finite
energy ∞ t=−∞ tg 2
< ∞. If we replace the integrand in (16) by a function which is constant
on intervals of length ∆, then we obtain

X
Gp (ν) = ∆ gt exp(−i2πt∆ν), (17)
t=−∞

36
and we can represent the signal with Gp :
Z 1/(2∆)
gt = Gp (ν) exp(i2πt∆ν)dν. (18)
−1/(2∆)

Hence again g is a superposition of harmonics, but the continuous frequencies ν are now
restricted to |ν| ≤ 1/(2∆). The reason for this is that in discrete time we cannot distinguish
between harmonics at frequencies ν, ν ± 1/∆, ν ± 2/∆ etc. . This is called aliasing, and
1/(2∆) is called the Nyquist frequency.
If we consider Gp as a function of arbitrary ν, then it is periodic with period 1/∆ (this is
the reason for the subscript p). Note that Gp (ν) 6= G(ν) for |ν| ≤ 1/∆, but rather

X ∞
X
Gp (ν) = G(ν + k/∆) = G(ν) + (G(ν + k/∆) + G(−ν + k/∆)).
k=−∞ k=1

This means that we add up the amplitudes at all frequencies we cannot distinguish. Finally,
for a discrete time signal, Parseval’s theorem says that
X∞ Z 1/(2∆)
2
∆ gt = |Gp (ν)|2 dν.
t=−∞ −1/(2∆)

In the last step, we consider a signal g observed at finitely many discrete time points t∆
with t = 0, 1, . . . n − 1. By replacing the integrand in (18) by a function which is constant
on intervals of length 1/(n∆), we obtain the representation
n−1 n−1
1 X k 1 X
gt = Gk exp(i2πt∆ )= Gk exp(i2πtk/n) (19)
n∆ k=0 n∆ n∆ k=0

whose inversion is
n−1
X
Gk = ∆ gt exp(−i2πtk/n). (20)
t=0

Hence the signal is now a superposition of harmonics with a finite number of frequencies
νk = k/(∆n), the so-called Fourier frequencies. Again Parseval’s theorem holds
n−1 n−1
X 1 X
∆ gt2 = |Gk |2 .
t=0
n∆ k=0

If we use (19) or (20) to define gt for any t ∈ Z or Gk for any k ∈ Z, we obtain periodic
sequences. If we restrict an infinite sequence with Fourier representation
Z 1/(2∆)
gt = Gp (ν) exp(i2πt∆ν)dν,
−1/(2∆)

37
to 0 ≤ t < n, then the relation between Gp (ν) and the discrete amplitudes Gk is
Z 1/(2∆)
Gk = n∆ Gp (ν) exp(−iπ(n − 1)(νk − ν)∆)Dn (|ν − νk |∆)dν
−1/(2∆)

where Dn is the so-called Dirichlet kernel

sin(nπν)
Dn (ν) = .
n sin(πν)

This means that the amplitude Gk in the discrete representation is a weighted average of the amplitudes
Gp (ν) for ν around νk . The phase shift in the above formula occurs because the time points are not symmetric
around the origin. The proof of this formula uses the the summation formula of a geometric series
n−1
X eiλn − 1 einλ/2 − e−inλ/2 sin(nλ/2)
eiλt = = ei(n−1)λ/2 iλ/2 = ei(n−1)λ/2 .
t=0

e −1 e − e−iλ/2 sin(λ/2)

The discrete Fourier transform (gt ) → (Gk ) can be computed by the Fast Fourier Transform (FFT) with
O(nlog2 (n)) operations instead of O(n2 ) operations in a naive implementation. This algorithm is crucial for
the widespread use of Fourier methods in many applications.
The spectral representation of stationary stochastic processes
P∞
For a stationary stochastic process (Xt ; t ∈ Z), the energy t=−∞ Xt2 is infinite, but if second moments
exist, the power (energy per time unit) converges to a finite value
T
1 X
Xt2 → E(Xt2 ).
2T + 1
t=−T

Hence we cannot expect to have a representation of the form


Z 1/2
Xt (ω) = exp(i2πνt)Z(ν, ω)dν.
−1/2

However, a deep result says that we have the representation


Z 1/2
Xt (ω) = E(Xt ) + exp(i2πνt)Z(dν, ω)
−1/2

where Z is a (complex) stochastic process with uncorrelated increments:

1. Z(−ν) − Z(−ν − h) = Z(ν + h) − Z(ν) for all ν, h.


2. E(Z(ν + h) − Z(ν)) = 0 for all ν, h.
3. E|Z(ν+h)−Z(ν)|2 = S(ν+h)−S(ν) where S is the spectral distribution function S(ν) = S([−1/2, ν]).
4. For ν < ν + h < ν 0 < ν 0 + h0 , E((Z(ν + h) − Z(ν))(Z(ν 0 + h0 ) − Z(ν 0 ))) = 0.

Here the integral is defined as the limit of


X
exp(i2πνj t)(Z(νj , ω) − Z(νj−1 , ω))
j

38
as the partition ν0 = −1/2 < ν1 < . . . < νJ = 1/2 becomes finer. Hence intuitively, the process is a
superposition of harmonics with uncorrelated mean zero amplitudes, and the variance of the amplitudes are
given by the increments of the spectrum. In other words, the spectrum spectrum says how strongly the
different frequencies are represented in the process. If the spectral density exists, E(|Z(νj , ω) − Z(νj−1 , ω)|2 )

is of the order νj − νj−1 and therefore |Z(νj , ω) − Z(νj−1 , ω)| is typically of the order νj − νj−1 > νj − νj−1 .
This is the crucial difference between the spectral representation here and the representations in the previous
subsection.
Formally, we can write the properties of Z as

E(Z(dν)Z(dν 0 )) = δν,ν 0 S(dν)

where δν,ν 0 = 0 for ν 6= ν 0 and δν,ν = 1 (the Kronecker delta). We then obtain the spectral representation of
the autocovariances (Herglotz’s Theorem)
Z 1/2 Z 1/2
γ(k) = Cov(Xt+k (ω), Xt (ω)) = exp(i2πν(t + k)) exp(−i2πν 0 t)E(Z(dν, ω)Z(dν 0 , ω))
−1/2 −1/2
Z 1/2
= exp(i2πνk)S(dν).
−1/2

In particular,
Z 1/2
2
E((Xt − E(Xt ) ) = S(dν)
−1/2

which is the analogue of Parseval’s theorem.

3.1.2 Linear filters


A (time invariant) linear filter is a transformation of an input time series (Xt ) into an output
time series (Yt ) of the following form
X
Yt = ak Xt−k
k

The
P input or output can be either deterministic or stochastic. Usually one assumes that
k |ak | < ∞ or some other condition in order that the right hand side is well defined.

If the input is an impulse at time zero, Xt = δt0 , then the output is equal to Yt = at .
Because of this, the coefficients ak are called the impulse response coefficients. If the input is
a harmonic with frequency ν, Xt = G exp(i2πνt) then the output is again a harmonic with
the same frequency
X
Yt = GA(ν) exp(i2πνt), A(ν) = ak exp(−i2πνk).
k

There is however a change in amplitude by |A(ν)| and also a phase shift unless the coefficients
are symmetric (a−k = ak ). A(ν) is called the transfer function.
By linearity of the linear filter, a superposition of harmonic oscillations is transformed into
another superposition of harmonics where the amplitudes and phases are changed by the
transfer function.

39
Stationary stochastic processes are superpositions of harmonic oscillations:
Z 1/2
Xt = E(Xt ) + exp(i2πνt)Z(dν).
−1/2

If the coefficients (ak ) are summable,


Z 1/2
Yt = A(0)E(Xt ) + exp(i2πνt)A(ν)Z(dν).
−1/2

Therefore the spectral increment process of (Yt ) is A(ν)Z(dν) and we have the following
relation between the spectral measures of (Xt ) and (Yt ):
SY (dν) = |A(ν)|2 SX (dν).

3.2 The periodogram


The periodogram of a time series of length n with sampling interval ∆ is defined as
n 2
∆ X
In (ν) = (Xt − X) exp(−i2πνt∆) .
n t=1
In words, we compute the absolute value squared of the Fourier transform of the sample,
that is we consider the squared amplitude and ignore the phase.
Note that In is periodic with period 1/∆ and that In (0) = 0 because we have centered the
observations at the mean. The centering has no effect for Fourier frequencies ν = k/(n∆),
k 6= 0.
By mutliplying out the absolute value squared on the right, we obtain
n n n−1
∆ XX X
In (ν) = (Xt − X)(Xs − X) exp(−i2πν(t − s)∆) = ∆ γ
b(h) exp(−i2πνh).
n t=1 s=1 h=−n+1

Hence the periodogram is nothing else than the Fourier transform of the estimated acf. In the
following, we assume that ∆ = 1 in order to simplify the formula (although for applications
the value of ∆ in the original time scale matters for the interpretation of frequencies).
By the above result, the periodogram seems to be the natural estimator of the spectral
density

X
s(ν) = γ(h) exp(−i2πνh).
h=−∞
However, a closer inspection shows that the periodogram has two serious shortcomings: It
has large random fluctuations, and also a bias which can be large. We first consider the bias.
Using the spectral representation, we see that up to a term which involves E(Xt ) − X
n
Z X 2 Z 2
1 0 0
In (ν) = e−i2π(ν−ν )t Z(dν 0 ) =n e−iπ(n+1)(ν−ν ) Dn (ν − ν 0 )Z(dν 0 ) .
n t=1

40
Taking the expectation on both sides and using the properties of Z, we obtain
Z
E(In (ν)) = n Dn (ν − ν 0 )2 s(ν 0 )dν 0 .

In order to gain insight from this formula, we need to understand the behaviour of the Dirchlet kernel Dn
and the so-called Fejér kernel
Fn (ν) = nDn (ν)2 .
R 1/2
It can be checked that Fn (0) = n, Fn (ν) → 0 as n → ∞ for all 0 < |ν| ≤ 1/2 and −1/2 Fn (ν) = 1 for all
n. Hence Fn approximates the Dirac delta function and for a continuous density we obtain E(In (ν)) → s(ν)
for any ν 6= 0.
Still, for some applications, the bias of the periodogram can be substantial. In such cases the bias is reduced
if we use a so-called taper. This is a set of weights h1 , h2 , . . . , hn which are one for t close to n/2 and decay
smoothly to zero for t near 1 and n. With these weights, we compute the tapered periodogram as follows
n 2
1 X
Inh (ν) = Pn ht (Xt − X) exp(−i2πνt) .
t=1 h2t t=1

If we use a taper, then we obtain


Z
E(Inh (ν)) = Hn (ν − ν 0 )s(ν 0 )dν 0

where
n 2
1 X
Hn (ν) = Pn ht exp(−2πiνt) .
t=1 h2t t=1

If ht is as described above, Hn (ν) has smaller sidelobes than the Fejér kernel.

The variances and covariances of the periodogram depend in principle on the fourth mo-
ments of the process. However, for many processes a CentralPLimit Theorem applies for the
Fourier transform and thus the real and imaginary part of ht (Xt − X) exp(2πiνt) have
asymptotically a normal distribution with mean zero and variance s(ν)/2 for ν 6= 0, 1/2.
Because of this
Inh (ν) χ2
approximately ∼ Exp(1) (= 2 ).
s(ν) 2
In particular, the periodogram is an asymptotically unbiased, but not consistent estimator
for the spectral density, and

Inh (ν) Inh (ν)


 
,
−log(0.025) −log(0.975)

is an approximate 95% confidence interval for s(ν). On the logarithmic scale, this interval
has constant width.
For two different frequencies ν 6= ν 0 , the periodogram values are asymptotically independent, in particular
the covariance tends to zero. This explains the irregular behaviour of the periodogram as a function of
frequency. Because of this and because of the inconsistency, the periodogram is of limited value.

41
For two frequencies close together, we have the following approximation
n 2
s(ν)s(ν 0 ) X 2
Cov(Inh (ν), Inh (ν 0 )) ≈ P n 2 ht exp(−2πi(ν − ν 0 )t) .
t=1 ht t=1

Without a taper, i.e. for ht ≡ 1, the periodogram values at two Fourier frequencies j/n and j 0 /n are thus
approximately uncorrelated. This does not hold if we use a taper.

I refer to the literature for exact statements and proofs of these results.

3.3 Smoothing the periodogram


The reason why the periodogram is not consistent is that as the length n of the time series
increases , we obtain independent estimates of the spectral density at an increasingly dense
set of Fourier frequencies νk = k/n. If the spectral density is smooth, we can therefore pool
the information from nearby frequencies.
The tapered and smoothed spectral estimate is
J
X
ŝ(ts) (k/n) = wj Inh ((k − j)/n),
j=−J

where the wj ’s are weights with the following properties


J
X
wj > 0, wj = w−j (−J ≤ j ≤ J), wj = 1.
j=−J

If k ≤ J, the smoothing includes the periodogram at the origin which is equal or very close to
zero if the mean µ is estimated. In this case, we exclude j = k from the sum and renormalize
the weights.
The properties of this estimator can be derived by the same arguments that are used for kernel
smoothers in nonparametric regression. If we neglect the bias of the tapered periodogram,
the bias of ŝ(ts) is approximately
J
s00 (k/n) 1 X 2
j wj .
2 n2 j=−J

The variance of ŝ(ts) (k/n) depends on whether or not a taper is used. Without a taper the
summands are approximately uncorrelated, and we obtain for k = 6 0, n/2
J
X
(ts) 2
Var(ŝ (k/n)) ≈ s(k/n) wj2 .
j=−J

42
With a taper, we have to take the correlation of the summands into account. We skip the
details and just state that in this case the variance is increased by the factor
1
Pn 4
n t=1 ht
M (h) = 1 Pn .
( n t=1 h2t )2
By Cauchy-Schwarz, M (h) is strictly greater than one unless ht is constant, and thus asymp-
totically tapering entails some loss of precision. However, this is often more than compen-
sated by a reduction in bias.
The choice of J, that is the number of frequencies involved in the smoothed estimate, is
difficult. Small values of J give a small bias, but a large variance, and vice versa. Asymp-
totically, the optimal choice is J = O(n4/5 ), but the constants involve both s and s00 which
are unknown. In practice, one often looks at the estimate for different values of J and then
makes a subjective choice.
The above results imply that to a first approximation
 (ts)   (ts)  J
ŝ (k/n) ŝ (k/n) X
E = 1, Var = wj2 M (h).
s(k/n) s(k/n) j=−J

Because the periodogram values have asymptotically an exponential distribution and the
sum of m independent exponential random variables is distributed as 1/2 times a chi-
squared random variable with 2m degrees of freedom, one approximates the distribution
of ŝ(ts) (k/n)/s(k/n) by Zd /d where Zd ∼ χ2d and the degrees of freedom d are chosen to
match the variance given above. This then leads to the following confidence interval for
s(k/n) " #
ŝ(ts) (k/n) d ŝ(ts) (k/n) d 2
2
, 2
where d = PJ .
χd,1−α/2 χd,α/2 2
j=−J wj M (h)

3.4 Alternative estimators of the spectrum


So far, we have averaged over the values of the periodogram at the Fourier frequencies k/n
because they are approximately independent in the case of no taper and because the fast
Fourier transform can be used for computation.
We can also use a different grid k/n0 with n0 > n (we then have to set Xt = X̄ for n < t ≤ n0
in order to use the fast Fourier transform). In the limit we then have a continuous average
Z
sb (ν) = W (ν − ν 0 )Inh (ν 0 )dν 0 .
(lw)

This can be shown to be equal to


n−1
X
bh (k) exp(−2πiνk)
wk γ
k=−n+1

43
where Z
wk = W (ν) exp(2πiνk)dν

and
n−|k|
h 1 X
γ
b (k) = Pn ht (Xt − X̄)ht+|k| (Xt+|k| − X̄)
t=1 h2t t=1

are the autocovariances of the tapered series. In other words, smoothing of the periodogram
is equivalent to downweighting the estimated autocovariances in the inversion formula

X
s(ν) = γ(k) exp(−2πikν).
k=−∞

This estimator is therefore called a lag weight estimator (which explains the superscript lw).
For computational reasons, sb(ts) is usually preferred.
A different approach consists in averaging the periodograms for segments of m < n consec-
utive observations:
J−1 X
m 2
(os) 1 X
−2πiνt
sb (ν) = Pm ht (Xt+jd − X̄)e
J t=1 h2t j=0 t=1

where J is the integer part of (n − m)/d. The parameter d regulates how much the segments
overlap: For d = 1 we have maximal overlap whereas for d = m there is no overlap (os
stands for overlapping segments). It can be shown that in case of maximal overlap, this is
essentially a lag weight estimator. It has however the advantage that it gives also information
about changes in the periodogram over time. It is thus the first step towards a time-frequency
analysis where one wants to analyze how strongly different frequencies are present at different
times. This is however an ill-posed question since a high resolution in time entails a low
resolution in frequency and vice versa.
Yet a different approach to spectral estimation consists in using the spectral density of a
fitted autoregressive model. Usually, one chooses the order of the autoregression by AIC.
This usually gives very smooth estimates, but sometimes details are lost that can be detected
by sb(ts) . A combination of both methods fits an autoregression, usually of low order without
assuming that the innovations
p
X
W t = Xt − φk Xt−k
k=1

are exactly white noise. In any case, the general formula

sW (ν)
sX (ν) = P .
|1 − φk exp(−2πiνk)|2

44
holds, and one estimates sW (ν) by smoothing the periodogram of the residuals. Even when
sW (ν) is not exactly constant, it is at least much flatter than sX (ν) and thus the problems
with the bias are less serious. This approach is called prewhitening.
Wavelets in time series analysis
Wavelets are suitable both for smoothing time series and for a time-frequency analysis. We can only give a
very brief introduction. The discrete wavelet transform decomposes an equispaced time series of length n as
follows:
−j
J 2 X
X n−1 2−J
X n−1
−j/2 −j
Xt = dj,k 2 ψ(2 t − k) + aJ,k 2−J/2 φ(2−J t − k) (t = 0, 1, . . . , n − 1)
j=1 k=0 k=0

where ψ is the so-called mother wavelet – a small wave located near zero – and φ is the so-called father
wavelet or scaling function which represents a smooth part. Hence we have a decomposition into oscillations
with frequencies 2−j located at times k2j for j = 1, 2, . . . , J and a part which contains the lower frequencies.
The simplest example is the Haar wavelet where

ψ(t) = 1[0,1/2) (t) − 1[1/2,1) (t), φ(t) = 1[0,1) (t).

For other cases, ψ and φ are defined through a limiting operation and thus have to be calculated numerically.
The amplitudes dj,k and aJ,k are computed from the original series by iterative application of an orthogonal
transformation. We start with a0,t = Xt and set for j = 1, 2, . . . , J ≤ log2 (n)
L−1
X L−1
X
aj,t = g` aj−1,2t+1−` , dj,t = h` aj−1,2t+1−` (t = 0, 1, . . . , 2−j n − 1).
`=0 `=0

(all indices are extended periodically). In words, we take the coefficients aj−1,t for odd times and apply
to them two linear filters with impulse response coefficients g` and h` = (−1)` gL−`−1 , respectively. The
coefficients g` are defined through the father wavelet (details omitted). They can be chosen arbitrarily
subject to the constraints that L must be even and
L−1 L−1−2n
X √ X
g` = 2, g` g`+2n = δn,0 (n = 0, 1, . . . , L/2 − 1).
`=0 `=0

For L = 2, 4 there is essentially only one solution, e.g. for L = 2 we have g0 = g1 = 1/ 2. For L ≥ 6, there
are several solutions.
Because the discrete wavelet transform is a product of orthogonal linear transformations and thus is again
linear and orthogonal, the computation of the inverse is easy. For smoothing, one typically sets dj,k and aJ,k
equal to zero if their absolute value is small and then applies the inverse transform. This retains features in
the data which are not smooth in a conventional sense.
In the maximal overlap discrete wavelet transform, one uses the above recursions without omitting coefficients
aj−1,t for even t:
L−1
X L−1
X
ãj,t = 2−j/2 g` ãj−1,t−2j−1 ` , d˜j,t = 2−j/2 h` ãj−1,t−2j−1 ` (t = 0, 1, . . . , n − 1).
`=0 `=0

This creates redundancies, but is sometimes easier for a time-frequency interpretation.


If Xt is a stochastic process, the amplitudes aj,t and dj,t are random variables, and one can study their
distributions. Because the wavelet transform is orthogonal, these amplitudes are again i.i.d. for Gaussian

45
white noise. It turns out that also under dependence they become approximately independent like the
periodogram values. In addition, the average of the a2j,t for fixed j is essentially an estimate of the spectrum
integrated over the frequency interval [2−j−1 , 2−j ]. A key difference is however that this holds also for
integrated processes: We only need that (1 − B)d Xt is stationary for some d < L/2.

4 State space models


The following is meant just as a quick overview of state space models2

4.1 General state space models/ Hidden Markov models


General state space models (or Hidden Markov models - HMM) consist of
(i) An unobserved (latent) state process (Xt ) with Markovian dependence
(ii) Observations (Yt ) which are derived from Xt .
Concretely, this means
(i) X0 , X1 , X2 , . . . is a Markov chain
(ii) Conditionally on (Xt ), all Yt are independent and depend only on Xt .
As a graphical model, can use a directed acyclic graph:
X0 X1 X2 X3

Y1 Y2 Y3

The graphical model is a convenient way of writing down assumptions (i) and (ii). For a
general joint distribution of (X0 , X1 , . . . , XT , Y1 , . . . YT ) we can always write the density as
T
Y
f0 (X0 ) ft (Xt |X0 , X1 , . . . , Xt−1 , Y1 , . . . , Yt−1 )gt (Yt |X0 , X1 , . . . , Xt−1 , Xt , Y1 , . . . , Yt−1 )
t=1

for appropriate conditional densities ft , t = 0, . . . , T and gt , t = 1, . . . , T . The model above


is equivalent to the joint density having a simpler factorization
T
Y
f0 (X0 ) ft (Xt |Xt−1 )gt (Yt |Xt ).
t=1

Note that (Xt ) is a Markov chain and also (Zt ) is a Markov chain if we concatenate Zt =
(Xt , Yt ). However, the observations (Yt ) on their own are not a Markov chain and exhibit
2
most parts here are in analogy (but simplified) from a book chapter “State Space and Hidden Markov
Models” by Prof. Hansruedi Künsch

46
more complex time-dependencies. The HMM allows, in other words, to model such more
complex time-dependencies in the observations by a simple Markov model.
Goals of HMM analysis include
(a) Given observations y1 , . . . , yT and known transition densities/probabilities ft (Xt |Xt−1 )
and gt (Yt |Xt ), provide inference for the underlying state vector x0 , . . . , xT (there are
several forms of different inference which we return to).
(b) Given observations y1 , . . . , yT and unknown transition densities/probabilities ft (Xt |Xt−1 )
and gt (Yt |Xt ), provide inference for the underlying state vector x0 , . . . , xT and for ft and
gt simultaneously, either in a Bayesian or frequentist form
In the following, will first assume we are in setting (a), that is the transition densities are
assumed to be known.
Examples for state space models:
(i) Linear state space model for Xt ∈ Rp and Yt ∈ Rq :

Xt = Gt Xt−1 + Vt
Yt = Ht Xt + W t ,

where Gt ∈ Rp×p and Ht ∈ Rq×p . The state vector Xt could for example be position
and velocity of a moving object and Yt are noisy measurements of the objects location.
Goal is to infer Xt as accurately as possible. In case of Gaussian error terms there is
an explicit solution (Kalman filter).
(ii) ARMA models. Let (Yt ) be a causal and invertible ARMA(p,q) process. Can be
written as an HMM. Take as an example the previously discussed case of an AR(p)
model: p
X
Yt = φj Yt−j + Wt .
j=1

Define Xt as the collection of the past p observations Xt := (Yt , Yt−1 , Yt−2 , . . . , Yt−p+1 )t .
Then

Xt = Φt Xt−1 + ηt
Y t = H t X t + εt ,

where
 

φ1 ... φp−1 φp
 Wt
 0  
 0 

Φ=

..  , ηt = 

0  Ht = (1, 0, 0, . . . , 0), and εt ≡ 0.
 Ip−1 .  
 ...


0 0

47
Advantages of writing an ARMA process as a state space model include the ability to
deal easily with missing data (for example by setting Ht = (0, 0, . . . , 0) and Y − t = 0
at times t where we have missing data) and the ability to introduce a different type of
outlier in the noise distribution (using ηt and εt respectively).
(iii) Speech recognition. The sequence (Xt ) can be seen as the hidden sequence of words
a speaker is trying to say and Yt are the observed sound measurements.
(iv) Biological examples include ion-channel analysis (determining whether an ion gate
is on or off–the underlying state Xt ∈ {0, 1}) based on noisy measurements Yt of the
current flowing through the gate. Includes also DNA analysis where the index of time
is taken by the index of position along the chromosome and we can try to infer for
example regions with heightened copynumbers of so-called CG-islands (areas where
the acids C and G appear more often than A and T).
(v) Physics/meteorology. State Xt includes all relevant atmospheric variables at time
t. Transition dynamics of Xt are given by a underlying physics (and approximated by
meteorological models). The observations Yt can be satellite measurements, wind and
rainfall sensors etc. that help to infer the true underlying state.

4.2 Discrete state space models


Inference is easier (also notationally) for discrete state space models, where without limitation
of generality
Xt ∈ {1, . . . , `}
Yt ∈ {1, . . . , m}.
Joint density factorizes again as
P (X0 = x0 , X1 = x1 , . . . , Xt = xT , Y1 = y1 , Y2 = y2 , . . . YT = yT ) =
T h
Y i
P (X0 = x0 ) · P (Xt = xt |Xt−1 = xt−1 ) · P (Yt = yt |Xt = xt )
t=1

Or, taking the logarithm,


logP (X0 = x0 , X1 = x1 , . . . , Xt = xT , Y1 = y1 , Y2 = y2 , . . . YT = yT ) =
XT h i
logP (X0 = x0 ) + logP (Xt = xt |Xt−1 = xt−1 ) + logP (Yt = yt |Xt = xt ) (21)
t=1

Let matrices A ∈ R`×` and B ∈ Rm×` describe the transition probabilities Xt−1 → Xt and
Xt → Yt in the sense that
P (Xt = j 0 |Xt−1 = j) = Aj 0 ,j
P (Yt = o|Xt = j) = Bo,j

48
(Note that often people work with At instead of A and Bt instead of B but for our purposes
it is more convenient to define the matrices as above.) We have that (using the fact that
conditional probabilities have to be positive and sum to 1):
`
X
Aj 0 ,j = 1 for all j ∈ {1, . . . , `} (matrix A is column-normalised)
j 0 =1
m
X
and Bo,j = 1 for all j ∈ {1, . . . , `} (matrix B is column-normalised)
o=1
and Aj 0 ,j ≥ 0 and Bo,j ≥ 0 for all j, j 0 ∈ {1, . . . , `} and o ∈ {1, . . . , m}.

4.3 Filtering, smoothing and prediction


Let π 0 ∈ R` be the initial/prior distribution of X0 :

πj0 := P (X0 = j) for all j = 1, . . . , `.

Let yst = (ys , . . . , yt ) be a vector of observations.


Goal: find conditional distribution P (Xt+k |yst ). This is called
i) Prediction if k > 0
ii) Filtering if k = 0
iii) Smoothing if k < 0.
Will here look mostly at filtering and prediction.

Prediction. The prediction problem can be solved iteratively and hence reduced to filter-
ing.
Let π t+k|t be the conditional distribution of Xt+k , given y1t in the sense that for all j ∈
{1, . . . , `},
t+k|t
πj := P (Xt+k = j|y1t ).
We can now get a recursion for π t+k|t (a recursion in k) by conditioning on Xt+k−1 and using

49
the conditional independence between Xt+k and Y1t given Xt+k−1 :
t+k|t
πj = P (Xt+k = j|y1t )
`
X
= P (Xt+k = j|Xt+k−1 = j 0 , y1t ) · P (Xt+k−1 = j 0 |y1t )
j 0 =1
`
X
= P (Xt+k = j|Xt+k−1 = j 0 ) · P (Xt+k−1 = j 0 |y1t )
j 0 =1
`
X t+k−1|t
= Aj,j 0 · πj 0 .
j 0 =1

Or, in vector form,


π t+k|t = A π t+k−1|t .
Reiterating back to time t, we get

π t+k|t = Ak π t|t .

and we have thus reduced it to a filtering problem since π t|t is the conditional distribution
of Xt , given (y1 , . . . , yt ).
The distribution of Yt+k , given y1t , follows by conditioning on Xt+k in similar form. If we set

pt+k|t
o := P (Yt+k = o|y1t ),

then, by conditioning,

pt+k|t
o = P (Yt+k = o|y1t )
`
X
= P (Yt+k = o|Xt+k = j, y1t ) · P (Xt+k = j|y1t )
j=1
`
X
= P (Yt+k = o|Xt+k = j) · P (Xt+k = j|y1t )
j=1

In vector form,
pt+k|t = B · π t+k|t .
Substituting from the result above , we can also write it as

pt+k|t = B · Ak · π t|t .

and we are going to look at the filtering distribution π t|t next.

50
t|t
Filtering. We want a recursion for the filtering density πj = P (Xt = j|y1t ), which is also
used in the prediction tasks. We can again use conditional independence of Y1t and Yt+1 ,
given Xt+1 , and Bayes formula to get the desired recursion.
Recall that for two events A, B Bayes formula derives from

P (A, B) = P (A|B)P (B) = P (B|A)P (A) = P (A, B),

and can be written as


P (B|A)P (A)
P (A|B) = .
P (B)
We can furthermore condition on yet another event C:

P (B|A, C)P (A|C)


P (A|B, C) = .
P (B|C)

Setting

A = {Xt+1 = j}
B = {Yt+1 = yt+1 }
C = {Y1t = y1t },

we get for the desired filtering density


t+1|t+1
πj = P (Xt+1 = j|y1t+1 )
P (Yt+1 = yt+1 |Xt+1 = j, Y1t = y1t )P (Xt+1 = j|Y1t = y1t )
= .
P (Yt+1 = yt+1 |Y1t = y1t )

Using the conditional independencies at this point, we can simplify to

t+1|t+1 P (Yt+1 = yt+1 |Xt+1 = j)P (Xt+1 = j|Y1t = y1t )


πj =
P (Yt+1 = yt+1 |Y1t = y1t )
P (Yt+1 = yt+1 |Xt+1 = j)P (Xt+1 = j|Y1t = y1t )
= P` t t
0 0
j 0 =1 P (Yt+1 = yt+1 |Xt+1 = j )P (Xt+1 = j |Y1 = y1 )
t+1|t
πj Byt+1 ,j
= P` t+1|t
j 0 =1 πj 0 Byt+1 ,j 0

The recursion works thus schematically in computation as

π t|t → π t+1|t → π t+1|t+1 → . . . ,

where the second step π t+1|t → π t+1|t+1 requires the new observation yt+1 at time t + 1.

51
Using from the prediction task the recursion for π t+1|t , we can directly write the recursion
for π t|t without going via the prediction density as

t+1|t+1 (Aπ t|t )j Byt+1 ,j


πj = P` .
t|t )
j 0 =1 (Aπ j 0 Byt+1 ,j 0

The denominator can be seen as a normalisation that ensures


`
X t|t
πj = 1 for all times t
j=1

(and can conveniently by implemented by such a normalisation without having to compute


the denominator explicitely).

4.4 Posterior mode, viterbi and forward-backward algorithms and


dynamic programming
The filtering approach yields posterior densities of, say, XT , given y1 , . . . , yT . It does not
and cannot answer questions about the most likely sequence xT0 = (x0 , . . . , xT ) under the
made observations, which is given by

x̂T0 = argmaxxT0 P (X0T = xT0 |y1T ).

The most likely sequence can be computed with dynamic programming in a forward and
backwards recursion.
Taking the log and using the previous decomposition (21),

logP (X0T = xT0 |Y1T = y1T ) ∝ logP (X0T = xT0 , Y1T = y1T )
= logP (X0 = x0 , X1 = x1 , . . . , XT = xT , Y1 = y1 , Y2 = y2 , . . . YT = yT )
XT h i
= logP (X0 = x0 ) + logP (Xt = xt |Xt−1 = xt−1 ) + logP (Yt = yt |Xt = xt )
t=1
T h
X i
= logπ 0 (x0 ) + logAxt ,xt−1 + logByt ,xt
t=1

The optimization problem can be seen as one of minimizing the cost of traversing time from
0 to T and passing through x0 , x1 , . . . , xT along the way where we incur
(i) Cost for passing through the initial state x0 which depends on the prior distribution
π 0 for X0 :
−logπ 0 (x0 )

52
(ii) Cost for passing from state xt−1 to state xt at every time t = 1, . . . , T :

−log(Axt ,xt−1 )

(iii) Cost for state xt at every time t = 1, . . . , T (depending on the observation yt at this
point).
−log(Byt ,xt )

We can first make a forward recursion going thorugh t = 1, . . . , T , and record in ψt (x) the
lowest cost (negative log-likelihood) achievable up to this point t in time if we end up in
position x at time t, that is
 t
0
X  
ψt (x) = min − logπ (x0 ) + − logAxt0 ,xt0 −1 + −logByt0 ,xt0 .
(x0 ,...,xt−1 ),xt =x
t0 =1

Note that
 T
0
X  
x̂ = argmin(x0 ,...,xT ) − logπ (x0 ) + − logAxt0 ,xt0 −1 + −logByt0 ,xt0
t0 =1

and hence
x̂T = argminx ψT (x)

The function ψt can be calculated now for t = 1, 2, . . . in a forward recursion as

ψ0 (x) = −logπ 0 (x)



ψt (x) = min ψt−1 (xt−1 ) − log(Ax,xt−1 ) − log(Byt ,x ) for t = 1, . . . , T.
xt−1

We also record the value of xt−1 (the back-pointer) for which the minimum was achieved at
time t − 1 if we pass through x at time t as

ξt−1 (x) = argminxt−1 ψt−1 (xt−1 ) − log(Ax,xt−1 ) − log(Byt ,x )

= argminxt−1 ψt−1 (xt−1 ) − log(Ax,xt−1 ) .

The optimal path (x̂0 , . . . , x̂T ) is then calculated in a backwards recursion as

x̂T = argminx ψT (x)


x̂t−1 = ξt−1 (x̂t ) for t = T − 1, T − 2, . . . , 0.

This is sometimes called the Viterbi algorithm.

53
4.5 Parameter estimation via the EM-algorithm
Assume we have a distribution with discrete observed variables Y and latent X and unknown parameter θ
(same ideas work for continuous variables). We would like to get the Maximum-likelihood estimate of θ as

θ̂ = argmaxθ `(θ),

where the log-likelihood `(θ) is given by logPθ (Y = y) if the observations of Y are y.


The problem is that Pθ (Y ) is not easily available in tractable form. What is available is the likelihood
Pθ (Y, X) if we could observe the latents X as well as. The EM (Expectation-Maximization; also called
Baum-Welch for HMMs) algorithm greedily optimizes the likelihood by alternating between

(i) estimating the latent variables X, given the observed variables Y and the current parameter estimate,
and then
(ii) updating the parameter estimates in a second step.

Starting from some initial estimate θ(1) , the steps are for an iteration t = 1, . . .,

E-step (Expectation): Compute the conditional distribution of the latent variables X, given the
observed variables and the current parameter estimates θ(t) :

Pθ̂(t) (X|Y = y).

This is similar to type of inference we discussed. Define the expected log-likelihood under the distri-
bution of X implied by the current parameter estimate as
h i
Qt (θ) := Eθ̂(t) logPθ (Y = y, X)|Y = y ,

where the expectation is with respect to the random X, conditional on Y = y and the current
parameter estimate θ̂(t) .
M-step (Maximization): update the parameters as

θ̂(t+1) = argmaxθ Qt (θ).

We get monotonically increasing likelihood

`(θ̂(t) ) ≥ `(θ̂(t−1) ),

that is the parameter estimates θ̂(t) will converge to a local maximum of the likelihood for t → ∞. Depending
on the starting value we might reach the global optimum but this is not guaranteed.
The EM algorithm appears very often in practice. Compare also the applications in eg clustering we discuss.
Sometimes the expectation is replaced with just computing the most-likely state of x0 , given Y = y and
θ̂(t) and setting Qt (θ) := logPθ (Y = y, x0 ). This is what happens in the K-means clustering algorithm, for
example. For HMMs it corresponds to computing the most likely sequence with the Viterbi algorithm, as
discussed. The approach is sometimes referred to as hard-EM. The result will again depend on the chosen
starting values in general.
But why do we get monotonically increasing likelihood? The proof sheds somePmore light on EM. To start
with, it holds for all functions f over the space X of the hidden variables with x∈X f (x) = 1 and f (x) ≥ 0

54
for all x ∈ X ,

`(θ) = logPθ (Y = y)
X
= log Pθ (Y = y, X = x)
x∈X
X Pθ (Y = y, X = x)
= log f (x)
f (x)
x∈X
X Pθ (Y = y, X = x)
≥ f (x)log ,
f (x)
x∈X

where the last inequality uses Jensens inequality and the fact that log is a concave function. Note that
equality holds iff
Pθ (Y = y, X = x)
f (x)
does not depend on x, for example if f (x) = Pθ (X = x|Y = y).
Hence there exists a constant c > 0 at each time-step3 such that

`(θ0 ) ≥ Qt (θ0 ) + c for all θ0 and `(θ̂(t) ) = Qt (θ̂(t) ) + c.

Hence
`(θ̂(t+1) ) ≥ Qt (θ̂(t+1) ) + c ≥ Qt (θ̂(t) ) + c = `(θ̂(t) ),
where the first inequality is due to the argument just above and the second is true as θ̂(t+1) is, by definition,
maximizing Qt .

4.6 Kalman filter


Suppose we have a linear dynamical system

Xt = AXt−1 + Vt
Yt = BXt + Wt ,

where Xt ∈ Rp is the latent state, Yt ∈ Rq the made observations, Wt the so-called process noise and Vt the
measurement noise.
Under a Gaussian noise assumption

Wt ∼ N (0, W ), Vt ∼ N (0, V ),

the joint vector of latent and observations over t = 1, . . . , T will have a joint Gaussian distribution. In
principle it is thus easy to derive the conditional distribution of, say, Xt |Y1t . Let Z ∈ Rp be a random vector
with a Gaussian distribution,
Z ∼ N (µ, Σ).
Then the conditional distribution of Zk , conditional on ZS = zs for some S ⊆ {1, . . . , p}, will again be
Gaussian
Zk |Zs = zs ∼ N (µk|S , Σk,S ),
3
P
namely entropy of f (x) = Pθ̂(t) (X = x|Y = y) as entropy is − x f (x)logf (x)

55
with

µk|S = µk + Σk,S Σ−1


S,S (zS − µS )

Σk|S = Σk,k − Σtk,S Σ−1


S,S ΣS,k

The problem with the direct approach is that the dimensionality of S grows like pT if we condition on the
observations Y1T = (Y1 , . . . , YT ).
Using the structure of the HMM again and the same message-passing as in the discrete case, we can define

X̂t|t = E(Xt |Y1t )


X̂t+1|t = E(Xt+1 |Y1t )
Σt|t = Cov(Xt |Y1t )
Σt+1|t = Cov(Xt+1 |Y1t )

The updates are usually split again into two part,s the time- and the measurement update. The time-update
concerns

X̂t|t → X̂t+1|t
Σt|t → Σt+1|t ,

while the measurement update concerns

X̂t+1|t → X̂t+1|t+1
Σt+1|t → Σt+1|t+1 ,

taking into account the new observation made at time t.


Conditioning on Y1t , we get

Xt+1 |Y1t = (AXt + Vt )|Y1t = AXt |Y1t + Vt


Yt+1 |Y1t = (BXt+1 + Wt )|Y1t = BXt+1 |Y1t + Wt .

The first equation yields the so-called time-update (updating t → t + 1 without using the new observation
at time t + 1)

X̂t+1|t = E(Xt+1 |Y1t ) = AX̂t|t


Σt+1|t = AΣt|t At + V

The second equation yields the measurement update (where the new observation Yt+1 is used to update
the conditional distribution). Note that the distribution of (Xt+1 , Yt+1 )|Y1t has a multivariate Gaussian
distribution
(Xt+1 , Yt+1 )|Y1t ∼ N (µ, S),
with
Σtt+1|t B t
   
X̂t+1|t Σt+1|t
µ= , S= .
B X̂t+1|t BΣt+1|t B t Σt+1|t B + W
Hence we get the measurement update as

X̂t+1|t+1 = X̂t+1|t + Σtt+1|t B t (B t Σt+1|t B + W )−1 (Yt − B X̂t+1|t )


Σt+1|t+1 = Σt+1|t − Σtt+1|t B t (B t Σt+1|t B + W )−1 BΣt+1|t

Perhaps surprisingly, the error covariance can be computed ahead of time (without seeing any observations).

56
Steady-state Kalman filter If Σt+1|t converges to a Σ∗ (which it will in general), then Σ∗ is the
solution of a Ricatti-type equation

Σ∗ = AΣ∗ At + V − A(Σ∗ )t B t (B t Σ∗ B + W )−1 BΣ∗ At .

The estimated means follow then the recursion

X̂t+1|t = AX̂t|t−1 + L(Yt − B X̂t|t−1 ),

where L = A(Σ∗ )t B t (B t Σ∗ B+W )−1 is the so-called Kalman gain. The first term updates the guess according
to the dynamics of the system while the second corrects it by using the newly available information via the
new observation.

57

You might also like