100% found this document useful (1 vote)

160 views66 pages

Time Series Analysis

This document provides an overview of topics in time series analysis including: 1) Definitions of time series, stationarity, and autocovariance functions. 2) Examples of modeling stationary time series using autoregressive (AR), moving average (MA), and ARMA processes. 3) Statistical analysis techniques like Box-Jenkins modeling, frequency analysis, and state space models.

Uploaded by

Shiv Ganesh S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

160 views66 pages

Time Series Analysis

Uploaded by

Shiv Ganesh S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Time Series

H ILARY T ERM 2010

P ROF. G ESINE R EINERT
http://www.stats.ox.ac.uk/~reinert

Overview

• Chapter 1: What are time series? Types of data, examples, objectives. Def-
initions, stationarity and autocovariances.

• Chapter 2: Models of stationary processes. Linear processes. Autoregres-

sive, moving average models, ARMA processes, the Backshift operator.
Differencing, ARIMA processes. Second-order properties. Autocorrelation
and partial autocorrelation function. Tests on sample autorcorrelations.

• Chapter 3: Statistical Analyis. Fitting ARIMA models: The Box-Jenkins

approach. Model identification, estimation, verification. Analysis in the
frequency domain. Spectrum, periodogram, smoothing, filters.

• Chapter 4: State space models. Linear models. Kalman filters.

• Chapter 5: Nonlinear models. ARCH and stochastic volatility models.

Chaos.

Relevant books

1. P.J. Brockwell and R.A. Davis (2002). Introduction to Time Series and Fore-
casting. Springer.

2. P.J. Brockwell and R.A. Davis (1991). Time Series: Theory and methods.
Springer.

3. P. Diggle (1990). Time Series. Clarendon Press.

4. R.H. Shumway and D.S. Stoffer (2006). Time Series Analysis and Its Ap-
plications. With R Examples. 2nd edition. Springer.

5. R.L. Smith (2001) Time Series. At

http://www.stat.unc.edu/faculty/rs/s133/tsnotes.pdf

1
6. W.N. Venables and B.D. Ripley (2002). Modern Applied Statistics with S.
Springer.

Lectures take place Mondays 11-12 and Thursdays 10-11, weeks 1-4, plus
Wednesday Week 1 at 11, and not Thursday Week 3 at 10. There will be two
problem sheets, and two Practical classes Friday of Week 2 and Friday of Week
4 and there will be two Examples classes Tuesday 10-11 of Weeks 3 and 5. The
Practical in Week 4 will be assessed. Your marker for the problem sheets is Yang
Wu; the work is due Friday of Weeks 2 and 4 at 5 pm.
While the examples class will cover problems from the problem sheet, there
may not be enough time to cover all the problems. You will benefit most from the
examples class if you (attempt to) solve the problems on the sheet ahead of the
examples class.
Lecture notes are published at http://www.stats.ox.ac.uk/~reinert/
timeseries/timeseries.htm. The notes may cover more material than the
lectures. The notes may be updated throughout the lecture course.
Time series analysis is a very complex topic, far beyond what could be covered
in an 8-hour class. Hence the goal of the class is to give a brief overview of the
basics in time series analysis. Further reading is recommended.

1 What are Time Series?

Many statistical methods relate to data which are independent, or at least uncorre-
lated. There are many practical situations where data might be correlated. This is
particularly so where repeated observations on a given system are made sequen-
tially in time.
Data gathered sequentially in time are called a time series.

Examples

Here are some examples in which time series arise:

• Economics and Finance

• Environmental Modelling

• Meteorology and Hydrology

2
• Demographics
• Medicine
• Engineering
• Quality Control

The simplest form of data is a long-ish series of continuous measurements at

equally spaced time points.
That is
• observations are made at distinct points in time, these time points being
equally spaced
• and, the observations may take values from a continuous distribution.

The above setup could be easily generalised: for example, the times of obser-
vation need not be equally spaced in time, the observations may only take values
from a discrete distribution, . . .

If we repeatedly observe a given system at regular time intervals, it is very

likely that the observations we make will be correlated. So we cannot assume that
the data constitute a random sample. The time-order in which the observations
are made is vital.

Objectives of time series analysis:

• description - summary statistics, graphs
• analysis and interpretation - find a model to describe the time dependence
in the data, can we interpret the model?
• forecasting or prediction - given a sample from the series, forecast the next
value, or the next few values
• control - adjust various control parameters to make the series fit closer to a
target
• adjustment - in a linear model the errors could form a time series of cor-
related observations, and we might want to adjust estimated variances to
allow for this

3
2 Examples: from Venables and Ripley, data from Diggle (1990)
lh: a series of 48 observations at 10-minute intervals on luteinizing hormone
levels for a human female

3.5
3.0
2.5
lh

2.0
1.5

0 10 20 30 40

Time

deaths: monthly deaths in the UK from a set of common lung diseases for
the years 1974 to 1979
4000
3500
3000
2500
deaths

2000
1500
1000
500

1974 1975 1976 1977 1978 1979 1980

year

dotted series = males, dashed = females, solid line = total

(We will not split the series into males and females from now on.)

1.1 Definitions

Assume that the series Xt runs throughout time, that is (Xt )t=0,±1,±2,... , but is only
observed at times t = 1, . . . , n.

4
So we observe (X1 , . . . , Xn ). Theoretical properties refer to the underlying
process (Xt )t∈Z .
The notations Xt and X(t) are interchangeable.
The theory for time series is based on the assumption of ‘second-order sta-
tionarity’. Real-life data are often not stationary: e.g. they exhibit a linear trend
over time, or they have a seasonal effect. So the assumptions of stationarity below
apply after any trends/seasonal effects have been removed. (We will look at the
issues of trends/seasonal effects later.)

1.2 Stationarity and autocovariances

The process is called weakly stationary or second-order stationary if for all inte-
gers t, τ

E(Xt ) = µ
cov(Xt+τ , Xτ ) = γt

where µ is constant and γt does not depend on τ .

The process is strictly stationary or strongly stationary if

(Xt1 , . . . , Xtk ) and (Xt1 +τ , . . . , Xtk +τ )

have the same distribution for all sets of time points t1 , . . . , tk and all integers τ .

Notice that a process that is strictly stationary is automatically weakly station-

ary. The converse of this is not true in general.
However, if the process is Gaussian, that is if (Xt1 , . . . , Xtk ) has a multivariate
normal distribution for all t1 , . . . , tk , then weak stationarity does imply strong
stationarity.

Note that var(Xt ) = γ0 and, by stationarity, γ−t = γt .

The sequence (γt ) is called the autocovariance function.

The autocorrelation function (acf) (ρt ) is given by
γt
ρt = corr(Xt+τ , Xτ ) = .
γ0
The acf describes the second-order properties of the time series.

5
We estimate γt by ct , and ρt by rt , where
min(n−t,n)
1 X ct
ct = [Xs+t − X][Xs − X] and rt = .
n c0
s=max(1,1−t)

• For t > 0, the covariance cov(Xt+τ , Xτ ) is estimated from the n − t ob-

served pairs
(Xt+1 , X1 ), . . . , (Xn , Xn−t ).
If we take the usual covariance of these pairs, we would be using different
estimates of the mean and variances for each of the subseries (Xt+1 , . . . , Xn )
and (X1 , . . . , Xn−t ), whereas under the stationarity assumption these have
the same mean and variance. So we use X (twice) in the above formula.

A plot of rt against t is called the correlogram.

A series (Xt ) is said to be lagged if its time axis is shifted: shifting by τ lags
gives the series (Xt−τ ).
So rt is the estimated autocorrelation at lag t; it is also called the sample
autocorrelation function.

lh: autocovariance function

Series lh
0.30
0.25
0.20
0.15
ACF (cov)

0.10
0.05
0.00
−0.05

0 5 10 15

Lag

lh: autocorrelation function

6
Series lh

1.0
0.8
0.6
0.4
ACF

0.2
0.0
−0.2

0 5 10 15

Lag

deaths: autocorrelation function

Series deaths
1.0
0.5
ACF

0.0
−0.5

0.0 0.5 1.0 1.5

Lag

2 Models of stationary processes

Assume we have a time series without trends or seasonal effects. That is, if nec-
essary, any trends or seasonal effects have already been removed from the series.
How might we construct a linear model for a time series with autocorrelation?

Linear processes

The process (Xt ) is called a linear process if it has a representation of the form
∞
X
Xt = µ + cr ǫt−r
r=−∞

7
where µ is a common mean, {cr } is a sequence of fixed constants and {ǫt } are
independent randomP variables with mean 0 and common variance.
We assume c2r < ∞ to ensure that the variance of Xt is finite.
If the {ǫt } are identically distributed, then such a orocess is strictly stationary.
If cr = 0 for r < 0 it is said to be causal, i.e. the process at time t does not depend
on the future, as yet unobserved, values of ǫt .
The AR, MA and ARMA processes that we are now going to define are all
special cases of causal linear processes.

2.1 Autoregressive processes

Assume that a current value of the series is linearly dependent upon its previous
value, with some error. Then we could have the linear relationship

Xt = αXt−1 + ǫt

where ǫt is a white noise time series. [That is, the ǫt are a sequence of uncorrelated
random variables (possibly normally distributed, but not necessarily normal) with
mean 0 and variance σ 2 .]
This model is called an autoregressive (AR) model, since X is regressed on
itself. Here the lag of the autoregression is 1.
More generally we could have an autoregressive model of order p, an AR(p)
model, defined by
p
X
Xt = αi Xt−i + ǫt .
i=1

At first sight, the AR(1) process

Xt = αXt−1 + ǫt
P
is not in the linear form Xt = µ + cr ǫt−r . However note that

Xt = αXt−1 + ǫt
= ǫt + α(ǫt−1 + αXt−2 )
= ǫt + αǫt−1 + α2 ǫt−2 + · · · + αk−1 ǫt−k+1 + αk Xt−k
= ǫt + αǫt−1 + α2 ǫt−2 + · · ·

8
which is in linear form.

If ǫt has variance σ 2 , then from independence we have that

V ar(Xt ) = σ 2 + α2 σ 2 + · · · + α2(k−1) σ 2 + α2k V ar(Xt−k ).

The sum converges as we assume finite variance.

But the sum converges only if |α| < 1. Thus |α| < 1 is a requirement for the
AR(1) process to be stationary.
We shall calculate the acf later.

2.2 Moving average processes

Another possibility is to assume that the current value of the series is a weighted
sum of past white noise terms, so for example that

Xt = ǫt + βǫt−1 .

Such a model is called a moving average (MA) model, since X is expressed as a

weighted average of past values of the white noise series.
Here the lag of the moving average is 1. We can think of the white noise
series as being innovations or shocks: new stochastically uncorrelated information
which appears at each time step, which is combined with other innovations (or
shocks) to provide the observable series X.

More generally we could have a moving average model of order q, an MA(q)

model, defined by
q
X
Xt = ǫt + βj ǫt−j .
j=1

If ǫt has variance σ 2 , then from independence we have that

q
X
2
V ar(Xt ) = σ + βj2 σ 2 .
j=1

We shall calculate the acf later.

9
2.3 ARMA processes
An autoregressive moving average process ARMA(p, q) is defined by
p q
X X
Xt = αi Xt−i + βj ǫt−j
i=1 j=0

where β0 = 1.
A slightly more general definition of an ARMA process incorporates a non-
zero mean value µ, and can be obtained by replacing Xt by Xt − µ and Xt−i by
Xt−i − µ above.

From its definition we see that an MA(q) process is second-order stationary

for any β1 , . . . , βq .
However the AR(p) and ARMA(p, q) models do not necessarily define second-
order stationary time series.
For example, we have already seen that for an AR(1) model we need the con-
dition |α| < 1. This is the stationarity condition for an AR(1) process. All AR
processes require a condition of this type.
Define, for any complex number z, the autoregressive polynomial
φα (z) = 1 − α1 z − · · · − αp z p .
Then the stationarity condition for an AR(p) process is:
all the zeros of the function φα (z) lie outside the unit circle in the complex plane.
This is exactly the condition that is needed on {α1 , . . . , αp } to ensure that the
process is well-defined and stationary (see Brockwell and Davis 1991), pp. 85-87.

2.4 The backshift operator

Define the backshift operator B by
BXt = Xt−1 , B 2 Xt = B(BXt ) = Xt−2 , ...
We include the identity operator IXt = B 0 Xt = Xt . P
Using this notation we can write the AR(p) process Xt = pi=1 αi Xt−i + ǫt as
p
!
X
I− αi B i Xt = ǫt
i=1

10
or even more concisely
φα (B)X = ǫ.

P
Recall that an MA(q) process is Xt = ǫt + qj=1 βj ǫt−j .
Define, for any complex number z, the moving average polynomial

φβ (z) = 1 + β1 z + · · · + βq z q .

Then, in operator notation, the MA(q) process can be written

q
!
X
j
Xt = I + βj B ǫt
j=1

or
X = φβ (B)ǫ.

For an MA(q) process we have already noted that there is no need for a station-
arity condition on the coefficients βj , but there is a different difficulty requiring
some restriction on the coefficients.
Consider the MA(1) process

Xt = ǫt + βǫt−1 .

As ǫt has mean zero and variance σ 2 , we can calculate the autocovariances to be

γ0 = V ar(X0 ) = (1 + β 2 )σ 2
γ1 = Cov(X0 , X1 )
= Cov(ǫ0 , ǫ1 ) + Cov(ǫ0 , βǫ0 ) + Cov(βǫ−1 , ǫ1 ) + Cov(βǫ−1 , βǫ0 )
= Cov(ǫ0 , βǫ0 )
= βσ 2 ,
γk = 0, k > 2.

So the autocorrelations are

β
ρ0 = 1, ρ1 = , ρk = 0 k > 2.
1 + β2

11
Now consider the identical process but with β replaced by 1/β. From above
we can see that the autocorrelation function is unchanged by this transformation:
the two processes defined by β and 1/β cannot be distinguished.
It is customary to impose the following identifiability condition:

all the zeros of the function φβ (z) lie outside the unit circle in the complex plane.

The ARMA(p, q) process

p q
X X
Xt = αi Xt−i + βj ǫt−j
i=1 j=0

where β0 = 1, can be written

φα (B)X = φβ (B)ǫ.

The conditions required are

1. the stationarity condition on {α1 , . . . , αp }

2. the identifiability condition on {β1 , . . . , βq }

3. an additional identifiability condition: φα (z) and φβ (z) have no common

roots.

Condition 3 is to avoid having an ARMA(p, q) model which can, in fact, be ex-

pressed as a lower order model, say as an ARMA(p − 1, q − 1) model.

2.5 Differencing
The difference operator ∇ is given by

∇Xt = Xt − Xt−1

These differences form a new time series ∇X (of length n−1 if the original series
had length n). Similarly

∇2 Xt = ∇(∇Xt ) = Xt − 2Xt−1 + Xt−2

12
and so on.
If our original time series is not stationary, we can look at the first order dif-
ference process ∇X, or second order differences ∇2 X, and so on. If we find that
a differenced process is a stationary process, we can look for an ARMA model of
that differenced process.
In practice if differencing is used, usually d = 1, or maybe d = 2, is enough.

2.6 ARIMA processes

The process Xt is said to be an autoregressive integrated moving average process
ARIMA(p, d, q) if its dth difference ∇d X is an ARMA(p, q) process.
An ARIMA(p, d, q) model can be written
φα (B)∇d X = φβ (B)ǫ
or
φα (B)(I − B)d X = φβ (B)ǫ.

2.7 Second order properties of MA(q)

P
For the MA(q) process Xt = qj=0 βj ǫt−j , where β0 = 1, it is clear that E(Xt ) =
0 for all t.
Hence, for k > 0, the autocovariance function is
γk = E(Xt Xt−k )
" q q
#
X X
=E ( βj ǫt−j )( βi ǫt−k−i )
j=0 i=0
q q
XX
= βj βi E(ǫt−j ǫt−k−i ).
j=0 i=0

Since the ǫt sequence is white noise, E(ǫt−j ǫt−k−i ) = 0 unless j = i + k.

Hence the only non-zero terms in the sum are of the form σ 2 βi βi+k and we
have ( P
q−|k|
σ 2 i=0 βi βi+|k| |k| 6 q
γk =
0 |k| > q

13
and the acf is obtained via ρk = γk /γ0 .
In particular notice that the acf if zero for |k| > q. This ‘cut-off’ in the acf after
lag q is a characteristic property of the MA process and can be used in identifying
the order of an MA process.

Simulation: MA(1) with β = 0.5

Series ma1.sim

1.0
0.8
0.6
ACF

0.4
0.2
0.0

0 5 10 15 20 25 30

Lag

Simulation: MA(2) with β1 = β2 = 0.5

Series ma2.sim
1.0
0.8
0.6
ACF

0.4
0.2
0.0

0 5 10 15 20 25 30

Lag

To identify an MA(q) process:

We have already seen that for an MA(q) time series, all values of the acf be-
yond lag q are zero: i.e. ρk = 0 for k > q.
So plots of the acf should show a sharp drop to near zero after the qth coeffi-
cient. This is therefore a diagnostic for an MA(q) process.

14
2.8 Second order properties of AR(p)

Consider the AR(p) process

p
X
Xt = αi Xt−i + ǫt .
i=1

For this model E(Xt ) = 0 (why?).

Hence multiplying both sides of the above equation by Xt−k and taking ex-
pectations gives
p
X
γk = αi γk−i , k > 0.
i=1

In terms of the autocorrelations ρk = γk /γ0

p
X
ρk = αi ρk−i , k>0
i=1

These are the Yule-Walker equations.

The population autocorrelations ρk are thus found by solving the Yule-Walker
equations: these autocorrelations are generally all non-zero.
Our present interest in the Yule-Walker equations is that we could use them
to calculate the ρk if we knew the αi . However later we will be interested in
using them to infer the values of αi corresponding to an observed set of sample
autocorrelation coefficients.

Simulation: AR(1) with α = 0.5

Series ar1.sim
1.0
0.8
0.6
ACF

0.4
0.2
0.0

0 5 10 15 20 25 30

Lag

15
Simulation: AR(2) with α1 = 0.5, α2 = 0.25

Series ar2.sim

1.0
0.8
0.6
ACF

0.4
0.2
0.0

0 5 10 15 20 25 30

Lag

To identify an AR(p) process:

The AR(p) process has ρk decaying smoothly as k increases, which can be
difficult to recognize in a plot of the acf.
Instead, the corresponding diagnostic for an AR(p) process is based on a quan-
tity known as the partial autocorrelation function (pacf).
The partial autocorrelation at lag k is the correlation between Xt and Xt−k
after regression on Xt−1 , . . . , Xt−k+1 .
To construct these partial autocorrelations we successively fit autoregressive
processes of order 1, 2, . . . and, at each stage, define the partial autocorrelation
coefficient ak to be the estimate of the final autoregressive coefficient: so ak is
the estimate of αk in an AR(k) process. If the underlying process is AR(p), then
αk = 0 for k > p, so a plot of the pacf should show a cutoff after lag p.

The simplest way to construct the pacf is via the sample analogues of the Yule-
Walker equations for an AR(p)
p
X
ρk = αi ρ|k−i| k = 1, . . . , p
i=1

The sample analogue of these equations replaces ρk by its sample value rk :

p
X
rk = ai,p r|k−i| k = 1, . . . , p
i=1

16
where we write ai,p to emphasize that we are estimating the autoregressive coeffi-
cients α1 , . . . , αp on the assumption that the underlying process is autoregressive
of order p.
So we have p equations in the unknowns a1,p , . . . , ap,p , which could be solved,
and the pth partial autocorrelation coefficient is ap,p .

Calculating the pacf

In practice the pacf is found as follows.

Consider the regression of Xt on Xt−1 , . . . , Xt−k , that is the model
k
X
Xt = aj,k Xt−j + ǫt
j=1

with ǫt independent of X1 , . . . , Xt−1 .

Given data X1 , . . . , Xn , least squares estimates of {a1,k , . . . , ak,k } are obtained
by minimising
n k
!2
1 X X
σk2 = Xt − aj,k Xt−j .
n t=k+1 j=1

These aj,k coefficients can be found recursively in k for k = 0, 1, 2, . . . .

For k = 0: σ02 = c0 ; a0,0 = 0, and a1,1 = ρ(1).
And then, given the aj,k−1 values, the aj,k values are given by
Pk−1
ρk − j=1 aj,k−1 ρk−j
ak,k = Pk−1
1 − j=1 aj,k−1 ρj
aj,k = aj,k−1 − ak,k ak−j,k−1 j = 1, . . . , k − 1

and then

σk2 = σk−1
2
(1 − a2k,k ).

This recursive method is the Levinson-Durbin recursion.

The ak,k value is the kth sample partial correlation coefficient.
In the case of a Gaussian process, we have the interpretation that

ak,k = corr(Xt , Xt−k | Xt−1 , . . . , Xt−k+1 ).

17
If the process Xt is genuinely an AR(p) process, then ak,k = 0 for k > p.
So a plot of the pacf should show a sharp drop to near zero after lag p, and this
is a diagnostic for identifying an AR(p).

Simulation: AR(1) with α = 0.5

Series ar1.sim

0.5
0.4
0.3
Partial ACF

0.2
0.1
0.0

0 5 10 15 20 25 30

Lag

Simulation: AR(2) with α1 = 0.5, α2 = 0.25

Series ar2.sim
0.6
0.4
Partial ACF

0.2
0.0

0 5 10 15 20 25 30

Lag

Simulation: MA(1) with β = 0.5

18
Series ma1.sim

0.3
0.2
Partial ACF

0.1
0.0
−0.1
−0.2

0 5 10 15 20 25 30

Lag

Simulation: MA(2) with β1 = β2 = 0.5

Series ma2.sim
0.4
0.2
Partial ACF

0.0
−0.2

0 5 10 15 20 25 30

Lag

Tests on sample autocorrelations

To determine whether the values of the acf, or the pacf, are negligible, we√can
use the approximation that √ they each have a standard deviation of around 1/ n.
So this would give ±2/ n as approximate confidence bounds (2 is an approx-
imation to 1.96). In R these are shown
√ as blue dotted lines.
Values outside the range ±2/ n can be regarded as significant at about the
5% level. But if a large number of rk values, say, are calculated it is likely that
some will exceed this threshold even if the underlying time series is a white noise
sequence.
Interpretation is also complicated by the fact that the rk are√not independently
distributed. The probability of any one rk lying outside ±2/ n depends on the
values of the other rk .

19
3 Statistical Analysis
3.1 Fitting ARIMA models: The Box-Jenkins approach
The Box-Jenkins approach to fitting ARIMA models can be divided into three
parts:

• Identification;

• Estimation;

• Verification.

3.1.1 Identification

This refers to initial preprocessing of the data to make it stationary, and choosing
plausible values of p and q (which can of course be adjusted as model fitting
progresses).
To assess whether the data come from a stationary process we can

• look at the data: e.g. a time plot as we looked at for the lh series;

• consider transforming it (e.g. by taking logs;)

• consider if we need to difference the series to make it stationary.

For stationarity the acf should decay to zero fairly rapidly. If this is not true,
then try differencing the series, and maybe a second time if necessary. (In practice
it is rare to go beyond d = 2 stages of differencing.)

The next step is initial identification of p and q. For this we use the acf and the
pacf, recalling that

• for an MA(q) series, the acf is zero beyond lag q;

• for an AR(p) series, the pacf is zero beyond lag p.

√
We can use plots of the acf/pacf and the approximate ±2/ n confidence bounds.

20
3.1.2 Estimation: AR processes

For the AR(p) process

p
X
Xt = αi Xt−i + ǫt
i=1
P
we have the Yule-Walker equations ρk = pi=1 αi ρ|i−k| , for k > 0.
We fit the parameters α1 , . . . , αp by solving
p
X
rk = αi r|i−k| , k = 1, . . . , p
i=1

These are p equations for the p unknowns α1 , . . . , αp which, as before, can be

solved using a Levinson-Durbin recursion.
The Levinson-Durbin recursion gives the residual variance
n p
!2
1 X X
bp2 =
σ Xt − αbj Xt−j .
n t=p+1 j=1

This can be used to guide our selection of the appropriate order p. Define an
approximate log likelihood by
σp2 ).
−2 log L = n log(b
Then this can be used for likelihood ratio tests.
Alternatively, p can be chosen by minimising AIC where
AIC = −2 log L + 2k
and k = p is the number of unknown parameters in the model.
If (Xt )t is a causal AR(p) process with i.i.d. WN(0, σǫ2 ), then (see Brockwell
and Davis (1991), p.241) then the Yule-Walker estimator α̂ is optimal with respect
to the normal distribution.
Moreover (Brockwell and Davis (1991), p.241) for the pacf of a causal AR(p)
process we have that, for m > p,
√
nα̂mm
is asymptotically standard normal. However, the elements of the vector α̂m =
(α̂1m , . . . , α̂mm ) are in general not asymptotically uncorrelated.

21
3.1.3 Estimation: ARMA processes
Now we consider an ARMA(p, q) process. If we assume a parametric model for
the white noise – this parametric model will be that of Gaussian white noise – we
can use maximum likelihood.
We rely on the prediction error decomposition. That is, X1 , . . . , Xn have joint
density
n
Y
f (X1 , . . . , Xn ) = f (X1 ) f (Xt | X1 , . . . , Xt−1 ).
t=2

Suppose the conditional distribution of Xt given X1 , . . . , Xt−1 is normal with

bt and variance Pt−1 , and suppose that X1 ∼ N (X
mean X b1 , P0 ). (This is as for the
Kalman filter – see later.)

Then for the log likelihood we obtain

n
( )
X bt )2
(Xt − X
−2 log L = log(2π) + log Pt−1 + .
t=1
Pt−1

Here Xbt and Pt−1 are functions of the parameters α1 , . . . , αp , β1 , . . . , βq , and

so maximum likelihood estimators can be found (numerically) by minimising
−2 log L with respect to these parameters.

The matrix of second derivatives of −2 log L, evaluated at the mle, is the ob-
served information matrix, and its inverse is an approximation to the covariance
matrix of the estimators. Hence we can obtain approximate standard errors for the
parameters from this matrix.

In practice, for AR(p) for example, the calculation is often simplified if we

condition on the first m values of the series for some small m. That is, we use a
conditional likelihood, and so the sum in the expression for −2 log L is taken over
t = m + 1 to n.
For an AR(p) we would use some small value of m, m > p.

When comparing models with different numbers of parameters, it is important

to use the same value of m, in particular when minimising AIC = −2 log L +
2(p + q). In R this corresponds to keeping n.cond in the arima command fixed
when comparing the AIC of several models.

22
3.1.4 Verification

The third step is to check whether the model fits the data.
Two main techniques for model verification are
• Overfitting: add extra parameters to the model and use likelihood ratio or t
tests to check that they are not significant.
• Residual analysis: calculate residuals from the fitted model and plot their
acf, pacf, ‘spectral density estimates’, etc, to check that they are consistent
with white noise.

3.1.5 Portmanteau test of white noise

A useful test for the residuals is the Box-Pierce portmanteau test. This is based on
K
X
Q=n rk2
k=1

where K > p + q but much smaller than n, and rk is the acf of the residual series.
If the model is correct then, approximately,

Q ∼ χ2K−p−q

so we can base a test on this: we would reject the model at level α if Q >
χ2K−p−q (1 − α).
An improved test is the Box-Ljung procedure which replaces Q by
K
X rk2
Q̃ = n(n + 2) .
k=1
n−k

The distribution of Q̃ is closer to a χ2K−p−q than that of Q.

Once we have a suitable model for the time series, we could apply it to es-
timate, say, a trend in a time series. For example, suppose that x1 , . . . , xk are
explanatory variables, that ǫt is an ARMA(p,q)-process, and that we observe a
series yt . Our null model may then be that

Yt = µ + β1 x1 + . . . + βk xk + ǫt , t = 1, . . . , T,

23
and the alternative model could be

Yt = µ + ft (λ) + β1 x1 + . . . + βk xk + ǫt , t = 1, . . . , T,

where ft (λ) is a function for the trend. As ǫt is ARMA, we can write down the
likelihoods under the two models, and then carry out a generalised likelihood ratio
test to assess whether the trend is significant.

For confidence intervals, assume that all errors are independently normally
distributed. Then we can estimate the covariance matrix for ǫt using the Yule-
Walker equations; call this estimate V . Let X be the T × (k + 2) design matrix.
Then we estimate the covariance matrix of (µ̂, λ̂, βˆk ) by

Σ̂ = (X T (σˆ2 V )−1 X)−1 .

If σλ is the square root of the diagonal element in Σ̂ corresponding to λ, then

λ̂ ± σλ tα/2 is a 100 α-confidence interval for λ.

As an example, see X.Zheng, R.E.Basher, C.S.Thompson: Trend detection in

regional-mean temperature series: Maximum, minimum, mean, diurnal range and
SST. In: Journal of Climate Vol. 10 Issue 2 (1997), pp. 317–326.

3.2 Analysis in the frequency domain

We can consider representing the variability in a time series in terms of harmonic

components at various frequencies. For example, a very simple model for a time
series Xt exhibiting cyclic fluctuations with a known period, p say, is

Xt = α cos(ωt) + β sin(ωt) + ǫt

where ǫt is a white noise sequence, ω = 2π/p is the known frequency of the cyclic
fluctuations, and α and β are parameters (which we might want to estimate).
Examining the second-order properties of a time series via autocovariances/autocorrelations
is ‘analysis in the time domain’.
What we are about to look at now, examining the second-order properties by
considering the frequency components of a series is ‘analysis in the frequency
domain’.

24
3.2.1 The spectrum
Suppose we have a stationary time series Xt with autocovariances (γk ).
For any sequence of autocovariances (γk ) generated by a stationary process,
there exists a function F such that
Z π
γk = eikλ dF (λ)
−π

where F is the unique function on [−π, π] such that

1. F (−π) = 0

2. F is non-decreasing and right-continuous

3. the increments of F are symmetric about zero, meaning that for 0 6 a <
b 6 π,
F (b) − F (a) = F (−a) − F (−b).

The function F is called the spectral distribution function or spectrum. F has

many of the properties of a probability distribution function, which helps explain
its name, but F (π) = 1 is not required.
The interpretation is that, for 0 6 a < b 6 π, F (b) − F (a) measures the
contribution to the total variability of the process within the frequency range a <
λ 6 b.
If F is everywhere continuous and differentiable, then

dF (λ)
f (λ) =
dλ
is called the spectral density function and we have
Z π
γk = eikλ f (λ)dλ.
−π

P
It |γk | < ∞, then it can be shown that f always exists and is given by
∞ ∞
1 X iλk γ0 1X
f (λ) = γk e = + γk cos(λk).
2π k=−∞ 2π π k=1

25
By the symmetry of γk , f (λ) = f (−λ).
From the mathematical point of view, the spectrum and acf contain equiv-
alent information concerning the underlying stationary random sequence (Xt ).
However, the spectrum has a more tangible interpretation in terms of the inherent
tendency for realizations of (Xt ) to exhibit cyclic variations about the mean.

[Note that some authors put constants of 2π in different places. For example,
some put a factor of 1/(2π) in the integral expression for γk in terms of F, f , and
then they don’t need a 1/(2π) factor when giving f in terms of γk .]

Example: WN(0, σ 2 )

Here, γ0 = σ 2 , γk = 0 for k 6= 0, and so we have immediately

σ2
f (λ) = for all λ
2π
which is independent of λ.
The fact that the spectral density is constant means that all frequencies are
equally present, and this is why the sequence is called ‘white noise’. The converse
also holds: i.e. a process is white noise if and only if its spectral density is constant.

Note that the frequency is measured in cycles per unit time; for example, at
frequency 21 the series makes a cycle every two time units. The number of time
periods to complete a cycle is 2. In general, for frequency λ the number of time
units to complete a cycle is λ1 .
Data which occurs at discrete time points will need at least two points to de-
termine a cycle. Hence the highest frequency of interest is 12 .
Rπ
The integral −π eikλ dF (λ) is interpreted as a so-called Riemann-Stieltjes in-
tegral. If F is differentiable with derivative f , then
Z π Z π
ikλ
e dF (λ) = eikλ f (λ)dλ.
−π −π

If F is such that

0 if λ < λ0
F (λ) =
a if λ ≥ λ0
then Z π
eikλ dF (λ) = aeikλ0 .
−π

26
The integral is additive; if

 0 if λ < λ0
F (λ) = a if λ0 ≤ λ < λ1

a+b if λ ≥ λ1

then
Z π Z λ1 Z π
ikλ ikλ
e dF (λ) = e dF (λ) + eikλ dF (λ)
−π λ0 λ1
ikλ0
= ae + (a + b − a)eikλ1
= aeikλ0 + beikλ1 .

Example: Consider the process

Xt = U1 sin(2πλ0 t) + U2 cos(2πλ0 t)

with U1 , U2 independent, mean zero, variance σ 2 random variables. Then this pro-
cess has frequency λ0 ; the number of time periods for the above series to complete
one cycle is exactly λ10 . We calculate

γh = E{U1 sin(2πλ0 t) + U2 cos(2πλ0 t))

×(U1 sin(2πλ0 (t + h)) + U2 cos(2πλ0 (t + h))}
= σ 2 {sin(2πλ0 t) sin(2πλ0 (t + h)) + cos(2πλ0 t)) cos(2πλ0 (t + h))}.

Now we use that

1
sin α sin β = (cos(α − β) − cos(α + β))
2
1
cos α cos β = (cos(α − β) + cos(α + β))
2
to get

σ2
γh = (cos(2πλ0 h) − cos(2πλ0 (2t + h))
2
+ cos(2πλ0 h) − + cos(2πλ0 (2t + h)))
= σ 2 cos(2πλ0 h)
σ 2 −2πiλ0 h
= e + e2πiλ0 h .
2
27
σ2
So, with a = b = 2
, we use

 0 if λ < −λ0
σ2
F (λ) = 2
if − λ0 ≤ λ < λ0
 2
σ if λ ≥ λ0 .

Example: AR(1): Xt = αXt−1 + ǫt .

Here γ0 = σ 2 /(1 − α2 ) and γk = α|k| γ0 for k 6= 0.

So
∞
X
1
f (λ) = γ0 α|k| eiλk
2π k=−∞
∞ ∞
γ0 1 X k iλk 1 X k −iλk
= + γ0 α e + γ0 α e
2π 2π k=1 2π k=1

γ0 αeiλ αe−iλ
= 1+ +
2π 1 − αeiλ 1 − αe−iλ
γ0 (1 − α2 )
=
2π(1 − 2α cos λ + α2 )
σ2
=
2π(1 − 2α cos λ + α2 )

where we used e−iλ + eiλ = 2 cos λ.

Simulation: AR(1) with α = 0.5

Series: ar1.sim
AR (1) spectrum
2.0
spectrum

1.0
0.5

0.0 0.1 0.2 0.3 0.4 0.5

frequency

28
Simulation: AR(1) with α = −0.5
Series: ar1b.sim
AR (2) spectrum

2.0
spectrum

1.0
0.5

0.0 0.1 0.2 0.3 0.4 0.5

frequency

Plotting the spectral density f (λ), we see that in the case α > 0 the spectral
density f (λ) is a decreasing function of λ: that is, the power is concentrated at
low frequencies, corresponding to gradual long-range fluctuations.
For α < 0 the spectral density f (λ) increases as a function of λ: that is,
the power is concentrated at high frequencies, which reflects the fact that such a
process tends to oscillate.

ARMA(p, q) process
p q
X X
Xt = αi Xt−i + βj ǫt−j
i=1 j=0

The spectral density for an ARMA(p,q) process is related to the AR and MA

polynomials φα (z) and φβ (z).
The spectral density of Xt is

σ 2 |φβ (e−iλ )|2

f (λ) = .
2π |φα (e−iλ )|2

29
Example: AR(1) Here φα (z) = 1 − αz and φβ (z) = 1, so, for −π 6 λ < π,

σ2
f (λ) = |1 − αe−iλ |−2
2π
σ2
= |1 − α cos λ + iα sin λ|−2
2π
σ2
= {(1 − α cos λ)2 + (α sin λ)2 }−1
2π
σ2
=
2π(1 − 2α cos λ + α2 )

as calculated before.

Example: MA(1)
Here φα (z) = 1, φβ (z) = 1 + θz, and we obtain, for −π 6 λ < π,

σǫ2
f (λ) = |1 + θe−iλ |2
2π
σǫ2
= (1 + 2θ cos(λ) + θ2 ).
2π

Plotting the spectral density f (λ), we would see that in the case θ > 0 the
spectral density is large for low frequencies, small for high frequencies. This is
not surprising, as we have short-range positive correlation, smoothing the series.
For θ < 0 the spectral density is large around high frequencies, and small
for low frequencies; the series fluctuates rapidly about its mean value. Thus, to a
coarse order, the qualitative behaviour of the spectral density is similar to that of
an AR(1) spectral density.

3.2.2 The Periodogram

To estimate the spectral density we use the periodogram.
For a frequency ω we compute the squared correlation between the time series

30
and the sine/cosine waves of frequency ω. The periodogram I(ω) is given by
n 2
1 X
−iωt
I(ω) = e Xt
2πn t=1
( )2 ( n )2 
Xn X
1 
= Xt sin(ωt) + Xt cos(ωt)  .
2πn t=1 t=1

The periodogram is related to the estimated autocovariance function by

∞ ∞
1 X −iωt c0 1X
I(ω) = ct e = + ct cos(ωt);
2π t=−∞ 2π π t=1
Z π
ct = eiωt I(ω)dω.
−π

So the periodogram and the estimated autocovariance function contain the same
information. For the purposes of interpretation, sometimes one will be easier to
interpret, other times the other will be easier to interpret.

Simulation: AR(1) with α = 0.5

Series: ar1.sim
Raw Periodogram
1e+01
1e+00
1e−01
spectrum

1e−02
1e−03

0.0 0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.000144

Simulation: AR(1) with α = −0.5

31
Series: ar1b.sim
Raw Periodogram

1e+01
1e+00
spectrum

1e−01
1e−02
1e−03

0.0 0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.000144

Simulation: MA(1) with β = 0.5

Series: ma1.sim
Raw Periodogram
1e+01
1e+00
spectrum

1e−01
1e−02
1e−03

0.0 0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.000144

From asymptotic theory, at Fourier frequencies ω = ωj = 2πj/n, j =

1, 2, . . . , the periodogram ordinates {I(ω1 ), I(ω2 ), . . . } are approximately inde-
pendent with means {f (ω1 ), f (ω2 ), . . . }. That is for these ω
I(ω) ∼ f (ω)E
where E is an exponential distribution with mean 1.

Note that var[I(ω)] ≈ f (ω)2 , which does not tend to zero as n → ∞. So I(ω)
is NOT a consistent estimator.

The cumulative periodogram U (ω) is defined by

⌊n/2⌋
X X
U (ω) = I(ωk ) / I(ωk ).
0<ωk 6ω 1

32
This can be used to test residuals in a fitted model, for example. If we hope that
our residual series is white noise, the the cumulative periodogram of the residuals
should increase linearly: i.e. we can plot the cumulative periodogram (in R) and
look to see if the plot is an approximate straight line.
2πk
If Xt , t = 0, ±1, ±2, . . . is Gaussian white noise, and if ωk = n
are the
Fourier frequencies, −π < ωk ≤ π, then the random variables
Pi
I(ωk )
Pqk=1 , r = 1, . . . , q − 1,
k=1 I(ωk )

are distributed as the order statistics of q − 1 independent random variables, each

being uniformly distributed on [0, 1].
As a consequence, we may apply a Kolmogorov-Smirnov test to assess whether
the residuals of a time series are white noise.

Example: Brockwell & Davis (p 339, 340): Data generated by

Xt = cos(πt/3) + ǫt t = . . . , 100

where {ǫt } is Gaussian white noise with variance 1. There is a peak in the peri-
odogram at ω17 = 0.34π.

In addition, the independence of the periodogram ordinates at different Fourier

frequencies suggests that the sample periodogram, as a function of ω, will be
extremely irregular. For this reason smoothing is often applied, for instance using
a moving average, or more generally a smoothing kernel.

3.2.3 Smoothing
The idea behind smoothing is to take weighted averages over neighbouring fre-
quencies in order to reduce the variability associated with individual periodogram
values.
The main form of a smoothed esimator is given by
Z
ˆ 1 λ−ω
f (ω) = K I(λ)dλ.
h h
Here K is some kernel function (= a probability density function), for example a
standard normal pdf, and h is the bandwidth.

33
The bandwidth h affects the degree to which this process smooths the peri-
odogram. Small h = indictes a little smoothing, large h = a lot of smoothing.

In practice, the smoothed esimate fˆ(ω) will be evaluated by the sum

X Z ωj 1 λ − ω
fˆ(ω) = K I(λ)dλ
j ωj−1
h h

2π X 1 ωj − ω
≈ K I(ωj ).
n j h h

Writing
2π ωj − ω
gj = K
hn h
we calculate that
X
E(fˆ(ω)) ≈ gj f (ωj )
j

and
X Z
2π
V ar(fˆ(ω)) ≈ gj2 f (ωj )2 ≈ f (ω)2 K(x)2 dx
j
nh

as well as
Z
f ′′ (ω) 2
bias(fˆ(ω)) ≈ h x2 K(x)dx,
2
see Venables and Ripley, p.408. Then
q
2bias(fˆ(ω))/f ′′ (ω)

is referred to as the bandwidth in R.

As the degree of smoothing h increases, the variance decreases but the bias
increases.

Example series: lh

34
3.5
3.0
2.5
lh

2.0
1.5

0 10 20 30 40

Time

Example series: deaths

4000
3500
3000
2500
deaths

2000
1500
1000
500

1974 1975 1976 1977 1978 1979 1980

year

deaths: unsmoothed periodogram

deaths: Raw Periodogram

60
55
50
45
spectrum (dB)

40
35
30
25

0 1 2 3 4 5 6

frequency
bandwidth = 0.0481, 95% C.I. is (−6.26,16.36)dB

35
Suppose we have estimated the periodogram values I(ω1 ), I(ω2 ), . . . , where
ωj = 2πj/n, j = 1, 2, . . . .
An example of a simple way to smooth is to use a moving average, and so
estimate I(ωj ) by
1 1 1
I(ωj−4 ) + [I(ωj−3 ) + I(ωj−2 ) + · · · + I(ωj+3 )] + I(ωj+4 ).
16 8 16
1
Observe that the sum of the weights above (i.e. the 16 s and the 18 s) is 1.
Keeping the sum of weights equal to 1, this process could be modified by using
more, or fewer, I(ωk ) values to estimate I(ωj ).
Also, this smoothing process could be repeated.

If a series is (approximately) periodic, say with frequency ω0 , then periodogram

will show a peak near this frequency.
It may well also show smaller peaks at frequencies 2ω0 , 3ω0 , . . . .
The integer multiples of ω0 are called its harmonics, and the secondary peaks
at these high frequencies arise because the cyclic variation in the original series is
non-sinusoidal. (So a situation like this warns against interpreting multiple peaks
in the periodogram as indicating the presence of several distinct cyclic mecha-
nisms in the underlying process.)

In R, smoothing is controlled by the option spans to the spectrum function.

The unsmoothed periodogram (above) was obtained via spectrum(lh). The
plots are on log scale, in units of decibels, that is, the plot is of 10 log10 I(ω).
The smoothed versions below are
spectrum(lh, spans = 3)
spectrum(lh, spans = c(3,3))
spectrum(lh, spans = c(3,5))
In R, the default is to use the modified Daniell kernel. This kernel places half
the weights at the endpoints; the other half is distributed uniformly.
All of the examples, above and below, from Venables & Ripley.
V & R advise:
• trial and error needed to choose the spans;

• spans should be odd integers;

• use at least two, which are different, to get a smooth plot.

36
lh: Smoothed Periodogram, spans=3

0
−5
spectrum (dB)

−10
−15

0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.0159, 95% C.I. is (−4.32, 7.73)dB

lh: Smoothed Periodogram, spans=c(3,3)

−2
−4
−6
spectrum (dB)

−8
−10
−12
−14

0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.0217, 95% C.I. is (−3.81, 6.24)dB

lh: Smoothed Periodogram, spans=c(3,5)

−2
−4
spectrum (dB)

−6
−8
−10
−12

0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.0301, 95% C.I. is (−3.29, 4.95)dB

37
deaths: Smoothed Periodogram, spans=c(3,3)

55
50
spectrum (dB)

45
40
35

0 1 2 3 4 5 6

frequency
bandwidth = 0.173, 95% C.I. is (−3.81, 6.24)dB

deaths: Smoothed Periodogram, spans=c(3,5)

50
45
spectrum (dB)

40
35

0 1 2 3 4 5 6

frequency
bandwidth = 0.241, 95% C.I. is (−3.29, 4.95)dB

deaths: Smoothed Periodogram, spans=c(5,7)

50
45
spectrum (dB)

40
35

0 1 2 3 4 5 6

frequency
bandwidth = 0.363, 95% C.I. is (−2.74, 3.82)dB

lh: cumulative periodogram

38
Series: lh

1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.1 0.2 0.3 0.4 0.5

frequency

deaths: cumulative periodogram

Series: deaths
1.0
0.8
0.6
0.4
0.2
0.0

0 1 2 3 4 5 6

frequency

3.3 Model fitting using time and frequency domain

3.3.1 Fitting ARMA models
The value of ARMA processes lies primarily in their ability to approximate a wide
range of second-order behaviour using only a small number of parameters.
Occasionally, we may be able to justify ARMA processes in terms of the basic
mechanisms generating the data. But more frequently, they are used as a means
of summarising a time series by a few well-chosen summary statistics: i.e. the
parameters of the ARMA process.

Now consider fitting an AR model to the lh series. Look at the pacf:

39
Series lh

0.6
0.4
Partial ACF

0.2
0.0
−0.2

5 10 15

Lag

Fit an AR(1) model:

lh.ar1 <- ar(lh, F, 1)

The fitted model is:

Xt = 0.58Xt−1 + ǫt
with σ 2 = 0.21.
One residual plot we could look at is

cpgram(lh.ar1$resid)

lh: cumulative periodogram of residuals from AR(1) model

AR(1) fit to lh
1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5

frequency

Also try select the order of the model using AIC:

40
lh.ar <- ar(lh, order.max = 9)
lh.ar$order
lh.ar$aic

This selects the AR(3) model:

Xt = 0.65Xt−1 − 0.06Xt−2 − 0.23Xt−3 + ǫt

with σ 2 = 0.20.
The same order is selected when using

lh.ar <- ar(lh, order.max = 20)

lh.ar$order

lh: cumulative periodogram of residuals from AR(3) model

AR(3) fit to lh
1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5

frequency

By default, ar fits by using the Yule-Walker equations.

We can also use
arima in library(MASS)
to fit these models using maximum likelihood. (Examples in Venables & Rip-
ley, and in the practical class)
The function tsdiag produces diagnostic residuals plots. As mentioned in a
previous lecture, the p-values from the Ljung-Box statistic are of concern if they
go below 0.05 (marked with a dotted line on the plot).

lh: diagnostic plots from AR(1) model

41
Standardized Residuals

2
1
0
−1
0 10 20 30 40

Time

ACF of Residuals

1.0
0.6
ACF

0.2
−0.2
0 5 10 15

Lag

p values for Ljung−Box statistic

0.8
p value

0.4
0.0

2 4 6 8 10

lag

lh: diagnostic plots from AR(3) model

Standardized Residuals
3
2
1
0
−1

0 10 20 30 40

Time

ACF of Residuals
1.0
0.6
ACF

0.2
−0.2

0 5 10 15

Lag

p values for Ljung−Box statistic

0.8
p value

0.4
0.0

2 4 6 8 10

lag

3.3.2 Estimation and elimination of trend and seasonal components

The first step in the analysis of any time series is to plot the data.
If there are any apparent discontinuities, such as a sudden change of level, it
may be advisable to analyse the series by first breaking it into a homogeneous
segments.

We can think of a simple model of a time series as comprising

• deterministic components, i.e. trend and seasonal components

• plus a random or stochastic component which shows no informative pattern.

42
We might write such a decomposition model as the additive model

Xt = mt + st + Zt

where

mt = trend component (or mean level) at time t;

st = seasonal component at time t;
Zt = random noise component at time t.

Here the trend mt is a slowly changing function of t, and if d is the number of

observations in a complete cycle then st = st−d .
In some applications a multiplicative model may be appropriate

Xt = mt st Zt .

After taking logs, this becomes the previous additive model.

It is often possible to look at a time plot of the series to spot trend and seasonal
behaviour. We might look for a linear trend in the first instance, though in many
applications non-linear trend is also of interest and present.
Periodic behaviour is also relatively straightforward to spot. However, if there
are two or more cycles operating at different periods in a time series, then it may
be difficult to detect such cycles by eye. A formal Fourier analysis can help.
The presence of both trend and seasonality together can make it more difficult
to detect one or the other by eye.

Example: Box and Jenkins airline data. Monthly totals (thousands) of inter-
national airline passengers, 1949 to 1960.
600
500
AirPassengers

400
300
200
100

1950 1952 1954 1956 1958 1960

Time

43
airpass.log <- log(AirPassengers)
ts.plot(airpass.log)

6.5
6.0
airpass.log

5.5
5.0

1950 1952 1954 1956 1958 1960

Time

We can aim to estimate and extract the deterministic components mt and st ,

and hope that the residual or noise component Zt turns out to be a stationary
process. We can then try to fit an ARMA process, for example, to Zt .

An alternative approach (Box-Jenkins) is to apply the difference operator ∇

repeatedly to the series Xt until the differenced series resembles a realization of a
stationary process, and then fit an ARMA model to the suitably differenced series.

3.3.3 Elimination of trend when there is no seasonal component

The model is
Xt = mt + Zt
where we can assume E(Zt ) = 0.

1: Fit a Parametric Relationship

We can take mt to be the linear trend mtP = α0 + α1 t, or some similar polyno-
mial trend, and estimate mt by minimising (Xt − mt )2 with respect to α0 , α1 .
Then consider fitting stationary models to Yt = Xt − m b t , where m
bt = α
b0 + α
b1 t.
Non-linear trends are also possible of course, say log mt = α0 + α1 k t (0 <
k < 1), mt = α0 /(1 + α1 e−α2 t ), . . .
In practice, fitting a single parametric relationship to an entire time series is
unrealistic, so we may fit such curves as these locally, by allowing the parameters
α to vary (slowly) with time.

44
The resulting series Yt = Xt − m
b t is the detrended time series.

Fit a linear trend:

6.5
6.0
airpass.log

5.5
5.0

0 20 40 60 80 100 120 140

time step

The detrended time series:

0.3
0.2
0.1
0.0
−0.1
−0.2
−0.3

0 20 40 60 80 100 120 140

time step

2: Smoothing
If the aim is to provide an estimate of the local trend in a time series, then we
can apply a moving average. That is, take a small sequence of the series values
Xt−q , . . . , Xt , . . . , Xt+q , and compute a (weighted) average of them to obtain a
smoothed series value at time t, say m b t , where
q
1 X
m
bt = Xt+j .
2q + 1 j=−q

45
It is useful to think of {m bt } by application of
b t } as a process obtained from {X
P∞
b t = j=−∞ aj Xt+j , with weights aj = 1/(2q + 1), −q 6 j 6 q,
a linear filter m
and aj = 0, |j| > q.

This filter is a ‘low pass’ filter since it takes data Xt and removes from it the
rapidly fluctuating component Yt = Xt − m b t , to leave the slowly varying estimated
trend term m b t.
We should not choose q too large since, if mt is not linear, although the filtered
process will be smooth, it will not be a good estimate of mt .
If we apply two filters in succession, for example to progressively smooth a
series, we are said to be using a convolution of the filters.

By careful choice of the weights aj , it is possible to design a filter that will not
only be effective in attenuating noise from the data, but which will also allow a
larger class of trend functions.
Spencer’s 15-point filter has weights
aj = a−j |j| 6 7
aj = 0 |j| > 7
1
(a0 , a1 , . . . , a7 ) = (74, 67, 46, 21, 3, −5, −6, −3)
320
and has the property that a cubic polynomial passes through the filter undistorted.
spencer.wts <- c(-3,-6,-5,3,21,46,67,74,67,46,21,3,-5,-6,-3)/320
airpass.filt <- filter(airpass.log, spencer.wts)
ts.plot(airpass.log, airpass.filt, lty=c(2,1))

Original series and filtered series using Spencer’s 15-point filter:

6.5
6.0
log(AirPassengers)

5.5
5.0

1950 1952 1954 1956 1958 1960

Time

46
Detrended series via filtering:

0.15
0.10
0.05
airpass.log − airpass.filt

0.00
−0.05
−0.10
−0.15

1950 1952 1954 1956 1958 1960

Time

3: Differencing
Recall that the difference operator is ∇Xt = Xt −Xt−1 . Note that differencing
is a special case of applying a linear filter.
We can think of differencing as a ‘sample derivative’. If we start with a linear
function, then differentiation yields a constant function, while if we start with a
quadratic function we need to differentiate twice to get to a constant function.
Similarly, if a time series has a linear trend, differencing the series once will
remove it, while if the series has a quadratic trend we would need to difference
twice to remove the trend.

Detrended series via differencing:

0.2
0.1
0.0
−0.1
−0.2

1950 1952 1954 1956 1958 1960

Year

47
3.4 Seasonality
After removing trend, we can remove seasonality. (Above, all detrended versions
of the airline data clearly still have a seasonal component.)
1: Block averaging
The simplest way to remove seasonality is to average the observations at the
same point in each repetition of the cycle (for example, for monthly data average
all the January values) and subtract that average from the values at those respective
points in the cycle.
2: Seasonal differencing
The seasonal difference operator is ∇s Xt = Xt − Xt−s where s is the period
of the seasonal cycle. Seasonal differencing will remove seasonality in the same
way that ordinary differencing will remove a polynomial trend.

airpass.diff<-diff(airpass.log)
airpass.diff2 <- diff(airpass.diff, lag=12)
ts.plot(airpass.diff2)
0.15
0.10
0.05
airpass.diff2

0.00
−0.05
−0.10
−0.15

1950 1952 1954 1956 1958 1960

Time

After differencing at lag 1 (to remove trend), then at lag 12 (to remove seasonal
effects), the log(AirPassengers) series appears stationary.
That is, the series ∇∇12 X, or equivalently the series (1 − B)(1 − B 12 )X,
appears stationary.

R has a function stl which you can use to estimate and remove trend and
seasonality using ‘loess’.

48
stl is a complex function, you should consult the online documentation be-
fore you use it. The time series chapter of Venables & Ripley contains examples
of how to use stl. As with all aspects of that chapter, it would be a good idea for
you to work through the examples there.

We could now look to fit an ARMA model to ∇∇12 X, or to the residual

component extracted by stl.

Seasonal ARIMA models

Recall that X is an ARMA(p, q) process if

p q
X X
Xt − αi Xt−i = ǫt + βj ǫt−j
i=1 j=1

and X is an ARIMA(p, d, q) process if ∇d X is ARMA(p, q).

In shorthand notation, these processes are

φα (B)X = φβ (B)ǫ and φα (B)∇d X = φβ (B)ǫ.

Suppose we have monthly observations, so that seasonal patterns repeat every

s = 12 observations. Then we may typically expect Xt to depend on such terms
as Xt−12 , and maybe Xt−24 , as well as Xt−1 , Xt−2 , . . . .

A general seasonal ARIMA (SARIMA) model, is

Φp (B)ΦP (B s )Y = Φq (B)ΦQ (B s )ǫ

where Φp , ΦP , Φq , ΦQ are polynomials of orders p, P, q, Q and where

Y = (1 − B)d (1 − B s )D X.

Here:
• s is the number of observations per season, so s = 12 for monthly data;

• D is the order of seasonal differencing, i.e. differencing at lag s (we were

content with D = 1 for the air passenger data);

• d is the order of ordinary differencing (we were content with d = 1 for the
air passenger data).

49
This model is often referred to as an ARIMA((p, d, q) × (P, D, Q)s ) model.

Examples

1. Consider a ARIMA model of order (1, 0, 0) × (0, 1, 1)12 .

This model can be written

(1 − αB)Yt = (1 + βB 12 )ǫt

where
Yt = Xt − Xt−12 .

2. The ‘airline model’ (so named because of its relevance to the air passenger
data) is a ARIMA model of order (0, 1, 1) × (0, 1, 1)12 .
This model can be written

Yt = (1 + β1 B)(1 + β2 B 12 )ǫt

where Yt = ∇∇12 X is the series we obtained after differencing to reach sta-

tionarity, i.e. one step of ordinary differencing, plus one step of seasonal (lag 12)
differencing.

3.5 Forecasting in ARMA models

As a linear time series, under our usual assumptions on the AR-polynomial and
the MA-polynomial, we can write an ARMA model as a causal model,
∞
X
Xt = cr ǫt−r .
r=0

Suppose that we are interested in forecasting XT +k from observations {Xt , t ≤

T }. Consider forecasts of the form
∞
X
X̂T,k = cr+k ǫT −r .
r=0

50
Then
∞
X ∞
X
XT +k − X̂T,k = cr ǫT +k−r − cr+k ǫT −r
r=0 r=0
k−1
X X∞ ∞
X
= cr ǫT +k−r + cr ǫT +k−r − cs ǫT −s+k
r=0 r=k s=k
k−1
X
= cr ǫT +k−r .
r=0

This gives rise to the mean squared prediction error

k−1
!
X
E{(XT +k − X̂T,k )2 } = c2r σǫ2 .
r=0

Thus ∞
X
X̂T,k = cr+k ǫT −r
r=0
is our theoretical optimal predictor.
Note that the mean squared prediction errors are based solely on the uncer-
tainty of prediction; they do not take errors in model identification into account.

In practice one usually uses a recursive approach. Define X̂T,k to be the

optimal predictor of XT +k given X1 , . . . , XT ; for −T + 1 ≤ k ≤ 0, we set
X̂T,k = XT +k . Then use the recursive relation
p q
X X
X̂T,k = αr X̂T,k−r + ǫ̂T +k + βs ǫ̂T +k−s
r=1 s=1

For k ≤ 0 we can use thus relation to calculate ǫ̂t for 1 ≤ t ≤ T . For k > 0 we
define ǫ̂t = 0 for t > T , to calculate the forecasts.

The difficulty is how to start off the recursion. Two standard solutions are
Either assume Xt = ǫt = 0 for all t ≤ 0,
or forecast the series in reverse direction to determine estimates of X0 , X−1 , . . . ,
as well as ǫ0 = 0, ǫ−1 = 0, etc.
A superior approach is to recast the model in state space form and apply the
Kalman filter.

51
4 State space models
State-space models assume that the observations (Xt )t are incomplete and noisy
functions of some underlying unobservable process (Yt )t , called the state process,
which is assumed to have a simple Markovian dynamics. The general state space
model is described by
1. Y0 , Y1 , Y2 , . . . is a Markov chain
2. Conditionally on {Yt }t , the Xt ’s are independent, and Xt depends on Yt
only.
When the state variables are discrete, one usually calls this model a hidden Markov
model; the term state space model is mainly used for continuous state variables.

4.1 The linear state space model

A prominent role is played by the linear state space model
Yt = Gt Yt−1 + vt (1)
Xt = Ht Yt + wt , (2)
where Gt and Ht are deterministic matrices, and (vt )t and (wt )t are two indepen-
dent white noise sequences with vt and wt being mean zero and having covariance
matrices Vt2 and Wt2 , respectively. The general case,
Yt = gt (Yt−1 , vt )
Xt = ht (Yt , wt ),
is much more flexible. Also, multivariate models are available. The typical ques-
tion on state space models is the estimation or the prediction of the states (Yt )t in
terms of the observed data points (Xt )t .

Example. Suppose the two-dimensional model

0 1 1
Yt = Yt−1 + Xt ,
0 0 β
where Xt is one-dimensional mean zero white noise. Then
Y2,t = βXt
Y1,t = Y2,t−1 + Xt = Xt + βXt−1 ,

52
so we obtain an MA(1)-process.

Example. Suppose the model

Yt = φYt−1 + vt
Xt = Yt + wt ,

where (vt )t and (wt )t are two independent white noise sequences with vt and wt
being mean zero and having variances V 2 and W 2 , respectively. Then

Xt − φXt−1 = Yt − φYt−1 + wt − φwt−1

= vt + wt − φwt−1 .

The right-hand side shows that all correlations at lags > 1 are zero. Hence
the right-hand side is equivalent to an MA(1) model, and thus Xt follows an
ARMA(1,1)-model.

To make the connection with ARMA(1,1) more transparent, note that

ǫ t = v t + wt

gives a mean zero whiteq noise series with variance σǫ2 = V 2 + W 2 . Thus ǫt has
2 2
the same distribution as V W+W2 wt . Putting
r
W2
β=− φ
V 2 + W2
thus gives that

vt + wt − φwt−1 = ǫt + βǫt−1 .

In fact any ARMA(p,q)-model with Gaussian WN can be formulated as a state

space model. The representation of an ARMA model as a state-space model is
however not unique, see Brockwell and Davis (1991), pp.469-470.
Note that the above model is more flexible than an ARMA model. If, for exam-
ple, the observation at time t is missing, then we simply put Ht = (0, 0, . . . , 0)T .

53
4.2 Filtering, smoothing, and forecasting

The primary aims of the analysis of state space models are to produce estimators
for the underlying unobserved signal Yt given the data Xs = (X1 , . . . , Xs ) up to
time s. When s < t the problem is called forecasting, when s = t it is called
filtering, and when s > t it is called smoothing. For a derivation of the results
below see also Smith (2001).
We will throughout assume the white noise to be Gaussian.

In Kalman filters made easy by Terence Tong, at

http://openuav.astroplanes.com/library/docs/writeup.pdf
an analogy of the following type is given.
Suppose that you just met a new friend and you do not know how punctual
your new friend will be. Based on your history, you estimate when the friend will
arrive. You do not want to come too early, but also you do not want to be too late.
You arrive on time at your first meeting, while your friend arrives 30 min late.
So you adapt your estimate, you will not be so early next time.

The Kalman filter is a method for updating parameter estimates instantly when
a new observation occurs, based on the likelihood of the current data - without
having to re-estimate a large number of parameters using all past data.

The Kalman filter was first developed in an engineering framework, and we

shall use it for filtering and forecasting. It is a recursive method to calculate a con-
ditional distribution within a multivariate normal framework. As it is recursive,
only the estimated state from the previous time step and the current measurement
are needed to compute the estimate for the current state.

The state of the filter is represented by two variables: the estimate of the state
at time t; and the error covariance matrix (a measure of the estimated accuracy of
the state estimate). The Kalman filter has two distinct phases: Predict and Update.
The predict phase uses the state estimate from the previous timestep to produce
an estimate of the state at the current timestep. In the update phase, measurement
information at the current timestep is used to refine this prediction to arrive at a
new, (hopefully) more accurate state estimate, again for the current timestep.

It is useful to first revise some distributional results for multivariate normal

54
distributions. Suppose that

Z1 µ1 Σ11 Σ12
∼ MVN , . (3)
Z2 µ2 Σ21 Σ22
Then the conditional distribution of Z1 given Z2 = z2 is

L(Z1 |Z2 = z2 ) = MVN µ1 + Σ12 Σ−1 −1
11 (z2 − µ2 ), Σ11 − Σ12 Σ11 Σ21 (4)
and conversely, if Z2 ∼ MVN (µ2 , Σ22 ) and if (4) holds, then (3) holds.

In particular, the conditional distribution of Z1 given Z2 = z2 is again normal,

and we can give its mean and its covariance matrix explicitly.

If Z1 , Z2 , Z3 are jointly normally distributed with means µp and covariance

matrices Σpq = E[(Zp − µp )(Zq − µq )′ ], for p, q = 1, 2, 3, and assume that µ3 = 0
and Σ23 = 0. Then
E(Z1 |Z2 , Z3 ) = E(Z1 |Z2 ) + Σ13 Σ−1
33 Z3

and
V ar(Z1 |Z2 , Z3 ) = V ar(Z1 |Z2 ) − Σ13 Σ−1 ′
33 Σ13 .

To illustrate how the filter works, we first look at a one-dimensional example.

Let X(t−1) = {x1 , . . . , xt−1 } be the set of past observations from a time series X
which arises in the state space model
Xt = Yt + ǫt
Yt = Yt−1 + ηt−1 ,
where ǫt is mean-zero normal with variance σǫ2 and ηt is mean-zero normal with
variance ση2 ; all independent.

Assume that the conditional distribution of Yt given X(t−1) is N (at , Pt ), where

at and Pt are to be determined. Given at and Pt , our objective is to calculate at+1
and Pt+1 when xt , the next observation, arrives.

Now
at+1 = E(Yt+1 |X(t) )
= E(Yt + ηt |X(t) )
= E(Yt |X(t) )

55
and
Pt+1 = V ar(Yt+1 |X(t) )
= V ar(Yt + ηt |X(t) )
= V ar(Yt |X(t) ) + ση2 .

Define vt = xt − at and Ft = V ar(vt ). Then

E(vt |X(t−1) ) = E(Yt + ǫt − at |X(t−1) )
= at − at = 0.
Thus E(vt ) = E(E(vt |X(t−1) )) = 0 and
Cov(vt , xj ) = E(vt xj ) = E[E(vt |X(t−1) )xj ] = 0,
and as vt and xj are normally distributed, they are independent for j = 1, . . . , t−1.
When X(t) is fixed, X(t−1) and xt are fixed, so X(t−1) and vt are fixed, and vice
versa. Thus
E(Yt |X(t) ) = E(Yt |X(t−1) , vt )
and
V ar(Yt |X(t) ) = V ar(Yt |X(t−1) , vt ).

Now we apply the conditional mean and variance formula for multivariate
normally distributed random variables:
E(Yt |X(t) ) = E(Yt |X(t−1) , vt )
= E(Yt |X(t−1) ) + Cov(Yt , vt )V ar(vt )−1 vt ,
where
Cov(Yt , vt ) = E(Yt (xt − at ))
= E[Yt (Yt + ǫt − at )]
= E[Yt (Yt − at )]
= E[(Yt − at )2 ] + at E[E(Yt − at |X(t−1) )]
= E[(Yt − at )2 ]
= E[E{(Yt − at )2 |X(t−1) }]
= E[V ar(Yt |X(t−1) )]
= Pt ,

56
and

V ar(vt ) = Ft
= V ar(Yt + ǫt − at )
= V ar(Yt |X(t−1) ) + σǫ2
= Pt + σǫ2 .

Put
Pt
Kt =
Ft
then, since at = E(Yt |X(t−1) ), we have

E(Yt |X(t) ) = at + Kt vt .

Now

V ar(Yt |X(t) ) = V ar(Yt |X(t−1) , vt )

= V ar(Yt |X(t−1) ) − Cov(Yt , vt )2 V ar(vt )−1
P2
= Pt − t
Ft
= Pt (1 − Kt ).

Thus the rule set of relations for updating from time t to t + 1 is

vt = xt − at Kalman filter residual; innovation

at+1 = at + K t v t
Ft = Pt + σǫ2
Pt+1 = Pt (1 − Kt ) + ση2
Pt
Kt = ,
Ft
for t = 1, . . . , n.

Note: a1 and P1 are assumed to be known; we shall discuss how to initialize

later.

57
Now consider the more general model

Yt = Gt Yt−1 + vt
Xt = Ht Yt + wt ,

with (vt )t independent white noise W N (0, Vt ), and (wt )t ind. W N (0, Wt ). Here,
Yt is a vector representing unknown states of the system, and Xt are the observed
data. . Put Xt = (X1 , X2 , . . . , Xt ), the history of X up to time t, and

Yts = E(Yt |Xs )

Pts1 ,t2 = E{(Yt1 − Yts1 )(Yt2 − Yts2 )T }
= E{(Yt1 − Yts1 )(Yt2 − Yts2 )T |Xs }.

When t1 = t2 = t, we will write Pts for convenience.

Suppose Y00 = µ and P00 = Σ0 , and that the conditional distribution of Yt−1 given
the history Xt−1 up to time t − 1,

L(Yt−1 |Xt−1 ) = MVN (Yt−1

t−1
, Pt−1 ).

Then L(Yt |Xt−1 ) is again multivariate normal. We have that

E(Xt |Yt ) = Ht Yt
V ar(Xt |Yt ) = Wt .

With

Rt = Gt Pt−1 G−1
t + Vt

the conditional distribution of (Xt , Yt )T given Xt−1 is given by

!
t−1
Xt t−1 Ht Gt Yt−1 Wt + Ht Rt HtT Ht Rt
L X = MVN t−1 , .
Yt Gt Yt−1 Rt HtT Rt

We can compute that the conditional distribution of Yt given Xt−1 is multivariate

(t−1)
normal with mean Ytt and variance Pt , where

Ytt = Gt Yt−1
t−1
+ Rt HtT (Wt + Ht Rt HtT )−1 (Xt − Ht Gt Yt−1
t−1
)
(t−1)
Pt = Rt − Rt HtT (Wt + Ht Rt HtT )−1 Ht Rt .

58
These equations are known as the Kalman filter updating equations. This solves
the filtering problem.
t−1
Have a look at the expression for Ytt . It contains the term Gt Yt−1 , which is simply
t−1
what we would predict if it were known that Yt−1 = Yt−1 , plus a term which
t−1
depends on the observed error in forecasting, i.e. (Xt − Ht Gt Yt−1 ).

Note that we initialized the recursion by X00 = µ and P00 = σ0 . Instead one might
have initialized the recursion by some prior distribution, of by an uninformative
prior X00 = 0, P00 = kI, where I denotes the identity matrix.
s s
For forecasting, suppose t > s. By induction, assume we know Yt−1 , Pt−1 . Then

Yts = Gt Yt−1
s

Pts = Gt Pt−1
s
GTt + Vt .

Recursion solves the forecasting problem.

The R command predict(arima) uses Kalman filters for prediction; see for
example the airline passenger example, with the code on the course website.

We can calculate that the conditional distribution of Xt+1 given Xt is

t T
MVN (Ht+1 Gt+1 Yt+1 , Ht+1 Rt+1 Ht+1 + Wt+1 ).

This fact is the basis of the prediction error decomposition, giving us a likelihood
for parameter estimation.

For smoothing we use the Kalman smoother. We proceed by backwards induction.

Suppose that Ytt , Ptt are known, where Ptt is the conditional covariance matrix of
Xt given {Y1 , . . . , Yt } . With a similar derivation as above, for t = n, n−1, . . . , 1,
n t−1
Yt−1 = Yt−1 + Jt−1 (Ytn − Ytn−1 )
n t−1
Pt−1 = Pt−1 + Jt−1 (Ptn − Ptt−1 )Jt−1
T

where
t−1 T
Jt−1 = Pt−1 H (Ptt−1 )−1 .

Note that these procedures differ for different initial distributions, and sometimes
it may not clear which initial distribution is appropriate.

59
See also Kalman filters made easy by Terence Tong, at
http://openuav.astroplanes.com/library/docs/writeup.pdf.

Example: Johnson & Johnson quarterly earnings per share, 1960-1980. The
model is

Xt = Tt + St + vt , observed
Tt = φTt−1 + wt1 , trend
St = St−1 + St−2 + St−3 + wt2 seasonal component.

Assume that the seasonal components sum to zero over the four quarters, in ex-
pectation. Here wt are i.i.d. mean-zero normal vectors with covariance matrix Q,
and vt are i.i.d. mean-zero normal with covariance R.
The state vector is
Yt = (Tt , St , St−1 , St−2 ).

See Shumway and Stoffer, p.334-336. The initial estimates are as follows.
Growth is about 3 % per year, so choose φ = 1.03. The initial mean is fixed at
(0.5, 0.3, 0.2, 0.1)t , and the initial covariance matrix is diagonal with Σ0,i,i = 0.01,
for i = 1, 2, 3, 4. Initial state covariance values were taken as q11 = 0.01, q22 =
0.1 to reflect relatively low uncertainty in the trend model compared to the sea-
sonal model. All other elements of Q are taken to be 0. We take R = 0.04. Itera-
tive estimation (using the EM algorithm) yielded, after 70 iterations, R = .0086,
φ = 1.035, q11 = 0.0169, q22 = 0.0497, and µ = (.55, .21, .15, .06).

5 Non-linear models

Note that this chapter and the next chapter were not covered in lectures.

Financial time series, e.g. share prices, share price indices, spot interest rates,
currency exchange rates, have led to many specialized models and methods.

There are two main types:

• ARCH models

60
• Stochastic Volatility models

ARCH = autoregressive conditionally heteroscedastic

ARCH models are models analagous to ARMA models, but with AR and MA
components which act on the variances of the process as well as, or instead of, the
means.

Stochastic Volatility
In stochastic volatility models there is some unobserved process known as the
volatility which directly influences the variance of the observed series. That is,
these have some similar characteristics to state space models.

A review of ARCH / Stochastic Volatility models is:

Shephard (1996), which is Chapter 1 of Time Series Models (editors: Cox, Hink-
ley, Barndorff-Nielsen), Chapman and Hall

Usually we consider the daily returns yt given by

xt
yt = 100 log
xt−1

where xt is the price on day t.

Common features of series of this type are:

• there is a symmetric distribution about the mean

• there is little autocorrelation among the values of yt

• there is strong autocorrelation among the values of yt2

• the yt have heavy tailed distributions (i.e. heavier tails than a normal distri-
bution)

• the variance of the process changes substantially over time

Most models of financial time series are of the general structure

yt | zt ∼ N (µt , σt2 )

61
where zt is some set of conditioning random variables (maybe lagged values of
yt ) and µt and σt2 are functions of zt .

An example of an ARCH model is:

yt | zt ∼ N (0, σt2 )

where

zt = (y1 , . . . , yt−1 )
σt2 = α0 + α1 yt−1 2 2
+ · · · + αp yt−p .

Clearly here the variance of yt depends on lagged values of yt .

An example of a stochastic volatility model is

yt | ht ∼ N (0, eht )

where

ht+1 = γ0 + γ1 ht + ηt
ηt ∼ N (0, ση2 )

with the variables ηt being independent as t varies.

The state variable ht is not observed, but could be estimated using the observa-
tions. This situation is similar to that for state space models, but it is the variance
(not the mean) of yt that depends on ht here.

5.1 ARCH models

The simplest ARCH model, ARCH(1), is

yt = σt ǫt , σt2 = α0 + α1 yt−1
2

with ǫt ∼ N (0, 1), and the sequence of ǫt variables being independent. Here
α1 > 0 has to be satisfied to avoid negative variances. Note that the conditional
distribution of Yt given Yt−1 = yt−1 is
2
N (0, α0 + α1 yt−1 ).

62
Hence
E(Yt ) = E[E(Yt |Yt−1 )] = 0.
To calculate the variance, we re-write
yt2 = σt2 ǫ2t
2
α0 + α1 yt−1 = σt2
so that
yt2 − (α0 + α1 yt−1
2
) = σt2 ǫ2t − σt2 ,
or
yt2 = α0 + α1 yt−1
2
+ vt ,
with
vt = σt2 (ǫ2t − 1).
Note that ǫ2t ∼ χ21 . Now
E(vt ) = E[E(vt |Yt−1 )]
= E[σt2 E(ǫ2t − 1)] = 0,
and furthermore
Cov(vt+h , vt ) = E(vt vt+h ) = E[E(vt vt+h |Yt+h−1 )]
= E[vt E(vt vt+h |Yt+h−1 )] = 0.
Thus the error process vt is uncorrelated. If the variance of vt is finite and constant
in time, and if 0 ≤ α1 < 1, then yt2 is a causal AR(1)-process. In particular,
α0
E(Yt2 ) = V ar(Yt ) = .
1 − α1
In order for V ar(Tt2 ) < ∞ we need 3α12 < 1.

As the conditional distribution of Yt given the past is normal and easy to write
down, to estimate parameters in an ARCH(1)-model, usually conditional maxi-
mum likelihood is used. For a wide class of processes, asymptotic normality of
the estimators has been proven. A practical difficulty is that the likelihood sur-
face tends to be flat, so that even for the simplest form ARCH(1), the masimum
likelihood estimates of α0 and α1 can be quite imprecise.

63
5.2 GARCH and other models
The ARCH model can be thought of as an autoregressive model in yt2 . An ob-
vious extension of this idea is to consider adding moving average terms as well.
This generalization of ARCH is called GARCH. The simplest GARCH model is
GARCH(1,1):

yt = σt ǫt , σt2 = α0 + α1 yt−1
2 2
+ β1 σt−1

The sequence is second-order stationary if α1 + β1 < 1.

The simplest estimation scheme for the GARCH(1,1) model uses some initial
sample of observations to obtain a crude estimate of σt2 , and then use maximum
likelihood estimation based on the prediction error decomposition.

A further extension (EGARCH, where E is for exponential) is to model the

log of σt2 as a function of the magnitude, and of the sign, of ǫt−1 .

The R command garch in the tseries package uses the Jarque-Bera test for
normality, based on sample skewness and kurtosis. For a sample x1 , . . . , xn the
test statistic is given by
n 2 (κ − 3)2
s +
6 4
with P
1
n
(xi − x̄)3
s= P 3
1 2 2
n
(x i − x̄)
the sample skewness, and
1
P
n
(xi − x̄)4
κ= P 2
1
n
(xi − x̄)2

the sample kurtosis. For a normal distribution, the expected skewness is 0, and the
expected kurtosis is 3. To test the null hypothesis that the data come from a normal
distribution, the Jarque-Bera statistic is compared to the chi-square distribution
with 2 degrees of freedom.

64
5.3 Stochastic volatility
The basic alternative to ARCH-type models is to allow σt2 to depend not on past
observations but on some unobserved components.
The log-normal stochastic volatility model is

yt = exp(ht /2), ht+1 = γ0 + γ1 ht + ηt

where ǫt ∼ N (0, 1) and ηt ∼ N (0, ση2 ) are independent for all t.

The process ht is strongly stationary if and only if |γ1 | < 1, and if ht is
stationary, then so is yt . Means, and autocorrelations can be computed.
Estimation is not straightforward any more, as log ǫ2t does not have a normal
distribution. Often Monte-Carlo approaches are used: see MCMC lectures!

6 Further topics
6.1 Multivariate time series
Virtually all the above discussion generalizes when a vector is observed at each
point in time. In the time domain, analysis would typically use cross-correlations
and vector autoregressive-moving average models. In the frequency domain, de-
pendencies at different frequencies are analysed separately.

6.2 Threshold models

For example when considering neuron firing in the brain, neurons are stimulated
but will only fire once the stimulus exceeds a threshold. Then threshold models
are used;

Yt+1 = g(Yt ) + ǫt ,

where g(Yt ) is piecewise linear.

6.3 More general nonlinear models

Nonlinear time series are of the form

Yt+1 = g(Yt ) + ǫt , or Yt+1 = g(Yt , ǫt ),

65
where g(y) or g(y, ǫ) is nonlinear.
For nonlinear time series, the amplitude (the periodogram) does not suffice to
estimate the spectral density, and the acf; instead the phase is also needed. That
is, we use vectors of time-delayed observations to describe the evolution of the
system. For example, suppose our time series is

1, 3, 6, 7, 4, 2, 4, 5, 6

and we want to describe it in a 3-dim space, using a delay of 1: then our vectors
are
(1, 3, 6); (3, 6, 7); (6, 7, 4); (7, 4, 2)
and so on, and we can see how these vectors move around in 3-dim space.
The interplay between randomness and nonlinearity generates new effects
such as coexistence of fixed points, periodic points, and chaotic attractors, and
new tools have been developed for these systems. In particular, nonlinear time
series analysis uses many ideas from deterministic chaos theory.

6.4 Chaos
There is a large literature centering around the idea that some simple deterministic
processes generate output that is very like a realization of a stochastic process.
In particular it satisfies sensitivity to the initial conditions. This is a completely
different approach to time series.

Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
Chapter 2 - Lecture Notes
No ratings yet
Chapter 2 - Lecture Notes
20 pages
The Power of A Pre-Linked ArchiCAD Template, Part 3
100% (1)
The Power of A Pre-Linked ArchiCAD Template, Part 3
13 pages
Ccleaner Pro 5.57.7182 Crack + License Key (Keygen) Lifetime
No ratings yet
Ccleaner Pro 5.57.7182 Crack + License Key (Keygen) Lifetime
6 pages
UNIT 1 Android Developer Fundamentals (V2) PDF
100% (1)
UNIT 1 Android Developer Fundamentals (V2) PDF
310 pages
Alphatech Case
No ratings yet
Alphatech Case
6 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
FDMEE Training Final
100% (1)
FDMEE Training Final
77 pages
Engineering Drawing
100% (1)
Engineering Drawing
1 page
Applied Time Series Analysis
No ratings yet
Applied Time Series Analysis
340 pages
Qa High Level Process Map
No ratings yet
Qa High Level Process Map
1 page
TS PartII
100% (1)
TS PartII
50 pages
Module - 3 Notes
No ratings yet
Module - 3 Notes
22 pages
Apspdcl LDC Previous Paper 2012
75% (4)
Apspdcl LDC Previous Paper 2012
6 pages
Time Series With SAS
No ratings yet
Time Series With SAS
364 pages
Ch6 Slides Ed3 Feb2021
No ratings yet
Ch6 Slides Ed3 Feb2021
63 pages
PMP - Process Chart
No ratings yet
PMP - Process Chart
1 page
Detailed Syllabus 2D and 3D Animation
No ratings yet
Detailed Syllabus 2D and 3D Animation
4 pages
ECON 762 Lecture Notes
No ratings yet
ECON 762 Lecture Notes
19 pages
Hypothesis Testing - 2 Populations
100% (1)
Hypothesis Testing - 2 Populations
26 pages
Time Series: H T 2008 P - G R
No ratings yet
Time Series: H T 2008 P - G R
161 pages
Assumption C.5 States That The Values of The Disturbance Term in The Observations in The Sample Are Generated Independently of Each Other
No ratings yet
Assumption C.5 States That The Values of The Disturbance Term in The Observations in The Sample Are Generated Independently of Each Other
129 pages
Univariate Time Series Modelling and Forecasting
100% (2)
Univariate Time Series Modelling and Forecasting
72 pages
Linear Stationary Models
No ratings yet
Linear Stationary Models
16 pages
Time Series
No ratings yet
Time Series
117 pages
Airlive Mfp-101u U
No ratings yet
Airlive Mfp-101u U
113 pages
Characteristics of Time Series
No ratings yet
Characteristics of Time Series
17 pages
Krolzig Macroeconometrics I
No ratings yet
Krolzig Macroeconometrics I
48 pages
Box Jenkins Methodology
100% (1)
Box Jenkins Methodology
29 pages
The Box-Jenkins Methodology For RIMA Models
No ratings yet
The Box-Jenkins Methodology For RIMA Models
172 pages
Forecasting Time Series With Arma and Arima Models The Box-Jenkins Methodology
100% (1)
Forecasting Time Series With Arma and Arima Models The Box-Jenkins Methodology
35 pages
Stochiastic Time Series
No ratings yet
Stochiastic Time Series
49 pages
Class16 PDF
No ratings yet
Class16 PDF
77 pages
AR, MA, ARIMATime Series
No ratings yet
AR, MA, ARIMATime Series
76 pages
Digital Signals and System
No ratings yet
Digital Signals and System
168 pages
Time Series Analysis and Forecasting Using R
No ratings yet
Time Series Analysis and Forecasting Using R
30 pages
Asus U36JC-NYC2 Manual
No ratings yet
Asus U36JC-NYC2 Manual
0 pages
Introduction To Time Series Analysis, Lectures
No ratings yet
Introduction To Time Series Analysis, Lectures
49 pages
KUDU Variable Frequency Drives: For Progressing Cavity Pumps
No ratings yet
KUDU Variable Frequency Drives: For Progressing Cavity Pumps
2 pages
Lecture - 4 Box-Jenkins Model
100% (2)
Lecture - 4 Box-Jenkins Model
16 pages
Station A Rity
No ratings yet
Station A Rity
18 pages
Chapter 2 - Discrete Time Signals and Systems
No ratings yet
Chapter 2 - Discrete Time Signals and Systems
81 pages
Stat 479
No ratings yet
Stat 479
74 pages
Correlation Design
No ratings yet
Correlation Design
5 pages
Box-Jenkins Methodology: 5.2.1 Stage 1: Model Identification
No ratings yet
Box-Jenkins Methodology: 5.2.1 Stage 1: Model Identification
26 pages
Week 1 - Slides White
No ratings yet
Week 1 - Slides White
72 pages
TSA PPT Lesson 03
No ratings yet
TSA PPT Lesson 03
15 pages
Error Correction Model
No ratings yet
Error Correction Model
37 pages
Cc-Me cs1 Did
No ratings yet
Cc-Me cs1 Did
66 pages
Arma Models and The Box Jenkins Methodology
No ratings yet
Arma Models and The Box Jenkins Methodology
25 pages
"Vehicle Information System": Gitarattan International Business School
No ratings yet
"Vehicle Information System": Gitarattan International Business School
6 pages
Project-II Final Report
No ratings yet
Project-II Final Report
76 pages
Document - UL E70524
No ratings yet
Document - UL E70524
4 pages
A-79955e 01 050818
No ratings yet
A-79955e 01 050818
17 pages
Order Management: Table Joins
No ratings yet
Order Management: Table Joins
5 pages
Mr. Ashok Ramchandra Patel: Professional Objective
No ratings yet
Mr. Ashok Ramchandra Patel: Professional Objective
3 pages
Time Series Analysis
No ratings yet
Time Series Analysis
41 pages
Lecture Notes
No ratings yet
Lecture Notes
97 pages
Umpire Club Contact Number Division Comments E-Mail Address
No ratings yet
Umpire Club Contact Number Division Comments E-Mail Address
3 pages
M 3 Unit Test Set B
No ratings yet
M 3 Unit Test Set B
1 page
TS Lecture1 2019
No ratings yet
TS Lecture1 2019
56 pages
Simple Linear Regression PDF
No ratings yet
Simple Linear Regression PDF
40 pages
2.10 Stationary Time Series-1607080624480
No ratings yet
2.10 Stationary Time Series-1607080624480
35 pages
Time Series and Sequential Data
No ratings yet
Time Series and Sequential Data
143 pages
Time Series Analysis
No ratings yet
Time Series Analysis
104 pages
Lesson 2.2.a Stem and Leaf Plot
No ratings yet
Lesson 2.2.a Stem and Leaf Plot
9 pages
STAT0010 Introductory Slides
No ratings yet
STAT0010 Introductory Slides
67 pages
C TSAF Box Jenkins - Method
No ratings yet
C TSAF Box Jenkins - Method
83 pages
Time Series 2022 B
No ratings yet
Time Series 2022 B
57 pages
Math7339TS1TimesSeries Intro
No ratings yet
Math7339TS1TimesSeries Intro
33 pages
Group 9 Time Series Data Analysis (ARIMA)
No ratings yet
Group 9 Time Series Data Analysis (ARIMA)
47 pages
SAP Business One and The Prolification of Technology
No ratings yet
SAP Business One and The Prolification of Technology
23 pages
STA222
No ratings yet
STA222
6 pages
SC Presentation VIT 03272024 Final
No ratings yet
SC Presentation VIT 03272024 Final
98 pages
Time Series
No ratings yet
Time Series
13 pages
Intro To Time Series
No ratings yet
Intro To Time Series
85 pages
Dharmendra
No ratings yet
Dharmendra
2 pages
TS CF
No ratings yet
TS CF
54 pages
Sta 445 1 Stationarity and Non-Stationarity
No ratings yet
Sta 445 1 Stationarity and Non-Stationarity
15 pages
ASAL IT Decisiontree Final
No ratings yet
ASAL IT Decisiontree Final
1 page
Chapter 5 - Time Series Models
No ratings yet
Chapter 5 - Time Series Models
195 pages
Lecture 1
No ratings yet
Lecture 1
45 pages
Lecture 2 Time Series Analysis
No ratings yet
Lecture 2 Time Series Analysis
32 pages
Ch6 Slides Ed3 Feb2024
No ratings yet
Ch6 Slides Ed3 Feb2024
31 pages
OPt Exam Example 2
No ratings yet
OPt Exam Example 2
6 pages
Document 1
No ratings yet
Document 1
6 pages
Dsa Nov 2024
No ratings yet
Dsa Nov 2024
3 pages
Time Series
No ratings yet
Time Series
32 pages
Week4 - 1
No ratings yet
Week4 - 1
18 pages