ST304 Notes Zetai (v2)
ST304 Notes Zetai (v2)
Zetai Cen
* The contents of this file are extracted from and hence the file only serves as a cookbook
of ST304@LSE. See Moodle for more details of the course, and all typos or inaccuracies
are my responsibility.
Contents
1 Overview and Basics 3
4 Estimation: Time-Domain 11
4.1 Basic Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Sample ACVS & ACF . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 MME: the Yule-Walker estimators for AR(p) . . . . . . . . . . . . . . . 13
4.3 Asymptotic results to test MA(q) & AR(p) . . . . . . . . . . . . . . . . 14
4.4 LSE: least square estimators for AR(p) & ARMA(p, q) . . . . . . . . . 15
4.5 MLE: MA(q) & AR(p) & ARMA(p, q) . . . . . . . . . . . . . . . . . . 17
1
5 Model Selection & Forecasting 18
5.1 Techniques for model selection & diagnostics . . . . . . . . . . . . . . . 18
5.2 Forecasting: prediction equations
(Only for those interested, except for MSPE definition) . . . . . . . . . 18
5.3 Forecasting: ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2
1 Overview and Basics
On a suitable filtered-probability space (R, B, (B)t , P), we define a time series to be
a discrete sequence of random variables X := {Xt , t ∈ Z} or denote it by {Xt } if no
ambiguity.
Definitions
γx (t, s)
ρx (t, s) = p .
γx (t, t)γx (s, s)
Thus for a weakly stationary, discrete and equally spaced series, denote the
autocovariance series (ACVS) and ACF as:
γx (t, t + τ ) sτ
ρτ := p = .
γx (t, t)γx (t + τ, t + τ ) s0
3
6. Gaussian process is a process X such that for any t1 , . . . , tn , vector (Xt1 , . . . , Xtn )
has a multivariate Normal distribution with finite mean and covariances. Notice
a weakly stationary Gaussian process is strongly stationary as well.
Xt = Tt + St + Mt
The overall analysis could be dichotomised as: time domain and frequency domain.
The latter is preferred when periodicity is less apparent, and is not considered in this
course. After the estimation, we talk about the forecasting. Thus, generally we have:
modelling → estimation
| {z } → model selection → forecasting
| {z } | {z }
Sec.2 & Sec.3 Sec.4 Sec.5
4
2 Stationary Time Series Models
A filtered series is a linear combination of time series variables. We start from white
noise, the building block of important models.
could be written as (1) by assuming E[Xt] = µ, then take expectation on both sides,
we have α = µ(1 − ϕ1 − · · · − ϕp ).
5
2.5 Notes on ARMA
We define the backward shift operator as B such that:
In (4), we define:
- Φ(B) as the autoregressive operator
- Θ(B) as the moving average operator
Hence also define:
- Φ(z) as the AR characteristic polynomial
- Θ(z) as the MA characteristic polynomial
Thus, ARMA is written as Φ(B)Xt = Θ(B)εt .
The above is an extended MA(∞) process if {εt } is weakly stationary with ACVS sετ .
Notice the process is NOT weaklyP
stationary. If coefficients are absolutely conver-
gent / absolutely summable: j |ψj | < ∞, then the process is weakly stationary
with ACVS: ∞
X
x
sτ = ψj ψk sετ −(j−k)
j,k=−∞
6
2.5.2 Invertibility of ARMA
For an ARMA process:
We
P∞ say the process is invertible if there is a sequence of constants πj such that
j=0 |πj | < ∞ and:
∞
X
εt = πj Xt−j
j=0
2.5.3 Summary
For ARMA(p, q):
stationarity −→ absolute summability, but not true for other processes.
X t = σ t εt
(5)
σt2 = α0 + α1 Xt−1
2 2
+ · · · + αp Xt−p
where {εt } is i.i.d. with zero mean and unit variance, denoted εt ∼ P
IID(0, 1). Fur-
thermore, α0 , αp > 0, and the other αj ≥ 0. Usually we also assume pi=1 αi < 1 to
ensure stationarity, see property 1 below.
Properties
7
2.7 GARCH(p, q), Generalised ARCH model of order p, q
A GARCH(p, q) model, denoted Xt ∼ GARCH(p, q), is defined as:
Xt = σt εt
(6)
σt2 = α0 + α1 Xt−1
2 2
+ · · · + αp Xt−p 2
+ β1 σt−1 2
+ · · · + βq σt−q
where εt ∼ IID(0, 1), and α0 , αp , βq > 0, and the other αj , βj ≥ 0. We also assume
Pmax(p,q)
(αi + βi ) := pi=1 αi + qi=1 βi < 1.
P P
i=1
Properties
3. For GARCH(1,1), if we adapt Gaussian innovation {εt }, we can check that the
kurtosis is larger than 3, so the process cannot be Gaussian.
8
3 Non-stationary Time Series Models
3.1 Trend & Seasonality
The d-th order difference on a process {Xt } is defined recursively as:
∆d Xt := ∆(∆d−1 Xt ), ∆1 Xt = Xt − Xt−1
∆s Xt := Xt − Xt−s = (1 − B s )Xt
∆d Xt = yt
Thus, if the above yt ∼ ARM A(p, q), we say the process Xt ∼ ARIM A(p, d, q).
ΦP (B s )Xt = ΘQ (B s )εt
ΦP (B s ) := 1 − ϕ1 B s − ϕ2 B 2s − · · · − ϕP B P s
ΘQ (B s ) := 1 + θ1 B s + θ2 B 2s + · · · + θQ B Qs
9
We extend the pure seasonal ARMA to the multiplicative SARMA model. We denote
Xt ∼ ARM A(p, q) × (P, Q)s if the model has the form:
ΦP (B s )Φ(B)Xt = ΘQ (B s )Θ(B)εt
where εt ∼ W N (0, σ 2 ), and the other operators are defined the same in the pure
seasonal ARMA and ARMA.
ΦP (B s )Φ(B)∆D d s
s ∆ Xt = α + ΘQ (B )Θ(B)εt
where εt ∼ W N (0, σ 2 ).
For pure seasonal AR or MA model, we have similar properties parallel to the ACF
and PACF properties for MA and AR models. We have ACF cut-off at lag Qs for
MA(Q)s , and PACF cut-off at lag P s for AR(P )s .
10
4 Estimation: Time-Domain
4.1 Basic Quantities
For later estimation, we first define and discuss some basic quantities. We assume the
process {Xt } is weakly stationary, and we want to estimate the mean µ, ACVS sτ ,
ACF ρτ .
We call a matrix Toeplitz matrix if entry values are the same on every n-th diagonal.
Specifically, here we define:
s0 s1 · · · sT −1
s1
s0 · · · sT −2
.. .. . . ..
. . . .
sT −1 sT −2 · · · s0
Thus, the summation in (7) is just to sum over the entries in each row and column, so
we can re-write it by summing over diagonals:
T −1
1 X
var(X̄) = 2 (T − |τ |)sτ
T
τ =−(T −1)
(8)
T
1 X |τ |
= (1 − )sτ
T τ =−T T
P
If {sτ } is absolutely summable (i.e. τ |sτ | < ∞, see 2.5.1), as T → ∞, we can see
that var(X̄) → 0:
E[(x̄ − µ)2] → 0,
which essentially says x̄ converges to µ in mean square, and hence in probability.
11
4.1.2 Sample ACVS & ACF
Define the unbiased estimator for ACVS, assuming µ is known:
T −|τ |
1 X
s(u)
τ := (Xt − µ)(Xt+|τ | − µ) (9)
T − |τ | t=1
However, we almost always not know µ, so (9) is adjusted, and the new estimator is
biased:
T −|τ |
(⋆) 1 X
sτ := (Xt − X̄)(Xt+|τ | − X̄) (10)
T − |τ | t=1
We propose a new biased estimator which is often more preferred:
T −|τ |
1 X
ŝτ := (Xt − X̄)(Xt+|τ | − X̄) (11)
T t=1
12
4.2 MME: the Yule-Walker estimators for AR(p)
The Yule-Walker estimator is a type of Methods of Moments Estimators (MME),
and in the discussion below, we assume {Xt } is weakly stationary.
For an AR(p) process in the form (1) and τ > 0, we want to estimate a collection
of p + 1 parameters: {σε2 , ϕ1 , . . . , ϕp }. Take covariance with Xt−τ on both sides, we
have (13); take covariance with Xt instead, we have (14):
sτ = ϕ1 sτ −1 + · · · + ϕp sτ −p (13)
s0 = ϕ1 s1 + · · · + ϕp sp + σε2 (14)
To our p+1 unknown parameters, we just need p+1 equations. Thus, take τ = 1, . . . , p
for (13), we have p + 1 Yule-Walker equations in matrix form:
γp = Γp ϕ, σε2 = s0 − ϕT γp (15)
From (15), we replace all ACVS using ŝτ in (11) and get the estimated parameters
which guarantees stationary AR(p):
−1
ϕ̂ = Γˆp γˆp , σˆε 2 = sˆ0 − ϕ̂T γˆp (16)
13
4.3 Asymptotic results to test MA(q) & AR(p)
First, MA(q) could be estimated by methods of moments (MME), try it.
Therefore, − 1.96
√ , + 1.96
T
√
T
is a 95% confidence interval on sample ACF for the process
being a white noise. For common MA(q) processes, we often use sample ACF to
replace the real ACF in (17).
If true order for AR is p, while we estimate an AR(h) process with h > p, then by
(19):
√ D
T ϕ̂h −
→ N (0, 1) (20)
Note the result does NOT hold for other k : p < k < h. Now, by the note in Def.7
on PACF, we have PACF ϕhh = ϕh = 0 for the case in (20). Thus for a well defined
sample PACF:
√ √ D
T ϕ̂hh = T ϕ̂h −
→ N (0, 1) ∀h > p (21)
1.96 1.96
Therefore, we have − √T , + √T as the 95% confidence bounds on sample PACF for
the AR process having order ≤ p.
14
4.4 LSE: least square estimators for AR(p) & ARMA(p, q)
AR(p)
We consider the least square estimators (LSE) for AR(p) processes with zero mean.
Given data points {X1 , . . . , XT }, write the equations in matrix form:
xF = F ϕ + εF (22)
The forward least square estimator minimises the sum of squared errors:
Hence:
2 ||xF − F (ϕ̂OLS )||2
σ̂OLS := (24)
(T − p) − p
However, the above estimators do not ensure a weakly stationary process which is
problematic for forecasting.
15
ARMA(p, q)
We now consider LSE on general ARMA(p, q) processes. We only mention the idea as
details are left with computers. For a non-zero mean ARMA(p, q):
p q
X X
Xt = c + bj Xt−j + εt + ai εt−i (25)
j=1 i=1
With εt (c, a, b) denoting the value of εt given (c, a, b), where a := (a1 , . . . , aq )T and
b := (b1 , . . . , bp )T :
T
X
⇒ (ĉ, â, b̂) : = argminc,a,b [εt (c, a, b)]2
t=p+1
T p q
(26)
X X X
: = argminc,a,b [Xt − c − bj Xt−j − ai εt−i (c, a, b)]2
t=p+1 j=1 i=1
It is a nonlinear optimisation but we can initialise (c0 , a0 , b0 ) and iterate until conver-
gence in our estimators.
16
4.5 MLE: MA(q) & AR(p) & ARMA(p, q)
We consider maximum likelihood estimators (MLE) for the three types of mod-
els. The general idea is to write MA or ARMA models as AR models by using the
backward shift operator, and we use MLE on the AR model.
Only example of AR(1) is discussed here, other cases are similar. Assume an AR(1)
i.i.d.
model with 0 mean, with data (X0 , X1 , . . . , Xn ), with εt ∼ N (0, σ 2 ). Thus we can
assume:
σ2
X0 ∼ N (0, ) (27)
1 − ϕ2
Together with the conditional joint likelihood f (X1 , . . . , Xn |X0 ) is Gaussian, we can
have the log-likelihood function:
n n 1 1
l(ϕ, σ 2 ) = − log(2π) − log(σ 2 ) + log(1 − ϕ2 ) − 2 S(ϕ) (28)
2 2 2 2σ
where we define S(ϕ):
n
X
S(ϕ) := (Xi − ϕXi−1 )2 + (1 − ϕ2 )X02
i=1
(29)
=: S ⋆ (ϕ) + (1 − ϕ2 )X02
The first method is nonlinear and requires numerical routines, but all three are asymp-
totically the same as the asymptotic distribution coincides with the asymptotic result
for Yule-Walker estimators.
17
5 Model Selection & Forecasting
5.1 Techniques for model selection & diagnostics
For model selection, we can consider the Akaike’s Information Criterion (AIC)
defined as:
AIC := −2(Maximised Log-likelihood) + 2m (30)
where m is the number of estimated parameters. We hope to minimise AIC and the
2m term is used to penalise for large models. More advanced versions are BIC (or
AICC which is not discussed here) based on AIC, defined as
where n is the sample size (and hence will be T for our case in time series).
Next we might want to test if a process such as residuals is a white noise. At large
lags we expect to observe trivial ACF, and a formal test is the Ljung-Box-Pierce
statistic:
k
⋆
X ρ̂2j
Q := T (T + 2) (32)
j=1
T −j
where T is the sample size, ρ̂j is the sample ACF, k is a pre-chosen and represents
the number of lags we want to look at. As T → ∞, Q⋆ is close to a χ2k−m distribution
where m is the number of estimated parameters.
Et[·] := E[·|Ft]
where Ft represents information of Xs ∀s ≤ t.
Theorem 6.2.1.
Xt (l) = Et [Xt+l ] minimises the MSPE. Proof omitted but please try.
Notice we can consider the de-meaned process {Xt − µ} and hence α0 in (34) could
be ignored (think about why?). In practice, µ could be estimated by the sample mean
anyway. Next, (35) implies for j = 0, 1, . . . , t:
st+l−j = α1 sj−1 + α2 sj−2 + · · · + αt sj−t . (36)
In matrix form, we have:
γl = Γt α (37)
By replacing all sj ’s by their sample estimates, ŝτ in (11), we can obtain the coefficient
estimates:
α̂ = Γ̂−1
t γ̂l .
Assume the process is invertible, we can also express Xt (l) directly using Xt , details
not discussed here.
Version edited using LATEX.
20