LBSEconometricsPartIIpdf Time Series
LBSEconometricsPartIIpdf Time Series
PhD LBS
Luca Gambetti
UAB, Barcelona GSE
February-March 2014
1
Contacts
Description
The main objective of the course is to provide the students with the knowledge of a comprehensive
set of tools necessary for empirical research with time series data.
2
Contents
• Introduction
• ARMA models (2 sessions)
– Representation
– Estimation
– Forecasting
• VAR models (1 session)
– Representation
– Estimation
– Forecasting
• Structural VAR (2 sessions)
– Recursive identification
– Long-run identification
– Sign identification
– Applications
3
References
Grades
Take-Home Exam.
Econometric Software
MATLAB.
4
1. INTRODUCTION
5
1 What does a macroeconometrican do?
”Macroeconometricians do four things: describe and summarize macroeconomic data, make macroe-
conomic forecasts, quantify what we do or do not know about the true structure of the macroeconomy,
and advise (and sometimes become) macroeconomic policymakers.” Stock and Watson, JEP, 2001.
Except advising and becoming policymakers, this is what we are going to do in this course.
6
2 Preliminaries
• The lag operator L maps a sequence {xt} into a sequence {yt} such that yt = Lxt = xt−1, for all t.
• If we apply L to a constant c, Lc = c.
• Inversion: L−1 is the inverse of L, L−1xt = xt+1. It is such that L−1(L)xt = xt.
• The lag operator and multiplication operator are commutative L(βxt) = βLxt (β a constant).
• The lag operator is distributive over the addition operator L(xt + wt) = Lxt + Lwt
7
• Absolutely summable (one-sided) filters. Let {αj }∞
j=0 be a sequence of absolutely summable coef-
P∞
ficients, i.e. j=0 |αj | < ∞. We define the filter
α(L) = α0 + α1L + α2L2 + ...
which gives
α(L)xt = α0xt + α1xt−1 + α2xt−2 + ...
• α(0) = α0.
• α(1) = α0 + α1 + α2 + ....
8
• α(L)xt + β(L)xt = (α(L) + β(L))xt.
• α(L)[β(L)xt] = β(L)[α(L)xt].
• Lag polynomials can also be inverted. For a polynomial φ(L), we are looking for the values of
the coefficients αi of φ(L)−1 = α0 + α1L + α2L2 + ... such that φ(L)−1φ(L) = 1.
Example: p = 1. Let φ(L) = (1 − φL) with |φ| < 1. To find the inverse write
(1 − φL)(α0 + α1L + α2L2 + ...) = 1
note that all the coefficients of the non-zero powers of L must be equal to zero. So This gives
α0 =1
−φ + α1 =0 ⇒ α1 = φ
−φα1 + α2 =0 ⇒ α2 = φ2
−φα2 + α3 =0 ⇒ α3 = φ3
9
P∞
and so on. In general αk = φk , so (1 − φL)−1 = j j
j=0 φ L provided that |φ| < 1.
10
Let T be an index which has an ordering relation defined on it. Thus, if t1, t2 ∈ T , then t1 ≤ t2 or
t1 > t2. Usually, T = R or T = Z, the sets of real or integer numbers, respectively.
Example 1 Let the index set be T = {1, 2, 3} and let the space of outcomes Ω be the possible
outcomes associated with tossing one dice:
Ω = 1, 2, 3, 4, 5, 6
Define Xt = t + [value on dice]2t. Therefore for a particular ω, say ω = 3, the realization or path
would be (10, 20, 30), and this stochastic process has 6 different possible realizations (associated to
each of the values of the dice).
11
12
13
14
15
2.2 Stationarity
This course will mainly focus on stationary processes. There are two definitions of stationarity: strict
and weak (or second order).
Strict Stationarity The time series Xt is said to be strictly stationary if the joint distributions of
(Xt1 , ....Xtk )0 and (Xt1+h, ....Xtk +h)0 are the same for all positive integers for all t1, ..., tk , h ∈ Z.
Interpretation: This means that the graphs over two equal-length time intervals of a realization
of the time series should exhibit similar statistical characteristics.
In order to define the concept of weak stationarity we first need to introduce the concept of au-
tocovariance function. This function is a measure of dependence between elements of the sequence
Xt .
The autocovariance function If Xt is a process such that V ar(Xt) < ∞ for each t ∈ T , then
the autocovariance function γt(r, s) of Xt is defined by
γ(r, s) = Cov(Xr , Xs) = E[(Xr − E(Xr ))(Xs − E(Xs))], (1)
16
Weak Stationarity The time series Xt is said to be weakly stationary if
(i) E|Xt|2 < ∞ for all t.
(ii) E(Xt) = µ for all t.
(iii) γ(r, s) = γ(r + t, s + t) for all t, r, s.
Notice that for a covariance stationary process γ(r, s) = γ(r − s, 0) = γ(h).
In summary: weak stationarity means that the mean, the variance are finite and constant and that
the autocovariance function only depends on h, the distance between observations.
Autocorrelation function, ACF For a stationary process Xt, the autocorrelation function at lag
h is defined as
γ(h)
ρ(h) = = Corr(Xt+h, Xt) for all t, h ∈ Z.
γ(0)
17
Partial correlation function (PACF). The partial autocorrelation α(.) of a stationary time series
is defined by
α(1) = Corr(X2, X1) = ρ(1)
and
α(k) = Corr(Xk+1 − P (Xk+1|1, X2, ..., Xk ), X1 − P (X1|1, X2, ..., Xk )) f ork ≥ 2
An equivalent definition is that the kth partial autocorrelation α(k) is defined as the last coefficient
in the linear projection of Yt on its k most recent values.
(k) (k) (k)
Yt = α1 Yt−1 + α2 Yt−2 + ... + αk Yt−k + εt
(k)
α(k) = αk .
Strict stationarity implies weak stationarity, provided the first and second moments of the vari-
ables exist, but the converse of this statement is not true in general.
There is an one important case where both concepts of stationary are equivalent.
Gaussian Time series The process Xt is a Gaussian time series if and only if the joint den-
sity of (Xt1 , ..., Xtn )0 is Gaussian for all t1, t2, ..., tn
If Xt is a stationary Gaussian time series, then it is also strictly stationary, since for all n = {1, 2, ...}
and for all h, t1, t2, ... ∈ Z, the random vectors (Xt1 , ..., Xtn )0, and (Xt1+h, ..., Xtn+h)0 have the same
mean, and covariance matrix, and hence they have the same distribution.
18
2.3 Ergodicity
Consider a stationary process Xt, with E(Xt) = µ for all t. Assume that we are interested in
estimating µ. The standard approach for estimating the mean of a single random variable consists of
computing its sample mean
N
(i)
X
X̄ = (1/N ) Xt
i=1
(we call this ensemble average) where the Xi’s are different realizations of the variable Xt.
When working in a laboratory, one could generate different observations for the variable Xt un-
der identical conditions.
However, when analyzing economic variables over time, we can only observe a unique realization
of each of the random variable Xt so that it is not possible to estimate µ by computing the above
average.
Whether time averages converge to the same limit as the ensemble average, E(Xt), hast to do with
the concept of Ergodicity.
19
Ergodicity for the mean A covariance stationary process Xt is said to be ergodic for the mean if
PT
X̄ = (1/T ) t=1 Xt converges in probability to E(Xt) as T gets large.
Ergodicity for the second moments A covariance stationary process is said to be ergodic for
the second moments if
T
X p
[1/(T − j)] (Xt − µ)(Xt−j − µ) → γj
t=j+1
Important result: A sufficient condition for ergodicity in mean of a stationary process is that
P∞
h=0 |γ(h)| < ∞. If the process is also Gaussian then the above condition also implies ergodic-
ity for all the higher moments.
For many applications, ergodicity and stationarity turn out to amount for the same requirements.
However, we present now an example of a stationary process that is not ergodic.
20
Example Consider the following process. y0 = u0 with u0 ∼ N (0, σ 2) and yt = yt−1 for t = 1, 2, 3, ....
It easy to see that the process is stationary. In fact
E(yt) = 0, E(yt2) = σ 2, E(yt−j , yt−i) = σ 2, i 6= j
However
T
X T
X
(1/T ) yt = (1/T ) u1 = u1 6= 0
t=1 t=1
21
2.4 Some processes
iid sequences The sequence εt is i.i.d with zero mean and variance σ 2 , written ε ∼ iid(0, σ 2), (inde-
pendent and identically distributed) if all the variables are independent and share the same univariate
distribution.
White noise The process εt is called white noise, written ε ∼ W N (0, σ 2) if it is weakly station-
ary with E(ε) = 0 and autocovariance function γ(0) = σ 2 and γ(h) = 0 for h 6= 0.
Note that an i.i.d sequence with zero mean and variance σ 2 is also white noise. The converse is
not true in general.
Martingale difference sequence, m.d.s. A process εt, with E(εt) = 0 is called a martingale differ-
ence sequence if
E(εt|εt−1, εt−2...) = 0, t = 2, 3, ...
22
It is easy to see that E(yt) = y0. Moreover the variance is
2
X t X t
γ(0) = E uj = E(u2j ) = tσ 2
j=1 j=1
and
t
X X t−h
γ(h) = E uj uj
j=1 k=1
t−h
X
= E(u2k )
k=1
= (t − h)σ 2
The autocorrelation is
(t − h)σ 2
ρ(h) =
(tσ 2(t − h)σ 2)1/2
1/2
h
= 1−
t
(2)
23
3 Linear projections
24
4 Moments estimation
The sample mean is the natural estimator for the expected value of a process.
The sample autocovariance and autocorrelation can be computed for any data set, and are not
restricted to realizations of a stationary process. For stationary processes both functions will show
a rapid decay towards zero as h increases. However, for non-stationary data, these functions will
exhibit quite different behavior. For instance for variables with trend, the autocorrelation reduces
very slowly.
25
26
To compute the k-th partial correlation, one simply has to run an OLS regression including the most
recent m values of the variable. The last coefficient would be the k-th. autocorrelation, that is,
(k) (k) (k)
Yt = α̂1 Yt−1 + α̂2 Yt−2 + ... + α̂k Yt−k + ε̂t
(
where ε̂t denotes OLS residuals, so that α̂(k) = α̂k k).
27
2. ARMA1
1
This part is based on Hamilton textbook.
1
1 MA
1.1 MA(1)
1.1.1 Moments
2
All we see works even for non zero mean processes.
2
The variance is given by
E(Yt)2 = E(εt + θεt−1)2
= E(εt)2 + θ2E(εt−1)2 + 2θE(εtεt−1)
= σ 2 + θ2σ 2
= (1 + θ2)σ 2
(3)
The first autocovariance is
E(YtYt−1) = E(εt + θεt−1)(εt−1 + θεt−2)
= E(εtεt−1) + E(θεtεt−2) + E(θε2t−1) + E(θ2εt−1εt−2)
= θσ 2
Higher autocovariances are all zero, E(YtYt−j ) = 0 for j > 1.
Since the mean and covariances are not functions of time the process is stationary regardless on
the value of θ. Moreover the process is also ergodic for the first moment since ∞
P
j=−∞ |γj | < ∞. If
εt is also Gaussian then the process is ergodic for all moments.
Figure 1
4
Note that the first order autocorrelation can be plotted as a function of θ as in Figure 2.
Figure 2
5
Note that:
1. positive value of θ induce positive autocorrelation, while negative value negative autocorrelations.
2. the largest possible value is 0.5 (θ = 1) and the smallest one is −0.5 (θ = −1)
3. for any value of ρ1 between [−0.5, 0.5] there are two values of θ that produce the same autocor-
relation because ρ1 is unchanged if we replace θ with 1/θ
1/θ θ2(1/θ) θ
= = (4)
1 + 1/θ2 θ2(1 + 1/θ2) θ2 + 1
So the processes
Yt = εt + 0.5εt−1
and
Yt = εt + 2εt−1
generate the same autocorrelation functions.
6
Figure 3. Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.
7
1.2 MA(q)
1.2.1 Moments
9
Example Consider an MA(2).
γ0 = (1 + θ12 + θ22)σ 2
γ1 = E(YtYt−1)
= E(εtεt−1) + E(θ1εtεt−2) + E(θ2εtεt−3 + E(θ1ε2t−1) + E(θ12εt−1εt−2) +
+E(θ1θ2εt−1εt−3) + E(θ2εt−2εt−1) + E(θ2θ1ε2t−2) + E(θ22εt−2εt−3)
= (θ1 + θ2θ1)σ 2
γ2 = E(YtYt−2)
= θ2E(ε2t−2)
= θ2σ 2
(6)
The MA(q) process is covariance stationary for any value of θi. Moreover the process is also ergodic
for the first moment since ∞
P
j=−∞ |γj | < ∞. If εt is also Gaussian then the process is ergodic for all
moments.
10
1.3 MA(∞)
1.3.1 Moments
• The variance is
∞
X
γ0 = E(Yt2) = E( θj εt−j )2
j=0
= E(θ1εt−1 + θ2εt−2 + θ3εt−3 + ...)2
= (θ12 + θ22 + θ32 + ...)σ 2
11
∞
X
= σ2 θj2 (9)
j=0
Again all the terms involving the expected value of different εj ’s are zero because of the WN assump-
tion.
• Autocovariances are
γj = E(YtYt−j ) = E(θ1εt−1 + θ2εt−2 + θ3εt−3 + ...)(θ1εt−j−1 + θ2εt−j−2 + θ3εt−j−3 + ...)
= (θj θ0 + θj+1θ1 + θj+2θ2 + θj+3θ3 + ...)σ 2
The process is stationary finite and constant first and second moments.
Moreover an MA(∞) with absolutely summable coefficients has absolutely summable autocovari-
P∞
ances, j=−∞ |γj | < ∞, so it is ergodic for the mean.
12
1.4 Invertibility and Fundamentalness
Consider an MA(1)
Yt = εt + θεt−1 (10)
where εt is W N (0, σ 2). Provided that |θ| < 1 both sides can be multiplied to obtain
(1 − θL + θ2L2 − θ3L3 + ...)Yt = εt
which could be viewed as a AR(∞) representation.
Let us investigate what invertibility means in terms of the first and second moments of the pro-
cess. Consider the following MA(1)
Ỹt = (1 + θ̃L)ε̃t (11)
where εt is W N (0, σ̃ 2). Moreover suppose that the parameters in this new MA(1) are related to the
other as follows:
θ̃ = θ−1
σ̃ 2 = θ2σ 2
13
Let us derive the first two moments of the two processes. E(Yt) = E(Ỹt) = 0. For Yt
E(Yt2) = σ 2(1 + θ2)
E(YtYt−1) = θσ 2
For Ỹt
E(Yt2) = σ̃ 2(1 + θ̃2)
E(YtYt−1) = θ̃σ̃ 2
However note that given the above restrictions
1
(1 + θ2)σ 2 = 1 + θ̃2σ̃ 2
θ̃2 !
θ̃2 + 1
= θ̃2σ̃ 2
θ̃2
= θ̃ + 1 σ̃ 2
2
and
2θ̃2σ̃ 2
θσ =
θ̃
= θ̃σ̃ 2
That is the first two moments of the two processes are identical. Note that if |θ| < 1 then |θ̃| > 1.
In other words for any invertible MA representation we can find a non invertible representation with
14
identical first and second moments. The two process share the same autocovariance generating func-
tion.
The value of εt associated with the invertible representation is sometimes called the fundamental
innovation for Yt. For the borderline case |θ| = 1 the process is non invertible but still fundamental.
Invertibility A MA(q) process defined by the equation Yt = θ(L)εt is said to be invertible is there
P∞ P∞
exists a sequence of constants {πj }∞
j=0 such that |π
j=0 j | < ∞ and j=0 πj Yt−j = εt .
Proposition A MA process defined by the equation Yt = θ(L)εt is invertible if and only if θ(z) 6= 0
for all z ∈ C such that |z| ≤ 1.
15
1.5 Wold’s decomposition theorem
Theorem Any zero-mean covariance stationary process Yt can be represented in the form
∞
X
Yt = θj εt−j + kt (12)
j=0
where:
1. θ0 = 1,
2. ∞ 2
P
j=0 θj < ∞,
3. εt is the the error made in forecasting Yt on the basis of a linear function of lagged Yt (fundamental
innovation),
4. the value of kt is uncorrelated with εt−j for any j and can be perfectly predicted from a linear
function of the past values of Y .
The term kt is called the linearly deterministic component of Yt. If kt = 0 then the process is called
purely non-deterministic.
The result is very powerful since holds for any covariance stationary process.
16
However the theorem does not implies that (12) is the true representation of the process.
• For instance the process could be stationary but non-linear or non-invertible. If the true sys-
tem is generated by a nonlinear difference equation Yt = g(Yt−1, ..., Yt−1) + ηt, obviously, when we fit
a linear approximation, as in the Wold theorem, the shock we recover ε will be different from ηt.
• If the model is non invertible then the true shock will not be the Wold shock.
17
2 AR
2.1 AR(1)
This ensures that the MA representation exists, the process is stationary and ergodic in mean.
P∞ j
Recall that j=0 φ is a geometric series converging to 1/(1 − φ) if |φ| < 1.
18
2.2 Moments
• The mean is
E(Yt) = 0
• The variance is ∞
X σ2
γ0 = E(Yt2) =σ 2
φ = 2j
j=0
1 − φ2
19
To find the moments of the AR(1) we can use a different strategy by directly working with the AR
representation and the assumption of stationarity.
20
Figure 4
21
Figure 5
22
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.
23
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.
24
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.
25
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.
26
2.3 AR(2)
27
2.3.1 Moments
• The mean is
E(Yt) = φ1E(Yt−1) + φ2E(Yt−2) + E(εt)
again by stationarity E(Yt) = E(Yt−j ) and therefore (1 − φ1 − φ2)E(Yt) = 0.
• The variance
E(Yt2) = φ1E(Yt−1Yt) + φ2E(Yt−2Yt) + E(εtYt)
The equation can be written as
γ0 = φ1γ1 + φ2γ2 + σ 2
where the last term comes from the fact that
E(εtYt) = φ1E(εtYt−1) + φ2E(εtYt−2) + E(ε2t )
and that E(εtYt−1) = E(εtYt−2) = 0. Note that γj /γ0 = ρj . Therefore the variance can be rewritten
as
γ0 = φρ1γ0 + φ2ρ2γ0 + σ 2
= (φ1ρ1 + φ2ρ2)γ0 + σ 2
φ21 φ2φ21
= + + φ22 γ0 + σ 2
(1 − φ2) (1 − φ2)
(1 − φ2)σ 2
γ0 = (17)
(1 + φ2)[(1 − φ2)2 − φ21]
29
Figure 4
30
2.4 AR(p)
31
2.4.1 Moments
• E(Yt) = 0;
• The variance is
γ0 = φ1γ1 + φ2γ2 + ... + φpγp
32
2.4.2 Finding the roots of (1 − φ1 z − φ2 z 2 − ... − φp z p )
An easy way to find the roots of the polynomial is the following. Define two new vectors Zt =
[Yt, Yt−1, ..., Yt−p+1]0, t = [εt, 0(1×p−1)] and a new matrix
φ1 φ2 φ3 · · · φp
1 0 0 ··· 0
F =
0 1 0 · · · ...
0 ... . . . ...
0 ··· 0 1 0
Then Zt satisfies the AR(1)
Zt = F Zt−1 + t
The roots of the polynomial (1−φ1z −φ2z 2 −...−φpz p) coincide with the reciprocal of the eigenvalues
of F .
33
2.5 Causality and stationarity
Proposition An AR process φ(L)Yt = εt is causal if and only if φ(z) 6= 0 for all z such that |z| ≤ 1.
Stationarity The AR(p) is stationary if and only if φ(z) 6= 0 for all the z such that |z| = 1
35
3 ARMA
3.1 ARMA(p,q)
36
• The variance of the process is
E(Yt2) = φ1E(Yt−1Yt) + φ2E(Yt−2Yt) + ... + φpE(Yt−pYt) +
+E(εtYt) + θ1E(εt−1Yt) + ... + θq E(εt−q Yt)
= φ1[σ 2(ψ1ψ0 + ψ2ψ1 + ...)] + φ2[σ 2(ψ2ψ0 + ψ3ψ1 + ...)]
+... + φp[σ 2(ψpψ0 + ψp+12ψ1 + ...)] +
+E(ψ0ε2t ) + θ1E(ψ1ε2t ) + ... + θq E(ψq ε2t )
= φ1[σ 2(ψ1ψ0 + ψ2ψ1 + ...)] + φ2[σ 2(ψ2ψ0 + ψ3ψ1 + ...)]
+... + φp[σ 2(ψpψ0 + ψp+12ψ1 + ...)] +
+σ 2(ψ0 + θ1ψ1 + ... + θq ψq )
and the j autocovariance is
γj = E(YtYt−j ) = φ1E(Yt−1Yt−j ) + φ2E(Yt−2Yt−j ) + ... + φpE(Yt−pYt−j ) +
+E(εtYt−j ) + θ1E(εt−1Yt−j ) + ... + θq E(εt−q Yt−j )
for j ≤ q
γj = E(YtYt−j ) = φ1γj−1 + φ2γj−2 + ... + φpγj−p +
+σ 2(θj ψ0 + θj+1ψ1...)
while for j > q autocovariances are
γj = E(YtYt−j ) = φ1γj−1 + φ2γj−2 + ... + φpγj−p+
37
There is a potential problem for redundant parametrization with ARMA processes. Consider a simple
WN
Yt = εt
and multiply both sides by (1 − ρL) to get
(1 − ρL)Yt = (1 − ρL)εt
an ARMA(1,1) with θ = φ = −ρ. Both representations are valid however it is important to avoid
such parametrization since we would get into trouble for estimating the parameter.
38
3.2 ARMA(1,1)
39
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.
40
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.
41
3. ESTIMATION1
1
This part is based on the Hamilton textbook.
1
1 Estimating an autoregression with OLS
with roots of (1−φ1z +φ2z 2 +...+φpz p) = 0 outside the unit circle and with εt an i.i.d. sequence
with zero mean, variance σ 2 and finite fourth moment.
The autoregression be be written in the standard form where xt = [yt−1, ..., yt−p]0.
Note that the autoregression cannot satisfy assumption A1 since ut is not independent of yt+1.
2
1.1 Asymptotic results for the estimator
The asymptotic results seen before hold. Suppose we have T + p observations so that T observations
can be used to estimate the model. Then:
Consistency
" T
#−1 " T
#
X X
β̂ − β = (xtx0t) (xtut)
t=1 t=1
" T
#−1 " T
#
X X
= (1/T ) (xtx0t) (1/T ) (xtut)
t=1 t=1
Again xtut is a martingale difference sequence with finite variance covariance matrix given by
E(xtutx0tut) = E(xtx0t)σ 2
" T
#−1
X p
(1/T ) (xtut) →0
t=1
Moreover the first term is
T −1 yt−1 T −1 yt−2 T −1 yt−p
P P P
1 ...
" #−1 T −1 P yt−1 T −1
P 2
y T −1
P
yt−1yt−2 . . . T −1 yt−1yt−p
P
T
X t−1
(xtx0t) = T −1 yt−2 T −1 yt−2yt−1 T −1 yt−22
. . . T −1 yt−2yt−p
P P P P
t=1
... ... ... ... ...
T −1 yt−p T −1 yt−pyt−1 T −1 yt−pyt−2 T −1 yt−p
2
P P P P
...
3
The elements in the first row or column converge in probability to µ by proposition C.13L. The other
elements by Theorem 7.2.1BD converge in probability to
E(yt−iyt−j ) = γ|i−j| + µ2
4
Asymptotic normality
Again this follows from the fact that xtut is a martingale difference sequence with variance covariance
matrix given by E(xtutx0tut) = E(xtx0t)σ 2 so that
" T
#
√ X L
(1/ T ) (xtu0t) → N (0, σ 2Q)
t=1
and
√ L
T (β̂ − β) → N (0, Q−1(σ 2Q)Q−1) = N (0, σ 2Q−1)
5
2 Maximum likelihood: Introduction
Here we explore how to estimate the values of φ1, ..., φp, θ1, ..., θq on the basis of observations of
y. The principle on which the estimation is based is maximum likelihood.
Let θ = (φ1, ..., φp, θ1, ..., θq , σ 2) denote the vector of population parameters and suppose we have a
sample of T observations (y1, y2, ..., yT ).
The maximum likelihood estimates (ML) is the value of θ such that (2) is maximized.
6
2.1 Likelihood function for an AR(1)
9
2.2 Conditional maximum likelihood estimates
An alternative to numerical optimization of the exact likelihood function is to consider the first value
y1 to be deterministic and to minimize the likelihood conditional on the first observation. In this case
the conditional log-likelihood is given by
T
X
L(θ) = log f (yt|yt−1, ..., y1; θ)) (9)
t=2
that is
T 2
(T − 1) (T − 1) X (y t − c − φy t−1 )
L(θ) = − log(2π) − log(σ 2) − (10)
2 2 t=2
2σ 2
Maximization of 10 is equivalent to minimization of
T
X
(yt − c − φyt−1)2
t=2
which is achieved by an OLS regression of yt on a constant and lagged values. The conditional ML
estimates of c, φ are given by
yt−1 −1
P P
T −1
ĉ yt
= P P 2 P (11)
φ̂ yt−1 yt−1 ytyt−1
The conditional ML estimates of the innovation variance is found by differentiating 10 with respect
10
to σ 2 and setting the result equal to zero.
T
T − 1 X (yt − c − φyt−1)2
− + =0
2σ 2 t=2
2σ 4
which gives PT
2 t=2 (yt − c − φ̂yt−1)2
σ̂ = (12)
T −1
11
2.3 Likelihood function for an AR(p)
with εt ∼ i.i.d.N (0, σ 2). In this case θ = (c, φ1, φ2, ...φp, σ 2)0. Here we only study the conditional
likelihood.
12
2.3.1 Conditional maximum likelihood estimates
The conditional likelihood can be derived as in the AR(1) case. In particular we have
(T − p) (T − p)
L(θ) = − log(2π) − log(σ 2) − ...
2 2
T
X (yt − c − φ1yt−1 − φ2yt−2 − ... − φpyt−p)2
− 2
(14)
t=p+1
2σ
The conditional ML estimates of the innovation variance is found by differentiating 10 with respect
to σ 2 and setting the result equal to zero
PT 2
t=p+1 (y t − c − φ 1 y t−1 − φ2 y t−2 − ... − φ p y t−p )
σ̂ 2 = (15)
T −p
• Note that even if disturbances are not Gaussian ML estimates of the parameters are consistent
estimates of the population parameters because they correspond to OLS estimates and we have seen
13
in the previous class that consistency does not depend the assumption of normality.
14
2.4 Conditional likelihood function for an MA(1)
yt|εt−1 ∼ N (µ + θεt−1, σ 2)
or
−(yt − µ − θεt−1)2
1
f (yt|εt−1; θ) = √ exp
2πσ 2 2σ 2
Since ε1 is known with certainty ε2 = y2 − µ − θε1. Proceeding this way given the knowledge of ε0
the full sequence of innovations can be calculated by iterating on εt = yt − µ − εt−1.
15
The conditional density for the tth observation can be calculated as
• For a particular numerical value of θ we calculate the sequence of ε’s. The conditional likelihood is
a function of the sum of the squares.
16
2.5 Conditional likelihood function for an MA(q)
with εt ∼ i.i.d.N (0, σ 2). In this case θ = (µ, θ1, ..., θq , σ 2)0.
17
2.6 Likelihood function for an ARMA(p,q)
with εt ∼ i.i.d.N (0, σ 2). In this case θ = (µ, φ1, ..., φp, θ1, ..., θq , σ 2)0.
Taking values of y0, ...y−p+1, ε0, ..., ε−q+1 as given the sequence of ε’s can be calculated using
18
4. FORECASTING1
1
This part is based on the Hamilton textbook.
1
1 Principles of forecasting
• Suppose we are interested in forecasting the value of Yt+1 based on a set of variables Xt.
• To evaluate the usefulness of the forecast we need to specify a loss function. Here we spec-
ify a quadratic loss function. A quadratic loss function means that Yt+1|t is chose to minimize
E(Yt+1 − Yt+1|t)2.
• E(Yt+1 − Yt+1|t)2 is known as the mean squared error associated with the forecast Yt+1|t denoted
• Fundamental result: the forecast with the smallest M SE is the expectation of Yt+1|t conditional
on Xt that is
Yt+1|t = E(Yt+1|Xt)
2
We now verify the claim. Let g(Xt) be any other function and let Yt+1|t = g(Xt). The associated
M SE is
(2)
Define
3
Therefore by law of iterated expectations
E(ηt+1) = E(E(ηt+1|Xt)) = 0
g(Xt) = E(Yt+1|Xt)
E(Yt+1|Xt) is the optimal forecast of Yt+1 conditional of Xt under a quadratic loss function. The
M SE of this optimal forecast is
4
1.2 Forecast based on linear projections
• The linear projection produces the smallest forecast error among the class of linear forecasting
rules. To verify this let g 0Xt be any arbitrary forecasting rule.
2 2
E (Yt+1 − g 0Xt) = E (Yt+1 − α0Xt + α0Xt − g 0Xt)
2
= E (Yt+1 − α0Xt) +
+2E [(Yt+1 − α0Xt) (α0Xt − g 0Xt)] +
2
+E (α0Xt − g 0Xt)
the middle term
E[(Yt+1 − α0Xt) (α0Xt − g 0Xt)] = E[(Yt+1 − α0Xt) Xt0 [α − g])
= E[(Yt+1 − α0Xt) Xt0] [α − g]
= 00
5
by definition of linear projection. Thus
2 2 2
E (Yt+1 − g 0Xt) = E (Yt+1 − α0Xt) + E (α0Xt − g 0Xt)
The optimal linear forecast is the value g 0Xt = α0Xt. We use the notation
P̂ (Yt+1|Xt) = α0Xt
Notice that
M SE[P̂ (Yt+1|Xt)] ≥ M SE [E(Yt+1|Xt)]
The projection coefficient α can be calculated in terms of moments of Yt+1 and Xt.
E(Yt+1Xt0) = α0E(XtXt0)
α0 = [E(XtXt0)]−1E(Yt+1Xt0)
Here we denote Ŷt+s|t = Ê(Yt+s|Xt) = P̂ (1, Yt+s|Xt) the best linear forecast of Yt+s conditional on
Xt .
6
1.3 Linear projections and OLS regression
There is a close relationship between OLS estimator and the linear projection coefficient. If Yt+1 and
Xt are stationary processes and also ergodic for the second moments then
T
X p
(1/T ) XtXt0 → E(XtXt0)
t=1
T
X p
(1/T ) XtYt+1 → E(XtYt+1)
t=1
implying
p
β̂ → α
The OLS regression yields a consistent estimate of the linear projection coefficient.
7
2 Forecasting an AR(1)
In general the j-step ahead forecast Ŷt+j|t can be computed using the recursion
Zt = F Zt−1 + t
We have
Ẑt+1|t = F Zt
Ẑt+2|t = F 2Zt
...
Ẑt+s|t = F sZt
which means
Ẑt+s|t = F Ẑt+s−1|t
The forecast Ŷt+s|t will be the first element of Ẑt+s|t
10
The associated forecast errors will be
M SE(Ẑt+1|t) = Σ
M SE(Ẑt+2|t) = F ΣF 0 + Σ
...
11
4 Forecasting an MA(1)
Ŷt+1|t = 0 (10)
h i
ψ(L)
because Ls + =0
12
5 Forecasting an MA(q)
Yt = εt + θ1εt−1 + θ2εt−2... + θq εt
1 + θ1L + θ2L2 + ... + θq Lq
1
Ŷt+s|t = Y
2 + ... + θ Lq t
Ls + 1 + θ1 L + θ 2 L q
where
1 + θ1L + θ2L2 + ... + θq L
= θs + θs+1L1 + ... + θq Lq−s for s = 1, 2, ...q
Ls +
and zero for s = q + 1, q + 2, ....
13
6 Forecasting an ARMA(1,1)
Ŷt+s|t = φŶt+s−1|t
15
7 Direct forecast
An alternative is to compute the direct forecast by computing the projection of Yt+h on Yt. To see
this consider a bivariate VAR(p) with two variables, xt and yt. We want to forecast xt+h given the
incormation available at time t.
16
8 Comparing Predictive Accuracy
Diebold and Mariano propose a procedure to formally compare the forecasting performance of two
1 2
competing models. Let Ŷt+s|t , Ŷt+s|t be two forecast based on the same information set but obtained
using two different models (i.e. MA(1) and AR(1)). Let
wτ1+s|τ = Yτ +s − Ŷτ1+s|t
wτ2+s|τ = Yτ +s − Ŷτ2+s|t
be the two forecast errors where τ = T0, ...T − s.
The accuracy of each forecast is measured by a particular loss function, say quadratic i.e. L(wτi +s|τ ) =
2
1
wτ +s|τ . The Diebold Mariano procedure is based on a test of the null hypothesis
H0 : E(dτ ) = 0, H1 : E(dτ ) 6= 0
where dτ = L(wτ1+s|τ ) − L(wτ2+s|τ ). The Diebold-Mariano statistic is
d¯
S= 1/2
ˆ d¯
LRV
where
T −s
X
d¯ = (1/(T − T0 − s)) dτ
τ =T0
17
ˆ d¯ is a consistent estimate of
and LRV
∞
X
LRVd¯ = γ0 + 2 γj
j=1
• When using forecast obtained from models one has to be careful. In nested models the distri-
bution of the DM statistic is non-normal.
• For nested model an alternative is the bootstrapping procedure in Clark and McCracken (Ad-
vances in Forecast Evaluation, 2013).
18
9 Forecast in practice
• So far we assumed that the value of the coefficients is known. This is obviously not the case in
practice. In real applications we will have to estimate the value of the parameters.
• For instance with an AR(p) we have to get estimate φ̂1, φ̂2, ..., φ̂p using OLS and then using
the formulas seen before to produce forecasts of Y .
• Suppose we have data up to date T . We estimate the model using all the T observations for
Yt and we forecast YT +s, call it ỸT +s|T to distinguish it from the forecast where coefficients were
known with certainty ŶT +s|T .
19
9.1 Forecast evaluation
• A key issue in forecasting is to evaluate the forecasting accuracy of a model of interest. In particular
several times we will be interested in comparing the performance of competing forecasting model, e.g.
AR(1) vs. ARMA(1,1).
• Answer: we can compare the mean squared errors using pseudo-out of sample forecast exercises.
20
9.2 Pseudo out-of-sample exercises
21
5. FORECASTING: APPLICATIONS
22
“Why has U.S. inflation become harder to forecast?”
By Stock, J and M., Watson
23
10 “Why has U.S. inflation become harder to forecast?”
• The rate of price inflation in the United States has become both harder and easier to forecast.
• Easier: inflation is much less volatile than it was in the 1970s and the root mean squared er-
ror of inflation forecasts has declined sharply since the mid-1980s.
• Harder: standard multivariate forecasting models do not a better job than simple naive mod-
els. The point was made by Atkeson and Ohanian (2001) (henceforth, AO), who found that, since
1984 in the U.S., backwards-looking Phillips curve forecasts have been inferior to a forecast of twelve-
month inflation by its average rate over the previous twelve months (naive or random walk forecast).
• Relevance of the topic. Change in terms of forecasting properties can signal changes in the structure
of the economy. This can be taken as evidence that suggests that some relations have changed
• What relations? Structural models should be employed (next part of the course).
24
10.1 U.S. Inflation forecasts: facts and puzzles
10.1.1 Data
• Robustness analysis done using personal consumption expenditure deflator for core items (PCEcore),
the personal consumption expenditure deflator for all items (PCE-all), and the consumer price index
(CPI, the official CPI-U).
• Real activity variables: the unemployment rate (u), log real GDP (y), the capacity utilization
rate, building permits, and the Chicago Fed National Activity Index (CFNAI)
• Quarterly data. Quarterly values for monthly series are averages of the three months in the quarter.
25
10.1.2 Forecasting models
• Let πt = 400 log(pt/pt−1) where pt is the quarterly price index and let the h-period average
h
Ph−1 h h
inflation be πt = (1/h) i=0 πt−i. Let πt+h|t be the forecast of πt+h using information up to date t.
26
10.1.3 AR(r)
• Forecasts made using a univariate autoregression with r lags. r is estimated using the Akaike
Information Criterion (AIC).
• Multistep forecasts are computed by the direct method: projecting h-period ahead inflation on
r lags
• The h-step ahead AR(r) forecast was computed using the model
h
πt+h − πt = µh + αh(L)∆πt + vth (11)
where
1. µh is a constant
2. αh(L) is a polynomial in the lag operator
3. vth is the h-step ahead error term
• The number of lags is chosen according to the Akaike Information Criterion (AIC) meaning that r
is such that
XT
AIC = T log( ε̂2t ) + 2r
t=1
27
is minimum. An alternative criterion is the Bayesian Information Criterion (BIC)
T
X
BIC = T log( ε̂2t ) + r log(p)
t=1
28
10.1.4 AO. Atkeson-Ohanian (2001)
AO forecasted the average four-quarter rate of inflation as the average rate of inflation over the
previous four quarters. The AO forecast is
h 1
πt+h|t = πt4 = (πt + πt−1 + πt−2 + πt−3)
4
29
10.1.5 Backwards-looking Phillips curve (PC)
The PC forecasts are computed by adding a predictor to (11) to form the autoregressive distributed
lag (ADL) specification,
h
πt+h − πt = µh + αh(L)∆πt + β hxgapt + δ h(L)∆xt + vth (12)
where
1. µh is a constant
2. αh(L), δ h(L), is a polynomial in the lag operator (lag length chosen using AIC)
3. xgapt is the gap variable (deviations from a low pass filter) based on the variable xt
4. vth is the h-step ahead error term
• The P C forecast using ut = xgapt = xt and ∆ut = ∆xt is called P C − u.
30
10.2 Out-of-sample methodology
• The forecasts were computed using the pseudo out-of-sample forecast methodology: that is, for
a forecast made at date t, all estimation, lag length selection, etc. was performed using only data
available through date t.
• The forecasts are recursive, so that forecasts at date t are based on all the data (beginning in
1960:I) through date t.
• The period 1960-1970 was used for initial parameter estimation. The forecast period 1970:I 2004:IV
was split into the two periods 1970:I 1983:IV and 1984:I 2004:IV.
31
10.3 Results
32
• The RMSFE of forecasts of GDP inflation has declined and the magnitude of this reduction is
striking. In this sense inflation has become easier to forecast
• The relative performance of the Phillips curve forecasts deteriorated substantially from the first pe-
riod to the second. This deterioration of Phillips curve forecasts is found for all the activity predictors.
• The AO forecast substantially improves upon the AR(AIC) and Phillips curve forecasts at the
four- and eight-quarter horizons in the 1984-2004 period, but not at shorter horizons and not in the
first period.
⇒ Substantial changes in the univariate inflation process and in the bivariate process of inflation
and its activity-based predictors.
33
“Unpredictability and Macreconomic Stability”
By D’Agostino, A., D. Giannone and P. Surico
34
11 Unpredictability and Macreconomic Stability
• D’Agostino Giannone and Surico extend the result for inflation to other economic activity variables:
the ability to predict several measures of real activity declined remarkably, relative to naive forecasts,
since the mid-1980s.
• The fall in the predictive ability is a common feature of many forecasting models including those
used by public and private institutions.
• The forecasts for output (and also inflation) of the Federal Reserves Greenbook and the Survey of
Professional Forecasters (SPF) are significantly more accurate than a random walk only before 1985.
After 1985, in contrast, the hypothesis of equal predictive ability between naive random walk forecasts
and the predictions of those institutions is not rejected for all horizons but the current quarter.
• The decline in predictive accuracy is far more pronounced for institutional forecasters and methods
based on large information sets than for univariate specifications.
• The fact that larger models are associated with larger historical changes suggests that the main
sources of the decline in predictability are the dynamic correlations between variables rather than the
autocorrelations of output and inflation.
35
11.1 Data
• Forecasts for nine monthly key macroeconomic series: three price indices, four measures of real
activity and two interest rates:
1. The three nominal variables are Producer Price Index (PPI ), Consumer Price Index (CPI ) and
Personal Consumption Expenditure implicit Deflator (PCED).
2. The four forecasted measures of real activity are Personal Income (PI ), Industrial Production
(IP) index, Unemployment Rate (UR), and EMPloyees on non-farm Payrolls (EMP).
3. the interest rates are 3 month Treasury Bills (TBILL) and 10 year Treasury Bonds (TBOND).
• The data set consists of monthly observations from 1959:1 through 2003:12 on 131 U.S.macroeconomic
time series including also the nine variables of interest.
36
11.2 Forecasting models
37
11.3 Out-of-sample methodology
• Pseudo out-of-sample forecasts are calculated for each variable and method over the horizons h =
1, 3, 6, and 12 months.
• The pseudo out-of-sample forecasting period begins in January 1970 and ends in December 2003.
Forecasts constructed at date T are based on models that are estimated using observations dated T
and earlier.
• Forecast based on rolling samples using, at each point in time, observations over the most recent
10 years.
38
11.4 Results: full sample
39
• For all prices and most real activity indicators, the forecasts based on large information are signifi-
cantly more accurate than the Naive forecasts.
• Univariate autoregressive forecasts significantly improve on the naive models for EMP at all hori-
zons and for CPI and PCED at one and three month horizons only. As far as interest rates are
concerned, no forecasting model performs significantly better than the naive benchmark.
40
12 Results: sub samples - inflation
• For all lags except the first, result of AO confirmed, deterioration of the forecasting performance of
inflation.
41
12.1 Results: sub samples - real activity
42
• Little change in the structure of univariate models for real activity.
• The relative MSFEs of FAAR and POOL suggest that important changes have occurred in the
relationship between output and other macroeconomic variables.
• The decline in predictability does not seem to extend to the labor market, especially at short
horizons. The forecasts of the employees on nonfarm payrolls are associated with the smallest per-
centage changes across subsamples.
43
12.2 Results: sub samples - interest rates
• In the second sample the AR, FAAR and P OOL methods produce more accurate forecasts than
the RW at one month horizon.
• Possible interpretation: increased predictability of the FED due to a better communication strategy.
44
12.3 Results: private and institutional forecasters
• The predictions for output and its deflator from two large forecasters representing the private sector
and the policy institutions are considered.
• The survey was introduced by the American Statistical Association and the National Bureau of
Economic Research and is currently maintained by the Philadelphia Fed. The SPF refers to quarterly
measures and is conducted in the middle of the second month of each quarter (here the median of
the individual forecasts is considered)
• The forecasts of the Greenbook are prepared by the Board of Governors at the Federal Reserve for
the meetings of the Federal Open Market Committee (FOCM), which takes place roughly every six
weeks.
• The measure of output is Gross National Product (GNP) until 1991 and Gross Domestic Product
(GDP) from 1992 onwards.
• The evaluation sample begins in 1975 (prior to this date the Greenbook forecasts were not al-
ways available up to the fourth quarter horizon).
45
12.4 Results: private and institutional forecasters - inflation
46
12.5 Results: private and institutional forecasters - real activity
47
”The Return of the Wage Phillips Curve”
48
13 The Return of the Wage Phillips Curve
• Previous evidence has been taken as a motivation to dismiss the Phillips curve as a theoretical
concept.
• In 1958 Phillips uncovered an inverse relation between wage rate inflation and unemployment.
• The focus however in recent years has been shifted to price inflation
49
13.1 Back to the origins
50
51
13.2 Results
• Phillips curve still (now more than then) characterize dynamics of wage growth and unemployment.
• Crucial question: what has changed in the relation between prices and wages?
52
5: MULTIVARATE STATIONARY PROCESSES
1
1 Some Preliminary Definitions and Concepts
Random Vector: A vector X = (X1, ..., Xn) whose components are scalar-valued random variables
on the same probability space.
Vector Random Process: A family of random vectors {Xt, t ∈ T } defined on a probability space,
where T is a set of time points. Typically T = R, T = Z or T = N, the sets or real, integer and
natural numbers, respectively.
Matrix of polynomial in the lag operator: Φ(L) if its elements are polynomial in the lag operator, i.e.
1 −0.5L
Φ(L) = = Φ0 + Φ1L
L 1+L
where
0 −0.5
1 0 0 0
Φ0 = , Φ1 = , Φj>1 = .
0 1 1 1 0 0
2
The inverse is matrix such that Φ(L)−1Φ(L) = I. Suppose Φ(L) = (I − Φ1L). Its inverse Φ(L)−1 =
A(L) is a matrix such that (A0 + A1L + A2L2 + ...)Φ = I. That is
A0 = I
A1 − Φ1 = 0 ⇒ A1 = Φ1
A2 − A1Φ1 = 0 ⇒ A2 = Φ2
...
Ak − Ak−1Φ1 = 0 ⇒ Ak = Φk
(1)
3
1.1 Covariance Stationarity
Let Yt be a n-dimensional vector of time series, Yt0 = [Y1t, ..., Ynt]. Then Yt is covariance (weakly)
stationary if E(Yt) = µ, and the autocovariance matrix Γj = E(Yt − µ)(Yt−j − µ)0 for all t, j, that
is are independent of t and both finite.
− Stationarity of each of the components of Yt does not imply stationarity of the vector Yt. Station-
arity in the vector case requires that the components of the vector are stationary and costationary.
− Although γj = γ−j for a scalar process, the same is not true for a vector process. The correct
relation is
Γj = Γ0−j
Example: n = 2 and µ = 0
E(Y1tY1t−1) E(Y1tY2t−1)
Γ1 =
E(Y2tY1t−1) E(Y2tY2t−1)
E(Y1t+1Y1t) E(Y1t+1Y2t)
=
E(Y2t+1Y1t) E(Y2t+1Y2t)
E(Y1tY2t+1) 0
E(Y1tY1t+1)
= = Γ0−1
E(Y2tY1t+1) E(Y2tY2t+1)
4
2 Vector Moving average processes
A n-dimensional vector white noise 0t = [1t, ..., nt] ∼ W N (0, Ω) is such if E(t) = 0 and Γk = Ω (Ω
a symmetric positive definite matrix) if k = 0 and 0 if k 6= 0. If t, τ are independent the process is
an independent vector White Noise (i.i.d). If also t ∼ N the process is a Gaussian WN.
Important: A vector whose components are white noise is not necessarily a white noise. Exam-
2
σ u 0
ple: let ut be a scalar white noise and define t = (ut, ut−1)0. Then E(t0t) = 2 and
0 σ u
0 0
E(t0t−1) = .
σu2 0
5
2.2 Vector Moving Average (VMA)
Given the n-dimensional vector White Noise t a vector moving average of order q is defined as
VMA(1)
Let us consider the VMA(1)
Yt = µ + t + C1t−1
with t ∼ W N (0, Ω), µ is the mean of Yt. The variance of the process is given by
with autocovariances
The VMA(q)
Let us consider the VMA(q)
6
with t ∼ W N (0, Ω), µ is the mean of Yt. The variance of the process is given by
with autocovariances
7
The VMA(∞)
A useful process, as we will see, is the VMA(∞)
∞
X
Yt = µ + Cj εt−j (2)
j=0
the process can be thought as the limiting case of a VMA(q) (for q → ∞). Recall the previous result
the process converges in mean square if {Cj } is absolutely summable.
where εt is a vector WN with E(εt−j ) = 0 and E(εtε0t−j = Ω for j = 0 and zero otherwise and
{Cj }∞
j=0 is absolutely summable. Let Yit denote the ith element of Yt and µi the ith element of
µ. Then
(a) The autocovariance between the ith variable at time t and the jth variable at time s periods
earlier, E(Yit − µi)(Yjt−s − µj ) exists and is given by the row icolumn j element of
∞
X
Γs = Cs+v ΩCs0
v=0
for s = 0, 1, 2, ....
8
(b) The sequence of matrices {Γs}∞
s=0 is absolutely summable.
If furthermore {εt}∞ t=−∞ is an i.i.d. sequence with E|εi1 t εi2 t εi3 t εi4 t | ≤ ∞ for i1 , i2 , i3 , i4 =
1, 2, ..., n, then also
(c) E|Yi1t1 Yi2t2 Yi3t3 Yi4t4 | ≤ ∞ for all t1, t2, t3, t4
p
(d) (1/T ) Tt=1 yityjt−s → E(yityjt−s), for i, j = 1, 2, ..., n and for all s
P
Implications:
1. Result (a) implies that the second moments of a M A(∞) with absolutely summable coefficients
can be found by taking the limit of the autocovariance of an M A(q).
2. Result (b) ensures ergodicity for the mean
3. Result (c) says that Yt has bounded fourth moments
4. Result (d) says that Yt is ergodic for second moments
Note: the vector M A(∞) representation of a stationary VAR satisfies the absolute summability
condition so that assumption of the previous proposition hold.
9
2.3 Invertible and fundamental VMA
Invertibility The VMA is invertible i.e. it possesses a VAR representation, if and only if the determi-
nant of C(L) vanishes only outside the unit circle, i.e. if det(C(z)) 6= 0 for all |z| ≤ 1.
det(C(z)) = θ − z which is zero for z = θ. The process is invertible if and only if |θ| > 1.
Fundamentalness The VMA is fundamental if and only if the det(C(z)) 6= 0 for all |z| < 1. In
the previous example the process is fundamental if and only if |θ| ≥ 1. In the case |θ| = 1 the process
is fundamental but noninvertible.
Provided that |θ| > 1 the MA process can be inverted and the shock can be obtained as a com-
bination of present and past values of Yt. That is the VAR (Vector Autoregressive) representation
can be recovered. The representation will entail infinitely many lags of Yt with absolutely summable
coefficients, so that the process converges in mean square.
10
Considering the above example
L
1 − θ−L
Y1t ε1t
1 =
0 θ−L Y2t ε2t
or
L 1
Y1t + Y2t = ε1t
θ 1 − 1θ L
1 1
Y2t = ε2t
θ 1 − 1θ L
(3)
11
2.4 Wold Decomposition
Yt = C(L)εt + µt (4)
P∞
where C(L)t is the stochastic component with C(L) = i=0 CiLi and µt the purely deterministic
component, the one perfectly predictable using linear combinations of past Yt.
(4) represents the Wold representation of Yt which is unique and for which the following prop-
erties hold:
(b) t is the innovation for Yt, i.e. t = Yt − Proj(Yt|Yt−1, Yt−1, ...),i.e. the shock is fundamental.
(b) t is White noise, Et = 0, Et0τ = 0, for t 6= τ , Et0t = Ω
(c) The coefficients are square summable ∞ 2
P
j=0 kCj k < ∞.
(d) C0 = I
12
• The result is very powerful since holds for any covariance stationary process.
• However the theorem does not implies that (4) is the true representation of the process. For
instance the process could be stationary but non-linear or non-invertible.
13
2.5 Other fundamental MA(∞) Representations
• It is easy to extend the Wold representation to the general class of invertible MA(∞) representations.
For any non singular matrix R of constant we define ut = R−1t and we have
Yt = C(L)Rut
= D(L)ut
• Notice that all these representations obtained as linear combinations of the Wold representations
are fundamental. In fact, det(C(L)R) = det(C(L))det(R). Therefore if det(C(L)R) 6= 0 ∀|z| < 1
so will det(C(L)R).
14
3 VAR: representations
• Every stationary vector process Yt admits a Wold representation. If the MA matrix of lag polyno-
mials is invertible, then a unique VAR exists.
• We define C(L)−1 as an (n × n) lag polynomial such that C(L)−1C(L) = I; i.e. when these
lag polynomial matrices are matrix-multiplied, all the lag terms cancel out. This operation in effect
converts lags of the errors into lags of the vector of dependent variables.
• Thus we move from MA coefficient to VAR coefficients. Define A(L) = C(L)−1. Then given
the (invertible) MA coefficients, it is easy to map these into the VAR coefficients:
Yt = C(L)t
A(L)Yt = t (5)
where A(L) = A0 − A1L1 − A2L2 − ... and Aj for all j are (n × n) matrices of coefficients.
• To show that this matrix lag polynomial exists and how it maps into the coefficients in C(L),
note that by assumption we have the identity
15
After distributing, the identity implies that coefficients on the lag operators must be zero, which
implies the following recursive solution for the VAR coefficients:
A0 = I
A 1 = A 0 C1
Ak = A0Ck + A1Ck + ... + Ak−1C1
• As noted, the VAR is of infinite order (i.e. infinite number of lags required to fully represent joint
density).
• In practice, the VAR is usually restricted for estimation by truncating the lag-length. Recall
that the AR coefficients are absolutely summable and vanish at long lags.
Note: Here we are considering zero mean processes. In case the mean of Yt is not zero we should add
a constant in the VAR equations.
16
VAR(1) representation Any VAR(p) can be rewritten as a VAR(1). To form a VAR(1) from the
general model we define: e0t = [, 0, ..., 0], Yt0 = [Yt0, Yt−1
0 0
, ..., Yt−p+1 ]
A1 A2 ... Ap−1 Ap
In 0 ... 0 0
A = 0 In ... 0 0
. ... ...
..
0 ... ... In 0
Therefore we can rewrite the VAR(p) as a VAR(1)
Yt = AYt−1 + et
17
SUR representation The VAR(p) can be stacked as
Y = XΓ + u
X11 X12
Vec representation Let vec denote the stacking columns operator, i.e X = X21 X22 then
X31 X32
X11
X21
X
31
vec(X) =
X12
X22
X32
Let γ = vec(Γ), then the VAR can be rewritten as
Yt = (In ⊗ Xt0)γ + t
18
4 VAR: Stationarity
Yt = µ + AYt−1 + εt
= µ + A(µ + AYt−2 + εt−1) + εt
= (I + A)µ + A2Yt−2 + Aεt−1 + εt
...
j−1
X
Yt = (I + A + ... + Aj−1)µ + Aj Yt−j + Aiεt−i
i=0
• The eigenvalues of A, λ, solve det(A − Iλ) = 0. If all the eigenvalues of A are smaller than one in
modulus the sequence Ai, i = 0, 1, ... is absolutely summable. Therefore
1. the infinite sum j−1 i
P
i=0 A εt−i exists in mean square;
19
Therefore if the eigenvalues are smaller than one in modulus then Yt has the following representation
∞
X
Yt = (I − A)−1µ + Aiεt−i
i=0
• Note that the the eigenvalues correspond to the reciprocal of the roots of the determinant of
A(z) = I − Az. A VAR(1) is called stable if
Stability A VAR(p) is stable if and aonly if that all the eigenvalues of A (the AR matrix of the
companion form of Yt) are smaller than one in modulus, or equivalently if and only if
• Notice that the converse is not true. An unstable process can be stationary.
20
4.2 Back the Wold representation
• We know how to find the MA(∞) representation of a stationary AR(1). We can proceed simi-
larly for the VAR(1). Substituting backward in the companion form we have
Yt = (I − AL)−1et
X∞
= Aj et−j
i=1
= C(L)et
21
where C0 = A0 = I, C1 = A1, C2 = A2, ..., Ck = Ak . Cj will be the n × n upper left matrix of
Cj .
22
5 VAR: second moments
Let us consider the companion form of a stationary (zero mean for simplicity) VAR(p) defined earlier
Yt = AYt−1 + et (7)
The variance of Yt is given by
Σ̃ = E[(Yt)(Yt)0]
= AΣ̃A0 + Ω̃ (8)
a closed form solution to (7) can be obtained in terms of the vec operator. Let A, B, C be matrices
such that the product ABC exists. A property of the vec operator is that
vec(ABC) = (C 0 ⊗ A)vec(B)
Applying the vec operator to both sides of (7) we have
vec(Σ̃) = (A ⊗ A)vec(Σ̃) + vec(Ω̃)
If we define A = (A ⊗ A) then we have
vec(Σ̃) = (I − A)−1vec(Ω̃)
where
Γ0 Γ1 · · · Γp−1
Γ
−1 Γ0 · · · Γp−2
Γ̃0 = Σ̃ = ..
. ... ... ...
Γ−p+1 Γ−p+2 · · · Γ0
23
The variance Σ = Γ0 of the original series Yt is given by the first n rows and columns of Σ̃.
The jth autocovariance of Yt (denoted Γ̃j ) can be found by post multiplying (6) by Yt−j and
taking expectations:
E(YtYt−j ) = AE(YtYt−j ) + E(etYt−j )
Thus
Γ̃j = AΓ̃j−1
or
Γ̃j = Aj Γ̃0 = Aj Σ̃
The autocovariances Γj of the original series Yt are given by the first n rows and columns of Γ̃j and
are given by
Γh = A1Γh−1 + A2Γh−2 + ... + ApΓh−p
known as Yule-Walker equations.
24
6. VAR: ESTIMATION AND HYPOTHESIS TESTING1
1
This part is based on the Hamilton textbook.
1
1 Conditional Likelihood
with t ∼ i.i.dN (0, Ω). Suppose we have a sample of T + p observations for such variables. Condi-
tioning on the first p observations we can form the conditional likelihood
where θ is a vector containing all the parameters of the model. We refer to (2) as ”conditional likeli-
hood function”.
The joint density of observations 1 through t conditioned on Y0, ..., Y−p+1 satisfies
f (Yt, Yt−1, ..., Y1|Y0, Y−1, ..., Y−p+1, θ) = f (Yt−1, ..., Y1|Y0, Y−1, ..., Y−p+1, θ)
×f (Yt|Yt−1, ..., Y1, Y0, Y−1, ..., Y−p+1, θ)
Applying the formula recursively, the likelihood for the full sample is the product of the individual
conditional densities
T
Y
f (Yt, Yt−1, ..., Y1|Y0, Y−1, ..., Y−p+1, θ) = f (Yt|Yt−1, Yt−2, ..., Y−p+1, θ) (3)
t=1
2
At each t, conditional on the values of Y through date t − 1
Yt|Yt−1, Yt−2, ..., Y−p+1 ∼ N (c + A1Yt−1 + A2Yt−2 + ... + ApYt−p, Ω)
Recall
1
Yt−1
Xt = Yt−2
.
..
Yt−p
is an (np + 1) × 1 vector and let Π0 = [c, A1, A2, ..., Ap] be an (n × np + 1) matrix of coefficients.
Using this notation we have that
Yt|Yt−1, Yt−2, ..., Y−p+1 ∼ N (Π0Xt, Ω)
Thus the conditional density of the tth observation is
−n/2 −1 1/2
f (Yt|Yt−1, Yt−2, ..., Y−p+1, θ) = (2π) Ω
0 0 −1 0
exp (−1/2)(Yt − Π Xt) Ω (Yt − Π Xt) (4)
The sample log-likelihood is found by substituting (4) into (3) and taking logs
T
X
L(θ) = logf (Yt|Yt−1, Yt−2, ..., Y−p+1, θ)
t=1
3
= −(T n/2) log(2π) + (T /2) log |Ω−1| −
T
X
(Yt − Π0Xt)0Ω−1(Yt − Π0Xt)
(−1/2) (5)
t=1
4
2 Maximum Likelihood Estimate (MLE) of Π
which is the estimated coefficient vector from an OLS regression of Yjt on Xt. Thus the MLE es-
timates for equation j are found by an OLS regression of Yjt on p lags of all the variables in the system.
We can verify that Π̂0M LE = Π̂0OLS . To verify this rewrite the last term in the log-likelihood as
T
X T
X
(Yt − Π0Xt)0Ω−1(Yt − Π0Xt) = [(Yt − Π̂0Xt + Π̂0Xt − Π0Xt)0Ω−1
t=1 t=1
×(Yt − Π̂0Xt + Π̂0Xt − Π0Xt)]
XT h i
0 0 0 −1 0 0
= (ˆt + (Π̂ − Π )Xt) Ω (ˆt + (Π̂ − Π )Xt)
t=1
5
T
X T
X
= ˆ0tΩ−1ˆt + 2 ˆ0tΩ−1(Π̂0 − Π0)0Xt +
t=1 t=1
T
X
+ Xt0(Π̂0 − Π0)Ω−1(Π̂0 − Π0)0Xt
t=1
PT
The term 2 ˆ0tΩ−1(Π̂0
t=1 − Π0)0Xt is a scalar so that
T
" T #
X X
ˆ0tΩ−1(Π̂0 − Π0)0Xt = tr ˆ0tΩ−1(Π̂0 − Π0)0Xt
t=1
" t=1
T
#
X
= tr Ω−1(Π̂0 − Π0)0Xtˆ0t
" t=1 T
#
X
= tr Ω−1(Π̂0 − Π0)0 Xtˆ0t
t=1
PT
But ˆ0t
t=1 Xt = 0 by construction since regressors are orthogonal to the residuals so that we have
T
X T
X T
X
(Yt − Π0Xt)0Ω−1(Yt − Π0Xt) = ˆ0tΩ−1ˆt + Xt0(Π̂0 − Π0)Ω−1(Π̂0 − Π0)0Xt
t=1 t=1 t=1
6
Given that Ω is positive definite, so is Ω−1, thus the smallest values of
T
X
Xt0(Π̂0 − Π0)Ω−1(Π̂0 − Π0)0Xt
t=1
is achieved by setting Π = Π̂, i.e. the log-likelihood is maximized when Π = Π̂. This establishes the
claim that the MLE estimator coincides with the OLS estimator.
 = (X0X)−1X0Y
7
3 MLE of Ω
Let X be an n×1 vector and let A be a nonsymetric and unrestricted matrix. Consider the quadratic
form X 0AX.
8
3.2 The estimator
T
X
Ω̂ = (1/T ) ˆtˆ0t (9)
t=1
9
4 Asymptotic distribution of Π̂
Maximum likelihood estimates are consistent even if the true innovations are non-Gaussian. The
asymptotic properties of the MLE estimator are summarized in the following proposition
where εt is i.i.d. with mean zero and variance Ω and E(εitεjtεltεmt) < ∞ for all i, j, l, m and
where the roots of
|I − A1z + A2z 2 + ... + Apz p| = 0
lie outside the unit circle. Let k = np + 1 and let Xt be the 1 × k vector
Let π̂T = vec(Π̂T ) denote the nk × 1 vector of coefficients resulting from the OLS regressions of
each of the element of Yt on Xt for a sample of size T .
π̂1T
π̂
2T
π̂T = ..
.
π̂nT
10
where " #−1 " #
T
X T
X
π̂iT = XtXt0 XtYit
t=1 t=1
and let π denote the vector of corresponding population coefficients. Finally let
T
X
Ω̂ = (1/T ) ε̂tε̂0t
t=1
where ε̂t = [ε̂1t ε̂2t ... ε̂nt], and ε̂it = Yit − Xt0π̂iT .
Then
0 p
Q where Q = E(XtXt0)
P
(a) (1/T ) t=1 Xt Xt →
p
(b) π̂T → π
p
(c) Ω̂ → Ω
√ L
(d) T (π̂T − π) → N (0, Ω ⊗ Q−1)
Notice that result (d) implies that
√ L
T (π̂iT − πi) → N (0, σi2Q−1)
11
where σi2 is the variance of the error term of the ith equation. σi2 is consistently estimated by
(1/T ) Tt=1 ε̂2it and that Q−1 is consistently estimated by ((1/T ) t=1 XtXt0)−1. Therefore we can
P P
12
5 Number of lags
As in the univariate case, care must be taken to account for all systematic dynamics in multivariate
models. In VAR models, this is usually done by choosing a sufficient number of lags to ensure that
the residuals in each of the equations are white noise.
AIC: Akaike information criterion Choosing the p that minimizes the following
HQ: Hannan- Quinn information criterion Choosing the p that minimizes the following
p̂ obtained using BIC and HQ are consistent while with AIC it is not.
13
AIC overestimate the true order with positive probability and underestimate the true order with
zero probability.
Suppose a VAR(p) is fitted to Y1, ..., YT (Yt not necessarily stationary). In small sample the fol-
lowing relations hold:
p̂BIC ≤ p̂AIC if T ≥ 8
p̂BIC ≤ p̂HQ for all T
p̂HQ ≤ p̂AIC if T ≥ 16
14
6 Testing: Wald Test
A general hypothesis of the form Rπ = r, involving coefficients across different equations can be tested
using the Wald form of the χ2 test seen in the first part of the course. Result (d) of Proposition 11.1H
establishes that
√ L
T (Rπ̂T − r) → N (0, R(Ω ⊗ Q−1)R0), (10)
The following proposition establishes a useful result for testing
p p
Proposition (3.5L). Suppose (10) holds, Ω̂ → Ω, (1/T ) t=1 XtXt0 → Q, (Q, Ω both non-
P
15
7 Testing: Likelihood ratio Test
L(Ω̂, Π̂)) = −(T n/2) log(2π) + (T /2) log |Ω̂−1| − (T n/2) (12)
Suppose we want to test the null hypothesis that a set of variables was generated by a VAR with
p0 lags against the alternative specification with p1 > p0. Let Ω̂0 = (1/T ) Tt=1 ˆt(p0)ˆt(p0)0 where
P
16
ˆ(p0)t is the residual estimated in the VAR(p0). the log likelihood is given by
Let Ω̂1 = (1/T ) Tt=1 ˆt(p1)ˆt(p1)0 where ˆ(p1)t is the residual estimated in the VAR(p1). the log
P
likelihood is given by
Under the null hypothesis, this asymptotically has a χ2 distribution with degrees of freedom equal to
the number of restriction imposed under H0. Each equation in the restricted model has n(p1 − p0)
restrictions, in total n2(p1 − p0). Thus is asymptotically χ2 with n2(p1 − p0) degrees of freedom.
17
Example. Suppose n = 2, p0 = 3, p1 = 4, T=46. Let Tt=1 [ˆ(p0)1t]2 = 2, Tt=1 [ˆ(p0)2t]2 = 2.5 and
P P
PT
t=1
ˆ(p0)1t(p0)2t = 1. Then
2.0 1.0
Ω̂0 =
1.0 2.5
for which log |Ω̂0| = log 4 = 1.386. Moreover
1.8 0.9
Ω̂1 =
0.9 2.2
for which log |Ω̂1| = 1.147. Then
The degrees of freedom for this test are 22(4 − 3) = 4. Since 10.99 > 9.49 (the 5% critical value for
a χ24 variable), the null hypothesis is rejected.
18
8 Granger Causality
Granger causality If a scalar y cannot help in forecasting x we say that y does not Granger cause x.
y fails to Granger cause x if for all s > 0 the mean squared error of a forecast of xt+s based on
(xt, xt−1, ..., ) is the same as the MSE of a forecast of xt+s based on (xt, xt−1, ..., ) and (yt, yt−1, ..., ).
If we restrict ourselves to linear functions, y fails to Granger-cause x if
h i h i
M SE (Ê(xt+s|xt, xt−1, ..., ) = M SE (Ê(xt+s|xt, xt−1, ..., yt, yt−1, ..., ) (16)
where Ê(x|y) is the linear projection of vector x on the vector y, i.e. the linear function α0y satisfying
E[(x − α0y)y] = 0.
Also we say that x is exogenous in the time series sense with respect to y if (23) holds.
19
8.1 Granger Causality in Bivariate VAR
Y1t C11(L) 0 1t
= (19)
Y2t C21(L) C22(L) 2t
20
that is the second Wold shock has no effects on the first variable. This it is easy to show by deriving
the Wold representation by inverting the VAR polynomial matrix.
21
8.2 Econometric test for Granger Causality
The simplest approach to test Granger causality in an autoregressive framework is the following is
(1) (2)
to estimate the bivariate VAR with p lags by OLS and test the null hypothesis H0 : A12 = A12 =
(p)
...A12 = 0 using an F-test using
(RSS0 − RSS1)/p
S1 =
RSS1/(T − 2p − 1)
and reject if S1 > F(α,p,T −2p−1). An asymptotically equivalent test is
T (RSS0 − RSS1)
S2 =
RSS1
and reject if S2 > χ(α,p).
22
8.3 Application 1: Output Growth and the Yield Curve
• Many research papers have found that yield curve (difference in long and short yield) has been a
good predictor, i.e. a variable that helps to forecast, for the real GDP growth in the US (Estrella,
2000,2005). However recent evidence suggests that its predictive power has reduced since the begin-
ning of the 80s (see D’Agostino, Giannone and Surico, 2006). This means we should find that the
yield curve Granger cuse output growth before mid 80’s but not after.
• We estimate a bivariate VAR for the growth rates of the real GDP and the difference between
the 10-year rate and the federal funds rate. Data are from FREDII StLouis Fed spanning from
1954:III-2007:III. The AIC criterion suggests p = 6.
23
Figure 1: Blu: real gdp growth rates; green: spread long-short.
24
Table 1: F-Tests of Granger Causality
1954:IV-2007:III 1954:IV-1990:I 1990:I-2007:III
We cannot reject the hypothesis that the spread does not Granger cause real output growth in the
last period, while we reject the hypothesis for all the other sample. This can be explained by a change
in the information content of private agents expectations, which is the information embedded in the
yield curve.
25
8.4 Application 2: Money, Income and Causality
In the 50’s and 60’s a big debate about the importance of money and monetary policy. Does money
affect output? For Friedman and monetarist yes. For Keynesian (Tobin) no: output was affecting
money, movement in money stock were reflecting movements in the money demand. Sims in 1972
run a test in order to distinguish between the two visions. He found that money was Granger-causing
output but not the reverse, providing evidence in favor of the monetarist view.
26
Figure 2: Blu: real gnp growth rates; green: M1 growth rates.
27
Table 2: F-Tests of Granger Causality
1959:II-1972:I 1959:II-2007:III
M →Y 4.4440 2.2699
Y →M 0.5695 3.5776
10% 2.0948 1.7071
5% 2.6123 1.9939
1% 3.8425 2.6187
In the first sample money Granger cause (at 5%) output but not the converse (Sims(72)’s result). In
the second sample at the 5% both output Granger cause money and money Granger cause output.
28
8.5 Caveat: Granger Causality Tests and Forward Looking Behavior
Let us consider the following simple model of stock price determination where Pt is the price of one
share of a stock, Dt+1 are dividends payed at t + 1 and r is the rate of return of the stock
(1 + r)Pt = Et(Dt+1 + Pt+1)
According to the theory stock price incorporates the market’s best forecast of the present value of
the future dividends. Solving forward we have
∞ j
X 1
Pt = Et Dt+j
j=1
1 + r
Suppose
Dt = d + ut + δut−1 + vt
where ut, vt are Gaussian WN and d is the mean dividend. The forecast of Dt+j based on this
information is (
d + δut for j = 1
Et(Dt+j ) =
d for j = 2, 3, ...
Substituting in the stock price equation we have
Pt = d/r + δut/(1 + r)
Thus the price is white noise and could not be forecast on the basis of lagged stock prices or dividends.
No series should Granger cause prices. The value of ut−1 can be uncovered from the lagged stock
29
price
δut−1 = (1 + r)Pt−1 − (1 + r)d/r
The bivariate VAR takes the form
Pt d/r 0 0 Pt−1 δut/(1 + r)
= + + (20)
Dt −d/r (1 + r) 0 Dt−1 ut + vt
Granger causation runs in the opposite direction from the true causation. Dividends fail to Granger
cause prices even though expected dividends are the only determinant of prices. On the other hand
prices Granger cause dividends even though this is not the case in the true model.
30
8.6 Granger Causality in a Multivariate Context
Suppose now we are interested in testing for Granger causality in a multivariate (n > 2) context. Let
us consider the following representation of a VAR(p)
Ỹ1t = Ã1X̃1t + Ã2X̃2t + ˜1t
Ỹ2t = B̃1X̃1t + B̃2X̃2t + ˜2t
(21)
where Ỹ1t and Ỹ2t are two vectors containing respectively n1 and n2 variables of Yt. Let
Ỹ1t−1 Ỹ2t−1
Ỹ Ỹ
1t−2 2t−2
X̃1t = .. X̃2t = ..
. .
Ỹ1t−p Ỹ2t−p
Ỹ1t is said block exogenous in the time series sense with respect to Ỹ2t if the elements in Ỹ2t are of
no help in improving the forecast of any variable in Ỹ1t. Ỹ1t is block exogenous if Ã2 = 0.
In order to test block exogeneity we can proceed as follows. First notice that the log likelihood
can be rewritten in terms of a conditional and a marginal log density
T
X T
X
L(θ) = `1t + `2t
t=1 t=1
31
`1t = log f (Ỹ1t|X̃1t, X̃2t, θ)
= −(n/2) log(2π) − (1/2) log |Ω11| −
−(1/2)[(Ỹ1t − Ã1X̃1t − Ã2X̃2t)0Ω−1
11 (Ỹ1t − Ã1 X̃1t − Ã2 X̃2t )]
`2t = log f (Ỹ2t|Ỹ1t, X̃1t, X̃2t, θ)
= −(n/2) log(2π) − (1/2) log |H| −
−(1/2)[(Ỹ2t − D̃0Ỹ1t − D̃1X̃1t − D̃2X̃2t)0H −1(Ỹ2t − D̃0Ỹ1t − D̃1X̃1t − D̃2X̃2t)]
where D̃0Ỹ1t + D̃1X̃1t + D̃2X̃2t and H represent the mean and the variance respectively of Ỹ2t con-
ditioning also on Ỹ1t.
Consider the the maximum likelihood estimation of the system subject to the constraint Ã2 = 0
giving estimates È1(0), Ω̂11(0), D̂0, D̂1, D̂2, Ĥ. Now consider the unrestricted maximum likelihood
estimation of the system providing the estimates È1, Ω̂11, D̂0, D̂1, D̂2Ĥ. The likelihood functions
evaluated at the MLE in the two cases are
32
In practice we perform OLS regression of each of the elements in Ỹ1t on p lags of all the elements
in Ỹ1t and all the elements in Ỹ2t. Let ˆ1t be the vector of sample residual and Ω̂11 their estimated
variance covariance matrix. Next perform OLS regressions of each element of Ỹ1t on p lags of Ỹ1t.
Let ˆ1t(0) be the vector of sample residual and Ω̂11(0) their estimated variance covariance matrix. If
is greater than the 5% critical values for a χ2n1n2p, then the null hypothesis is rejected and the
conclusion is that some elements in Ỹ2t is important for forecasting Ỹ1t
33
7. STRUCTURAL VAR: THEORY
1
1 Structural Vector Autoregressions
Impulse response functions are interpreted under the assumption that all the other shocks are held
constant. However in the Wold representation the shocks are not orthogonal. So the assumption is
not very realistic!.
This is why we need Structural VAR in order to perform policy analysis. Ideally we would like to
have
1) orthogonal shock
2) shocks with economic meaning (technology, demand, labor supply, monetary policy etc.)
1) Cholesky decomposition
2) Spectral Decomposition
2
1.2 Cholesky decomposition
Let us consider the matrix Ω. The Cholesky factor, S, of Ω is defined as the unique lower triangular
matrix such that SS 0 = Ω. This implies that we can rewrite the VAR in terms of orthogonal shocks
ηt = S −1t with identity covariance matrix
A(L)Yt = Sηt
Yt = C(L)Sηt
∞
X
= Cj Sηt−j (1)
j=0
3
1.3 Spectral Decomposition
Let V and be a matrix containing the eigenvectors of Ω and Λ a diagonal matrix with the eigenvalues
of Ω on the main diagonal. Then we have that V ΛV 0 = Ω. This implies that we can rewrite the
VAR in terms of orthogonal shocks ξt = (V D1/2)−1t with identity covariance matrix
A(L)Yt = V D1/2ξ
Yt = C(L)V D1/2ξt
∞
X
= Cj Sηt−j (3)
j=0
4
Problem: what is the economic interpretation of the orthogonal shocks? What is the economic infor-
mation contained in the impulse response functions to orthogonal shocks?
5
1.4 The Class of Orthonormal Representations
From the class of invertible MA representation of Yt we can derive the class of orthonormal represen-
tation, i.e. the class of representations of Yt in term of orthonormal shocks. Let H any orthogonal
matrix, i.e. HH 0 = H 0H = I. Defining wt = (SH)−1t we can recover the general class of the
orthonormal representation of Yt
Yt = C(L)SHwt
= F (L)wt
6
2 The Identification Problem
• Identifying the VAR means fixing a particular matrix H, i.e. choosing one particular representation
of Yt in order to recover the structural shocks from the VAR innovations
• Therefore structural economic shocks are linear combinations of the VAR innovations.
• In order to choose a matrix H we have to fix n(n − 1)/2 parameters since there is a total of
n2 parameters and a total of n(n + 1)/2 restrictions implied by orthonormality.
• Use economic theory in order to derive some restrictions on the effects of some shock on a particular
variables to fix the remaining n(n − 1)/2.
7
2.1 Zero restrictions: contemporaneous restrictions
Example. Let us consider a bivariate VAR. We have a total of n2 = 4 parameters to fix. n(n+1)/2 = 3
are pinned down by the ortnonormality restrictions so that there are n(n − 1)/2 = 1 free parameters.
Suppose that the theory tells us that shock 2 has no effect on impact (contemporaneously) on Y1
equal to 0, that is F012 = 0. This is the additional restriction that allows us to identify the shocks. In
particular we will have the following restrictions:
HH 0 = I
S11H12 + S12H22 = 0
• A common identification scheme is the Cholesky scheme (like in this case). This implies set-
ting H = I. Such an identification scheme creates a recursive contemporaneous ordering ordering
among variables since S −1 is triangular.
• This means that any variable in the vector Yt does not depend contemporanously on the vari-
ables ordered after.
8
• Results depend on the particular ordering of the variables.
9
2.2 Zero restrictions: long run restrictions
• An identification scheme based on zero long run restrictions is a scheme which imposes restrictions
on the matrix F (1) = F0 + F1 + F2 + ..., the matrix of the long run coefficients.
Example. Again let us consider a bivariate VAR. We have a total of n2 = 4 parameters to fix.
n(n + 1)/2 = 3 are pinned down by the ortnonormality restrictions so that there are n(n − 1)/2 = 1
free parameters. Suppose that the theory tells us that shock 2 does not affect Y1 in the long run, i.e.
F12(1) = 0. This is the additional restriction that allows us to identify the shocks. In particular we
will have the following restrictions:
HH 0 = I
D11(1)H12 + D12(1)H22 = 0
where D(1) = C(1)S represents the long run effects of the Cholesky shocks.
10
2.3 Signs restrictions
• The previous two examples yield just identification in the sense that the shockis uniquely identified,
there exists a unique matrix H yielding the structural shocks.
• Sign identification is based on qualitative restriction involving the sign of some shocks on some
variables. In this case we will have sets of consistent impulse response functions.
Example. Again let us consider a bivariate VAR. We have a total of n2 = 4 parameters to fix.
n(n + 1)/2 = 3 are pinned down by the ortnonormality restrictions so that there are n(n − 1)/2 = 1
free parameters. Suppose that the theory tells us that shock 2, which is the interesting one, produce a
positive effect on Y1 for k periods after the shock Fj12 > 0 for j = 1, ..., k. We will have the following
restrictions:
HH 0 = I
S11H12 + S12H22 > 0
Dj,12H12 + Dj,22H22 > 0 j = 1, ..., k
• In a classical statistics approach this delivers not exact identification since there can be many
H consistent with such a restriction. That is for each parameter of the impulse response functions
11
we will have an admissible set of values.
• Increasing the number of restrictions can be helpful in reducing the number of H consistent with
such restrictions.
12
2.4 Parametrizing H
• A useful way to parametrize the matrix H in order to include orthonormality restrictions is using
rotation matrices. Let us consider the bivariate case. A rotation matrix in this case is the unity
matrix
cos(θ) sin(θ)
H=
−sin(θ) cos(θ)
• Note that such a matrix incorporates the orthonormality conditions. The parameter θ will be found
by imposing the additional economic restriction.
• In general the rotation matrix will be found as the product of n(n − 1)/2 rotation matrices.
For the case of three shocks the rotation matrix can be found as the product of the following three
matrices
cos(θ1) sin(θ1) 0 cos(θ2) 0 sin(θ2) 1 0 0
−sin(θ1) cos(θ1) 0 0 1 0 0 cos(θ3) sin(θ3)
0 0 1 −sin(θ2) 0 cos(θ2) 0 −sin(θ3) cos(θ3)
Example. Suppose that n = 2 and the restriction we want to impose is that the effect of the first
shock on the second variable has a positive sign, i.e.
S21H11 + S22H21 > 0
Using the parametrization seen before the restriction becomes
S21cos(θ) − S22sin(θ) > 0
13
Which implies
sin(θ) S21
tan(θ) = <
cos(θ) S22
If S21 = 0.5 and S22 = 1 then all the impulse response fcuntions obtained with θ < atan(0.5) satisfy
the rstriction and should be kept.
14
2.5 Partial Identification
• In many cases we might be interested in identifying just a single shock and not all the n shocks.
• Since the shock are orthogonal we can also partially identify the model, i.e. fix just one ( or a
subset of) column of H. In this case what we have to do is to fix n − 1 elements of H, all but one
elements of a column of the identifying matrix. The additional restriction is provided by the norm of
the vector equal one.
Example Suppose n = 3. We want to identify a single shock using the restriction that such shock
has no effects on the first variable on impact a positive effect on the second variable and negative on
the third variable.
First of all we notice that the first column of the product of orthogonal matrices seen before is
cos(θ1)cos(θ2)
H1 = −sin(θ1)cos(θ2)
−sin(θ2)
therefore we have that the impact effects ofthe first shock are given by
S11 0 0 cos(θ1)cos(θ2)
S21 S22 0 −sin(θ1)cos(θ2)
S31 S32 S33 −sin(θ2)
15
To implement the first restriction we can set θ1 = π/2, i.e. cos(θ1) = 0. This imples that
S11 0 0 0
S21 S22 0 −cos(θ2)
S31 S32 S33 −sin(θ2)
16
2.6 Variance Decomposition
• The second type of analysis which is usually done in SVAR is the variance decomposition analysis.
• The idea is to decompose the total variance of a time series into the percentages attributable
to each structural shock.
• Variance decomposition analysis is useful in order to address questions like ”What are the sources
of the business cycle?” or ”Is the shock important for economic fluctuations?”.
17
Let us consider the MA representation of an identified SVAR
Yt = F (L)wt
18
It is also possible to study the of the series explained by the shock at different horizons, i.e. short vs.
long run. Consider the forecast error in terms of structural shocks. The horizon h forecast error is
given by
Yt+h − Yt+h|t = F0wt+1 + F2wt+2 + ... + Fk wt+h
the variance of the forecast error of the ith variable is thus
n X
X h
var(Yit+h − Yit+h|t) = Fikj2var(wkt)
k=1 j=0
Xn X h
= Fikj2
k=1 j=1
19
8. STRUCTURAL VAR: APPLICATIONS
1
1 Monetary Policy Shocks (Christiano Eichenbaum and Evans, 1999 HoM)
• Monetary policy shocks is the unexpected part of the equation for the monetary policy instrument
(St).
St = f (It) + wtmp
f (It) represents the systematic response of the monetary policy to economic conditions, It is the
information set at time t and wtmp is the monetary policy shock.
• The ”standard” way to identify monetary policy shock is through zero contemporaneous restric-
tions. Using the standard monetary VAR (a simplified version of the CEE 98 VAR) including output
growth, inflation and the federal funds rate we identify the monetary policy shock using the following
restrictions:
1) Monetary policy shocks do not affect output within the same quarter
2) Monetary policy shocks do not affect inflation within the same quarter
• These two restrictions are not sufficient to identify all the shocks but are sufficient to identify
the monetary policy shock.
2
• A simple way to implement the restrictions is to take simply the Cholesky decomposition of the
variance covariance matrix in a system in which the federal funds rate is ordered last. The last column
of the impulse response functions is the column of the monetary policy shock.
3
Cholesky impulse response functions of a system with GDP inflation and the federal funds rate.
Monetary shock is in the third column.
4
• Notice that after a monetary tightening inflation goes up which is completely counterintuitive ac-
cording to the standard transmission mechanism. This phenomenon if known as the price puzzle.
Why is this the case?.
• ”Sims (1992) conjectured that prices appeared to rise after certain measures of a contrac-
tionary policy shock because those measures were based on specifications of It that did not
include information about future inflation that was available to the Fed. Put differently, the
conjecture is that policy shocks which are associated with substantial price puzzles are actually
confounded with non-policy disturbances that signal future increases in prices.” (CEE 98)
• Sims shows that including commodity prices (signaling future inflation increases) may solve the
puzzle.
5
2 Uhlig (JME 2006) monetary policy shocks
• Uhlig proposes a very different method to identify monetary policy shocks. Instead of using zero
restrictions as in CEE he uses sign restrictions.
• He identifies the effects of a monetary policy shocks using restrictions which are implied by several
economic models.
• If we order the variables in vector Yt as follows: GDP inflation, money growth and the inter-
est rate the restrictions imply Fki1 < 0 for i = 2, 3 and Fk41 > 0.
6
• In order to draw impulse response functions he applies the following algorithm:
1. He assumes that the column of H, H1, represents the coordinate of a point uniformly distributed
over the unit hypersphere (in case of bivariate VAR it represents a point in a circle). To draw
such point he draws from a N (0, I) and divide by the norm of the vector.
2. Compute the impulse response functions Cj SH1 for j=1,..,k.
3. If the draw satisfies the restrictions keep it and go to 1), otherwise discard it and go to 1). Repeat
1)-3) a big number of timef L.
7
Source: What are the effects of a monetary policy shock... JME H. Uhlig (2006)
8
Source: What are the effects of a monetary policy shock... JME H. Uhlig (2006)
9
3 Blanchard Quah (AER 1989) aggregate demand and supply shocks
• Blanchard and Quah proposed an identification scheme based on long run restrictions.
• In their model there are two shocks: an aggregate demand and an aggregate supply disturbance.
• The restriction used to identify is that aggregate demand shocks have no effects on the long run
levels of output, i.e. demand shocks are transitory on output. The idea behind of such a restriction
is the existence of a vertical aggregate supply curve.
where Yt is output, Ut is the unemployment rate and wts, wtd are two aggregate supply and demand
disturbances respectively.
10
• The restriction can be implemented in the following way. Let us consider the reduced form VAR
∆Yt A11(L) A12(L) 1t
=
Ut A21(L) A22(L) 2t
where E(t0t) = Ω.
wt = K −1t
F (L) = A(L)K
F (1) = A(1)K
= A(1)A(1)−1S
= S
11
Moreover we have that shocks are orthogonal since
And
12
Source: The Dynamic Effects of Aggregate Demand and Supply Disturbances, (AER) Blanchard and
Quah (1989):
13
Source: The Dynamic Effects of Aggregate Demand and Supply Disturbances, (AER) Blanchard and
Quah (1989):
14
Source: The Dynamic Effects of Aggregate Demand and Supply Disturbances, (AER) Blanchard and
Quah (1989):
15
4 The technology shocks and hours debate Gali (AER 1999), Christiano, Eichen-
baum and Vigfusson (NBER WP, 2003)
This is a nice example of how SVAR models can be used in order to distinguish among competing
models of the business cycles.
16
4.1 The model
• Equilibrium:
1 1−γ 1
∆xt = 1 − ∆ξt + + γ ηt + (1 − γ) 1 − ηt−1
ϕ ϕ ϕ
1 (1 − γ)
nt = ξt − ηt
ϕ ϕ
or
1−γ
!
+ γ + (1 − γ) 1 − ϕ1 L 1 − ϕ1 (1 − L)
∆xt ϕ ηt
= −(1−γ) 1
(3)
nt ξt
ϕ ϕ
17
Note the model prediction. If monetary policy is not completely accommodative γ < 1 then the
response of hours to a technology shock −(1−γ)
ϕ is negative.
18
Source: What Happens After a Technology Shock?... Christiano Eichenbaum and Vigfusson NBER
WK (2003)
19
Source: What Happens After a Technology Shock?... Christiano Eichenbaum and Vigfusson NBER
WK (2003)
20
Source: What Happens After a Technology Shock?... Christiano Eichenbaum and Vigfusson NBER
WK (2003)
21
Source: Trend Breaks, Long-Run Restrictions, and Contractionary Technology Improvements, JME
John Fernald (2007)
22
Source: Trend Breaks, Long-Run Restrictions, and Contractionary Technology Improvements, JME
John Fernald (2007)
23
Source: Trend Breaks, Long-Run Restrictions, and Contractionary Technology Improvements, JME
John Fernald (2007)
24
5 Government spending shocks
• Understanding the effects of government spending shocks is important for policy authorities but
also to assess competing theories of the business cycle.
25
5.1 Government spending shocks: Blanchard and Perotti (QJE 2002)
• BP (originally) use a VAR for real per capita taxes, government spending, and GDP with the
restriction that government spending does not react to taxes and GDP contemporaneously, Cholesky
identification with government spending ordered first. The government spending shock is the first
one (quadratic trend four lags).
26
Source: IDENTIFYING GOVERNMENT SPENDING SHOCKS: IT’S ALL IN THE TIMING Valerie A. Ramey, QJE
27
5.2 Government spending shocks: Ramey and Shapiro (1998)
• Ramey and Shapiro (1998) use a narrative approach to identify shocks to government spending.
• Focus on episodes where Business Week suddenly forecast large rises in defense spending induced
by major political events that were unrelated to the state of the U.S. economy (exogenous episodes
of government spending).
• Three of such episodes: Korean War, The Vietnam War and the Carter-Reagan Buildup + 9/11.
• The military date variable takes a value of unity in 1950:3, 1965:1, 1980:1, and 2001:3, and ze-
ros elsewhere.
• To identify government spending shocks, the military date variable is embedded in the standard
VAR, but ordered before the other variables.
• VARs: shocks are often anticipated (fiscal foresight shocks may be not invertible)
• Strategy:
1. Estimate a VAR nine variables (including: short term interest rate, interest rate spread, housing
investment share of GDP, real GDP, real consumption, real hours prices, prices, commodity price
index and a money indicator.
2. Identify the monetary policy shock using the restriction that the shock does not affect prices
and output contemporaneously but affect the short term interest rate, the spread and the money
stock and analyze the impulse response functions.
3. Shut down the identified shock and study the counterfactual path of housing prices over time.
29
Source: Jarocinki and Smets (2008)
30
Source: Jarocinki and Smets (2008)
31
Source: Jarocinki and Smets (2008)
32