0% found this document useful (0 votes)
206 views246 pages

LBSEconometricsPartIIpdf Time Series

This document provides an overview of an introductory time series analysis course with applications in macroeconomics. The main objective of the course is to provide students with tools for empirical research using time series data. Key topics covered include ARMA models, VAR models, structural VAR models, and econometric software. Assessment will be via a take-home exam. Readings include textbooks on time series analysis and macroeconomic theory. The course is taught by Prof. Luca Gambetti of UAB Barcelona.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views246 pages

LBSEconometricsPartIIpdf Time Series

This document provides an overview of an introductory time series analysis course with applications in macroeconomics. The main objective of the course is to provide students with tools for empirical research using time series data. Key topics covered include ARMA models, VAR models, structural VAR models, and econometric software. Assessment will be via a take-home exam. Readings include textbooks on time series analysis and macroeconomic theory. The course is taught by Prof. Luca Gambetti of UAB Barcelona.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 246

ECONOMETRICS Part II

PhD LBS

Luca Gambetti
UAB, Barcelona GSE

February-March 2014

1
Contacts

Prof.: Luca Gambetti


email: [email protected]
webpage: http://pareto.uab.es/lgambetti/

Description

This is an introductory Time Series Analysis with applications in macroeconomics.

Goal of the course

The main objective of the course is to provide the students with the knowledge of a comprehensive
set of tools necessary for empirical research with time series data.

2
Contents

• Introduction
• ARMA models (2 sessions)
– Representation
– Estimation
– Forecasting
• VAR models (1 session)
– Representation
– Estimation
– Forecasting
• Structural VAR (2 sessions)
– Recursive identification
– Long-run identification
– Sign identification
– Applications

3
References

1. P. J.Brockwell, and R. A. Davis, (2009), Time Series: Theory and Methods,Springer-Verlag:


Berlin.
2. J. D. Hamilton (1994), Time Series Analysis, Princeton University Press: Princeton.
3. H. Lutkepohl (2005), New Introduction to Multiple Time Series, Springer-Verlag: Berlin.
4. T.J. Sargent (1987) Macroeconomic Theory, Academic Press.

Grades

Take-Home Exam.

Econometric Software

MATLAB.

4
1. INTRODUCTION

5
1 What does a macroeconometrican do?

”Macroeconometricians do four things: describe and summarize macroeconomic data, make macroe-
conomic forecasts, quantify what we do or do not know about the true structure of the macroeconomy,
and advise (and sometimes become) macroeconomic policymakers.” Stock and Watson, JEP, 2001.

Except advising and becoming policymakers, this is what we are going to do in this course.

6
2 Preliminaries

2.1 Lag operators

• The lag operator L maps a sequence {xt} into a sequence {yt} such that yt = Lxt = xt−1, for all t.

• It can be applied repeatedly on a process, for instance L(L(Lxt)) = L3xt = xt−3.

• If we apply L to a constant c, Lc = c.

• Inversion: L−1 is the inverse of L, L−1xt = xt+1. It is such that L−1(L)xt = xt.

• The lag operator and multiplication operator are commutative L(βxt) = βLxt (β a constant).

• The lag operator is distributive over the addition operator L(xt + wt) = Lxt + Lwt

• Polynomials in the lag operator:


α(L) = α0 + α1L + α2L2 + ... + αpLp
is a polynomial in the lag operator of order p and is such that
α(L)xt = α0xt + α1xt−1 + ... + αpxt−p
with αi (i = 1, ..., p) constant.

7
• Absolutely summable (one-sided) filters. Let {αj }∞
j=0 be a sequence of absolutely summable coef-
P∞
ficients, i.e. j=0 |αj | < ∞. We define the filter
α(L) = α0 + α1L + α2L2 + ...
which gives
α(L)xt = α0xt + α1xt−1 + α2xt−2 + ...

• Square summable (one-sided) filters. Let {αj }∞


j=0 be a sequence of square summable coefficients,
P∞ 2
i.e. j=0 αj < ∞. We define the filter
α(L) = α0 + α1L + α2L2 + ...
which gives
α(L)xt = α0xt + α1xt−1 + α2xt−2 + ...
• Absolute summability implies square summability: if ∞
P P∞ 2
j=0 |α j | < ∞ then j=0 αj < ∞.

• α(0) = α0.

• α(1) = α0 + α1 + α2 + ....

• α(L)(bxt + cwt) = α(L)bxt + α(L)cwt.

8
• α(L)xt + β(L)xt = (α(L) + β(L))xt.

• α(L)[β(L)xt] = β(L)[α(L)xt].

• Polynomials factorization: consider


α(L) = 1 + α1L + α2L2 + ... + αpLp
Then
α(L) = (1 − λ1L)(1 − λ1L)...(1 − λpL)
where λi (i = 1, ..., p) are the reciprocals of the roots of the polynomial in L.

• Lag polynomials can also be inverted. For a polynomial φ(L), we are looking for the values of
the coefficients αi of φ(L)−1 = α0 + α1L + α2L2 + ... such that φ(L)−1φ(L) = 1.

Example: p = 1. Let φ(L) = (1 − φL) with |φ| < 1. To find the inverse write
(1 − φL)(α0 + α1L + α2L2 + ...) = 1
note that all the coefficients of the non-zero powers of L must be equal to zero. So This gives
α0 =1
−φ + α1 =0 ⇒ α1 = φ
−φα1 + α2 =0 ⇒ α2 = φ2
−φα2 + α3 =0 ⇒ α3 = φ3
9
P∞
and so on. In general αk = φk , so (1 − φL)−1 = j j
j=0 φ L provided that |φ| < 1.

It is easy to check this because


(1 − φL)(1 + φL + φ2L2 + ... + φk Lk ) = 1 − φk+1Lk+1
so
2 2 k k1 − φk+1Lk+1
(1 + φL + φ L + ... + φ L ) =
(1 − φL)
Pk j j 1
and j=0 φ L → (1−φL) as k → ∞.

10
Let T be an index which has an ordering relation defined on it. Thus, if t1, t2 ∈ T , then t1 ≤ t2 or
t1 > t2. Usually, T = R or T = Z, the sets of real or integer numbers, respectively.

Stochastic process. A stochastic process is a family of random variables {Xt, t ∈ T } indexed by


some set T .

A stochastic process can be discrete or continuous according to whether T is continuous, e.g., T = R,


or discrete, e.g., T = Z.

Time series. A time series is part of a realization of a stochastic process

Example 1 Let the index set be T = {1, 2, 3} and let the space of outcomes Ω be the possible
outcomes associated with tossing one dice:
Ω = 1, 2, 3, 4, 5, 6
Define Xt = t + [value on dice]2t. Therefore for a particular ω, say ω = 3, the realization or path
would be (10, 20, 30), and this stochastic process has 6 different possible realizations (associated to
each of the values of the dice).

11
12
13
14
15
2.2 Stationarity

This course will mainly focus on stationary processes. There are two definitions of stationarity: strict
and weak (or second order).

Strict Stationarity The time series Xt is said to be strictly stationary if the joint distributions of
(Xt1 , ....Xtk )0 and (Xt1+h, ....Xtk +h)0 are the same for all positive integers for all t1, ..., tk , h ∈ Z.

Interpretation: This means that the graphs over two equal-length time intervals of a realization
of the time series should exhibit similar statistical characteristics.

In order to define the concept of weak stationarity we first need to introduce the concept of au-
tocovariance function. This function is a measure of dependence between elements of the sequence
Xt .

The autocovariance function If Xt is a process such that V ar(Xt) < ∞ for each t ∈ T , then
the autocovariance function γt(r, s) of Xt is defined by
γ(r, s) = Cov(Xr , Xs) = E[(Xr − E(Xr ))(Xs − E(Xs))], (1)

16
Weak Stationarity The time series Xt is said to be weakly stationary if
(i) E|Xt|2 < ∞ for all t.
(ii) E(Xt) = µ for all t.
(iii) γ(r, s) = γ(r + t, s + t) for all t, r, s.
Notice that for a covariance stationary process γ(r, s) = γ(r − s, 0) = γ(h).
In summary: weak stationarity means that the mean, the variance are finite and constant and that
the autocovariance function only depends on h, the distance between observations.

If γ(.) is the autocovariance function of a stationary process, then it satisfies


(i) γ(0) ≥ 0.
(ii) |γ(h)| ≤ γ(0) for all h ∈ Z.
(iii) γ(−h) = γ(h) for all h ∈ Z

Autocorrelation function, ACF For a stationary process Xt, the autocorrelation function at lag
h is defined as
γ(h)
ρ(h) = = Corr(Xt+h, Xt) for all t, h ∈ Z.
γ(0)

17
Partial correlation function (PACF). The partial autocorrelation α(.) of a stationary time series
is defined by
α(1) = Corr(X2, X1) = ρ(1)
and
α(k) = Corr(Xk+1 − P (Xk+1|1, X2, ..., Xk ), X1 − P (X1|1, X2, ..., Xk )) f ork ≥ 2
An equivalent definition is that the kth partial autocorrelation α(k) is defined as the last coefficient
in the linear projection of Yt on its k most recent values.
(k) (k) (k)
Yt = α1 Yt−1 + α2 Yt−2 + ... + αk Yt−k + εt
(k)
α(k) = αk .

Strict stationarity implies weak stationarity, provided the first and second moments of the vari-
ables exist, but the converse of this statement is not true in general.

There is an one important case where both concepts of stationary are equivalent.

Gaussian Time series The process Xt is a Gaussian time series if and only if the joint den-
sity of (Xt1 , ..., Xtn )0 is Gaussian for all t1, t2, ..., tn

If Xt is a stationary Gaussian time series, then it is also strictly stationary, since for all n = {1, 2, ...}
and for all h, t1, t2, ... ∈ Z, the random vectors (Xt1 , ..., Xtn )0, and (Xt1+h, ..., Xtn+h)0 have the same
mean, and covariance matrix, and hence they have the same distribution.
18
2.3 Ergodicity

Consider a stationary process Xt, with E(Xt) = µ for all t. Assume that we are interested in
estimating µ. The standard approach for estimating the mean of a single random variable consists of
computing its sample mean
N
(i)
X
X̄ = (1/N ) Xt
i=1
(we call this ensemble average) where the Xi’s are different realizations of the variable Xt.

When working in a laboratory, one could generate different observations for the variable Xt un-
der identical conditions.

However, when analyzing economic variables over time, we can only observe a unique realization
of each of the random variable Xt so that it is not possible to estimate µ by computing the above
average.

However, we can compute a time average


T
X
X̄ = (1/T ) Xt
t=1

Whether time averages converge to the same limit as the ensemble average, E(Xt), hast to do with
the concept of Ergodicity.
19
Ergodicity for the mean A covariance stationary process Xt is said to be ergodic for the mean if
PT
X̄ = (1/T ) t=1 Xt converges in probability to E(Xt) as T gets large.

Ergodicity for the second moments A covariance stationary process is said to be ergodic for
the second moments if
T
X p
[1/(T − j)] (Xt − µ)(Xt−j − µ) → γj
t=j+1

Important result: A sufficient condition for ergodicity in mean of a stationary process is that
P∞
h=0 |γ(h)| < ∞. If the process is also Gaussian then the above condition also implies ergodic-
ity for all the higher moments.

For many applications, ergodicity and stationarity turn out to amount for the same requirements.
However, we present now an example of a stationary process that is not ergodic.

20
Example Consider the following process. y0 = u0 with u0 ∼ N (0, σ 2) and yt = yt−1 for t = 1, 2, 3, ....
It easy to see that the process is stationary. In fact
E(yt) = 0, E(yt2) = σ 2, E(yt−j , yt−i) = σ 2, i 6= j
However
T
X T
X
(1/T ) yt = (1/T ) u1 = u1 6= 0
t=1 t=1

21
2.4 Some processes

iid sequences The sequence εt is i.i.d with zero mean and variance σ 2 , written ε ∼ iid(0, σ 2), (inde-
pendent and identically distributed) if all the variables are independent and share the same univariate
distribution.

White noise The process εt is called white noise, written ε ∼ W N (0, σ 2) if it is weakly station-
ary with E(ε) = 0 and autocovariance function γ(0) = σ 2 and γ(h) = 0 for h 6= 0.

Note that an i.i.d sequence with zero mean and variance σ 2 is also white noise. The converse is
not true in general.

Martingale difference sequence, m.d.s. A process εt, with E(εt) = 0 is called a martingale differ-
ence sequence if
E(εt|εt−1, εt−2...) = 0, t = 2, 3, ...

Random Walk Consider the process


yt = yt−1 + ut, t = 0, 1, ...
where u is a W N (0, σ 2) and y0 fixed. Substituting backward
yt = y0 + u1 + u2 + ... + ut−2 + ut−1 + ut

22
It is easy to see that E(yt) = y0. Moreover the variance is
 2  
X t X t
γ(0) = E  uj  =  E(u2j ) = tσ 2
j=1 j=1

and
 
t
X X t−h
γ(h) = E  uj uj 
j=1 k=1
t−h
X
= E(u2k )
k=1
= (t − h)σ 2
The autocorrelation is
(t − h)σ 2
ρ(h) =
(tσ 2(t − h)σ 2)1/2
 1/2
h
= 1−
t
(2)

23
3 Linear projections

Let Xt be a (k + 1) × 1 vector of variables, with non-singular variance-covariance matrix and let Yt


be a variable. Consider the linear functions of Xt
P (Yt|X1t, ..., Xkt) = β 0Xt (3)
satisfying
E[(Yt − β 0Xt)Xt0] = 0
Then (3) is the linear projection of Yt onto Xt and β is the projection coefficient satisfying
β 0 = E(YtXt0)[E(XtXt0)]−1

The linear projection has the following properties

1. If E(Y |X) = Xβ, then E(Y |X) = P (Y |X) .

2. P (aZ + bY |X) = aP (Z|X) + bP (Y |X)

4. P (Y |X) = P (P (Y |X, Z)|X)

24
4 Moments estimation

The sample mean is the natural estimator for the expected value of a process.

The sample mean of Y1, Y2, ..., YT is defined as


T
X
ȲT = (1/T ) Yt
t=1
The sample autocovariance of Y1, Y2, ..., YT is defined as
T
X
γ̂(j) = (1/T ) (Yt − ȲT )(Yt−j − ȲT )
t=j+1

The sample autocorrelation of Y1, Y2, ..., YT is defined as


γ̂(j)
ρ̂(j) =
γ̂(0)

The sample autocovariance and autocorrelation can be computed for any data set, and are not
restricted to realizations of a stationary process. For stationary processes both functions will show
a rapid decay towards zero as h increases. However, for non-stationary data, these functions will
exhibit quite different behavior. For instance for variables with trend, the autocorrelation reduces
very slowly.
25
26
To compute the k-th partial correlation, one simply has to run an OLS regression including the most
recent m values of the variable. The last coefficient would be the k-th. autocorrelation, that is,
(k) (k) (k)
Yt = α̂1 Yt−1 + α̂2 Yt−2 + ... + α̂k Yt−k + ε̂t
(
where ε̂t denotes OLS residuals, so that α̂(k) = α̂k k).

27
2. ARMA1

1
This part is based on Hamilton textbook.

1
1 MA

1.1 MA(1)

Let εt be WN with variance σ 2 and consider the zero mean2 process


Yt = εt + θεt−1 (1)
where θ is a constant. This time series is called first-order moving average process denote by
MA(1).

1.1.1 Moments

The expectation of Yt is given by


E(Yt) = E(εt + θεt−1)
= E(εt) + θE(εt−1)
= 0 (2)
Clearly with a constant term µ in (10) the expectation would be µ.

2
All we see works even for non zero mean processes.

2
The variance is given by
E(Yt)2 = E(εt + θεt−1)2
= E(εt)2 + θ2E(εt−1)2 + 2θE(εtεt−1)
= σ 2 + θ2σ 2
= (1 + θ2)σ 2
(3)
The first autocovariance is
E(YtYt−1) = E(εt + θεt−1)(εt−1 + θεt−2)
= E(εtεt−1) + E(θεtεt−2) + E(θε2t−1) + E(θ2εt−1εt−2)
= θσ 2
Higher autocovariances are all zero, E(YtYt−j ) = 0 for j > 1.

Since the mean and covariances are not functions of time the process is stationary regardless on
the value of θ. Moreover the process is also ergodic for the first moment since ∞
P
j=−∞ |γj | < ∞. If
εt is also Gaussian then the process is ergodic for all moments.

The j-th autocorrelation is


γ1 θσ 2 θ
ρ1 = = =
γ0 (1 + θ2)σ 2 (1 + θ2)
for j = 1 and zero for j > 1.
3
Figure 1 displays the autocorrelation functions for
Yt = εt + 0.8εt−1

Figure 1

4
Note that the first order autocorrelation can be plotted as a function of θ as in Figure 2.

Figure 2

5
Note that:
1. positive value of θ induce positive autocorrelation, while negative value negative autocorrelations.
2. the largest possible value is 0.5 (θ = 1) and the smallest one is −0.5 (θ = −1)
3. for any value of ρ1 between [−0.5, 0.5] there are two values of θ that produce the same autocor-
relation because ρ1 is unchanged if we replace θ with 1/θ
1/θ θ2(1/θ) θ
= = (4)
1 + 1/θ2 θ2(1 + 1/θ2) θ2 + 1
So the processes
Yt = εt + 0.5εt−1
and
Yt = εt + 2εt−1
generate the same autocorrelation functions.

6
Figure 3. Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.

7
1.2 MA(q)

A q-th order moving average process, denoted MA(q), is characterized by


Yt = εt + θ1εt−1 + θ2εt−2 + ... + θq εt−q (5)
where εt is WN and the θi’s are any real number.

1.2.1 Moments

• The mean of (5) is


E(Yt) = E(εt) + θ1E(εt−1) + θ2E(εt−2) + ... + θq E(εt−q )
= 0

• The variance of (5) is


γ0 = E(Yt2) = E(εt + θ1εt−1 + θ2εt−2 + ... + θq εt−q )2
= (1 + θ12 + θ22 + ... + θq2)σ 2
because all the terms involving the expected value of different εj ’s are zero because of the WN as-
sumption.

• The autocovariances are


γj = (θj + θj+1θ1 + θj+2θ2... + θq θq−j )σ 2 j = 1, 2, ..., q
and zero for j > q.
8
Figure 3

9
Example Consider an MA(2).
γ0 = (1 + θ12 + θ22)σ 2
γ1 = E(YtYt−1)
= E(εtεt−1) + E(θ1εtεt−2) + E(θ2εtεt−3 + E(θ1ε2t−1) + E(θ12εt−1εt−2) +
+E(θ1θ2εt−1εt−3) + E(θ2εt−2εt−1) + E(θ2θ1ε2t−2) + E(θ22εt−2εt−3)
= (θ1 + θ2θ1)σ 2
γ2 = E(YtYt−2)
= θ2E(ε2t−2)
= θ2σ 2
(6)
The MA(q) process is covariance stationary for any value of θi. Moreover the process is also ergodic
for the first moment since ∞
P
j=−∞ |γj | < ∞. If εt is also Gaussian then the process is ergodic for all
moments.

10
1.3 MA(∞)

The MA(∞) can be thought as the limit of a MA(q) process for q → ∞



X
Yt = θj εt−j = θ0εt + θ1εt−1 + θ2εt−2 + ... (7)
j=0
P∞
with absolute summable coefficients j=0 |θj | < ∞.
P∞ 2
If the MA coefficients are square summable (implied by absolute summability), i.e. j=0 θj <∞
then the above infinite sum generates a mean square convergent random variable.

1.3.1 Moments

• The mean of the process is



X
E(Yt) = θj E(εt−j ) = 0 (8)
j=0

• The variance is

X
γ0 = E(Yt2) = E( θj εt−j )2
j=0
= E(θ1εt−1 + θ2εt−2 + θ3εt−3 + ...)2
= (θ12 + θ22 + θ32 + ...)σ 2
11

X
= σ2 θj2 (9)
j=0

Again all the terms involving the expected value of different εj ’s are zero because of the WN assump-
tion.

• Autocovariances are
γj = E(YtYt−j ) = E(θ1εt−1 + θ2εt−2 + θ3εt−3 + ...)(θ1εt−j−1 + θ2εt−j−2 + θ3εt−j−3 + ...)
= (θj θ0 + θj+1θ1 + θj+2θ2 + θj+3θ3 + ...)σ 2

The process is stationary finite and constant first and second moments.

Moreover an MA(∞) with absolutely summable coefficients has absolutely summable autocovari-
P∞
ances, j=−∞ |γj | < ∞, so it is ergodic for the mean.

the ε’s are Gaussian is ergodic for all moments.

12
1.4 Invertibility and Fundamentalness

Consider an MA(1)
Yt = εt + θεt−1 (10)
where εt is W N (0, σ 2). Provided that |θ| < 1 both sides can be multiplied to obtain
(1 − θL + θ2L2 − θ3L3 + ...)Yt = εt
which could be viewed as a AR(∞) representation.

If a moving average representation can be rewritten in terms as an AR(∞) representation by in-


verting (1 + θL), then the moving average representation is said to be invertible. For an MA(1)
invertibility requires |θ| < 1.

Let us investigate what invertibility means in terms of the first and second moments of the pro-
cess. Consider the following MA(1)
Ỹt = (1 + θ̃L)ε̃t (11)
where εt is W N (0, σ̃ 2). Moreover suppose that the parameters in this new MA(1) are related to the
other as follows:
θ̃ = θ−1
σ̃ 2 = θ2σ 2

13
Let us derive the first two moments of the two processes. E(Yt) = E(Ỹt) = 0. For Yt
E(Yt2) = σ 2(1 + θ2)
E(YtYt−1) = θσ 2
For Ỹt
E(Yt2) = σ̃ 2(1 + θ̃2)
E(YtYt−1) = θ̃σ̃ 2
However note that given the above restrictions
 
1
(1 + θ2)σ 2 = 1 + θ̃2σ̃ 2
θ̃2 !
θ̃2 + 1
= θ̃2σ̃ 2
θ̃2
 
= θ̃ + 1 σ̃ 2
2

and
2θ̃2σ̃ 2
θσ =
θ̃
= θ̃σ̃ 2
That is the first two moments of the two processes are identical. Note that if |θ| < 1 then |θ̃| > 1.
In other words for any invertible MA representation we can find a non invertible representation with
14
identical first and second moments. The two process share the same autocovariance generating func-
tion.

The value of εt associated with the invertible representation is sometimes called the fundamental
innovation for Yt. For the borderline case |θ| = 1 the process is non invertible but still fundamental.

Here we give a formal definition of invertibility

Invertibility A MA(q) process defined by the equation Yt = θ(L)εt is said to be invertible is there
P∞ P∞
exists a sequence of constants {πj }∞
j=0 such that |π
j=0 j | < ∞ and j=0 πj Yt−j = εt .

Proposition A MA process defined by the equation Yt = θ(L)εt is invertible if and only if θ(z) 6= 0
for all z ∈ C such that |z| ≤ 1.

A similar concept is that of fundamentalness defined below.

Fundamentalness A MA Yt = θ(L)εt is fundamental if and only if θ(z) 6= 0 for all z ∈ C such


that |z| < 1.

15
1.5 Wold’s decomposition theorem

Here is a very powerful result known as Wold’d Decomposition theorem.

Theorem Any zero-mean covariance stationary process Yt can be represented in the form

X
Yt = θj εt−j + kt (12)
j=0

where:
1. θ0 = 1,
2. ∞ 2
P
j=0 θj < ∞,

3. εt is the the error made in forecasting Yt on the basis of a linear function of lagged Yt (fundamental
innovation),
4. the value of kt is uncorrelated with εt−j for any j and can be perfectly predicted from a linear
function of the past values of Y .
The term kt is called the linearly deterministic component of Yt. If kt = 0 then the process is called
purely non-deterministic.

The result is very powerful since holds for any covariance stationary process.

16
However the theorem does not implies that (12) is the true representation of the process.

• For instance the process could be stationary but non-linear or non-invertible. If the true sys-
tem is generated by a nonlinear difference equation Yt = g(Yt−1, ..., Yt−1) + ηt, obviously, when we fit
a linear approximation, as in the Wold theorem, the shock we recover ε will be different from ηt.

• If the model is non invertible then the true shock will not be the Wold shock.

17
2 AR

2.1 AR(1)

A first-order autoregression, denoted AR(1), satisfies the following difference equation:


Yt = φYt−1 + εt (13)
where again εt is a WN. When |φ| < 1, the solution to (13) is
Yt = εt + φεt−1 + φ2εt−2 + φ3εt−3 + ... (14)
(14) can be viewed as an MA(∞) with ψj = φj . When |φ| < 1 the autocovariances are absolutely
summable since the MA coefficients are absolutely summable
∞ ∞
X X 1
|ψj | = |φ|j =
j=0 j=0
(1 − |φ|)

This ensures that the MA representation exists, the process is stationary and ergodic in mean.
P∞ j
Recall that j=0 φ is a geometric series converging to 1/(1 − φ) if |φ| < 1.

18
2.2 Moments

• The mean is
E(Yt) = 0

• The variance is ∞
X σ2
γ0 = E(Yt2) =σ 2
φ = 2j

j=0
1 − φ2

• The jth autocovariance is


φj σ 2
γj = E(YtYt−j ) =
1 − φ2
• The jth autocorrelation is
ρj = φ j

19
To find the moments of the AR(1) we can use a different strategy by directly working with the AR
representation and the assumption of stationarity.

• Note that the mean


E(Yt) = φE(Yt−1) + E(εt)
given the stationarity assumption E(Yt) = E(Yt−1) and therefore (1 − φ)E(Yt) = 0.

• The jth autocovariance is


γj = E(YtYt−j ) = φE(Yt−1Yt−j ) + E(εtYt−j )
= φγj−1
= φ j γ0

• Similarly the variance


E(Yt2) = φE(Yt−1Yt) + E(εtYt)
= φE(Yt−1Yt) + σ 2
γ0 = φ 2 γ0 + σ 2
= σ 2/(1 − φ2)

20
Figure 4

21
Figure 5

22
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.

23
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.

24
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.

25
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.

26
2.3 AR(2)

A second-order autoregression, denoted AR(2), satisfies


Yt = φ1Yt−1 + φ2Yt−2 + εt (15)
where again εt is a WN. Using the lag operator
(1 − φ1L − φ2L2)Yt = εt (16)
The difference equation is stable provided that the roots of the polynomial
1 − φ1z − φ2z 2
lie outside the unit circle. When this condition is satisfied the process is covariance stationary and
the inverse of the autoregressive operator is
ψ(L) = (1 − φ1L − φ2L2)−1 = ψ0 + ψ1L + ψ2L2 + ψ3L3 + ...
P∞
with j=0 |ψj | < ∞.

27
2.3.1 Moments

To find the moments of the AR(2) we can proceed as before.

• The mean is
E(Yt) = φ1E(Yt−1) + φ2E(Yt−2) + E(εt)
again by stationarity E(Yt) = E(Yt−j ) and therefore (1 − φ1 − φ2)E(Yt) = 0.

• The jth autocovariance is given by


γj = E(YtYt−j ) = φ1E(Yt−1Yt−j ) + φ2E(Yt−2Yt−j ) + E(εtYt−j )
= φ1γj−1 + φ2γj−2

• Similarly the jth autocorrelation is


ρj = φ1ρj−1 + φ2ρj−2
In particular setting j = 1
ρ 1 = φ 1 + φ 2 ρ1
ρ1 = φ1/(1 − φ2)
and setting j = 2,
ρ2 = φ1ρ1 + φ2
28
φ1
ρ2 = φ1 + φ2
1 − φ2

• The variance
E(Yt2) = φ1E(Yt−1Yt) + φ2E(Yt−2Yt) + E(εtYt)
The equation can be written as
γ0 = φ1γ1 + φ2γ2 + σ 2
where the last term comes from the fact that
E(εtYt) = φ1E(εtYt−1) + φ2E(εtYt−2) + E(ε2t )
and that E(εtYt−1) = E(εtYt−2) = 0. Note that γj /γ0 = ρj . Therefore the variance can be rewritten
as
γ0 = φρ1γ0 + φ2ρ2γ0 + σ 2
= (φ1ρ1 + φ2ρ2)γ0 + σ 2
φ21 φ2φ21
 
= + + φ22 γ0 + σ 2
(1 − φ2) (1 − φ2)
(1 − φ2)σ 2
γ0 = (17)
(1 + φ2)[(1 − φ2)2 − φ21]

29
Figure 4

30
2.4 AR(p)

A p-order autoregression, denoted AR(2), satisfies


Yt = φ1Yt−1 + φ2Yt−2 + ... + φpYt−pεt (18)
where again εt is a WN. Using the lag operator
(1 − φ1L − φ2L2 − ... − φpYt−p)Yt = εt (19)
The difference equation is stable provided that the roots of the polynomial
1 − φ1z − φ2z 2 − ... − φpz p
lie outside the unit circle. When this condition is satisfied the process is covariance stationary and
the inverse of the autoregressive operator is
ψ(L) = (1 − φ1L − φ2L2 − ... − φpLp)−1 = ψ0 + ψ1L + ψ2L2 + ψ3L3 + ...
P∞
with j=0 |ψj | < ∞.

31
2.4.1 Moments

• E(Yt) = 0;

• The jth autocovariance is


γj = φ1γj−1 + φ2γj−2 + ... + φpγj−p
for j = 1, 2, ...

• The variance is
γ0 = φ1γ1 + φ2γ2 + ... + φpγp

• Dividing by γ0 the autocovariances one obtains the Yule-Walker equations


ρj = φ1ρj−1 + φ2ρj−2 + ... + φpρj−p

32
2.4.2 Finding the roots of (1 − φ1 z − φ2 z 2 − ... − φp z p )

An easy way to find the roots of the polynomial is the following. Define two new vectors Zt =
[Yt, Yt−1, ..., Yt−p+1]0, t = [εt, 0(1×p−1)] and a new matrix
φ1 φ2 φ3 · · · φp
 
 1 0 0 ··· 0 
F =

0 1 0 · · · ...  

 0 ... . . . ... 

0 ··· 0 1 0
Then Zt satisfies the AR(1)
Zt = F Zt−1 + t
The roots of the polynomial (1−φ1z −φ2z 2 −...−φpz p) coincide with the reciprocal of the eigenvalues
of F .

33
2.5 Causality and stationarity

Causality An AR(p) process defined by the equation (1−φ1L−φ2L2 −...−φpLp)Yt = φ(L)Yt = εt


P∞
is said to be causal if there exists a sequence of constants {ψj }∞
j=0 such that j=0 |ψj | < ∞ and
P∞
Yt = j=0 ψj εt−j .

Proposition An AR process φ(L)Yt = εt is causal if and only if φ(z) 6= 0 for all z such that |z| ≤ 1.

Stationarity The AR(p) is stationary if and only if φ(z) 6= 0 for all the z such that |z| = 1

Here we focus on AR processes which are causal and stationary.

Example Consider the process


Yt = φYt−1 + εt
where εt is WN and |φ| > 1. Clearly the process is not causal. However we can rewrite the process
as
1 1
Yt = Yt+1 − εt+1
φ φ
or using the forward operator F = L−1
1 1
Yt = F Yt − F εt
φ φ
1 1
(1 − F )Yt = − F εt
φ φ
34
1 1
Yt = −(1 − F )−1 F εt
φ φ
(20)
which is a mean square convergent random variable. Using the lag operator it is easy to see that

(1 − φL) = ((φL)−1 − 1)(φL) = (1 − (φL)−1)(−φL)

35
3 ARMA

3.1 ARMA(p,q)

An ARMA(p,q) process includes both autoregressive and moving average terms:


Yt = φ1Yt−1 + φ2Yt−2 + ... + φpYt−p + εt + θ1εt−1 + ... + θq εt−q (21)
where again εt is a WN. Using the lag operator
(1 − φ1L − φ2L2 − ... − φpLp)Yt = (1 + θ1L + θ21L2 + ... + θpLp)εt (22)
Provided that the roots of the polynomial
1 − φ1z − φ2z 2 − ... − φpz p
lie outside the unit circle the ARMA process can bee written as
Yt = ψ(L)εt
where
(1 + θ1L + θ21L2 + ... + θpLp)
ψ(L) =
(1 − φ1L − φ2L2 − ... − φpLp)
Stationarity of the ARMA process depends on the AR parameters. Again a stationarity is implied
by the roots being outside the unit circle in absolute value.

36
• The variance of the process is
E(Yt2) = φ1E(Yt−1Yt) + φ2E(Yt−2Yt) + ... + φpE(Yt−pYt) +
+E(εtYt) + θ1E(εt−1Yt) + ... + θq E(εt−q Yt)
= φ1[σ 2(ψ1ψ0 + ψ2ψ1 + ...)] + φ2[σ 2(ψ2ψ0 + ψ3ψ1 + ...)]
+... + φp[σ 2(ψpψ0 + ψp+12ψ1 + ...)] +
+E(ψ0ε2t ) + θ1E(ψ1ε2t ) + ... + θq E(ψq ε2t )
= φ1[σ 2(ψ1ψ0 + ψ2ψ1 + ...)] + φ2[σ 2(ψ2ψ0 + ψ3ψ1 + ...)]
+... + φp[σ 2(ψpψ0 + ψp+12ψ1 + ...)] +
+σ 2(ψ0 + θ1ψ1 + ... + θq ψq )
and the j autocovariance is
γj = E(YtYt−j ) = φ1E(Yt−1Yt−j ) + φ2E(Yt−2Yt−j ) + ... + φpE(Yt−pYt−j ) +
+E(εtYt−j ) + θ1E(εt−1Yt−j ) + ... + θq E(εt−q Yt−j )
for j ≤ q
γj = E(YtYt−j ) = φ1γj−1 + φ2γj−2 + ... + φpγj−p +
+σ 2(θj ψ0 + θj+1ψ1...)
while for j > q autocovariances are
γj = E(YtYt−j ) = φ1γj−1 + φ2γj−2 + ... + φpγj−p+

37
There is a potential problem for redundant parametrization with ARMA processes. Consider a simple
WN
Yt = εt
and multiply both sides by (1 − ρL) to get
(1 − ρL)Yt = (1 − ρL)εt
an ARMA(1,1) with θ = φ = −ρ. Both representations are valid however it is important to avoid
such parametrization since we would get into trouble for estimating the parameter.

38
3.2 ARMA(1,1)

The ARMA(1,1) satisfies


(1 − φL)Yt = (1 + θL)εt (23)
where again εt is a WN, and
(1 + θL)
ψ(L) =
(1 − φL)
Here we have that
γ0 = φE(Yt−1Yt) + E(εtYt) + (εt−1Yt)
γ0 = φσ 2(ψ1ψ0 + ψ2ψ1 + ...) + σ 2 + θψ1σ 2
2
γ1 = φE(Yt−1 ) + E(εtYt−1) + E(θεt−1Yt−1) = φγ0 + θσ 2
γ2 = φγ1 (24)

39
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.

40
Figure . Source: W.Wei ”Time Series Analysis: Univariate and Multivariate Methods”.

41
3. ESTIMATION1

1
This part is based on the Hamilton textbook.

1
1 Estimating an autoregression with OLS

Assumption: The regression model is

yt = c + φ1yt−1 + φ2yt−2 + ... + φpyt−p + εt

with roots of (1−φ1z +φ2z 2 +...+φpz p) = 0 outside the unit circle and with εt an i.i.d. sequence
with zero mean, variance σ 2 and finite fourth moment.

The autoregression be be written in the standard form where xt = [yt−1, ..., yt−p]0.

Note that the autoregression cannot satisfy assumption A1 since ut is not independent of yt+1.

⇒ The estimator is biased

2
1.1 Asymptotic results for the estimator

The asymptotic results seen before hold. Suppose we have T + p observations so that T observations
can be used to estimate the model. Then:
Consistency

" T
#−1 " T
#
X X
β̂ − β = (xtx0t) (xtut)
t=1 t=1
" T
#−1 " T
#
X X
= (1/T ) (xtx0t) (1/T ) (xtut)
t=1 t=1
Again xtut is a martingale difference sequence with finite variance covariance matrix given by
E(xtutx0tut) = E(xtx0t)σ 2
" T
#−1
X p
(1/T ) (xtut) →0
t=1
Moreover the first term is
T −1 yt−1 T −1 yt−2 T −1 yt−p
 P P P 
1 ...
" #−1  T −1 P yt−1 T −1
P 2
y T −1
P
yt−1yt−2 . . . T −1 yt−1yt−p 
P
T
X  t−1 
(xtx0t) =  T −1 yt−2 T −1 yt−2yt−1 T −1 yt−22
. . . T −1 yt−2yt−p 
 P P P P 
t=1

 ... ... ... ... ... 

T −1 yt−p T −1 yt−pyt−1 T −1 yt−pyt−2 T −1 yt−p
2
P P P P
...
3
The elements in the first row or column converge in probability to µ by proposition C.13L. The other
elements by Theorem 7.2.1BD converge in probability to

E(yt−iyt−j ) = γ|i−j| + µ2

hence " #−1


T
X p
(1/T ) (xtx0t) → Q−1
t=1
Therefore as before
p
β̂ − β → Q−10
verifying that the estimator is consistent.

4
Asymptotic normality
Again this follows from the fact that xtut is a martingale difference sequence with variance covariance
matrix given by E(xtutx0tut) = E(xtx0t)σ 2 so that
" T
#
√ X L
(1/ T ) (xtu0t) → N (0, σ 2Q)
t=1

and
√ L
T (β̂ − β) → N (0, Q−1(σ 2Q)Q−1) = N (0, σ 2Q−1)

5
2 Maximum likelihood: Introduction

Consider an ARMA model of the form

Yt = φ1Yt−1 + φ2Yt−2 + ... + φpYt−p + εt + θ1εt−1 + θ2εt−2 + ... + θq εt−q (1)

with εt ∼ i.i.d.N (0, σ 2).

Here we explore how to estimate the values of φ1, ..., φp, θ1, ..., θq on the basis of observations of
y. The principle on which the estimation is based is maximum likelihood.

Let θ = (φ1, ..., φp, θ1, ..., θq , σ 2) denote the vector of population parameters and suppose we have a
sample of T observations (y1, y2, ..., yT ).

The approach is to calculate the probability density or likelihood function

f (yT , yT −1, ..., y1; θ) (2)

The maximum likelihood estimates (ML) is the value of θ such that (2) is maximized.

6
2.1 Likelihood function for an AR(1)

Consider the Gaussian AR(1)


Yt = c + φYt−1 + εt (3)
with εt ∼ i.i.d.N (0, σ 2). In this case θ = (c, φ, σ 2)0. Since εt is Gaussian y1 is also Gaussian with
µ = c/(1 − φ)
and variance
E(y1 − µ)2 = σ 2/(1 − φ2)
Hence the density of the first observation is
[y1 − c/(1 − φ)]2
 
1
f (y1; θ) = p exp −
2πσ 2/(1 − φ2) 2σ 2/(1 − φ2)
Conditional on the first observation the density of y2 is
f (y2|y1; θ) = N (c + φy1, σ 2)
[y2 − c − φy1]2
 
1
= √ exp − (4)
2πσ 2 2σ 2
Similarly the conditional density for the third observation is
f (y3|y2, y1; θ) = N (c + φy2, σ 2)
[y3 − c − φy2]2
 
1
= √ exp − (5)
2πσ 2 2σ 2
7
Note that for the tth observation, given the dependence only on t − 1 the conditional density is
f (yt|yt−1; θ) = N (c + φyt−1, σ 2)
2
 
1 [yt − c − φyt−1]
= √ exp − (6)
2πσ 2 2σ 2
and the joint density of the first t observation can be written as
f (yt, yt−1, ..., y1; θ) = f (yt|yt−1, ..., y1; θ)f (yt−1, ..., y1; θ)
= f (yt|yt−1, ..., y1; θ)f (yt−1|yt−2, ..., y1; θ)
= f (yt|yt−1, ..., y1; θ)f (yt−1|yt−2, ..., y1; θ)...f (y2|y1; θ)f (y1; θ)
T
Y
= f (y1; θ) f (yt|yt−1, ..., y1; θ)
t=2
and the log-likelihood is
T
X
L(θ) = log f (y1; θ) + log f (yt|yt−1, ..., y1; θ)) (7)
t=2
Substituting (6) into (7) we obtain the log-likelihood function for a sample of size T from a Gaussian
AR(1)
1 1  2 2
 [y1 − c/(1 − φ)]2
L(θ) = − log(2π) − log σ /(1 − φ )) − −
2 2 2σ 2/(1 − φ2)
T  2

(T − 1) (T − 1) X (y t − c − φy t−1 )
− log(2π) − log(σ 2) − 2
(8)
2 2 t=2

8
(8) is the exact likelihood function. To find ML estimates numerical optimization must be used

9
2.2 Conditional maximum likelihood estimates

An alternative to numerical optimization of the exact likelihood function is to consider the first value
y1 to be deterministic and to minimize the likelihood conditional on the first observation. In this case
the conditional log-likelihood is given by
T
X
L(θ) = log f (yt|yt−1, ..., y1; θ)) (9)
t=2

that is
T  2

(T − 1) (T − 1) X (y t − c − φy t−1 )
L(θ) = − log(2π) − log(σ 2) − (10)
2 2 t=2
2σ 2
Maximization of 10 is equivalent to minimization of
T
X
(yt − c − φyt−1)2
t=2

which is achieved by an OLS regression of yt on a constant and lagged values. The conditional ML
estimates of c, φ are given by
yt−1 −1
   P   P
T −1

ĉ yt
= P P 2 P (11)
φ̂ yt−1 yt−1 ytyt−1

The conditional ML estimates of the innovation variance is found by differentiating 10 with respect
10
to σ 2 and setting the result equal to zero.
T
T − 1 X (yt − c − φyt−1)2
− + =0
2σ 2 t=2
2σ 4

which gives PT
2 t=2 (yt − c − φ̂yt−1)2
σ̂ = (12)
T −1

11
2.3 Likelihood function for an AR(p)

Consider the Gaussian AR(p)

Yt = c + φ1Yt−1 + φ2Yt−2 + ... + φpYt−pεt (13)

with εt ∼ i.i.d.N (0, σ 2). In this case θ = (c, φ1, φ2, ...φp, σ 2)0. Here we only study the conditional
likelihood.

12
2.3.1 Conditional maximum likelihood estimates

The conditional likelihood can be derived as in the AR(1) case. In particular we have
(T − p) (T − p)
L(θ) = − log(2π) − log(σ 2) − ...
2 2
T
X (yt − c − φ1yt−1 − φ2yt−2 − ... − φpyt−p)2 

− 2
(14)
t=p+1

Again maximization of 14 is equivalent to minimization of


T
X
(yt − c − φ1yt−1 − φ2yt−2 − ... − φpyt−p)2
t=2

which is achieved by an OLS regression of yt on a constant and p lagged values of y.

The conditional ML estimates of the innovation variance is found by differentiating 10 with respect
to σ 2 and setting the result equal to zero
PT 2
t=p+1 (y t − c − φ 1 y t−1 − φ2 y t−2 − ... − φ p y t−p )
σ̂ 2 = (15)
T −p

• Note that even if disturbances are not Gaussian ML estimates of the parameters are consistent
estimates of the population parameters because they correspond to OLS estimates and we have seen
13
in the previous class that consistency does not depend the assumption of normality.

• An estimate that maximizes a misspecified likelihood function is known a s a quasi-maximum


likelihood estimate.

14
2.4 Conditional likelihood function for an MA(1)

Consider the MA(1) process


Yt = µ + εt + θεt−1
with εt ∼ i.i.d.N (0, σ 2). In this case θ = (µ, θ, σ 2)0.

If the value of εt−1 were known with certainty then

yt|εt−1 ∼ N (µ + θεt−1, σ 2)

or
−(yt − µ − θεt−1)2
 
1
f (yt|εt−1; θ) = √ exp
2πσ 2 2σ 2

Suppose we knew ε0 = 0. Then


y1|ε0 ∼ N (µ, σ 2)
Moreover given y1, ε1 = y1 − µ we know
−(y2 − µ − θε1)2
 
1
f (y2|y1, ε0 = 0; θ) = √ exp
2πσ 2 2σ 2

Since ε1 is known with certainty ε2 = y2 − µ − θε1. Proceeding this way given the knowledge of ε0
the full sequence of innovations can be calculated by iterating on εt = yt − µ − εt−1.
15
The conditional density for the tth observation can be calculated as

f (yt|yt−1, yt−2, ..., y1, ε0 = 0; θ) = f (yt|εt−1; θ)


 2
1 −εt
= √ exp (16)
2πσ 2 2σ 2
The sample likelihood would be the product of individual densities
T
Y
f (yt, yt−1, yt−2, ..., y1|ε0 = 0; θ) = f (yt|yt−1, yt−2, ..., y1|ε0 = 0; θ) (17)
t=1

and the conditional log-likelihood is


X ε2 T
T T 2 t
L(θ) = − log(2π) − log(σ ) − (18)
2 2 t=1
2σ 2

• For a particular numerical value of θ we calculate the sequence of ε’s. The conditional likelihood is
a function of the sum of the squares.

• ML estimates have to be found by numerical optimization.

16
2.5 Conditional likelihood function for an MA(q)

Consider the MA(q) process

Yt = µ + εt + θ1εt−1 + θw εt−2 + ... + θq εt−q

with εt ∼ i.i.d.N (0, σ 2). In this case θ = (µ, θ1, ..., θq , σ 2)0.

As before a simple approach is to condition on the assumption

ε0 = ε−1 = ... = ε−q+1 = 0

Using these starting values we can iterate on

εt = yt − µ − θ1εt−1 − θ2εt−2 − ... − θq εt−q

The conditional likelihood is


X ε2 T
T T 2 t
L(θ) = − log(2π) − log(σ ) − 2
(19)
2 2 t=1

17
2.6 Likelihood function for an ARMA(p,q)

A Gaussian ARMA(p,q) takes the form

Yt = c + φ1Yt−1 + φ2Yt−2 + ... + φpYt−p + εt + θ1εt−1 + θw εt−2 + ... + θq εt−q

with εt ∼ i.i.d.N (0, σ 2). In this case θ = (µ, φ1, ..., φp, θ1, ..., θq , σ 2)0.

2.7 Conditional Likelihood function

Taking values of y0, ...y−p+1, ε0, ..., ε−q+1 as given the sequence of ε’s can be calculated using

εt = yt − c − φ1yt−1 − φ2yt−2 − ... − φpyt−p − εt − θ1εt−1 − θw εt−2 − ... − θq εt−q

The conditional likelihood is then


X ε2 T
T T 2 t
L(θ) = − log(2π) − log(σ ) − (20)
2 2 t=1
2σ 2

Again numerical optimizxation has to be used to compute the ML estimates.

18
4. FORECASTING1

1
This part is based on the Hamilton textbook.

1
1 Principles of forecasting

1.1 Forecast based on conditional expectations

• Suppose we are interested in forecasting the value of Yt+1 based on a set of variables Xt.

• Let Yt+1|t denote such a forecast.

• To evaluate the usefulness of the forecast we need to specify a loss function. Here we spec-
ify a quadratic loss function. A quadratic loss function means that Yt+1|t is chose to minimize
E(Yt+1 − Yt+1|t)2.

• E(Yt+1 − Yt+1|t)2 is known as the mean squared error associated with the forecast Yt+1|t denoted

M SE(Yt+1|t) = E(Yt+1 − Yt+1|t)2

• Fundamental result: the forecast with the smallest M SE is the expectation of Yt+1|t conditional
on Xt that is
Yt+1|t = E(Yt+1|Xt)

2
We now verify the claim. Let g(Xt) be any other function and let Yt+1|t = g(Xt). The associated
M SE is

E [Yt+1 − g(Xt)]2 = E [Yt+1 − E(Yt+1|Xt) + E(Yt+1|Xt) − g(Xt)]2


= E [Yt+1 − E(Yt+1|Xt)]2 +
+2E {[Yt+1 − E(Yt+1|Xt)] [E(Yt+1|Xt) − g(Xt)]} +
+E{[E(Yt+1|Xt) − g(Xt)}2

(2)

Define

ηt+1 ≡ [Yt+1 − E(Yt+1|Xt)] [E(Yt+1|Xt) − g(Xt)]


(3)

The conditional expectation is

E(ηt+1|Xt) = E {[Yt+1 − E(Yt+1|Xt)] [E(Yt+1|Xt) − g(Xt)] |Xt}


= [E(Yt+1|Xt) − g(Xt)] E {[Yt+1 − E(Yt+1|Xt)] |Xt}
= [E(Yt+1|Xt) − g(Xt)] [E(Yt+1|Xt) − E(Yt+1|Xt)]
= 0

3
Therefore by law of iterated expectations

E(ηt+1) = E(E(ηt+1|Xt)) = 0

This means that

E [Yt+1 − g(Xt)]2 = E [Yt+1 − E(Yt+1|Xt)]2 + E([E(Yt+1|Xt) − g(Xt)])2

Therefore the function that minimizes the M SE is

g(Xt) = E(Yt+1|Xt)

E(Yt+1|Xt) is the optimal forecast of Yt+1 conditional of Xt under a quadratic loss function. The
M SE of this optimal forecast is

E[Yt+1 − g(Xt)]2 = E[Yt+1 − E(Yt+1|Xt)]2

4
1.2 Forecast based on linear projections

We now restrict the class of forecasts we consider to be linear function of Xt:


Yt+1|t = α0Xt
Suppose α0 is such that the resulting forecast error is uncorrelated with Xt
E[(Yt+1 − α0Xt)Xt0] = 00 (4)
If (4) holds the we call α0Xt the linear projection of Yt+1 on Xt.

• The linear projection produces the smallest forecast error among the class of linear forecasting
rules. To verify this let g 0Xt be any arbitrary forecasting rule.
2 2
E (Yt+1 − g 0Xt) = E (Yt+1 − α0Xt + α0Xt − g 0Xt)
2
= E (Yt+1 − α0Xt) +
+2E [(Yt+1 − α0Xt) (α0Xt − g 0Xt)] +
2
+E (α0Xt − g 0Xt)
the middle term
E[(Yt+1 − α0Xt) (α0Xt − g 0Xt)] = E[(Yt+1 − α0Xt) Xt0 [α − g])
= E[(Yt+1 − α0Xt) Xt0] [α − g]
= 00
5
by definition of linear projection. Thus
2 2 2
E (Yt+1 − g 0Xt) = E (Yt+1 − α0Xt) + E (α0Xt − g 0Xt)

The optimal linear forecast is the value g 0Xt = α0Xt. We use the notation

P̂ (Yt+1|Xt) = α0Xt

to indicate the linear projection of Yt+1 on Xt.

Notice that
M SE[P̂ (Yt+1|Xt)] ≥ M SE [E(Yt+1|Xt)]

The projection coefficient α can be calculated in terms of moments of Yt+1 and Xt.

E(Yt+1Xt0) = α0E(XtXt0)

α0 = [E(XtXt0)]−1E(Yt+1Xt0)

Here we denote Ŷt+s|t = Ê(Yt+s|Xt) = P̂ (1, Yt+s|Xt) the best linear forecast of Yt+s conditional on
Xt .

6
1.3 Linear projections and OLS regression

There is a close relationship between OLS estimator and the linear projection coefficient. If Yt+1 and
Xt are stationary processes and also ergodic for the second moments then
T
X p
(1/T ) XtXt0 → E(XtXt0)
t=1
T
X p
(1/T ) XtYt+1 → E(XtYt+1)
t=1

implying
p
β̂ → α

The OLS regression yields a consistent estimate of the linear projection coefficient.

7
2 Forecasting an AR(1)

For the covariance-stationary AR(1) we have


Ŷt+s|t = Ê(Yt+s|Yt, Yt−1, ...) = φsYt
The forecast decays geometrically toward zero as the forecast horizon increases.

The forecast error is


Yt+s − Ŷt+s|t = εt+s + φεt+s−1 + φ2εt+s−2 + ... + φs−1εt+1
The associated M SE will be
E(Yt+s − Ŷt+s|t)2 = E(εt+s + φεt+s−1 + φ2εt+s−2 + ... + φs−1εt+1)2
= (1 + φ2 + φ4 + ... + φ2(s−1))σ 2
1 − φ2s 2
= σ
1 − φ2
Notice that
lim Ŷt+s|t = 0
s→∞
(in general it will converge to the mean of the process) and
2 σ2
lim E(Yt+s − Ŷt+s|t) =
s→∞ 1 − φ2
which is the variance of the process.
8
3 Forecasting an AR(p)

Now consider an AR(p). Recall the AR(p) can be written as


Zt = F Zt−1 + t
where Zt = [Yt, Yt−1, ..., Yt−p+1]0, t = [εt, 0, ..., 0]0 and
 
φ1 φ2 . . . φp−1 φp
 1 0 ... 0 0 
 
F = 0 1 ... 0 0 
 
 . ... ... ... ... 
 .. 
0 0 ... 1 0
Therefore
(s) (s) (s) 1 2 s−1
Yt+s = f11 Yt + f12 Yt−1 + ... + f1p Yt−p+1 + εt+s + f11 εt+s−1 + f11 εt+s−2 + ... + f11 εt+1
j
where fmn denotes the (m, n) element of F j .

The optimal s-step ahead forecast is thus


(s) (s) (s)
Ê(Yt+s|Yt, Yt−1, ...) = f11 Yt + f12 Yt−1 + ... + f1p Yt−p+1 (5)
and the associated forecast error
1 2 s−1
Yt+s − Ê(Yt+s|Yt, Yt−1, ...) = εt+s + f11 εt+s−1 + f11 εt+s−2 + ... + f11 εt+1 (6)
9
The forecast can be computed recursively. Let Ŷt+1|t be one period step-ahead forecast of Yt+1 we
have

Ŷt+1|t = φ1Yt + φ2Yt−1 + ... + φpYt−p+1 (7)

In general the j-step ahead forecast Ŷt+j|t can be computed using the recursion

Ŷt+j|t = φ1Ŷt+j−1|t + φ2Ŷt+j−2|t + ... + φpŶt+j−p|t (8)

An easy way to see this is to use the AR(1) process

Zt = F Zt−1 + t

We have

Ẑt+1|t = F Zt
Ẑt+2|t = F 2Zt
...

Ẑt+s|t = F sZt

which means
Ẑt+s|t = F Ẑt+s−1|t
The forecast Ŷt+s|t will be the first element of Ẑt+s|t

10
The associated forecast errors will be

Zt+1 − Ẑt+1|t = t+1


Zt+2 − Ẑt+2|t = F t+1 + t+2
...

Zt+s − Ẑt+s|t = F s−1t+1 + F s−2t+2 + ... + F t+s−1 + t+s

Let E(t0t) = Σ. The mean squared errors will be

M SE(Ẑt+1|t) = Σ
M SE(Ẑt+2|t) = F ΣF 0 + Σ
...

M SE(Ẑt+s|t) = F s−1ΣF s−10 + F s−2ΣF s−20 + ... + F ΣF 0 + Σ


s−1
X
= F j ΣF j0
j=0

11
4 Forecasting an MA(1)

Consider the invertible MA(1)


Yt = εt + θεt−1
with |θ| < 1. Replacing in the Wiener-Kolmogorov formula we obtain
 
1 + θL 1
Ŷt+s|t = Yt
Ls + 1 + θL
For s = 1
θ
Ŷt+1|t = Yt = θεt (9)
1 + θL
where εt is the outcome of the infinite recursion εt = Yt − θεt−1 For s = 2, 3, ...

Ŷt+1|t = 0 (10)
h i
ψ(L)
because Ls + =0

12
5 Forecasting an MA(q)

For an invertible MA(q) process

Yt = εt + θ1εt−1 + θ2εt−2... + θq εt
1 + θ1L + θ2L2 + ... + θq Lq
 
1
Ŷt+s|t = Y
2 + ... + θ Lq t
Ls + 1 + θ1 L + θ 2 L q
where
1 + θ1L + θ2L2 + ... + θq L
 
= θs + θs+1L1 + ... + θq Lq−s for s = 1, 2, ...q
Ls +
and zero for s = q + 1, q + 2, ....

Therefore the forecast for horizons s = 1, 2, ..., q is given by

Ŷt+s|t = (θs + θs+1L + ... + θq Lq−s)εt

and zero for the other horizons.

13
6 Forecasting an ARMA(1,1)

Consider the ARMA(1,1) process


(1 − φL)Yt = (1 + θL)εt
with |φ|, |θ| < 1.  
1 + θL 1 − φL
Ŷt+s|t = Yt
(1 − φL)Ls + 1 + θL
1 + φL + φ2L2 + ... θL(1 + φL + φ2L2 + ...)
   
1 + θL
= +
(1 − φL)Ls + Ls Ls +
= (φs + φs+1L + φs+2L2 + ...) + θ(φs−1 + φsL + φs+1L2 + ...)
= (φs + θφs−1)(1 + φL + φ2L +2 +...)
φs + θφs−1
=
1 − φL
Therefore
φs + θφs−1 1 − φL
 
Ŷt+s|t = Yt
1 − φL 1 + θL
 s s−1

φ + θφ
= Yt
1 + θL
for s = 1 the forecast is  
φ+θ
Ŷt+1|t = Yt
1 + θL
14
This can be written as
φ(1 + θL) + θ(1 − φL)
Ŷt+1|t = Yt = φYt + θεt
1 + θL
where
(1 − φL)
εt = Yt
(1 + θL)
= Yt − φYt−1 − θεt−1

For s = 2, 3, ... the forecast obeys the recursion

Ŷt+s|t = φŶt+s−1|t

15
7 Direct forecast

An alternative is to compute the direct forecast by computing the projection of Yt+h on Yt. To see
this consider a bivariate VAR(p) with two variables, xt and yt. We want to forecast xt+h given the
incormation available at time t.

The direct forecast works as follows:


1. Estimate the projection equation
p−1
X p−1
X
xt = a + φixt−h−i + ψiyt−h−i + εt
i=0 i=0

2. Using the estimated coefficients, the predictor xt+h|t is obtain as


p−1
X p−1
X
x̂t+h|t = a + φixt−i + ψiyt−i
i=0 i=0

16
8 Comparing Predictive Accuracy

Diebold and Mariano propose a procedure to formally compare the forecasting performance of two
1 2
competing models. Let Ŷt+s|t , Ŷt+s|t be two forecast based on the same information set but obtained
using two different models (i.e. MA(1) and AR(1)). Let
wτ1+s|τ = Yτ +s − Ŷτ1+s|t
wτ2+s|τ = Yτ +s − Ŷτ2+s|t
be the two forecast errors where τ = T0, ...T − s.

The accuracy of each forecast is measured by a particular loss function, say quadratic i.e. L(wτi +s|τ ) =
 2
1
wτ +s|τ . The Diebold Mariano procedure is based on a test of the null hypothesis
H0 : E(dτ ) = 0, H1 : E(dτ ) 6= 0
where dτ = L(wτ1+s|τ ) − L(wτ2+s|τ ). The Diebold-Mariano statistic is

S= 1/2
ˆ d¯
LRV
where
T −s
X
d¯ = (1/(T − T0 − s)) dτ
τ =T0

17
ˆ d¯ is a consistent estimate of
and LRV

X
LRVd¯ = γ0 + 2 γj
j=1

and γj = Cov(dt, dt−j ).

Under the null


L
S → N (0, 1)

• DM test is for forecast comparison not model comparison.

• When using forecast obtained from models one has to be careful. In nested models the distri-
bution of the DM statistic is non-normal.

• For nested model an alternative is the bootstrapping procedure in Clark and McCracken (Ad-
vances in Forecast Evaluation, 2013).

18
9 Forecast in practice

• So far we assumed that the value of the coefficients is known. This is obviously not the case in
practice. In real applications we will have to estimate the value of the parameters.

• For instance with an AR(p) we have to get estimate φ̂1, φ̂2, ..., φ̂p using OLS and then using
the formulas seen before to produce forecasts of Y .

• Suppose we have data up to date T . We estimate the model using all the T observations for
Yt and we forecast YT +s, call it ỸT +s|T to distinguish it from the forecast where coefficients were
known with certainty ŶT +s|T .

19
9.1 Forecast evaluation

• A key issue in forecasting is to evaluate the forecasting accuracy of a model of interest. In particular
several times we will be interested in comparing the performance of competing forecasting model, e.g.
AR(1) vs. ARMA(1,1).

• How can we perform such a forecast evaluation?

• Answer: we can compare the mean squared errors using pseudo-out of sample forecast exercises.

20
9.2 Pseudo out-of-sample exercises

Suppose we have a sample of T observations. Let τ = T0 and let i = 1. A pseudo out-of-sample


exercise works as follows:
1. We use τ observations to estimate the parameters of the model.
2. We forecast Yτ +j with obtaining Ỹτ +j|τ j = 1, 2, ..., s.
3. We compute the forecast error wτ +j|τ = Yτ +j − Ỹτ +j|τ with j = 1, 2, ..., s.
4. We update τ = τ + 1 and repeat steps 1-3.
We repeat steps 1-4 up to the end of the sample and we compute the mean squared error
T −T0 −j
1 X
M ˆSE(ỸT +j|T ) = wτ2+j|τ
T − T0 − j
τ =T0

or the root mean squared error


v
u T −T0 −j
ˆ
u 1 X
RM SE(ỸT +j|T ) = t wτ2+j|τ
T − T0 − j
τ =T0

21
5. FORECASTING: APPLICATIONS

22
“Why has U.S. inflation become harder to forecast?”
By Stock, J and M., Watson

23
10 “Why has U.S. inflation become harder to forecast?”

• The rate of price inflation in the United States has become both harder and easier to forecast.

• Easier: inflation is much less volatile than it was in the 1970s and the root mean squared er-
ror of inflation forecasts has declined sharply since the mid-1980s.

• Harder: standard multivariate forecasting models do not a better job than simple naive mod-
els. The point was made by Atkeson and Ohanian (2001) (henceforth, AO), who found that, since
1984 in the U.S., backwards-looking Phillips curve forecasts have been inferior to a forecast of twelve-
month inflation by its average rate over the previous twelve months (naive or random walk forecast).

• Relevance of the topic. Change in terms of forecasting properties can signal changes in the structure
of the economy. This can be taken as evidence that suggests that some relations have changed

• What relations? Structural models should be employed (next part of the course).

24
10.1 U.S. Inflation forecasts: facts and puzzles
10.1.1 Data

• GDP price index inflation (π).

• Robustness analysis done using personal consumption expenditure deflator for core items (PCEcore),
the personal consumption expenditure deflator for all items (PCE-all), and the consumer price index
(CPI, the official CPI-U).

• Real activity variables: the unemployment rate (u), log real GDP (y), the capacity utilization
rate, building permits, and the Chicago Fed National Activity Index (CFNAI)

• Quarterly data. Quarterly values for monthly series are averages of the three months in the quarter.

• The full sample is from 1960:I through 2004:IV.

25
10.1.2 Forecasting models

• Two univariate models and one multivariate forecasting models.

• Let πt = 400 log(pt/pt−1) where pt is the quarterly price index and let the h-period average
h
Ph−1 h h
inflation be πt = (1/h) i=0 πt−i. Let πt+h|t be the forecast of πt+h using information up to date t.

26
10.1.3 AR(r)

• Forecasts made using a univariate autoregression with r lags. r is estimated using the Akaike
Information Criterion (AIC).

• Multistep forecasts are computed by the direct method: projecting h-period ahead inflation on
r lags

• The h-step ahead AR(r) forecast was computed using the model
h
πt+h − πt = µh + αh(L)∆πt + vth (11)

where
1. µh is a constant
2. αh(L) is a polynomial in the lag operator
3. vth is the h-step ahead error term
• The number of lags is chosen according to the Akaike Information Criterion (AIC) meaning that r
is such that
XT
AIC = T log( ε̂2t ) + 2r
t=1

27
is minimum. An alternative criterion is the Bayesian Information Criterion (BIC)
T
X
BIC = T log( ε̂2t ) + r log(p)
t=1

28
10.1.4 AO. Atkeson-Ohanian (2001)

AO forecasted the average four-quarter rate of inflation as the average rate of inflation over the
previous four quarters. The AO forecast is
h 1
πt+h|t = πt4 = (πt + πt−1 + πt−2 + πt−3)
4

29
10.1.5 Backwards-looking Phillips curve (PC)

The PC forecasts are computed by adding a predictor to (11) to form the autoregressive distributed
lag (ADL) specification,
h
πt+h − πt = µh + αh(L)∆πt + β hxgapt + δ h(L)∆xt + vth (12)

where
1. µh is a constant
2. αh(L), δ h(L), is a polynomial in the lag operator (lag length chosen using AIC)
3. xgapt is the gap variable (deviations from a low pass filter) based on the variable xt
4. vth is the h-step ahead error term
• The P C forecast using ut = xgapt = xt and ∆ut = ∆xt is called P C − u.

• The forecasts P C − ∆u, P C − ∆y, P C − ∆CapU til, P C − ∆P ermits, P C − CF N AI omit


the gap variable and only include stationary predictors ∆u, ∆y, ∆CapU til, ∆P ermits, CF N AI.

30
10.2 Out-of-sample methodology

• The forecasts were computed using the pseudo out-of-sample forecast methodology: that is, for
a forecast made at date t, all estimation, lag length selection, etc. was performed using only data
available through date t.

• The forecasts are recursive, so that forecasts at date t are based on all the data (beginning in
1960:I) through date t.

• The period 1960-1970 was used for initial parameter estimation. The forecast period 1970:I 2004:IV
was split into the two periods 1970:I 1983:IV and 1984:I 2004:IV.

31
10.3 Results

32
• The RMSFE of forecasts of GDP inflation has declined and the magnitude of this reduction is
striking. In this sense inflation has become easier to forecast

• The relative performance of the Phillips curve forecasts deteriorated substantially from the first pe-
riod to the second. This deterioration of Phillips curve forecasts is found for all the activity predictors.

• The AO forecast substantially improves upon the AR(AIC) and Phillips curve forecasts at the
four- and eight-quarter horizons in the 1984-2004 period, but not at shorter horizons and not in the
first period.

⇒ Substantial changes in the univariate inflation process and in the bivariate process of inflation
and its activity-based predictors.

33
“Unpredictability and Macreconomic Stability”
By D’Agostino, A., D. Giannone and P. Surico

34
11 Unpredictability and Macreconomic Stability

• D’Agostino Giannone and Surico extend the result for inflation to other economic activity variables:
the ability to predict several measures of real activity declined remarkably, relative to naive forecasts,
since the mid-1980s.

• The fall in the predictive ability is a common feature of many forecasting models including those
used by public and private institutions.

• The forecasts for output (and also inflation) of the Federal Reserves Greenbook and the Survey of
Professional Forecasters (SPF) are significantly more accurate than a random walk only before 1985.
After 1985, in contrast, the hypothesis of equal predictive ability between naive random walk forecasts
and the predictions of those institutions is not rejected for all horizons but the current quarter.

• The decline in predictive accuracy is far more pronounced for institutional forecasters and methods
based on large information sets than for univariate specifications.

• The fact that larger models are associated with larger historical changes suggests that the main
sources of the decline in predictability are the dynamic correlations between variables rather than the
autocorrelations of output and inflation.

35
11.1 Data

• Forecasts for nine monthly key macroeconomic series: three price indices, four measures of real
activity and two interest rates:
1. The three nominal variables are Producer Price Index (PPI ), Consumer Price Index (CPI ) and
Personal Consumption Expenditure implicit Deflator (PCED).
2. The four forecasted measures of real activity are Personal Income (PI ), Industrial Production
(IP) index, Unemployment Rate (UR), and EMPloyees on non-farm Payrolls (EMP).
3. the interest rates are 3 month Treasury Bills (TBILL) and 10 year Treasury Bonds (TBOND).
• The data set consists of monthly observations from 1959:1 through 2003:12 on 131 U.S.macroeconomic
time series including also the nine variables of interest.

36
11.2 Forecasting models

The model used are the following:


1. A Naive forecast model (N or RW).
2. Univariate AR, where the forecasts are based exclusively on the own past values of the variable
of interest.
3. Factor augmented AR forecast (FAAR), in which the univariate models are augmented with
common factors extracted from the whole panel of series.
4. Pooling of bivariate forecasts (POOL): for each variable the forecast is defined as the average of
130 forecasts obtained by augmenting the AR model with each of the remaining 130 variables in
the data set.

37
11.3 Out-of-sample methodology

• Pseudo out-of-sample forecasts are calculated for each variable and method over the horizons h =
1, 3, 6, and 12 months.

• The pseudo out-of-sample forecasting period begins in January 1970 and ends in December 2003.
Forecasts constructed at date T are based on models that are estimated using observations dated T
and earlier.

• Forecast based on rolling samples using, at each point in time, observations over the most recent
10 years.

38
11.4 Results: full sample

39
• For all prices and most real activity indicators, the forecasts based on large information are signifi-
cantly more accurate than the Naive forecasts.

• The factor augmented model produces the most accurate predictions.

• Univariate autoregressive forecasts significantly improve on the naive models for EMP at all hori-
zons and for CPI and PCED at one and three month horizons only. As far as interest rates are
concerned, no forecasting model performs significantly better than the naive benchmark.

40
12 Results: sub samples - inflation

• For all lags except the first, result of AO confirmed, deterioration of the forecasting performance of
inflation.
41
12.1 Results: sub samples - real activity

42
• Little change in the structure of univariate models for real activity.

• The relative MSFEs of FAAR and POOL suggest that important changes have occurred in the
relationship between output and other macroeconomic variables.

• The decline in predictability does not seem to extend to the labor market, especially at short
horizons. The forecasts of the employees on nonfarm payrolls are associated with the smallest per-
centage changes across subsamples.

43
12.2 Results: sub samples - interest rates

• In the second sample the AR, FAAR and P OOL methods produce more accurate forecasts than
the RW at one month horizon.

• Possible interpretation: increased predictability of the FED due to a better communication strategy.

44
12.3 Results: private and institutional forecasters

• The predictions for output and its deflator from two large forecasters representing the private sector
and the policy institutions are considered.

• The survey was introduced by the American Statistical Association and the National Bureau of
Economic Research and is currently maintained by the Philadelphia Fed. The SPF refers to quarterly
measures and is conducted in the middle of the second month of each quarter (here the median of
the individual forecasts is considered)

• The forecasts of the Greenbook are prepared by the Board of Governors at the Federal Reserve for
the meetings of the Federal Open Market Committee (FOCM), which takes place roughly every six
weeks.

• Four forecast horizons ranging from 1 to 4 quarters.

• The measure of output is Gross National Product (GNP) until 1991 and Gross Domestic Product
(GDP) from 1992 onwards.

• The evaluation sample begins in 1975 (prior to this date the Greenbook forecasts were not al-
ways available up to the fourth quarter horizon).
45
12.4 Results: private and institutional forecasters - inflation

46
12.5 Results: private and institutional forecasters - real activity

47
”The Return of the Wage Phillips Curve”

48
13 The Return of the Wage Phillips Curve

• Previous evidence has been taken as a motivation to dismiss the Phillips curve as a theoretical
concept.

• Danger with that interpretation.

• In 1958 Phillips uncovered an inverse relation between wage rate inflation and unemployment.

• The focus however in recent years has been shifted to price inflation

49
13.1 Back to the origins

50
51
13.2 Results

forecast horizon RMSE VAR/RW RMSE AR/RW % gain using var


1 0.2252 0.2048 9.976
4 0.3642 0.3976 -8.4208
8 0.4892 0.6110 -19.9406
12 0.544 0.6646 -18.1371
16 0.5356 0.6157 -13.0069
18 0.5259 0.5914 -11.0704

• Phillips curve still (now more than then) characterize dynamics of wage growth and unemployment.

• Crucial question: what has changed in the relation between prices and wages?

52
5: MULTIVARATE STATIONARY PROCESSES

1
1 Some Preliminary Definitions and Concepts

Random Vector: A vector X = (X1, ..., Xn) whose components are scalar-valued random variables
on the same probability space.

Vector Random Process: A family of random vectors {Xt, t ∈ T } defined on a probability space,
where T is a set of time points. Typically T = R, T = Z or T = N, the sets or real, integer and
natural numbers, respectively.

Time Series Vector: A particular realization of a vector random process.

Matrix of polynomial in the lag operator: Φ(L) if its elements are polynomial in the lag operator, i.e.

1 −0.5L

Φ(L) = = Φ0 + Φ1L
L 1+L
where   
0 −0.5
  
1 0 0 0
Φ0 = , Φ1 = , Φj>1 = .
0 1 1 1 0 0

When applied to a vector Xt we obtain



1 −0.5L
  
X1t − 0.5X2t−1

X1t
Φ(L)Xt = =
L 1+L X2t X1t−1 + X2t + X2t−1

2
The inverse is matrix such that Φ(L)−1Φ(L) = I. Suppose Φ(L) = (I − Φ1L). Its inverse Φ(L)−1 =
A(L) is a matrix such that (A0 + A1L + A2L2 + ...)Φ = I. That is

A0 = I
A1 − Φ1 = 0 ⇒ A1 = Φ1
A2 − A1Φ1 = 0 ⇒ A2 = Φ2
...

Ak − Ak−1Φ1 = 0 ⇒ Ak = Φk
(1)

3
1.1 Covariance Stationarity

Let Yt be a n-dimensional vector of time series, Yt0 = [Y1t, ..., Ynt]. Then Yt is covariance (weakly)
stationary if E(Yt) = µ, and the autocovariance matrix Γj = E(Yt − µ)(Yt−j − µ)0 for all t, j, that
is are independent of t and both finite.
− Stationarity of each of the components of Yt does not imply stationarity of the vector Yt. Station-
arity in the vector case requires that the components of the vector are stationary and costationary.
− Although γj = γ−j for a scalar process, the same is not true for a vector process. The correct
relation is
Γj = Γ0−j

Example: n = 2 and µ = 0
 
E(Y1tY1t−1) E(Y1tY2t−1)
Γ1 =
E(Y2tY1t−1) E(Y2tY2t−1)
 
E(Y1t+1Y1t) E(Y1t+1Y2t)
=
E(Y2t+1Y1t) E(Y2t+1Y2t)
E(Y1tY2t+1) 0
 
E(Y1tY1t+1)
= = Γ0−1
E(Y2tY1t+1) E(Y2tY2t+1)

4
2 Vector Moving average processes

2.1 White Noise (WN)

A n-dimensional vector white noise 0t = [1t, ..., nt] ∼ W N (0, Ω) is such if E(t) = 0 and Γk = Ω (Ω
a symmetric positive definite matrix) if k = 0 and 0 if k 6= 0. If t, τ are independent the process is
an independent vector White Noise (i.i.d). If also t ∼ N the process is a Gaussian WN.

Important: A vector whose components are white noise is not necessarily a white  noise. Exam-
2 
σ u 0
ple: let ut be a scalar white noise and define t = (ut, ut−1)0. Then E(t0t) = 2 and
  0 σ u
0 0
E(t0t−1) = .
σu2 0

5
2.2 Vector Moving Average (VMA)

Given the n-dimensional vector White Noise t a vector moving average of order q is defined as

Yt = µ + t + C1t−1 + ... + Cq t−q

where Cj are n × n matrices of coefficients.

VMA(1)
Let us consider the VMA(1)
Yt = µ + t + C1t−1
with t ∼ W N (0, Ω), µ is the mean of Yt. The variance of the process is given by

Γ0 = E[(Yt − µ)(Yt − µ)0]


= Ω + C1ΩC10

with autocovariances

Γ1 = C1Ω, Γ−1 = ΩC10 , Γj = 0 for |j| > 1

The VMA(q)
Let us consider the VMA(q)

Yt = µ + t + C1t−1 + ... + Cq t−q

6
with t ∼ W N (0, Ω), µ is the mean of Yt. The variance of the process is given by

Γ0 = E[(Yt − µ)(Yt − µ)0]


= Ω + C1ΩC10 + C2ΩC20 + ... + Cq ΩCq0

with autocovariances

Γj = Cj Ω + Cj+1ΩC10 + Cj+2ΩC20 + ... + Cq ΩCq−j


0
f or j = 1, 2, ..., q
0 0 0
Γj = ΩC−j + C1ΩC−j+1 + C2ΩC−j+2 + ... + Cq+j ΩCq0 f or j = −1, −2, ..., −q
Γj = 0 f or |j| > q

7
The VMA(∞)
A useful process, as we will see, is the VMA(∞)

X
Yt = µ + Cj εt−j (2)
j=0

the process can be thought as the limiting case of a VMA(q) (for q → ∞). Recall the previous result
the process converges in mean square if {Cj } is absolutely summable.

Proposition (10.2H). Let Yt be an n × 1 vector satisfying



X
Yt = µ + Cj εt−j
j=0

where εt is a vector WN with E(εt−j ) = 0 and E(εtε0t−j = Ω for j = 0 and zero otherwise and
{Cj }∞
j=0 is absolutely summable. Let Yit denote the ith element of Yt and µi the ith element of
µ. Then
(a) The autocovariance between the ith variable at time t and the jth variable at time s periods
earlier, E(Yit − µi)(Yjt−s − µj ) exists and is given by the row icolumn j element of

X
Γs = Cs+v ΩCs0
v=0

for s = 0, 1, 2, ....
8
(b) The sequence of matrices {Γs}∞
s=0 is absolutely summable.

If furthermore {εt}∞ t=−∞ is an i.i.d. sequence with E|εi1 t εi2 t εi3 t εi4 t | ≤ ∞ for i1 , i2 , i3 , i4 =
1, 2, ..., n, then also
(c) E|Yi1t1 Yi2t2 Yi3t3 Yi4t4 | ≤ ∞ for all t1, t2, t3, t4
p
(d) (1/T ) Tt=1 yityjt−s → E(yityjt−s), for i, j = 1, 2, ..., n and for all s
P

Implications:
1. Result (a) implies that the second moments of a M A(∞) with absolutely summable coefficients
can be found by taking the limit of the autocovariance of an M A(q).
2. Result (b) ensures ergodicity for the mean
3. Result (c) says that Yt has bounded fourth moments
4. Result (d) says that Yt is ergodic for second moments

Note: the vector M A(∞) representation of a stationary VAR satisfies the absolute summability
condition so that assumption of the previous proposition hold.

9
2.3 Invertible and fundamental VMA

Invertibility The VMA is invertible i.e. it possesses a VAR representation, if and only if the determi-
nant of C(L) vanishes only outside the unit circle, i.e. if det(C(z)) 6= 0 for all |z| ≤ 1.

Example Consider the process


    
Y1t 1 L ε1t
=
Y2t 0 θ−L ε2t

det(C(z)) = θ − z which is zero for z = θ. The process is invertible if and only if |θ| > 1.

Fundamentalness The VMA is fundamental if and only if the det(C(z)) 6= 0 for all |z| < 1. In
the previous example the process is fundamental if and only if |θ| ≥ 1. In the case |θ| = 1 the process
is fundamental but noninvertible.

Provided that |θ| > 1 the MA process can be inverted and the shock can be obtained as a com-
bination of present and past values of Yt. That is the VAR (Vector Autoregressive) representation
can be recovered. The representation will entail infinitely many lags of Yt with absolutely summable
coefficients, so that the process converges in mean square.

10
Considering the above example
 L
1 − θ−L
   
Y1t ε1t
1 =
0 θ−L Y2t ε2t
or
L 1
Y1t + Y2t = ε1t
θ 1 − 1θ L
1 1
Y2t = ε2t
θ 1 − 1θ L
(3)

11
2.4 Wold Decomposition

Any zero-mean stationary vector process Yt admits the following representation

Yt = C(L)εt + µt (4)
P∞
where C(L)t is the stochastic component with C(L) = i=0 CiLi and µt the purely deterministic
component, the one perfectly predictable using linear combinations of past Yt.

If µt = 0 the process is said regular. Here we only consider regular processes.

(4) represents the Wold representation of Yt which is unique and for which the following prop-
erties hold:
(b) t is the innovation for Yt, i.e. t = Yt − Proj(Yt|Yt−1, Yt−1, ...),i.e. the shock is fundamental.
(b) t is White noise, Et = 0, Et0τ = 0, for t 6= τ , Et0t = Ω
(c) The coefficients are square summable ∞ 2
P
j=0 kCj k < ∞.

(d) C0 = I

12
• The result is very powerful since holds for any covariance stationary process.

• However the theorem does not implies that (4) is the true representation of the process. For
instance the process could be stationary but non-linear or non-invertible.

13
2.5 Other fundamental MA(∞) Representations

• It is easy to extend the Wold representation to the general class of invertible MA(∞) representations.
For any non singular matrix R of constant we define ut = R−1t and we have

Yt = C(L)Rut
= D(L)ut

where ut ∼ W N (0, R−1ΩR−10).

• Notice that all these representations obtained as linear combinations of the Wold representations
are fundamental. In fact, det(C(L)R) = det(C(L))det(R). Therefore if det(C(L)R) 6= 0 ∀|z| < 1
so will det(C(L)R).

14
3 VAR: representations

• Every stationary vector process Yt admits a Wold representation. If the MA matrix of lag polyno-
mials is invertible, then a unique VAR exists.

• We define C(L)−1 as an (n × n) lag polynomial such that C(L)−1C(L) = I; i.e. when these
lag polynomial matrices are matrix-multiplied, all the lag terms cancel out. This operation in effect
converts lags of the errors into lags of the vector of dependent variables.

• Thus we move from MA coefficient to VAR coefficients. Define A(L) = C(L)−1. Then given
the (invertible) MA coefficients, it is easy to map these into the VAR coefficients:

Yt = C(L)t
A(L)Yt = t (5)

where A(L) = A0 − A1L1 − A2L2 − ... and Aj for all j are (n × n) matrices of coefficients.

• To show that this matrix lag polynomial exists and how it maps into the coefficients in C(L),
note that by assumption we have the identity

(A0 − A1L1 − A2L2 − ...)(I + C1L1 + C2L2 + ...) = I

15
After distributing, the identity implies that coefficients on the lag operators must be zero, which
implies the following recursive solution for the VAR coefficients:

A0 = I
A 1 = A 0 C1
Ak = A0Ck + A1Ck + ... + Ak−1C1

• As noted, the VAR is of infinite order (i.e. infinite number of lags required to fully represent joint
density).

• In practice, the VAR is usually restricted for estimation by truncating the lag-length. Recall
that the AR coefficients are absolutely summable and vanish at long lags.

pth-order vector autoregression VAR(p). A VAR(p) is given by

Yt = A1Yt−1 + A2Yt−2 + ... + ApYt−p + t (6)

Note: Here we are considering zero mean processes. In case the mean of Yt is not zero we should add
a constant in the VAR equations.

16
VAR(1) representation Any VAR(p) can be rewritten as a VAR(1). To form a VAR(1) from the
general model we define: e0t = [, 0, ..., 0], Yt0 = [Yt0, Yt−1
0 0
, ..., Yt−p+1 ]
 
A1 A2 ... Ap−1 Ap
 In 0 ... 0 0 
 
A =  0 In ... 0 0 
 
 . ... ... 
 .. 
0 ... ... In 0
Therefore we can rewrite the VAR(p) as a VAR(1)

Yt = AYt−1 + et

This is also known as the companion form of the VAR(p)

17
SUR representation The VAR(p) can be stacked as

Y = XΓ + u

where X = [X1, ..., XT ]0, Xt = [Yt−1


0 0
, Yt−2 0 0
..., Yt−p ] Y = [Y1, ..., YT ]0, u = [1, ..., T ]0 and Γ =
[A1, ..., Ap]0

X11 X12
 

Vec representation Let vec denote the stacking columns operator, i.e X =  X21 X22  then
X31 X32
X11
 
 X21 
 
X 
 31 
vec(X) = 
 X12 

 
 X22 
X32
Let γ = vec(Γ), then the VAR can be rewritten as

Yt = (In ⊗ Xt0)γ + t

18
4 VAR: Stationarity

4.1 Stability and stationarity

• Consider the VAR(1)


Yt = µ + AYt−1 + εt
Substituting backward we obtain

Yt = µ + AYt−1 + εt
= µ + A(µ + AYt−2 + εt−1) + εt
= (I + A)µ + A2Yt−2 + Aεt−1 + εt
...
j−1
X
Yt = (I + A + ... + Aj−1)µ + Aj Yt−j + Aiεt−i
i=0

• The eigenvalues of A, λ, solve det(A − Iλ) = 0. If all the eigenvalues of A are smaller than one in
modulus the sequence Ai, i = 0, 1, ... is absolutely summable. Therefore
1. the infinite sum j−1 i
P
i=0 A εt−i exists in mean square;

2. (I + A + ... + Aj−1)µ → (I − A)−1 and Aj → 0 as j goes to infinity.

19
Therefore if the eigenvalues are smaller than one in modulus then Yt has the following representation

X
Yt = (I − A)−1µ + Aiεt−i
i=0

• Note that the the eigenvalues correspond to the reciprocal of the roots of the determinant of
A(z) = I − Az. A VAR(1) is called stable if

det(I − Az) 6= 0 for |z| ≤ 1.

Stability A VAR(p) is stable if and aonly if that all the eigenvalues of A (the AR matrix of the
companion form of Yt) are smaller than one in modulus, or equivalently if and only if

det(I − A1z − A2z 2, ..., Apz p) 6= 0 for |z| ≤ 1.

A condition for stationarity: A stable VAR process is stationary.

• Notice that the converse is not true. An unstable process can be stationary.

20
4.2 Back the Wold representation

• If the VAR is stationary Yt has the following Wold representation



X
Yt = C(L)t = Cj t
j=0
P∞
where the sequence {Cj } is absolutely summable, j=0 |Cj | <∞

• How can we find it? Let us rewrite the VAR(p) as a VAR(1)

• We know how to find the MA(∞) representation of a stationary AR(1). We can proceed simi-
larly for the VAR(1). Substituting backward in the companion form we have

Yt = Aj Yt−j + Aj−1et−j+1 + ... + A1et−1 + ... + et

If conditions for stationarity are satisfied, the series ∞ j


P
i=1 A converges and Yt has an VMA(∞)
representation in terms of the Wold shock et given by

Yt = (I − AL)−1et
X∞
= Aj et−j
i=1
= C(L)et

21
where C0 = A0 = I, C1 = A1, C2 = A2, ..., Ck = Ak . Cj will be the n × n upper left matrix of
Cj .

22
5 VAR: second moments

Let us consider the companion form of a stationary (zero mean for simplicity) VAR(p) defined earlier
Yt = AYt−1 + et (7)
The variance of Yt is given by
Σ̃ = E[(Yt)(Yt)0]
= AΣ̃A0 + Ω̃ (8)
a closed form solution to (7) can be obtained in terms of the vec operator. Let A, B, C be matrices
such that the product ABC exists. A property of the vec operator is that
vec(ABC) = (C 0 ⊗ A)vec(B)
Applying the vec operator to both sides of (7) we have
vec(Σ̃) = (A ⊗ A)vec(Σ̃) + vec(Ω̃)
If we define A = (A ⊗ A) then we have
vec(Σ̃) = (I − A)−1vec(Ω̃)
where  
Γ0 Γ1 · · · Γp−1
 Γ
 −1 Γ0 · · · Γp−2 
Γ̃0 = Σ̃ =  ..

 . ... ... ... 

Γ−p+1 Γ−p+2 · · · Γ0
23
The variance Σ = Γ0 of the original series Yt is given by the first n rows and columns of Σ̃.

The jth autocovariance of Yt (denoted Γ̃j ) can be found by post multiplying (6) by Yt−j and
taking expectations:
E(YtYt−j ) = AE(YtYt−j ) + E(etYt−j )
Thus
Γ̃j = AΓ̃j−1
or
Γ̃j = Aj Γ̃0 = Aj Σ̃
The autocovariances Γj of the original series Yt are given by the first n rows and columns of Γ̃j and
are given by
Γh = A1Γh−1 + A2Γh−2 + ... + ApΓh−p
known as Yule-Walker equations.

24
6. VAR: ESTIMATION AND HYPOTHESIS TESTING1

1
This part is based on the Hamilton textbook.

1
1 Conditional Likelihood

Let us condsider the VAR(p)

Yt = c + A1Yt−1 + A2Yt−2 + ... + ApYt−p + t (1)

with t ∼ i.i.dN (0, Ω). Suppose we have a sample of T + p observations for such variables. Condi-
tioning on the first p observations we can form the conditional likelihood

f (YT , YT −1, ..., Y1|Y0, Y−1, ..., Y−p+1, θ) (2)

where θ is a vector containing all the parameters of the model. We refer to (2) as ”conditional likeli-
hood function”.

The joint density of observations 1 through t conditioned on Y0, ..., Y−p+1 satisfies

f (Yt, Yt−1, ..., Y1|Y0, Y−1, ..., Y−p+1, θ) = f (Yt−1, ..., Y1|Y0, Y−1, ..., Y−p+1, θ)
×f (Yt|Yt−1, ..., Y1, Y0, Y−1, ..., Y−p+1, θ)

Applying the formula recursively, the likelihood for the full sample is the product of the individual
conditional densities
T
Y
f (Yt, Yt−1, ..., Y1|Y0, Y−1, ..., Y−p+1, θ) = f (Yt|Yt−1, Yt−2, ..., Y−p+1, θ) (3)
t=1

2
At each t, conditional on the values of Y through date t − 1
Yt|Yt−1, Yt−2, ..., Y−p+1 ∼ N (c + A1Yt−1 + A2Yt−2 + ... + ApYt−p, Ω)
Recall  
1
 Yt−1 
 
Xt =  Yt−2 
 
 . 
 .. 
Yt−p
is an (np + 1) × 1 vector and let Π0 = [c, A1, A2, ..., Ap] be an (n × np + 1) matrix of coefficients.
Using this notation we have that
Yt|Yt−1, Yt−2, ..., Y−p+1 ∼ N (Π0Xt, Ω)
Thus the conditional density of the tth observation is
−n/2 −1 1/2

f (Yt|Yt−1, Yt−2, ..., Y−p+1, θ) = (2π) Ω
0 0 −1 0
 
exp (−1/2)(Yt − Π Xt) Ω (Yt − Π Xt) (4)

The sample log-likelihood is found by substituting (4) into (3) and taking logs
T
X
L(θ) = logf (Yt|Yt−1, Yt−2, ..., Y−p+1, θ)
t=1

3
= −(T n/2) log(2π) + (T /2) log |Ω−1| −
T
X
(Yt − Π0Xt)0Ω−1(Yt − Π0Xt)
 
(−1/2) (5)
t=1

4
2 Maximum Likelihood Estimate (MLE) of Π

The MLE estimate of Π are given by


" T
#" T #−1
X X
Π̂0M LE = YtXt0 XtXt0
t=1 t=1

Π̂0M LE is n × (np + 1)). The jth row of Π̂0 is


" T #" T #−1
X X
0 0
π̂j = YjtXt XtXt0
t=1 t=1

which is the estimated coefficient vector from an OLS regression of Yjt on Xt. Thus the MLE es-
timates for equation j are found by an OLS regression of Yjt on p lags of all the variables in the system.

We can verify that Π̂0M LE = Π̂0OLS . To verify this rewrite the last term in the log-likelihood as
T
X T
X
(Yt − Π0Xt)0Ω−1(Yt − Π0Xt) = [(Yt − Π̂0Xt + Π̂0Xt − Π0Xt)0Ω−1
 
t=1 t=1
×(Yt − Π̂0Xt + Π̂0Xt − Π0Xt)]
XT h i
0 0 0 −1 0 0
= (ˆt + (Π̂ − Π )Xt) Ω (ˆt + (Π̂ − Π )Xt)
t=1

5
T
X T
X
= ˆ0tΩ−1ˆt + 2 ˆ0tΩ−1(Π̂0 − Π0)0Xt +
t=1 t=1
T
X
+ Xt0(Π̂0 − Π0)Ω−1(Π̂0 − Π0)0Xt
t=1

PT
The term 2 ˆ0tΩ−1(Π̂0
t=1  − Π0)0Xt is a scalar so that
T
" T #
X X
ˆ0tΩ−1(Π̂0 − Π0)0Xt = tr ˆ0tΩ−1(Π̂0 − Π0)0Xt
t=1
" t=1
T
#
X
= tr Ω−1(Π̂0 − Π0)0Xtˆ0t
" t=1 T
#
X
= tr Ω−1(Π̂0 − Π0)0 Xtˆ0t
t=1
PT
But ˆ0t
t=1 Xt  = 0 by construction since regressors are orthogonal to the residuals so that we have
T
X T
X T
X
(Yt − Π0Xt)0Ω−1(Yt − Π0Xt) = ˆ0tΩ−1ˆt + Xt0(Π̂0 − Π0)Ω−1(Π̂0 − Π0)0Xt
 
t=1 t=1 t=1

6
Given that Ω is positive definite, so is Ω−1, thus the smallest values of
T
X
Xt0(Π̂0 − Π0)Ω−1(Π̂0 − Π0)0Xt
t=1

is achieved by setting Π = Π̂, i.e. the log-likelihood is maximized when Π = Π̂. This establishes the
claim that the MLE estimator coincides with the OLS estimator.

Recall the SUR representation


Y = XA + u
where X = [X1, ..., XT ]0, Xt = [Yt−1 0 0
, Yt−2 0 0
..., Yt−p ] Y = [Y1, ..., YT ]0, u = [1, ..., T ]0 and A =
[A1, ..., Ap]0.The MLE estimator is given by

 = (X0X)−1X0Y

(notice that  = Π̂0M LE , different notation same estimator)

7
3 MLE of Ω

3.1 Some useful results

Let X be an n×1 vector and let A be a nonsymetric and unrestricted matrix. Consider the quadratic
form X 0AX.

(i) The first result says that


∂X 0AX
= XX 0
∂A

(ii) The second result says that


∂log |A|
= (A0)−1
∂A

8
3.2 The estimator

We now find the MLE of Ω. When evaluated at Π̂ the log likelihood is


T
X
L(θ) = −(T n/2) log(2π) + (T /2) log |Ω−1| − (1/2) ˆ0tΩ−1ˆt (6)
t=1

Taking derivatives and using results for matrix derivatives we have:


PT
∂L(Ω, Π̂) ∂ log |Ω−1| 0tΩ−1ˆt
t=1 ∂ˆ
= (T /2) − (1/2)
∂Ω−1 ∂Ω−1 ∂Ω−1
XT
0
= (T /2)Ω − (1/2) ˆtˆ0t (7)
t=1

The likelihood is maximized when the derivative is set to zero, or when


T
X
Ω0 = (1/T ) ˆtˆ0t (8)
t=1

T
X
Ω̂ = (1/T ) ˆtˆ0t (9)
t=1

9
4 Asymptotic distribution of Π̂

Maximum likelihood estimates are consistent even if the true innovations are non-Gaussian. The
asymptotic properties of the MLE estimator are summarized in the following proposition

Proposition (11.1H). Let

Yt = c + A1Yt−1 + A2Yt−2 + ... + ApYt−p + εt

where εt is i.i.d. with mean zero and variance Ω and E(εitεjtεltεmt) < ∞ for all i, j, l, m and
where the roots of
|I − A1z + A2z 2 + ... + Apz p| = 0
lie outside the unit circle. Let k = np + 1 and let Xt be the 1 × k vector

Xt0 = [1, Yt−1


0 0
, Yt−2 0
, ..., Yt−p ]

Let π̂T = vec(Π̂T ) denote the nk × 1 vector of coefficients resulting from the OLS regressions of
each of the element of Yt on Xt for a sample of size T .
 
π̂1T
 π̂ 
 2T 
π̂T =  .. 
 . 
π̂nT
10
where " #−1 " #
T
X T
X
π̂iT = XtXt0 XtYit
t=1 t=1
and let π denote the vector of corresponding population coefficients. Finally let
T
X
Ω̂ = (1/T ) ε̂tε̂0t
t=1

where ε̂t = [ε̂1t ε̂2t ... ε̂nt], and ε̂it = Yit − Xt0π̂iT .

Then
0 p
Q where Q = E(XtXt0)
P
(a) (1/T ) t=1 Xt Xt →
p
(b) π̂T → π
p
(c) Ω̂ → Ω
√ L
(d) T (π̂T − π) → N (0, Ω ⊗ Q−1)
Notice that result (d) implies that
√ L
T (π̂iT − πi) → N (0, σi2Q−1)

11
where σi2 is the variance of the error term of the ith equation. σi2 is consistently estimated by
(1/T ) Tt=1 ε̂2it and that Q−1 is consistently estimated by ((1/T ) t=1 XtXt0)−1. Therefore we can
P P

treat π̂i approximately as  " #−1


X
π̂i ≈ N π, σ̂i
 2
XtXt0 
t=1

12
5 Number of lags

As in the univariate case, care must be taken to account for all systematic dynamics in multivariate
models. In VAR models, this is usually done by choosing a sufficient number of lags to ensure that
the residuals in each of the equations are white noise.

AIC: Akaike information criterion Choosing the p that minimizes the following

AIC(p) = T ln |Ω̂| + 2(n2p)

BIC: Bayesian information criterionChoosing the p that minimizes the following

BIC(p) = T ln |Ω̂| + (n2p) ln T

HQ: Hannan- Quinn information criterion Choosing the p that minimizes the following

HQ(p) = T ln |Ω̂| + 2(n2p) ln ln T

p̂ obtained using BIC and HQ are consistent while with AIC it is not.

13
AIC overestimate the true order with positive probability and underestimate the true order with
zero probability.

Suppose a VAR(p) is fitted to Y1, ..., YT (Yt not necessarily stationary). In small sample the fol-
lowing relations hold:
p̂BIC ≤ p̂AIC if T ≥ 8
p̂BIC ≤ p̂HQ for all T
p̂HQ ≤ p̂AIC if T ≥ 16

14
6 Testing: Wald Test

A general hypothesis of the form Rπ = r, involving coefficients across different equations can be tested
using the Wald form of the χ2 test seen in the first part of the course. Result (d) of Proposition 11.1H
establishes that
√ L
T (Rπ̂T − r) → N (0, R(Ω ⊗ Q−1)R0), (10)
The following proposition establishes a useful result for testing

p p
Proposition (3.5L). Suppose (10) holds, Ω̂ → Ω, (1/T ) t=1 XtXt0 → Q, (Q, Ω both non-
P

singular) and Rπ = r is true.


Then  
T
!−1 −1
 
d
X
0 0 0
(Rπ̂T − r) R Ω̂ ⊗ Xt XT R (Rπ̂T − r) → χ2(m)
 
t=1

where m is the number of restrictions, i.e. the number of rows of R.

15
7 Testing: Likelihood ratio Test

First let consider the log Likelihood evaluated at the MLE


T
X
L(Ω̂, Π̂)) = −(T n/2) log(2π) + (T /2) log |Ω̂−1| − (1/2) ˆ0tΩ̂−1ˆt (11)
t=1

The last term is


T
" T
#
X X
(1/2) ˆ0tΩ̂−1ˆt = (1/2)tr ˆ0tΩ̂−1ˆt
t=1
" t=1
T
#
X
= (1/2)tr Ω̂−1ˆtˆ0t
h t=1 i
−1
= (1/2)tr Ω̂ T Ω̂
= (1/2)tr [T In]
= T n/2

Substituting in (18) we have

L(Ω̂, Π̂)) = −(T n/2) log(2π) + (T /2) log |Ω̂−1| − (T n/2) (12)

Suppose we want to test the null hypothesis that a set of variables was generated by a VAR with
p0 lags against the alternative specification with p1 > p0. Let Ω̂0 = (1/T ) Tt=1 ˆt(p0)ˆt(p0)0 where
P

16
ˆ(p0)t is the residual estimated in the VAR(p0). the log likelihood is given by

L0 = −(T n/2) log(2π) + (T /2) log |Ω̂−1


0 | − (T n/2) (13)

Let Ω̂1 = (1/T ) Tt=1 ˆt(p1)ˆt(p1)0 where ˆ(p1)t is the residual estimated in the VAR(p1). the log
P

likelihood is given by

L1 = −(T n/2) log(2π) + (T /2) log |Ω̂−1


1 | − (T n/2) (14)

Twice the log ratio is

2(L1 − L0) = 2{(T /2) log |Ω̂−1 −1


1 | − (T /2) log |Ω̂0 |}
= T log(1/|Ω̂1|) − T log(1/|Ω̂0|)
= −T log(|Ω̂1|) + T log(|Ω̂0|)
= T {log(|Ω̂0|) − log(|Ω̂1|)} (15)

Under the null hypothesis, this asymptotically has a χ2 distribution with degrees of freedom equal to
the number of restriction imposed under H0. Each equation in the restricted model has n(p1 − p0)
restrictions, in total n2(p1 − p0). Thus is asymptotically χ2 with n2(p1 − p0) degrees of freedom.

17
Example. Suppose n = 2, p0 = 3, p1 = 4, T=46. Let Tt=1 [ˆ(p0)1t]2 = 2, Tt=1 [ˆ(p0)2t]2 = 2.5 and
P P
PT
t=1 
ˆ(p0)1t(p0)2t = 1. Then  
2.0 1.0
Ω̂0 =
1.0 2.5
for which log |Ω̂0| = log 4 = 1.386. Moreover
 
1.8 0.9
Ω̂1 =
0.9 2.2
for which log |Ω̂1| = 1.147. Then

2(L1 − L0) = 46(1.386 − 1.147) = 10.99

The degrees of freedom for this test are 22(4 − 3) = 4. Since 10.99 > 9.49 (the 5% critical value for
a χ24 variable), the null hypothesis is rejected.

18
8 Granger Causality

Granger causality If a scalar y cannot help in forecasting x we say that y does not Granger cause x.
y fails to Granger cause x if for all s > 0 the mean squared error of a forecast of xt+s based on
(xt, xt−1, ..., ) is the same as the MSE of a forecast of xt+s based on (xt, xt−1, ..., ) and (yt, yt−1, ..., ).
If we restrict ourselves to linear functions, y fails to Granger-cause x if
h i h i
M SE (Ê(xt+s|xt, xt−1, ..., ) = M SE (Ê(xt+s|xt, xt−1, ..., yt, yt−1, ..., ) (16)

where Ê(x|y) is the linear projection of vector x on the vector y, i.e. the linear function α0y satisfying
E[(x − α0y)y] = 0.
Also we say that x is exogenous in the time series sense with respect to y if (23) holds.

19
8.1 Granger Causality in Bivariate VAR

Let us consider a bivariate VAR


   (1) (1)     (2) (2)   
Y1t A11 A12 Y1t−1 A11 A12 Y1t−2
= (1) (1) + (2) (2) +
Y2t A21 A22 Y2t−1 A21 A22 Y2t−2
 (p) (p)     
A11 A12 Y1t−p 1t
+... + (p) (p) + (17)
A21 A22 Y2t−p 2t
(j)
We say that Y2 fails to Granger cause Y1 if the elements A12 = 0 for j = 1, ..., p. We can check
(j)
that if A12 = 0 the two MSE coincide. For s = 1 we have
(1) (1) (2) (2)
Ê(Y1t+1|Y1t, Y1t−1, ..., Y2t, Y2t−1) = A11 Y1t−1 + A12 Y2t−1 + A11 Y1t−2 + A12 Y2t−2 +
(p) (p)
+... + A11 Y1t−p + A12 Y2t−p
(j)
clearly if A12 = 0
(1) (2) (p)
Ê(Y1t+1|Y1t, Y1t−1, ..., Y2t, Y2t−1) = A11 Y1t−1 + A11 Y1t−2 + ... + A11 Y1t−p
= Ê(Y1t+1|Y1t, Y1t−1, ...) (18)
An important implication of Granger causality in the bivariate context is that if Y2 fails to Granger
cause Y1 then the Wold representation of Yt is

    
Y1t C11(L) 0 1t
= (19)
Y2t C21(L) C22(L) 2t
20
that is the second Wold shock has no effects on the first variable. This it is easy to show by deriving
the Wold representation by inverting the VAR polynomial matrix.

21
8.2 Econometric test for Granger Causality

The simplest approach to test Granger causality in an autoregressive framework is the following is
(1) (2)
to estimate the bivariate VAR with p lags by OLS and test the null hypothesis H0 : A12 = A12 =
(p)
...A12 = 0 using an F-test using
(RSS0 − RSS1)/p
S1 =
RSS1/(T − 2p − 1)
and reject if S1 > F(α,p,T −2p−1). An asymptotically equivalent test is
T (RSS0 − RSS1)
S2 =
RSS1
and reject if S2 > χ(α,p).

22
8.3 Application 1: Output Growth and the Yield Curve

• Many research papers have found that yield curve (difference in long and short yield) has been a
good predictor, i.e. a variable that helps to forecast, for the real GDP growth in the US (Estrella,
2000,2005). However recent evidence suggests that its predictive power has reduced since the begin-
ning of the 80s (see D’Agostino, Giannone and Surico, 2006). This means we should find that the
yield curve Granger cuse output growth before mid 80’s but not after.

• We estimate a bivariate VAR for the growth rates of the real GDP and the difference between
the 10-year rate and the federal funds rate. Data are from FREDII StLouis Fed spanning from
1954:III-2007:III. The AIC criterion suggests p = 6.

23
Figure 1: Blu: real gdp growth rates; green: spread long-short.

24
Table 1: F-Tests of Granger Causality
1954:IV-2007:III 1954:IV-1990:I 1990:I-2007:III

S1 5.4233 6.0047 0.9687


10% 1.8050 1.8222 1.8954
5% 2.1460 2.1725 2.2864
1% 2.8971 2.9508 3.1864

We cannot reject the hypothesis that the spread does not Granger cause real output growth in the
last period, while we reject the hypothesis for all the other sample. This can be explained by a change
in the information content of private agents expectations, which is the information embedded in the
yield curve.

25
8.4 Application 2: Money, Income and Causality

In the 50’s and 60’s a big debate about the importance of money and monetary policy. Does money
affect output? For Friedman and monetarist yes. For Keynesian (Tobin) no: output was affecting
money, movement in money stock were reflecting movements in the money demand. Sims in 1972
run a test in order to distinguish between the two visions. He found that money was Granger-causing
output but not the reverse, providing evidence in favor of the monetarist view.

26
Figure 2: Blu: real gnp growth rates; green: M1 growth rates.

27
Table 2: F-Tests of Granger Causality
1959:II-1972:I 1959:II-2007:III

M →Y 4.4440 2.2699
Y →M 0.5695 3.5776
10% 2.0948 1.7071
5% 2.6123 1.9939
1% 3.8425 2.6187

In the first sample money Granger cause (at 5%) output but not the converse (Sims(72)’s result). In
the second sample at the 5% both output Granger cause money and money Granger cause output.

28
8.5 Caveat: Granger Causality Tests and Forward Looking Behavior

Let us consider the following simple model of stock price determination where Pt is the price of one
share of a stock, Dt+1 are dividends payed at t + 1 and r is the rate of return of the stock
(1 + r)Pt = Et(Dt+1 + Pt+1)
According to the theory stock price incorporates the market’s best forecast of the present value of
the future dividends. Solving forward we have
∞  j
X 1
Pt = Et Dt+j
j=1
1 + r
Suppose
Dt = d + ut + δut−1 + vt
where ut, vt are Gaussian WN and d is the mean dividend. The forecast of Dt+j based on this
information is (
d + δut for j = 1
Et(Dt+j ) =
d for j = 2, 3, ...
Substituting in the stock price equation we have
Pt = d/r + δut/(1 + r)
Thus the price is white noise and could not be forecast on the basis of lagged stock prices or dividends.
No series should Granger cause prices. The value of ut−1 can be uncovered from the lagged stock
29
price
δut−1 = (1 + r)Pt−1 − (1 + r)d/r
The bivariate VAR takes the form

        
Pt d/r 0 0 Pt−1 δut/(1 + r)
= + + (20)
Dt −d/r (1 + r) 0 Dt−1 ut + vt

Granger causation runs in the opposite direction from the true causation. Dividends fail to Granger
cause prices even though expected dividends are the only determinant of prices. On the other hand
prices Granger cause dividends even though this is not the case in the true model.

30
8.6 Granger Causality in a Multivariate Context

Suppose now we are interested in testing for Granger causality in a multivariate (n > 2) context. Let
us consider the following representation of a VAR(p)
Ỹ1t = Ã1X̃1t + Ã2X̃2t + ˜1t
Ỹ2t = B̃1X̃1t + B̃2X̃2t + ˜2t
(21)
where Ỹ1t and Ỹ2t are two vectors containing respectively n1 and n2 variables of Yt. Let
   
Ỹ1t−1 Ỹ2t−1
 Ỹ   Ỹ 
 1t−2   2t−2 
X̃1t =  ..  X̃2t =  .. 
 .   . 
Ỹ1t−p Ỹ2t−p

Ỹ1t is said block exogenous in the time series sense with respect to Ỹ2t if the elements in Ỹ2t are of
no help in improving the forecast of any variable in Ỹ1t. Ỹ1t is block exogenous if Ã2 = 0.

In order to test block exogeneity we can proceed as follows. First notice that the log likelihood
can be rewritten in terms of a conditional and a marginal log density
T
X T
X
L(θ) = `1t + `2t
t=1 t=1

31
`1t = log f (Ỹ1t|X̃1t, X̃2t, θ)
= −(n/2) log(2π) − (1/2) log |Ω11| −
−(1/2)[(Ỹ1t − Ã1X̃1t − Ã2X̃2t)0Ω−1
11 (Ỹ1t − Ã1 X̃1t − Ã2 X̃2t )]
`2t = log f (Ỹ2t|Ỹ1t, X̃1t, X̃2t, θ)
= −(n/2) log(2π) − (1/2) log |H| −
−(1/2)[(Ỹ2t − D̃0Ỹ1t − D̃1X̃1t − D̃2X̃2t)0H −1(Ỹ2t − D̃0Ỹ1t − D̃1X̃1t − D̃2X̃2t)]

where D̃0Ỹ1t + D̃1X̃1t + D̃2X̃2t and H represent the mean and the variance respectively of Ỹ2t con-
ditioning also on Ỹ1t.

Consider the the maximum likelihood estimation of the system subject to the constraint Ã2 = 0
giving estimates È1(0), Ω̂11(0), D̂0, D̂1, D̂2, Ĥ. Now consider the unrestricted maximum likelihood
estimation of the system providing the estimates È1, Ω̂11, D̂0, D̂1, D̂2Ĥ. The likelihood functions
evaluated at the MLE in the two cases are

L(θ̂(0)) = −(T /2) log(2π) + (T /2) log |Ω̂−1


11 (0)| − (T n/2)
L(θ̂) = −(T /2) log(2π) + (T /2) log |Ω̂−1
11 | − (T n/2)

A likelihood ratio test of the hypothesis Ã1 = 0 can be based on

2(L(θ̂) − L(θ̂(0)) = T (log |Ω̂−1 −1


11 | − log |Ω̂11 (0)|) (22)

32
In practice we perform OLS regression of each of the elements in Ỹ1t on p lags of all the elements
in Ỹ1t and all the elements in Ỹ2t. Let ˆ1t be the vector of sample residual and Ω̂11 their estimated
variance covariance matrix. Next perform OLS regressions of each element of Ỹ1t on p lags of Ỹ1t.
Let ˆ1t(0) be the vector of sample residual and Ω̂11(0) their estimated variance covariance matrix. If

T {log |Ω̂11(0)| − log |Ω̂11|}

is greater than the 5% critical values for a χ2n1n2p, then the null hypothesis is rejected and the
conclusion is that some elements in Ỹ2t is important for forecasting Ỹ1t

33
7. STRUCTURAL VAR: THEORY

1
1 Structural Vector Autoregressions

Impulse response functions are interpreted under the assumption that all the other shocks are held
constant. However in the Wold representation the shocks are not orthogonal. So the assumption is
not very realistic!.

This is why we need Structural VAR in order to perform policy analysis. Ideally we would like to
have

1) orthogonal shock

2) shocks with economic meaning (technology, demand, labor supply, monetary policy etc.)

1.1 Statistical Orthogonalizations

There are two easy way to orthogonalize shocks.

1) Cholesky decomposition

2) Spectral Decomposition

2
1.2 Cholesky decomposition

Let us consider the matrix Ω. The Cholesky factor, S, of Ω is defined as the unique lower triangular
matrix such that SS 0 = Ω. This implies that we can rewrite the VAR in terms of orthogonal shocks
ηt = S −1t with identity covariance matrix

A(L)Yt = Sηt

Impulse response to orthogonalized shocks are found from the MA representation

Yt = C(L)Sηt

X
= Cj Sηt−j (1)
j=0

where Cj S has the interpretation


∂Yt+j
= Cj S (2)
∂ηt
That is, the row i, column k element of Cj S identifies the consequences of a unit increase in ηkt for
the value of the ith variable at time t + j holding all other shocks constant.

3
1.3 Spectral Decomposition

Let V and be a matrix containing the eigenvectors of Ω and Λ a diagonal matrix with the eigenvalues
of Ω on the main diagonal. Then we have that V ΛV 0 = Ω. This implies that we can rewrite the
VAR in terms of orthogonal shocks ξt = (V D1/2)−1t with identity covariance matrix

A(L)Yt = V D1/2ξ

Impulse response to orthogonalized shocks are found from the MA representation

Yt = C(L)V D1/2ξt

X
= Cj Sηt−j (3)
j=0

where Cj V D1/2 has the interpretation


∂Yt+j
= Cj V D1/2 (4)
∂ξt
That is, the row i, column k element of Cj V D1/2 identifies the consequences of a unit increase in ξkt
at date t for the value of the ith variable at time t + j holding all other shocks constant.

4
Problem: what is the economic interpretation of the orthogonal shocks? What is the economic infor-
mation contained in the impulse response functions to orthogonal shocks?

Except for special cases not clear.

5
1.4 The Class of Orthonormal Representations

From the class of invertible MA representation of Yt we can derive the class of orthonormal represen-
tation, i.e. the class of representations of Yt in term of orthonormal shocks. Let H any orthogonal
matrix, i.e. HH 0 = H 0H = I. Defining wt = (SH)−1t we can recover the general class of the
orthonormal representation of Yt

Yt = C(L)SHwt
= F (L)wt

where F (L) = C(L)SH and wt ∼ W N with

E(wtwt0 ) = E((SH 0)−1t0t(SH 0)−10)


= HS −1E(t0t)H 0(S 0)−1
= HS −1ΩH 0(S 0)−1
= HS −1SS 0(S 0)−1H 0
= I

Problem: H can be any, so how should we choose one?

6
2 The Identification Problem

• Identifying the VAR means fixing a particular matrix H, i.e. choosing one particular representation
of Yt in order to recover the structural shocks from the VAR innovations

• Therefore structural economic shocks are linear combinations of the VAR innovations.

• In order to choose a matrix H we have to fix n(n − 1)/2 parameters since there is a total of
n2 parameters and a total of n(n + 1)/2 restrictions implied by orthonormality.

• Use economic theory in order to derive some restrictions on the effects of some shock on a particular
variables to fix the remaining n(n − 1)/2.

7
2.1 Zero restrictions: contemporaneous restrictions

• An identification scheme based on zero contemporaneous restrictions is a scheme which imposes


restrictions to zero on the matrix F0, the matrix of the impact effects.

Example. Let us consider a bivariate VAR. We have a total of n2 = 4 parameters to fix. n(n+1)/2 = 3
are pinned down by the ortnonormality restrictions so that there are n(n − 1)/2 = 1 free parameters.
Suppose that the theory tells us that shock 2 has no effect on impact (contemporaneously) on Y1
equal to 0, that is F012 = 0. This is the additional restriction that allows us to identify the shocks. In
particular we will have the following restrictions:

HH 0 = I
S11H12 + S12H22 = 0

Since S12 = 0 the solution is H11 = H22 = 1 and H12 = H21 = 0.

• A common identification scheme is the Cholesky scheme (like in this case). This implies set-
ting H = I. Such an identification scheme creates a recursive contemporaneous ordering ordering
among variables since S −1 is triangular.

• This means that any variable in the vector Yt does not depend contemporanously on the vari-
ables ordered after.
8
• Results depend on the particular ordering of the variables.

9
2.2 Zero restrictions: long run restrictions

• An identification scheme based on zero long run restrictions is a scheme which imposes restrictions
on the matrix F (1) = F0 + F1 + F2 + ..., the matrix of the long run coefficients.

Example. Again let us consider a bivariate VAR. We have a total of n2 = 4 parameters to fix.
n(n + 1)/2 = 3 are pinned down by the ortnonormality restrictions so that there are n(n − 1)/2 = 1
free parameters. Suppose that the theory tells us that shock 2 does not affect Y1 in the long run, i.e.
F12(1) = 0. This is the additional restriction that allows us to identify the shocks. In particular we
will have the following restrictions:

HH 0 = I
D11(1)H12 + D12(1)H22 = 0

where D(1) = C(1)S represents the long run effects of the Cholesky shocks.

10
2.3 Signs restrictions

• The previous two examples yield just identification in the sense that the shockis uniquely identified,
there exists a unique matrix H yielding the structural shocks.

• Sign identification is based on qualitative restriction involving the sign of some shocks on some
variables. In this case we will have sets of consistent impulse response functions.

Example. Again let us consider a bivariate VAR. We have a total of n2 = 4 parameters to fix.
n(n + 1)/2 = 3 are pinned down by the ortnonormality restrictions so that there are n(n − 1)/2 = 1
free parameters. Suppose that the theory tells us that shock 2, which is the interesting one, produce a
positive effect on Y1 for k periods after the shock Fj12 > 0 for j = 1, ..., k. We will have the following
restrictions:

HH 0 = I
S11H12 + S12H22 > 0
Dj,12H12 + Dj,22H22 > 0 j = 1, ..., k

where Dj = Cj S represents the effects at horizon j.

• In a classical statistics approach this delivers not exact identification since there can be many
H consistent with such a restriction. That is for each parameter of the impulse response functions
11
we will have an admissible set of values.

• Increasing the number of restrictions can be helpful in reducing the number of H consistent with
such restrictions.

12
2.4 Parametrizing H

• A useful way to parametrize the matrix H in order to include orthonormality restrictions is using
rotation matrices. Let us consider the bivariate case. A rotation matrix in this case is the unity
matrix  
cos(θ) sin(θ)
H=
−sin(θ) cos(θ)
• Note that such a matrix incorporates the orthonormality conditions. The parameter θ will be found
by imposing the additional economic restriction.

• In general the rotation matrix will be found as the product of n(n − 1)/2 rotation matrices.
For the case of three shocks the rotation matrix can be found as the product of the following three
matrices
cos(θ1) sin(θ1) 0 cos(θ2) 0 sin(θ2) 1 0 0
   
 −sin(θ1) cos(θ1) 0   0 1 0   0 cos(θ3) sin(θ3) 
0 0 1 −sin(θ2) 0 cos(θ2) 0 −sin(θ3) cos(θ3)
Example. Suppose that n = 2 and the restriction we want to impose is that the effect of the first
shock on the second variable has a positive sign, i.e.
S21H11 + S22H21 > 0
Using the parametrization seen before the restriction becomes
S21cos(θ) − S22sin(θ) > 0
13
Which implies
sin(θ) S21
tan(θ) = <
cos(θ) S22
If S21 = 0.5 and S22 = 1 then all the impulse response fcuntions obtained with θ < atan(0.5) satisfy
the rstriction and should be kept.

14
2.5 Partial Identification

• In many cases we might be interested in identifying just a single shock and not all the n shocks.

• Since the shock are orthogonal we can also partially identify the model, i.e. fix just one ( or a
subset of) column of H. In this case what we have to do is to fix n − 1 elements of H, all but one
elements of a column of the identifying matrix. The additional restriction is provided by the norm of
the vector equal one.

Example Suppose n = 3. We want to identify a single shock using the restriction that such shock
has no effects on the first variable on impact a positive effect on the second variable and negative on
the third variable.

First of all we notice that the first column of the product of orthogonal matrices seen before is
cos(θ1)cos(θ2)
 

H1 =  −sin(θ1)cos(θ2) 
−sin(θ2)
therefore we have that the impact effects ofthe first shock are given by
S11 0 0 cos(θ1)cos(θ2)
  
 S21 S22 0   −sin(θ1)cos(θ2) 
S31 S32 S33 −sin(θ2)
15
To implement the first restriction we can set θ1 = π/2, i.e. cos(θ1) = 0. This imples that
S11 0 0 0
  
 S21 S22 0   −cos(θ2) 
S31 S32 S33 −sin(θ2)

The second restriction implies that


−S22cos(θ2) > 0
and the third
−S32cos(θ2) − S33sin(θ2) < 0
All the values of θ2 satisfying the two restrictions yield impulse response functions consistent with
the identification scheme.

16
2.6 Variance Decomposition

• The second type of analysis which is usually done in SVAR is the variance decomposition analysis.

• The idea is to decompose the total variance of a time series into the percentages attributable
to each structural shock.

• Variance decomposition analysis is useful in order to address questions like ”What are the sources
of the business cycle?” or ”Is the shock important for economic fluctuations?”.

17
Let us consider the MA representation of an identified SVAR

Yt = F (L)wt

The variance of Yit is given by



n X
X
var(Yit) = Fikj2var(wkt)
k=1 j=0
Xn X ∞
= Fikj2
k=1 j=0
P∞ j2
where j=0 ik is the variance of Yit generated by the kth shock. This implies that
F
P∞ j2
j=0 Fik
Pn P∞ j2
k=1 j=0 Fik

is the percentage of variance of Yit explained by the kth shock.

18
It is also possible to study the of the series explained by the shock at different horizons, i.e. short vs.
long run. Consider the forecast error in terms of structural shocks. The horizon h forecast error is
given by
Yt+h − Yt+h|t = F0wt+1 + F2wt+2 + ... + Fk wt+h
the variance of the forecast error of the ith variable is thus
n X
X h
var(Yit+h − Yit+h|t) = Fikj2var(wkt)
k=1 j=0
Xn X h
= Fikj2
k=1 j=1

Thus the percentage of variance of Yit explained by the kth shock is


Ph j2
j=0 Fik
Pn Ph j2
k=1 j=1 Fik

19
8. STRUCTURAL VAR: APPLICATIONS

1
1 Monetary Policy Shocks (Christiano Eichenbaum and Evans, 1999 HoM)

• Monetary policy shocks is the unexpected part of the equation for the monetary policy instrument
(St).
St = f (It) + wtmp

f (It) represents the systematic response of the monetary policy to economic conditions, It is the
information set at time t and wtmp is the monetary policy shock.

• The ”standard” way to identify monetary policy shock is through zero contemporaneous restric-
tions. Using the standard monetary VAR (a simplified version of the CEE 98 VAR) including output
growth, inflation and the federal funds rate we identify the monetary policy shock using the following
restrictions:

1) Monetary policy shocks do not affect output within the same quarter

2) Monetary policy shocks do not affect inflation within the same quarter

• These two restrictions are not sufficient to identify all the shocks but are sufficient to identify
the monetary policy shock.

2
• A simple way to implement the restrictions is to take simply the Cholesky decomposition of the
variance covariance matrix in a system in which the federal funds rate is ordered last. The last column
of the impulse response functions is the column of the monetary policy shock.

3
Cholesky impulse response functions of a system with GDP inflation and the federal funds rate.
Monetary shock is in the third column.

4
• Notice that after a monetary tightening inflation goes up which is completely counterintuitive ac-
cording to the standard transmission mechanism. This phenomenon if known as the price puzzle.
Why is this the case?.

• ”Sims (1992) conjectured that prices appeared to rise after certain measures of a contrac-
tionary policy shock because those measures were based on specifications of It that did not
include information about future inflation that was available to the Fed. Put differently, the
conjecture is that policy shocks which are associated with substantial price puzzles are actually
confounded with non-policy disturbances that signal future increases in prices.” (CEE 98)

• Sims shows that including commodity prices (signaling future inflation increases) may solve the
puzzle.

5
2 Uhlig (JME 2006) monetary policy shocks

• Uhlig proposes a very different method to identify monetary policy shocks. Instead of using zero
restrictions as in CEE he uses sign restrictions.

• He identifies the effects of a monetary policy shocks using restrictions which are implied by several
economic models.

• In particular a contractionary monetary policy shock:


1. does not increase prices for k periods after the shock
2. does not increase money or monetary aggregates (i.e. reserves) for k periods after the shock
3. does not reduce short term interest rate for k periods after the shock.
• Since just one shock is identified only a column of H has to be identified, say column one.

• If we order the variables in vector Yt as follows: GDP inflation, money growth and the inter-
est rate the restrictions imply Fki1 < 0 for i = 2, 3 and Fk41 > 0.

6
• In order to draw impulse response functions he applies the following algorithm:
1. He assumes that the column of H, H1, represents the coordinate of a point uniformly distributed
over the unit hypersphere (in case of bivariate VAR it represents a point in a circle). To draw
such point he draws from a N (0, I) and divide by the norm of the vector.
2. Compute the impulse response functions Cj SH1 for j=1,..,k.
3. If the draw satisfies the restrictions keep it and go to 1), otherwise discard it and go to 1). Repeat
1)-3) a big number of timef L.

7
Source: What are the effects of a monetary policy shock... JME H. Uhlig (2006)
8
Source: What are the effects of a monetary policy shock... JME H. Uhlig (2006)
9
3 Blanchard Quah (AER 1989) aggregate demand and supply shocks

• Blanchard and Quah proposed an identification scheme based on long run restrictions.

• In their model there are two shocks: an aggregate demand and an aggregate supply disturbance.

• The restriction used to identify is that aggregate demand shocks have no effects on the long run
levels of output, i.e. demand shocks are transitory on output. The idea behind of such a restriction
is the existence of a vertical aggregate supply curve.

• Let us consider the following bivariate VAR


    s
∆Yt F11(L) F12(L) wt
=
Ut F21(L) F22(L) wtd

where Yt is output, Ut is the unemployment rate and wts, wtd are two aggregate supply and demand
disturbances respectively.

• The identification restriction is given by F12(1) = 0.

10
• The restriction can be implemented in the following way. Let us consider the reduced form VAR
    
∆Yt A11(L) A12(L) 1t
=
Ut A21(L) A22(L) 2t

where E(t0t) = Ω.

Let S = chol(A(1)ΩA(1)0) and K = A(1)−1S. The identified shocks are

wt = K −1t

and the resulting impulse response to structural shocks are

F (L) = A(L)K

notice that the restrictions are satisfied

F (1) = A(1)K
= A(1)A(1)−1S
= S

which is lower triangular implying that F12(1) = 0.

11
Moreover we have that shocks are orthogonal since

KK 0 = A(1)−1SS 0A(1)−10 (1)


= A(1)−1A(1)ΩA(1)0A(1)−10
= Ω
(2)

And

E(wtwt0 ) = E(K −1t0tK −10)


= K −1ΩK −10
= K −1KK 0K −10

12
Source: The Dynamic Effects of Aggregate Demand and Supply Disturbances, (AER) Blanchard and
Quah (1989):

13
Source: The Dynamic Effects of Aggregate Demand and Supply Disturbances, (AER) Blanchard and
Quah (1989):

14
Source: The Dynamic Effects of Aggregate Demand and Supply Disturbances, (AER) Blanchard and
Quah (1989):

15
4 The technology shocks and hours debate Gali (AER 1999), Christiano, Eichen-
baum and Vigfusson (NBER WP, 2003)

This is a nice example of how SVAR models can be used in order to distinguish among competing
models of the business cycles.

1) RBC technology important source of business cycles.

2) Other models (sticky prices) tech shocks not so important.

Response of hours worked very important in distinguish among theories

1) RBC hours increase.

2) Other hours fall

16
4.1 The model

• Technology shock: zt = zt−1 + ηt ηt = technology shock

• Monetary Policy: mt = mt−1 + ξt + γηt where ξt = monetary policy shock.

• Equilibrium:
     
1 1−γ 1
∆xt = 1 − ∆ξt + + γ ηt + (1 − γ) 1 − ηt−1
ϕ ϕ ϕ
1 (1 − γ)
nt = ξt − ηt
ϕ ϕ
or
     
1−γ
! 
+ γ + (1 − γ) 1 − ϕ1 L 1 − ϕ1 (1 − L)
 
∆xt ϕ ηt
= −(1−γ) 1
(3)
nt ξt
ϕ ϕ

In the long run L = 1


   
1−γ 1
!
+ γ + (1 − γ) 1 − 0
  
∆xt ϕ ϕ ηt
= −(1−γ) 1
(4)
nt ξt
ϕ ϕ

that is only the technology shocks affects labor productivity.

17
Note the model prediction. If monetary policy is not completely accommodative γ < 1 then the
response of hours to a technology shock −(1−γ)
ϕ is negative.

18
Source: What Happens After a Technology Shock?... Christiano Eichenbaum and Vigfusson NBER
WK (2003)

19
Source: What Happens After a Technology Shock?... Christiano Eichenbaum and Vigfusson NBER
WK (2003)

20
Source: What Happens After a Technology Shock?... Christiano Eichenbaum and Vigfusson NBER
WK (2003)

21
Source: Trend Breaks, Long-Run Restrictions, and Contractionary Technology Improvements, JME
John Fernald (2007)

22
Source: Trend Breaks, Long-Run Restrictions, and Contractionary Technology Improvements, JME
John Fernald (2007)

23
Source: Trend Breaks, Long-Run Restrictions, and Contractionary Technology Improvements, JME
John Fernald (2007)

24
5 Government spending shocks

• Understanding the effects of government spending shocks is important for policy authorities but
also to assess competing theories of the business cycle.

• Keynesian theory: G ↑, Y ↑, C ↑ because of the government spending multiplier.

• RBC theory: G ↑, C ↓ because of a negative wealth effect.

• Empirical point of view disagreement.

25
5.1 Government spending shocks: Blanchard and Perotti (QJE 2002)

• BP (originally) use a VAR for real per capita taxes, government spending, and GDP with the
restriction that government spending does not react to taxes and GDP contemporaneously, Cholesky
identification with government spending ordered first. The government spending shock is the first
one (quadratic trend four lags).

• When augmented with consumption consumption increases.

26
Source: IDENTIFYING GOVERNMENT SPENDING SHOCKS: IT’S ALL IN THE TIMING Valerie A. Ramey, QJE

27
5.2 Government spending shocks: Ramey and Shapiro (1998)

• Ramey and Shapiro (1998) use a narrative approach to identify shocks to government spending.

• Focus on episodes where Business Week suddenly forecast large rises in defense spending induced
by major political events that were unrelated to the state of the U.S. economy (exogenous episodes
of government spending).

• Three of such episodes: Korean War, The Vietnam War and the Carter-Reagan Buildup + 9/11.

• The military date variable takes a value of unity in 1950:3, 1965:1, 1980:1, and 2001:3, and ze-
ros elsewhere.

• To identify government spending shocks, the military date variable is embedded in the standard
VAR, but ordered before the other variables.

• Both methodologies have drawbacks.

• VARs: shocks are often anticipated (fiscal foresight shocks may be not invertible)

• War Dummy: few observations, subjective.


28
6 Monetary policy and housing

• Central question: how does monetary policy affects house prices?

• Jarocinski and Smets (2008) addresses this question.

• Strategy:
1. Estimate a VAR nine variables (including: short term interest rate, interest rate spread, housing
investment share of GDP, real GDP, real consumption, real hours prices, prices, commodity price
index and a money indicator.
2. Identify the monetary policy shock using the restriction that the shock does not affect prices
and output contemporaneously but affect the short term interest rate, the spread and the money
stock and analyze the impulse response functions.
3. Shut down the identified shock and study the counterfactual path of housing prices over time.

29
Source: Jarocinki and Smets (2008)

30
Source: Jarocinki and Smets (2008)

31
Source: Jarocinki and Smets (2008)

32

You might also like