Time Series Course
Time Series Course
Fundamental Concepts of
Time-Series Econometrics
Many of the principles and properties that we studied in cross-section econometrics carry
over when our data are collected over time. However, time-series data present important
challenges that are not present with cross sections and that warrant detailed attention.
Random variables that are measured over time are often called “time series.” We define
the simplest kind of time series, “white noise,” then we discuss how variables with more
complex properties can be derived from an underlying white-noise variable. After studying
basic kinds of time-series variables and the rules, or “time-series processes,” that relate them
to a white-noise variable, we then make the critical distinction between stationary and non-
stationary time-series processes.
( y1 , y2 , ..., yT ) .
We often think of these observations as being a finite sample from a time-series stochastic pro-
cess that began infinitely far back in time and will continue into the indefinite future:
pre-sample
sample
post-sample
..., y−3 , y−2 , y−1 , y0 , y 1, y2 , ..., yT −1 , yT , yT +1 , yT + 2 , ... .
Each element of the time series is treated as a random variable with a probability distri-
bution. As with the cross-section variables of our earlier analysis, we assume that the distri-
butions of the individual elements of the series have parameters in common. For example,
1
The theory of discrete-time stochastic processes can be extended to continuous time, but we need not
consider this here because econometricians typically have data only at discrete intervals.
2 Chapter 1: Fundamental Concepts of Time-Series Econometrics
we may assume that the variance of each yt is the same and that the covariance between each
adjacent pair of elements cov ( yt , yt −1 ) is the same. If the distribution of yt is the same for all
values of t, then we say that the series y is stationary, which we define more precisely below.
The aim of our statistical analysis is to use the information contained in the sample to infer
properties of the underlying distribution of the time-series process (such as the covariances)
from the sample of available observations.
The simplest kind of time-series process corresponds to the classical, normal error term
of the Gauss-Markov Theorem. We call this kind of variable white noise. If a variable is white
noise, then each element has an identical, independent, mean-zero distribution. Each peri-
od’s observation in a white-noise time series is a complete “surprise”: nothing in the previous
history of the series gives us a clue whether the new value will be positive or negative, large
or small.
E ( ε t ) = 0, ∀t ,
var ( ε t ) =σ2 , ∀t , (1.1)
cov ( ε t , ε t − s ) = 0, ∀s ≠ 0.
Some authors define white noise to include the assumption of normality, but although we
will usually assume that a white-noise process εt follows a normal distribution we do not in-
clude that as part of the definition. The covariances in the third line of equation (1.1) have a
special name: they are called the autocovariances of the time series. The s-order autocovari-
ance is the covariance between the value at time t and the value s periods earlier at time t – s.
Fluctuations in most economic time series tend to persist over time, so elements near
each other in time are correlated. These series are serially correlated and therefore cannot be
white-noise processes. However, even though most variables we observe are not simple
white noise, we shall see that the concept of a white-noise process is extremely useful as a
building block for modeling the time-series behavior of serially correlated processes.
We model serially correlated time series by breaking them into two additive components:
effects of past on yt
=yt g ( yt −1 , yt − 2 , ..., εt −1 , ε t − 2 , ...) + εt . (1.2)
The function g is the part of the current yt that can be predicted based on the past. The varia-
ble εt, which is assumed to be white noise, is the fundamental innovation or shock to the series
at time t—the part that cannot be predicted based on the past history of the series.
Chapter 1: Fundamental Concepts of Time-Series Econometrics 3
From equation (1.2) and the definition of white noise, we see that the best possible fore-
cast of yt based on all information observed through period t – 1 (which we denote It – 1) is
E ( yt | I t −1 ) g ( yt −1 , yt − 2 , ..., εt −1 , εt −2 , ...)
=
Every stationary time-series process and many useful non-stationary ones can be de-
scribed by equation (1.2). Thus, although most economic time series are not white noise, any
series can be decomposed into predictable and unpredictable components, where the latter is
the fundamental underlying white-noise process of the series.
Characterizing the behavior of a particular time series means describing two things: (1)
the function g that describes the part of yt that is predictable based on past values and (2) the
variance of the innovation εt. The most common specifications we shall use are linear sto-
chastic processes, where the function g is linear and the number of lagged values of y and ε
appearing in g is finite.
where
The process yt described by equation (1.3) has two sets of terms in the g function in addi-
tion to the constant. There are p autoregressive terms involving p lagged values of the variable
and q moving-average terms with q lagged values of the innovation ε. We often refer to such a
stochastic process as an autoregressive-moving-average process of order (p, q), or an AR-
2
MA(p, q) process. If q = 0, so there are no moving-average terms, then the process is a pure
autoregressive process: AR(p). Similarly, if p = 0 and there are no autoregressive terms, the
process is a pure moving-average: MA(q). We will use autoregressive processes extensively
in these chapters, but moving-average processes will appear relatively rarely.
2
The analysis of ARMA processes was pioneered by Box and Jenkins (1976).
4 Chapter 1: Fundamental Concepts of Time-Series Econometrics
It is convenient to use a time-series “operator” called the lag operator when writing equa-
tions such as (1.3). The lag operator L ( ⋅) is a mathematical operator or function, just like
the negation operator − ( ⋅) that turns a number or expression into its negative or the inverse
operator ( ⋅)
−1
that takes the reciprocal. Just as with these other operators, we shall often omit
the parentheses when the argument to which the operator is applied is clear without them.
The lag operator’s argument is an element of a time series; when we apply the lag opera-
tor to an element yt we get its predecessor yt – 1:
L ( y=
t) Lyt ≡ yt −1.
We can apply the lag operator iteratively to get lags longer than one period. When we do
this, it is convenient to use an exponent on the L operator to indicate the number of lags:
L2 ( yt ) ≡ L L= [ yt −1 ] yt −2 ,
( yt ) L=
and, by extension, Ln (=
yt ) Ln yt ≡ yt −n .
yt − φ1 L ( yt ) − φ2 L2 ( yt ) ... − φ p Lp ( yt )
= yt − φ1 Lyt − φ2 L2 yt − ... − φ p Lp yt
(1.6)
= (1 − φ L − φ L
1 2
2
− ... − φ p Lp ) yt
≡ φ ( L ) yt .
The expression in parentheses in the third line of equation (1.6), which the fourth line defines
as φ ( L ) , is a polynomial in the lag operator. Moving from the second line to the third line in-
volves factoring yt out of the expression, which would be entirely transparent if L were a
number. Even though L is an operator, we can perform this factorization as shown, writing
φ ( L ) as a composite lag function to be applied to yt.
α + εt + θ1 L ( εt ) + θ2 L2 ( εt ) + ... + θq Lq ( εt )
= α + (1 + θ1 L + θ2 L2 + ... + θq Lq ) εt ≡ α + θ ( L ) εt ,
Chapter 1: Fundamental Concepts of Time-Series Econometrics 5
with θ ( L ) defined by the second line as the moving-average polynomial in the lag operator.
Using lag operator notation, we can rewrite the ARMA(p, q) process in equation (1.5) com-
pactly as
φ ( L ) yt = α + θ ( L ) ε t . (1.7)
The left-hand side of (1.7) is the autoregressive part of the process and the right-hand side is
the moving-average part (plus a constant that allows the mean of y to be non-zero).
Any finite ARMA process can be expressed as a (possibly) infinite moving average pro-
cess. We can see intuitively how this works by recursively substituting for lags of yt in equa-
tion (1.3). For simplicity, consider the ARMA(1, 1) process with zero mean:
We assume that all observations t are generated according to (1.8). Lagging equation (1.8)
one period yields yt −1 = φ1 yt − 2 + εt −1 + θ1εt − 2 , which we can substitute into (1.8) to get
yt = φ1 ( φ1 yt − 2 + εt −1 + θ1εt − 2 ) + εt + θ1εt −1
= φ12 yt − 2 + εt + ( φ1 + θ1 ) εt −1 + φ1θ1εt − 2 .
If we continue substituting in this manner we can push the lagged y term on the right-hand
side further and further into the past. As long as |φ1| < 1, the coefficient on the lagged y
3
term gets smaller and smaller as we continue substituting, approaching zero in the limit.
However, each time we substitute we add another lagged ε term to the expression.
In the limit, if we were to hypothetically substitute infinitely many times, the lagged y
term would converge to zero and there would be infinitely many lagged ε terms on the right-
hand side:
∞
yt = εt + ∑ φ1s −1 ( φ1 + θ1 ) εt − s , (1.9)
s =1
3
We shall see presently that the condition |φ1| < 1 is necessary for the ARMA(1, 1) process to be sta-
tionary.
6 Chapter 1: Fundamental Concepts of Time-Series Econometrics
which is a moving-average process with (if φ1 ≠ 0) infinitely many terms. Equation (1.9) is
referred to as the infinite-moving-average representation of the ARMA(1, 1) process; it exists
provided that the process is stationary, which in turn requires |φ1| < 1 so that the lagged y
term converges to zero after infinitely many substitutions.
α θ( L )
yt = + εt
φ( L ) φ( L )
(1.10)
α 1 + θ1 L + θ2 L2 + ... + θq Lq
= + εt .
1 − φ1 − φ2 − ... − φ p 1 − φ1 L − φ2 L2 − ... − φ p Lp
In the first term, there is no time-dependent variable on which the lag operator can operate,
so it just operates on the constant one, which gives the expression in the denominator. The
first term is the unconditional mean of yt because the expected value of all of the ε variables
in the second term is zero.
The quotient in front of εt in the second term is the ratio of two polynomials, which, in
general, will be an infinite polynomial in L. The terms in this quotient can be (laboriously)
evaluated for any particular ARMA model by polynomial long division.
For the zero-mean ARMA(1, 1) process we considered earlier, the first term is zero (be-
cause α = 0). We can use a shortcut to evaluate the second term. We know from basic alge-
bra that
∞
1
∑a
i =0
i
=
1− a
(we can pretend that L is one for purposes of assessing whether this expression converges).
From (1.10), we get
1 + θ1 L ∞
εt ∑ ( φ1 L ) (1 + θ1 L ) εt ,
i
=
yt =
1 − φ1 L i =0
Variables that have a trend component are one example of nonstationary time series. A
trended variable grows (or shrinks, if the trend is negative) over time, so the mean of its dis-
tribution is not the same at all dates.
In recent decades, econometricians have discovered that the traditional regression meth-
ods we studied earlier are poorly suited to the estimation of models with nonstationary vari-
ables. So before we begin our analysis of time-series regressions we must define stationarity
clearly. We then proceed in subsequent chapters to examine regression techniques that are
suitable for stationary and nonstationary time series.
A time-series process y is strictly stationary if the joint probability distribution of any pair
of observations from the process ( yt , yt − s ) depends only on s, the distance in time between
4
the observations, and not on t, the time in the sample from which the pair is drawn. For a
time series that we observe for 1900 through 2000, this means that the joint distribution of
the observations for 1920 and 1925 must be identical to the joint distribution of the observa-
tions for 1980 and 1985.
We shall work almost exclusively with normally distributed time series, so we can rely
on the weaker concept of weak stationarity or covariance stationarity. For normally distributed
time series (but not for non-normal series in general), covariance stationarity implies strict
stationarity. A time-series process is covariance stationary if the means, variances, and co-
variance of any pair of observations ( yt , yt − s ) depends only on s and not on t.
E ( yt ) =µ, ∀t ,
var ( yt ) =σ2y , ∀t ,
cov ( yt , yt − s ) =σ s , ∀t , s.
4
A truly formal definition would consider more than two observations, but the idea is the same. See
Hamilton (1994, 46).
8 Chapter 1: Fundamental Concepts of Time-Series Econometrics
In words, this definition says that µ, σ2y , and σs do not depend on t, so that the moments of
the distribution are the same at every date in the sample.
σ
ρ s ≡ corr ( yt , yt − s ) = 2s .
σy
A concept that is closely related to stationarity is ergodicity. Hamilton (1994) shows that
stationary, normally distributed time-series processes are ergodic if the limit of the sum of its
∞
absolute autocorrelations ∑ρ
s =0
s is finite. This condition is stronger than lim ρ s =0 ; The au-
s →∞
tocorrelations must not only go to zero, they must go to zero “sufficiently quickly.” In our
analysis, we shall routinely assume that stationary processes are also ergodic.
There are numerous ways in which a time-series process can fail to be stationary. A sim-
ple one is breakpoint non-stationarity, in which the parameters of the data-generating process
change at a particular date. For example, there is evidence that many macroeconomic time
series behaved differently after the oil shock of 1973 than they did before it. Another exam-
ple would be an explicit change in a law or policy that would lead to a change in the behav-
ior of a series at the time the new regime comes into effect.
Most of the non-stationary processes we shall examine have neither breakpoints nor
trends. Instead, they are highly persistent, “integrated” processes that are sometimes called
stochastic trends. We can think of these series as having a trend in which the change from pe-
riod to period (β in the deterministic trend above) is a stationary random variable. Integrated
processes can be made stationary by taking differences over time. We shall examine integrat-
ed processes in considerable detail below.
Chapter 1: Fundamental Concepts of Time-Series Econometrics 9
Consider first a model that has no autoregressive terms, the MA(q) model
yt = εt + θ1εt −1 + ... + θq εt − q ,
where ε is white noise with variance σ2. (We assume that the mean of the series is zero in all
of our stationarity examples. Adding a constant α to the right-hand side does not change the
variance or autocovariances and simply adds a time-invariant constant to the mean, so it
does not affect stationarity.)
) E ( εt ) + θ1E ( εt −1 ) + ... + θq E ( εt −q =) 0
E ( yt =
because the mean of all of the white-noise errors is zero. The variance of yt is
var ( y=
t) var ( εt ) + θ12 var ( εt −1 ) + ... + θ1q var ( εt − q )
= (1 + θ 2
1 + ... + θ2p ) σ2 ,
where we can ignore the covariances among the ε terms of various dates because they are
zero for a white-noise process.
Finally, the s-order autocovariance of y, the covariance between values of y that are s pe-
riods apart, is
cov ( yt , yt −=
s) E ( εt + θ1εt −1 + ... + θq εt − q )( εt − s + θ1εt − s −1 + ... + θq εt − s − q ) . (1.12)
The only terms in this expression that will be non-zero are those for which the subscripts
match. Thus,
cov ( yt , yt −=
1) E ( ε t + θ1εt −1 + ... + θq εt − q )( εt −1 + θ1εt − 2 + ... + θq εt −1− q )
= ( θ1 + θ1θ2 + θ2 θ3 + ... + θq −1θq ) σ2 ≡ σ1 ,
cov ( yt , yt −=
2) E ( ε t + θ1εt −1 + ... + θq ε t − q )( ε t − 2 + θ1ε t −3 + ... + θq εt − 2 − q )
= ( θ2 + θ1θ3 + θ2 θ4 + ... + θq − 2 θq ) σ2 ≡ σ2 ,
and so on. The mean, variance, and autocovariances derived above do not depend on t, so
the MA(q) process is stationary, regardless of the values of the θ parameters.
Moreover, for any s > q, the time interval t through t – q in the first expression in (1.12)
does not overlap the time interval t – s through t – s – q of the second expression. Thus, there
are no contemporaneous cross-product terms and σs = 0 for s > q. This implies that all finite
moving-average processes are ergodic.
10 Chapter 1: Fundamental Concepts of Time-Series Econometrics
While all finite MA processes are stationary, this is not true of autoregressive processes,
so we now consider stationarity of pure autoregressive models. The zero-mean AR(p) pro-
cess is written as
yt = φ1 yt −1 + φ2 yt − 2 + ... + φ p yt − p + εt .
yt = φ1 yt −1 + εt , (1.13)
then generalize. Equation (1.11) shows that the infinite-moving-average representation of the
AR(1) process is
∞
y=
t ∑φ ε
i =0
i
1 t −i . (1.14)
Taking the expected value of both sides shows that E ( yt ) = 0 because all of the white noise ε
terms in the summation have zero expectation.
because the covariances of the white-noise innovations at different points in time are all zero.
If |φ1| < 1, then φ12 < 1 and the infinite series in equation (1.15) converges to
σ2
.
1 − φ12
If |φ1| ≥ 1, then the variance in equation (1.15) is infinite and the AR(1) process is nonsta-
tionary . Thus, yt has finite variance only if |φ1| < 1, which is a necessary condition for sta-
tionarity.
These covariances are finite and independent of t as long as |φ1| < 1. Thus, |φ1| < 1 is a
necessary and sufficient condition for covariance stationarity of the AR(1) process. The con-
dition |φ1| < 1 also assures that the AR process is ergodic because
Chapter 1: Fundamental Concepts of Time-Series Econometrics 11
∞ ∞ ∞
1
∑ ρ=s ∑ φ= ∑φ=
s
s
1 1 < ∞.
=s 0=s 0=s 0 1 − φ1
Intuitively, the AR(1) process is stationary and ergodic if the effect of a past value of y
dies out as time passes. The same intuition holds for the AR(p) process, but the mathemati-
cal analysis is more challenging. For the general case, we must rely on the concept of poly-
nomial roots.
For the AR(p) process, the polynomial φ(L) is a p-order polynomial, involving powers of
L ranging from zero up to p. Polynomials of order p have p roots that satisfy φ ( L ) =
0 , some
or all of which may be complex numbers with imaginary parts. The AR(p) process is station-
ary if and only if all of the roots satisfy |rj | > 1, where the absolute value notation is inter-
preted as a modulus if rj has an imaginary part. If a complex number is written a + bi, then
the modulus is a 2 + b 2 , which (by the Pythagorean Theorem) is the distance of the point
a + bi from the origin in the complex plane. Therefore, stationarity requires that all of the
roots of φ(L) lie more than one unit from the origin, or “outside the unit circle,” in the com-
plex plane.
=
To see how this works, consider the AR(2) process yt 0.75 yt −1 − 0.125 yt − 2 + εt . The lag
polynomial for this process is 1 − 0.75L + 0.125L2 . We can find that the (two) roots are r1 =
1/0.5 = 2 and r2 = 1/0.25 = 4 either by using the quadratic formula or by factoring the poly-
nomial into (1 − 0.5L )(1 − 0.25L ) . Both roots are real and greater in absolute value than one,
so this AR process is stationary.
=
As a second example, we examine yt 1.25 yt −1 − 0.25 yt − 2 + εt with lag polynomial 1 –
1.25L + 0.25L2. This polynomial factors into (1 − L )(1 − 0.25L ) , so the roots are r1 = 4 and
r2 = 1. One root (4) is stable, but the second (1) lies on, not outside, the unit circle. We shall
12 Chapter 1: Fundamental Concepts of Time-Series Econometrics
see shortly that nonstationary autoregressive processes that have unit roots are called integrat-
ed processes and have special properties.
It should come as no surprise, then, that the stationarity of the ARMA(p, q) process depends
only on the parameters of the autoregressive part, and not at all on the moving-average part.
The ARMA(p, q) process φ ( L ) yt =
θ ( L ) εt is stationary if and only if the p roots of the order-
p polynomial φ(L) all lie outside the unit circle.
=
yt yt −1 + εt ,
which, in words, says that the value of y in period t is equal to its value in t – 1, plus a ran-
dom “step” due to the white-noise shock εt. Bringing yt – 1 to the left-hand side,
∆yt ≡ yt − yt −1 =εt ,
so the first difference of y—the change in y from one period to the next—is white noise.
(1 − L ) yt =
εt . (1.16)
The term (1 – L) is the difference operator (∆) expressed in terms of lag notation: the current
value minus last period’s value. The lone root of the polynomial (1 – L) is 1, so this process
has a single, unitary root.
to be the first difference of y. If y follows a random walk, then by substituting from (1.17)
into (1.16), zt = εt is white noise, which is a stationary process.
1 1 1
φ ( L ) = 1 − L 1 − L 1 − L ,
r1 r2 rp
where r1, r2, …, rp are the p roots of the polynomial. The factor (1 – L) will appear d times in
this expression, once for each of the d roots that equal one. Suppose that we arbitrarily order
the roots so that the first d roots are equal to one, then
d 1 1
φ ( L ) =(1 − L ) 1 − L 1 − L . (1.18)
rd +1 rp
To achieve stationarity in the presence of d unit roots, we must apply d differences to yt. Let
(1 − L )
d
z t ≡ ∆ d yt= yt . (1.19)
1 1 1 1
φ ( L ) yt =
1 − L 1 − L ∆ d y t =
1− L 1 − L z t ≡ φ * ( L ) z t =
εt . (1.20)
rd +1 rp rd +1 rp
Because rd + 1, …, rp are all (by assumption) outside the unit circle, the dth difference series zt
is stationary, described by the stationary (p – d)-order autoregressive process φ* ( L ) z t =
εt
defined in (1.20).
equals the non-stationary series y differenced d times, and therefore y is equal to the station-
ary series z “integrated” d times.
The above examples have examined only pure autoregressive processes, but the analysis
generalizes to ARMA processes as well. The general, integrated time-series process can be
written as
φ ( L ) ∆ d yt =
θ ( L ) εt ,
where φ(L) is an order-p polynomial with roots outside the unit circle and θ(L) is an order-q
polynomial. This process is called an autoregressive integrated moving-average process, or
ARIMA(p, d, q) process.
To take advantage of time-series operations and commands in Stata, you must first de-
fine the time-series structure of your dataset. In order to do so, there must be a “time varia-
ble” that tells where each observation fits in the time sequence. The simplest time variable
would just be a counter that starts at 1 with the first observation. If no time variable exists in
your dataset, you can create such a counter with the following command:
generate timevar = _n
Once you have created a time variable (which be called anything), you must use the
Stata command tsset to tell Stata about it. The simplest form of the command is
tsset timevar
where timevar is the name of the variable containing the time information.
Although the simple time variable is a satisfactory option for time-series data of any fre-
quency, Stata is capable of customizing the time variable with a variety of time formats, with
monthly, quarterly, and yearly being the ones most used by economists.
If you have monthly time-series data, the dataset have two numeric variables, one indi-
cating the year and one the month. You can use Stata functions to combine these into a sin-
gle time variable, then assign an appropriate format so that your output will be intuitive.
Suppose that you have a dataset in which the variable mon contains the month number and
yr contains the year. The ym function in Stata will translate these into a single variable,
which can then be used as the time variable:
generate t = ym(yr, mon)
tsset t, monthly
Chapter 1: Fundamental Concepts of Time-Series Econometrics 15
(There are similar Stata functions for quarterly data and other frequencies, as well as func-
tions that translate alphanumeric data strings in case your month variable has the values
“January,” “February,” etc.) The generate command above assigns an integer value to the
variable t on a scale in which January 1960 has the value of 0, February 1960 = 1, etc. The
monthly option in the tsset command tells Stata explicitly that your dataset has a month-
ly time unit.
To get t to be displayed in a form that is easily interpreted as a date, set the variable’s
format to %tm with the command:
format t %tm
(Again, there are formats for quarterly data and other frequencies as well as monthly.) After
setting the format in this way, a value of zero for variable t will display as 1960m1 rather
than as 0.
One purpose of declaring a time variable is that it allows you to inform Stata about any
gaps in your data without having “empty observations” in your dataset. Suppose that you
have annual data from 1929 to 1940 and 1946 to 2010, with the World War II years missing
from your dataset. The variable year contains the year number and jumps right from a value
of 1940 for the 12th observation to 1946 for the 13th, with no intervening observations. De-
claring the dataset with
tsset year, yearly
will inform Stata that observations corresponding to 1941–1945 are missing from your da-
taset. If you ask for the lagged value one period before the 1946 observation, Stata will cor-
rectly return a missing value rather than incorrectly using the value for 1940. If for any rea-
son you need to add these missing observations to your dataset for the gap years, the Stata
command tsfill will do so, filling them with missing values.
Lags of more than one period can be obtained by including a number after the l in the lag
operator:
generate ytwicelagged = l2.y
In order to use the lag function, the time-series structure of the dataset must have already
been defined using the tsset command.
You can use the difference operator D. or d. to take the difference of a variable. The
following two commands produce identical results:
16 Chapter 1: Fundamental Concepts of Time-Series Econometrics
As with the lag operator, a higher-order difference can be obtained by adding a number after
the d:
If data for the observation prior to period t is missing, either explicitly in the dataset or
because there is no observation in the dataset for period t – 1, then both the lag operator and
the difference operator return a missing value.
If you do not want to create extra variables for lags and differences, you can use the lag
and difference operators directly in many Stata commands. For example, to regress a varia-
ble on its once-lagged value, you could type
regress y l.y
In addition to the standard Stata commands (such as summarize) that produce descrip-
tive statistics, there are some additional descriptive commands specifically for time-series
data. Most of these are available under the Graphics menu entry Time Series Graphs.
We can create a simply plot of one or more variables against the time variable using
tsline. To calculate and graph the autocorrelations of a variable, we can use the ac com-
mand. The xcorr command will calculate the “cross correlations” between two series,
which are defined as corr( yt , x t − s ), s = ..., − 2, − 1, 0, 1, 2, ... .
References
Box, George E. P., and Gwilym M. Jenkins. 1976. Time Series Analysis: Forecasting and
Control. San Francisco: Holden-Day.
Hamilton, James D. 1994. Time Series Analysis. Princeton, N.J.: Princeton University Press.
CHAPTER 2
However, most of the time series that macroeconomists of the time used—GDP, the
money supply, consumption spending, etc.—contained strong trends. Thinking about it intu-
itively, it is not surprising that two trended series will tend to be highly and positively corre-
lated: both series are smaller at the beginning of the sample than at the end. This correlation
will be strong even if the non-trend movements in the two series are independent.
Consider the time plot in Figure 0-1, which depicts two series: total attendance at Ameri-
can League baseball games and real per-capita GDP from 1960 to 2007. The series seem to
be highly correlated, largely due to the strong trends in both. Regressing AL attendance on
per-capita GDP yields the regression in the first column of Table 2-1 with an R-square value
of nearly 0.94 and an overwhelming t statistic of 26. An economist seeing column (1) would
seem justified in concluding that increases in income per person have very strong and statis-
tically reliable effects on baseball attendance, which seems entirely plausible—except that the
regressor in (1) is per-capita GDP in Botswana, not in the United States!
This is an example of spurious regression, a term coined by Granger and Newbold (1974)
in their seminal article on regressions with nonstationary variables. When the variables in a
regression are nonstationary, R-square values and t-statistics no longer follow the usual dis-
tributions and can be wildly inflated. Consider the regression in column (2) of Table 2-1,
which relates the first difference of attendance to the first difference of Botswana GDP. In
20 Chapter 2: Regression with Stationary Time Series
contrast to the levels equation (1), there is no evidence of a relationship in the differenced
regression of column (2), with R-square of 0.005 and a t-statistic less than 1.
(1) (2)
AL Attendance ∆ AL Att.
GDP 3,323***
(126.8)
∆GDP 765.5
(1,599)
Constant 7.073e+06*** 406,547
(640,361) (464,611)
Observations 47 46
R-squared 0.939 0.005
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
Table 2-1. Regressions of baseball attendance on GDP
Chapter 2: Regression with Stationary Time Series 21
The case for spurious correlation between two strongly trended series as in Figure 2-1 is
intuitive. But beyond this, Granger and Newbold demonstrated that nonstationary regres-
sion is also unreliable in a less obvious case: random walks with no trend or “drift” that
moves the series in the same direction over time. The only thing that these two series have in
common is that the (independent) shocks to both series are highly persistent, yet Granger
and Newbold’s Monte Carlo regressions rejected the null hypothesis of a zero coefficient 76
percent of the time (rather than the appropriate 5 percent) in bivariate regressions. As ex-
pected, differenced regressions (which would involve two unrelated white-noise series be-
cause the first difference of a random walk is white noise) yielded conventional frequencies
of rejection. Moreover, as the sample size gets larger, the problem of spurious regressions
with nonstationary variables gets worse, not better. The t statistic between unrelated random
walks goes to infinity rather than zero as T → ∞.
In a time-series setting, observations that are near each other in time are likely to be re-
lated. The strict exogeneity condition E ( εt |..., x t + 2 , x t +1 , x t , x t −1 , x t − 2 , ...) =
0 requires that the
regressor’s value in period t be unrelated to the disturbance term in every period. This means
that we cannot have dynamic feedback effects in which the past or future value of the regres-
sor might depend on the current disturbance.
For us to apply the Gauss-Markov Theorem to a time-series context, we require the fol-
lowing assumptions:
22 Chapter 2: Regression with Stationary Time Series
yt = β1 + β2 x t ,2 + ... + β K x t ,K + ut , t = 1, 2, ..., T ,
• TS-2. Strict exogeneity: The explanatory variables x.,j are strictly exogenous with re-
spect to the disturbance term. Mathematically,
E ( ut |X
= ) 0,=t 1, 2, ..., T ,
where X includes all K – 1 regressors and all T time periods:
a1 + a2 x t ,2 + ... + aK x t ,K = 0, ∀t = 1, 2, ..., T .
var ( ut |X ) =
σ2 , t =
1, 2, ..., T .
cov ( ut , ut − s |X
= ) 0,=s 1, 2, ..., T − 1.
• TS-6. Normality: The disturbance terms are normally distributed,
ut ~ N ( 0, σ2 ) .
When conditions TS-1 through TS-3 hold, the OLS coefficient estimator is unbiased.
When we add TS-4 and TS-5 we obtain the unbiasedness of the standard OLS estimator of
the variance of the OLS coefficient estimator, so we can use the standard tools of OLS infer-
ence. Under TS-1 through TS-5, the Gauss-Markov Theorem assures that OLS is BLUE. If
we add TS-6, then the OLS coefficient vector has a normal distribution and the ratio of each
coefficient to its standard error has a t distribution.
Chapter 2: Regression with Stationary Time Series 23
The focus in time-series regression analysis is mainly addressed to coping with violations
of TS-2 and TS-5. If the variables in our model are stationary and ergodic, we can loosen TS-
2 to require only weak exogeneity and our OLS estimator will still have desirable asymptotic
properties. Coping with serial correlation is discussed in the next section.
In cross-section samples, the law of large numbers assures us that random variations will
tend to “even out” with a large enough sample of independent observations. This happens
because the sample means, variances, and covariances among the variables converge to
fixed, finite population moments such as
1 N
lim
N →∞ N
∑x
i =1
i = µx ,
1 N
∑( x −x) =
2
lim i σ2x , (2.1)
N →∞ N
i =1
1 N
lim
N →∞ N
∑(x
i =1
i, j − x j ) ( x i ,k − x k ) =
σ x j xk .
To see why asymptotic properties require careful attention in time-series models, consid-
er what happens to the equations in (2.1) when the x variable is a time trend so that xt = t.
The mean of a sample of T observations is
1 T
xT ≡ ∑ xt
T t =1
1 T 1 T (T + 1) T + 1
= =∑
T t =1
t
T
=
2 2
,
and
1 T T +1
lim xT = lim
T →∞ T →∞ T
∑
t =1
x t = lim
T →∞ 2
= ∞.
24 Chapter 2: Regression with Stationary Time Series
Therefore the first moment-convergence condition in (2.1) fails when the regressor is a time
trend. It is straightforward to show that the second condition in (2.1) also fails for the time
trend—the sample variance also diverges as T gets large.
This problem is not restricted to time trends. It occurs with any nonstationary variable
because the mean and/or variance do not converge in large samples. This is the basis for the
spurious-regression problem with which we began this chapter.
A more subtle difficulty occurs with non-ergodic variables. Recall that these variables
have “long memory,” so that observations that are far apart in time are still strongly correlat-
ed. What this means for regression analysis is that even as we accumulate a large number of
observations, the amount of new information in those observations is limited by their corre-
lation with the earlier ones. Intuitively, this means that the information in the sample does
not grow fast enough as the sample size increases to lead to asymptotic convergence of esti-
mators to the true parameter values.
If—and it’s a big if—we are working with stationary and ergodic time series, then we can
weaken the strict exogeneity assumption TS-2 to weak exogeneity and the OLS estimators
still have desirable asymptotic properties. Corresponding to the set of assumptions TS-1
through TS-6 that support the small-sample properties of the OLS estimators, we have ATS-
1 through ATS-6 that foster consistency, asymptotic efficiency, and asymptotic normality.
yt = β1 + β2 x t ,2 + ... + β K x t ,K + ut , t = 1, 2, ..., T ,
• ATS-2. Weak exogeneity, stationarity, and ergodicity: The variables of the model
are stationary and ergodic, and the explanatory variables x.,j are weakly exogenous
with respect to the disturbance term. Mathematically,
E ( ut | x t ,2 , x t ,3 , ..., x t ,K ) = 0.
Note that this requires ut to be independent only of the current values of the regres-
sors, not of all past, current, and future values.
a1 + a2 x t ,2 + ... + aK x t ,K = 0, ∀t = 1, 2, ..., T .
(Identical to TS-3.)
var ( ut | x t ,2 , x t ,3 , ..., x t ,K ) =
σ2 , t =
1, 2, ..., T .
K )
cov ( ut , ut − s | x t ,2 , x t ,3 , ..., x t ,= 0,=s 1, 2, ..., T − 1.
Under assumptions ATS-1 through ATS-3, the OLS estimator is consistent. If we add
assumptions ATS-4 and ATS-5, it is asymptotically efficient and asymptotically normal.
Note that we do not require an asymptotic analog to TS-6, which imposed normality of u.
That is because assumptions ATS-1 through ATS-5 allow us to use central limit theorems to
show that the OLS estimators will converge to a normal distribution in large samples. This
allows us to use the normal distribution to assess the asymptotic significance of our t statis-
tics and the chi-square distribution to evaluate the asymptotic distribution of F statistics.
In discussing the small-sample properties of OLS with time-series data, we identified TS-
2 (strict exogeneity) and TS-5 (no serial correlation) as assumptions that were often violated
in economic data. We then showed that we can relax TS-2 to the more data-friendly ATS-2
(weak exogeneity) if our variables are stationary and ergodic and if our sample is large
enough that asymptotic distributions are reasonably close approximations. In the remainder
of this chapter we assume that stationarity and ergodicity assumptions in ATS-2 are valid.
We now consider the issue of TS-5 or ATS-5. In economic data, what happens at t is of-
ten related to what happened at t – 1. If that is true of the disturbance terms in our regres-
sion, then we have serial correlation. We know from our discussion above that the con-
sistency of the OLS coefficient estimator requires only ATS-1 through ATS-3, so it does not
depend on the absence of serial correlation in the disturbance. However, in the presence of
serial correlation the variance of the OLS estimator will be larger than some other estima-
tors, so it is not efficient, and the traditional OLS variance estimator will not estimate the
true variance accurately so our OLS test statistics will not follow the same distribution that
they would if ATS-5 were not violated.
To see the problem, let us suppose that assumptions ATS-1 through ATS-4 are satisfied,
but that the disturbance term u follows an AR(1) process:
ut = ρut −1 + εt (2.2)
26 Chapter 2: Regression with Stationary Time Series
with |ρ| < 1 and ε being white noise with variance σ2ε . For simplicity, let K = 2 so there is
only one (non-constant) regressor: yt = β1 + β2 x t + ut . [This follows Wooldridge (2009, 409).]
∑ ( yt − y )( x t − x ) ∑u ( x t t −x)
=t 1 =t 1
2 T 2 T
βˆ = =β + .
∑( x − x ) ∑ ( xt − x )
2 2
t
=t 1 =t 1
Just to simplify this expression, assume that we have normalized the regressor so that x = 0 .
(This is not a restrictive assumption; it just makes the algebra more transparent.) In this case,
T
∑u x t t
βˆ 2 − β2 = t =T1 ,
∑xt =1
2
t
and
T
2
( )
T 2
1
var βˆ 2 |X = 2 E ∑ ut x t |X .
t =1
∑ xt
t =1
In general, the squared-summation term involves all of the cross-products of the ut terms
with ut – s terms. In the special case of no serial correlation, all of the cross-products are zero
T
and the expectation expression reduces to σ2 ∑ x t2 , leading to our usual OLS formula for
t =1
the variance. But this does not happen if the disturbance terms are serially correlated.
T
2
T T −1 T − t
E ∑= ut x t |X E ∑ x t2 ut2 + 2 ∑∑ x t x t + s ut ut + s |X
=t 1 =t 1 =t 1 =s 1
T T −1 T − t
= ∑ x E (u
=t 1
2
t
2
t |X ) + 2∑∑ x t x t + s E ( ut ut + s |X )
=t 1 =s 1
T T −1 T − t
= ∑x 2
t var ( ut |X ) + 2∑∑ x t x t + s cov ( ut , ut + s |X )
=t 1 =t 1 =s 1
T T −1 T − t
=σu2 ∑ x t2 + 2σu2 ∑∑ x t x t + s corr ( ut , ut + s |X )
t =1 =t 1 =s 1
T T −1 T − t
=σu2 ∑ x t2 + 2σu2 ∑∑ ρ s x t x t + s .
=t 1 =t 1 =s 1
Chapter 2: Regression with Stationary Time Series 27
The final expression uses the property that corr(ut, ut + s) = ρs for an AR(1) process. Substitut-
ing this expression into the variance formula yields
T T −1 T − t
σu2 ∑ x t2 + 2σu2 ∑∑ ρ s x t x t + s
( )
var βˆ 2 |X = =t 1
=t 1 =s 1
2
2
T
∑ t x
t =1
T −1 T − t
(2.3)
σu2 ∑∑ ρ x x s
t t +s
= T
+ 2σu2 =t 1 =s 1
2
.
T 2
∑x 2
t
∑ xt
t =1 t =1
The standard expression for the OLS variance (assuming no serial correlation) is just the
first term of (2.3) and neglects the second. The second term is zero only if either the disturb-
ance term is not serially correlated (ρ = 0) or the regressor is not serially correlated in our
sample (in which case the cross-product terms would add up to zero). In economic data it is
common for ρ > 0 and for x to be positively serially correlated as well, which means that the
second term in (2.3) is likely to be positive and the true variance of the OLS estimator will be
larger than when there is no serial correlation.
Because the usual OLS standard errors neglect the second term, they will generally be
inconsistent in the presence of serial correlation, meaning that our t and F statistics based on
them will not be valid. We will see below that there are two methods for dealing with a seri-
ally correlated disturbance: we can try to transform the model to eliminate the serial correla-
tion or we can use the (inefficient) OLS estimator and correct the standard errors to reflect
the second term in (2.3).
Serial correlation in the error occurs when the condition corr ( ut , ut − s ) = 0 is violated for
some s. We can test the null hypothesis of no serial correlation if we have estimators of the
error terms ut that are consistent (when the null hypothesis is true). The OLS residuals uˆt are
the obvious choice, so our tests for serial correlation will involve testing for correlation be-
tween uˆt and uˆt − s for positive values of s up to some chosen limit p. The null hypothesis is
therefore corr ( ut , ut −=
s) 0,=s 1, 2, ..., p .
The oldest test for (first-order) serial correlation is the Durbin-Watson test. This test has
fallen into disuse for three reasons. First, the critical values of the test statistic depend on the
regressors in the model, so they cannot be tabulated for a general case. Users of the Durbin-
Watson test traditionally relied on upper and lower bounds for the critical values, meaning
that it was impossible to draw a conclusion for calculated test statistics lying in the interval
between the bounds. Second, the Durbin-Watson test has been shown to be invalid if a
lagged dependent variable is among the regressors. This rules out its use for any model with
28 Chapter 2: Regression with Stationary Time Series
an autoregressive structure of y. Third, the Durbin-Watson statistic only tests for first-order
serial correlation and cannot easily be extended to p > 1.
The Breusch-Godfrey Lagrange multiplier test is a regression-based test for order-p autocor-
relation of the disturbance. The null hypothesis is that the disturbance is white noise. If the
disturbance is white noise, then the current OLS residual uˆt should be independent of the
lagged residuals uˆt −1 , uˆt − 2 , ... . The Breusch-Godfrey test uses the regression
where xt.. is the row vector of all explanatory variables in the model corresponding to period t
(including the constant) and β is the coefficient vector of the model.
The dependent variable in regression (2.4) is the residual uˆt from the regression of yt on
xt., which is by construction uncorrelated with xt.. If there is no autocorrelation in u, then the
p lagged residuals should also be uncorrelated with uˆt . Thus, under the null hypothesis of no
serial correlation the regressors in (2.4) should have no explanatory power whatsoever for
the dependent variable; if they do explain uˆt , then there must be serial correlation.
BG ( p=
) T0 × R 2 ,
where T0 is the number of observations and R2 is the coefficient of determination from re-
gression (2.4). BG(p) is asymptotically distributed as a χ2(p) variable under the null hypothe-
sis. No autocorrelation and no explanatory power in (2.4) would imply an R2 near zero and a
small test statistic, failing to reject the null hypothesis of white noise.
If there are T observations in the original model, then we have values for uˆt and xt. for
all of these T observations. However, we cannot observe uˆt −1 , ..., uˆt − p for the first p observa-
tions. There are two strategies for handling this difficulty. We can either omit these observa-
tions and estimate (2.4) for the T – p observations t = p + 1, p + 2, …, T or we can use all T
observations and substitute zero (the expected value) for the missing lagged residuals. In ei-
ther case, the T0 used to calculate BG(p) is the number of observations used in (2.4), T0 = T –
p if the observations are dropped and T0 = T if zero is substituted.
To perform the Breusch-Godfrey test in Stata, we can use the estat bgodfrey com-
mand. The order of autocorrelation to be tested (p) is specified by the lags( ) option. By
default, Stata substitutes zero for the missing lagged residuals and uses the full sample. To
use the shorter sample without substituting, we specify the nomiss0 option.
p
rs2
( p ) T (T + 2 ) ∑
Q= ,
s =1 T − j
where
T
∑ uˆ uˆ t t −s
rs ≡ t = s +1
T
∑ uˆ
t =1
2
t
is the sample autocorrelation coefficient at lag s. Like the Breusch-Godfrey test statistic, Q(p)
converges asymptotically to a χ2p distribution. The Q test and the Breusch-Godfrey test are
asymptotically equivalent under the null hypothesis if there are no lagged dependent varia-
bles among the regressors, but Greene (2012, 923) argues that the Breusch-Godfrey test is
more powerful because it controls for the regressors when testing for correlation between the
current and lagged residuals.
The Stata command wntestq implements the Q test with the option lags( ) specify-
ing p. The residuals must be explicitly retrieved and included in the command. For example,
to test the first 12 autocorrelations of a residual series stored in uhat, we would type
wntestq uhat , lags(12). The general command corrgram for calculating autocor-
relations also shows the successive Q statistics at each order p up to the limit specified. It is
important to remember that Q (p) (like the Breusch-Godfrey test) tests the joint null hypothe-
sis that the first p autocorrelations are zero, not the simple null that the individual pth-order
autocorrelation is zero.
Generalized least squares estimation allows us to transform a model whose error term
has a non-classical distribution into one whose error term follows the classical assumptions.
In the case of autoregressive error terms, the transformation consists of a “quasi-
differencing” filter that purges the error term of autocorrelation. If ut is an order-p autoregres-
sive process
ut = ρ1ut −1 + ... + ρ p ut − p + εt ,
has no autocorrelation.
The most common case is p = 1, where the error term follows an AR(1) process. In this
case, ut* = ut − ρ1ut −1 = (1 − ρ1 L ) ut = ε t . If we begin with the model yt = α + βx t + ut and apply
30 Chapter 2: Regression with Stationary Time Series
yt − ρ1 yt −1 = ( α + βx t + ut ) − ρ1 ( α + βx t −1 + ut −1 )
= α (1 − ρ1 ) + β ( x t − ρ1 x t −1 ) + ( ut − ρ1ut −1 ) ,
or
where
yt* ≡ yt − ρ1 yt −1 ,
c t* ≡ 1 − ρ1 ,
(2.6)
x t* ≡ x t − ρ1 x t −1 ,
ut* ≡ ut − ρ1ut −1 = ε t .
The transformed model (2.5) has an error term ut* = εt that is serially uncorrelated, so it can
be estimated efficiently by OLS. But in order to apply GLS to this model we must solve two
problems: (1) what to do about the first observation and (2) how to obtain an estimate of ρ1.
The first-observation problem is that we cannot use the transformation in (2.6) when t =
1 because we generally do not have observations for y0 and x0. There are two choices for
dealing with this problem. We can omit the first observation and estimate (2.5) for T – 1 ob-
servations starting with t = 2, but this solution loses the information from the omitted obser-
1
vation, which may be significant in small samples. Alternatively, we can include the first
observation, but if we were to add this one untransformed observation to the T – 1 trans-
formed observations, it would have a different error variance. An untransformed observation
has variance var ( u ) = σu2 . The transformed observations have a smaller variance equal to
var ( ε )= (1 − ρ ) σ
2
1
2
u . Thus, to make the variance of the untransformed initial observation
match the transformed observations we must calculate
y=
*
1 1 − ρ12 y1 ,
c=
*
1 1 − ρ12 ,
(2.7)
x=
*
1 1 − ρ12 x1 ,
u=
*
1 1 − ρ12 u1
1
If p > 1, the transformations in (2.6) have p lags and we would lose p observations from the begin-
ning of the sample.
Chapter 2: Regression with Stationary Time Series 31
in order to add the first observation into the transformed estimating sample.
To perform feasible GLS in the AR(1) model, we require an estimate of ρ1. The most
common way of estimating ρ1 is using the residuals û from an OLS regression to approxi-
mate u and then calculating ρ1 either as the correlation coefficient between uˆt and uˆt −1 or as
the estimated coefficient in a regression uˆt = ρ1uˆt −1 + εt . The Prais-Winsten estimator uses this
method to estimate ρ1 and includes all T observations using (2.7) for t = 1. The Cochrane-
Orcutt estimator estimates ρ1 in the same way but omits the first observation. The two estima-
tors are asymptotically equivalent because the importance of the first observation becomes
negligible as T gets large. Because Prais-Winsten is more efficient in small samples, we shall
not consider the Cochrane-Orcutt estimator further.
To estimate FGLS models with AR(1) error terms in Stata, we use the prais com-
mand, which has a format similar to regress. The default method is iterated Prais-Winsten
estimation. We can alter the method used by specifying options: corc produces Cochrane-
Orcutt estimates; twostep disables iteration; ssesearch specifies the Hildreth-Lu search
procedure.
32 Chapter 2: Regression with Stationary Time Series
As noted above, autocorrelation introduces two problems with the OLS model: ineffi-
cient estimators and inconsistent standard errors. GLS methods seek to achieve efficient es-
timation of parameters in the presence of autocorrelated errors. An alternative is to accept
the inefficiency of the OLS estimators but correct the standard errors so that valid inference
can be performed.
The Newey-West HAC-robust standard errors for the OLS estimators are consistent when
the error term is heteroskedastic, autocorrelated, or both, as long as the regressors are sta-
tionary and ergodic. These robust standard errors are kin to White’s heteroskedasticity-
robust standard errors, but the formulas are more complex.
We begin with the standard regression model (with one variable, for simplicity):
yt = α + βx t + ut , (2.8)
where cov ( ut , ut − s ) ≠ 0 for s ≠ 0. For this derivation, we assume that var ( ut ) = σu2 for all ob-
servations, although the Newey-West standard errors are consistent even when the error
term is heteroskedastic.
The derivation of equation (2.3) suggests how the OLS standard errors can be corrected
to account for the presence of serial correlation . [The derivation here is based on Stock and
Watson (2011, 595-600).] We know that the OLS estimator for β in equation (2.8) can be
written
1 T
∑ ( x t − x ) ut
T i =1
b =β+ .
1 T
∑ ( xt − x )
2
T i =1
Recall that probability limits are very forgiving in that plim f ( x ) = f plim ( x ) quite gen-
erally. We know that plim x = µ x and, if x is stationary and ergodic, then
1 T
2
plim
T
∑( x
i =1
t −x) =
σ2x .
Therefore,
1 T
plim ∑ ( x t − µ x ) ut
plim ( b − β ) =
= T i =1 plim ( v ) ,
σx2
σ2x
1 T
where vt ≡ ( x t − µ x ) ut and v = ∑ vt .
T t =1
Chapter 2: Regression with Stationary Time Series 33
In large samples where we can assume that b is arbitrarily close to plim (b),
v var ( v )
var ( b ) var
= = 2 .
σx σ 4x
1 σv2
var ( v )
= =var ( vt ) ,
T T
which simplifies to the usual formula for the OLS standard error. But with serial correlation,
when we take the variance of the sum in v the covariance terms are not zero.
In the case where there is serial correlation we have to take into account the covariance of
the vt terms:
v + v + + vT
var ( v ) = var 1 2
T
1 T T
2 ∑∑
E ( vi v j )
=
T =i 1 =j 1
1 T
= 2 ∑
var ( vi ) + ∑ cov ( vi , v j )
=T i 1 j ≠i
=
1
T2
( )
T var ( vt ) + 2 (T − 1) cov ( vt , vt −1 ) + 2 (T − 2 ) cov ( vt , vt − 2 ) + + 2 cov vt , vt −(T −1)
σv2
= fT ,
T
where
T −1
T − j
fT ≡ 1 + 2∑ corr ( vt , vt − j )
j =1 T
T −1
T − j
1 + 2∑
= ρj .
j =1 T
1 σv2
Thus, var ( b2 ) = 4 T
f , which expresses the variance as the product of the no-
T σ x
autocorrelation variance and the fT factor that corrects for autocorrelation. In order to im-
plement this, we need to know fT, which depends on the autocorrelations of v for orders 1
through T – 1.
In practice, just as in GLS estimation, these are not known and must be estimated. As
usual, we use the OLS residuals uˆt as estimators for ut, which allows us to compute esti-
mates of vt for each observation in the sample. For ρ1 we have lots of information because
34 Chapter 2: Regression with Stationary Time Series
there are T – 1 pairs of values for (vt, vt – 1) in the sample. Similarly, we have T – 2 pairs to use
in estimating ρ2, T – 3 pairs for ρ3, and so on.
As the order of autocorrelation j gets larger, there are fewer and fewer observations from
which to estimate ρj. When we get to ρT – 1, there is only one pair of observations that are T –
1 periods apart—namely (uT, u1)—on which to base an estimate, so this correlation cannot be
calculated at all.
References
Granger, C. W. J., and Paul Newbold. 1974. Spurious Regressions in Econometrics. Journal
of Econometrics 2 (2):111-120.
Greene, William H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, N.J.: Pearson
Education.
Stock, James H., and Mark Watson. 2011. Introduction to Econometrics. 3rd ed. Boston:
Pearson Education/Addison Wesley.
Wooldridge, Jeffrey M. 2009. Introductory Econometrics: A Modern Approach. 4th ed. Mason,
O.: South-Western Cengage Learning.
CHAPTER 3
Distributed-Lag Models
A distributed-lag model is a dynamic model in which the effect of a regressor x on y occurs
over time rather than all at once. In the simple case of one explanatory variable and a linear
relationship, we can write the model as
∞
yt = α + β ( L ) x t + ut = α + ∑ β s x t − s + ut , (3.1)
s =0
1
where ut is a stationary error term. This form is very similar to the infinite-moving-average
representation of an ARMA process, except that the lag polynomial on the right-hand side is
applied to the explanatory variable x rather than to a white-noise process ε. The individual
coefficients β s are called lag weights and the collectively comprise the lag distribution. They
define the pattern of how x affects y over time.
One difficulty that is common to all distributed-lag models is choice of lag length,
whether this be choosing the point q at which to truncate a finite lag distribution in (3.1) or
choosing how many lagged dependent variables to include. We defer this question until later
in the chapter, after various distributed-lag models have been introduced.
1
We assume that yt does not depend on future values of x, thus we exclude negative values of s from
the summation. However, it is theoretically possible to have “negative lags” on the right-hand side, for
example, people might change their behavior now if they know that a law is going to change in the
future. Equation (3.1) can be modified appropriately for such circumstances.
36 Chapter 3: Distributed-Lag Models
much effect x has on y, but when it has the effect. Is the effect immediate? Does it emerge
slowly? Is there an initial effect that goes away after a few periods? In order to answer these
questions, we must estimate the lag distribution relating y to x.
∂yt + s ∂yt
= = βs . (3.2)
∂x t ∂x t − s
Note that the first equation in (3.2) requires that the time-series relationship between y and x
be stationary, so we can think of βs either as the effect of current xt on future yt + s or as the
effect of past xt – s on current yt.
When reporting the results of a lag regression it is common to express the lag weights
either in a table, on a graph, or both. Suppose that we estimate a finite distributed lag with
weights of 4, 2, and 0.5. For this example, equation (3.1) becomes
yt = α + 4 x t + 2 x t −1 + 0.5 x t − 2 + ut . (3.3)
We might show the lag weights for equation (3.3) in a graph similar to Figure 3-1.
4.5
3.5
3
Effect of x on y
2.5
1.5
0.5
0
0 1 2
Lag
To see the interpretation of the lag weights, we consider two special cases: a temporary
change in x and a permanent change in x. Suppose that x increases temporarily by one unit in
period t, then returns to its original lower level for periods t + 1 and all future periods. For
the temporary change, the time path of the changes in x looks like Figure 3-2: the change is
zero except in period t, where it is one.
1
0.9
0.8
0.7
0.6
Change in x
0.5
0.4
0.3
0.2
0.1
0
t-2 t-1 t t+1 t+2 t+3
Time
Given that xt + 1, xt – 1, and the disturbance are unchanged, the change in y in period t + 1 is 2,
the coefficient on the first lag of x (β1). This is the dynamic marginal effect of x on y at one lag.
By similar analysis, we can see that the effect of the temporary change in x at time t on yt + 2 is
β2 = 0.5.
From this example we can see that the pattern of dynamic marginal effects of a tempo-
rary change in x on y is given by the coefficients of the lag distribution βs that are shown in
Figure 3-1.
38 Chapter 3: Distributed-Lag Models
Now consider the case of a permanent increase in x at time t: x increases by one unit in
period t and remains higher in all periods after t than it was before t. This change is graphed
in Figure 3-3.
1
0.9
0.8
0.7
0.6
Change in x
0.5
0.4
0.3
0.2
0.1
0
t-2 t-1 t t+1 t+2 t+3
Time
Moving one more period into the future, the effect on yt + 3 will be the same as the effect
on yt + 2 because once again all of the x terms on the right-hand side are increased by one unit.
The time path of the cumulative effects is shown in Figure 3-4, with the cumulative effect
staying at 6.5 for all lags starting at 2. The limit of the cumulative effect as the lag length goes
to infinity as called the long-run cumulative effect of x on y. It measures how much y will even-
tually change in response to a permanent change in x.
Chapter 3: Distributed-Lag Models 39
0
t t+1 t+2 t+3 t+4
q
In this case, the long-run cumulative effect is ∑β
s =0
s . If the moving-average representation
converges to zero slowly as s goes to infinity rather than truncating at finite q, then the long-
q ∞
run cumulative effect is lim ∑ β s or ∑β s .
q →∞
s =0 s =0
The pattern of the dynamic marginal effects and cumulative effects tells us about both
the magnitude and the timing of the effect of x on y. In the example we studied above, per-
manent increases in x lead to permanent increases in y that get larger over the first three peri-
ods of the change. Temporary changes in x, by contrast, lead to temporary changes in y that
die away after three periods.
Another pattern that is plausible for some economic relationships is that permanent
changes in x may lead to only temporary changes in y. For example, standard macroeconom-
ic theory tells us that changes in the rate of monetary growth have only temporary effects on
real output growth. In such a situation, the positive marginal effects at short lags (β0, β1, and
β2, perhaps) would be offset by negative marginal effects at longer lags so that the long-run
cumulative effect (the sum of all the β coefficients) is zero.
40 Chapter 3: Distributed-Lag Models
For example, suppose that we are looking at the effect of changing the legal drinking age
on traffic fatalities. If we were to model this relationship with x being the change from year
to year in the drinking age and y being fatalities, then a once-and-for-all increase in the drink-
ing age leads to a positive value of x in the year of the increase and x = 0 in subsequent years
when the drinking age stays high. We might expect a permanent reduction in traffic fatalities
to result, which would mean negative β values extending into the indefinite future.
We can solve this problem by redefining x to be the legal drinking age itself, rather than
the change in the drinking age. If we do this, then the once-and-for-all increase leads to a
permanently higher value of x. We saw above that permanent changes in x may have a per-
manent effect on y even if the β coefficients eventually converge to zero.
Another way of resolving this difficulty is by redefining y to be the change in traffic fatali-
ties leaving x as the change in drinking age. In this case, the temporary change in x associat-
ed with the rise in the drinking age would be associated with a temporary (though possibly
lagged) change in y measuring the change in traffic deaths. We would expect the increases in
fatalities eventually to return to zero after the one-time change in drinking age, so again the β
coefficients would converge to zero over time.
In specifying dynamic econometric models, it is crucial to think very carefully about the
nature of the dynamic relationship among the variables. We must decide how we would ex-
pect y to respond over time to a one-time change in x, then define the variables as levels or
changes in order to represent the expected relationship with a stationary lag function.
In cases where the effect of x on y dies out quickly, it may be feasible to estimate equa-
tion (3.5) directly. The finite distributed lag model has several advantages. The coefficients
can be estimated by OLS or GLS, assuming that x is strictly exogenous. Interpretation of the
βs coefficients is straightforward using the analysis above. Since there are no restrictions im-
posed on the q + 1 lag coefficients, any finite pattern of lag weights can be estimated.
Chapter 3: Distributed-Lag Models 41
There are two disadvantages to the finite distributed lag model. The first is multicolline-
arity. Even if x is stationary, it may be highly autocorrelated, meaning that xt and xt – 1 are
strongly correlated, as are xt – 1 and xt – 2, xt – 2 and xt – 3, etc. High levels of correlation among
the regressors imply multicollinearity, which leads to unreliable coefficient estimates with
large variances and standard errors. Estimation of finite distributed lag models with strongly
autocorrelated regressors often leads to lag distributions in which the sequence of lag coeffi-
cients bounces around between large and small—and sometimes positive and negative—
values in ways that are not consistent with economic theory. When this happens, econome-
tricians often try to restrict the estimated β coefficients to satisfy prior assumptions of
smoothness.
The second disadvantage of finite distributed lags is that they are can be problematic
when the lag length is long, especially in small samples. If we have data for observations
from t = 1 through T, then the earliest observation that can be included in the estimating
sample is t = q + 1, because we need to have data for q periods before the beginning of the
estimating sample for the lagged terms on the right-hand side. Thus, we have T – q observa-
tions available to estimate q + 2 coefficients (including the constant, assuming that there is
only one regressor). This affords us T – 2q – 2 degrees of freedom. Each time we lengthen the
lag by one period, we lose two degrees of freedom—one because we must estimate another
coefficient and one because we must reduce our sample by one period. Unless T is very large
compared to plausible values of q, degrees of freedom can be depleted very quickly as the lag
length increases. Note also that this effect is magnified if there are two or more regressors for
which we must estimate lag distributions.
In summary, the finite distributed lag model is most suitable to estimating dynamic rela-
tionships when lag weights decline to zero relatively quickly, when the regressor is not highly
autocorrelated, and when the sample is long relative to the length of the lag distribution.
We may believe strongly (perhaps based on economic theory) that the lag weights βs
should be a smooth function of s. If the unrestricted finite distributed lag estimates contradict
this smoothness, we may choose to restrict the model to impose smooth lag weights. Re-
stricting the lag coefficients can not only impose smoothness, but also reduces the number of
parameters that must be estimated. There are many patterns of smoothness that we can
choose to impose a restrictive structure on the weights. We shall discuss a simple example in
detail, then consider other possible restrictions.
One simple restriction on the lag weights is that they decline linearly from an initial posi-
tive or negative impact effect to zero at a lag of length q + 1. [Might want a graph here.] In
other words, each of the lag weights β1, β2, … βq are linearly declining fractions of the impact
effect β0 according to Table 3-1. Each lag weight β s is smaller than its predecessor βs – 1 by the
fixed amount 1/(1 + q) until the effect dies to zero at s = q + 1. The formula for the lag
weights in Table 3-1 is
42 Chapter 3: Distributed-Lag Models
q +1− s
=βs =
β0 , s 1, 2, ..., q. (3.6)
q +1
To estimate the linear-declining lag model for given q, we substitute for each βs from
equation (3.6) into equation (3.5) to get
q
q +1− s q q +1− s
yt = α + ∑ β0 x t − s + ut = α + β0 ∑ x t − s + ut . (3.7)
=s 0= q +1 s 0 q +1
Equation (3.7) has two parameters to be estimated, α and β0, with the single regressor in the
model being the bracketed term in (3.7). The remainder of the lag distribution is determined
by the choice of q and the linear restriction. Estimation of (3.7) will have only T – q observa-
tions because we still need to exclude the first q data points to construct the weighted-sum
lag variable, but it has only 2 parameters to be estimating, giving T – q – 2 degrees of free-
dom. Relative to unrestricted estimation, we have saved q degrees of freedom.
The impact effect is β0 in this model, and the long-run cumulative effect is
q
q +1− s β q
β0 ∑ = 0 ∑ ( s + 1)
=s 0= q +1 q +1 s 0
β0 q
=
q +1
( q + 1) + ∑ s
s =0
q ( q + 1) / 2
=
β0 1 +
q +1
q
=
β0 1 + .
2
s βs
0 β0
1 (4/5) β0
2 (3/5) β0
3 (2/5) β0
4 (1/5) β0
Long-run cumu-
3β0
lative effect
The same procedure we used to estimate the linear-declining lag model can be applied to
other shapes as well. For example, if the lag weights increase linearly to a peak at lag m, then
decline to zero in a symmetric “tent” shape, then we can model βs as
m−s
βs =βm 1 − , s =0, 1, ..., 2m.
m +1
Once again, this formula can be substituted into (3.5) to get an estimating form that allows
βm to be estimated. The long-run cumulative effect in this model is (m + 1)βm. For a lag of
length 6 that reaches its maximum effect at 3 lags, the lag coefficients of the tent lag would
be
s βs
0 (1/4) β3
1 (1/2) β3
2 (3/4) β3
3 β3
4 (3/4) β3
5 (1/2) β3
6 (1/4) β3
Long-run cumulative
4β3
effect
A more common application of restricted distributed lags is the polynomial distributed lag
first explored by Shirley Almon (1965). The most common application of the polynomial
distributed lag is restricting the lag coefficients to lie on a quadratic function. This imposes
smoothness on the coefficients, but allows for considerable flexibility in the shapes of the lag
distributions that it permits. Depending on what part of the parabola lies in the range (0, q),
the commonly plausible shapes shown in Figure 3-5 can be estimated. The quadratic lag dis-
tribution also allows linear lags as a special case, including flat lags and the linear-declining
lags discussed earlier.
44 Chapter 3: Distributed-Lag Models
βs
β s = ξ0 + ξ1 s + ξ2 s 2 , s = 0, 1, ..., q, (3.8)
where ξ0, ξ1, and ξ2 are the parameters of the quadratic function describing the lag weights.
Substituting into (3.5) yields
q
yt = α + ∑ ( ξ0 + ξ1 s + ξ2 s 2 ) x t − s + ut
s =0
q q q
= α + ξ0 ∑ x t − s + ξ1 ∑ s x t − s + ξ2 ∑ s 2 x t − s + ut ,
=s 0=s 0 =s 0
or
yt = α + ξ0 z t0 + ξ1 z t1 + ξ2 z t2 + ut , (3.9)
where
q q q
z t0 ≡ ∑ x t − s , z t1 ≡ ∑ s x t − s , z t2 ≡ ∑ s 2 x t − s .
=s 0=s 0 =s 0
The z variables can be constructed by simple transformations once you have chosen a value
of q, which allows equation (3.9) to be estimated by standard linear methods. Once we have
obtained estimates for the parameters of (3.9), we can obtain the implied estimates of the lag
weights from (3.8). Since the functions in (3.8) relating the lag weights β to the quadratic pa-
Chapter 3: Distributed-Lag Models 45
rameters ξ are linear, calculation of the standard errors of the lag weights is straightforward
using the usual procedure for linear functions of coefficients.
Of course, quadratic lag distributions can also come out looking counterintuitive, for ex-
ample with lags that diverge from zero at the far endpoint or that swoop into negative values
in the middle. Implausible estimated lag distributions may be evidence of misspecification of
the model, so they should not be ignored. It is possible to force the quadratic lag distribution
to have certain properties, such as having the weights converge smoothly to a zero value at
lag q + 1 (as we did for the linear-declining lags). For the quadratic case, we want to restrict
βq + 1 to have the value zero, so βq +1 = ξ0 + ξ1 ( q + 1) + ξ2 ( q + 1) = 0. This means that the pa-
2
rameters of (3.9) have the linear restriction ξ0 = −ξ1 ( q + 1) − ξ2 ( q + 1) , which can be im-
2
posed in estimation to assure that the quadratic lag distribution dies to zero smoothly at the
end of the lag.
The method of polynomial distributed lags can be used with cubic or even higher-order
polynomials as well. The higher the order of the polynomial, the less “smooth” the lag dis-
tribution is allowed to be and the more parameters must be estimated. Of course, just as we
can always find a line passing through any two points and a parabola passing through any
three points, any unrestricted lag distribution with q + 1 lag weights lies on a polynomial lag
distribution of order q, so a q-order polynomial distributed lag is equivalent to the unrestrict-
ed distributed lag model.
The restrictions implied by a polynomial distributed lag model can always be tested as
linear restrictions on the unrestricted polynomial estimates. This tells us whether the data
can reject the smoothness imposed by the polynomial lag model. If non-smoothness of the
unrestricted lag weights was due to multicollinearity (which makes the point estimates unre-
liable), then the smoothness restrictions may not be rejected. If the restrictions are rejected,
then the data contain strong evidence that the lag distribution does not follow the smoothed
model we have imposed, and we impose the smoothness model at our peril.
The easiest way to implement the test of the polynomial restrictions is in terms of “dif-
ferences” of coefficients. For the quadratic lag, we difference (3.8) in terms of s to get
(
β s − β s −1 = ( ξ0 + ξ1 s + ξ2 s 2 ) − ξ0 + ξ1 ( s − 1) + ξ2 ( s − 1)
2
)
= ξ1 + ξ2 s 2 − ( s − 1) = ξ1 + ξ2 s 2 − ( s 2 − 2 s + 1)
2
= ξ1 + ξ2 ( 2 s − 1) , s = 1, 2, ..., q,
1 + ξ2 ( 2 s − 1) − ξ1 + ξ2 ( 2 ( s − 1) − 1)
(βs − βs −1 ) − (βs −1 − βs −2 ) = ξ
=ξ2 2 s − 1 − ( 2 s − 2 − 1)
=
2ξ2 , s =
2, 3, ..., q.
46 Chapter 3: Distributed-Lag Models
Thus, the second difference of the lag coefficients (differencing with respect to the lag) is a
constant. That means that we can test the following q – 2 restrictions on β0, β1, …, β q to test
the quadratic restriction:
β2 − 2β1 − β0 = β3 − 2β2 − β1
β3 − 2β2 − β1 = β4 − 2β3 − β2
βq −1 − 2βq − 2 − βq −3 = βq − 2βq −1 − βq − 2 .
The first-order autoregressive lag model is often called the Koyck lag in recognition of
the seminal application of the model to the macroeconomic investment function by L. M.
Koyck (1954). With a single explanatory variable x, the model is written
yt = δ + φ1 yt −1 + θ0 x t + ut . (3.10)
Estimation of equation (3.10) presents challenges because yt – 1 is by definition not strictly ex-
ogenous and, unless ut is white noise, is not even weakly exogenous. We postpone these es-
timation concerns for the moment and discuss the interpretation of the coefficients in the
Koyck model.
(1 − φ1L ) yt = δ + θ0 x t + ut .
This suggests solving the model for yt as
δ θ0 1
yt = + xt + ut ,
1 − φ1 L 1 − φ1 L 1 − φ1 L
δ ∞ ∞
=
yt + θ0 ∑ φ1s x t − s + ∑ φ1s ut − s , (3.11)
1 −=φ1 s 0=s 0
as long as φ1 < 1 .
Equation (3.11) has the form of the infinite distributed lag (3.1), with
δ
α= ,
1 − φ1
β s =θ0 φ1s ,
and the disturbance term having an infinite-moving-average form. The dynamic marginal
effects of x on y in the Koyck model are
∂yt + s
=β s =θ0 φ1s .
∂x t (3.12)
If, as is usually the case, 0 < φ1 < 1, the lag weights decline exponentially to zero from an
initial value of θ0. The long-run cumulative effect of x on y is
∞
θ
θ0 ∑ φ1s = 0 .
s =0 1 − φ1
This exponentially declining lag distribution seems to fit many economic relationships
well. Moreover, some theoretical models, such as exponential convergence models in eco-
nomic growth and models with quadratic adjustment costs, predict exponentially declining
lag weights.
The Koyck lag model can be used with more than one regressor in the equation, but it
imposes a significant restriction on the lag distributions. Suppose that we have two regres-
sors, x and z:
yt = δ + φ1 yt −1 + θ0 x t + λ 0 z t + ut .
The dynamic marginal effects of x on y will be as in equation (3.12). The effects of z on y will
be
∂yt + s
=λ 0 φ1s .
∂z t
Both the effects of x and the effects of z decline at the same exponential rate φ1 as s increases,
so with the Koyck lag it is not possible for y to respond more quickly to one explanatory var-
iable than to another. This symmetry of dynamic responses may be appropriate in some ap-
plications. For example, if the reason for lagged response is costs of adjusting y to its optimal
level, then it may be reasonable to expect that the same adjustment costs—and thus the same
48 Chapter 3: Distributed-Lag Models
lag structure—would apply regardless of which variable caused a change in the optimal y. In
other cases, we might expect different patterns of dynamic response to different regressors.
Consumption of asparagus may adjust more (or less) quickly to changes in the price of as-
paragus than to changes in income. The Koyck lag would not allow independent estimation
2
of the lag structures of the two variables.
The theoretical and empirical appeal of the Koyck lag has led to its frequent use. How-
ever, consistent estimation of the Koyck-lag model can be problematic. The lagged depend-
ent variable as a regressor on the right-hand side is never strictly exogenous, so the small-
sample assumptions needed for the Gauss-Markov Theorem cannot be satisfied.
Even the weak exogeneity required for consistency is often dubious. Lagging the model
one period means that yt −1 = δ + φ1 yt − 2 + θ0 x t + ut −1 , so yt – 1 is clearly correlated with ut – 1. If ut
is not white noise so cov ( ut , ut −1 ) ≠ 0 , so is not white noise, then yt – 1 and ut are correlated
and the lagged dependent variable is not even weakly exogenous and the OLS estimator is
inconsistent. Given the ubiquity of serial correlation in error terms of dynamic models, it is
hard to maintain confidence in the consistent estimation of the Koyck model.
Avoiding inconsistency requires that we make sure that the error term is not serially cor-
related. GLS methods can be used to transform the model into one that is not serially corre-
lated, but only with great caution. Because the OLS estimators are inconsistent, so are all test
statistics and estimators based on the OLS residuals. For example, the Durbin-Watson test
and other tests that rely on the correlation coefficient between uˆt and uˆt −1 are not valid.
Moreover, we must estimate the disturbance autocorrelation parameter using a method that
does not rely on OLS residuals. Thus, the Hildreth-Lu search procedure is valid for estimat-
ing a model with an AR(1) error and a lagged dependent variable, but the Prais-Winsten and
Cochrane-Orcutt methods are not.
The Koyck lag treats y as a first-order autoregressive process. Although one lag of the
dependent variable is often enough to capture the dynamic relationship between y and the
regressors. However, longer autoregressive lags can be included as well. The general auto-
regressive lag model AR(p) would be written
φ ( L ) yt = δ + θ0 x t + ut , (3.13)
with φ(L) a p-order polynomial in the lag operator. In order for the relationship between y
and x to be dynamically stable, the roots of φ(L) must lie outside the unit circle. This general-
izes the condition |φ1| < 1 from the Koyck lag model. If the stability condition does not
2
We shall see below that adding one or more lags of x and/or z could allow differences in the first few
terms of the lag distribution. However, the “tail” of the lag distribution beyond the longest lag of x or z
always depends only on φ1.
Chapter 3: Distributed-Lag Models 49
hold, then a one-time change in x will cause permanent or explosive changes in y, which
suggests differencing y to make the order of integration the same on both sides of the equa-
tion.
The φ parameters of the autoregressive lags determine the shape of the lag distribution.
As with the first-order Koyck lag, lag weights can decline smoothly according to an expo-
nential pattern. But with higher-order lags, other patterns are possible. For example, if the
roots of φ(L) are complex numbers (i.e., have non-zero imaginary terms), then lag weights
may oscillate back and forth like a damped sine function, converging eventually to zero.
(Complex roots on the unit circle would have non-damped oscillations like a pure sine wave;
complex roots inside the unit circle would lead to explosive oscillations.)
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-0.2
-0.4
lower it (due to increases in the size of the capital stock), with the effect converging to zero in
3
the long run.
Given the autoregressive lag relationship in equation (3.13), a logical extension is to al-
low lags of x on the right-hand side. The general autoregressive distributed lag (ARDL) model
is written
φ ( L ) yt = δ + θ ( L ) x t + ut , (3.14)
where φ(L) is an order-p polynomial that, for stability, has roots lying outside the unit circle
and θ(L) is an order-q polynomial. Expanding the lag polynomials, equation (3.14) can be
written as
yt = δ + φ1 yt −1 + ... + φ p yt − p + θ0 x t + θ1 x t −1 + ... + θq x t −q + ut .
With a sample of T observations, this model can be estimated for T – max{p, q} observa-
tions.
As we did with univariate ARMA models, we can divide both sides of (3.14) by the au-
toregressive polynomial to get
δ θ( L ) 1
yt = + xt + ut
φ( L ) φ( L ) φ( L )
(3.15)
θ( L )
=α+ x t + vt ,
φ( L )
where α and v are the constant and error term defined in (3.15). Because the lag distribution
of the ARDL model cause can be represented by the ratio of two finite lag polynomials, it is
sometimes called the rational lag. (Recall that rational numbers can be represented as the ra-
tio of two integers.)
The analysis of ARDL models parallels that of univariate ARMA processes; the differ-
ence is that the lag structure on the right-hand side of (3.15) is applied to an explanatory var-
iable x rather than to a white-noise error term ε as in equation
Error! Reference source not found.. As in the ARMA models, the coefficients of the order-p
polynomial θ(L) affect only the first q lags of the dynamic lag distribution of the effect of x on
y. The behavior of the “tail” of the lag distribution beyond q depends entirely on the auto-
regressive polynomial φ(L). The property that the dynamic effect is stable only if the roots of
φ(L) lie outside the unit circle carries over from the autoregressive lag model of the previous
section.
3
See Chapter 5 of Romer (2012).
Chapter 3: Distributed-Lag Models 51
yt = δ + φ1 yt −1 + θ0 x t + ut ,
ut = ρut −1 + ε t .
We assume that both φ1 and ρ are between –1 and 1 for stability and stationarity and that ε is
white noise. Solving for ut from the first line, u=
t yt − δ − φ1 yt −1 − θ0 x t , so lagging once yields
u=
t −1 yt −1 − δ − yt − 2 − θ0 x t −1 . Substituting this into the second line and then plugging the re-
sulting expression for ut into the first line yields
yt = δ + φ1 yt −1 + θ0 x t + ρ ( yt −1 − δ − φ1 yt − 2 − θ0 x t −1 ) + εt
(3.16)
= (1 − ρ ) δ + ( φ1 + ρ ) yt −1 − ρφ1 yt −2 + θ0 x t − ρθ0 x t −1 + εt .
Equation (3.16) shows that the ARDL(1,0) model with a first-order autoregressive error can
be reduced to an ARDL(2,1) model with a white-noise error. Given this apparent equiva-
lence, how could we determine which is the appropriate model?
The equivalence is not quite perfect because there is one nonlinear restriction on the co-
efficients of (3.16). The ARDL(2,1) model in (3.16) has 5 linear coefficients (the constant,
two lagged y coefficients, and two coefficients on current and lagged x), but they depend on
only 4 underlying parameters: δ, φ1, θ0, and ρ. Thus, if we needed to, we could estimate the
general ARDL(2,1) model (3.16) and test the nonlinear coefficient restriction:
coef ( yt − 2 ) coef ( x t −1 )
coef ( yt −1 )
= coef ( x t ) − . (3.17)
coef ( x t −1 ) coef ( x t )
If the model is truly a Koyck lag with AR(1) error, then it is true that we can save one
degree of freedom and thus gain a marginal bit of efficiency by imposing restriction (3.17) on
the estimation of (3.16) (or by doing this implicitly through a two-step procedure such as
Hildreth-Lu). However, in practice, it is much easier to estimate the unrestricted ARDL(2,1)
model, which accounts (though not fully efficiently) for the possibility of an AR(1) error in
the simpler Koyck model.
This example shows that lengthening the lags in the ARDL model can eliminate serial
correlation in the error term. It is fairly straightforward to determine how many lags need to
be included. In (3.16), if ρ = 0 then the coefficients of yt – 2 and xt – 1 are both zero, so testing
the last lag terms and eliminating them if they are near zero will assure that we do not in-
cluded an unneeded correction for autocorrelated disturbances. Testing the residuals for au-
tocorrelation should reveal whether the remaining error term is white noise. If it is not, then
adding more lags might be appropriate. Adding lags until the residual seems to be white
52 Chapter 3: Distributed-Lag Models
noise is the most common way that modern time-series econometricians deal with the possi-
bility of serial correlation in the disturbance. Two-step GLS estimation using the Prais-
Winsten or Hildreth-Lu procedures is used much less frequently.
The methods we discuss for choosing lag length can apply either to the lagged x terms on
the right-hand side—the lag length that we have been calling q—or to the number of lagged
dependent variables to include in an autoregressive-lag or ARDL model—what we have
called p. However, most of these methods cannot be applied in a straightforward way to de-
termining the length of restricted lag models such as the linearly-declining lag or the poly-
nomial distributed lag.
An obvious way to choose the length of a lag is to start with a long lag test the statistical
significance of the coefficient at the longest lag—the “trailing lag”—and shorten the lag by
one period it if we cannot reject the null hypothesis that the effect at the longest lag is zero.
We continue shortening the lag until the trailing lag coefficient is statistically significant.
Although this method has appeal, there are dangers as well. Remember that an insignifi-
cant t statistic on the trailing lag only fails to reject the hypothesis of a zero coefficient; it does
not prove that the coefficient is zero! It is therefore quite possible that this procedure will
choose a lag length that is too short.
An alternative that also relies on statistical tests of significance is to start with a very
short lag and successively add lag terms, continuing to add lags that are statistically signifi-
cant and stopping when the marginal lag coefficient is not. This method is similar to the one
above and often, though not necessarily always, leads to a similar choice of lag length. To
Chapter 3: Distributed-Lag Models 53
see why they are not identical, consider what would happen if the first, second, and fourth
lags were (always) statistically significant but the third lag and all lags longer than four are
not. Starting from a long lag and working downward you would stop at four lags, eliminat-
ing the insignificant fifth lag but retaining the third lag by convention. Starting from a short
lag length and working upward you would stop at two lags; you would never discover the
significant fourth lag.
Information criteria are designed to measure the amount of information about the de-
pendent variable contained in a set of regressors. They are goodness-of-fit measures of the
same type as R2 or R 2 , but without the convenient interpretation as share of variance ex-
plained that we give to R2 in an OLS regression with an intercept term. The two most com-
monly used criteria are the Akaike information criterion (AIC) and the Schwartz/Bayesian in-
formation criterion (SBIC). They are usually calculated in log form by the formulas
T 2
∑ uˆt 2K
AIC ln t =1
= + ,
T T
(3.18)
T 2
∑ uˆt K lnT
SBIC ln t =1
= + .
T T
In equation (3.18), T is sample length, K is the total number of estimated coefficients, and uˆt
are the residuals.
The “main ingredient” in both information criteria is the sum of squared residuals,
which we want to make as small as possible. Thus, we want to minimize the criteria and
choose the model with the smallest AIC or SBIC value.
When using the information criteria to choose lag length, we must be very careful to
make sure that all candidate models among which we are choosing are estimated over exact-
ly the same sample period. This requires particular caution in lag models because there will
usually be more observations available for models with shorter lags (because with fewer lags
we “lose” fewer observations at the beginning of the sample). Passively allowing Stata to set
the sample by using all available observations will result in samples with different T for mod-
els with lags of different length, thus the information criteria calculated from them cannot be
compared. You should always use an in or if clause in your estimation command to keep
the sample the same across regressions being compared with AIC or SBIC, then verify that
all of the observations being compared have identical sample lengths.
The first term of the information criteria (common to both) is the log of the standard er-
ror of the estimate (SEE), uncorrected for degrees of freedom. This measures how well the
54 Chapter 3: Distributed-Lag Models
model explains the dependent variable. The second term is a “penalty term” that depends
positively on the number of estimated parameters K. Increasing lag length will lower the first
term by marginally improving the fit, but will also increase the second term because the
number of parameters will be larger. The information criteria provide alternative ways of
“trading off” improved fit against more parameters to estimate.
From (3.18), the SBIC penalizes additional parameters more strongly than the AIC (as-
suming, plausibly, that ln T > 2). Thus, the SBIC always chooses a lag length that is shorter
than (or the same length as) the one that minimizes the AIC. Neither is “better,” so one
might consider the AIC as a lower bound and the SBIC as an upper bound for the appropri-
ate lag length. In the case that they happen to agree, the choice is clear.
When using residual autocorrelation to determine lag length, one adds lags until the re-
siduals appear to be white noise. After running the distributed-lag regression, one extracts
the residuals and uses a Breusch-Godfrey LM test or a Box-Ljung Q test to test the null hy-
pothesis that the residuals are white noise. Rejecting the white-noise null hypothesis means
that more lags should be added to the regression according to this criterion.
References
Almon, Shirley. 1965. The Distributed Lag Between Capital Appropriations and
Expenditures. Econometrica 33 (1):178-196.
Box, George E. P., and Gwilym M. Jenkins. 1976. Time Series Analysis: Forecasting and
Control. San Francisco: Holden-Day.
Granger, C. W. J., and Paul Newbold. 1974. Spurious Regressions in Econometrics. Journal
of Econometrics 2 (2):111-120.
Hamilton, James D. 1994. Time Series Analysis. Princeton, N.J.: Princeton University Press.
Koyck, L. M. 1954. Distributed Lags and Investment Analysis. Amsterdam: North-Holland.
Romer, David. 2012. Advanced Macroeconomics. 4th ed. New York: McGraw-Hill.
CHAPTER 4
If there are discrete breakpoints at which the structure of the individual series or the rela-
tionship between them changes, then we must adapt our model to accommodate these
changes. Splitting the sample into two or more sub-samples is the most obvious and common
way of doing this.
Little econometric attention has been devoted to the case where the series are explosive,
such as an AR(1) process with a parameter φ1 > 1. Such series are probably uncommon in
economics, since they have the property that a small shock leads to an ever-increasing effect
on the series that becomes infinite in the limit.
The cases that have dominated modern time-series analysis are borderline-nonstationary
cases. The most common one is regression involving unit-root or integrated time-series pro-
cesses. Another case that has received some attention is the trend-stationary process for
which deviations from a deterministic trend are stationary.
Economic relationships among variables may change over time. New laws or other as-
pects of the institutional environment can change discretely at a particular point in time,
leading to changes in economic agents’ behavior. Or behavior may evolve gradually over
time. In either case, the parameters of econometric model are likely to change—suddenly or
gradually—through the time-series sample.
Ideally, the econometrician can measure the variables that have caused changes in be-
havior and incorporate them in the model. Suppose that the model of interest is
yt = α + β x t + ε t ,
56 Chapter 4: Regression with Nonstationary Variables
where we assume for simplicity that there is only one explanatory variable, there is no lag
structure, and the error term is white noise. (All of these simplifications can be readily gener-
alized in a straightforward way.) We observe a variable zt that is likely to change the relation-
ship between y and x, perhaps a tax rate or other policy variable.
If a change in z affects E ( yt | x t ) , but not ∂E ( yt | x t ) / ∂x t , then only the level of the rela-
tionship between y and x is affected by z and we can insert z as a variable in the regression:
yt = δ + γz t + βx t + εt , (4.1)
with the constant intercept term α being replaced by a function of z: δ + γzt. If changes in z
affect the slope β = ∂E ( yt | x t ) / ∂x t as well as the level of the relationship, then an interac-
tion term is required:
yt = δ + γz t + ( λ 0 + λ1 z t ) x t + εt = δ + γz t + λ 0 x t + λ1 x t z t + εt . (4.2)
Here the original intercept term has been replaced with δ + γzt as above, plus the β coefficient
measuring the effect of x on y is replaced with λ0 + λ1zt.
Modeling changes in intercept terms or slope coefficients over time as a function of an-
other variable is the simple extension to a time-series context of the general procedure for
adding a variable to the model, including the possibility of an interaction. We can test the
hypothesis that γ = 0 in equation (4.1) or that γ = λ1 = 0 in equation (4.2) to determine if the
effect is statistically significant.
The variable z may be a dummy variable reflecting a discrete change in the environment
or a continuous variable that leads to gradual changes. A common special case occurs when
z is a dummy variable that switches from zero to one at a fixed date and remains at one
through the rest of the sample:
0, t = 1, 2, ..., T1
zt = (4.3)
1, t =T1 + 1, T1 + 2, ..., T .
For the case of a dummy variable such as this, we can think of its effect as a “breakpoint” in
the sample. The relationship between y and x is different (in intercept, slope, or both) after T1
than before.
If we know the breakpoint T1, then the test of γ = λ1 = 0 becomes a simple Chow test of
the null hypothesis that the coefficients are the same before and after the break. We can often
hypothesize about possible breakpoints even if we cannot always measure the underlying
variables that cause the relationship to change. For example, there is considerable economet-
ric evidence that macroeconomic relationships in the United States (and other advanced
economies) changed around 1973. Many factors may explain why the macroeconomy
Chapter 4: Regression with Nonstationary Variables 57
changed in the early 1970s: a discrete rise in oil prices followed by OPEC’s waxing influ-
ence, the move to floating exchange rates, the emergence of Japan as an industrial power
and the increasing effects of globalization, and the entry of the baby-boom generation into
the work force are a few. Even though we may have variables that measure some of these
changes, disentangling the effects of all of these simultaneously changing factors may be be-
yond the power of our data. Therefore it is common to use a dummy variable to capture the
overall effect of multiple changes in the economic environment around 1973 and consider
whether the sample should be split there.
We often suspect that the structure of the model has changed at some point in the sam-
ple, but we do not know the date of the breakpoint. In other words, we believe that an equa-
tion such as (4.2) with zt defined in (4.3) is a good model for the data, but the date of the
breakpoint T1 is unknown.
With a known breakpoint, the Chow F statistic provides us with a statistical measure of
the magnitude of the break. So comparing the Chow F statistic associated with different pos-
sible breakpoints could give us an indicator of which breakpoint seems to be most strongly
supported by the data. This is the intuition of the Quandt likelihood ratio (QLR) test. [This
section is based on Stock & Watson, 3/e, pp. 558-561. See references there to Quandt (1960)
and Andrews (2003).]
To implement the QLR test, we must deal with two issues. First, in order to test whether
two sub-samples have the same coefficients, we must have enough observations in each sub-
sample to get reliable coefficient estimates. This means that we cannot detect or test potential
breakpoints that are close to either end of the sample. The reliability of the sub-sample esti-
mates depends on the number of degrees of freedom, the difference between the number of
observations in the sub-sample and the number of coefficients we want to estimate. Thus, the
degree of “trimming” of possible breakpoint dates that is necessary will depend on the length
of the sample and the number of parameters in the model. A conventional choice is to trim
15% of the observations from each end of the sample, looking for breakpoints only within
the central 70% of observations. Figure 4-1 shows a schematic representation of trimming.
The breakpoints must obviously be rounded to integer values.
If τ1 and τ2 are the minimum and maximum observations we consider as possible break-
points, the QLR statistic is defined as
{ }
QLR = max Fτ1 , Fτ1 +1 , ..., Fτ2 −1 , Fτ2 ,
where Fτ is the Chow F statistic for a breakpoint at observation τ. Because QLR is the maxi-
mum of a set of F statistics, it does not follow the F distribution.
58 Chapter 4: Regression with Nonstationary Variables
1 0.15×T 0.85×T T
The QLR test can be readily generalized to test for more than one possible breakpoint,
and indeed has been demonstrated to be effective at testing the null hypothesis of structural
stability even when the change in the coefficients is continuous rather than discrete. There
does not seem to be an implementation of the QLR test available for Stata.
plied by a deterministic trend with the complications and surprises faced year after year by
workers, businesses, and governments.”
yt = α + γt + ut , (4.4)
where ut is a stationary disturbance term with constant variance σu2 . The variable yt has con-
stant variance (and covariance) over time, but its mean E ( yt ) = γt changes with t, so yt is
nonstationary as, of course, is t itself.
Recall that in integrated variable (of order one) is a variable whose first difference is sta-
tionary. We use the notation I(1) to denote such a variable and I(0) to denote a stationary
variable. We use the term “levels” to refer to the actual values of the variable yt, (or xt or ut)
and the term “differences” to refer to the first differences ∆yt ≡ (1 – L)yt = yt – yt – 1. By defini-
tion, if yt is I(1), then ∆yt is I(0).
To estimate the dynamic relationship between y and x, it is important to get the orders of
integration right! If the dependent variable is integrated, then at least some of the regressors
must also be integrated, otherwise we are trying to explain something that is nonstationary
by a set of explanatory variables that are not. Similarly, if the dependent variable is station-
ary, then it cannot “follow” an integrated explanatory variable on its nonstationary wander-
ings, so the model must be misspecified.
With one regressor, the order of integration of y and x must match for the specification to
make economic sense. With more than one regressor and an integrated dependent variable,
it is possible to have a mixture of integrated and stationary regressors. For example, we
could add some (stationary) dummy variables to a regression with integrated y and x. A good
rule of thumb is that you can’t explain something nonstationary with (only) stationary varia-
bles. Any nonstationary regressor will transmit its nonstationarity to the dependent variable,
so you cannot explain a stationary variable with a nonstationary one.
60 Chapter 4: Regression with Nonstationary Variables
We saw in the baseball example that opened Chapter 2 that regressions in which y and x
are nonstationary lead to misleading conclusions: R2 and t statistics are likely to be large even
if the underlying variables are not truly related. The second column of Table 2-1 shows that
performing the same regression in terms of first differences yields the (correct) result that we
cannot reject the null hypothesis of a zero coefficient.
This suggests that differencing may be appropriate in nonstationary models, and this is
1
often correct. Granger and Newbold (1974) present strong evidence that regressions involv-
ing random walks are spurious when performed on the levels, but not on the differences. Ta-
ble 4-1 is taken from their subsequent book and reports the results of a Monte Carlo study in
which they generated unrelated random walks and performed regressions on them. The top
part of the table shows the results for regressions in levels, where all of the variables on both
sides of the equation are I(1); the bottom part shows the regressions where all variables are
differenced, and thus I(0).
If the variables are unrelated, the true null hypothesis that all β coefficients are zero
should be rejected 5% of the time. We see rejection rates of 76% to 96% depending on the
number of regressors in the equation when estimating in levels, but correctly see 2% to 10%
2
rejection when the variables are made stationary by differencing. Similarly, the adjusted R2
coefficients are inflated in levels, but near zero with differencing. With five regressions, they
find an adjusted R2 over 0.7 more than one-third of the time!
3 93 0.46 25
4 95 0.55 34
5 96 0.59 37
1 8 0.004 0
Differences
2 4 0.001 0
3 2 –0.007 0
4 10 0.006 0
5 6 0.012 0
1
An important exception is the special case of cointegration, which is discussed below.
2
Computer time was not cheap in the 1970s, so Granger and Newbold made due with only 100 sam-
ples. We could replicate this result for 100,000 samples on today’s computers in a matter of minutes.
Chapter 4: Regression with Nonstationary Variables 61
Formally, suppose that our hypothesized model is yt = α + βx t + ut . (We could add lags
of x or additional I(1) regressors without changing the basic principle.) Both y and x are I(1)
variables such as random walks, and we shall assume that the error term u is also I(1). Tak-
ing the first difference of the equation yields
yt = α + βx t + ut
− yt −1 = α + βx t −1 + ut −1 . (4.5)
∆=
yt β∆x t + ∆ut
If y, x, and u are all I(1), then their differences, which appear in (4.5), are all stationary and
we can estimate β reliably.
However, notice that the constant term α disappears when we take differences. Because
α affects all values of y in the same way, taking the difference eliminates it from the equa-
tion. When performing a regression in differences, we generally want to remove the constant
term. Including a constant in the differenced equation would be equivalent to having a time
trend in the original “levels” equation.
4.3.2 Cointegration
A special case of great interest to econometricians arises when y and x are I(1), but the
error term in the relationship between them u is stationary. In this case, we say that y and x
are cointegrated. Two series that are cointegrated are nonstationary, but they are nonstation-
ary “together.” Think of two variables taking a “random walk together,” in which they both
move in a nonstationary manner over time, but the difference between them (or some other
linear function of them) is stationary, tending to return back to a stable, constant value after
being disturbed by a shock.
The econometric concept of cointegration often fits nicely with the economic concept of
long-run equilibrium. For example, the median price series for houses in Portland and Bea-
verton are both nonstationary—they are never going to return to the levels of two decades
ago. However, prices in the two cities cannot get too far out of line with each other, so it is
plausible that some linear combination of the two price series would be stationary, tending to
return to zero after a shock.
In terms of our model, we again have yt = α + βx t + ut , but we now assume that u is I(0)
(with x and y both still assumed to be I(1)). We call this equation, with nonstationary varia-
bles but a stationary error, a cointegrating equation. The vector of coefficients (1, –α, –β) that
makes 1 yt − α − βx t = ut stationary is called a cointegrating vector. For two variables, there
can be only one cointegrating vector having a coefficient of one on y, although any multiple
of the cointegrating vector (1, –α, –β) is also a cointegrating vector because, for example,
2 yt − 2α − 2βx t =2ut is obviously also stationary.
62 Chapter 4: Regression with Nonstationary Variables
Can we estimate a cointegrated model in differences? Yes, but we probably do not want
to. If u is stationary, then its difference ∆u is also stationary, so the regression in differences is
valid. However, differencing the regression loses information contained in the stable long-
run relationship between the series.
To see why this happens consider a simple model. Suppose that P is the price of houses
in Portland and B is the price of houses in Beaverton. The long-run relationship between
them is
Bt =
0 + 0.9 Pt + ut , (4.6)
where B and P are I(1) and u is I(0). (The zero constant term is included just to show that
there can be one.) The differenced equation is
where v = ∆u.
Suppose that a shock u1 in period one causes the Beaverton price to increase relative to
the Portland price by 1 unit (perhaps $1,000). Assuming, for simplicity, that u0 was zero,
then v1 = u1 – u0 = u1 = +1. Because u is stationary, we know that this shock is going to dissi-
pate over time and that Beaverton’s house prices will eventually be expected to fall back
down to their long-run equilibrium relationship with Portland’s. However, if we use equa-
tion (4.7), we will predict that future changes in Beaverton’s price will be 0.9 times the
changes in Portland’s because E ( v2 =
) E ( v3 =) ...= 0. . There is no tendency in (4.7) to re-
store the long-run equilibrium relationship in (4.6).
How can we build our knowledge that u is stationary into our prediction? Because u1 > 0,
we expect that future values of u will be less than u1, which means that future values of vt =
∆ut will be negative. Thus, cointegration means that E ( v2 |u1 > 0 ) < 0, which is lost in the
differenced equation (4.7) when we ignore past shocks and simply assume E ( v2 ) = 0 . We
need to modify equation (4.7) to include a term that reflects the tendency of B to return to its
3
long-run equilibrium relationship to P. A model incorporating such a term is called an error-
correction model.
From equation (4.6), the degree to which Bt is above or below its long-run equilibrium
relationship to Pt at the beginning of period t is measured by u=
t −1 Bt −1 − 0.9 Pt −1 . An error-
3
We might expect that an “over-differenced” error term like v would be negatively serially correlated:
a positive value in one period would be followed by negative values afterward as u returns to zero.
Chapter 4: Regression with Nonstationary Variables 63
We expect that –λ < 0 so that a positive value of ut −1 is associated with reductions in future
changes in B below what would be predicted by the corresponding future changes in P as
Beaverton price return to their normal relationship to Portland price. Notice that the error-
correction equation (4.8) is “balanced” in the sense that all terms on both sides—∆B, ∆P, u,
and v—are I(0). We can also estimate (4.8) successfully by OLS because all terms are sta-
tionary.
In order to estimate (4.8), we must know the cointegrating vector so that we can calcu-
late ut −1 . In this example, we must know the value 0.9. We can estimate the cointegrating
vector in one of two ways. We can estimate (4.8) by nonlinear least-squares with 0.9 re-
placed by an unknown parameter γ1 (and a constant γ0, because we would not know that the
constant was zero):
∆Bt = β∆Pt − λ ( Bt −1 − γ 0 − γ1 Pt −1 ) + vt .
However, we can also follow a two-step procedure in which we first estimate the cointegrat-
ing vector by running the “cointegrating regression”
Bt = γ 0 + γ1 Pt + ut (4.9)
in levels and then using estimated coefficients γ̂ 0 and γ̂1 to calculate uˆt −1 ≡ Bt −1 − γˆ 0 − γˆ 1 Pt −1
in the error-correction model (4.8).
How can we get away with estimating (4.6) without encountering spurious regression
difficulties given that both of the variables are I(1)? It turns out that in the special case of
cointegration, the OLS estimator γ̂ is not only consistent, it is “super-consistent,” meaning
that its variance converges to zero at a rate proportional to 1/T rather than the usual rate,
which is proportion to 1/ T . Intuitively, this happens because it is very easy for OLS to
find the “right” values of γ; any other value leads to a non-stationary error term which will
4
tend to have large squared residuals. Moreover, because γ̂ is super-consistent, we can esti-
mate
4
Although the OLS coefficient estimator in the cointegrating regression is super-consistent, we still
cannot use t tests based on its estimated standard error for the same reasons as in the spurious-
regression case.
64 Chapter 4: Regression with Nonstationary Variables
without worrying about the potential inaccuracy of γ̂ —we can treat it as a known constant.
Although the long-run relationship between the levels of B and P can probably be de-
scribed effectively without worrying too much about lags, the adjustment of B to P over time
is probably not immediate. This leads us to think about incorporating lagged differences into
the error-correction model. If we allow for p lags of ∆Bt and q lags of ∆Pt , we arrive at a
model like
Equation (4.11) is a typical form of an error-correction model, with the lengths p and q of the
lags to be determined by the methods discussed in Chapter 3.
Table 4-2 summarizes the four possible cases of stationarity and nonstationarity (I(1)) for
regressors and the error term. If yt = α + βx t + ut , then the time-series behavior of y is gov-
erned by the behavior of x and u. The first two columns of the table show the four possible
patterns of stationarity and nonstationarity for x and u. The only model that is not plausible
is the second line of the table, when x is stationary but nonstationarity in the error makes the
dependent variable nonstationary. It is hopeless to attempt to explain a nonstationary varia-
ble with regressors that strictly stationary—they cannot capture the wandering over time that
will occur in y.
The first case is the one we examined in Chapter 3. It can be estimated with the distrib-
uted-lag models discussed there, possibly corrected for (stationary) serial correlation in the
error term. The third case is the spurious-regression model, where estimating the model in
first-differences is appropriate. The final case is the cointegration case that we have just ex-
amined, where the appropriate estimation technique is the error-correction model.
We now know how to deal with nonstationarity if it arises in our regression models, but
one key question remains: How do we determine if a variable is stationary or nonstationary?
We now turn to this question.
A stationary variable tends to return to a fixed mean after being disturbed by a shock.
We sometimes use the adjective mean-reverting as a synonym for stationary. This tendency to
revert back to the mean is the intuitive basis for the oldest and most basic test for stationarity:
the Dickey-Fuller test.
yt =
ρyt −1 + ut , (4.12)
It looks like we could just estimate (4.12) by OLS and use the conventional t test to ex-
amine the null hypothesis of nonstationarity, but remember the problem with spurious re-
gressions. Under the null hypothesis, y and yt–1 are nonstationary, so the t statistic will be in-
flated and unreliable. Instead, we subtract yt–1 from both sides to get
with γ ≡ ρ – 1. The null hypothesis ρ = 1 is now equivalent to γ = 0 with the alternative γ < 0.
The intuition of equation (4.13) is for a mean-reverting (stationary) process, a high value last
period should be associated (on average) with a negative change in the series this period to
move it back toward the mean. Thus, if y is stationary, γ should be negative. If y is nonsta-
tionary, then there will be no tendency for high values of y in t – 1 to be reversed in t, and we
should find γ = 0.
The Dickey-Fuller test statistic is the t statistic of γ̂ in the OLS regression of (4.13). How-
ever, because the regressor is non-stationary under the null, γˆ / s.e. ( γˆ ) does not follow the t
distribution. Many authors have used Monte Carlo methods to calibrate the distribution of
the DF statistic; the critical values for the DF test are more negative than the usual –1.65 that
we would use for a one-tailed t test. Hill, Griffiths, and Lim show critical values in Table
12.2 on page 486. If the calculated DF test statistic is less than (i.e., more negative than) the
negative critical value, then we reject the null hypothesis and conclude that the variable is
stationary. If the test statistic is positive or less negative than the critical value, then we can-
not reject the hypothesis that y is nonstationary.
Of course, failing to reject nonstationarity is not the same thing as proving, or even con-
cluding, that y is nonstationary. For series that are very persistent but stationary (such as
66 Chapter 4: Regression with Nonstationary Variables
AR(1) processes with ρ > 0.8), the DF test has very low power, meaning that it often fails to
reject false null hypotheses. Thus, deciding that a series is nonstationary based on a marginal
failure to reject the DF test can be misleading.
The DF test is valid if the error term u in equation (4.12) is white noise because then the
assumptions of the time-series Gauss-Markov Theorem are satisfied. But we know that error
terms in time-series data are usually autocorrelated, and that makes the OLS estimates inef-
ficient and biases the standard errors. We can deal with this problem in either of two ways:
The first correction leads to the augmented Dickey-Fuller test and is implemented by add-
5
ing lagged values of ∆y to the right-hand side of equation (4.13). Thus, an ADF test with p
lags would be a regression
where our test statistic is again the t ratio for γ̂ and we select p to be large enough that the
error term ε is white noise. The critical values for the ADF test are different than those for
the basic DF test and depend on the number of lags p.
The basic DF and ADF tests are tests of whether a series is a random walk, with the al-
ternative being a stationary AR process. There are variants of these tests that can include
“drift” (a random walk with a nonzero mean period-to-period change) or a linear trend. To
test for a random walk with drift (against stationarity), we add a constant term to (4.14) to
get
The test statistic for (4.15) is again the t statistic on γ̂ , but the critical values are different in
the presence of a constant term than in (4.14). To test whether the series is I(1) against the
alternative that it is stationarity around a fixed. linear trend we add a trend term along with
the constant:
Once again, we must use a different table of critical values when including the trend.
5
There is a close relationship between lagged dependent variables and serial correlation of the error.
Adding lagged dependent variables as regressors can be an effective alternative to using a transfor-
mation such as Prais-Winsten to correct the error term.
Chapter 4: Regression with Nonstationary Variables 67
Rather than trying to figure out how many lags should be included in the ADF specifica-
tion, the Phillips-Perron test uses the OLS t test from (4.13) and uses the Newey-West proce-
dure to correct the standard errors for autocorrelation in u.
Stata implements the ADF and Phillips-Perron tests with the commands dfuller and pper-
ron, respectively. In both cases, one can specify the presence or absence of a constant and/or
trend (the default is to include a constant but not a trend) and the number of lags (of ∆y in
the ADF and in the Newey-West approximation in Phillips-Perron). Stata will show
67pprox.imate critical values for the test, tailored to the particular specification used, so you
should not need to refer to any textbook tables.
We noted above that the ADF and Phillips-Perron tests find it difficult to distinguish be-
tween ρ = 1 (nonstationarity) and ρ just less than one (stationary but persistent). Because
many stationary series are persistent, this low power is problematic. Stock and Watson rec-
ommend an alternative test as being more powerful in these cases. The DF-GLS test is a
Dickey-Fuller test in which the variables are “quasi-differenced” in a manner similar to the
Prais-Winsten GLS procedure we use for stationary AR(1) error terms.
To implement this test, we first use the following formulas to create two quasi-
differenced series, z which is a transformed version of y and x1, which is analogous to a con-
stant:
yt , for t = 1,
zt = 7
yt − 1 − T yt −1 , for t =
2, 3, , T .
1, for t = 1,
x1t = 7
T , for t = 2, 3, , T .
We then regress z t =
δ0 x1t + ut with no constant term (because x1 is essentially a constant)
and calculate a “detrended” y series as ytd ≡ yt − δˆ 0 . We then apply the Dickey-Fuller test to
the detrended series yd using critical values developed for the DF-GLS test. No DF-GLS pro-
cedure is included in Stata, but there is a command dfgls that can be downloaded from the
online Stata libraries that implements it.
Testing for cointegration is simply testing the stationarity of the error term in the cointe-
grating equation. When Engle and Granger first explored cointegrated models, their first
test, the Engle-Granger test, simply applies the ADF test to the residuals of the cointegrating
regression (4.9). Because these are estimated residuals rather than a free-standing time series,
yet another set of custom critical values must be used for this test. A more recent test, the
68 Chapter 4: Regression with Nonstationary Variables
Johansen-Juselius test is more general, and tests for the possibility of multiple cointegrating
relationships when there are more than two variables. This test is integrated into the proce-
dure for vector error-correction models in Stata. We study these models in Chapter 5.
References
Granger, C. W. J., and Paul Newbold. 1974. Spurious Regressions in Econometrics. Journal
of Econometrics 2 (2):111-120.
Stock, James H., and Mark Watson. 2011. Introduction to Econometrics. 3rd ed. Boston:
Pearson Education/Addison Wesley.
CHAPTER 5
Although estimating the equations of a VAR does not require strong identification as-
sumptions, some of the most useful applications of the estimates, such as calculating impulse-
response functions (IRFs) or variance decompositions do require identifying restrictions. A typi-
cal restriction takes the form of an assumption about the dynamic relationship between a
pair of variables, for example, that x affects y only with a lag, or that x does not affect y in the
long run.
A VAR system contains a set of m variables, each of which is expressed as a linear func-
tion of p lags of itself and of all of the other m – 1 variables, plus an error term. (It is possible
to include exogenous variables such as seasonal dummies or time trends in a VAR, but we
shall focus on the simple case.) With two variables, x and y, an order-p VAR would be the
two equations
We adopt the subscript convention that βxyp represents the coefficient of y in the equation for
x at lag p. If we were to add another variable z to the system, there would be a third equation
for zt and terms involving p lagged values of z, for example, βxzp, would be added to the right-
hand side of each of the three equations.
A key feature of equations (5.1) is that no current variables appear on the right-hand side
of any of the equations. This makes it plausible, though not always certain, that the regres-
sors of (5.1) are weakly exogenous and that, if all of the variables are stationary and ergodic,
70 Chapter 4: Vector Autoregression and Vector Error-Correction Models
OLS can produce asymptotically desirable estimators. Variables that are known to be exoge-
nous—a common example is seasonal dummy variables—may be added to the right-hand
side of the VAR equations without difficulty, and obviously without including additional
equations to model them. Our examples will not include such exogenous variables.
The error terms in (5.1) represent the parts of yt and xt that are not related to past values
of the two variables: the unpredictable “innovation” in each variable. These innovations will,
in general, be correlated with one another because there will usually be some tendency for
movements in yt and xt to be correlated, perhaps because of a contemporaneous causal rela-
tionship (or because of the common influence of other variables).
A key distinction in understanding and applying VARs is between the innovation terms v
in the VAR and underlying exogenous, orthogonal shocks to each variable, which we shall
call ε. The innovation in yt is the part of yt that cannot be predicted by past values of x and y.
Some of this unpredictable variation in yt that we measure by vt is surely due to ε ty , an exog-
enous shock to yt that is has no relationship to what is happening with x or any other variable
that might be included in the system. However, if x has a contemporaneous effect on y, then
some part of vty will be due to the indirect effect of the current shock to x, ε tx , which enters
the yt equation in (5.1) through the error term because current xt is not allowed to be on the
right-hand side. We will study in the next section how, by making identifying assumptions,
we can identify the exogenous shocks ε from our estimates of the VAR coefficients and re-
siduals.
Correlation between the error terms of two equations, such as that present in (5.1), usual-
ly means that we can gain efficiency by using the seemingly unrelated regressions (SUR) sys-
tem estimator rather than estimating the equations individually by OLS. However, the VAR
system conforms to the one exception to that rule: the regressors of all of the equations are
identical, meaning that SUR and OLS lead to identical estimators. The only situation in
which we gain by estimating the VAR as a system of seemingly unrelated regressions is
when we impose restrictions on the coefficients of the VAR, a case that we shall ignore here.
When the variables of a VAR are cointegrated, we use a vector error-correction (VEC)
model. A VEC for two variables might look like
where yt = α 0 + α1 x t is the long-run cointegrating relationship between the two variables and
λy and λx are the error-correction parameters that measure how y and x react to deviations
from long-run equilibrium.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 71
When we apply the VEC model to more than two variables, we must consider the possi-
bility that more than one cointegrating relationship exists among the variables. For example,
if x, y, and z all tend to be equal in the long run, then xt = yt and yt = zt (or, equivalently, xt =
zt) would be two cointegrating relationships. To deal with this situation we need to general-
ize the procedure for testing for cointegrating relationships to allow more than one cointe-
grating equation, and we need a model that allows multiple error-correction terms in each
equation.
The structure of equations (5.1) is designed to model how the values of the variables in
period t are related to past values. This makes the VAR a natural for the task of forecasting
the future paths of x and y conditional on their past histories.
Suppose that we have a sample of observations on x and y that ends in period T, and that
we wish to forecast their values in T + 1, T + 2, etc. To keep the algebra simple, suppose that
p = 1, so there is only one lagged value on the right-hand side. For period T + 1, our VAR is
yT +1 = β y 0 + β yy1 yT + β yx 1 xT + vTy +1
(5.3)
xT +1 = β x 0 + β xy1 yT + β xx 1 xT + vTx +1.
Taking the expectation conditional on the relevant information from the sample (xT and yT)
gives
E ( yT +1 | xT , yT ) = β y 0 + β yy1 yT + β yx 1 xT + E ( vTy +1 | xT , yT )
(5.4)
E ( xT +1 | xT , yT ) = β x 0 + β xy1 yT + β xx 1 xT + E ( vTx +1 | xT , yT ).
The conditional expectation of the VAR error terms on the right-hand side must be zero
in order for OLS to estimate the coefficients consistently. Whether or not this assumption is
valid will depend on the serial correlation properties of the v terms—we have seen that serial-
ly correlated errors and lagged dependent variables of the kind present in the VAR can be a
toxic combination.
72 Chapter 4: Vector Autoregression and Vector Error-Correction Models
Thus, we want to make sure that E ( vtj |vtx−1 , vty−1 ) = 0 . As we saw in an earlier chapter,
adding lagged values of y and x can often eliminate serial correlation of the error, and this
method is now more common than using GLS procedures to correct for possible autocorrela-
tion. We assume that our VAR system has sufficient lag length that the error term is not seri-
ally correlated, so that the conditional expectation of the error term for all periods after T is
zero. This means that the final term on the right-hand side of each equation in (5.4) is zero,
so
E ( yT +1 | xT , yT ) = β y 0 + β yy1 yT + β yx 1 xT
(5.5)
E ( xT +1 | xT , yT ) = β x 0 + β xy1 yT + β xx 1 xT .
If we knew the β coefficients, we could use (5.5) to calculate a forecast for period T + 1.
Naturally, we use our estimated VAR coefficients in place of the true values to calculate our
predictions
The forecast error in the predictions in (5.6) will come from two sources: the unpredicta-
ble period T + 1 error term and the errors we make in estimating the β coefficients. Formal-
ly,
( ) ( ) (
yT +1 − yˆT +1|T = β y 0 − βˆ y 0 + β yy1 − βˆ yy1 yT + β yx 1 − βˆ yx 1 xT + vTy +1 )
xT +1 − xˆT +1|T = (β x0 − βˆ x 0 ) + (β xy 1 − βˆ xy1 ) y + (β
T xx 1 − βˆ xx 1 )x
T + vTx +1.
If our estimates of the β coefficients are consistent and there is no serial correlation in v, then
the expectation of the forecast error is asymptotically zero. The variance of the forecast error
is
( ) ( ) ( )
var ( yT +1 − yˆT +1|T ) = var βˆ y 0 + var βˆ yy1 yT2 + var βˆ yx 1 xT2
+2 cov ( βˆ , βˆ ) y + 2 cov ( βˆ , βˆ ) x
y0 yy 1 T y0 yx 1 T ( )
+ 2 cov βˆ yy1 , βˆ yx 1 xT yT
+ var ( vTy +1 )
( ) ( ) ( )
var ( xT +1 − xˆT +1|T ) = var βˆ x 0 + var βˆ xy1 yT2 + var βˆ xx 1 xT2
+2 cov ( βˆ , βˆ ) y + 2 cov ( βˆ , βˆ ) x
x0 xy 1 T x0 xx 1 T ( )
+ 2 cov βˆ xy1 , βˆ xx 1 xT yT
+ var ( vTx +1 ).
As our consistent estimates of the β coefficients converge to the true values (as T gets large),
all of the terms in this expression converge to zero except the last one. Thus, in calculating
Chapter 4: Vector Autoregression and Vector Error-Correction Models 73
the variance of the forecast error, the error in estimating the coefficients is often neglected,
giving
One of the most useful attributes of the VAR is that it can be used recursively to extend
forecasts into the future. For period T + 2,
E ( yT + 2 | xT +1 , yT +1 ) = β y 0 + β yy1 yT +1 + β yx 1 xT +1
E ( xT + 2 | xT +1 , yT +1 ) = β x 0 + β xy1 yT +1 + β xx 1 xT +1 ,
so by recursive expectations
E ( yT + 2 | xT , yT ) = β y 0 + β yy1 E ( yT +1 | xT , yT ) + β yx 1 E ( xT +1 | xT , yT )
= β y 0 + β yy1 ( β y 0 + β yy1 yT + β yx 1 xT ) + β yx 1 ( β x 0 + β xy1 yT + β xx 1 xT )
E ( xT + 2 | xT , yT ) = β x 0 + β xy1 E ( yT +1 | xT , yT ) + β xx 1 E ( xT +1 | xT , yT )
= β x 0 + β xy1 ( β y 0 + β yy1 yT + β yx 1 xT ) + β xx 1 ( β x 0 + β xy1 yT + β xx 1 xT ).
The corresponding forecasts are again obtained by substituting coefficient estimates to get
If we once again ignore error in estimating the coefficients, then the two-period-ahead
forecast error in (5.8) is
In general, the error terms for period T + 1 will be correlated across equations, so the vari-
ance of the two-period-ahead forecast is approximately
The two-period-error forecast error has larger variance than the one-period-ahead error
because the errors that we make in forecasting period T + 1 propagate into errors in the fore-
cast for T + 2. As our forecast horizon increases, the variance gets larger and larger, reflect-
ing our inability to forecast a great distance into the future even if (as we have optimistically
assumed here) we have accurate estimates of the coefficients.
The calculations in equation (5.9) become increasingly complex as one considers longer
forecast horizons. Including more than two variables in the VAR or more than one lag on
the right-hand side also increases the number of terms in both (5.8) and (5.9) rapidly. We are
fortunate that modern statistical software, including Stata, has automated these tasks for us.
We now discuss the basics of estimating a VAR in Stata.
At one level, estimating a VAR is a simple task—because it is estimated with OLS, the
Stata regression command will handle the estimation. However, for everything we do with a
VAR beyond estimation, we need to consider the system as a whole, so Stata provides a fam-
ily of procedures that are tailored to the VAR application. The two essential VAR com-
mands are var and varbasic. The latter is easy to use (potentially as easy as listing the
variables you want in the system), but lacks the flexibility of the former to deal with asym-
metric lag patterns across equations, additional exogenous variables that have no equations
of their own, and coefficient constraints across equations. We discuss var first; later in the
chapter we will go back and show the use of the simpler varbasic command. (As always,
only simple examples of Stata commands are shown here. The current Stata manual availa-
ble through the Stata Help menu contains full documentation of all options and variations,
along with additional examples.)
To run a simple VAR for variables x and y with two lags or each variable in each equa-
tion and no constraints or exogenous variables, we can simply type
var x y , lags(1/2)
Notice that we need 1/2 rather than just 2 in the lag specification because we want lags 1
through 2, not just the second lag. The output from this command will give the β coefficients
from OLS estimation of the two regressions, plus some system and individual-equation
goodness-of-fit statistics.
Once we have estimated a VAR model, there are a variety of tests that can be used to
help us determine whether we have a good model. In terms of model validation, one im-
portant property for our estimates to have desirable asymptotic properties is that the model
must be stable in the sense that the estimated coefficients imply that ∂yt + s / ∂vtj and
∂x t + s / ∂vtj (j = x, y) become small as s gets large. If these conditions do not hold, then the
VAR implies that x and y are not jointly ergodic: the effects of shocks do not die out.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 75
The Stata command varstable (which usually needs no arguments or options) calcu-
lates the eigenvalues of a “companion matrix” for the system. If all of the calculated eigen-
values (which can be complex) are less than one (in modulus, if they have imaginary parts),
then the model is stable. This condition is the vector extension to the stationarity condition
that the roots of an autoregressive polynomial of a single variable lie outside the unit circle.
If the varstable command reports an eigenvalue with modulus greater than one, then
the VAR is unstable and forecasts will explode. This can arise when the variables in the
model are non-stationary or when the model is misspecified. Differencing (and perhaps, after
checking for cointegration, using a VEC) may yield a stable system.
If the VAR is stable, then the main issue in specification is lag length. We discussed lag
length issues above in the context of single-variable distributed lag models. The issues and
methods in a VAR are similar, but apply simultaneously to all of the equations of the model
and all of the variables, since we conventionally choose a common length for all lags.
Forecasting with a VAR assumes that there is no serial correlation in the error term. The
Stata command varlmar implements a VAR version of the Lagrange multiplier test for se-
rial correlation in the residual. This command tests the null hypotheses cov ( vtj , vtj− s ) = 0 with
j indexing the variables of the model. The main option in the varlmar command allows
you to specify the highest order of autocorrelation (the default is 2) that you want to test in
the residual. For example varlmar , mlag (4) would perform the above test individual-
ly for s = 1, s = 2, s = 3, and s = 4. If the Lagrange multiplier test rejects the null hypothesis
of no serial correlation, then you may want to include additional lags in the equations and
perform the test again.
Another way of deciding on lag length is to use standard (Wald) test statistics to test
whether all of the coefficients at each lag are zero. Stata automates this in the varwle
command, which requires no options.
Once you have settled on a VAR model that includes an appropriate number of lags, is
stable, and has serially uncorrelated errors, you can proceed to use the model to generate
forecasts. There are two commands for creating and graphing forecasts. The fcast com-
pute command calculates the predictions of the VAR and stores them in a set of new vari-
76 Chapter 4: Vector Autoregression and Vector Error-Correction Models
ables. If you want your forecasts to start in the period immediately following the last period
of the estimating sample, then the only option you need in the fcast compute command
is step(#), with which you specify the forecast horizon (how many periods ahead you
want to forecast). The forecast variables are stored in variables that attach a prefix that you
specify to the names of the VAR variables being forecasted. For example, to forecast your
VAR model for 10 periods beginning after the estimating sample and store predicted values
of x in pred_x and y in pred_y, you could type
The fcast compute command also generates standard errors of the forecasts and uses
them to calculate upper and lower confidence bounds. After computing the forecasts, you
can graph them along with the confidence by typing fcast graph pred_*. If you have
actual observed values for the variables for the forecast period, they can be added to the
graph with the observed option (separated from the command by a comma, as always
with Stata options).
One of the first, and undeniable, maxims that every econometrician or statistician is
taught is that “correlation does not imply causality.” Correlation or covariance is a symmet-
ric, bivariate relationship; cov ( x , y ) = cov ( y, x ) . We cannot, in general, infer anything
about the existence or direction of causality between x and y by observing non-zero covari-
ance. Even if our statistical analysis is successful in establishing that the covariance is highly
unlikely to have occurred by chance, such a relationship could occur because x causes y, be-
cause y causes x, because each causes the other, or because x and y are responding to some
third variable without any causal relationship between them.
However, Clive Granger defined the concept of Granger causality, which, under some
controversial assumptions, can be used to shed light on the direction of possible causality
between pairs of variables. The formal definition of Granger causality asks whether past val-
ues of x aid in the prediction of yt, conditional on having already accounted for the effects on
yt of past values of y (and perhaps of past values of other variables). If they do, the x is said to
“Granger cause” y.
The VAR is a natural framework for examining Granger causality. Consider the two-
variable system in equations (5.1). The first equation models yt as a linear function of its own
past values, plus past values of x. If x Granger causes y (which we write as x ⇒ y ), then some
or all of the lagged x values have non-zero effects: lagged x affects yt conditional on the ef-
fects of lagged y. Testing for Granger causality in (5.1) amounts to testing the joint blocks of
coefficients βyxs and βxys to see if they are zero. The null hypothesis x ⇒ y (x does not
Granger cause y) in this VAR is
H 0 : β yx 1 =
β yx 2 =
... =
β yxp =
0,
Chapter 4: Vector Autoregression and Vector Error-Correction Models 77
which can be tested using a standard Wald F or χ2 test. Similarly, the null hypothesis y ⇒ x
can be expressed in the VAR as
H 0 : β xy1 =
β xy 2 =
... =
β xyp =
0.
Running both of these tests can yield four possible outcomes, as shown in Table 5-1: no
Granger causality, one-way Granger causality in either direction, or “feedback,” with
Granger causality running both ways.
There are multiple ways to perform Granger causality tests between a pair of variables,
so no result is unique or definitive. Within the two-variable VAR, one may obtain different
results with different lag lengths p. Moreover, including additional variables in the VAR sys-
tem may change the outcome of the Wald tests that underpin Granger causality. In a three-
variable VAR, there are three pairs of variables, (x, y), (y, z), and (x, z) that can be tested for
Granger causality in both directions: six tests with 36 possible combinations of outcomes.
The effect of lagged x on yt can disappear when lagged values of a third variable z are added
to the regression. For example, if x ⇒ z and z ⇒ y , then omitting z from the VAR system
could lead us to conclude that x ⇒ y even if there is no direct Granger causality in the larger
system.
Is “Granger causality” really “causality”? Obviously, if the maxim about correlation and
causality is true, then there must be something tricky happening, and indeed there is.
Granger causality tests whether lagged values of one variable conditionally help predict an-
other variable. Under what conditions can we interpret this as “causality”? Two assumptions
are sufficient.
lated with the future variables themselves) can change agents’ current choices, which might
result in causality that would appear to violate this assumption.
Second, any causal relationship that is strictly immediate in the sense that a change in xt
leads to a change in yt but no change in any future values of y would fly under the radar of a
Granger causality test, which only measures and tests lagged effects. Most causal economic
relationships are dynamic in that effects are not fully realized within a single time period, so
this difficulty may not present a practical problem in many cases.
Stata implements Granger causality tests automatically with vargranger, which tests
all of the pairs of variables in a VAR for Granger causality. In systems with more than two
variables, it also tests the joint hypothesis that all of the other variables fail to Granger cause
each variable in turn. This joint test amounts to testing whether all of the lagged terms other
than those of the dependent variable have zero effects.
Suppose that two variables, x and y, evolve over time according to the structural model
x t = α 0 + α1 x t −1 + θ1 yt −1 + εtx
(5.10)
yt = φ0 + φ1 yt −1 + δ0 x t + δ1 x t −1 + εty ,
where the ε terms are exogenous white-noise shocks to x and y that are “orthogonal” (uncor-
related) to one another: var ( εtx ) =σ2x , var ( εty ) =σ2y , and cov ( εtx , εty ) =
0. The ε shocks are
changes in the variables that come from outside the VAR system. Because they are (assumed
to be) exogenous, we can measure the effect of an exogenous change in x on the path of y
and x by looking at the dynamic marginal effects of ε tx , for example, ∂yt + s / ∂εtx . This is the
key distinction between the VAR error terms v and the exogenous structural shocks ε—
depending on the identifying assumptions we make, we cannot generally interpret a change
in v as an exogenous shock to one variable.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 79
We assume that x and y are stationary and ergodic, which imposes restrictions on the
1
autoregressive coefficients of the model.
The first equation of (5.10) is already in the form of a VAR equation: it expresses the cur-
rent value of xt as a function of lagged values of x and y. If we solve (5.10) by substituting for
xt in the second equation using the first equation, we get
yt = φ0 + φ1 yt −1 + δ0 ( α 0 + α1 x t −1 + θ1 yt −1 + εtx ) + δ1 x t −1 + εty
(5.11)
= ( φ0 + δ0 α 0 ) + ( φ1 + δ0 θ1 ) yt −1 + ( δ1 + δ0 α1 ) x t −1 + ( εty + δ0 εtx ) ,
which also has the VAR form. Thus, we can write the reduced-form system of (5.10) as
yt = β y 0 + β yy1 yt −1 + β yx 1 x t −1 + vty
(5.12)
x t = β x 0 + β xy1 yt −1 + β xx 1 x t −1 + vtx ,
with
β y 0 = φ0 + δ 0 α 0 βx 0 = α0
β yx 1 = φ1 + δ0 θ1 β xy1 = θ1
(5.13)
β yy1 = δ1 + δ0 α1 β xx 1 = α1
vty = εty + δ0 εtx vtx = εtx .
Given our assumptions about the distributions of the exogenous shocks ε, we can determine
the variances and covariance of the VAR error terms v as
cov ( vtx , vty ) = E ( vtx vty ) = E εtx ( εty + δ0 εtx ) =δ0 σ2x .
Let’s now consider what can be estimated using the VAR system (5.12) and to what ex-
tent these estimates allow us to infer value of the parameters in the structural system (5.10).
In terms of coefficients, there are six β coefficients that can be estimated in the VAR and
seven structural coefficients in (5.10). This seems a pessimistic start to the task of identifica-
tion. However, we can also estimate the variances and covariance of the v terms using the
VAR residuals: var ( v y ) , and cov
( v x ) , var ( v x , v y ) . Conditions (5.14) allow us to estimate
t t t t
three parameters— σ2x , σ2y , δ0 —from the three estimated variances and covariance:
1
In a single-variable autoregressive model, we would require that the coefficient φ1 for yt – 1 be in the
range (–1, 1). The corresponding conditions in the vector setting are more involved, but similar in na-
ture.
80 Chapter 4: Vector Autoregression and Vector Error-Correction Models
(v x ),
σˆ 2x =var t
ˆδ = ( t t ) ,
vx, vy
cov
(5.15)
0
(v x )
var t
= ( v y ) − δˆ 2 var
σˆ 2y var ( v x ).
t 0 t
Armed with an estimate of δ0 from the covariance term, we can now use the six β coeffi-
cients to estimate the remaining six structural coefficients using (5.13). The system is just
identified.
εˆ tx =vˆtx
εˆ ty= vˆty − δˆ 0 vˆtx .
This makes it clear that we are interpreting the VAR residual for x to be an exogenous, struc-
tural shock to x. In order to extract the structural shock to y, we subtract the part of vˆty ,
δˆ 0 vˆtx =δˆ 0 εˆ tx , that is due to the effect of the shock to xt on yt. From an econometric standpoint,
we could equally well make the opposite assumption, assuming that yt affects xt rather than
vice versa, which would interpret vˆty as εˆ ty and calculate εˆ tx as the part of vˆtx that is not ex-
plained by vˆty . Choosing which interpretation to use must be done on the basis of theory:
which variable is more plausibly exogenous within the immediate period. We may get differ-
ent results depending on which identification assumption we choose, so if there is no clear
choice it may be useful to examine whether results are robust across different choices.
of the system, which we call impulse-response functions (IRFs). We discuss the computation
and interpretation of IRFs in the next section.
In our example, we identified shocks by limiting the contemporaneous effects among the
variables. With only two variables, there are two possible choices: (1) the assumption we
made, that xt affects yt immediately but yt does not have an immediate effect on xt or (2) the
opposite assumption, that yt affects xt immediately but xt does not affect yt except with a lag.
We can think of the choice between these alternatives as an “ordering” of the variables, with
the variables lying higher in the order having instantaneous effects on those lower in the or-
der, but the lower variables only affecting those above them with a lag.
Although identification by ordering is still common, subsequent research has shown that
other kinds of restrictions can be used. For example, in some macroeconomic models we can
assume that changes in a variable such as the money supply would have no long-run effect
on another variable such as real output. In a simple system such as (5.10), this might show
up as the assumption that δ0 + δ1 = 0, for example. Imposing this condition would allow the
seven structural coefficients of (5.10) to be identified from the six β coefficients of the VAR
without using restrictions on the covariances.
It is important to stress that, unlike forecasts and Granger causality tests, both IRFs and
variance decompositions can only be calculated based on a set of identifying assumptions and
that a different set of identification assumptions may lead to different conclusions.
Suppose that we have an n-variable VAR with lags up to order p. If the variables of the
system are y1, y2, …, yn, then we can write the n equations of the VAR as
82 Chapter 4: Vector Autoregression and Vector Error-Correction Models
n
p
yti =βi 0 + ∑ ∑ βijs ytj− s + vti , i =1, 2, ..., n. (5.16)
s 1
=j 1 =
The impulse-response functions are the n × n set of dynamic marginal effects of a one-
time shock to variable j on itself or another variable i:
∂yti + s
, s = 0, 1, 2, .... (5.17)
∂εtj
Note that there is in principle no limit on how far into the future these dynamic impulse re-
sponses can extend. If the VAR is stable, then the IRFs should converge to zero as the time
from the shock s gets large—one-time shocks should not have permanent effects. As noted
above, non-convergent IRFs and unstable VARs are indications of non-stationarity in the
variables of the model, which may be corrected by differencing.
IRFs are usually presented graphically with the time lag s running from zero up to some
user-set limit S on the horizontal axis and the impact at the s-order lag on the vertical. They
can also be expressed in tabular form if the numbers themselves are important. One common
format for the entire collection of IRFs corresponding to a VAR is as an n × n matrix of
graphs, with the “impulse variable” (the shock) on one dimension and the “response varia-
ble” on the other.
Each of the n2 IRF graphs tells us how a shock to one variable affects another (or the
same) variable. There are two common conventions for determining the size of the shock to
the impulse variable. One is to use a shock of magnitude one. Since we can think of the im-
pulse shock as the ∂ε in the denominator of (5.17), setting the shock to one means that the
values reported are the dynamic marginal effects as in (5.17).
However, a shock of size one does not always make economic sense: Suppose that the
shock variable is banks’ ratio of reserves to deposits, expressed as a fraction. An increase of
one in this variable, say from 0.10 to 1.10, would be implausible. To aid in interpretation,
some software packages normalize the size of the shocks to be one standard deviation of the
variable rather than one unit. Under this convention, the values plotted are
∂yti + s
σˆ j , s =
0, 1, 2, ...
∂εtj
and are interpreted as the change in each response variable resulting from a one-standard-
deviation increase in the impulse variable. This makes the magnitude of the changes in the
response variables more realistic, but does not allow the IRF values to be interpreted directly
as dynamic marginal effects.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 83
Because the VAR model is linear, the marginal effects in (5.17) are constant, so which
normalization to choose for the shocks—one unit or one standard deviation—is arbitrary
and should be done to facilitate interpretation. Stata uses the convention of the one-unit im-
pulse in its “simple IRFs” and one standard deviation in its “orthogonalized IRFs.”
If the impulse variable is the same as the response variable, then the IRF tells us how
persistent shocks to that variable tend to be. By definition,
∂yti
=1,
∂εit
so the zero-order own impulse response is always one. If the VAR is stable, reflecting the
stationarity and ergodicity of the underlying variables, then the own impulse responses decay
to zero as the time horizon increases:
∂yti + s
lim = 0.
s →∞ ∂εi
t
If the impulse responses decay to zero only slowly then shocks to the variable tend to change
its value for many periods, whereas a short impulse response pattern indicates that shocks
are more transitory.
For cross-variable effects, where the impulse and response variables are different, general
patterns of positive or negative responses are possible. Depending on the identification as-
sumption (the “ordering”), the zero-period response may be zero or non-zero. By assump-
tion, shocks to variables near the bottom of the ordering have no current-period effect on var-
iables higher in the order, so the zero-lag impulse response in such cases is exactly zero.
As a preliminary check, we verify that both growth series are stationary. To be conserva-
tive, we include four lagged differences to eliminate serial correlation in the error term of the
Dickey-Fuller regression.
. dfuller usgr , lags(4)
In both cases, we comfortably reject the presence of a unit root in the growth series because
the test statistic is more negative than the critical value, even at a 1% level of significance.
Phillips-Perron tests lead to similar conclusions. Therefore, we conclude that VAR analysis
can be performed on the two growth series without differencing.
To assess the optimal lag length, we use the Stata varsoc command with a maximum
lag length of four:
. varsoc usgr cgr , maxlag(4)
Selection-order criteria
Sample: 1976q1 - 2011q4 Number of obs = 144
+---------------------------------------------------------------------------+
|lag | LL LR df p FPE AIC HQIC SBIC |
|----+----------------------------------------------------------------------|
| 0 | -709.83 67.4086 9.88653 9.90329 9.92777 |
| 1 | -671.726 76.208* 4 0.000 41.9769* 9.41286* 9.46314* 9.5366* |
| 2 | -670.318 2.8151 4 0.589 43.5178 9.44887 9.53267 9.6551 |
| 3 | -667.543 5.55 4 0.235 44.2688 9.46588 9.58321 9.75461 |
| 4 | -664.14 6.8067 4 0.146 44.6449 9.47417 9.62501 9.84539 |
+---------------------------------------------------------------------------+
Endogenous: usgr cgr
Exogenous: _cons
Note that all of the regressions leading to the numbers in the table are run for a sample be-
ginning in 1976q1, which is the earliest date for which 4 lags are available, even though the
regressions with fewer than 4 lags could use a longer sample. In this VAR, all of the criteria
support a lag of length one, so that is what we choose.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 85
Although we could accomplish the tasks we desire using varbasic, we will use the
more general commands to demonstrate their use. To run the VAR regressions, we use var:
Vector autoregression
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
usgr |
usgr |
L1. | .2512771 .0898504 2.80 0.005 .0751735 .4273807
|
cgr |
L1. | .2341061 .0975328 2.40 0.016 .0429453 .4252668
|
_cons | 1.494612 .3354005 4.46 0.000 .8372394 2.151985
-------------+----------------------------------------------------------------
cgr |
usgr |
L1. | .3759117 .0726571 5.17 0.000 .2335065 .5183169
|
cgr |
L1. | .2859551 .0788694 3.63 0.000 .1313739 .4405362
|
_cons | .8755421 .2712199 3.23 0.001 .3439609 1.407123
------------------------------------------------------------------------------
We have not yet attempted any shock identification, so at this point the ordering of the vari-
ables in the command is arbitrary. The VAR regressions are run starting the sample at the
earliest possible date with one lag, which is 1975q2 because our first available observation is
1975q1. Because it uses three additional observations, the reported AIC and SBIC values
from the VAR output do not match those from the varsoc table above.
To assess the validity of our VAR, we test for stability and for autocorrelation of the re-
siduals. The varstable command examines the dynamic stability of the system. None of
the eigenvalues is even close to one, so our system is stable.
86 Chapter 4: Vector Autoregression and Vector Error-Correction Models
. varstable
The varlmar performs a Lagrange multiplier test for the joint null hypothesis of no auto-
correlation of the residuals of the two equations:
. varlmar , mlag(4)
Lagrange-multiplier test
+--------------------------------------+
| lag | chi2 df Prob > chi2 |
|------+-------------------------------|
| 1 | 4.6880 4 0.32084 |
| 2 | 5.5347 4 0.23670 |
| 3 | 3.5253 4 0.47404 |
| 4 | 7.1422 4 0.12856 |
+--------------------------------------+
H0: no autocorrelation at lag order
We cannot reject the null of no residual autocorrelation at orders 1 through 4 at any conven-
tional significance level, so we have no evidence to contradict the validity of our VAR.
To determine if the growth rates of the US and Canada affect one another over time, we
can perform Granger causality tests using our VAR.
. vargranger
We see strong evidence that lagged Canadian growth helps predict US growth (the p-value is
0.016) and overwhelming evidence that lagged US growth helps predict Canadian growth (p-
value less than 0.001). It is not surprising, given the relative sizes of the economies, that the
US might have a stronger effect on Canada than vice versa. Note that because we have only
one lag in our VAR, the Granger causality tests have only one degree of freedom and are
Chapter 4: Vector Autoregression and Vector Error-Correction Models 87
equivalent to the single-coefficient tests in the VAR regression tables. The coefficient of Ca-
nadian growth in the US equation has a z value of 2.40, which is the square root of the
5.7613 value reported in the vargranger table; both have identical p-values. If we had
more than one lag in the regressions, then the z values in the VAR table would be tests of
individual lag coefficients and the χ2 values in the Granger causality table would be joint
tests of the blocks of lag coefficients associated with each variable.
We next explore the implications of our VARs for the behavior of GDP in the two coun-
tries in 2012 and 2013. What does our model forecast? To find out, we issue that command:
fcast compute p_, step(8). This command produces no output, but changes our
dataset in two ways. First, it extends the dataset by eight observations to 2013q4, filling in
appropriate values for the date variable. Second, it adds eight variables to the dataset with
values for those eight quarters. The new variables p_usgr and p_cgr contain the forecasts,
and the variables p_usgr_SE, p_usgr_LB, and p_usgr_UB (and corresponding variables
for Canada) contain the standard error, 95% lower bound, and 95% upper bound for the
forecasts.
We can examine these forecast values in the data browser or with any statistical com-
mands in the Stata arsenal, but it is often most informative to graph them. The command
fcast graph p_usgr p_cgr generates the graph shown in Figure 5-1. The graph
shows that the confidence bands on our forecasts are very large: our VARs do not forecast
very confidently. The point forecasts predict little change in growth rates. The US growth
rate is predicted to decline very slightly and then hold steady near its mean; the Canadian
growth rate to increase a bit and then converge to its mean. If the goal of our VAR exercise
was to obtain insightful and reliable forecasts, we have not succeeded!
Estimation, Granger causality, and forecasting can all be accomplished without any
identification assumptions. But this is as far as we can go with our VAR analysis without
making some assumptions to allow us to identify the structural shocks to US and Canadian
GDP growth.
88 Chapter 4: Vector Autoregression and Vector Error-Correction Models
10
5
5
0
0
-5
-5
2011q3 2012q1 2012q3 2013q1 2013q3 2011q3 2012q1 2012q3 2013q1 2013q3
95% CI forecast
Only if the residuals of the two equations are contemporaneously uncorrelated can we
interpret them as structural shocks. We would expect that innovations to US and Canadian
growth in any period would tend to be positively correlated, and indeed the cross-equation
correlation coefficient for the residuals in our VAR regressions is 0.44.
We then create an IRF entry in the file called “L1” to hold the results of our one-lag VARs,
running the IRF effect horizon out 20 quarters (five years):
. irf create L1 , step(20) order(usgr cgr)
(file uscan.irf updated)
We specified the ordering explicitly in creating the IRF. However, because the ordering is
the same as the order in which the variables were listed in the var command itself, Stata
would have chosen this ordering by default.
We can now use the irf graph command to produce impulse-response function and
variance decomposition graphs. To get the (orthogonalized) IRFs, we type
. irf graph oirf , irf(L1) ustep(8)
It is important to specify oirf rather than irf because the latter gives impulse responses
assuming (counterfactually) that the VAR residuals are uncorrelated. The resulting IRFs are
shown in Figure 5-2.
0
0 2 4 6 8 0 2 4 6 8
step
95% CI orthogonalized irf
Graphs by irfname, impulse variable, and response variable
The diagonal panels in Figure 5-2 show the effects of shocks to each country’s GDP
growth on future values of its own growth. In both cases, the shock dies out quickly, reflect-
ing the stationarity of the variables. A one-standard deviation shock to Canadian GDP
growth in the top-left panel is just over 2 percent; a corresponding shock to U.S. growth is
close to 3 percent.
The off-diagonal panels (bottom-left and top-right) show the effects of a growth shock in
one country on the path of growth in the other. In the bottom-left panel, we see that a one-
standard-deviation (about 3 percentage points) shock in U.S. growth raises Canadian growth
by about 1 percentage point in the current quarter, then by a bit more in the next quarter as
the lagged effect kicks in. From the second lag on, the effect decays rapidly to zero, with the
statistical significance of the effect vanishing at about one year.
In the top-right panel, we see the estimated effects of a shock to Canadian growth on
growth in the United States. The first thing to notice is that the effect is zero in the current
period (at zero lag). This is a direct result of our identification assumption: we imposed the
condition that Canadian growth has no immediate effect on U.S. growth in order to identify
the shocks. The second noteworthy result is that the dynamic effect that occurs in the second
period is much smaller than the effect of the U.S. on Canada. This is as we expected.
But how much of this greater dependence of Canada on the United States is really the
data speaking and how much is our assumption that contemporaneous correlation in shocks
runs only from the U.S. to Canada? Recall that our identification assumption imposes the
condition that any “common shocks” that affect both countries are assumed to be U.S.
shocks, with Canada shocks being the part of the Canadian VAR innovation that is not ex-
plained by the common shock. This might cause the Canadian shocks to have smaller vari-
ance (which it does in Figure 5-2) and might also overestimate the effect of the U.S. shocks
on Canada.
To assess the sensitivity of our conclusions to the ordering assumption, we examine the
IRFs making the opposite assumption: that contemporaneous correlation in the innovations
is due to Canada shocks affecting the U.S. Figure 5-3 shows the graphs of the reverse-
ordering IRFs. As expected, the effect of the U.S. on Canada (lower left) now begins at zero
and the effect of Canada on the U.S. (upper right) does not.
Beyond this, there are a couple of interesting changes when we reverse the order. First,
note that both shocks now have a standard deviation of about 2.5 rather than the U.S. shock
having a much larger standard deviation. This occurs because we now attribute the “com-
mon” part of the innovation to the Canadian shock rather than the U.S. shock. Second, after
the initial period in which the U.S.-to-Canada effect is constrained to be zero, the two effects
are of similar magnitudes and die out in a similar way.
This example shows the difficulty of identifying impulse responses in VARs. The impli-
cations can depend on the identification assumption we make, so if we are not sure which
assumption is better we may be left with considerable uncertainty in interpreting our results.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 91
0
0 2 4 6 8 0 2 4 6 8
step
95% CI orthogonalized irf
Graphs by irfname, impulse variable, and response variable
We can also use Stata’s irf graph command to plot the cumulative effect of a perma-
nent shock to one of the variables. For the preferred ordering this looks like Figure 5-4. Us-
ing the top-left panel, a permanent positive shock of one standard deviation to Canada’s
growth—an exogenous increase of about 2 percentage points of growth that is sustained over
time—would eventually cause Canadian growth to be about 3.5 percentage points higher.
This magnification comes from two effects. First, shocks to Canadian growth tend to persist
for a period or two after the shock, so growth increases more as a result. Second, a positive
shock to Canadian growth increases U.S. growth (even with no exogenous shock in the
U.S.), which feeds back positively on Canadian growth. The same multiplier effect happens
in the other panels.
92 Chapter 4: Vector Autoregression and Vector Error-Correction Models
0
0 2 4 6 8 0 2 4 6 8
step
95% CI cumulative orthogonalized irf
Graphs by irfname, impulse variable, and response variable
Another tool that is available for analysis of identified VARs is the forecast-error vari-
ance decomposition, which measures the extent to which each shock contributes to unex-
plained movements (forecast errors) in each variable. Figure 5-5 results from the Stata com-
mand: irf graph fevd , irf(L1) ustep(8)and shows how each shock contributes
to the variation in each variable. All variance decompositions start at zero because there is
no forecast error at a zero lag.
The left-column panels show that (with the preferred identification assumption) the
Canadian shock contributes about 80% of the variance in the one-period-ahead forecast error
for Canadian growth, with the U.S. shock contributing the other 20%. As our forecast hori-
zon moves further into the future, the effect of the U.S. shock on Canadian growth increases
and the shares converge to less than 60% of variation in Canadian growth being due to the
Canadian shock and more than 40% due to the U.S. shock. The right-column panels indicate
that very little (less than 5%) of the variation in U.S. growth is attributable to Canadian
growth shocks in the short run or long run.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 93
.5
.5
0 2 4 6 8 0 2 4 6 8
step
95% CI fraction of mse due to impulse
Graphs by irfname, impulse variable, and response variable
Figure 5-6 shows the very different results that we get when we reverse the contempora-
neous causal ordering. Now the Canadian shock (which includes the shock that is common
to both countries under this assumption) explains most (80%) of the variation in Canadian
growth and much (30%) of the variation in growth in the United States.
First, the United States has a stronger effect on Canada than vice versa. Interpreting the
VAR results in favor of Canada’s effect (by putting them first in the order) gives Canada a
substantial effect on the U.S. but the U.S. shocks are clearly still important for both coun-
tries, but interpreting them in favor of the U.S. effect virtually wipes out the effect of Canada
on the United States.
94 Chapter 4: Vector Autoregression and Vector Error-Correction Models
Second, because of the way the results vary between orderings, it is clear that much of
the variation in growth in both countries is due to a common shock. Whichever country is
(perhaps arbitrarily) assigned ownership of this shock seems to have a large effect relative
with the other. While this doesn’t resolve the “causality question,” it is very useful infor-
mation about the co-movement of U.S. and Canadian growth.
.5
.5
0
0 2 4 6 8 0 2 4 6 8
step
95% CI fraction of mse due to impulse
Graphs by irfname, impulse variable, and response variable
We now consider how cointegrated variables can be used in a VAR using a vector error-
correction (VEC) model. First we examine the two-variable case, which extends the simple
Chapter 4: Vector Autoregression and Vector Error-Correction Models 95
If two I(1) series x and y are cointegrated, then there is exist unique α0 and α1 such that
ut ≡ yt − α 0 − α1 x t is I(0). In the single-equation model of cointegration where we thought of
y as the dependent variable and x as an exogenous regressor, we saw that the error-correction
model
was an appropriate specification. All terms in equation (5.18) are I(0) as long as the α coeffi-
cients (the “cointegrating vector”) are known or at least consistently estimated. The ut −1 term
is the magnitude by which y was above or below its long-run equilibrium value in the previ-
ous period. The coefficient λ (which we expect to be negative) represents the amount of “cor-
rection” of this period-(t – 1) disequilibrium that happens in period t. For example, if λ is –
0.25, then one quarter of the gap between yt – 1 and its equilibrium value would tend (all else
equal) to be reversed (because the sign is negative) in period t.
The VEC model extends this single-equation error-correction model to allow y and x to
evolve jointly over time as in a VAR system. In the two-variable case, there can be only one
cointegrating relationship and the y equation of the VEC system is similar to (5.18), except
that we mirror the VAR specification by putting lagged differences of y and x on the right-
hand side. With only one lagged difference (there can be more) the bivariate VEC can be
written
As in (5.18), all of the terms in both equations of (5.19) are I(0) if the variables are coin-
tegrated with cointegrating vector (1, –α0, –α1), in other words, if yt − α 0 − α1 x t is stationary.
The λ coefficients are again the error-correction coefficients, measuring the response of each
variable to the degree of deviation from long-run equilibrium in the previous period. We ex-
pect λy < 0 for the same reason as above: if yt −1 is above its long-run value in relation to x t −1
then the error-correction term in parentheses is positive and this should lead, other things
constant, to downward movement in y in period t. The expected sign of λx depends on the
sign of α1. We expect ∂∆x t / ∂x t −1 = −λ x α1 < 0 for the same reason that we expect
∂∆yt / ∂yt −1 = λ y < 0 : if x t −1 is above its long-run relation to y, then we expect ∆x t to be neg-
ative, other things constant.
96 Chapter 4: Vector Autoregression and Vector Error-Correction Models
A simple, concrete example may help clarify the role of the error-correction terms in a
VEC model. Let the long-run cointegrating relationship be yt = x t , so that α0 = 0 and α1 = –
1. The parenthetical error-correction term in each equation of (5.19) is now yt −1 − x t −1 , the
difference between y and x in the previous period. Suppose that because of previous shocks,
y=t −1 x t −1 + 1 so that y is above its long-run equilibrium relationship to x by one unit (or,
equivalently, x is below its long-run equilibrium relationship to y by one unit). To move to-
ward long-run equilibrium in period t, we expect (if there are no other changes) ∆yt < 0 and
∆xt > 0. Using equation (5.19), ∆yt changes in response to this equilibrium by
λ y ( yt −1 − x t −1 ) =
λ y , so for stable adjustment to occur λy < 0; y is “too high” so it must de-
crease in response to the disequilibrium. The corresponding change in ∆xt from equation
(5.19) is λ x ( yt −1 − x t −1 ) =
λ x . Since x is “too low,” stable adjustment requires that the re-
sponse in x be positive, so we need λx > 0. Note that if the long-run relationship between y and
x were inverse (α1 < 0), then x would need to decrease in order to move toward equilibrium
and we would need λx < 0. The expected sign on λx depends on the sign of α1.
If theory tells us the coefficients α0 and α1 of the cointegrating relationship, as in the case
of purchasing-power parity, then we can calculate the error-correction term in (5.19) and es-
timate it as a standard VAR. However, we usually do not know these coefficients, so they
must be estimated.
We now consider a vector error-correction model with three variables x, y, and z. This
situation is more complex because the number of linear combinations of the three variables
that are stationary could be 0, 1, or 2. In other words, there could be zero, one, or two com-
mon trends among the three variables.
If there are no cointegrating relationships, then the series are not cointegrated and a VAR
in differences is the appropriate specification. There is no long-run relationship to which the
levels of the variables tend to return, so there is no basis for an error-correction term in any
equation.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 97
There would be one cointegrating relationship among the three variables if there is one
long-run equilibrium condition tying the levels of the variables together. An example would
be the purchasing-power-parity (PPP) condition between two countries under floating ex-
change rates. Suppose that P1 is the price of a basket of goods in country one, P2 is the price
of the same basket in country 2, and X is the exchange rate: the number of units of country
one’s currency that buys one unit of country two’s. If goods are to cost the same in both
countries—purchasing-power parity—then X = P 1 / P 2 . Any increase in prices in country
one should be reflected, in long-run equilibrium, by an increase of equal proportion in the
amount of country-one currency needed to buy a unit of country-two’s currency.
In practice, economists have to rely on price indexes whose market baskets differ across
countries, so the PPP equation would need a constant of proportionality to reflect this differ-
ence: X = A0 P 1 / P 2 . Denoting logs of the variables by small letters, this implies that
x =α 0 + p1 − p 2
We could estimate a VEC system (with one lag, for simplicity) for the evolution of the
three variables x, p1, and p2 with one cointegrating relationship (with some known coeffi-
cients):
If the exchange rate is out of equilibrium, say, too high, then we expect some combination of
adjustments in x, p1, and p2 to move back toward long-run equilibrium. The error-correction
coefficients λx, λ1, and λ2 measure these responses. Using the logic described above, we
would expect λx and λ2 to be negative and λ1 to be positive.
This does not exhaust the possible degree of cointegration among these variables, how-
ever. Suppose that country one is on a gold standard, so that the price level in that country
2
tends to be constant in the long run. This would impose a second long-run equilibrium con-
dition—a second cointegrating relationship—on the variables: p1 = α1 . The VEC system in-
corporating both cointegrating relationships would look like
2
Another example we could use would be a fixed-exchange-rate system in which one country keeps x
near a constant level in the long run.
98 Chapter 4: Vector Autoregression and Vector Error-Correction Models
In equations (5.20) and (5.21) we started with a system in which we “knew” the nature
of the cointegrating relationship(s) among the variables. It is more common that we must test
for the possibility of cointegration (and determine how many cointegrating relationships ex-
ist) and estimate a full set of α parameters for them. We now turn to that process, then to an
example of an estimated VEC model.
The most common tests to determine the number of cointegrating relationships among
the series in a VAR/VEC are due to Johansen (1995). Although the mathematics of the tests
involve methods that are beyond our reach, the intuition is very similar to testing for unit
roots in the polynomial representing an AR process.
If we have n I (1) variables that are modeled jointly in a dynamic system, there can be up
to n – 1 cointegrating relationships linking them. Stock and Watson (ref) think of each coin-
tegrating relationship as a common trend linking some or all of the series in the system. we
shall think of “cointegrating relationship” and “common trend” as synonymous. The cointe-
grating rank of the system is the number of such common trends, or the number of cointe-
3
grating relationships.
To determine the cointegrating rank r, we perform a sequence of tests. First we test the
null hypothesis of r = 0 against r ≥ 1 to determine if there is at least one cointegrating rela-
tionship. If we fail to reject r = 0, then we conclude that there are no cointegrating relation-
ships or common trends among the series. In this case, we do not need a VEC model and
can simply use a VAR in the differences of the series.
If we reject r = 0 at the initial stage then at least some of the series are cointegrated and
we want to determine the number of cointegrating relationships. We proceed to a second
step to test the null hypothesis that r ≤ 1 against r ≥ 2. If we cannot reject the hypothesis of
no more than one common trend, then we estimate a VEC system with one cointegrating
relationship, such as (5.20).
3
For those familiar with linear algebra, the term “rank” refers to the rank of a matrix characterizing
the dynamic system. If a dynamic system of n variables has r cointegrating relationships, then the rank
of the matrix is n – r. This means that the matrix has r eigenvalues that are zero and n – r that are not.
The Johansen tests are based on determining the number of nonzero eigenvalues.
Chapter 4: Vector Autoregression and Vector Error-Correction Models 99
Johansen proposed several related tests that can be used at each stage. The most com-
mon (and the default in Stata) is the trace statistic. The Stata command vecrank prints the
trace statistic or, alternatively, the maximum-eigenvalue statistic (with the max option) or
various information criteria (with the ic option).
The Johansen procedure invoked in Stata by the vec command estimates both the pa-
rameters of the adjustment process (the β coefficients on the lagged changes in all variables)
and the long-run cointegrating relationships themselves (the α coefficients on the long-run
relationships) by maximum likelihood. We must tell Stata whether to include constant terms
in the differenced VEC regressions—remember that a constant term in a differenced equa-
tion corresponds to a trend term in the levels—or perhaps trend terms (which would be a
quadratic trend in the levels). It is also possible to include seasonal variables where appropri-
ate or to impose constraints on the coefficients of either the cointegrating relationships or the
adjustment equations.
Once the VEC system has been estimated, we can proceed to calculate IRFs and vari-
ance decompositions, or to generate forecasts as we would with a VAR.
References
Johansen, Soren. 1995. Likelihood-Based Inference in Cointegrated Vector Autoregressive Models.
Oxford: Clarendon Press.
Sims, Christopher A. 1980. Macroeconomics and Reality. Econometrica 48 (1):1-48.