EcmAll PDF
EcmAll PDF
Paul Söderlind1
June 2002 (some typos corrected and some material added later)
1 Universityof St. Gallen. Address: s/bf-HSG, Rosenbergstrasse 52, CH-9000 St. Gallen,
Switzerland. E-mail: [email protected]. Document name: EcmAll.TeX.
Contents
1 Introduction 5
1.1 Means and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 5
1.2 Testing Sample Means . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 The Distribution of ˇO . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Testing Hypotheses about ˇO . . . . . . . . . . . . . . . . . . . . . . 14
A Practical Matters 16
B A CLT in Action 17
1
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Least Squares 53
4.1 Definition of the LS Estimator . . . . . . . . . . . . . . . . . . . . . 53
4.2 LS and R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Finite Sample Properties of LS . . . . . . . . . . . . . . . . . . . . . 57
4.4 Consistency of LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Asymptotic Normality of LS . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Diagnostic Tests of Autocorrelation, Heteroskedasticity, and Normality 66
7 GMM 100
7.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Generalized Method of Moments . . . . . . . . . . . . . . . . . . . . 101
7.3 Moment Conditions in GMM . . . . . . . . . . . . . . . . . . . . . . 101
7.4 The Optimization Problem in GMM . . . . . . . . . . . . . . . . . . 104
7.5 Asymptotic Properties of GMM . . . . . . . . . . . . . . . . . . . . 108
7.6 Summary of GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.7 Efficient GMM and Its Feasible Implementation . . . . . . . . . . . . 114
7.8 Testing in GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.9 GMM with Sub-Optimal Weighting Matrix . . . . . . . . . . . . . . 117
7.10 GMM without a Loss Function . . . . . . . . . . . . . . . . . . . . 118
2
7.11 Simulated Moments Estimator . . . . . . . . . . . . . . . . . . . . . 119
3
15 Nonparametric Regressions and Tests 193
15.1 Nonparametric Regressions . . . . . . . . . . . . . . . . . . . . . . . 193
15.2 Estimating and Testing Distributions . . . . . . . . . . . . . . . . . . 201
4
1 Introduction
The standard deviation (here denoted Std.x t /), the square root of the variance, is the most
common measure of volatility.
The mean and standard deviation are often estimated on rolling data windows (for in-
stance, a “Bollinger band” is ˙2 standard deviations from a moving data window around
a moving average—sometimes used in analysis of financial prices.)
If x t is iid (independently and identically distributed), then it is straightforward to find
the variance of the sample average. Then, note that
P P
T
Var x
t D1 t =T D TtD1 Var .x t =T /
D T Var .x t / =T 2
D Var .x t / =T: (1.2)
The first equality follows from the assumption that x t and xs are independently distributed
(so the covariance is zero). The second equality follows from the assumption that x t and
xs are identically distributed (so their variances are the same). The third equality is a
trivial simplification.
A sample average is (typically) unbiased, that is, the expected value of the sample
average equals the population mean. To illustrate that, consider the expected value of the
sample average of the iid x t
PT PT
E t D1 x t =T D tD1 E x t =T
D E xt : (1.3)
The first equality is always true (the expectation of a sum is the sum of expectations), and
5
a. Distribution of sample average b. Distribution of √T times sample average
3 0.4
T=5 T=5
2 T=25 T=25
T=50 T=50
T=100 0.2 T=100
1
0 0
−2 0 2 −5 0 5
Sample average √T times sample average
Figure p
1.1: Sampling distributions. This figure shows the distribution of the sample mean
and of T times the sample mean of the random variable z t 1 where z t 2 .1/.
the second equality follows from the assumption of identical distributions which implies
identical expectations.
The law of large numbers (LLN) says that the sample mean converges to the true popula-
tion mean as the sample size goes to infinity. This holds for a very large class of random
variables, but there are exceptions. A sufficient (but not necessary) condition for this con-
vergence is that the sample average is unbiased (as in (1.3)) and that the variance goes to
zero as the sample size goes to infinity (as in (1.2)). (This is also called convergence in
mean square.) To see the LLN in action, see Figure 1.1.
p
The central limit theorem (CLT) says that T xN converges in distribution to a normal
distribution as the sample size increases. See Figure 1.1 for an illustration. This also
holds for a large class of random variables—and it is a very useful result since it allows
us to test hypothesis. Most estimators (including LS and other methods) are effectively
some kind of sample average, so the CLT can be applied.
The basic approach in testing a hypothesis (the “null hypothesis”), is to compare the
test statistics (the sample average, say) with how the distribution of that statistics (which
is a random number since the sample is finite) would look like if the null hypothesis is
true. For instance, suppose the null hypothesis is that the population mean is Suppose
also that we know that distribution of the sample mean is normal with a known variance
h2 (which will typically be estimated and then treated as if it was known). Under the null
hypothesis, the sample average should then be N.; h2 /. We would then reject the null
6
hypothesis if the sample average is far out in one the tails of the distribution. A traditional
two-tailed test amounts to rejecting the null hypothesis at the 10% significance level if
the test statistics is so far out that there is only 5% probability mass further out in that
tail (and another 5% in the other tail). The interpretation is that if the null hypothesis is
actually true, then there would only be a 10% chance of getting such an extreme (positive
or negative) sample average—and these 10% are considered so low that we say that the
null is probably wrong.
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x y = x−0.5
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 4
z = (x−0.5)/√2
See Figure 1.2 for some examples or normal distributions. recall that in a normal
distribution, the interval ˙1 standard deviation around the mean contains 68% of the
probability mass; ˙1:65 standard deviations contains 90%; and ˙2 standard deviations
contains 95%.
In practice, the test of a sample mean is done by “standardizing” the sampe mean so
7
that it can be compared with a standard N.0; 1/ distribution. The logic of this is as follows
If xN N.; h2 /, then .xN /= h N.0; 1/, so the probability of xN 2:7 can be calcu-
lated by calculating how much probability mass of the standard normal density function
there is above .2:7 /= h.
To construct a two-tailed test, we also need.the probability that xN is above some num-
ber. This number is chosen to make the two-tailed tst symmetric, that is, so that there
is as much probability mass below lower number (lower tail) as above the upper number
(upper tail). With a normal distribution (or, for that matter, any symmetric distribution)
this is done as follows. Note that .xN /= h N.0; 1/ is symmetric around 0. This means
that the probability of being above some number, .C /= h, must equal the probability
of being below 1 times the same number, or
xN C C xN
Pr D Pr : (1.6)
h h h h
A 10% critical value is the value of .C /= h that makes both these probabilities
equal to 5%—which happens to be 1.645. The easiest way to look up such critical values
is by looking at the normal cumulative distribution function—see Figure 1.2.
b
Cov .x t ; z t / D
PT
t D1 .x t N .z t
x/ zN / =T: (1.7)
b
Corr .x t ; z t / D
b
Cov .x t ; z t /
; (1.8)
c .x t / Std
Std c .z t /
8
Pdf of t when true β=−0.1 Pdf of t when true β=0.51
0.4 0.4
0.2 0.2
0.1 0.1
0 0
−4 −2 0 2 4 −4 −2 0 2 4
t t
H0: β=0.5
Pdf of t when true β=2
0.4 t = (b−0.5)/σ
If b ~ N(β,σ2), then t ~ N((β−0.5)/σ,1)
0.3 Probs: 0.00 0.44
Probabilities (power) are shown for
0.2
t ≤ −1.65 and t > 1.65 (10% critical values)
0.1
0
−4 −2 0 2 4
t
(try to show it). Note that covariance and correlation measure the degree of linear relation
only. This is illustrated in Figure 1.4.
The pth autocovariance of x is estimated by
b
Cov x t ; x t p
D
PT
t D1 .x t N xt
x/ p
xN =T; (1.9)
where we use the same estimated (using all data) mean in both places. Similarly, the pth
autocorrelation is estimated as
b Cov x t ; x t p
Corr x t ; x t p D
b
: (1.10)
c .x t /2
Std
Compared with a traditional estimate of a correlation (1.8) we here impose that the stan-
dard deviation of x t and x t p are the same (which typically does not make much of a
difference).
9
Correlation of 0.9 Correlation of 0
2 2
1 1
0 0
y
y
−1 −1
−2 −2
−5 0 5 −5 0 5
x x
−1 −1
−2 −2
−5 0 5 −5 0 5
x x
Figure 1.4: Example of correlations on an artificial sample. Both subfigures use the same
sample of y.
y t D x t ˇ0 C u t ; (1.11)
where all variables are zero mean scalars and where ˇ0 is the true value of the parameter
we want to estimate. The task is to use a sample fy t ; x t gTtD1 to estimate ˇ and to test
hypotheses about its value, for instance that ˇ D 0.
If there were no movements in the unobserved errors, u t , in (1.11), then any sample
would provide us with a perfect estimate of ˇ. With errors, any estimate of ˇ will still
leave us with some uncertainty about what the true value is. The two perhaps most impor-
tant issues in econometrics are how to construct a good estimator of ˇ and how to assess
10
the uncertainty about the true value.
O we get a fitted residual
For any possible estimate, ˇ,
uO t D y t O
x t ˇ: (1.12)
One appealing method of choosing ˇO is to minimize the part of the movements in y t that
we cannot explain by x t ˇ, O that is, to minimize the movements in uO t . There are several
candidates for how to measure the “movements,” but the most common is by the mean of
squared errors, that is, ˙ tTD1 uO 2t =T . We will later look at estimators where we instead use
˙ tTD1 juO t j =T .
With the sum or mean of squared errors as the loss function, the optimization problem
T
1X
min .y t x t ˇ/2 (1.13)
ˇ T t D1
has the first order condition that the derivative should be zero as the optimal estimate ˇO
T
1X
xt yt x t ˇO D 0; (1.14)
T tD1
where a hat indicates a sample estimate. This is the Least Squares (LS) estimator.
1
u2t = 2 2
pdf .u t / D p exp : (1.17)
2 2
11
Since the errors are independent, we get the joint pdf of the u1 ; u2 ; : : : ; uT by multiplying
the marginal pdfs of each of the errors. Then substitute y t x t ˇ for u t (the derivative of
the transformation is unity) and take logs to get the log likelihood function of the sample
T
T T 1X
ln 2 x t ˇ/2 = 2 :
ln L D ln .2/ .y t (1.18)
2 2 2 t D1
This likelihood function is maximized by minimizing the last term, which is propor-
tional to the sum of squared errors - just like in (1.13): LS is ML when the errors are iid
normally distributed.
Maximum likelihood estimators have very nice properties, provided the basic dis-
tributional assumptions are correct. If they are, then MLE are typically the most effi-
cient/precise estimators, at least asymptotically. ML also provides a coherent framework
for testing hypotheses (including the Wald, LM, and LR tests).
Equation (1.15) will give different values of ˇO when we use different samples, that is
different draws of the random variables u t , x t , and y t . Since the true value, ˇ0 , is a fixed
constant, this distribution describes the uncertainty we should have about the true value
after having obtained a specific estimated value.
O use (1.11) in (1.15) to substitute for y t
To understand the distribution of ˇ,
1
T T
!
1X 2 1X
ˇO D x x t .x t ˇ0 C u t /
T t D1 t T t D1
T
! 1 T
1X 2 1X
D ˇ0 C xt xt ut ; (1.19)
T tD1 T t D1
12
when x t are a set of deterministic variables which will always be the same irrespective
of which sample we use, then ˇO ˇ0 is a time invariant linear function of u t , so the
distribution of u t carries over to the distribution of ˇ. O This is probably an unrealistic
case, which forces us to look elsewhere to understand the properties of ˇ. O
There are two main routes to learn more about the distribution of ˇ: O (i) set up a small
“experiment” in the computer and simulate the distribution or (ii) use the asymptotic
distribution as an approximation. The asymptotic distribution can often be derived, in
contrast to the exact distribution in a sample of a given size. If the actual sample is large,
then the asymptotic distribution may be a good approximation.
A law of large numbers would (in most cases) say that both TtD1 x t2 =T and TtD1 x t u t =T
P P
in (1.19) converge to their expected values as T ! 1. The reason is that both are sample
averages of random variables (clearly, both x t2 and x t u t are random variables). These ex-
pected values are Var.x t / and Cov.x t ; u t /, respectively (recall both x t and u t have zero
means). The key to show that ˇO is consistent, that is, has a probability limit equal to ˇ0 , is
that Cov.x t ; u t / D 0. This highlights the importance of using good theory to derive not
only the systematic part of (1.11), but also in understanding the properties of the errors.
For instance, when theory tells us that y t and x t affect each other (as prices and quanti-
ties typically do), then the errors are likely to be correlated with the regressors - and LS
is inconsistent. One common way to get around that is to use an instrumental variables
technique. More about that later. Consistency is a feature we want from most estimators,
since it says that we would at least get it right if we had enough data.
Suppose that ˇO is consistent. Can we say anything more about the asymptotic distri-
bution? Well, the distributionof ˇO converges to a spike with all the mass at ˇ0 , but the
p p
O O
distribution of T ˇ, or T ˇ ˇ0 , will typically converge to a non-trivial normal
distribution. To see why, note from (1.19) that we can write
T
! 1 p T
p 1X 2 T X
T ˇO ˇ0 D x xt ut : (1.20)
T t D1 t T tD1
The first term on the right hand side will typically converge to the inverse of Var.x t /, as
p
discussed earlier. The second term is T times a sample average (of the random variable
x t u t ) with a zero expected value, since we assumed that ˇO is consistent. Under weak
p
conditions, a central limit theorem applies so T times a sample average converges to
p
a normal distribution. This shows that T ˇO has an asymptotic normal distribution. It
13
turns out that this is a property of many estimators, basically because most estimators are
some kind of sample average. For an example of a central limit theorem in action, see
Appendix B
We could then test hypotheses about ˇO as for any other random variable. For instance,
consider the hypothesis that ˇ0 D 0. If this is true, then
p p
Pr O < 2 D Pr
T ˇ=v O > 2 0:025;
T ˇ=v (1.22)
which says that there is only a 2.5% chance that a random sample will deliver a value of
p
O less than -2 and also a 2.5% chance that a sample delivers a value larger than 2,
T ˇ=v
assuming the true value is zero.
We then say that we reject the hypothesis that ˇ0 D 0 at the 5% significance level
p
O
(95% confidence level) if the test statistics j T ˇ=vj is larger than 2. The idea is that,
if the hypothesis is true (ˇ0 D 0), then this decision rule gives the wrong decision in
5% of the cases. That is, 5% of all possible random samples will make us reject a true
hypothesis. Note, however, that this test can only be taken to be an approximation since it
relies on the asymptotic distribution, which is an approximation of the true (and typically
unknown) distribution.
14
a. Pdf of N(0,1) b. Pdf of Chi−square(n)
0.4 1
n=1
n=2
n=5
0.2 0.5
0 0
−2 0 2 0 5 10
x x
Figure 1.5: Probability density functions
p
O
The natural interpretation of a really large test statistics, j T ˇ=vj D 3 say, is that
it is very unlikely that this sample could have been drawn from a distribution where the
hypothesis ˇ0 D 0 is true. We therefore choose to reject the hypothesis. We also hope that
the decision rule we use will indeed make us reject false hypothesis more often than we
reject true hypothesis. For instance, we want the decision rule discussed above to reject
ˇ0 D 0 more often when ˇ0 D 1 than when ˇ0 D 0.
There is clearly nothing sacred about the 5% significance level. It is just a matter of
convention that the 5% and 10% are the most widely used. However, it is not uncommon
to use the 1% or the 20%. Clearly, the lower the significance level, the harder it is to reject
a null hypothesis. At the 1% level it often turns out that almost no reasonable hypothesis
can be rejected.
The t-test described above works only if the null hypothesis contains a single restric-
tion. We have to use another approach whenever we want to test several restrictions
jointly. The perhaps most common approach is a Wald test. To illustrate the idea, suppose
p d
ˇ is an m 1 vector and that T ˇO ! N .0; V / under the null hypothesis , where V is a
covariance matrix. We then know that
p p d
T ˇO 0 V 1
T ˇO ! 2 .m/ : (1.23)
The decision rule is then that if the left hand side of (1.23) is larger that the 5%, say,
critical value of the 2 .m/ distribution, then we reject the hypothesis that all elements in
ˇ are zero.
15
A Practical Matters
A.0.1 Software
4. Pindyck and Rubinfeld (1998), Econometric Models and Economic Forecasts (gen-
eral, easy)
11. Judge et al (1985), Theory and Practice of Econometrics (general, a bit old)
16
13. Spanos (1986), Statistical Foundations of Econometric Modelling, Cambridge Uni-
versity Press (general econometrics, LSE approach)
15. Harvey (1989), Forecasting, Structural Time Series... (structural time series, Kalman
filter).
16. Lütkepohl (1993), Introduction to Multiple Time Series Analysis (time series, VAR
models)
17. Priestley (1981), Spectral Analysis and Time Series (advanced time series)
19. Silverman (1986), Density Estimation for Statistics and Data Analysis (density es-
timation).
B A CLT in Action
This is an example of how we can calculate the limiting distribution of a sample average.
p
Remark B.1 If T .xN /= N.0; 1/ then xN N.; 2 =T /.
p
Example B.2 (Distribution of ˙ tTD1 .z t 1/ =T and T ˙ tTD1 .z t 1/ =T when z t
2 .1/:) When z t is iid 2 .1/, then ˙ tTD1 z t is distributed as a 2 .T / variable with pdf
T
fT ./. We now construct a new variable by transforming ˙ tD1 z t as to a sample mean
around one (the mean of z t )
17
Example B.3 Continuing the previous example, we now consider the random variable
p
zN2 D T zN1 ;
p
with inverse function zN 1 D zN2 = T . By applying the “change of variable” rule again, we
get the pdf of zN 2 as
p p p p
h .zN 2 / D g.Nz2 = T /= T D fT T zN 2 C T T:
Example B.4 When z t is iid 2 .1/, then ˙ tTD1 z t is 2 .T /, which we denote f .˙ tTD1 z t /.
T
We now construct two new variables by transforming ˙ tD1 zt
zN1 D ˙ tTD1 z t =T T
1 D ˙ tD1 .z t 1/ =T , and
p
zN2 D T zN1 :
Example B.5 We transform this distribution by first subtracting one from z t (to remove
p
the mean) and then by dividing by T or T . This gives the distributions of the sample
p
mean and scaled sample mean, zN2 D T zN1 as
1
f .zN 1 / D y T =2 1
exp . y=2/ with y D T zN 1 C T , and
2T =2 .T =2/
1 p
f .zN 2 / D y T =2 1
exp . y=2/ with y D T zN 1 C T .
2T =2 .T =2/
These distributions are shown in Figure 1.1. It is clear that f .zN 1 / converges to a spike
at zero as the sample size increases, while f .zN 2 / converges to a (non-trivial) normal
distribution.
p
Example B.6 (Distribution of ˙ tTD1 .z t 1/ =T and T ˙ tTD1 .z t 1/ =T when z t
T
2 .1/:) When z t is iid 2 .1/, then ˙ tD1 z t is 2 .T /, that is, has the probability density
function
1 T =2 1
f ˙ tTD1 z t D ˙ tTD1 z t ˙ tTD1 z t =2 :
exp
2T =2 .T =2/
We transform this distribution by first subtracting one from z t (to remove the mean) and
p
then by dividing by T or T . This gives the distributions of the sample mean, zN 1 D
18
p
˙ tTD1 .z t 1/ =T , and scaled sample mean, zN 2 D T zN 1 as
1
f .zN 1 / D y T =2 1
exp . y=2/ with y D T zN 1 C T , and
2T =2 .T =2/
1 p
f .zN 2 / D y T =2 1
exp . y=2/ with y D T zN 1 C T .
2T =2 .T =2/
These distributions are shown in Figure 1.1. It is clear that f .zN 1 / converges to a spike
at zero as the sample size increases, while f .zN 2 / converges to a (non-trivial) normal
distribution.
Bibliography
Amemiya, T., 1985, Advanced econometrics, Harvard University Press, Cambridge, Mas-
sachusetts.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
Härdle, W., 1990, Applied nonparametric regression, Cambridge University Press, Cam-
bridge.
Harvey, A. C., 1989, Forecasting, structural time series models and the Kalman filter,
Cambridge University Press.
Johnston, J., and J. DiNardo, 1997, Econometric methods, McGraw-Hill, New York, 4th
edn.
Lütkepohl, H., 1993, Introduction to multiple time series, Springer-Verlag, 2nd edn.
19
Mittelhammer, R. C., G. J. Judge, and D. J. Miller, 2000, Econometric foundations, Cam-
bridge University Press, Cambridge.
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
Priestley, M. B., 1981, Spectral analysis and time series, Academic Press.
Silverman, B. W., 1986, Density estimation for statistics and data analysis, Chapman and
Hall, London.
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn.
20
2 Univariate Time Series Analysis
Reference: Greene (2000) 13.1-3 and 18.1-3
Additional references: Hayashi (2000) 6.2-4; Verbeek (2004) 8-9; Hamilton (1994); John-
ston and DiNardo (1997) 7; and Pindyck and Rubinfeld (1998) 16-18
where subscripts indicate time periods. The superscripts indicate that this sample is from
planet (realization) i . We could imagine a continuum of parallel planets where the same
time series process has generated different samples with T different numbers (different
realizations).
Consider period t . The distribution of y t across the (infinite number of) planets has
some density function, f t .y t /. The mean of this distribution
Z 1
Ey t D y t f t .y t / dy t (2.1)
1
is the expected value of the value in period t, also called the unconditional mean of y t .
Note that Ey t could be different from Ey tCs . The unconditional variance is defined simi-
larly.
Now consider periods t and t s jointly. On planet i we have the pair y ti s ; y ti .
˚
The bivariate distribution of these pairs, across the planets, has some density function
g t s;t .y t s ; y t /.1 Calculate the covariance between y t s and y t as usual
Z 1Z 1
Cov .y t s ; y t / D .y t s Ey t s / .y t Ey t / g t s;t .y t s ; y t / dy t dy t s (2.2)
1 1
D E .y t s Ey t s / .y t Ey t / : (2.3)
1
R1
The relation between f t .y t / and g t s;t .y t s ; yt / is, as usual, f t .y t / = 1 gt s;t .y t s ; y t / dy t s .
21
This is the s t h autocovariance of y t . (Of course, s D 0 or s < 0 are allowed.)
A stochastic process is covariance stationary if
Ey t D is independent of t; (2.4)
Cov .y t s ; y t / D
s depends only on s, and (2.5)
both and
s are finite. (2.6)
Most of these notes are about covariance stationary processes, but Section 2.7 is about
non-stationary processes.
Humanity has so far only discovered one planet with coin flipping; any attempt to
estimate the moments of a time series process must therefore be based on the realization
of the stochastic process from planet earth only. This is meaningful only if the process is
ergodic for the moment you want to estimate. A covariance stationary process is said to
be ergodic for the mean if
T
1X
plim y t D Ey t ; (2.7)
T tD1
so the sample mean converges in probability to the unconditional mean. A sufficient
condition for ergodicity for the mean is
1
X
jCov .y t s ; y t /j < 1: (2.8)
sD0
This means that the link between the values in t and t s goes to zero sufficiently fast
as s increases (you may think of this as getting independent observations before we reach
the limit). If y t is normally distributed, then (2.8) is also sufficient for the process to be
ergodic for all moments, not just the mean. Figure 2.1 illustrates how a longer and longer
sample (of one realization of the same time series process) gets closer and closer to the
unconditional distribution as the sample gets longer.
Let y t be a vector of a covariance stationary and ergodic. The sth covariance matrix is
R .s/ D E .y t Ey t / .y t s Ey t s /0 : (2.9)
22
One sample from an AR(1) with corr=0.85 Histogram, obs 1−20
5
0.2
0
0.1
−5
0
0 500 1000 −5 0 5
period
Histogram, obs 1−1000 Mean and Std over longer and longer samples
4 Mean
0.2 Std
2
0.1 0
−2
0
−5 0 5 0 500 1000
sample length
Figure 2.1: Sample of one realization of y t D 0:85y t 1 C" t with y0 D 4 and Std." t / D 1.
Note that R .s/ does not have to be symmetric unless s D 0. However, note that R .s/ D
R . s/0 . This follows from noting that
R . s/ D E .y t Ey t / .y t Cs Ey t Cs /0
D E .y t s Ey t s / .y t Ey t /0 ; (2.10a)
where we have simply changed time subscripts and exploited the fact that y t is covariance
stationary. Transpose to get
R . s/0 D E .y t Ey t / .y t s Ey t s /0 ; (2.11)
which is the same as in (2.9). If y t is a scalar, then R .s/ D R . s/, which shows that
autocovariances are symmetric around s D 0.
23
Example 2.1 (Bivariate case.) Let y t D Œx t ; z t 0 with Ex t DEz t D 0. Then
" #
x t
h i
RO .s/ D E xt s zt s
zt
" #
Cov .x t ; x t s / Cov .x t ; z t s /
D :
Cov .z t ; x t s / Cov .z t ; x t s /
Note that R . s/ is
" #
Cov .x t ; x t Cs / Cov .x t ; z t Cs /
R . s/ D
Cov .z t ; x t Cs / Cov .z t ; x t Cs /
" #
Cov .x t s ; x t / Cov .x t s ; z t /
D ;
Cov .z t s ; x t / Cov .z t s ; x t /
24
2.3 White Noise
E" t D 0
Var ." t / D 2 , and
Cov ." t s ; " t / D 0 if s ¤ 0. (2.15)
If, in addition, " t is normally distributed, then it is said to be Gaussian white noise. The
conditions in (2.4)-(2.6) are satisfied so this process is covariance stationary. Moreover,
(2.8) is also satisfied, so the process is ergodic for the mean (and all moments if " t is
normally distributed).
where the innovation " t is white noise (usually Gaussian). We could also allow both y t
and " t to be vectors; such a process it called a vector MA (VMA).
We have Ey t D 0 and
Var .y t / D E " t C 1 " t C ::: C q " t
1 q " t C 1 " t 1 C ::: C q " t q
Example 2.2 The mean of an MA(1), y t D " t C 1 " t 1, is zero since the mean of " t (and
" t 1 ) is zero. The first three autocovariance are
D 2 1 C 12
Var .y t / D E ." t C 1 " t 1 / ." t C 1 " t 1/
25
and Cov.y t s ; y t / D 0 for jsj 2. Since both the mean and the covariances are finite
and constant across t , the MA(1) is covariance stationary. Since the absolute value of
the covariances sum to a finite number, the MA(1) is also ergodic for the mean. The first
autocorrelation of an MA(1) is
1
Corr .y t 1; yt / D :
1 C 12
Since the white noise process is covariance stationary, and since an MA.q/ with m <
1 is a finite order linear function of " t , it must be the case that the MA.q/ is covariance
stationary. It is ergodic for the mean since Cov.y t s ; y t / D 0 for s > q, so (2.8) is
satisfied. As usual, Gaussian innovations are then sufficient for the MA(q) to be ergodic
for all moments.
The effect of " t on y t , y t C1 ; :::, that is, the impulse response function, is the same as
the MA coefficients
@y t @y t C1 @y tCq @y t CqCk
D 1, D 1 ; :::; D q ; and D 0 for k > 0. (2.19)
@" t @" t @" t @" t
This is easily seen from applying (2.16)
26
The forecasts made in t D 2 then have the follow expressions—with an example using
1 D 2; "1 D 3=4 and "2 D 1=2 in the second column
General Example
y2 D 1=2 C 2 3=4 D 2
E2 y3 D E2 ."3 C 1 "2 / D 1 "2 D 2 1=2 D 1
E2 y4 D E2 ."4 C 1 "3 / D 0 D0
Example 2.4 (MA(1) and conditional variances.) From Example 2.3, the forecasting
variances are—with the numerical example continued assuming that 2 D 1
General Example
Var.y2 E2 y2 / D 0 D0
2
Var.y3 E2 y3 / D Var."3 C 1 "2 1 "2 / D D1
Var.y4 E2 y4 / D Var ."4 C 1 "3 / D 2 C 12 2 D5
If the innovations are iid Gaussian, then the distribution of the s period forecast error
is
E t s y t / N 0; 2 1 C 12 C ::: C s2 1 ;
.y t (2.21)
since " t ; " t 1 ; :::; " t .s 1/ are independent Gaussian random variables. This implies that
the conditional distribution of y t , conditional on f"w gswD 1 , is
The conditional mean is the point forecast and the variance is the variance of the forecast
error. Note that if s > q, then the conditional distribution coincides with the unconditional
distribution since " t s for s > q is of no help in forecasting y t .
27
(where ˝2 indicates the information set in t D 2)
General Example
y2 j ˝2 N .y2 ; 0/ D N .2; 0/
y3 j ˝2 N .E2 y3 ; Var.y3 E2 y3 // D N .1; 1/
y4 j ˝2 N .E2 y4 ; Var.y4 E2 y4 // D N .0; 5/
2.5 Autoregression
A VAR.p/ is just like the AR.p/ in (2.24), but where y t is interpreted as a vector and ai
as a matrix.
All stationary AR(p) processes can be written on MA(1) form by repeated substitu-
tion. To do so we rewrite the AR(p) as a first order vector autoregression, VAR(1). For
instance, an AR(2) x t D a1 x t 1 C a2 x t 2 C " t can be written as
" # " #" # " #
xt a1 a2 xt 1 "t
D C , or (2.25)
xt 1 1 0 xt 2 0
y t D Ay t 1 C "t ; (2.26)
where y t is an 2 1 vector and A a 4 4 matrix. This works also if x t and " t are vectors
and. In this case, we interpret ai as matrices and 1 as an identity matrix.
28
Iterate backwards on (2.26)
Remark 2.7 (Spectral decomposition.) The n eigenvalues (i ) and associated eigenvec-
tors (zi ) of the n n matrix A satisfy
.A i In / zi D 0n1 :
A2 D AA D ZZ 1
ZZ 1
D ZZ 1
D Z2 Z 1
) Aq D Zq Z 1
:
p
Remark 2.8 (Modulus of complex number.) If D a C bi, where i D 1, then
p
jj D ja C bij D a2 C b 2 .
We note from the spectral decompositions that AKC1 D ZKC1 Z 1 , where Z is the ma-
trix of eigenvectors and a diagonal matrix with eigenvalues. Clearly, limK!1 AKC1 y t K 1 D
0 is satisfied if the eigenvalues of A are all less than one in modulus and y t K 1 does not
grow without a bound.
29
Conditional moments of AR(1), y0=4 Conditional distributions of AR(1), y0=4
4 0.4
s=1
s=3
Mean s=5
2 0.2 s=7
Variance
s=7
0 0
0 10 20 −5 0 5
Forecasting horizon x
Figure 2.2: Conditional moments and distributions for different forecast horizons for the
AR(1) process y t D 0:85y t 1 C " t with y0 D 4 and Std." t / D 1.
Example 2.9 (AR(1).) For the univariate AR(1) y t D ay t 1 C " t , the characteristic
equation is .a / z D 0, which is only satisfied if the eigenvalue is D a. The AR(1) is
therefore stable (and stationarity) if 1 < a < 1. This can also be seen directly by noting
that aKC1 y t K 1 declines to zero if 0 < a < 1 as K increases.
Example 2.11 (Covariance stationarity of an AR(1) with jaj < 1.) From the MA-representation
P1 s
yt D sD0 a " t s , the expected value of y t is zero, since E" t s D 0. We know that
Cov(y t ; y t s )D ajsj 2 = 1 a2 which is constant and finite.
30
Example 2.12 (Ergodicity of a stationary AR(1).) We know that Cov(y t ; y t s )D ajsj 2 = 1 a2 ,
jCov.y t ; y t s /j D jajjsj 2 = 1 a2
Example 2.13 (Conditional distribution of AR(1).) For the AR(1) y t D ay t 1 C " t with
" t N 0; 2 , we get
E t y tCs D as y t ,
Var .y t Cs E t y t Cs / D 1 C a2 C a4 C ::: C a2.s 1/
2
a2s 1 2
D :
a2 1
The distribution of y t Cs conditional on y t is normal with these parameters. See Figure
2.2 for an example.
31
Recall that the joint and conditional pdfs of some variables z and x are related as
Repeating this for the entire sample gives the likelihood function for the sample
T
!
ˇ
T =2 1 X
pdf fy t gTtD0 ˇ y0 D 2 2 .y t a1 y t 1 /2 :
exp (2.33)
ˇ
2
2 t D1
Taking logs, and evaluating the first order conditions for 2 and a gives the usual OLS
estimator. Note that this is MLE conditional on y0 . There is a corresponding exact MLE,
but the difference is usually small (the asymptotic distributions of the two estimators are
the same under stationarity; under non-stationarity OLS still gives consistent estimates).
The MLE of Var(" t ) is given by TtD1 vO t2 =T , where vO t is the OLS residual.
P
These results carry over to any finite-order VAR. The MLE, conditional on the initial
observations, of the VAR is the same as OLS estimates of each equation. The MLE of
the ij t h element in Cov(" t ) is given by TtD1 vO i t vOjt =T , where vO i t and vOjt are the OLS
P
residuals.
To get the exact MLE, we need to multiply (2.33) with the unconditional pdf of y0
(since we have no information to condition on)
y02
1
pdf .y0 / D p exp ; (2.34)
2 2 =.1 a2 / 2 2 =.1 a2 /
since y0 N.0; 2 =.1 a2 //. The optimization problem is then non-linear and must be
solved by a numerical optimization routine.
32
2.5.2 Lag Operators
A common and convenient way of dealing with leads and lags is the lag operator, L. It is
such that
Ls y t D y t s for all (integer) s.
can be written as
a2 L2 y t D .1 C 1 L/ " t ;
1 a1 L (2.36)
The first term in (2.38) is the inverse of the sample estimate of covariance matrix of
x t (since Ey t D 0), which converges in probability to ˙xx1 (y t is stationary and ergodic
for all moments if " t is Gaussian). The last term, T1 TtD1 x t " t , is serially uncorrelated,
P
so we can apply a CLT. Note that Ex t " t "0t x t0 DE" t "0t Ex t x t0 D 2 ˙xx since u t and x t are
independent. We therefore have
T
1 X
x t " t !d N 0; 2 ˙xx :
p (2.39)
T t D1
33
Combining these facts, we get the asymptotic distribution
p
O
T ˇLS ˇ !d N 0; ˙xx1 2 :
(2.40)
It is straightforward to see the relation between autocorrelations and the AR model when
the AR model is the true process. This relation is given by the Yule-Walker equations.
For an AR(1), the autoregression coefficient is simply the first autocorrelation coeffi-
cient. For an AR(2), y t D a1 y t 1 C a2 y t 2 C " t , we have
2 3 2 3
Cov.y t ; y t / Cov.y t ; a1 y t 1 C a2 y t 2 C " t /
4 Cov.y t 1 ; y t / 5 D 4 Cov.y t 1 ; a1 y t 1 C a2 y t 2 C " t / 5
6 7 6 7
a1 Cov.y t C a2 Cov.y t
2; yt 1/ 2; yt 2/
2 3 2 3
0 a1
1 C a2
2 C Var." t /
4
1 5 D 4 a1
0 C a2
1 5: (2.41)
6 7 6 7
2 a1 1 C a2 0
To transform to autocorrelation, divide through by
0 . The last two equations are then
" # " # " # " #
1 a1 C a2 1 1 a1 = .1 a2 /
D or D : (2.42)
2 a1 1 C a2 2 a12 = .1 a2 / C a2
If we know the parameters of the AR(2) model (a1 , a2 , and Var." t /), then we can
solve for the autocorrelations. Alternatively, if we know the autocorrelations, then we
34
can solve for the autoregression coefficients. This demonstrates that testing that all the
autocorrelations are zero is essentially the same as testing if all the autoregressive coeffi-
cients are zero. Note, however, that the transformation is non-linear, which may make a
difference in small samples.
Estimation of ARMA processes is typically done by setting up the likelihood function and
then using some numerical method to maximize it.
Even low-order ARMA models can be fairly flexible. For instance, the ARMA(1,1)
model is
y t D ay t 1 C " t C " t 1 , where " t is white noise. (2.44)
35
a. Impulse response of a=0.9 a. Impulse response of a=0
2 2
0 0
θ=−0.8
θ=0
θ=0.8
−2 −2
0 5 10 0 5 10
period period
−2
0 5 10
period
2.7.1 Introduction
yt D C yt 1 C "t ; (2.49)
where " t is white noise. The name “unit root process” comes from the fact that the largest
36
eigenvalues of the canonical form (the VAR(1) form of the AR(p)) is one. Such a process
is said to be integrated of order one (often denoted I(1)) and can be made stationary by
taking first differences.
Example 2.14 (Non-stationary AR(2).) The process y t D 1:5y t 1 0:5y t 2 C " t can be
written " # " #" # " #
yt 1:5 0:5 yt 1 "t
D C ;
yt 1 1 0 yt 2 0
where the matrix has the eigenvalues 1 and 0.5 and is therefore non-stationary. Note that
subtracting y t 1 from both sides gives y t y t 1 D 0:5 .y t 1 y t 2 /C" t , so the variable
x t D y t y t 1 is stationary.
The distinguishing feature of unit root processes is that the effect of a shock never
vanishes. This is most easily seen for the random walk. Substitute repeatedly in (2.49) to
get
y t D C . C y t 2 C "t 1/ C "t
::
:
t
X
D t C y0 C "s : (2.50)
sD1
The effect of " t never dies out: a non-zero value of " t gives a permanent shift of the level
of y t . This process is clearly non-stationary. A consequence of the permanent effect of
a shock is that the variance of the conditional distribution grows without bound as the
forecasting horizon is extended. For instance, for the random walk with drift, (2.50), the
distribution conditional on the information in t D 0 is N y0 C t; s 2 if the innova-
tions are Gaussian. This means that the expected change is t and that the conditional
variance grows linearly with the forecasting horizon. The unconditional variance is there-
fore infinite and the standard results on inference are not applicable.
In contrast, the conditional distributions from the trend stationary model, (2.48), is
N st; 2 .
A process could have two unit roots (integrated of order 2: I(2)). In this case, we need
to difference twice to make it stationary. Alternatively, a process can also be explosive,
that is, have eigenvalues outside the unit circle. In this case, the impulse response function
diverges.
37
Example 2.15 (Two unit roots.) Suppose y t in Example (2.14) is actually the first differ-
ence of some other series, y t D z t z t 1 . We then have
zt 2 0 1 0 zt 3 0
The eigenvalues are 1, 1, and 0.5, so z t has two unit roots (integrated of order 2: I(2) and
needs to be differenced twice to become stationary).
Example 2.16 (Explosive AR(1).) Consider the process y t D 1:5y t 1 C " t . The eigen-
value is then outside the unit circle, so the process is explosive. This means that the
impulse response to a shock to " t diverges (it is 1:5s for s periods ahead).
y t D y t yt 1
D "t ; (2.51)
which is white noise (any finite difference, like y t y t s , will give a stationary series),
38
so we could proceed by applying standard econometric tools to y t .
One may then be tempted to try first-differencing all non-stationary series, since it
may be hard to tell if they are unit root process or just trend-stationary. For instance, a
first difference of the trend stationary process, (2.48), gives
Its unclear if this is an improvement: the trend is gone, but the errors are now of MA(1)
type (in fact, non-invertible, and therefore tricky, in particular for estimation).
y t D ay t 1 C "t ; (2.53)
where the true value of jaj < 1. The asymptotic distribution is of the LS estimator is
p
a2 :
T .aO a/ N 0; 1 (2.54)
(The variance follows from the standard OLS formula where the variance of the estimator
is 2 .X 0 X=T / 1 . Here plim X 0 X=T DVar.y t / which we know is 2 = 1 a2 ).
It is well known (but not easy to show) that when a D 1, then aO is biased towards
zero in small samples. In addition, the asymptotic distribution is no longer (2.54). In
fact, there is a discontinuity in the limiting distribution as we move from a stationary/to
a non-stationary variable. This, together with the small sample bias means that we have
to use simulated critical values for testing the null hypothesis of a D 1 based on the OLS
estimate from (2.53).
The approach is to calculate the test statistic
aO 1
tD ;
O
Std.a/
and reject the null of non-stationarity if t is less than the critical values published by
Dickey and Fuller (typically more negative than the standard values to compensate for the
small sample bias) or from your own simulations.
In principle, distinguishing between a stationary and a non-stationary series is very
39
difficult (and impossible unless we restrict the class of processes, for instance, to an
AR(2)), since any sample of a non-stationary process can be arbitrary well approximated
by some stationary process et vice versa. The lesson to be learned, from a practical point
of view, is that strong persistence in the data generating process (stationary or not) invali-
dates the usual results on inference. We are usually on safer ground to apply the unit root
results in this case, even if the process is actually stationary.
Reference: Fuller (1976), Introduction to Statistical Time Series; Dickey and Fuller (1979),
“Distribution of the Estimators for Autoregressive Time Series with a Unit Root,” Journal
of the American Statistical Association, 74, 427-431.
Consider the AR(1) with intercept
y t D C ˛y t 1 C u t ; or y t D C ˇy t 1 C u t ; where ˇ D .˛ 1/ : (2.55)
The DF test is to test the null hypothesis that ˇ D 0, against ˇ < 0 using the usual
t statistic. However, under the null hypothesis, the distribution of the t statistics is far
from a student-t or normal distribution. Critical values, found in Fuller and Dickey and
Fuller, are lower than the usual ones. Remember to add any nonstochastic regressors
that in required, for instance, seasonal dummies, trends, etc. If you forget a trend, then
the power of the test goes to zero as T ! 1. The critical values are lower the more
deterministic components that are added.
The asymptotic critical values are valid even under heteroskedasticity, and non-normal
distributions of u t . However, no autocorrelation in u t is allowed for. In contrast, the
simulated small sample critical values are usually only valid for iid normally distributed
disturbances.
The ADF test is a way to account for serial correlation in u t . The same critical values
apply. Consider an AR(1) u t D u t 1 C e t . A Cochrane-Orcutt transformation of (2.55)
gives
y t D
.1 Q t
/ C ˇy 1 C .ˇ C 1/ y t 1 C e t ; where ˇQ D ˇ .1 / : (2.56)
40
p lags of y t if u t is an AR(p). The test remains valid even under an MA structure if
the number of lags included increases at the rate T 1=3 as the sample lenngth increases.
In practice: add lags until the remaining residual is white noise. The size of the test
(probability of rejecting H0 when it is actually correct) can be awful in small samples for
a series that is a I(1) process that initially “overshoots” over time, as y t D e t 0:8e t 1 ,
since this makes the series look mean reverting (stationary). Similarly, the power (prob of
rejecting H0 when it is false) can be awful when there is a lot of persistence, for instance,
if ˛ D 0:95.
The power of the test depends on the span of the data, rather than the number of
observations. Seasonally adjusted data tend to look more integrated than they are. Should
apply different critical values, see Ghysel and Perron (1993), Journal of Econometrics,
55, 57-98. A break in mean or trend also makes the data look non-stationary. Should
perhaps apply tests that account for this, see Banerjee, Lumsdaine, Stock (1992), Journal
of Business and Economics Statistics, 10, 271-287.
Park (1990, “Testing for Unit Roots and Cointegration by Variable Addition,” Ad-
vances in Econometrics, 8, 107-133) sets up a framework where we can use both non-
stationarity as the null hypothesis and where we can have stationarity as the null. Consider
the regression
p q
X X
yt D ˇs t s C ˇs t s C u t ; (2.57)
sD0 sDpC1
41
2.7.5 Cointegration
Suppose y1t and y2t are both (scalar) unit root processes, but that
is stationary. The processes y t and x t must then share the samehcommon stochastic
i trend,
and are therefore cointegrated with the cointegrating vector 1 ˇ . Running the
regression (2.59) gives an estimator ˇOLS which converges much faster than usual (it is
“superconsistent”) and is not affected by any simultaneous equations bias. The intuition
for the second result is that the simultaneous equations bias depends on the simultaneous
reactions to the shocks, which are stationary and therefore without any long-run impor-
tance.
This can be generalized by letting y t be a vector of n unit root processes which follows
a VAR. For simplicity assume it is a VAR(2)
y t D A1 y t 1 C A2 y t 2 C "t : (2.60)
Subtract y t from both sides, add and subtract A2 y t 1 from the right hand side
yt yt 1 D A1 y t 1 C A2 y t 2 C "t yt 1 C A2 y t 1 A2 y t 1
The left hand side is now stationary, and so is y t 1 y t 2 and " t on the right hand side. It
must therefore be the case that .A1 C A2 I / y t 1 is also stationary; it must be n linear
combinations of the cointegrating vectors. Since the number of cointegrating vectors must
be less than n, the rank of A1 CA2 I must be less than n. To impose this calls for special
estimation methods.
The simplest of these is Engle and Granger’s two-step procedure. In the first step, we
estimate the cointegrating vectors (as in 2.59) and calculate the different z t series (fewer
than n). In the second step, these are used in the error correction form of the VAR
yt yt 1 D z t 1 A2 .y t 1 yt 2/ C "t (2.62)
42
to estimate
and A2 . The relation to (2.61) is most easily seen in the bivariate case. Then,
by using (2.59) in (2.62) we get
h i
yt yt 1 D
ˇ y t 1 A2 .y t 1 y t 2 / C " t ; (2.63)
Bibliography
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
Johnston, J., and J. DiNardo, 1997, Econometric methods, McGraw-Hill, New York, 4th
edn.
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn.
43
3 The Distribution of a Sample Average
Reference: Hayashi (2000) 6.5
Additional references: Hamilton (1994) 14; Verbeek (2004) 4.10; Harris and Matyas
(1999); and Pindyck and Rubinfeld (1998) Appendix 10.1; Cochrane (2001) 11.7
R .s/ D Cov .m t ; m t s /
D E m t m0t s ; (3.2)
44
Example 3.1 (m t is a scalar iid process.) When m t is a scalar iid process, then
T T
!
1X 1 X
Var mt D 2 Var .m t / /*independently distributed*/
T t D1 T t D1
1
D T Var .m t / /*identically distributed*/
T2
1
D Var .m t / :
T
N D 0. By multiplying both sides by
pcase. Clearly, limT )1 Var.m/
This is the classical iid
T we instead get Var T m N D Var.m t /, which is often more convenient for asymptotics.
Example i3.2 Let x t and z t be two scalars, with samples averages xN and z.
N Let m t D
h 0
x t z t . Then Cov.m/ N is
" #!
xN
N D Cov
Cov .m/
zN
" #
N
Var .x/ N z/
Cov .x; N
D :
N x/
Cov .z; N Var .zN /
N D
Cov .T m/
E .m1 C m2 C m3 / m01 C m02 C m03 D
E m1 m01 C m2 m02 C m3 m03 C E m2 m01 C m3 m02 C E m1 m02 C m2 m03 C Em3 m01 C Em1 m03 :
„ ƒ‚ … „ ƒ‚ … „ ƒ‚ … „ ƒ‚ … „ ƒ‚ …
3R.0/ 2R.1/ 2R. 1/ R.2/ R. 2/
p T
X1
jsj
Cov N D
Tm 1 R.s/. (3.4)
T
sD .T 1/
45
In many cases, we use the asymptotic expression (limiting value as T ! 1) instead.
If R .s/ D 0 for s > q so m t is an MA(q), then the limit as the sample size goes to infinity
is
p p q
X
ACov N D lim Cov
Tm N D
Tm R.s/; (3.5)
T !1
sD q
where ACov stands for the asymptotic variance-covariance matrix. This continues to hold
even if q D 1, provided R .s/ goes to zero sufficiently quickly, as it does in stationary
VAR systems. In this case we have
p 1
X
ACov N D
Tm R.s/: (3.6)
sD 1
Estimation in finite samples will of course require some cut-off point, which is discussed
below. p
The traditional estimator of ACov T m N is just R.0/, which is correct when m t has
no autocorrelation, that is
p
ACov Tm N D R.0/ D Cov .m t ; m t / if Cov .m t ; m t s / D 0 for s ¤ 0: (3.7)
By comparing with (3.5) we see that this underestimates the true variance if autocovari-
ances are mostly positive, and overestimates if they are mostly negative. The errors can
be substantial.
p 1
X
AVar N D
Tm R.s/
sD 1
1 1
!
2 X jsj 2 X
D D 1C2 s
1 2 sD 1 1 2 sD1
2
1C
D ;
1 2 1
which is increasing in (provided jj < 1, as required for stationarity). The variance
N is much larger for close to one than for close to zero: the high autocorrelation
of m
create long swings, so the mean cannot be estimated with any good precision in a small
46
Variance of sample mean, AR(1) Var(sample mean)/Var(series), AR(1)
100 100
50 50
0 0
−1 0 1 −1 0 1
AR(1) coefficient AR(1) coefficient
p
Figure 3.1: Variance of T times sample mean of AR(1) process m t D m t 1 C ut .
sample. If we disregard all autocovariances, then we would conclude that the variance of
p
N is 2 = 1 2 , which is smaller (larger) than the true value when > 0 ( < 0).
Tm
For instance, with D 0:85, it is approximately 12 times too small. See Figure 3.1 for an
illustration.
Example 3.5 (Variance of sample mean of AR(1), continued.) Part of the reason why
N increased with in the previous examples is that Var.m t / increases with . We
Var.m/
p
can eliminate this effect by considering how much larger AVar. T m/ N is than in the iid
p
N
case, that is, AVar. T m/=Var.m t / D .1 C / = .1 /. This ratio is one for D 0 (iid
data), less than one for < 0, and greater than one for > 0. This says that if relatively
more of the variance in m t comes from long swings (high ), then the sample mean is
more uncertain. See Figure 3.1 for an illustration.
Example 3.6 (Variance of sample mean of AR(1), illustration of why limT !1 of (3.4).)
For an AR(1) (3.4) is
p T
X1
2
jsj jsj
Var N D
Tm 1
1 2 T
sD .T 1/
T
X1
" #
2 s s
D 1C2 1
1 2 sD1
T
2
T C1
D 1C2 C2 :
1 2 1 T .1 /2
47
The last term in brackets goes to zero as T goes to infinity. We then get the result in
Example 3.4.
Newey and West (1987) suggested the following estimator of the covariance matrix in
(3.5) as (for some n < T )
1 p n
X jsj O
ACov N D
Tm 1 R.s/
sD n
n C 1
n
O
X s O
O s/ ; or since R.
O s/ D RO 0 .s/;
D R.0/ C 1 R.s/ C R.
sD1
n C 1
n
O
X s O
D R.0/ C 1 R.s/ C RO 0 .s/ , where (3.8)
sD1
n C 1
T
O 1 X 0
R.s/ D mt mt s .if E m t D 0/: (3.9)
T t DsC1
The tent shaped (Bartlett) weights in (3.8) guarantee a positive definite covariance
estimate. In contrast, equal weights (as in (3.5)), may give an estimated covariance matrix
which is not positive definite, which is fairly awkward. Newey and West (1987) showed
that this estimator is consistent if we let n go to infinity as T does, but in such a way that
n=T 1=4 goes to zero.
There are several other possible estimators of the covariance matrix in (3.5), but sim-
ulation evidence suggest that they typically do not improve a lot on the Newey-West
estimator.
Example 3.7 (m t is MA(1).) Suppose we know that m t D " t C" t 1 .Then R.s/
D 0 for
1
s 2, so it might be tempting to use n D 1 in (3.8). This gives ACov
p
Tm O
N D R.0/ C
1 O
2
ŒR.1/ C RO 0 .1/, while the theoretical expression (3.5) is ACovD R.0/ C R.1/ C R0 .1/.
The Newey-West estimator puts too low weights on the first lead and lag, which suggests
that we should use n > 1 (or more generally, n > q for an MA(q) process).
48
Scatter plot, iid residuals Scatter plot, Var(residual) depends on x2
20 20
10 10
0 0
y
y
−10 −10
−20 −20
−10 −5 0 5 10 −10 −5 0 5 10
x x
y = 0.03 + 1.3x + u
Solid regression lines are based on all data,
dashed lines exclude the crossed out data point
O
punder quite general circumstances, S in (3.8)-(3.9) is a
It can also be shown that,
consistent estimator of ACov T m N , even if m t is heteroskedastic (on top of being au-
tocorrelated). (See Hamilton (1994) 10.5 for a discussion.)
Economic theory and/or stylized facts can sometimes help us choose the lag length n.
For instance, we may have a model of stock returns which typically show little autocor-
relation, so it may make sense to set n D 0 or n D 1 in that case. A popular choice of
n is to round .T =100/1=4 down to the closest integer, although this does not satisfy the
consistency requirement.
It is important to note that definition of the covariance matrices in (3.2) and (3.9)
assume that m t has zero mean. If that is not the case, then the mean should be removed
in the calculation of the covariance matrix. In practice, you remove the same number,
estimated on the whole sample, from both m t and m t s . It is often recommended to
remove the sample means even if theory tells you that the true mean is zero.
49
Std of LS slope coefficient under heteroskedasticity
0.1
OLS formula
White’s
Simulated
0.08
0.06
0
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2
α (effect of regressor on variance)
3.3 Summary
T
1X
N D
Let m m t and R .s/ D Cov .m t ; m t s / . Then
T tD1
p 1
X
ACov N D
Tm R.s/
p sD 1
ACov N D R.0/ D Cov .m t ; m t / if R.s/ D 0 for s ¤ 0
Tm
1
p n
O
X s O O 0
Newey-West W ACov N D R.0/ C
Tm 1 R.s/ C R .s/ :
sD1
nC1
Bibliography
Cochrane, J. H., 2001, Asset pricing, Princeton University Press, Princeton, New Jersey.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
50
Scatter plot, iid residuals Scatter plot, autocorrelated residuals
20 20
0 0
y
y
−20 −20
−10 −5 0 5 10 −10 −5 0 5 10
x x
y = 0.03 + 1.3x + u
Solid regression lines are based on all data,
dashed lines are based on late sample (high x values).
The regressor is (strongly) autocorrelated, since
it is an increasing series (-10,-9.9,...,10).
Harris, D., and L. Matyas, 1999, “Introduction to the generalized method of moments esti-
mation,” in Laszlo Matyas (ed.), Generalized Method of Moments Estimation . chap. 1,
Cambridge University Press.
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn.
51
Std of LS slope under autocorrelation Std of LS slope under autocorrelation
0.1 κ = −0.9 0.1 κ = 0
OLS formula
Newey-West
Simulated
0.05 0.05
0 0
−0.5 0 0.5 −0.5 0 0.5
ρ (autocorrelation of residual) ρ (autocorrelation of residual)
Model: yt = 0.9xt + ǫt ,
Std of LS slope under autocorrelation
where ǫt = ρǫt−1 + ξt , ξt is iid N
0.1 κ = 0.9
xt = κxt−1 + ηt , ηt is iid N
52
4 Least Squares
Reference: Greene (2000) 6
Additional references: Hayashi (2000) 1-2; Verbeek (2004) 1-4; Hamilton (1994) 8
53
Remark 4.1 (Summation and vectors) Let z t and x t be the vectors
2 3
" # x1t
z1t
zt D and x t D 4 x2t 5 ;
6 7
z2t
x3t
then
2 3 2 3 2 PT PT 3
T T x 1t T x 1t z1t x 1t z 2t t D1 x1t z1t x1t z2t
X X h i X
7 6 PT PTt D1
x t z t0 D 4 x2t 5 z1t z2t D D
4 x2t z1t x2t z2t 5 4 t D1 x2t z1t tD1 x2t z2t 5 :
6 7 6 7
t D1 tD1 t D1
PT PT
x3t x3t z1t x3t z2t t D1 x3t z1t t D1 x3t z2t
54
We then have
T
X
x t z t0 D X 0 Z:
t D1
We can then rewrite the loss function (4.2) as e 0 e, the first order conditions (4.3) and
(4.4) as (recall that y t D y t0 since it is a scalar)
0kx1 D X 0 Y X ˇOLS (4.10)
X 0 Y D X 0 X ˇOLS ; (4.11)
4.2 LS and R2
T
This implies that the fitted residuals and fitted values are orthogonal, ˙ tD1 T
yO t uO t D ˙ tD1 ˇO 0 x t uO t D
0. If we let x t include a constant, then (4.13) also implies that the fitted residuals have a
zero mean, ˙ tTD1 uO t =T D 0. We can then decompose the sample variance (denoted Var) c
of y t D yO t C uO t as
c .y t / D Var
Var c .yO t / C Var
c .uO t / ; (4.14)
since yO t and uO t are uncorrelated in this case. (Note that Cov.yO t ; uO t / D EyO t uO t EyO t EuO t so
the orthogonality is not enough to allow the decomposition; we also need EyO t EuO t D 0—
this holds for sample moments as well.)
We define R2 as the fraction of Var c .y t / that is explained by the model
c .yO t /
Var
R2 D (4.15)
c .y t /
Var
c .uO t /
Var
D1 : (4.16)
c .y t /
Var
55
c .uO t /, so it
LS minimizes the sum of squared fitted errors, which is proportional to Var
maximizes R2 .
We can rewrite R2 by noting that
b b
Cov .y t ; yO t / D Cov .yO t C uO t ; yO t / D Var
c .yO t / : (4.17)
2
R D
b
Cov .y t ; yO t /2
c .yO t /
c .y t / Var
Var
b
D Corr .y t ; yO t /2 (4.18)
which shows that R2 is the square of correlation coefficient of the actual and fitted value.
b
Note that this interpretation of R2 relies on the fact that Cov .yO t ; uO t / D 0. From (4.14)
this implies that the sample variance of the fitted variables is smaller than the sample
variance of y t . From (4.15) we see that this implies that 0 R2 1.
To get a bit more intuition for what R2 represents, suppose the estimated coefficients
2
equal the true coefficients, so yO t D x t0 ˇ0 . In this case, R2 D Corr x t0 ˇ0 C u t ; x t0 ˇ0 ,
that is, the squared correlation of y t with the systematic part of y t . Clearly, if the model
is perfect so u t D 0, then R2 D 1. On contrast, when there is no movements in the
systematic part (ˇ0 D 0), then R2 D 0.
2
b
O
Cov y ; bx
t
2
b
bO 2 Cov .y t ; x t /2
t
R D D ;
c .y t / Var
Var c bx O t bO 2 Var
c .y t / Var
c .x t /
b
We therefore have
Cov .y t ; x t /2
bOLS dOLS D D R2 :
Var .y t / Var .x t /
c c
56
This shows that dOLS D 1=bOLS if (and only if) R2 D 1.
Use the true model (4.1) to substitute for y t in the definition of the LS estimator (4.6)
1
T T
!
1X 1X
ˇOLS D x t x t0 x t x t0 ˇ0 C u t
T t D1 T t D1
T
! 1 T
1X 0 1X
D ˇ0 C xt xt xt ut : (4.19)
T t D1 T tD1
D ˇ0 ; (4.21)
since E.u t jx t s / D 0 for all s. This is, for instance, the case when the regressors are
deterministic. Notice that E. u t j x t / D 0 is not enough for unbiasedness since (4.19)
contains terms involving x t s x t u t from the product of . T1 TtD1 x t x t0 / 1 and x t u t .
P
In this case, the assumption E.u t jx t s / D 0 for all s (that is, s D :::; 1; 0; 1; :::) is false,
since x t C1 D y t and u t and y t are correlated. We can therefore not use this way of
proving that ˛O LS is unbiased. In fact, it is not, and it can be shown that ˛O LS is downward-
biased if ˛ > 0, and that this bias gets quite severe as ˛ gets close to unity.
57
The finite sample distribution of the LS estimator is typically unknown.
Even in the most restrictive case where u t is iid N 0; 2 and E.u t jx t s / D 0 for all
This says that the estimator, conditional on the sample of regressors, is normally dis-
tributed. With deterministic x t , this clearly means that ˇOLS is normally distributed in a
small sample. The intuition is that the LS estimator with deterministic regressors is just
a linear combination of the normally distributed y t , so it must be normally distributed.
However, if x t is stochastic, then we have to take into account the distribution of fx t gTtD1
to find the unconditional distribution of ˇOLS . The principle is that
Z 1 Z 1
pdf ˇ DO O
pdf ˇ; x dx D O
pdf ˇ jx pdf .x/ dx;
1 1
so the distribution in (4.22) must be multiplied with the probability density function of
fx t gTtD1 and then integrated over fx t gTtD1 to give the unconditional distribution (marginal)
of ˇOLS . This is typically not a normal distribution.
Another way to see the same problem is to note that ˇOLS in (4.19) is a product of
T
two random variables, .˙ tD1 x t x t0 =T / 1 and ˙ tTD1 x t u t =T . Even if u t happened to be
normally distributed, there is no particular reason why x t u t should be, and certainly no
strong reason for why .˙ tTD1 x t x t0 =T / 1 ˙ tTD1 x t u t =T should be.
4.4 Consistency of LS
Reference: Greene (2000) 9.3-5 and 11.2; Hamilton (1994) 8.2; Davidson (2000) 3
We now study if the LS estimator is consistent.
Remark 4.6 Suppose the true parameter value is ˇ0 . The estimator ˇOT (which, of course,
depends on the sample size T ) is said to be consistent if for every " > 0 and ı > 0 there
exists N such that for T N
O
Pr
ˇT ˇ0
> ı < ":
58
p
(kxk D x 0 x, the Euclidean distance of x from zero.) We write this plim ˇOT D ˇ0 or
just plim ˇO D ˇ0 , or perhaps ˇO !p ˇ0 . (For an estimator of a covariance matrix, the
most convenient is to stack the unique elements in a vector and then apply the definition
above.)
Remark 4.7 (Slutsky’s theorem.) If g .:/ is a continuous function, then plim g .zT / D
g .plim zT /. In contrast, note that Eg .zT / is generally not equal to g .EzT /, unless g .:/
is a linear function.
Remark 4.8 (Probability limit of product.) Let xT and yT be two functions of a sample
of length T . If plim xT D a and plim yT D b, then plim xT yT D ab.
Assume
T
1X
plim x t x t0 D ˙xx < 1; and ˙xx invertible. (4.23)
T tD1
The plim carries over to the inverse by Slutsky’s theorem.1 Use the facts above to write
the probability limit of (4.19) as
T
O 1 1X
plim ˇLS D ˇ0 C ˙xx plim xt ut : (4.24)
T t D1
This is fairly easy to establish in special cases, for instance, when w t D x t u t is iid or
when there is either heteroskedasticity or serial correlation. The case with both serial
correlation and heteroskedasticity is just a bit more complicated. In other cases, it is clear
that the covariance the residuals and the regressors are not all zero—for instance when
some of the regressors are measured with error or when some of them are endogenous
variables.
An example of a case where LS is not consistent is when the errors are autocorrelated
and the regressors include lags of the dependent variable. For instance, suppose the error
1
This puts non-trivial restrictions on the data generating processes. For instance, if x t include lagged
values of y t , then we typically require y t to be stationary and ergodic, and that u t is independent of x t s
for s 0.
59
is a MA(1) process
u t D " t C 1 " t 1; (4.26)
where " t is white noise and that the regression equation is an AR(1)
y t D y t 1 C ut : (4.27)
This is an ARMA(1,1) model and it is clear that the regressor and error in (4.27) are
correlated, so LS is not a consistent estimator of an ARMA(1,1) model.
Reference: Greene (2000) 9.3-5 and 11.2; Hamilton (1994) 8.2; Davidson (2000) 3
Remark 4.9 (Continuous mapping theorem.) Let the sequences of random matrices fxT g
d p
and fyT g, and the non-random matrix faT g be such that xT ! x, yT ! y, and aT ! a
d
(a traditional limit). Let g.xT ; yT ; aT / be a continuous function. Then g.xT ; yT ; aT / !
g.x; y; a/. Either of yT and aT could be irrelevant in g.
d
Remark 4.10 From the previous remark: if xT ! x (a random variable) and plim QT D
d
Q (a constant matrix), then QT xT ! Qx.
p
Premultiply (4.19) by T and rearrange as
T
! 1 p T
p 1X T X
T ˇOLS ˇ0 D x t x t0 xt ut : (4.28)
T t D1 T t D1
If the first term on the right hand side converges in probability to a finite matrix (as as-
sumed in (4.23)), and the vector of random variables x t u t satisfies a central limit theorem,
then
p d
T .ˇOLSˇ0 / ! N 0; ˙xx1 S0 ˙xx1 , where
(4.29)
T p T !
1X T X
˙xx D plim x t x t0 and S0 D Cov xt ut :
T t D1 T t D1
The last matrix in the covariance matrix does not need to be transposed since it is sym-
metric (since ˙xx is). This general expression is valid for both autocorrelated and het-
eroskedastic residuals—all such features are loaded into the S0 matrix. Note that S0 is
60
p
the variance-covariance matrix of T times a sample average (of the vector of random
variables x t u t ), which can be complicated to specify and to estimate. In simple cases,
we can derive what it is. To do so, we typically need to understand the properties of the
residuals. Are they autocorrelated and/or heteroskedastic? In other cases we will have to
use some kind of “non-parametric” approach to estimate it.
A common approach is to estimate ˙xx by ˙ tTD1 x t x t0 =T and use the Newey-West
estimator of S0 .
This means that all cross terms (involving different observations) drop out and that we
can write
T
1X
S0 D Ex t x t0 Eu2t (4.32)
T tD1
T
1 X2
D E x t x t0 (since u t is iid and 2 D Eu2t ) (4.33)
T t D1
D 2 ˙xx : (4.34)
61
Using this in (4.29) gives
p
Asymptotic CovŒ T .ˇOLS ˇ0 / D ˙xx1 S0 ˙xx1 D ˙xx1 2 ˙xx ˙xx1 D 2 ˙xx1 :
Reference: Greene (2000) 12.2 and Davidson and MacKinnon (1993) 16.2.
This section shows that the classical LS formula for the covariance matrix is valid
even if the errors are heteroskedastic—provided the heteroskedasticity is independent of
the regressors.
The only difference compared with the classical LS assumptions is that u t is now
allowed to be heteroskedastic, but this heteroskedasticity is not allowed to depend on the
moments of x t . This means that (4.32) holds, but (4.33) does not since Eu2t is not the
same for all t.
However, we can still simplify (4.32) a bit more. We assumed that Ex t x t0 and Eu2t
(which can both be time varying) are not related to each other, so we could perhaps multi-
ply Ex t x t0 by ˙ tTD1 Eu2t =T instead of by Eu2t . This is indeed true asymptotically—where
any possible “small sample” relation between Ex t x t0 and Eu2t must wash out due to the
assumptions of independence (which are about population moments).
In large samples we therefore have
T T
! !
1X 2 1X 0
S0 D Eu Ex t x t
T t D1 t T t D1
T T
! !
1X 2 1X
D Eu E x t x t0
T t D1 t T tD1
D ! 2 ˙xx ; (4.35)
where ! 2 is a scalar. This is very similar to the classical LS case, except that ! 2 is
the average variance of the residual rather than the constant variance. In practice, the
estimator of ! 2 is the same as the estimator of 2 , so we can actually apply the standard
LS formulas in this case.
This is the motivation for why White’s test for heteroskedasticity makes sense: if the
heteroskedasticity is not correlated with the regressors, then the standard LS formula is
correct (provided there is no autocorrelation).
62
4.6 Inference
Suppose we want to test the null hypothesis that the s linear restrictions Rˇ0 D r hold,
where R is an s k matrix and r is an s 1 vector. If the null hypothesis is true, then
p d
T .RˇO r/ ! N.0; RVR0 /; (4.37)
since the s linear combinations are linear combinations of random variables with an
asymptotic normal distribution as in (4.37).
Remark 4.12 From the previous remark and Remark (4.9), it follows that if the n 1
d d
vector x ! N.0; ˙/, then x 0 ˙ 1
x ! 2n .
From this remark, it follows that if the null hypothesis, Rˇ0 D r, is true, then Wald
test statistics converges in distribution to a 2s variable
1 d
T .RˇO r/0 RVR0 .RˇO r/ ! 2s :
(4.38)
Values of the test statistics above the x% critical value of the 2s distribution mean that
we reject the null hypothesis at the x% significance level.
p
When there is only one restriction (s D 1), then T .RˇO r/ is a scalar, so the test
can equally well be based on the fact that
p
T .RˇO r/ d
p ! N.0; 1/:
RVR0
In this case, we should reject the null hypothesis if the test statistics is either very low
(negative) or very high (positive). In particular, let ˚./ be the standard normal cumulative
distribution function. We then reject the null hypothesis at the x% significance level if the
test statistics is below xL such that ˚.xL / D .x=2/% or above xH such that ˚.xH / D
1 .x=2/% (that is with .x=2/% of the probability mass in each tail).
63
Example 4.13 (TR2 =.1 R2 / as a test of the regression.) Recall from (4.15)-(4.16) that
R2 D Var
c .yO t / =Var
c .y t / D 1 Varc .uO t / =Var
c .y t /, where yO t and uO t are the fitted value
and residual respectively. We therefore get
To simplify the algebra, assume that both y t and x t are demeaned and that no intercept is
used. (We get the same results, but after more work, if we relax this assumption.) In this
O so we can rewrite the previous equation as
case, yO t D x t0 ˇ,
This is identical to (4.38) when R D Ik and r D 0k1 and the classical LS assumptions
are fulfilled (so V D Var .uO t / ˙xx1 /. The TR2 =.1 R2 / is therefore a 2k distributed
statistics for testing if all the slope coefficients are zero. More generally, the test statistic
is distributed as 2k 1 when there are k 1 slopes and one intercept.
Example 4.14 (F version of the test.) There is also an Fk;T k version of the test in the
previous example: .R2 =k/=Œ.1 R2 /=.T k/. Note that k times an Fk;T k variable
converges to a 2k variable as T k ! 1. This means that the 2k form in the previous
example can be seen as an asymptotic version of the (more common) F form. More
generally, the test statistic ŒR2 =.k 1/=Œ.1 R2 /=.T k/ is distributed as Fk 1;T k
when there are k 1 slopes and one intercept.
To test non-linear restrictions, we can use the delta method which gives the asymptotic
distribution of a function of a random variable.
Fact 4.15 Remark 4.16 (Delta method) Consider an estimator ˇOk1 which satisfies
p d
T ˇO ˇ0 ! N .0; ˝/ ;
q1 D g .ˇ/ ;
64
where g .:/ is has continuous first derivatives. The result is
p h i d
T g ˇO
g .ˇ0 / ! N 0; qq ; where
@g .ˇ0 / @g .ˇ0 /0 @g .ˇ0 /
D 0 ˝ , where is q k:
@ˇ @ˇ @ˇ 0
4.6.2 On F Tests
F tests are sometimes used instead of chi–square tests. However, F tests rely on very spe-
cial assumptions and typically converge to chi–square tests as the sample size increases.
There are therefore few compelling theoretical reasons for why we should use F tests.2
This section demonstrates that point.
65
To use the F test to test s linear restrictions Rˇ0 D r, we need to assume that the
p
small sample distribution of the estimator is normal, T .ˇO ˇ0 / N.0; 2 W /, where
2 is a scalar and W a known matrix. This would follow from an assumption that the
residuals are normally distributed and that we either consider the distribution conditional
on the regressors or that the regressors are deterministic. In this case W D ˙xx1 .
Consider the test statistics
1
F D T .RˇO r/0 RO 2 WR0 .RˇO
r/=s:
This is similar to (4.38), expect that we use the estimated covariance matrix O 2 W instead
of the true 2 W (recall, W is assumed to be known) and that we have divided by the
number of restrictions, s. Multiply and divide this expressions by 2
1
T .RˇO r/0 R 2 WR0 .RˇO
r/=s
F D :
O 2 = 2
The numerator is an 2s variable divided by its degrees of freedom, s. The denominator
can be written O 2 = 2 D ˙.uO t =/2 =T , where uO t are the fitted residuals. Since we just
assumed that u t are iid N.0; 2 /, the denominator is an 2T variable divided by its degrees
of freedom, T . It can also be shown that the numerator and denominator are independent
(essentially because the fitted residuals are orthogonal to the regressors), so F is an Fs;T
variable.
We need indeed very strong assumptions to justify the F distributions. Moreover, as
d
T ! 1, sF ! 2n which is the Wald test—which do not need all these assumptions.
Reference: Greene (2000) 12.3, 13.5 and 9.7; Johnston and DiNardo (1997) 6; and Pindyck
and Rubinfeld (1998) 6, Patterson (2000) 5
LS and IV are still consistent even if the residuals are autocorrelated, heteroskedastic,
and/or non-normal, but the traditional expression for the variance of the parameter esti-
mators is invalid. It is therefore important to investigate the properties of the residuals.
We would like to test the properties of the true residuals, u t , but these are unobserv-
able. We can instead use residuals from a consistent estimator as approximations, since
the approximation error then goes to zero as the sample size increases. The residuals from
66
an estimator are
uO t D y t x t0 ˇO
D x t0 ˇ0 ˇO C u t : (4.39)
4.7.1 Autocorrelation
Let Os be the estimate of the sth autocorrelation coefficient of some variable, for instance,
the fitted residuals. The sampling properties of Os are complicated, but there are several
useful large sample results for Gaussian processes (these results typically carry over to
processes which are similar to the Gaussian—a homoskedastic process with finite 6th
moment is typically enough). When the true autocorrelations are all zero (not 0 , of
course), then for any i and j different from zero
" # " # " #!
p Oi 0 1 0
T !d N ; : (4.40)
Oj 0 0 1
This result can be used to construct tests for both single autocorrelations (t-test or 2 test)
and several autocorrelations at once (2 test).
Example 4.18 (t-test) We want to test the hypothesis that 1 D 0. Since the N.0; 1/
distribution has 5% of the probability mass below -1.65 and another 5% above 1.65, we
p
can reject the null hypothesis at the 10% level if T jO1 j > 1:65. With T D 100, we
p
therefore need jO1 j > 1:65= 100 D 0:165 for rejection, and with T D 1000 we need
67
p
jO1 j > 1:65= 1000 0:0:53.
p
The Box-Pierce test follows directly from the result in (4.40), since it shows that T Oi
p
and T Oj are iid N(0,1) variables. Therefore, the sum of the square of them is distributed
as an 2 variable. The test statistics typically used is
L
X
QL D T Os2 !d L
2
: (4.41)
sD1
Example 4.19 (Box-Pierce) Let O1 D 0:165, and T D 100, so Q1 D 100 0:1652 D
2:72. The 10% critical value of the 21 distribution is 2.71, so the null hypothesis of no
autocorrelation is rejected.
Example 4.20 (Residuals follow an AR(1)process) If u t D 0:9u t 1 C " t , then the true
autocorrelation coefficients are j D 0:9j .
A common test of the serial correlation of residuals from a regression is the Durbin-
Watson test
d D 2 .1 O1 / ; (4.42)
where the upper and lower critical values can be found in tables. (Use 4 d to let
negative autocorrelation be the alternative hypothesis.) This test is typically not useful
when lagged dependent variables enter the right hand side (d is biased towards showing
no autocorrelation). Note that DW tests only for first-order autocorrelation.
68
Example 4.21 (Durbin-Watson.) With O1 D 0:2 we get d D 1:6. For large samples,
the 5% critical value is dlower 1:6, so O1 > 0:2 is typically considered as evidence of
positive autocorrelation.
The fitted residuals used in the autocorrelation tests must be consistent in order to in-
terpret the result in terms of the properties of the true residuals. For instance, an excluded
autocorrelated variable will probably give autocorrelated fitted residuals—and also make
the coefficient estimator inconsistent (unless the excluded variable is uncorrelated with
the regressors). Only when we know that the model is correctly specified can we interpret
a finding of autocorrelated residuals as an indication of the properties of the true residuals.
4.7.2 Heteroskedasticity
White’s test for heteroskedasticity tests the null hypothesis of homoskedasticity against
the kind of heteroskedasticity which can be explained by the levels, squares, and cross
products of the regressors. Let w t be the unique elements in x t ˝ x t , where we have
added a constant to x t if there was not one from the start. Run a regression of the squared
fitted LS residuals on w t
uO 2t D w t0
C " t (4.43)
and test if all elements (except the constant) in
are zero (with a 2 or F test). The
reason for this specification is that if u2t is uncorrelated with x t ˝ x t , then the usual LS
covariance matrix applies.
69
Breusch-Pagan’s test is very similar, except that the vector w t in (4.43) can be any
vector which is thought of as useful for explaining the heteroskedasticity. The null hy-
pothesis is that the variance is constant, which is tested against the alternative that the
variance is some function of w t .
The fitted residuals used in the heteroskedasticity tests must be consistent in order to
interpret the result in terms of the properties of the true residuals. For instance, if some
of the of elements in w t belong to the regression equation, but are excluded, then fitted
residuals will probably fail these tests.
4.7.3 Normality
We often make the assumption of normally distributed errors, for instance, in maximum
likelihood estimation. This assumption can be tested by using the fitted errors. This works
since moments estimated from the fitted errors are consistent estimators of the moments
of the true errors. Define the degree of skewness and excess kurtosis for a variable z t
(could be the fitted residuals) as
T
O 1X
3 D .z t zN /3 =O 3 ; (4.44)
T t D1
T
1X
O4 D .z t zN /4 =O 4 3; (4.45)
T t D1
Remark 4.24 (2 .n/ distribution.) If xi are independent N.0; i2 / variables, then ˙inD1 xi2 =i2
2 .n/.
In a normal distribution, the true values are zero and the test statistics O3 and O4 are
themselves normally distributed with zero covariance and variances 6=T and 24=T , re-
spectively (straightforward, but tedious, to show). Therefore, under the null hypothesis
of a normal distribution, T O32 =6 and T O42 =24 are independent and both asymptotically
distributed as 2 .1/, so the sum is asymptotically a 2 .2/ variable
O 2 O
W D T 3 =6 C 4 =24 !d 2 .2/:
2
(4.46)
70
Histogram of 100 draws from a U(0,1) distribution
0.2
θ3 = −0.14, θ4 = −1.4, W = 8
0.15
0.1
0.05
0
0 0.2 0.4 0.6 0.8 1
Figure 4.1: This figure shows a histogram from 100 draws of iid uniformly [0,1] dis-
tributed variables.
Bibliography
Davidson, J., 2000, Econometric theory, Blackwell Publishers, Oxford.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
Johnston, J., and J. DiNardo, 1997, Econometric methods, McGraw-Hill, New York, 4th
edn.
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
71
Average LS estimate of ρ Std of LS estimate of ρ
simulation
0.9 asymptotic
0.1
0.85
0.8
0.05
0.75
0.7 0
0 100 200 300 400 500 0 100 200 300 400 500
Sample size, T Sample size, T
√
T × Std of LS estimate of ρ True model: yt = 0.9yt−1 + ǫt ,
0.8 where ǫt is iid N(0,2)
0.5
0.4
0 100 200 300 400 500
Sample size, T
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn.
72
Distribution of LS t-stat, T = 5 Distribution of LS t-stat, T = 100
t = (b̂ − 0.9)/Std(b̂)
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −2 0 2 4 −4 −2 0 2 4
Model: Rt = 0.9ft + ǫt , ǫt = vt − 2,
where vt has a χ22 distribution Probability density functions
73
5 Instrumental Variable Method
Reference: Greene (2000) 9.5 and 16.1-2
Additional references: Hayashi (2000) 3.1-4; Verbeek (2004) 5.1-4; Hamilton (1994) 8.2;
and Pindyck and Rubinfeld (1998) 7
In many cases the law of large numbers applies to both terms on the right hand side. The
first term is typically a matrix with finite elements and the second term is the covariance of
the regressors and the true residuals. This covariance must be zero for LS to be consistent.
74
Suppose the true model is
y t D x t0 ˇ0 C ut : (5.5)
y t D x t0 ˇ C u t ; (5.6)
where y t and x t are proxies for the correct variables (the ones that the model is true for).
We can think of the difference as measurement errors
y t D y t C v ty and (5.7)
x t D x t C v tx ; (5.8)
where the errors are uncorrelated with the true values and the “true” residual ut .
Use (5.7) and (5.8) in (5.5)
0
yt v ty D x t v tx ˇ0 C ut or
y t D x t0 ˇ0 C " t where " t D v tx0 ˇ0 C v ty C ut : (5.9)
Suppose that x t is a measured with error. From (5.8) we see that v tx and x t are corre-
lated, so LS on (5.9) is inconsistent in this case. To make things even worse, measurement
errors in only one of the variables typically affect all the coefficient estimates.
To illustrate the effect of the error, consider the case when x t is a scalar. Then, the
probability limit of the LS estimator of ˇ in (5.9) is
75
since x t and v tx are uncorrelated with ut and with each other. This shows that ˇOLS goes
to zero as the measurement error becomes relatively more volatile compared with the true
value. This makes a lot of sense, since when the measurement error is very large then the
regressor x t is dominated by noise that has nothing to do with the dependent variable.
Suppose instead that only y t is measured with error. This not a big problem since this
measurement error is uncorrelated with the regressor, so the consistency of least squares
is not affected. In fact, a measurement error in the dependent variable is like increasing
the variance in the residual.
Suppose economic theory tells you that the structural form of the m endogenous variables,
y t , and the k predetermined (exogenous) variables, z t , is
1 1
yt D F Gz t C F ut (5.12)
D ˘ z t C " t , with Cov ." t / D ˝: (5.13)
Fj y t C Gj z t D ujt ; (5.14)
where Fj and Gj are the j th rows of F and G, respectively. Suppose the model is nor-
malized so that the coefficient on yjt is one (otherwise, divide (5.14) with this coefficient).
76
Then, rewrite (5.14) as
where zQ t and yQ t are the exogenous and endogenous variables that enter the j th equation,
which we collect in the x t vector to highlight that (5.15) looks like any other linear re-
gression equation. The problem with (5.15), however, is that the residual is likely to be
correlated with the regressors, so the LS estimator is inconsistent. The reason is that a
shock to ujt influences yjt , which in turn will affect some other endogenous variables in
the system (5.11). If any of these endogenous variable are in x t in (5.15), then there is a
correlation between the residual and (some of) the regressors.
Note that the concept of endogeneity discussed here only refers to contemporaneous
endogeneity as captured by off-diagonal elements in F in (5.11). The vector of predeter-
mined variables, z t , could very well include lags of y t without affecting the econometric
endogeneity problem.
Example 5.1 (Supply and Demand. Reference: GR 16, Hamilton 9.1.) Consider the
simplest simultaneous equations model for supply and demand on a market. Supply is
q t D p t C ust ; > 0;
and demand is
q t D ˇp t C ˛A t C udt ; ˇ < 0;
where A t is an observable demand shock (perhaps income). The structural form is there-
fore " #" # " # " #
1
qt 0 ust
C At D :
1 ˇ pt ˛ udt
The reduced form is " # " # " #
qt 11 "1t
D At C :
pt 21 "2t
If we knew the structural form, then we can solve for q t and p t to get the reduced form in
77
terms of the structural parameters
" # " # " #" #
ˇ
qt ˇ
˛ ˇ
ˇ
ust
D 1
At C 1 1
:
pt ˇ
˛ ˇ
ˇ
udt
Example 5.2 (Supply equation with LS.) Suppose we try to estimate the supply equation
in Example 5.1 by LS, that is, we run the regression
q t D p t C " t :
If data is generated by the model in Example 5.1, then the reduced form shows that p t is
correlated with ust , so we cannot hope that LS will be consistent. In fact, when both q t
and p t have zero means, then the probability limit of the LS estimator is
Cov .q t ; p t /
plim O D
Var .p t /
Cov
˛ˇ A t C
ˇ udt ˇ
us ; ˛ A C
1 ˇ udt
ˇ t
ˇ t
1
ud
ˇ t
D
Var
˛ ˇ A t C
1 ˇ udt 1
u
ˇ t
s
;
where the second line follows from the reduced form. Suppose the supply and demand
shocks are uncorrelated. In that case we get
˛2
udt C ˇ 2 Var ust
Var .A t / C Var
.
ˇ /2 .
ˇ /2 .
ˇ /
plim O D ˛2 1
Var u t C 1 2 Var .ust /
d
Var .A t / C
.
ˇ /2 .
ˇ /2 .
ˇ /
2 d
˛ Var .A t / C
Var u t C ˇVar ust
D :
˛ 2 Var .A t / C Var udt C Var .ust /
First, suppose the supply shocks are zero, Var ust D 0, then plim O D
, so we indeed
estimate the supply elasticity, as we wanted. Think of a fixed supply curve, and a demand
curve which moves around. These point of p t and q t should trace out the supply curve. It
is clearly ust that causes a simultaneous equations problem in estimating the supply curve:
ust affects both q t and p t and the latter is the regressor in the supply equation. With no
movements in ust there is no correlation between the shock and the regressor. Second, now
suppose instead that the both demand shocks are zero (both A t D 0 and Var udt D 0).
Then plim O D ˇ, so the estimated value is not the supply, but the demand elasticity. Not
good. This time, think of a fixed demand curve, and a supply curve which moves around.
78
Distribution of LS estimator, T=200 Distribution of LS estimator, T=900
0.88 0.89
0.4 Mean and std:
0.03
0.4 0.01
0.2 0.2
0 0
0.6 0.8 1 0.6 0.8 1
Example 5.3 (A flat demand curve.) Suppose we change the demand curve in Example
5.1 to be infinitely elastic, but to still have demand shocks. For instance, the inverse
demand curve could be p t D A t C uD t . In this case, the supply and demand is no longer
a simultaneous system of equations and both equations could be estimated consistently
with LS. In fact, the system is recursive, which is easily seen by writing the system on
vector form " #" # " # " #
1 0 pt uDt
C At D :
1
qt 0 ust
A supply shock, ust , affects the quantity, but this has no affect on the price (the regressor
in the supply equation), so there is no correlation between the residual and regressor in
the supply equation. A demand shock, uD t , affects the price and the quantity, but since
quantity is not a regressor in the inverse demand function (only the exogenous A t is) there
is no correlation between the residual and the regressor in the inverse demand equation
either.
79
LS, T=200 IV, T=200 ML, T=200
0 0 0
0.6 0.8 1 0.6 0.8 1 0.6 0.8 1
0 0 0
0.6 0.8 1 0.6 0.8 1 0.6 0.8 1
Reference: Greene (2000) 9.5; Hamilton (1994) 8.2; and Pindyck and Rubinfeld (1998)
7.
Consider the linear model
y t D x t0 ˇ0 C u t ; (5.16)
80
Recall that LS minimizes the variance of the fitted residuals, uO t D y t x t0 ˇOLS . The
first order conditions for that optimization problem are
T
1X
0kx1 D xt yt x t0 ˇOLS : (5.17)
T t D1
If x t and u t are correlated, then plim ˇOLS ¤ ˇ0 . The reason is that the probability limit of
the right hand side of (5.17) is Cov.x t ; y t x t0 ˇOLS /, which at ˇOLS D ˇ0 is non-zero, so the
first order conditions (in the limit) cannot be satisfied at the true parameter values. Note
that since the LS estimator by construction forces the fitted residuals to be uncorrelated
with the regressors, the properties of the LS residuals are of little help in deciding if to
use LS or IV.
The idea of the IV method is to replace the first x t in (5.17) with a vector (of similar
size) of some instruments, z t . The identifying assumption of the IV method is that the
instruments are uncorrelated with the residuals (and, as we will see, correlated with the
regressors)
0kx1 D Ez t u t (5.18)
x t0 ˇ0 :
D Ez t y t (5.19)
The intuition is that the linear model (5.16) is assumed to be correctly specified: the
residuals, u t , represent factors which we cannot explain, so z t should not contain any
information about u t .
The sample analogue to (5.19) defines the IV estimator of ˇ as1
T
1X
0kx1 D zt yt x t0 ˇOI V ; or (5.20)
T t D1
1
T T
!
1X 1X
ˇOI V D z t x t0 zt yt : (5.21)
T t D1 T t D1
Remark 5.4 (Probability limit of product) For any random variables yT and xT where
plim yT D a and plim xT D b (a and b are constants), we have plim yT xT D ab.
1
1
In matrix notation where z t0 is the t t h row of Z we have ˇOI V D .Z 0 X=T / .Z 0 Y =T /.
81
To see if the IV estimator is consistent, use (5.16) to substitute for y t in (5.20) and
take the probability limit
T T T
1X 1X 1X
plim 0
z t x t ˇ0 C plim z t u t D plim z t x t0 ˇOI V : (5.22)
T t D1 T t D1 T t D1
Two things are required for consistency of the IV estimator, plim ˇOI V D ˇ0 . First, that
plim ˙ z t u t =T D 0. Provided a law of large numbers apply, this is condition (5.18).
Second, that plim ˙z t x t0 =T has full rank. To see this, suppose plim ˙z t u t =T D 0 is
satisfied. Then, (5.22) can be written
T
!
1X
plim z t x t0 ˇ0 plim ˇOI V D 0: (5.23)
T t D1
If plim ˙z t x t0 =T has reduced rank, then plim ˇOI V does not need to equal ˇ0 for (5.23) to
be satisfied. In practical terms, the first order conditions (5.20) do then not define a unique
value of the vector of estimates. If a law of large numbers applies, then plim ˙z t x t0 =T D
Ez t x t0 . If both z t and x t contain constants (or at least one of them has zero means), then
a reduced rank of Ez t x t0 would be a consequence of a reduced rank of the covariance
matrix of the stochastic elements in z t and x t , for instance, that some of the instruments
are uncorrelated with all the regressors. This shows that the instruments must indeed be
correlated with the regressors for IV to be consistent (and to make sense).
Remark 5.5 (Second moment matrix) Note that Ezx 0 D EzEx 0 C Cov.z; x/. If Ez D 0
and/or Ex D 0, then the second moment matrix is a covariance matrix. Alternatively,
suppose both z and x contain constants normalized to unity: z D Œ1; zQ 0 0 and x D Œ1; xQ 0 0
where zQ and xQ are random vectors. We can then write
" # " #
1 h i 0 0
Ezx 0 D 1 ExQ 0 C
EQz Q x/
0 Cov.z; Q
" #
1 ExQ 0
D :
EQz EQz ExQ 0 C Cov.z;
Q x/
Q
For simplicity, suppose zQ and xQ are scalars. Then Ezx 0 has reduced rank if Cov.z;
Q x/
Q D 0,
0
Q x/
since Cov.z; Q is then the determinant of Ezx . This is true also when zQ and xQ are vectors.
82
Example 5.6 (Supply equation with IV.) Suppose we try to estimate the supply equation
in Example 5.1 by IV. The only available instrument is A t , so (5.21) becomes
1
T T
!
1X 1X
OI V D At pt At qt ;
T t D1 T t D1
1
plim
OI V D Cov .A t ; p t / Cov .A t ; q t / ;
since all variables have zero means. From the reduced form in Example 5.1 we see that
1
Cov .A t ; p t / D ˛Var .A t / and Cov .A t ; q t / D ˛Var .A t / ;
ˇ
ˇ
so
1
1
plim
OI V D ˛Var .A t / ˛Var .A t /
ˇ
ˇ
D
:
Little is known about the finite sample distribution of the IV estimator, so we focus on the
asymptotic distribution—assuming the IV estimator is consistent.
d
Remark 5.7 If xT ! x (a random variable) and plim QT D Q (a constant matrix),
d
then QT xT ! Qx.
T
! 1 p T
p 1X T X
T .ˇOI V ˇ0 / D z t x t0 zt ut : (5.25)
T tD1 T tD1
83
If the first term on the right hand side converges in probability to a finite matrix (as as-
sumed in in proving consistency), and the vector of random variables z t u t satisfies a
central limit theorem, then
p d
T .ˇOI Vˇ0 / ! N 0; ˙zx1 S0 ˙xz1 , where
(5.26)
T p T !
1X T X
˙zx D z t x t0 and S0 D Cov zt ut :
T t D1 T t D1
0
The last matrix in the covariance matrix follows from .˙zx1 /0 D .˙zx / 1 D ˙xz1 . This
general expression is valid for both autocorrelated and heteroskedastic residuals—all such
features are loaded into the S0 matrix. Note that S0 is the variance-covariance matrix of
p
T times a sample average (of the vector of random variables x t u t ).
Example 5.8 (Choice of instrument in IV, simplest case) Consider the simple regression
y t D ˇ1 x t C u t :
5.4.2 2SLS
Suppose now that we have more instruments, z t , than regressors, x t . The IV method does
not work since, there are then more equations than unknowns in (5.20). Instead, we can
use the 2SLS estimator. It has two steps. First, regress all elements in x t on all elements
84
in z t with LS. Second, use the fitted values of x t , denoted xO t , as instruments in the IV
method (use xO t in place of z t in the equations above). In can be shown that this is the
most efficient use of the information in z t . The IV is clearly a special case of 2SLS (when
z t has the same number of elements as x t ).
It is immediate from (5.22) that 2SLS is consistent under the same condiditons as
IV since xO t is a linear function of the instruments, so plim TtD1 xO t u t =T D 0, if all the
P
Cov .p t ; A t / 1
p t D ıA t C u t ) plim ıOLS D D ˛:
Var .A t / ˇ
Construct the predicted values as
pO t D ıOLS A t :
Second, regress q t on pO t
b
plim Cov q t ; ıOLS A t
plim
O2SLS D
c ıOLS A t
plim Var
Cov .q t ; A t / plim ıOLS
D
Var .A t / plim ıOLS
2
h ih i
1
ˇ
˛Var .A t / ˇ
˛
D h i2
Var .A t / ˇ 1
˛
D :
Note that the trick here is to suppress some the movements in p t . Only those movements
85
that depend on A t (the observable shifts of the demand curve) are used. Movements in p t
which are due to the unobservable demand and supply shocks are disregarded in pO t . We
know from Example 5.2 that it is the supply shocks that make the LS estimate of the supply
curve inconsistent. The IV method suppresses both them and the unobservable demand
shock.
Proof. Consider the estimator ˇOc C .1 / ˇOe , which is clearly consistent under H0
since both ˇOc and ˇOe are. The asymptotic variance of this estimator is
2 Var ˇOc C .1 /2 Var ˇOe C 2 .1 / Cov ˇOc ; ˇOe ;
which is minimized at D 0 (since ˇOe is asymptotically efficient). The first order condi-
tion with respect to
2Var ˇOc 2 .1 / Var ˇOe C 2 .1 2/ Cov ˇOc ; ˇOe D 0
86
This means that we can write
O O O
Var ˇe ˇc D Var ˇe C Var ˇc O O O
2Cov ˇe ; ˇc
D Var ˇOc Var ˇOe : (5.28)
We can use this to test, for instance, if the estimates from least squares (ˇOe , since LS
is efficient if errors are iid normally distributed) and instrumental variable method (ˇOc ,
since consistent even if the true residuals are correlated with the regressors) are the same.
In this case, H0 is that the true residuals are uncorrelated with the regressors.
All we need for this test are the point estimates and consistent estimates of the vari-
ance matrices. Testing one of the coefficient can be done by a t test, and testing all the
parameters by a 2 test
0 1
ˇOe ˇOc Var ˇOe ˇOc ˇOe ˇOc 2 .j / ; (5.29)
where j equals the number of regressors that are potentially endogenous or measured
with error. Note that the covariance matrix in (5.28) and (5.29) is likely to have a reduced
rank, so the inverse needs to be calculated as a generalized inverse.
When we use 2SLS, then we can test if instruments affect the dependent variable only
via their correlation with the regressors. If not, something is wrong with the model since
some relevant variables are excluded from the regression.
Bibliography
Davidson, J., 2000, Econometric theory, Blackwell Publishers, Oxford.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
87
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn.
88
6 Simulating the Finite Sample Properties
Reference: Greene (2000) 5.3 and Horowitz (2001)
Additional references: Cochrane (2001) 15.2; Davidson and MacKinnon (1993) 21; Davi-
son and Hinkley (1997); Efron and Tibshirani (1993) (bootstrapping, chap 9 in particular);
and Berkowitz and Kilian (2000) (bootstrapping in time series models)
We know the small sample properties of regression coefficients in linear models with
fixed regressors and iid normal error terms. Monte Carlo simulations and bootstrapping
are two common techniques used to understand the small sample properties when these
conditions are not satisfied.
How they should be implemented depends crucially on the properties of the model
and data: if the residuals are autocorrelated, heteroskedastic, or perhaps correlated across
regressions equations. These notes summarize a few typical cases.
The need for using Monte Carlos or bootstraps varies across applications and data
sets. For a case where it is not needed, see Figure 6.1.
Monte Carlo simulations is essentially a way to generate many artificial (small) samples
from a parameterized model and then estimating the statistic on each of those samples.
The distribution of the statistic is then used as the small sample distribution of the estima-
tor.
The following is an example of how Monte Carlo simulations could be done in the
special case of a linear model with a scalar dependent variable
y t D x t0 ˇ C u t ; (6.1)
where u t is iid N.0; 2 / and x t is stochastic but independent of u t ˙s for all s. This means
that x t cannot include lags of y t .
Suppose we want to find the small sample distribution of a function of the estimate,
89
alpha t LS t NW t boot
US industry portfolios, 1970:1-2012:12 all NaN NaN NaN NaN
15 A (NoDur) 3.62 2.72 2.71 2.70
Mean excess return
90
sample of xQ t to calculate an artificial sample yQ t for t D 1; : : : ; T from
yQ t D xQ t0 ˇ C uQ t ; (6.2)
3. Repeat the previous steps N (3000, say) times. The more times you repeat, the
better is the approximation of the small sample distribution.
Remark 6.1 (Generating N.; ˙/ random numbers) Suppose you want to draw an n 1
vector " t of N.; ˙/ variables. Use the Cholesky decomposition to calculate the lower
triangular P such that ˙ D PP 0 . Draw u t from an N.0; I / distribution, and define
" t D C P u t . Note that Cov." t / D E P u t u0t P 0 D PIP 0 D ˙ .
If x t contains lags of y t , then we must set up the simulations so that feature is preserved in
every artificial sample that we create. For instance, suppose x t includes y t 1 and another
91
Average LS estimate of ρ Std of LS estimate of ρ
simulation
0.9 asymptotic
0.1
0.85
0.8
0.05
0.75
0.7 0
0 100 200 300 400 500 0 100 200 300 400 500
Sample size, T Sample size, T
√
T × Std of LS estimate of ρ True model: yt = 0.9yt−1 + ǫt ,
0.8 where ǫt is iid N(0,2)
0.5
0.4
0 100 200 300 400 500
Sample size, T
Figure 6.2: Results from a Monte Carlo experiment of LS estimation of the AR coeffi-
cient.
vector z t of variables which are independent of u t ˙s for all s. We can then generate an
artificial sample as follows. First, create a sample zQ t for t D 1; : : : ; T by some time series
model (for instance, a VAR) or by taking the observed sample itself. Second, observation
t of .xQ t ; yQ t / is generated as
" #
yQ t 1
xQ t D and yQ t D xQ t0 ˇ C uQ t for t D 1; : : : ; T (6.3)
zQ t
We clearly need the initial value yQ0 to start up the artificial sample—and then the rest of
the sample (t D 1; 2; :::) is calculated recursively.
For instance, for a VAR(2) model (where there is no z t )
y t D A1 y t 1 C A2 y t 2 C ut ; (6.4)
92
Distribution of LS estimator, T = 25 Distribution of LS estimator, T = 100
Mean and std: Mean and std:
0.1 0.74 0.16 0.1 0.86 0.06
0.05 0.05
0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Figure 6.3: Results from a Monte Carlo experiment of LS estimation of the AR coeffi-
cient.
the procedure is straightforward. First, estimate the model on data and record the esti-
mates (A1 ; A2 ; Var.u t /). Second, draw a new time series of residuals, uQ t for t D 1; : : : ; T
and construct an artificial sample recursively (first t D 1, then t D 2 and so forth) as
yQ t D A1 yQ t 1 C A2 yQ t 2 C uQ t : (6.5)
(This requires some starting values for y 1 and y0 .) Third, re-estimate the model on the
the artificial sample, yQ t for t D 1; : : : ; T .
It is straightforward to sample the errors from other distributions than the normal, for in-
stance, a student-t distribution. Equipped with uniformly distributed random numbers,
you can always (numerically) invert the cumulative distribution function (cdf) of any
distribution to generate random variables from any distribution by using the probability
transformation method. See Figure 6.4 for an example.
Remark 6.2 Let X U.0; 1/ and consider the transformation Y D F 1 .X/, where
F 1 ./ is the inverse of a strictly increasing cumulative distribution function F , then Y
has the cdf F .
93
Distribution of LS t-stat, T = 5 Distribution of LS t-stat, T = 100
t = (b̂ − 0.9)/Std(b̂)
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −2 0 2 4 −4 −2 0 2 4
Model: Rt = 0.9ft + ǫt , ǫt = vt − 2,
where vt has a χ22 distribution Probability density functions
Figure 6.4: Results from a Monte Carlo experiment with thick-tailed errors.
Example 6.3 The exponential cdf is x D 1 exp. y/ with inverse y D ln .1 x/ =.
Draw x from U.0:1/ and transform to y to get an exponentially distributed variable.
It is more difficult to handle non-iid errors, like those with autocorrelation and het-
eroskedasticity. We then need to model the error process and generate the errors from that
model.
If the errors are autocorrelated, then we could estimate that process from the fitted
errors and then generate artificial samples of errors (here by an AR(2))
uQ t D a1 uQ t 1 C a2 uQ t 2 C "Q t : (6.6)
94
However, this specification does not account for any link between the volatility and the
regressors (squared)—as tested for by White’s test. This would invalidate the usual OLS
standard errors and therefore deserves to be taken seriously. A simple, but crude, approach
is to generate residuals from a N.0; t2 ) process, but where t2 is approximated by the
fitted values from
"2t D c 0 w t C t ; (6.8)
where w t include the squares and cross product of all the regressors.
6.2 Bootstrapping
yQ t D x t0 ˇ C uQ t ; (6.9)
where uQ t is drawn (with replacement) from the fitted residual and where ˇ is the
point estimate.
.yQ3 ; xQ 3 / .x30 ˇ0 C u2 ; x3 /
95
The approach in (6.9) works also when y t is a vector of dependent variables—and
will then help retain the cross-sectional correlation of the residuals.
When x t contains lagged values of y t , then we have to modify the approach in (6.9) since
uQ t can become correlated with x t . For instance, if x t includes y t 1 and we happen to
sample uQ t D u t 1 , then we get a non-zero correlation. The easiest way to handle this
is as in the Monte Carlo simulations in (6.3), but where uQ t are drawn (with replacement)
from the sample of fitted residuals. The same carries over to the VAR model in (6.4)–(6.5).
Suppose now that the errors are heteroskedastic, but serially uncorrelated. If the het-
eroskedasticity is unrelated to the regressors, then we can still use (6.9).
On contrast, if the heteroskedasticity is related to the regressors, then the traditional LS
covariance matrix is not correct (this is the case that White’s test for heteroskedasticity
tries to identify). It would then be wrong to pair x t with just any uQ t D us since that
destroys the relation between x t and the variance of the residual.
An alternative way of bootstrapping can then be used: generate the artificial sample
by drawing (with replacement) pairs .ys ; xs /, that is, we let the artificial pair in t be
.yQ t ; xQ t / D .xs0 ˇ0 C us ; xs / for some random draw of s so we are always pairing the
residual, us , with the contemporaneous regressors, xs . Note that we are always sampling
with replacement—otherwise the approach of drawing pairs would be to just re-create the
original data set.
This approach works also when y t is a vector of dependent variables.
.yQ3 ; xQ 3 / .x30 ˇ0 C u3 ; x3 /
It could be argued (see, for instance, Davidson and MacKinnon (1993)) that bootstrap-
ping the pairs .ys ; xs / makes little sense when xs contains lags of ys , since the random
sampling of the pair .ys ; xs / destroys the autocorrelation pattern on the regressors.
96
6.2.4 Autocorrelated Errors
It is quite hard to handle the case when the errors are serially dependent, since we must
the sample in such a way that we do not destroy the autocorrelation structure of the data.
A common approach is to fit a model for the residuals, for instance, an AR(1), and then
bootstrap the (hopefully iid) innovations to that process.
Another approach amounts to resampling blocks of data. For instance, suppose the
sample has 10 observations, and we decide to create blocks of 3 observations. The first
block is .uO 1 ; uO 2 ; uO 3 /, the second block is .uO 2 ; uO 3 ; uO 4 /, and so forth until the last block,
.uO 8 ; uO 9 ; uO 10 /. If we need a sample of length 3 , say, then we simply draw of those
block randomly (with replacement) and stack them to form a longer series. To handle
end point effects (so that all data points have the same probability to be drawn), we also
create blocks by “wrapping” the data around a circle. In practice, this means that we add
a the following blocks: .uO 10 ; uO 1 ; uO 2 / and .uO 9 ; uO 10 ; uO 1 /. The length of the blocks should
clearly depend on the degree of autocorrelation, but T 1=3 is sometimes recommended as
a rough guide. An alternative approach is to have non-overlapping blocks. See Berkowitz
and Kilian (2000) for some other approaches.
See Figures 6.5–6.6 for an illustration.
There are many other ways to do bootstrapping. For instance, we could sample the re-
gressors and residuals independently of each other and construct an artificial sample of
the dependent variable yQ t D xQ t0 ˇO C uQ t . This clearly makes sense if the residuals and
regressors are independent of each other and errors are iid. In that case, the advantage of
this approach is that we do not keep the regressors fixed.
Bibliography
Berkowitz, J., and L. Kilian, 2000, “Recent developments in bootstrapping time series,”
Econometric-Reviews, 19, 1–48.
Cochrane, J. H., 2001, Asset pricing, Princeton University Press, Princeton, New Jersey.
97
Std of LS slope under autocorrelation Std of LS slope under autocorrelation
0.1 κ = −0.9 0.1 κ = 0
OLS formula
Newey-West
Simulated
Bootstrapped
0.05 0.05
0 0
−0.5 0 0.5 −0.5 0 0.5
ρ (autocorrelation of residual) ρ (autocorrelation of residual)
Model: yt = 0.9xt + ǫt ,
Std of LS slope under autocorrelation where ǫt = ρǫt−1 + ξt , ξt is iid N
0.1 κ = 0.9 xt = κxt−1 + ηt , ηt is iid N
Davison, A. C., and D. V. Hinkley, 1997, Bootstrap methods and their applications, Cam-
bridge University Press.
Efron, B., and R. J. Tibshirani, 1993, An introduction to the bootstrap, Chapman and Hall,
New York.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Horowitz, J. L., 2001, “The Bootstrap,” in J.J. Heckman, and E. Leamer (ed.), Handbook
of Econometrics . , vol. 5, Elsevier.
98
Std of LS intercept under autocorrelation Std of LS intercept under autocorrelation
0.1 κ = −0.9 0.1 κ = 0
0.05 0.05
OLS formula
Newey-West
Simulated
Bootstrapped
0 0
−0.5 0 0.5 −0.5 0 0.5
ρ (autocorrelation of residual) ρ (autocorrelation of residual)
Model: yt = 0.9xt + ǫt ,
Std of LS intercept under autocorrelation where ǫt = ρǫt−1 + ξt , ξt is iid N
0.1 κ = 0.9 xt = κxt−1 + ηt , ηt is iid N
99
7 GMM
References: Greene (2000) 4.7 and 11.5-6
Additional references: Hayashi (2000) 3-4; Verbeek (2004) 5; Hamilton (1994) 14; Ogaki
(1993), Johnston and DiNardo (1997) 10; Harris and Matyas (1999); Pindyck and Rubin-
feld (1998) Appendix 10.1; Cochrane (2001) 10-11
Let m .x t / be a k 1 vector valued continuous function of a stationary process, and let the
probability limit of the mean of m .:/ be a function
.:/ of a k 1 vector ˇ of parameters.
We want to estimate ˇ. The method of moments (MM, not yet generalized to GMM)
estimator is obtained by replacing the probability limit with the sample mean and solving
the system of k equations
T
1X
m .x t /
.ˇ/ D 0k1 (7.1)
T t D1
Example 7.1 (Moment conditions for variances and covariance) Suppose the series x t
and y t have zero means. The following moment conditions define the traditional variance
and covariance estimators
1 XT
x2 xx D 0
T t D1 t
1 XT
y2 yy D 0
T tD1 t
1 XT
xt yt xy D 0:
T t D1
100
It does not matter if the parameterers are estimated separately or jointly. In contrast, if
we want the correlation, xy , instead of the covariance, then we change the last moment
condition to
1 XT p p
x t y t xy xx yy D 0;
T t D1
E .y t y t 1 / D E Œ. t C t 1 / . t 1 C t 2 / D 2 :
GMM extends MM by allowing for more orthogonality conditions than parameters. This
could, for instance, increase efficiency and/or provide new aspects which can be tested.
Many (most) traditional estimation methods, like LS, IV, and MLE are special cases
of GMM. This means that the properties of GMM are very general, and therefore fairly
difficult to prove.
101
from which we want to estimate the k 1 (k q) vector of parameters, ˇ. The true
values are ˇ0 . We assume that w t is a stationary and ergodic (vector) process (otherwise
the sample means does not converge to anything meaningful as the sample size increases).
The sample averages, or “sample moment conditions,” evaluated at some value of ˇ, are
T
1X
N
m.ˇ/ D m.w t ; ˇ/: (7.3)
T t D1
The sample average m N .ˇ/ is a vector of functions of random variables, so they are ran-
dom variables themselves and depend on the sample used. It will later be interesting to
N .ˇ/. Note that m.ˇ
calculate the variance of m N 1 / and m.ˇ
N 2 / are sample means obtained
by using two different parameter vectors, but on the same sample of data.
Example 7.3 (Moments conditions for IV/2SLS.) Consider the linear model y t D x t0 ˇ0 C
u t , where x t and ˇ are k 1 vectors. Let z t be a q 1 vector, with q k. The moment
conditions and their sample analogues are
T
1X
0q1 D Ez t u t D EŒz t .y t x t0 ˇ0 /; N .ˇ/ D
and m z t .y t x t0 ˇ/;
T t D1
(or Z 0 .Y Xˇ/=T in matrix form). Let q D k to get IV; let z t D x t to get LS.
Example 7.4 (Moments conditions for MLE.) The maximum likelihood estimator maxi-
mizes the log likelihood function, T1 ˙ tTD1 ln L .w t I ˇ/, which requires T1 ˙ tTD1 @ ln L .w t I ˇ/ =@ˇ D
0. A key regularity condition for the MLE is that E@ ln L .w t I ˇ0 / =@ˇ D 0, which is just
like a GMM moment condition.
102
The condition E.ujz/ D 0 then implies Cov.z; u/ D 0. Recall that Cov.z; u/ D Ezu EzEu,
and that E.ujz/ D 0 implies that Eu D 0 (by iterated expectations). We therefore get that
" #
Cov .z; u/ D 0
E .ujz/ D 0 ) ) Euz D 0:
Eu D 0
Example 7.6 (Euler equation for optimal consumption.) The standard Euler equation
for optimal consumption choice which with isoelastic utility U .C t / D C t1
= .1
/ is
C t C1
ˇ
ˇ
E R t C1 ˇ 1ˇˇ ˝ t D 0;
Ct
where R t C1 is a gross return on an investment and ˝ t is the information set in t . Let
z t 2 ˝ t , for instance asset returns or consumption t or earlier. The Euler equation then
implies
C t C1
E R t C1 ˇ z t z t D 0:
Ct
Let z t D .z1t ; :::; znt /0 , and define the new (unconditional) moment conditions as
2 3
u1 .x t ; ˇ/z1t
6 7
6 u1 .x t ; ˇ/z2t 7
::
6 7
:
6 7
6 7
6 7
m.w t ; ˇ/ D u.x t ; ˇ/ ˝ z t D 6 6 u 1 .x t ; ˇ/znt
7
7 ; (7.5)
6 u2 .x t ; ˇ/z1t 7
6 7
6 :: 7
:
6 7
4 5
um .x t ; ˇ/znt q1
This a set of unconditional moment conditions—just as in (7.2). The sample moment con-
ditions (7.3) are therefore valid also in the conditional case, although we have to specify
m.w t ; ˇ/ as in (7.5).
Note that the choice of instruments is often arbitrary: it often amounts to using only
a subset of the information variables. GMM is often said to be close to economic theory,
103
but it should be admitted that economic theory sometimes tells us fairly little about which
instruments, z t , to use.
Example 7.7 (Euler equation for optimal consumption, continued) The orthogonality
conditions from the consumption Euler equations in Example 7.6 are highly non-linear,
and theory tells us very little about how the prediction errors are distributed. GMM has
the advantage of using the theoretical predictions (moment conditions) with a minimum
of distributional assumptions. The drawback is that it is sometimes hard to tell exactly
which features of the (underlying) distribution that are tested.
N
where m.ˇ/ is the sample average of m.w t ; ˇ/ given by (7.3), and where W is some
q q symmetric positive definite weighting matrix. (We will soon discuss a good choice
of weighting matrix.) There are k parameters in ˇ to estimate, and we have q moment
N
conditions in m.ˇ/. We therefore have q k overidentifying moment restrictions.
With q D k the model is exactly identified (as many equations as unknowns), and it
should be possible to set all q sample moment conditions to zero by a choosing the k D q
parameters. It is clear that the choice of the weighting matrix has no effect in this case
N ˇ/
since m. O D 0 at the point estimates ˇ.
O
y t D x t ˇ0 C u t ; (7.9)
104
where y t and x t are zero mean scalars. The moment condition and loss function are
T
1X
N .ˇ/ D
m x t .y t x t ˇ/ and
T t D1
" T
#2
1X
J DW x t .y t x t ˇ/ ;
T t D1
Example 7.9 (IV/2SLS method continued.) From Example 7.3, we note that the loss func-
tion for the IV/2SLS method is
T
" #0 " T
#
0 1X 1X
N
m.ˇ/ N
W m.ˇ/ D z t .y t x t0 ˇ/ W z t .y t x t0 ˇ/ :
T t D1 T t D1
When q D k, then the model is exactly identified, so the estimator could actually be found
by setting all moment conditions to zero. We then get the IV estimator
T
1X
0D z t .y t x t0 ˇOI V / or
T tD1
1
T T
!
1X 1X
ˇOI V D z t x t0 zt yt
T t D1 T t D1
D ˙O zx1 ˙O zy ;
where ˙O zx D ˙ tD1 T
z t x t0 =T and similarly for the other second moment matrices. Let
z t D x t to get LS
ˇOLS D ˙O xx1 ˙O xy :
Remark 7.10 (Matrix differentiation of non-linear functions.) Let the vector yn1 be a
function of the vector xm1
2 3 2 3
y1 f1 .x/
6 : 7
6 :: 7 D f .x/ D 6 :::
6 7
7:
4 5 4 5
yn fn .x/
105
Then, @y=@x 0 is an n m matrix
2 3 2 3
@f1 .x/ @f1 .x/ @f1 .x/
@x 0 @x1
@xm
@y 6 :
6 ::
7 6 :
7 D 6 :: :: 7
D : 7:
@x 0 4
@f
5 4 5
1 .x/ @fn .x/ @fn .x/
@x 0 @x1
@xm
(Note that the notation implies that the derivatives of the first element in y, denoted y1 ,
with respect to each of the elements in x 0 are found in the first row of @y=@x 0 . A rule to
help memorizing the format of @y=@x 0 : y is a column vector and x 0 is a row vector.)
Remark 7.12 As a special case of the previous remark y D z 0 x where both z and x are
vectors. Then @ .z 0 x/ =@x 0 D z 0 (since z 0 plays the role of A).
Remark 7.13 (Matrix differentiation of quadratic forms.) Let xn1 , f .x/m1 , and Amm
symmetric. Then
@f .x/0 Af .x/ @f .x/ 0
D2 Af .x/ :
@x @x 0
Remark 7.14 If f .x/ D x, then @f .x/ =@x 0 D I , so @ .x 0 Ax/ =@x D 2Ax.
The k first order conditions for minimizing the GMM loss function in (7.8) with re-
spect to the k parameters are that the partial derivatives with respect to ˇ equal zero at the
O
estimate, ˇ,
N ˇ/
@m. O 0 W m. O
N ˇ/
0k1 D
@ˇ
30 2
N 1 .ˇO / N 1 .ˇO /
2 32 3
@m @m O
W11 W1q m N 1 .ˇ/
6 : @ˇ1 ::
@ˇk
7 6 :: :: :: 76 :: 7
6 :: : 7 6 : : : 76 : 7
D6 : 7 (with ˇOk1 );
6 7 6 76 7
6 :: :: 7 6 :: ::: :: 76 ::
4 : 7
5
6
4 : : 76
54 : 7
5
@mN q .ˇO / N q .ˇO /
@m O
@ˇ1
@ˇk
W1q Wqq m N q .ˇ/
(7.10)
!0
N ˇ/
@m. O
D W m. O :
N ˇ/ (7.11)
@ˇ 0 „ƒ‚…„ƒ‚…
„ ƒ‚ … qq q1
kq
106
We can solve for the GMM estimator, ˇ,O from (7.11). This set of equations must often be
solved by numerical methods, except in linear models (the moment conditions are linear
functions of the parameters) where we can find analytical solutions by matrix inversion.
Example 7.15 (First order conditions of simple linear regression.) The first order condi-
tions of the loss function in Example 7.8 is
" T
#2
d 1X O
0D W x t .y t x t ˇ/
dˇ T t D1
T T
" # " #
1X 2 1X O ; or
D x W x t .y t x t ˇ/
T t D1 t T t D1
T
! 1 T
O 1 X
2 1X
ˇD x xt yt :
T t D1 t T t D1
Example 7.16 (First order conditions of IV/2SLS.) The first order conditions correspond-
ing to (7.11) of the loss function in Example 7.9 (when q k) are
" #0
N ˇ/
@m. O
0k1 D W m. N ˇ/ O
@ˇ 0
" T
#0 T
@ 1X 0 O 1X O
D 0
z t .y t x t ˇ/ W z t .y t x t0 ˇ/
@ˇ T tD1 T tD1
" T
#0 T
1X 0 1X O
D zt xt W z t .y t x t0 ˇ/
T t D1 T tD1
D ˙O xz W .˙O zy O
˙O zx ˇ/:
When q D k, then the first order conditions can be premultiplied with .˙O xz W / 1 , since
˙O xz W is an invertible k k matrix in this case, to give
0k1 D ˙O zy O so ˇOI V D ˙O 1 ˙O zy :
˙O zx ˇ, zx
This shows that the first order conditions are just the same as the sample moment condi-
107
tions, which can be made to hold exactly since there are as many parameters as there are
equations.
We know very little about the general small sample properties, including bias, of GMM.
We therefore have to rely either on simulations (Monte Carlo or bootstrap) or on the
asymptotic results. This section is about the latter.
GMM estimates are typically consistent and normally distributed, even if the series
m.w t ; ˇ/ in the moment conditions (7.3) are serially correlated and heteroskedastic—
provided w t is a stationary and ergodic process. The reason is essentially that the esti-
mators are (at least as a first order approximation) linear combinations of sample means
which typically are consistent (LLN) and normally distributed (CLT). More about that
later. The proofs are hard, since the GMM is such a broad class of estimators. This
section discusses, in an informal way, how we can arrive at those results.
7.5.1 Consistency
Sample moments are typically consistent, so plim m .ˇ/ D E m.w t ; ˇ/. This must hold
at any parameter vector in the relevant space (for instance, those inducing stationarity and
variances which are strictly positive). Then, if the moment conditions (7.2) are true only at
the true parameter vector, ˇ0 , (otherwise the parameters are “unidentified”) and that they
are continuous in ˇ, then GMM is consistent. The idea is thus that GMM asymptotically
solves
0q1 D plim m.
N ˇ/ O
O
D E m.w t ; ˇ/;
Remark 7.17 (Slutsky’s theorem.) If fxT g is a sequence of random matrices such that
plim xT D x and g.xT / a continuous function, then plim g.xT / D g.x/.
108
Example 7.18 (Consistency of 2SLS.) By using y t D x t0 ˇ0 Cu t , the first order conditions
in Example 7.16 can be rewritten
T
O 1X O
0k1 D ˙xz W z t .y t x t0 ˇ/
T t D1
T
1X h i
D ˙O xz W z t u t C x t0 ˇ0 ˇO
T t D1
O O O O
D ˙xz W ˙zu C ˙xz W ˙zx ˇ0 O
ˇ :
where we use the definition of m N .ˇ0 / in (7.3). (To estimate S0 it is important to recognize
that it is a scaled sample average.) Let R .s/ be the q q covariance (matrix) of the vector
109
m.w t ; ˇ0 / with the vector m.w t 2 ; ˇ0 /
In practice, we often estimate this by using the Newey-West estimator (or something
similar).
Second, let D0 (a q k matrix) denote the probability limit of the gradient of the
sample moment conditions with respect to the parameters, evaluated at the true parameters
N 0/
@m.ˇ
D0 D plim , where (7.16)
@ˇ 0
2 3
@m
N 1 .ˇ / @m
N 1 .ˇ /
@ˇ1
@ˇk
6 : :: 7
N 0/ 6
@m.ˇ 6 :: : 7
D 7 at the true ˇ vector. (7.17)
7
6 : ::
@ˇ 0 6 :: : 7
4 5
@m
N q .ˇ / @m
N q .ˇ /
@ˇ1
@ˇk
W D S0 1 : (7.18)
It can be shown that this choice of weighting matrix gives the asymptotically most ef-
ficient estimator for a given set of orthogonality conditions. For instance, in 2SLS, this
means a given set of instruments and (7.18) then shows only how to use these instruments
in the most efficient way. Of course, another set of instruments might be better (in the
O
sense of giving a smaller Cov(ˇ)).
With the definitions in (7.12) and (7.16) and the choice of weighting matrix in (7.18)
and the added assumption that the rank of D0 equals k (number of parameters) then we
110
can show (under fairly general conditions) that
p d 1
T .ˇO ˇ0 / ! N.0k1 ; V /, where V D D00 S0 1 D0
: (7.19)
This holds also when the model is exactly identified, so we really do not use any weighting
matrix.
To prove this note the following.
Remark 7.19 (Continuous mapping theorem.) Let the sequences of random matrices
d p
fxT g and fyT g, and the non-random matrix faT g be such that xT ! x, yT ! y,
and aT ! a (a traditional limit). Let g.xT ; yT ; aT / be a continuous function. Then
d
g.xT ; yT ; aT / ! g.x; y; a/. Either of yT and aT could be irrelevant in g. (See Mittel-
hammer (1996) 5.3.)
p
Example 7.20 For instance, the sequences in Remark 7.19 could be xT D T ˙ tTD w t =T ,
the scaled sample average of a random variable w t ; yT D ˙ tTD w t2 =T , the sample second
moment; and aT D ˙ tTD1 0:7t .
d
Remark 7.21 From the previous remark: if xT ! x (a random variable) and plim QT D
d
Q (a constant matrix), then QT xT ! Qx.
Proof. (The asymptotic distribution (7.19). Sketch of proof.) This proof is essentially
an application of the delta rule. By the mean-value theorem the sample moment condition
evaluated at the GMM estimate, ˇ, O is
O D m.ˇ N 1/ O
@m.ˇ
N ˇ/
m. N 0/ C .ˇ ˇ0 / (7.20)
@ˇ 0
for some values ˇ1 between ˇO and ˇ0 . (This point is different for different elements in
N Premultiply with Œ@m.
m.) O
N ˇ/=@ˇ 0 0
W . By the first order condition (7.11), the left hand
side is then zero, so we have
!0 !0
N ˇ/
@m. O N ˇ/
@m. O N 1/ O
@m.ˇ
0k1 D 0
N 0/ C
W m.ˇ 0
W .ˇ ˇ0 /: (7.21)
@ˇ @ˇ @ˇ 0
111
p
Multiply with T and solve as
" !0 # 1 !0
p N ˇ/
@m. O N 1/
@m.ˇ N ˇ/
@m. O p
T ˇO ˇ0 D W W N 0 /:
T m.ˇ (7.22)
@ˇ 0 @ˇ 0 @ˇ 0
„ ƒ‚ …
If
N ˇ/
@m. O N 0/
@m.ˇ N 1/
@m.ˇ
plim 0
D 0
D D0 ; then plim D D0 ;
@ˇ @ˇ @ˇ 0
O Then
since ˇ1 is between ˇ0 and ˇ.
1
D00 WD0 D00 W:
plim D (7.23)
p p
The last term in (7.22), T m.ˇN 0 /, is T times a vector of sample averages, so by a CLT
it converges in distribution to N.0; S0 /, where S0 is defined as in (7.12). By the rules of
limiting distributions (see Remark 7.19) we then have that
p d
T ˇO ˇ0 ! plim something that is N .0; S0 / ; that is,
p d
O
T ˇ ˇ0 ! N 0k1 ; .plim /S0 .plim 0 / :
112
vector. In practice, this means plugging in the point estimates into the sample moment
conditions and calculate the derivatives with respect to parameters (for instance, by a
numerical method).
Similarly, S0 in (7.13) can be estimated by, for instance, Newey-West’s estimator of
p
CovŒ T m. O once again using the point estimates in the moment conditions.
N ˇ/,
hp p T !
i T X
S0 D ACov Tm N .ˇ0 / D ACov zt ut
T t D1
T
!
N 0/
@m.ˇ 1X
D0 D plim 0
D plim z t x t0 D ˙zx :
@ˇ T t D1
p
This gives the asymptotic covariance matrix of T .ˇO ˇ0 /
1 1
V D D00 S0 1 D0 0
S0 1 ˙zx
D ˙zx :
113
7.7 Efficient GMM and Its Feasible Implementation
The efficient GMM (remember: for a given set of moment conditions) requires that we
use W D S0 1 , which is tricky since S0 should be calculated by using the true (unknown)
parameter vector. However, the following two-stage procedure usually works fine:
First, estimate model with some (symmetric and positive definite) weighting matrix.
The identity matrix is typically a good choice for models where the moment con-
ditions are of the same order of magnitude (if not, consider changing the moment
conditions). This gives consistent estimates of the parameters ˇ. Then a consistent
estimate SO can be calculated (for instance, with Newey-West).
Use the consistent SO from the first step to define a new weighting matrix as W D
SO 1 . The algorithm is run again to give asymptotically efficient estimates of ˇ.
Iterate at least once more. (You may want to consider iterating until the point esti-
mates converge.)
Example 7.23 (Implementation of 2SLS.) Under the classical 2SLS assumptions, there is
no need for iterating since the efficient weighting matrix is ˙zz1 = 2 . Only 2 depends
on the estimated parameters, but this scaling factor of the loss function does not affect
ˇO2SLS .
114
7.8 Testing in GMM
The result in (7.19) can be used to do Wald tests of the parameter vector. For instance,
suppose we want to test the s linear restrictions that Rˇ0 D r (R is s k and r is s 1)
then it must be the case that under null hypothesis
p d
T .RˇO r/ ! N.0s1 ; RVR0 /: (7.26)
Remark 7.24 (Distribution of quadratic forms.) If the n 1 vector x N.0; ˙/, then
x 0 ˙ 1 x 2n .
From this remark and the continuous mapping theorem in Remark (7.19) it follows
that, under the null hypothesis that Rˇ0 D r, the Wald test statistics is distributed as a 2s
variable
1 d
T .RˇO r/0 RVR0 .RˇO r/ ! 2s : (7.27)
We might also want to test the overidentifying restrictions. The first order conditions
(7.11) imply that k linear combinations of the q moment conditions are set to zero by
solving for ˇ:O Therefore, we have q k remaining overidentifying restrictions which
should also be close to zero if the model is correct (fits data). Under the null hypothe-
sis that the moment conditions hold (so the overidentifying restrictions hold), we know
p
that T m N .ˇ0 / is a (scaled) sample average and therefore has (by a CLT) an asymptotic
normal distribution. It has a zero mean (the null hypothesis) and the covariance matrix in
(7.12). In short,
p d
TmN .ˇ0 / ! N 0q1 ; S0 : (7.28)
O 0 S 1 m.
N ˇ/ O
If would then perhaps be natural to expect that the quadratic form T m. 0 N ˇ/
should be converge in distribution to a 2q variable. That is not correct, however, since ˇO
chosen is such a way that k linear combinations of the first order conditions always (in
every sample) are zero. There are, in effect, only q k nondegenerate random variables in
the quadratic form (see Davidson and MacKinnon (1993) 17.6 for a detailed discussion).
The correct result is therefore that if we have used optimal weight matrix is used, W D
S0 1 , then
d
O 0 S 1 m.
N ˇ/
T m. O !
N ˇ/ 2 ; if W D S 1 : (7.29)
0 q k 0
115
The left hand side equals T times of value of the loss function (7.8) evaluated at the point
estimates, so we could equivalently write what is often called the J test
O 2
TJ.ˇ/ if W D S0 1 :
q k; (7.30)
This also illustrates that with no overidentifying restrictions (as many moment conditions
as parameters) there are, of course, no restrictions to test. Indeed, the loss function value
is then always zero at the point estimates.
Example 7.25 (Test of overidentifying assumptions in 2SLS.) In contrast to the IV method,
2SLS allows us to test overidentifying restrictions (we have more moment conditions than
parameters, that is, more instruments than regressors). This is a test of whether the residu-
als are indeed uncorrelated with all the instruments. If not, the model should be rejected.
It can be shown that test (7.30) is (asymptotically, at least) the same as the traditional
(Sargan (1964), see Davidson (2000) 8.4) test of the overidentifying restrictions in 2SLS.
In the latter, the fitted residuals are regressed on the instruments; TR2 from that regres-
sion is 2 distributed with as many degrees of freedom as the number of overidentifying
restrictions.
Example 7.26 (Results from GMM on CCAPM; continuing Example 7.6.) The instru-
ments could be anything known at t or earlier could be used as instruments. Actually,
Hansen and Singleton (1982) and Hansen and Singleton (1983) use lagged Ri;t C1 c t C1 =c t
as instruments, and estimate
to be 0:68 to 0:95, using monthly data. However, TJT .ˇ/O
is large and the model can usually be rejected at the 5% significance level. The rejection
is most clear when multiple asset returns are used. If T-bills and stocks are tested at the
same time, then the rejection would probably be overwhelming.
Another test is to compare a restricted and a less restricted model, where we have
used the optimal weighting matrix for the less restricted model in estimating both the less
restricted and more restricted model (the weighting matrix is treated as a fixed matrix in
the latter case). It can be shown that the test of the s restrictions (the “D test”, similar in
flavour to an LR test), is
The weighting matrix is typically based on the unrestricted model. Note that (7.30) is a
special case, since the model with allows q non-zero parameters (as many as the moment
116
conditions) always attains J D 0, and that by imposing s D q k restrictions we get a
restricted model.
When the optimal weighting matrix is not used, that is, when (7.18) does not hold, then
the asymptotic covariance matrix of the parameters is given by (7.25) instead of the result
in (7.19). That is,
p d 1 1
T .ˇO ˇ0 / ! N.0k1 ; V2 /, where V2 D D00 WD0 D00 W S0 W 0 D0 D00 WD0
:
(7.32)
The consistency property is not affected.
The test of the overidentifying restrictions (7.29) and (7.30) are not longer valid. In-
stead, the result is that
p
O !d N 0q1 ; 2 , with
N ˇ/
T m. (7.33)
1 0 1
2 D ŒI D0 D00 WD0 D0 D00 WD0 D00 W 0 :
D0 W S0 ŒI (7.34)
This covariance matrix has rank q k (the number of overidentifying restriction). This
distribution can be used to test hypotheses about the moments, for instance, that a partic-
ular moment condition is zero.
Proof. (Sketch of proof of (7.33)-(7.34)) Use (7.22) in (7.20) to get
p p p @m.ˇ
N 1/
T m. O D
N ˇ/ N 0/ C
T m.ˇ
T N 0/
m.ˇ
@ˇ 0
p
N 1/
@m.ˇ
D IC N 0 /:
T m.ˇ
@ˇ 0
1
The term in brackets has a probability limit, which by (7.23) equals I D0 D00 WD0 D00 W .
p
N 0 / !d N 0q1 ; S0 we get (7.33).
Since T m.ˇ
Remark 7.27 If the n 1 vector X N.0; ˙/, where ˙ has rank r n then Y D
X 0 ˙ C X 2r where ˙ C is the pseudo inverse of ˙.
Remark 7.28 The symmetric ˙ can be decomposed as ˙ D ZZ 0 where Z are the
0
orthogonal eigenvectors (Z Z D I ) and is a diagonal matrix with the eigenvalues
117
along the main diagonal. The pseudo inverse can then be calculated as ˙ C D ZC Z 0 ,
where " #
1
11 0
C D ;
0 0
with the reciprocals of the non-zero eigen values along the principal diagonal of 111 .
This remark and (7.34) implies that the test of overidentifying restrictions (Hansen’s
J statistics) analogous to (7.29) is
O 0 C m.
N ˇ/ O d 2
T m. 2 N ˇ/ ! q k: (7.35)
Suppose we sidestep the whole optimization issue and instead specify k linear combi-
nations (as many as there are parameters) of the q moment conditions directly. That is,
instead of the first order conditions (7.11) we postulate that the estimator should solve
0k1 D „ƒ‚…
A m. O (ˇO is k 1).
N ˇ/ (7.36)
„ƒ‚…
kq q1
The matrix A is chosen by the researcher and it must have rank k (lower rank means that
we effectively have too few moment conditions to estimate the k parameters in ˇ). If A
is random, then it should have a finite probability limit A0 (also with rank k). One simple
case when this approach makes sense is when we want to use a subset of the moment
conditions to estimate the parameters (some columns in A are then filled with zeros), but
we want to study the distribution of all the moment conditions.
By comparing (7.11) and (7.36) we see that A plays the same role as Œ@m. O
N ˇ/=@ˇ 0 0
W,
but with the difference that A is chosen and not allowed to depend on the parameters.
In the asymptotic distribution, it is the probability limit of these matrices that matter, so
we can actually substitute A0 for D00 W in the proof of the asymptotic distribution. The
118
covariance matrix in (7.32) then becomes
1 1
V3 D .A0 D0 / A0 S0 Œ.A0 D0 / A0 0
1
D .A0 D0 / A0 S0 A00 Œ.A0 D0 / 1 0 ; (7.37)
where 3 has reduced rank. As before, this covariance matrix can be used to construct
both t type and 2 tests of the moment conditions. For instance, the test of overidentifying
restrictions (Hansen’s J statistics)
d
O 0 C m.
N ˇ/ O 2
T m. 3 N ˇ/ ! q k; (7.40)
but the model is so non-linear that we cannot find a closed form expression for Var_of_model.ˇ0 /.
Similary, we could match a covariance of
The SME involves (i) drawing a set of random numbers for the stochastic shocks in
the model; (ii) for a given set of parameter values generate a model simulation with Tsi m
observations, calculating the moments and using those instead of Var_of_model.ˇ0 / (or
similarly for other moments), which is then used to evaluate the loss function JT . This is
119
repeated for various sets of parameter values until we find the one which minimizes JT .
Basically all GMM results go through, but the covariance matrix should be scaled up
with 1 C T =Tsi m , where T is the sample length. Note that the same sequence of random
numbers should be reused over and over again (as the parameter values are changed).
Example 7.29 Suppose w t has two elements, x t and y t , and that we want to match both
variances and also the covariance. For simplicity, suppose both series have zero means.
Then we can formulate the moment conditions
2 2 3
x t Var(x)_in_model.ˇ/
m.x t ; y t ; ˇ/ D 4 y t2 Var(y)_in_model.ˇ/ 5 : (7.43)
6 7
xt yt Cov(x,y)_in_model.ˇ/
Bibliography
Cochrane, J. H., 2001, Asset pricing, Princeton University Press, Princeton, New Jersey.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
Hansen, L., and K. Singleton, 1983, “Stochastic consumption, risk aversion and the tem-
poral behavior of asset returns,” Journal of Political Economy, 91, 249–268.
Harris, D., and L. Matyas, 1999, “Introduction to the generalized method of moments esti-
mation,” in Laszlo Matyas (ed.), Generalized Method of Moments Estimation . chap. 1,
Cambridge University Press.
120
Ingram, B.-F., and B.-S. Lee, 1991, “Simulation estimation of time-series models,” Jour-
nal of Econometrics, 47, 197–205.
Johnston, J., and J. DiNardo, 1997, Econometric methods, McGraw-Hill, New York, 4th
edn.
Mittelhammer, R. C., 1996, Mathematical statistics for economics and business, Springer-
Verlag, New York.
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn.
121
8 Examples and Applications of GMM
The model is
y t D x t0 ˇ0 C u t ; (8.1)
where ˇ is a k 1 vector.
The k moment conditions are
T T T
1X 1X 1X
N .ˇ/ D
m x t .y t x t0 ˇ/ D xt yt x t x t0 ˇ: (8.2)
T t D1 T t D1 T t D1
The point estimates are found by setting all moment conditions to zero (the model is
N .ˇ/ D 0k1 , which gives
exactly identified), m
1
T T
!
1X 1X
ˇO D x t x t0 x t y t ˇ: (8.3)
T t D1 T t D1
If we define
hp p T !
i T X
S0 D ACov Tm N .ˇ0 / D ACov xt ut (8.4)
T t D1
T
!
N 0/
@m.ˇ 1X
D0 D plim 0
D plim x t x t0 D ˙xx : (8.5)
@ˇ T t D1
p
then the asymptotic covariance matrix of T .ˇO ˇ0 /
1 1
VLS D D00 S0 1 D0 0
S0 1 ˙xx D ˙xx1 S0 ˙xx1 :
D ˙xx (8.6)
122
8.1.2 The IV/2SLS Estimator (General)
The model is (8.1), but we use an IV/2SLS method. The q moment conditions (with
q k) are
T T T
1X 1X 1X
N .ˇ/ D
m z t .y t x t0 ˇ/ D zt yt z t x t0 ˇ: (8.7)
T t D1 T t D1 T t D1
The loss function is (for some positive definite weighting matrix W , not necessarily
the optimal)
T
" #0 " T
#
0 1X 1X
N
m.ˇ/ N
W m.ˇ/ D z t .y t x t0 ˇ/ W z t .y t x t0 ˇ/ ; (8.8)
T t D1 T tD1
O
N ˇ/=@ˇ
and the k first order conditions, .@m. 0 0 O D 0, are
N ˇ/
/ W m.
" T
#0 T
@ 1X 0 O 1X O
0k1 D z t .y t x t ˇ/ W z t .y t x t0 ˇ/
@ˇ 0 T tD1 T tD1
" T
#0 T
1X 1X O
D z t x t0 W z t .y t x t0 ˇ/
T t D1 T tD1
D ˙O xz W .˙O zy O
˙O zx ˇ/: (8.9)
We solve for ˇO as
1
O O O
ˇ D ˙xz W ˙zx ˙O xz W ˙O zy : (8.10)
Define
hp p T !
i T X
S0 D ACov Tm N .ˇ0 / D ACov zt ut (8.11)
T t D1
T
!
N 0/
@m.ˇ 1X
D0 D plim D plim z t x t0 D ˙zx : (8.12)
@ˇ 0 T t D1
p
This gives the asymptotic covariance matrix of T .ˇO ˇ0 /
1 1
V D D00 S0 1 D0 0
S0 1 ˙zx
D ˙zx : (8.13)
When the model is exactly identified (q D k/, then we can make some simplifications
123
since ˙O xz is then invertible. This is the case of the classical IV estimator. We get
1
ˇO D ˙O zx1 ˙O zy and V D ˙zx1 S0 ˙zx
0
if q D k. (8.14)
1 1 1 1
(Use the rule .ABC / DC B A to show this.)
This means that all cross terms (involving different observations) drop out and that we
can write
T
1X
S0 D E x t x t0 E u2t (8.17)
T t D1
T
1 X
D 2 E x t x t0 (since u t is iid and 2 D E u2t ) (8.18)
T t D1
D 2 ˙xx : (8.19)
124
Using this in (8.6) gives
V D 2 ˙xx1 : (8.20)
Reference: Greene (2000) 12.2 and Davidson and MacKinnon (1993) 16.2.
The only difference compared with the classical LS assumptions is that u t is now
allowed to be heteroskedastic, but this heteroskedasticity is not allowed to depend on the
moments of x t . This means that (8.17) holds, but (8.18) does not since Eu2t is not the
same for all t.
However, we can still simplify (8.17) a bit more. We assumed that Ex t x t0 and Eu2t
(which can both be time varying) are not related to each other, so we could perhaps multi-
ply E x t x t0 by ˙ tTD1 E u2t =T instead of by E u2t . This is indeed true asymptotically—where
any possible “small sample” relation between E x t x t0 and E u2t must wash out due to the
assumptions of independence (which are about population moments).
In large samples we therefore have
T T
! !
1X 1 X
S0 D E u2t E x t x t0
T tD1 T t D1
T T
! !
1X 2 1X 0
D E ut E xt xt
T tD1 T t D1
D ! 2 ˙xx ; (8.21)
where ! 2 is a scalar. This is very similar to the classical LS case, except that ! 2 is
the average variance of the residual rather than the constant variance. In practice, the
estimator of ! 2 is the same as the estimator of 2 , so we can actually apply the standard
LS formulas in this case.
This is the motivation for why White’s test for heteroskedasticity makes sense: if the
heteroskedasticity is not correlated with the regressors, then the standard LS formula is
correct (provided there is no autocorrelation).
125
8.1.5 Estimating the Mean of a Process
so by following the same steps in (8.16)-(8.19) we get S0 D 2 ˙zz :The optimal weight-
ing matrix is therefore W D ˙zz1 = 2 (or .Z 0 Z=T / 1 = 2 in matrix form). We use this
result in (8.10) to get
1
ˇO2SLS D ˙O xz ˙O zz1 ˙O zx ˙O xz ˙O zz1 ˙O zy ; (8.23)
y t D F .x t I ˇ0 / C " t ; (8.25)
126
where F .x t I ˇ0 / is a potentially non-linear equation of the regressors x t , with a k 1
vector of parameters ˇ0 . The non-linear least squares (NLS) approach is minimize the
sum of squared residuals, that is, to solve
To express this as a GMM problem, use the first order conditions for (8.26) as moment
conditions
1 PT @F .x t I ˇ/
mN .ˇ/ D Œy t F .x t I ˇ/ : (8.27)
T tD1 @ˇ
The model is then exactly identified so the point estimates are found by setting all moment
N .ˇ/ D 0k1 .
conditions to zero , m
p
As usual, S0 D CovŒ T m N .ˇ0 /, while the Jacobian is
N 0/
@m.ˇ
D0 D plim
@ˇ 0
1 PT @F .x t I ˇ/ @F .x t I ˇ/ 1 PT @2 F .x t I ˇ/
D plim plim Œy t F .x t I ˇ/ :
T t D1 @ˇ @ˇ 0 T t D1 @ˇ@ˇ 0
(8.28)
which is a 2 1 vector. Notice that the outer product of the gradient (first term) in (8.28)
is a 2 2 matrix. Similarly, the matrix with the second derivatives (the Hessian) is also a
2 2 matrix " 2
@2 F .x t Iˇ /
#
@ F .x t Iˇ /
@2 F .x t I ˇ/
D @2@ˇ 1 @ˇ1
F .x t Iˇ /
@ˇ1 @ˇ2
@2 F .x t Iˇ /
:
@ˇ@ˇ 0 @ˇ @ˇ @ˇ @ˇ
2 1 2 2
127
the same in the LS case (except possibly for the sign of D0 , but that is of no consequence
since it is only the square of D0 that matters.)
Example 8.3 (Logistic smooth transition regression) Let G.z/ be a logistic (increasing
but “S-shaped”) function
1
G.z/ D ;
1 C expŒ
.z c/
where the parameter c is the central location (where G.z/ D 1=2) and
> 0 determines
the steepness of the function (a high
implies that the function goes quickly from 0 to 1
around z D c.) See Figure 8.1 for an illustration. A logistic smooth transition regression
is
The regression coefficients vary smoothly with z t : from ˇ1 at low values of z t to ˇ2 at high
values of z t . See Figure 8.1 for an illustration. The parameter vector (
; c; ˇ1 ; ˇ2 —called
just ˇ in (8.25)) is easily estimated by NLS by concentrating the loss function: optimize
(numerically) over .
; c/ and let (for each value of .
; c/) the parameters (ˇ1 ; ˇ2 ) be the
OLS coefficients on the vector of “regressors” .Œ1 G.z t / x t ; G.z t /x t /. The most com-
mon application of this model is obtained by letting x t D y t s (this is the LSTAR model—
logistic smooth transition auto regression model), see Franses and van Dijk (2000).
One way to handle unbalanced panels (when there is more data on sone variables than
on others), is to artificially expand the sample and then interact the moment conditions
with a dummy variable to pick out the correct subsample. This example illustrates how
and why that works. To keep it simple, the example discusses the case of estimating a
sample mean of x t —for which we have data over the sample t D 1 to and the sample
is artificially extended to include T data points.
128
Logistic function, different γ Coefficient on xt, different β2
1
0.4 β2=0.5
0.3 β2=0.25
0.5
0.2 β2=0
γ=1
γ=5 0.1
0 0
−2 0 2 −2 0 2
z z
G(z) = 1/[1+exp(−γ(z−c))], c=0 yt = [1−G(zt)]β1xt + G(zt)β2xt + εt, β1=0.25
G(zt) is a logistic function with γ=2 and c=0
Figure 8.1: Logistic function and the effective slope coefficient in a Logistic smooth
transition regression
2 C .T /0 2
S0 D D (8.31)
T T
N 0 /=@ˇ 0 ) is
and the Jacobian (plim @m.ˇ
D0 D : (8.32)
T
129
(This is also what a numerical derivative based on a function for the moment conditions
would give.)
Combining gives
p d
T .O / ! N.0; V /; where
1
2 1
V D
T T T
T 2 T
D
T
T
D 2 : (8.33)
Therefore,
2
O D V =T D
Var./ ; (8.34)
which is the correct result—the artificial extension of the sample does not lead to a spu-
riously low uncertainty. This demonstrates that the aritificial and spurious extension of
the sample actually does no harm: the inference based on standard GMM formulas is still
correct.
8.1.9 Moment Conditions with Spuriously Extended Sample 2 (Dummies for Miss-
ing Values)
y t D bx t C " t (8.35)
and suppose the sample length is T , but only the first observations have full data, while
the last T observations include some missing values. (Putting these observations last
is just a matter of convenience.)
Suppose we prune (“excise”) the sample by simply skipping the observations with
missing values. Under the standard iid assumptions, we then have that the LS estimate
O is distributed as
(b)
p d
T .bO b0 / ! N .0k1 ; V / ; (8.36)
130
where the covariance matrix is
! 1
2 1X 2 1X 2
V D plim xt xt and D plim " : (8.37)
t D1 tD1 t
Instead, suppose we use all T observations, but let d t D 1 if there is data for period t
and zero otherwise. This gives the sample moment condition
T
1X
N D
m d t x t .y t bx t / (8.38)
T t D1
The Jacobian is
T
1X
D0 D plim dt xt xt (8.39)
T t D1
and the covariance of the moment conditions (under the standard iid assumptions)
T T T
1X 1X 1X
S0 D plim d t x t x t d t "2t D s 2 plim d t x t x t ; where s 2 D plim d t "2t :
T t D1 T t D1 T t D1
(8.40)
Combining as in (8.6) gives the covariance matrix
1
T
!
1X
V b D s2 dt xt xt : (8.41)
T tD1
which is the same as in (8.37). This makes a lot of sense since the dummy approach is just
131
about nullifying the effect of the periods with missing values. In a sense this makes the
Jacobian too small, but that is compensated for by making S0 too large. This demonstrates
that the estimation could be done in either way.
F y t C Gz t D u t ; (8.43)
where x t contains the endogenous and exogenous variables that enter the j th equation
with non-zero coefficients, that is, subsets of y t and z t .
We want to estimate ˇ in (8.44). Least squares is inconsistent if some of the regressors
are endogenous variables (in terms of (8.43), this means that the j th row in F contains
at least one additional non-zero element apart from coefficient on yjt ). Instead, we use
IV/2SLS. By assumption, the structural model summarizes all relevant information for
the endogenous variables y t . This implies that the only useful instruments are the vari-
ables in z t . (A valid instrument is uncorrelated with the residuals, but correlated with the
regressors.) The moment conditions for the j th equation are then
T
1X
x t0 ˇ x t0 ˇ D 0:
E z t yjt D 0 with sample moment conditions z t yjt
T t D1
(8.45)
If there are as many moment conditions as there are elements in ˇ, then this equation
is exactly identified, so the sample moment conditions can be inverted to give the Instru-
1
By premultiplying with F 1 and rearranging we get the reduced form y t D ˘ z t C" t , with ˘ D F 1
132
mental variables (IV) estimator of ˇ. If there are more moment conditions than elements
in ˇ, then this equation is overidentified and we must devise some method for weighting
the different moment conditions. This is the 2SLS method. Finally, when there are fewer
moment conditions than elements in ˇ, then this equation is unidentified, and we cannot
hope to estimate the structural parameters of it.
We can partition the vector of regressors in (8.44) as x t0 D ŒzQ t0 ; yQ t0 , where y1t and z1t
are the subsets of z t and y t respectively, that enter the right hand side of (8.44). Partition
z t conformably z t0 D ŒzQ t0 ; z t0 , where z t are the exogenous variables that do not enter
(8.44). We can then rewrite the moment conditions in (8.45) as
" # " #0 !
zQ t zQ t
E yjt ˇ D 0: (8.46)
z t yQ t
yjt D Gj zQ t Fj yQ t C ujt
D x t0 ˇ C ujt , where x t0 D zQ t0 ; yQ t0 ;
(8.47)
This shows that we need at least as many elements in z t as in yQ t to have this equations
identified, which confirms the old-fashioned rule of thumb: there must be at least as
many excluded exogenous variables (z t ) as included endogenous variables (yQ t ) to have
the equation identified.
This section has discussed identification of structural parameters when 2SLS/IV, one
equation at a time, is used. There are other ways to obtain identification, for instance, by
imposing restrictions on the covariance matrix. See, for instance, Greene (2000) 16.1-3
for details.
Example 8.4 (Supply and Demand. Reference: GR 16, Hamilton 9.1.) Consider the
simplest simultaneous equations model for supply and demand on a market. Supply is
q t D p t C ust ; > 0;
and demand is
q t D ˇp t C ˛A t C udt ; ˇ < 0;
where A t is an observable exogenous demand shock (perhaps income). The only mean-
133
ingful instrument is A t . From the supply equation we then get the moment condition
E A t .q t p t / D 0;
which gives one equation in one unknown,
. The supply equation is therefore exactly
identified. In contrast, the demand equation is unidentified, since there is only one (mean-
ingful) moment condition
E A t .q t ˇp t ˛A t / D 0;
Example 8.5 (Supply and Demand: overidentification.) If we change the demand equa-
tion in Example 8.4 to
q t D ˇp t C ˛A t C bB t C udt ; ˇ < 0:
There are now two moment conditions for the supply curve (since there are two useful
instruments) " # " #
A t .q t
p t / 0
E D ;
B t .q t
p t / 0
but still only one parameter: the supply curve is now overidentified. The demand curve is
still underidentified (two instruments and three parameters).
This section discusses how GMM can be used to test if a series is autocorrelated. The
analysis focuses on first-order autocorrelation, but it is straightforward to extend it to
higher-order autocorrelation.
Consider a scalar random variable x t with a zero mean (it is easy to extend the analysis
to allow for a non-zero mean). Consider the moment conditions
T
" # " # " #
x t2 2 1 X x t2 2 2
m t .ˇ/ D N
; so m.ˇ/ D , with ˇ D :
x t x t 1 2 T t D1 x t x t 1 2
(8.48)
2 2
is the variance and the first-order autocorrelation so is the first-order autocovari-
134
ance. We want to test if D 0. We could proceed along two different routes: estimate
and test if it is different from zero or set to zero and then test overidentifying restric-
tions. We analyze how these two approaches work when the null hypothesis of D 0 is
true.
We estimate both 2 and by using the moment conditions (8.48) and then test if D
0. To do that we need to calculate the asymptotic variance of O (there is little hope of
being able to calculate the small sample variance, so we have to settle for the asymptotic
variance as an approximation).
We have an exactly identified system so the weight matrix does not matter—we can
then proceed as if we had used the optimal weighting matrix (all those results apply).
To find the asymptotic covariance matrix of the parameters estimators, we need the
probability limit of the Jacobian of the moments and the covariance matrix of the moments—
evaluated at the true parameter values. Let m N i .ˇ0 / denote the i th element of the m.ˇ/
N
vector—evaluated at the true parameter values. The probability of the Jacobian is
" # " # " #
@mN 1 .ˇ0 /=@ 2 @mN 1 .ˇ0 /=@ 1 0 1 0
D0 D plim D D ;
@mN 2 .ˇ0 /=@ 2 @mN 2 .ˇ0 /=@ 2 0 2
(8.49)
since D 0 (the true value). Note that we differentiate with respect to 2 , not , since
we treat 2 as a parameter.
The covariance matrix is more complicated. The definition is
"p T
# "p T #0
T X T X
S0 D E m t .ˇ0 / m t .ˇ0 / :
T t D1 T t D1
S0 D E m t .ˇ0 /m t .ˇ0 /0 .
This assumption is stronger than assuming that D 0, but we make it here in order to
illustrate the asymptotic distribution. To get anywhere, we assume that x t is iid N.0; 2 /.
135
In this case (and with D 0 imposed) we get
" #" #0 " #
x t2 2 x t2 2 .x t2 2 /2 .x t2 2 /x t x t 1
S0 D E DE
xt xt 1 xt xt 1 .x t2 2 /x t x t 1 .x t x t 1 /2
" # " #
E x t4 2 2 E x t2 C 4 0 2 4 0
D D : (8.50)
0 E x t2 x t2 1 0 4
To make the simplification in the second line we use the facts that E x t4 D 3 4 if x t
N.0; 2 /, and that the normality and the iid properties of x t together imply E x t2 x t2 1 D
E x t2 E x t2 1 and E x t3 x t 1 D E 2 x t x t 1 D 0.
By combining (8.49) and (8.50) we get that
" #!
p O 2 0
1
1
ACov T D D0 S0 D0
O
0" #0 " # 1" #1 1
4
1 0 2 0 1 0
D@ 2 4
A
0 0 0 2
" #
2 4 0
D : (8.51)
0 1
p
O
This shows the standard expression for the uncertainty of the variance and that the T .
p
Since GMM estimators typically have an asymptotic distribution we have T O !d
N.0; 1/, so we can test the null hypothesis of no first-order autocorrelation by the test
statistics
T O2 21 . (8.52)
136
8.3.2 Testing the Overidentifying Restriction of No Autocorrelation
We can estimate 2 alone and then test if both moment condition are satisfied at D 0.
There are several ways of doing that, but the perhaps most straightforward is skip the loss
function approach to GMM and instead specify the “first order conditions” directly as
0 D Am
N
T
" #
h i1X x t2 2
D 1 0 ; (8.53)
T t D1 xt xt 1
We therefore get
" #" #" #0 " #
p 0 0 2 4 0 0 0 0 0
ACovŒ T m. O D
N ˇ/ D : (8.57)
0 1 0 4 0 1 0 4
Note that the first moment condition has no sampling variance at the estimated parameters,
since the choice of O 2 always sets the first moment condition equal to zero.
The test of the overidentifying restriction that the second moment restriction is also
137
zero is p C
0
N ACovŒ T m.
Tm O
N ˇ/ m N 21 ; (8.58)
where we have to use a generalized inverse if the covariance matrix is singular (which it
is in (8.57)).
In this case, we get the test statistics (note the generalized inverse)
" #0 " #" # T 2
0 0 0 0 ˙ t D1 x t x t 1 =T
T DT ;
˙ tTD1 x t x t 1 =T 0 1= 4 T
˙ tD1 x t x t 1 =T 4
(8.59)
which is the T times the square of the sample covariance divided by 4 . A sample cor-
relation, , O would satisfy ˙ tTD1 x t x t 1 =T D OO 2 , which we can use to rewrite (8.59) as
T O2 O 4 = 4 . By approximating 4 by O 4 we get the same test statistics as in (8.52).
This section discusses how the GMM framework can be used to test if a variable is nor-
mally distributed. The analysis cold easily be changed in order to test other distributions
as well.
Suppose we have a sample of the scalar random variable x t and that we want to test if
the series is normally distributed. We analyze the asymptotic distribution under the null
hypothesis that x t is N.; 2 /.
We specify four moment conditions
2 3 2 3
xt xt
T 6
6 .x t /2 2 7 6 .x t /2 2 7
6 7 7
1 X
mt D 6 3
7 so mN D 6
3
7 (8.60)
6 .x
4 t /
7
5 T t D1 6
4 .x t /
7
5
.x t /4 3 4 .x t /4 3 4
138
values. The probability of the Jacobian is
2 3
@mN 1 .ˇ0 /=@ @m N 1 .ˇ0 /=@ 2
2 7
6 7
6 @m N 2 .ˇ0 /=@ @ N
m 2 .ˇ0 /=@
D0 D plim 6 7
2 7
6 @m
4 N 3 .ˇ0 /=@ @ N
m 3 .ˇ0 /=@ 5
2
@mN 4 .ˇ0 /=@ @m N 4 .ˇ0 /=@
2 3 2 3
1 0 1 0
T 6 7 6 7
1 X6 2.x t / 1 7 6 0 1 7
D plim 6
2
7D6 7: (8.61)
T tD1 6 4 3.x t / 0 7 6
5 4 3 2 0 7
5
3 2
4.x t / 6 0 6 2
which is the form we use here for illustration. We therefore have (provided m t .ˇ0 / is not
autocorrelated)
02 31 02 310 2 3
xt xt 2 0 3 4 0
2 2 2 2 4
12 6
B6 7 C B 6 7 C 6 7
B6 .x t / 7 7C B6 .x t / 7 7C D 6 0 2 0
C B6 C 6 7
S0 D E B 6 7:
B6
@4 .x t /3 7C B6
5A @4 .x t /3 7C
5A
6 3 4
4 0 15 6 0 7
5
4 4 4 4 6
.x t / 3 .x t / 3 0 12 0 96 8
(8.64)
It is straightforward to derive this result once we have the information in the following
remark.
Remark 8.6 If X N.; 2 /, then the first few moments around the mean of a are
E.X / D 0, E.X /2 D 2 , E.X /3 D 0 (all odd moments are zero), E.X /4 D
3 4 , E.X /6 D 15 6 , and E.X /8 D 105 8 .
139
Suppose we use the efficient weighting matrix. The asymptotic covariance matrix of
the estimated mean and variance is then (.D00 S0 1 D0 / 1 )
02 30 2 3 1 2 31 1
1 0 2 0 3 4 0 1 0 " # 1
1
2 4 12 6 7
B6 7 6 7 6 7C
B6 0 1 7 6 0 0 6 0 1 7C
2
0
B6 7 6 7 6 7C D 1
B6
@4 3 2 0 7
5
6
4 3 4 0 15 6 0 7 5
6
4 3 2 0 7C
5A 0 2 4
0 6 2 0 12 6 0 96 8 0 6 2
" #
2 0
D :
0 2 4
(8.65)
This is the same as the result from maximum likelihood estimation which use the sample
mean and sample variance as the estimators. The extra moment conditions (overidenti-
fying restrictions) does not produce any more efficient estimators—for the simple reason
that the first two moments completely characterizes the normal distribution.
The payoff from the overidentifying restrictions is that we can test if the series is actually
normally distributed. There are several ways of doing that, but the perhaps most straight-
forward is skip the loss function approach to GMM and instead specify the “first order
conditions” directly as
0 D Am
N
2 3
xt
T 6
" #
.x t /2 2 7
7
1 0 0 0 1 X6
D 6 7: (8.66)
0 1 0 0 T t D1 6
4 .x t /3 7
5
.x t /4 3 4
The asymptotic covariance matrix the moment conditions is as in (8.55). In this case,
140
the matrix with brackets is
0 1 1
2 3 2 3B 2 3C
1 0 0 0 1 0 B 1 0 C
6 7 6 7B " #6 7C " #
6
6 0 1 0 0 7
7
6
6 0 1 7B
7B 1 0 0 0 6
6 0 1 7C
7C 1 0 0 0
2 7B 2 7C
6
4 0 0 1 0 7
5
6
4 3 0 5B 0 1 0 0 4
B 6 3 0 5C
C 0 1 0 0
6 2 B„ 6 2
ƒ‚ … „ ƒ‚ …
0 0 0 1 0 @ A0 0 C
A A0
„ ƒ‚ … „ ƒ‚ … „ ƒ‚ …
I4 D0 D0
2 3
0 0 0 0
6 7
6 0 0 0 0 7
D6 7 (8.67)
6
4 3 2 0 1 0 7
5
0 6 2 0 1
We therefore get
2 32 32 30
0 0 0 0 2 0 3 4 0 0 0 0 0
p 2 4 12 6 7
6 76 76 7
O
6 0 0 0 0 7 6 0 0 6 0 0 0 0 7
N ˇ/ D 6
ACovŒ T m. 76 76 7
6
4 3 2 0 1 0 7
54
6 3 4 0 15 6 0 7 54
6 3 2 0 1 0 7
5
0 6 2 0 1 0 12 6 0 96 8 0 6 2 0 1
2 3
0 0 0 0
6 7
6 0 0 0 0 7
D6
6 0 0 6 6
7 (8.68)
4 0 7
5
0 0 0 24 8
We now form the test statistics for the overidentifying restrictions as in (8.58). In this
case, it is (note the generalized inverse)
2 30 2 32 3
0 0 0 0 0 0
6 7 6 76 7
6 0 7 6 0 0 0 0 76 0 7
T66 ˙ T .x 3
7 6
7 6 0 0 1=.6 6 /
76
7 6 ˙ T .x
7
4 t D1 t / =T 5 4 0 5 4 t D1 t /3 =T 7
5
T
˙ t D1 Œ.x t / 4 4
3 =T 0 0 0 1=.24 /8
˙ tTD1 Œ.x t /4 3 4 =T
2 2
T ˙ tTD1 .x t /3 =T T ˙ tTD1 Œ.x t /4 3 4 =T
˚
D C : (8.69)
6 6 24 8
When we approximate by O then this is the same as the Jarque and Bera test of nor-
141
mality.
The analysis shows (once again) that we can arrive at simple closed form results by
making strong assumptions about the data generating process. In particular, we assumed
that the moment conditions were serially uncorrelated. The GMM test, with a modified
estimator of the covariance matrix S0 , can typically be much more general.
and estimate it with GMM using moment conditions. One of the moment condition is
that the sample average of the labor share in value added equals the coefficient on labor
in a Cobb-Douglas production function, another is that just the definitions of a standard
deviation, and so forth.
The distribution of the estimator for is asymptotically normal. Note that the covari-
ance matrix of the moments is calculated similarly to the Newey-West estimator.
The second step is to note that the RBC model generates second moments as a function
h .:/ of the model parameters fı; :::; g, which are in , that is, the model generated
second moments can be thought of as h . /.
The third step is to test if the non-linear restrictions of the model (the model mapping
from parameters to second moments) are satisfied. That is, the restriction that the model
second moments are as in data
y
c p
H . / D h . / ; :::; Corr ; n D 0; (8.71)
y n
is tested with a Wald test. (Note that this is much like the Rˇ D 0 constraints in the linear
142
case.) From the delta-method we get
p 0
@Hd @H
T H. O / ! N 0; Cov. O / : (8.72)
@ 0 @
Forming the quadratic form
0
1
@H O / @H
TH. O / 0
Cov. H. O /; (8.73)
@ 0 @
will as usual give a 2 distributed test statistic with as many degrees of freedoms as
restrictions (the number of functions in (8.71)).
0
y1t D x1t ˇ1 C u1t
0
y2t D x2t ˇ2 C u2t ;
and two sets of instruments, z1t and z2t with the same dimensions as x1t and x2t , respec-
tively. The sample moment conditions are
T
" #
0
1 X z1t y1t x1t ˇ1
N 1 ; ˇ2 / D
m.ˇ 0
;
T t D1 z2t y2t x2t ˇ2
143
This is invertible so we can premultiply the first order condition with the inverse of
0 0
N
Œ@m.ˇ/=@ˇ N
A and get m.ˇ/ D 0k1 . We can solve this system for ˇ1 and ˇ2 as
" # " PT # 1" P #
1 0 1 T
ˇ1 z x
tD1 1t 1t 0 z y
1t 1t
D T
1
PT 0
T
1
PTt D1
ˇ2 0 T t D1 z2t x2t T t D1 z2t y2t
2 P 1 3"
1 T 0 PT #
t D1 z1t x1t 0 1
T z 1t y1t
D4 P 1 5 T1 PTtD1 :
0 1 T
z 2t x 0
T t D1 z2t y2t
T tD1 2t
This is IV on each equation separately, which follows from having an exactly identified
system.
Bibliography
Christiano, L. J., and M. Eichenbaum, 1992, “Current real-business-cycle theories and
aggregate labor-market fluctuations,” American Economic Review, 82, 430–450.
Franses, P. H., and D. van Dijk, 2000, Non-linear time series models in empirical finance,
Cambridge University Press.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
144
12 Vector Autoregression (VAR)
Reference: Hamilton (1994) 10-11; Greene (2000) 17.5; Johnston and DiNardo (1997)
9.1-9.2 and Appendix 9.2; and Pindyck and Rubinfeld (1998) 9.2 and 13.5.
Let y t be an n 1 vector of variables. The VAR(p) is
Issues:
Variable selection
Lag length
Estimation
145
Example 12.2 (Canonical form of a univariate AR(2).)
" # " # " #" # " #
yt a1 a2 yt 1 "t
D C C .
yt 1 0 1 0 yt 2 0
Example 12.3 (Canonical for of VAR(2) of 21 vector.) Continuing on the previous
example, we get
2 3 2 32 3 2 3
xt A1;11 A1;11 A2;11 A2;12 xt 1 "1;t
6 7 6 76 7 6 7
6 zt 7 6 A1;21 A1;22 A2;21 A2;22 76 zt 1 7 6 "2;t 7
6 7D6 76 7C6 7:
6 x 7 6 1 0 0 0 76 xt 7 6 0 7
4 t 1 5 4 54 2 5 4 5
zt 1 0 1 0 0 zt 2 0
y t D Ay t 1 C "t : (12.5)
y t D A Ay t C "t C "t
2 1
D A2 y t 2 C A"t
C "t 1
Remark 12.4 (Spectral decomposition.) The n eigenvalues (i ) and associated eigen-
vectors (zi ) of the n n matrix A satisfies
.A i In / zi D 0n1 :
146
If the eigenvectors are linearly independent, then
2 3
1 0 0
6 0 2 0
6 7 h i
A D ZZ 1 , where D 6
7
7 and Z D z1 z2 zn :
6 :: :: :
4 : : ::
7
5
0 0 n
A2 D AA D ZZ 1
ZZ 1
D ZZ 1
D Z2 Z 1
) Aq D Zq Z 1
:
p
Remark 12.5 (Modulus of complex number.) If D a C bi, where i D 1, then
p
jj D ja C bij D a2 C b 2 .
Example 12.6 (AR(1).) For the univariate AR(1) y t D ay t 1 C " t , the characteristic
equation is .a / z D 0, which is only satisfied if the eigenvalue is D a. The AR(1) is
therefore stable (and stationarity) if 1 < a < 1.
We may pick out the first n equations from (12.7) (to extract the “original” variables from
the canonical form) and write them as
147
Example 12.7 (AR(2), Example (12.2) continued.) Let D 0 in 12.2 and note that the
VMA of the canonical form is
" # " # " #" # " #" #
yt "t a1 a2 "t 1 a12 C a2 a1 a2 "t 2
D C C C :::
yt 1 0 1 0 0 a1 a2 0
The MA of y t is therefore
C a12 C a2 " t
y t D " t C a1 " t 1 2 C :::
Note that
@y t @E t y t Cs
0
D Cs or D Cs ; with C0 D I (12.9)
@" t s @"0t
so the impulse response function is given by fI; C1 ; C2 ; :::g. Note that it is typically only
meaningful to discuss impulse responses to uncorrelated shocks with economic interpreta-
tions. The idea behind structural VARs (discussed below) is to impose enough restrictions
to achieve this.
Example 12.8 (Impulse response function for AR(1).) Let y t D y t 1 C " t . The MA
Pt s s
representation is y t D sD0 " t s , so @y t =@" t s D @E t y t Cs =@" t D . Stability
requires jj < 1, so the effect of the initial value eventually dies off (lims!1 @y t =@" t s D
0).
The eigenvalues are approximately 0:52 and 0:32, so this is a stable VAR. The VMA is
" # " # " #" # " #" #
xt "1;t 0:5 0:2 "1;t 1 0:27 0:04 "1;t 2
D C C C :::
zt "2;t 0:1 0:3 "2;t 1 0:02 0:11 "2;t 2
Remark 12.10 (Generalized impulse response function) The impulse response function
makes most sense when the shocks (" t ) are uncorrelated. If they are not, then a gener-
alized impulse response function (see Pesaran and Shin (1998)) might still be useful: in
(12.8), replace Cs by Cs ˙.diag.˙// 1 , where ˙ D Cov." t /. It can be interpreted as fol-
lows: what is the typical response to "i t —when we incorporate the typical comovement
148
of the other shocks with "i t ? In practice, this means that the impulse response function
with respect to "i t is not calculated by setting " t D ei where ei is a vector of zeros except
that element i is unity. Instead, we use E." t j"i t D 1/ D ˙i =i i where ˙i is column i of
˙ (that is, Cov." t ; "i t /), so the value of the entire vector " t is predicted (assuming that
" t is normally distributed) using the information "i t D 1. Clearly, when the shocks are
uncorrelated, then this approach coincides with a traditional impulse response function
(where the impulse is "i t D 1 and "jt D 0 for j ¤ i ).
12.3 Estimation
The MLE, conditional on the initial observations, of the VAR is the same as OLS esti-
mates of each equation separately. The MLE of the ij t h element in Cov(" t ) is given by
PT
O i t vOjt =T , where vO i t and vOjt are the OLS residuals.
t D1 v
Note that the VAR system is a system of “seemingly unrelated regressions,” with the
same regressors in each equation. The OLS on each equation is therefore the GLS, which
coincides with MLE if the errors are normally distributed.
Main message: Granger-causality might be useful, but it is not the same as causality.
Definition: if z cannot help forecast x, then z does not Granger-cause x; the MSE of
the forecast E. x t j x t s ; z t s ; s > 0/ equals the MSE of the forecast E.x t j x t s ; s > 0/.
Test: Redefine the dimensions of x t and z t in (12.2): let x t be n1 1 and z t is n2 1. If
the n1 n2 matrices A1;12 D 0 and A2;12 D 0, then z fail to Granger-cause x. (In general,
we would require As;12 D 0 for s D 1; :::; p.) This carries over to the MA representation
in (12.8), so Cs;12 D 0.
These restrictions can be tested with an F-test. The easiest case is when x is a scalar,
since we then simply have a set of linear restrictions on a single OLS regression.
Example 12.11 (RBC and nominal neutrality.) Suppose we have an RBC model which
says that money has no effect on the real variables (for instance, output, capital stock,
and the productivity level). Money stock should not Granger-cause real variables.
Example 12.12 (Granger causality and causality.) Do Christmas cards cause Christ-
mas?
149
Example 12.13 (Granger causality and causality II, from Hamilton 11.) Consider the
price P t of an asset paying dividends D t . Suppose the expected return (E t (P tC1 C
D t C1 /=P t ) is a constant, R. The price then satisfies P t DE t 1 s
P
sD1 R D t Cs . Suppose
D t D u t C ıu t 1 C v t , so E t D t C1 D ıu t and E t D t Cs D 0 for s > 1. This gives
P t D ıu t =R, and D t D u t C v t C RP t 1 , so the VAR is
" # " #" # " #
Pt 0 0 Pt 1 ıu t =R
D C ;
Dt R 0 Dt 1 ut C vt
Example 12.14 (Money and output, Sims (1972).) Sims found that output, y does not
Granger-cause money, m, but that m Granger causes y. His interpretation was that
money supply is exogenous (set by the Fed) and that money has real effects. Notice how
he used a combination of two Granger causality test to make an economic interpretation.
Example 12.15 (Granger causality and omitted information. ) Consider the VAR
2 3 2 32 3 2 3
y1t a11 a12 0 y1t 1 "1t
4 y2t 5 D 4 0 a22 0 5 4 y2t 1 5 C 4 "2t 5
6 7 6 76 7 6 7
Notice that y2t and y3t do not depend on y1t 1 , so the latter should not be able to
Granger-cause y3t . However, suppose we forget to use y2t in the regression and then
ask if y1t Granger causes y3t . The answer might very well be yes since y1t 1 contains
information about y2t 1 which does affect y3t . (If you let y1t be money, y2t be the (auto-
correlated) Solow residual, and y3t be output, then this is a short version of the comment
in King (1986) comment on Bernanke (1986) (see below) on why money may appear to
Granger-cause output). Also note that adding a nominal interest rate to Sims (see above)
money-output VAR showed that money cannot be taken to be exogenous.
150
so the covariance matrix of the (s periods ahead) forecasting errors is
Note that lims!1 E t y t Cs D 0, that is, the forecast goes to the unconditional mean
(which is zero here, since there are no constants - you could think of y t as a deviation
from the mean). Consequently, the forecast error becomes the VMA representation (12.8).
Similarly, the forecast error variance goes to the unconditional variance.
D ˝ C A.Ey t y t0 /A0 ;
If the shocks are uncorrelated, then it is often useful to calculate the fraction of Var(yi;tCs E t yi;t Cs )
due to the j t h shock, the forecast error variance decomposition. Suppose the covariance
matrix of the shocks, here ˝, is a diagonal n n matrix with the variances !i i along the
diagonal. Let cqi be the it h column of Cq . We then have
n
X 0
Cq ˝ Cq0 D !i i cqi cqi : (12.14)
i D1
151
Example 12.17 (Illustration of (12.14) with n D 2:) Suppose
" # " #
c11 c12 !11 0
Cq D and ˝ D ;
c21 c22 0 !22
then " #
2 2
!11 c11 C !22 c12 !11 c11 c21 C !22 c12 c22
Cq ˝ Cq0 D 2 2
;
!11 c11 c21 C !22 c12 c22 !11 c21 C !22 c22
which should be compared with
" #" #0 " #" #0
c11 c11 c12 c12
!11 C !22
c21 c21 c22 c22
" # " #
2 2
c11 c11 c21 c12 c12 c22
D !11 2
C !22 2
:
c11 c21 c21 c12 c22 c22
which shows how the covariance matrix for the s-period forecast errors can be decom-
posed into its n components.
We are usually not interested in the impulse response function (12.8) or the variance
decomposition (12.11) with respect to " t , but with respect to some structural shocks, u t ,
which have clearer interpretations (technology, monetary policy shock, etc.).
Suppose the structural form of the model is
152
This could, for instance, be an economic model derived from theory.1
Provided F 1 exists, it is possible to write the time series process as a vector autore-
gression
1 1 1 1
yt D F ˛CF B1 y t 1 C ::: C F Bp y t p CF ut (12.17)
D C A1 y t 1 C ::: C Ap y t p C " t ; Cov ." t / D ˝, (12.18)
where
1 1 1 1 1 0
DF ˛, As D F Bs , and " t D F u t so ˝ D F D F . (12.19)
1
Remark 12.18 The easiest way to calculate this representation is by first finding F
(see below), then writing (12.18) as
1
y t D C A1 y t 1 C ::: C Ap y t p CF ut : (12.21)
To calculate the impulse responses to the first element in u t , set y t 1 ; :::; y t p equal to
the long-run average, .I A1 ::: Ap/ 1 (or just put them all to zero), make the
first element in u t unity and all other elements zero. Calculate the response by iterating
1
This is a “structural model” in a traditional, Cowles commission, sense. This might be different from
what modern macroeconomists would call structural.
153
forward on (12.21), but putting all elements in u t C1 ; u tC2 ; ::: to zero. This procedure can
be repeated for the other elements of u t .
We would typically pick F such that the elements in u t are uncorrelated with each
other, so they have a clear interpretation.
The VAR form can be estimated directly from data. Is it then possible to recover the
structural parameters in (12.16) from the estimated VAR (12.18)? Not without restrictions
on the structural parameters in F; Bs , ˛, and D. To see why, note that in the structural
form (12.16) we have .p C 1/ n2 parameters in fF; B1 ; : : : ; Bp g, n parameters in ˛, and
n.n C 1/=2 unique parameters in D (it is symmetric). In the VAR (12.18) we have fewer
parameters: pn2 in fA1 ; : : : ; Ap g, n parameters in in , and n.n C 1/=2 unique param-
eters in ˝. This means that we have to impose at least n2 restrictions on the structural
parameters fF; B1 ; : : : ; Bp ; ˛; Dg to identify all of them. This means, of course, that
many different structural models have can have exactly the same reduced form.
Example 12.19 (Structural form of the 2 1 case.) Suppose the structural form of the
previous example is
" #" # " #" # " #" # " #
F11 F12 xt B1;11 B1;12 xt 1 B2;11 B2;12 xt 2 u1;t
D C C :
F21 F22 zt B1;21 B1;22 zt 1 B2;21 B2;22 zt 2 u2;t
This structural form has 3 4 C 3 unique parameters. The VAR in (12.2) has 2 4 C 3.
We need at least 4 restrictions on fF; B1 ; B2 ; Dg to identify them from fA1 ; A2 ; ˝g.
154
This is a recursive simultaneous equations model, and we obviously need to be careful
with how we order the variables. The assumptions that Fi i D 1 is just a normalization.
A diagonal D matrix seems to be something that we would often like to have in
a structural form in order to interpret the shocks as, for instance, demand and supply
shocks. The diagonal elements of D are the variances of the structural shocks.
Example 12.20 (Lower triangular F : going from structural form to VAR.) Suppose the
structural form is
" #" # " #" # " #
1 0 xt B11 B12 xt 1 u1;t
D C :
˛ 1 zt B21 B22 zt 1 u2;t
This is a recursive system where x t does not not depend on the contemporaneous z t , and
therefore not on the contemporaneous u2t (see first equation). However, z t does depend
on x t (second equation). The VAR (reduced form) is obtained by premultiplying by F 1
" # " #" #" # " #" #
xt 1 0 B11 B12 xt 1 1 0 u1;t
D C
zt ˛ 1 B21 B22 zt 1 ˛ 1 u2;t
" #" # " #
A11 A12 xt 1 "1;t
D C :
A21 A22 zt 1 "2;t
This means that "1t D u1t , so the first VAR shock equals the first structural shock. In
contrast, "2;t D ˛u1;t C u2;t , so the second VAR shock is a linear combination of the first
two shocks. The covariance matrix of the VAR shocks is therefore
" # " #
"1;t Var .u1t / ˛Var .u1t /
Cov D :
"2;t ˛Var .u1t / ˛ 2 Var .u1t / C Var .u2t /
Remark 12.21 (Importance of the ordering of the VAR) Suppose the our objective is to
analyze the effects of policy shocks on the other variables in the VAR system, for instance,
output and prices. The identification rests on the ordering of the VAR, that is, on the struc-
ture of the contemporaneous correlations as captured by F . It is therefore important to
understand how the results on the policy shock (denoted upt ) are changed if the variables
155
are reordered. We have the following result. First, the partitioning of y t into variables
which come before, x1t , and after, x2t , the policy instrument is important for upt and the
impulse response function of all variables with respect to upt . Second, the order within
x1t and x2t does not matter for upt or the impulse response function of any variable with
respect to upt . This suggests that we can settle for partial identification in the sense that
we must take a stand on which variables that come before and after the policy instru-
ment, but the ordering within those blocks are unimportant for understanding the effects
of policy shocks.
The identifying restrictions in Section 12.7.2 is actually the same as assuming that F is
triangular and that D D I . In this latter case, the restriction on the diagonal elements of F
has been moved to the diagonal elements of D. This is just a change of normalization (that
the structural shocks have unit variance). It happens that this alternative normalization is
fairly convenient when we want to estimate the VAR first and then recover the structural
parameters from the VAR estimates.
Example 12.22 (Change of normalization in Example 12.20) Suppose the structural shocks
in Example 12.20 have the covariance matrix
" # " #
u1;t 12 0
D D Cov D :
u2;t 0 22
to get
" #" # " #" # " #
1=1 0 xt B11 =1 B12 =1 xt 1 u1;t =1
D C :
˛=2 1=2 zt B21 =2 B22 =2 zt 1 u2;t =2
This structural form has a triangular F matrix (with diagonal elements that can be dif-
ferent from unity), and a covariance matrix equal to an identity matrix.
156
The reason why this alternative normalization is convenient is that it allows us to use
the widely available Cholesky decomposition.
Remark 12.24 Note the following two important features of the Cholesky decomposition.
First, each column of P is only identified up to a sign transformation; they can be reversed
at will. Second, the diagonal elements in P are typically not unity.
Remark 12.25 (Changing sign of column and inverting.) Suppose the square matrix A2
is the same as A1 except that the i t h and j t h columns have the reverse signs. Then A2 1
is the same as A1 1 except that the i t h and j t h rows have the reverse sign.
This set of identifying restrictions can be implemented by estimating the VAR with
LS and then take the following steps.
0
Step 1. From (12.19) ˝ D F 1 I F 1 (recall D D I is assumed), so a Cholesky
decomposition recovers F 1 (lower triangular F gives a similar structure of F 1 ,
and vice versa, so this works). The signs of each column of F 1 can be chosen
freely, for instance, so that a productivity shock gets a positive, rather than negative,
effect on output. Invert F 1 to get F .
Step 2. Invert the expressions in (12.19) to calculate the structural parameters from
the VAR parameters as ˛ D F, and Bs D FAs .
Example 12.26 (Identification of the 2 1 case.) Suppose the structural form of the
previous example is
" #" # " #" # " #" # " #
F11 0 xt B1;11 B1;12 xt 1 B2;11 B2;12 xt 2 u1;t
D C C ;
F21 F22 zt B1;21 B1;22 zt 1 B2;21 B2;22 zt 2 u2;t
" #
1 0
with D D :
0 1
157
Step 1 above solves
" # " # 1
0" # 1 10
˝11 ˝12 F11 0 F11 0
D @ A
˝12 ˝22 F21 F22 F21 F22
2 3
1 F21
2 2
F11 F11 F22
D4 F21
2
F21 CF11 2
5
2 2 2
F11 F22 F11 F22
for the three unknowns F11 , F21 , and F22 in terms of the known ˝11 , ˝12 , and ˝22 . Note
that the identifying restrictions are that D D I (three restrictions) and F12 D 0 (one
restriction). (This system is just four nonlinear equations in three unknown - one of the
equations for ˝12 is redundant. You do not need the Cholesky decomposition to solve it,
since it could be solved with any numerical solver of non-linear equations—but why make
life even more miserable?)
y t D ei t C C1 ei t 1 C C 2 ei t 2 C ::: (12.22)
where ˝ is the covariance matrix of the VAR residuals, ˝ D Cov." t /. (Notice that
diag.˝/ 1 should be understood as the inverse of diag.˝/).
The generalized impulse response function can be interpreted as follows: what is the
typical response to the i th shock, when we incorporate the typical comovement of that
shock with the other shocks? To see that, notice that if the residuals are normally dis-
tributed, then the expected value of the vector of residuals, conditional on the ith element
158
being one, is
E." t j"i t D 1/ D ˝i =!i i (12.24)
where ˝i is column i of ˝ (that is, Cov." t ; "i t /). Clearly, when the shocks are uncorre-
lated, then this approach coincides with a traditional impulse response function.
Comparing with the structural form (12.20) where the impulse response function (with
respect to the i th shock) is calculated as
1 1 1
yt D F ei t C C 1 F ei t 1 C C2 F ei t 2 C :::; (12.25)
1
F D ˝diag.˝/ 1 , so F D diag.˝/˝ 1
: (12.26)
Example 12.27 (Generalized impulse response function when the true structural form is
recursive) Consider Example 12.20 and suppose Var.ui t / D 1 notice that we then have
" #" #" #0 " #
1 0 1 0 1 0 1 ˛
˝D D , so
˛ 1 0 1 ˛ 1 ˛ ˛2 C 1
" #
2
1 ˛=.˛ C 1/
˝diag.˝/ 1 D :
˛ 1
This is used in the generalized impulse response function (12.23). The generalized im-
pulse response with respect to the first shock is then calculated as
" #" # " #
1 ˛=.˛ 2 C 1/ 1 1
Cs D Cs :
˛ 1 0 ˛
The implication is that when the first shock is one, then the second shock is assumed to
be ˛, which is the same as in Example 12.20. Similarly, the generalized impulse response
with respect to the second shock is then calculated as
" #" # " #
1 ˛=.˛ 2 C 1/ 0 ˛=.˛ 2 C 1/
Cs D Cs :
˛ 1 1 1
The implication is that when the second shock is one, then the first shock is assumed to be
˛=.˛ 2 C 1/, which is different from Example 12.20 (where the first shock is then assumed
to be zero).
159
Example 12.28 (Generalized impulse response function and the implied structural form)
Using the results from the previous example in (12.26) gives
" #
1 ˛2 C 1 ˛
F D diag.˝/˝ D ;
˛.˛ 2 C 1/ ˛ 2 C 1
which can be normalized to get unit coefficients on the diagonal by multiplying by the
inverse of diag.F /
" #
2
1 ˛=.˛ C 1/
F D F diag.F / 1 D :
˛ 1
This shows the same pattern as in the previous example. The first column coincides with
the (true) structural form, while the second column (wrongly) breaks the recursive struc-
ture by assuming that the first variable depends (contemporaneously) on the second vari-
able. The reason is that the generalized impulse response function is based on the cor-
relation pattern—and it is indeed the case that the first variable is correlated with the
second. However, that is not driven by shocks to the second variable (as assumed by the
generalized impulse response function), rather the opposite.
1 1 0
˝DF F : (12.27)
As before we need n.n 1/=2 restrictions on F , but this time we don’t want to impose
the restriction that all elements in F above the principal diagonal are zero. Given these
restrictions (whatever they are), we can solve for the remaining elements in B, typically
with a numerical method for solving systems of non-linear equations.
160
12.7.6 What if the VAR Shocks are Uncorrelated (˝ D I )?
Suppose we estimate a VAR and find that the covariance matrix of the estimated residuals
is (almost) an identity matrix (or diagonal). Does this mean that the identification is
superfluous? No, not in general. Yes, if we also want to impose the restrictions that F is
triangular.
There are many ways to reshuffle the shocks and still get orthogonal shocks. Recall
that the structural shocks are linear functions of the VAR shocks, u t D F " t , and that we
assume that Cov." t / D ˝ D I and we want Cov.u t / D I , that, is from (12.19) we then
have (D D I )
FF 0 D I: (12.28)
There are many such F matrices: the class of those matrices even have a name: orthogonal
matrices (all columns in F are orthonormal). However, there is only one lower triangular
F which satisfies (12.28) (the one returned by a Cholesky decomposition, which is I ).
Suppose you know that F is lower triangular (and you intend to use this as the identi-
fying assumption), but that your estimated ˝ is (almost, at least) diagonal. The logic then
requires that F is not only lower triangular, but also diagonal. This means that u t D " t
(up to a scaling factor). Therefore, a finding that the VAR shocks are uncorrelated com-
bined with the identifying restriction that F is triangular implies that the structural and
reduced form shocks are proportional. We can draw no such conclusion if the identifying
assumption is something else than lower triangularity.
161
which is an identity matrix since cos2 C sin2 D 1. The transformation u D G 0 " gives
u t D " t for t ¤ i; k
ui D " i c "k s
uk D "i s C "k c:
Example 12.30 (Givens rotations and the F matrix.) We could take F in (12.28) to be
(the transpose) of any such sequence of givens rotations. For instance, if G1 and G2 are
0
givens rotations, then F D G10 or F D G2 or F D G10 G20 are all valid.
Suppose we have estimated a VAR system (12.1) for the first differences of some variables
y t D x t , and that we have calculated the impulse response function as in (12.8), which
we rewrite as
x t D C .L/ " t C x t 1
162
As before the structural shocks, u t , are
The C C .L/ polynomial is known from the estimation, so we need to identify F in order to
use this equation for impulse response function and variance decompositions with respect
to the structural shocks.
As before we assume that D D I , so
1 1 0
˝DF D F (12.32)
@x t Cs
lim D lim CsC F 1
s!1 @u0t s!1
1
D C.1/F ; (12.33)
where C.1/ D j1D0 Cs . We impose n.n 1/=2 restrictions on these long run responses.
P
1
D C.1/F (12.35)
by calculating the Cholesky decomposition of the left hand side of (12.34) (which is
163
1
available from the VAR estimate). Finally, we solve for F from (12.35).
or since " t D F 1 u t
" # " # " # " #
xt u 1;t u1;t 1 u1;t 2
DF 1 1
C A2 C A C I F 1
C.A C I / F C:::
zt u2;t u2;t 1 u2;t 2
There are 8+3 parameters in the structural form and 4+3 parameters in the VAR, so we
need four restrictions. Assume that Cov.u t / D I (three restrictions) and that the long
run response of u1;t s on x t is zero, that is,
" # " # 1
unrestricted 0 2
F11 F12
D I C A C A C :::
unrestricted unrestricted F21 F22
" # 1
1 F11 F12
D .I A/
F21 F22
" # 1 " # 1
1 A11 A12 F11 F12
D :
A21 1 A22 F21 F22
164
The upper right element of the right hand side is
F12 C F12 A22 C A12 F11
.1 A22 A11 C A11 A22 A12 A21 / .F11 F22 F12 F21 /
1 1 0
which is one restriction on the elements in F . The other three are given by F F D
˝, that is,
2 2
2 3 "
F22 CF12 F22 F21 CF12 F11
#
4 .F11 F22 F12 F21 / 2 2
.F11 F22 F12 F21 / 5 ˝ 11 ˝ 12
F22 F21 CF12 F11
2
F21 2
CF11
D :
2 2
˝ 12 ˝ 22
.F11 F22 F12 F21 / .F11 F22 F12 F21 /
These notes are a reading guide to Mellander, Vredin, and Warne (1992), which is well be-
yond the first year course in econometrics. See also Englund, Vredin, and Warne (1994).
(I have not yet double checked this section.)
t D t 1 C 't , (12.37)
where ˚ .L/ is a stable matrix polynomial in the lag operator. We see that the k 1 vector
' t has permanent effects on (at least some elements in) y t , while the r 1 (r D n k)
t does not.
Remark 12.32 (Lag operator.) We have the following rules: (i) Lk x t D x t k ; (ii) if
˚ .L/ D a C bL m C cLn , then ˚ .L/ .x t C y t / D a .x t C y t / C b .x t Cm C y t Cm / C
c .x t n C y t n / and ˚ .1/ D a C b C c.
165
Example 12.33 (Söderlind and Vredin (1996)). Suppose we have
2 3 2 3
ln Y t (output) 0 1
6 7 6 7
6 ln P t (price level) 7 6 1 1 7
yt D 66 ln M (money stock)
7, D 6
7 6 1 0 7,
7
4 t 5 4 5
ln R t (gross interest rate) 0 0
" #
money supply trend
and t D ;
productivity trend
then we see that ln R t and ln Y t C ln P t ln M t (that is, log velocity) are stationary, so
" #
0 0 0 1
˛0 D
1 1 1 0
are (or rather, span the space of) cointegrating vectors. We also see that ˛ 0 D 022 .
This can easily be rewritten on the VAR form (12.1) or on the vector MA representation
for y t
y t D C .L/ " t C y t 1
166
We now try to write (12.41) in a form which resembles the common trends representation
(12.36)-(12.37) as much as possible.
We want to split a vector of non-stationary series into some random walks and the rest
(which is stationary). Rewrite (12.41) by adding and subtracting C.1/." t C " t 1 C :::/
y t D C .1/ ." t C " t 1 C "t 2 C ::: C "0 /CŒC.L/ C .1/ ." t C " t
C ::: C "0 / : 1 C "t 2
(12.42)
Suppose "s D 0 for s < 0 and consider the second term in (12.42). It can be written
I C C1 L C C2 L2 C ::::
C .1/ ." t C " t 1 C "t 2 C ::: C "0 /
D /*since C .1/ D I C C1 C C2 C :::*/
Œ C1 C2 C3 ::: " t C Œ C2 C3 ::: " t 1 C Œ C3 ::: " t 2: (12.43)
t D t 1 C "t ; (12.44)
D "t C "t 1 C "t 2 C ::: C "0 :
(12.48)
167
h i0
0 0
Since both " t and ' t t are white noise, we notice that the response of y t Cs to
either must be the same, that is,
" #
h i '
t
C .1/ C Cs " t D
0nr C ˚s for all t and s 0: (12.49)
t
This means that the VAR shocks are linear combinations of the structural shocks (as
in the standard setup without cointegration)
" #
't
D F "t
t
" #
Fk
D "t : (12.50)
Fr
must hold for all s 0. In particular, it must hold for s ! 1 where both Cs and ˚s
vanishes
C .1/ D Fk : (12.52)
Step 1. From (12.36) we see that ˛ 0 D 0rk must hold for ˛ 0 y t to be stationary.
Given an (estimate of) ˛, this gives rk equations from which we can identify rk
elements in . (It will soon be clear why it is useful to know ).
2
Equivalently, we can use (12.52) and (12.51) to calculate and ˚s (for all s) and then calculate the
impulse response function from (12.48).
168
Step 2. From (12.49) we have ' t D C .1/ " t as s ! 1. The variances of both
sides must be equal
169
on Fr are restrictions on @y t =@ t0 , that is, on the contemporaneous response.
This is exactly as in the standard case without cointegration.
Bibliography
Bernanke, B., 1986, “Alternative explanations of the money-income correlation,”
Carnegie-Rochester Series on Public Policy, 25, 49–100.
Englund, P., A. Vredin, and A. Warne, 1994, “Macroeconomic shocks in an open econ-
omy - a common trends representation of Swedish data 1871-1990,” in Villy Bergström,
and Anders Vredin (ed.), Measuring and interpreting business cycles . pp. 125–233,
Claredon Press.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
Johnston, J., and J. DiNardo, 1997, Econometric methods, McGraw-Hill, New York, 4th
edn.
King, R. G., 1986, “Money and business cycles: comments on Bernanke and related
literature,” Carnegie-Rochester Series on Public Policy, 25, 101–116.
Mellander, E., A. Vredin, and A. Warne, 1992, “Stochastic trends and economic fluctua-
tions in a small open economy,” Journal of Applied Econometrics, 7, 369–394.
Pesaran, H. H., and Y. Shin, 1998, “Generalized impulse response analysis in linear mul-
tivariate models,” Economics Letters, 58, 17–29.
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
170
Söderlind, P., and A. Vredin, 1996, “Applied cointegration analysis in the mirror of
macroeconomic theory,” Journal of Applied Econometrics, 11, 363–382.
171
12 Kalman filter
The distribution of the random variable Z conditional on that X D x is also normal with
mean (expectation of the random variable Z conditional on that the random variable X
has the value x)
ZN C ˙zx ˙xx1 x XN ;
E .Zjx/ D „ƒ‚… (12.2)
„ ƒ‚ … „ƒ‚…„ƒ‚…„ ƒ‚ …
m1 m1 mn nn n1
172
12.2 Kalman Recursions
where y t and t are n1 vectors, and Z an nm matrix. (12.4) expresses some observable
variables y t in terms of some (partly) unobservable state variables ˛ t . The transition
equation for the states is
where ˛ t and u t are m 1 vectors, and T an m m matrix. This system is time invariant
since all coefficients are constant. It is assumed that all errors are normally distributed,
and that E. t u t s / D 0 for all s.
173
Now we want an estimate of ˛ t based ˛O t 1 . From (12.5) the obvious estimate, denoted
by ˛ tjt 1 , is
˛O t jt 1 D T ˛O t 1 . (12.7)
D E ŒT ˛ t 1 C u t T ˛O t 1 ŒT ˛ t 1 C u t T ˛O t 1 0
˚
D E ŒT .˛O t 1 ˛ t 1 / u t ŒT .˛O t 1 ˛ t 1 / u t 0
˚
0
D TP t 1T C Q; (12.8)
where we have used (12.5), (12.6), and the fact that u t is uncorrelated with ˛O t 1 ˛t 1.
yO t jt 1 D Z ˛O t jt 1, (12.9)
174
The variance of the prediction error is
F t D E v t v t0
n 0 o
D E Z ˛ t ˛O t jt 1 C t Z ˛ t ˛O t jt 1 C t
h 0 i 0
D ZE ˛ t ˛O t jt 1 ˛ t ˛O t jt 1 Z C E t t0
0
D ZP t jt 1Z C H; (12.11)
where we have used the definition of P t jt 1 in (12.8), and of H in 12.4. Similarly, the
covariance of the prediction errors for y t and for ˛ t is
Cov ˛ t ˛O t jt 1; yt yO t jt 1 D E ˛ t ˛O tjt 1 y t yO t jt 1
n 0 o
D E ˛ t ˛O t jt 1 Z ˛ t ˛O tjt 1 C t
h 0 i
D E ˛ t ˛O t jt 1 ˛ t ˛O t jt 1 Z 0
0
D P t jt 1Z : (12.12)
Suppose that y t is observed and that we want to update our estimate of ˛ t from ˛O tjt 1
175
with variance
0 0 0
1
P t D P t jt 1 P tjt 1 Z ZP t jt 1 Z C H ZP t jt 1 , (12.14)
„ƒ‚… „ƒ‚… „ ƒ‚ …„ ƒ‚ …„ ƒ‚ …
Var.zjx/ ˙zz ˙zx ˙xx1 ˙xz
0 0
1 0 0
1
P tjt 1Z ZP t jt 1Z CH E yt Z˛ t jt 1 yt Z˛ t jt 1 ZP t jt 1Z CH ZP t jt 1;
(12.15)
where we have exploited the symmetry of covariance matrices. Note that y t Z˛ t jt 1 D
y t yO t jt 1 , so the middle term in the previous expression is
0 0
E yt Z˛ t jt 1 yt Z˛ t jt 1 D ZP t jt 1Z C H: (12.16)
The Kalman algorithm calculates optimal predictions of ˛ t in a recursive way. You can
also calculate the prediction errors v t in (12.10) as a by-prodct, which turns out to be
useful in estimation.
2. Calculate (12.7), (12.8), (12.13), and (12.14) in that order. This gives values for ˛O t
and P t . If you want v t for estimation purposes, calculate also (12.10) and (12.11).
Increase t with one step.
3. Iterate on 2 until t D T .
One choice of starting values that work in stationary models is to set P0 to the uncon-
ditional covariance matrix of ˛ t , and ˛0 to the unconditional mean. This is the matrix P
176
to which (12.8) converges: P D TP T 0 C Q. (The easiest way to calculate this is simply
to start with P D I and iterate until convergence.)
In non-stationary model we could set
For any (conditionally) Gaussian time series model for the observable y t the log likeli-
hood for an observation is
n 1 1 0 1
ln L t D ln .2/ ln jF t j v F vt : (12.18)
2 2 2 t t
In case the starting conditions are as in (12.17), the overall log likelihood function is
( P
T
ln L t in stationary models
ln L D PTt D1 (12.19)
t DmC1 ln L t in non-stationary models.
We can, of course, use all the asymptotic MLE theory, like likelihood ratio tests etc. For
diagnostoic tests, we will most often want to study the normalized residuals
p
vQ i t D vi t = element i i in F t , i D 1; :::; n;
since element i i in F t is the standard deviation of the scalar residual vi t . Typical tests are
CUSUMQ tests for structural breaks, various tests for serial correlation, heteroskedastic-
ity, and normality.
Bibliography
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
Harvey, A. C., 1989, Forecasting, structural time series models and the Kalman filter,
Cambridge University Press.
177
Lütkepohl, H., 1993, Introduction to multiple time series, Springer-Verlag, 2nd edn.
178
13 Outliers and Robust Estimators
O and uO t D y t
yO t D x t0 ˇ; yO t : (13.4)
A common way to study the sensitivity of the results with respect to excluding observa-
tions is to plot ˇO .s/ ˇ, O and yOs.s/ yOs . Note that we here plot the fitted value of ys using
the coefficients estimated by excluding observation s from the sample. Extreme values
prompt a closer look at data (errors in data?) and perhaps also a more robust estimation
method than LS, which is very sensitive to outliers.
Another useful way to spot outliers is to study the standardized residuals, uO s =O and
.s/
uO s =O .s/ , where O and O .s/ are standard deviations estimated from the whole sample and
excluding observation s, respectively. Values below -2 or above 2 warrant attention (recall
179
that Pr.x > 1:96/ 0:025 in a N.0; 1/ distribution).
Sometimes the residuals are instead standardized by taking into account the uncer-
tainty of the estimated coefficients. Note that
uO .s/
t D yt x t0 ˇO .s/
D u t C x t0 ˇ ˇO .s/ ; (13.6)
When t D s, which is the case we care about, the covariance term drops out since ˇO .s/
cannot be correlated with us since period s is not used in the estimation (this statement
assumes that shocks are not autocorrelated). The first term is then estimated as the usual
variance of the residuals (recall that period s is not used) and the second term is the
estimated covariance matrix of the parameter vector (once again excluding period s) pre-
and postmultiplied by xs .
Example 13.1 (Errors are iid independent of the regressors.) In this case the variance of
the parameter vector is estimated as O 2 .˙x t x t0 / 1 (excluding period s), so we have
Var uO .s/ D O 2 1 C xs0 .˙x t x t0 / 1 xs :
t
180
Rescursive residuals from AR(1) with corr=0.85
CUSUM statistics and 95% confidence band
50
2
0 0
−2
−50
0 100 200 0 100 200
period period
Figure 13.1: This figure shows recursive residuals and CUSUM statistics, when data are
simulated from y t D 0:85y t 1 C u t , with Var.u t / D 1.
This is repeated for the rest of the sample by extending the sample used in the estimation
by one period, making a one-period ahead forecast, and then repeating until we reach the
end of the sample.
A first diagnosis can be made by examining the standardized residuals, uO Œs sC1 = O Œs ,
where O Œs can be estimated as in (13.7) with a zero covariance term, since usC1 is not
correlated with data for earlier periods (used in calculating ˇO Œs ), provided errors are not
autocorrelated. As before, standardized residuals outside ˙2 indicates problems: outliers
or structural breaks (if the residuals are persistently outside ˙2).
The CUSUM test uses these standardized residuals to form a sequence of test statistics.
A (persistent) jump in the statistics is a good indicator of a structural break. Suppose we
use r observations to form the first estimate of ˇ, so we calculate ˇO Œs and uO Œs
sC1 =O Œs for
s D r; :::; T . Define the cumulative sums of standardized residuals
t
uO Œs
X
Wt D O Œs , t D r; :::; T:
sC1 = (13.9)
sDr
Under the null hypothesis that no structural breaks occurs, that is, that the true ˇ is the
same for the whole sample, W t has a zero mean and a variance equal to the number of
elements in the sum, t r C 1. This follows from the fact that the standardized resid-
uals all have zero mean and unit variance and are uncorrelated with each other. Typ-
pW t is plotted along with
ically,
p
a 95%
confidence interval, which can be shown to be
˙ a T r C 2a .t r/ = T r with a D 0:948. The hypothesis of no structural
break is rejected if the W t is outside this band for at least one observation. (The derivation
of this confidence band is somewhat tricky, but it incorporates the fact that W t and W t C1
181
are very correlated.)
Reference: Greene (2000) 9.8.1; Rousseeuw and Leroy (1987); Donald and Maddala
(1993); and Judge, Griffiths, Lütkepohl, and Lee (1985) 20.4.
Outliers and other extreme observations can have very decisive influence on the estimates
of the key statistics needed for financial analysis, including mean returns, variances, co-
variances and also regression coefficients.
The perhaps best way to solve these problems is to carefully analyse the data—and
then decide which data points to exclude. Alternatively, robust estimators can be applied
instead of the traditional ones.
To estimate the mean, the sample average can be replaced by the median or a trimmed
mean (where the x% lowest and highest observations are excluded).
Similarly, to estimate the variance, the sample standard deviation can be replaced by
the interquartile range (the difference between the 75th and the 25th percentiles), divided
by 1:35
StdRobust D Œquantile.0:75/ quantile.0:25/=1:35; (13.10)
Both these would coincide with the standard deviation if data was indeed drawn from a
normal distribution without outliers.
A robust covariance can be calculated by using the identity
and using a robust estimator of the variances—like the square of (13.10). A robust cor-
relation is then created by dividing the robust covariance with the two robust standard
deviations.
See Figures 13.2–13.4 for empirical examples.
182
US industry portfolios, ERe
Monthly data 1947:1−2008:6
mean
0.11 median
0.1
0.09
β
0.08
0.07
0.06
A B C D E F G H I J
The idea of robust estimation is to give less weight to extreme observations than in
Least Squares. When the errors are normally distributed, then there should be very few ex-
treme observations, so LS makes a lot of sense (and is indeed the MLE). When the errors
have distributions with fatter tails (like the Laplace or two-tailed exponential distribution,
f .u/ D exp. juj =/=2), then LS is no longer optimal and can be fairly sensitive to
outliers. The ideal way to proceed would be to apply MLE, but the true distribution is
often unknown. Instead, one of the “robust estimators” discussed below is often used.
O Then, the least absolute deviations (LAD), least median squares
Let uO t D y t x t0 ˇ.
(LMS), and least trimmed squares (LTS) estimators solve
T
X
ˇOLAD D arg min juO t j (13.13)
ˇ
t D1
Note that the LTS estimator in (13.15) minimizes of the sum of the h smallest squared
residuals.
183
US industry portfolios, Std
Monthly data 1947:1−2008:6
std
0.2 iqr/1.35
0.18
β
0.16
0.14
0.12
A B C D E F G H I J
Example 13.2 (Algorithm for LAD.) The LAD estimator can be written
T
X
ˇOLAD D arg min w t uO 2t , w t D 1= juO t j ;
ˇ
tD1
so it is a weighted least squares where both y t and x t are multiplied by 1= juO t j. It can be
shown that iterating on LS with the weights given by 1= juO t j, where the residuals are from
the previous iteration, converges very quickly to the LAD estimator.
It can be noted that LAD is actually the MLE for the Laplace distribution discussed
above.
13.4 Multicollinearity
184
Correlations, traditional Difference in correlations: traditional − robust
1
0.1
0.5 0
−0.1
0
10 10
10 10
5 5 5 5
Industry Industry
ments in xi t or xjt . This means that the point estimates might fluctuate wildly over sub-
samples and it is often the case that individual coefficients are insignificant even though
the R2 is high and the joint significance of the coefficients is also high. The estimators
are still consistent and asymptotically normally distributed, just very imprecise.
A common indicator for multicollinearity is to standardize each element in x t by
subtracting the sample mean and then dividing by its standard deviation
The condition number of a matrix is the ratio of the largest (in magnitude) of the
eigenvalues to the smallest
c D jjmax = jjmin : (13.18)
(Some authors take c 1=2 to be the condition number; others still define it in terms of the
“singular values” of a matrix.) If the regressors are uncorrelated, then the condition value
of A is one. This follows from the fact that A is a (sample) covariance matrix. If it is
185
OLS vs LAD
2
1.5
1
0.5
0
−0.5 Data
−1 0.75*x
OLS
−1.5
LAD
−2
−3 −2 −1 0 1 2 3
x
Figure 13.5: This figure shows an example of how LS and LAD can differ. In this case
y t D 0:75x t C u t , but only one of the errors has a non-zero value.
diagonal, then the eigenvalues are equal to diagonal elements, which are all unity since
the standardization in (13.16) makes all variables have unit variances. Values of c above
several hundreds typically indicate serious problems.
Bibliography
Donald, S. G., and G. S. Maddala, 1993, “Identifying outliers and influential observations
in econometric models,” in G. S. Maddala, C. R. Rao, and H. D. Vinod (ed.), Handbook
of Statistics, Vol 11 . pp. 663–701, Elsevier Science Publishers B.V.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Judge, G. G., W. E. Griffiths, H. Lütkepohl, and T.-C. Lee, 1985, The theory and practice
of econometrics, John Wiley and Sons, New York, 2nd edn.
Rousseeuw, P. J., and A. M. Leroy, 1987, Robust regression and outlier detection, John
Wiley and Sons, New York.
186
14 Generalized Least Squares
Reference: Greene (2000) 11.3-4
Additional references: Hayashi (2000) 1.6; Johnston and DiNardo (1997) 5.4; Verbeek
(2004) 4
14.1 Introduction
This allows for both heteroskedasticity (different elements along the main diagonal) and
autocorrelation (non-zero off-diagonal elements). LS is still consistent even if ˝ is not
proportional to an identity matrix, but it is not efficient. Generalized least squares (GLS)
187
is. The trick of GLS is to transform the variables and the do LS.
Remark 14.1 If the n 1 vector x has a multivariate normal distribution with mean vec-
tor and covariance matrix ˝, then the joint probability density function is .2/ n=2 j˝j 1=2 expŒ .x
/0 ˝ 1 .x /=2.
If the T 1 vector u is N .0; ˝/, then the joint pdf of u is .2/ n=2 j˝j 1=2 expŒ u0 ˝ 1 u=2.
Change variable from u to y Xˇ (the Jacobian of this transformation equals one), and
take logs to get the (scalar) log likelihood function
n 1 1
ln L D ln .2/ ln j˝j .y Xˇ/0 ˝ 1
.y Xˇ/ : (14.3)
2 2 2
To simplify things, suppose we know ˝. It is then clear that we maximize the likelihood
function by minimizing the last term, which is a weighted sum of squared errors.
In the classical LS case, ˝ D 2 I , so the last term in (14.3) is proportional to the
unweighted sum of squared errors. The LS is therefore the MLE when the errors are iid
normally distributed.
When errors are heteroskedastic, but not autocorrelated, then ˝ has the form
2 3
2
0 0
6 1 :: 7
6 0 2 : 7
2
˝D6 : 7: (14.4)
6 7
6 :: : :: 0 7
4 5
0 0 T2
1
In this case, we can decompose ˝ as
2 3
1=1 0 0
6 :: 7
1
6 0 1=2 : 7
˝ D P 0 P , where P D 6 7: (14.5)
6 7
:: :::
6
4 : 0 7
5
0 0 1=T
188
The last term in (14.3) can then be written
1 1
.y Xˇ/0 ˝ 1
.y Xˇ/ D .y Xˇ/0 P 0 P .y Xˇ/
2 2
1
D .P y PXˇ/0 .P y PXˇ/ : (14.6)
2
This very practical result says that if we define y t D y t = t and x t D x t = t , then we
get ML estimates of ˇ running an LS regression of y t on x t . (One of the elements in x t
could be a constant—also this one should be transformed). This is the generalized least
squares (GLS).
Remark 14.3 Let A be an n n symmetric positive definite matrix. The Cholesky decom-
position gives the unique lower triangular P1 such that A D P1 P10 or an upper triangular
matrix P2 such that A D P20 P2 (clearly P2 D P10 ). Note that P1 and P2 must be invertible
(since A is).
1
˝ D P 0 P: (14.7)
The Cholesky decomposition is often a convenient tool, but other decompositions can
also be used. We can then apply (14.6) also in this case—the only difference is that P
is typically more complicated than in the case without autocorrelation. In particular, the
transformed variables P y and PX cannot be done line by line (y t is a function of y t ,
y t 1 , and perhaps more).
Example 14.4 (AR(1) errors, see Davidson and MacKinnon (1993) 10.6.) Let u t D
au t 1 C " t where " t is iid. We have Var.u t / D 2 = 1 a2 , and Corr.u t ; u t s / D as .
189
For T D 4, the covariance matrix of the errors is
h i0
˝ D Cov u1 u2 u3 u4
2 3
1 a a2 a3
2 6 a 1 a a2 7
6 7
D 6 7:
1 a2 6 4 a 2
a 1 a 7
5
3 2
a a a 1
The inverse is 2 3
1 a 0 0
2
6 7
1 1 6 a 1Ca a 0 7
˝ D 2 6 7;
6
4 0 a 1 C a2 a 7
5
0 0 a 1
and note that we can decompose it as
2 p 30 2 p 3
1 a2 0 0 0 1 a2 0 0 0
6 7 6 7
16 a 1 0 0 7 16 a 1 0 0
˝ 1D 6
7
7 6 7:
46 0 a 1 0 76
5 4 0 a 1 0 7
5
0 0 a 1 0 0 a 1
„ ƒ‚ …„ ƒ‚ …
P0 P
This is not a Cholesky decomposition, but certainly a valid decomposition (in case of
doubt, do the multiplication). Premultiply the system
2 3 2 3 2 3
y1 x10 u1
6 y2 7 6 x20 7
6 7 6 7 6 7
7 ˇ0 C 6 u 2 7
6 7
6 7D6
6 y 7 6 x0 7 6 u 7
4 3 5 4 3 5 4 3 5
0
y4 x4 u4
by P to get
2 p 3 2 p 3 2 p 3
.1 a2 /y1 .1 a2 /x10 .1 a2 /u1
x20 ax10
6 7 6 7 6 7
166 y2 ay1 7D 16 7 ˇ0 C 1 6 "2
7 6 7 6 7
7:
6 4 y3 ay2 7 6
5 4 x30 ax20 7
5 46 "3 7
5
y4 ay3 x40 ax30 "4
190
Note that all the residuals are uncorrelated in this formulation. Apart from the first ob-
servation, they are also identically distributed. The importance of the first observation
becomes smaller as the sample size increases—in the limit, GLS is efficient.
When the errors are not normally distributed, then the MLE approach in the previous
section is not valid. But we can still note that GLS has the same properties as LS has with
iid non-normally distributed errors. In particular, the Gauss-Markov theorem applies,
so the GLS is most efficient within the class of linear (in y t ) and unbiased estimators
(assuming, of course, that GLS and LS really are unbiased, which typically requires that
u t is uncorrelated with x t s for all s). This follows from that the transformed system
P y D PXˇ0 C P u
y D X ˇ0 C u ; (14.8)
191
established. Evidence from simulations suggests that the FGLS estimator can be a lot
worse than LS if the estimate of ˝ is bad.
To use maximum likelihood when ˝ is unknown requires that we make assumptions
about the structure of ˝ (in terms of a small number of parameters), and more gener-
ally about the distribution of the residuals. We must typically use numerical methods to
maximize the likelihood function.
Example 14.5 (MLE and AR(1) errors.) If u t in Example 14.4 are normally distributed,
then we can use the ˝ 1 in (14.3) to express the likelihood function in terms of the un-
known parameters: ˇ, , and a. Maximizing this likelihood function requires a numerical
optimization routine.
Bibliography
Davidson, R., and J. G. MacKinnon, 1993, Estimation and inference in econometrics,
Oxford University Press, Oxford.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Johnston, J., and J. DiNardo, 1997, Econometric methods, McGraw-Hill, New York, 4th
edn.
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn.
192
15 Nonparametric Regressions and Tests
Reference: Campbell, Lo, and MacKinlay (1997) 12.3; Härdle (1990); Pagan and Ullah
(1999); Mittelhammer, Judge, and Miller (2000) 21
15.1.1 Introduction
Nonparametric regressions are used when we are unwilling to impose a parametric form
on the regression equation—and we have a lot of data.
Let the scalars y t and x t be related as
where " t is uncorrelated over time and where E " t D 0 and E." t jx t / D 0. The function
b./ is unknown and possibly non-linear.
Suppose the sample had 3 observations (say, t D 3, 27, and 99) with exactly the same
value of x t , say 1:9. A natural way of estimating b.x/ at x D 1:9 would then be to
average over these 3 observations as we can expect average of the error terms to be close
to zero (iid and zero mean).
Unfortunately, we seldom have repeated observations of this type. Instead, we may
try to approximate the value of b.x/ (x is a single value, 1.9, say) by averaging over
observations where x t is close to x. The general form of this type of estimator is
PT
O w.x t x/y t
b.x/ D Pt D1
T
; (15.2)
t D1 w.x t x/
where w.x t x/=˙ tTD1 w.x t x/ is the weight on observation t. Note that the denominator
makes the weights sum to unity. The basic assumption behind (15.2) is that the b.x/
function is smooth so local (around x) averaging makes sense.
As an example of a w.:/ function, it could give equal weight to the k values of x t
which are closest to x and zero weight to all other observations (this is the “k-nearest
193
neighbor” estimator, see Härdle (1990) 3.2). As another example, the weight function
O
could be defined so that it trades off the expected squared errors, EŒy t b.x/ 2
, and the
2O 2 2
expected squared acceleration, EŒd b.x/=dx . This defines a cubic spline (often used
in macroeconomics when x t D t , and is then called the Hodrick-Prescott filter).
Remark 15.1 (Easy way to calculate the “nearest neighbor” estimator, univariate case)
Create a matrix Z where row t is .y t ; x t /. Sort the rows of Z according to the second
column (x). Calculate an equally weighted centered moving average of the first column
(y).
A Kernel regression uses a pdf as the weight function, w.x t x/ D K Œ.x t x/= h,
where the choice of h (also called bandwidth) allows us to easily vary the relative weights
of different observations. The pdf of N.0; 1/ is often used for K./. This weighting func-
tion is positive, so all observations get a positive weight, but the weights are highest for
observations close to x and then taper off in a bell-shaped way. A low value of h means
that the weights taper off fast. See Figure 15.1 for an example.
With the N.0; 1/ kernel, we get the following estimator of b.x/ at a point x
PT xt x
2
t D1 K y t exp u =2
O
b.x/ D P h
, where K.u/ D p : (15.3)
T xt x
t D1 K h
2
Remark 15.2 (Kernel as a pdf of N.x; h2 /) If K.z/ is the pdf of an N.0; 1/ variable,
then K Œ.x t x/= h = h is the same as using an N.x; h2 / pdf of x t . Clearly, the 1= h term
would cancel in (15.3).
O
In practice we have to estimate b.x/ at a finite number of points x. This could, for
instance, be 100 evenly spread points in the interval between the minimum and the max-
imum values observed in the sample. See Figure 15.2 for an illustration. Special correc-
tions might be needed if there are a lot of observations stacked close to the boundary of
the support of x (see Härdle (1990) 4.4).
Example 15.3 (Kernel regression) Suppose the sample has three data points Œx1 ; x2 ; x3 D
Œ1:5; 2; 2:5 and Œy1 ; y2 ; y3 D Œ5; 4; 3:5. Consider the estimation of b.x/ at x D 1:9.
194
Data and weights for b(1.7) Data and weights for b(1.9)
5 5
weights (solid)
⊗ 1 1
⊗
4 4
0 0
1.4 1.6 1.8 2 2.2 2.4 1.4 1.6 1.8 2 2.2 2.4
xt xt
0
1.4 1.6 1.8 2 2.2 2.4
xt
The denominator is
XT p
.1:5 1:9/2 =2 .2 1:9/2 =2 .2:5 1:9/2 =2
K.x t x/ D e Ce Ce = 2
t D1
p
2:75= 2:
O
b.1:9/ 11:52=2:75 4:19:
195
Kernel regression
5.2
Data
5
h= 0.25
4.8 h= 0.2
4.6
4.4
y
4.2
4
3.8
3.6
3.4
1.4 1.6 1.8 2 2.2 2.4
x
Kernel regressions are typically consistent, provided longer samples are accompanied
by smaller values of h, so the weighting function becomes more and more local as the
sample size increases. It can be shown (see Härdle (1990) 3.1 and Pagan and Ullah (1999)
3.3–4) that under the assumption that x t is iid, the mean squared error, variance and bias
of the estimator at the value x are approximately (for general kernel functions)
h i n o2
O
MSE.x/ D Var b.x/ O
C BiasŒb.x/ , with
h
O
i 1 2 .x/ R 1
Var b.x/ D 1 K.u/2 du
T h f .x/
2
O 2 1 d b.x/ df .x/ 1 db.x/ R1
BiasŒb.x/ D h 2
C 1 K.u/u2 du: (15.4)
2 dx dx f .x/ dx
In these expressions, 2 .x/ is the variance of the residuals in (15.1) and f .x/ the marginal
density of x. The remaining terms are functions of either the true regression function or
the kernel.
With a N.0; 1/ kernel these expressions can be simplified since
R1 2 1 R1
1 K.u/ du D p and 1 K.u/u2 du D 1, if N.0; 1/ kernel. (15.5)
2
196
In this case (15.4) becomes
h
O
i 1 2 .x/ 1
Var b.x/ D p
T h f .x/ 2
2
O 2 1 d b.x/ df .x/ 1 db.x/
BiasŒb.x/ D h C : (15.6)
2 dx 2 dx f .x/ dx
A smaller h increases the variance (we effectively use fewer data points to estimate
b.x/) but decreases the bias of the estimator (it becomes more local to x). If h decreases
less than proportionally with the sample size (so hT in the denominator of the first term
increases with T ), then the variance goes to zero and the estimator is consistent (since the
bias in the second term decreases as h does).
The variance is a function of the variance of the residuals and the “peakedness” of the
kernel, but not of the b.x/ function. The more concentrated the kernel is (s K.u/2 du
large) around x (for a given h), the less information is used in forming the average around
x, and the uncertainty is therefore larger—which is similar to using a small h. A low
density of the regressors (f .x/ low) means that we have little data at x which drives up
the uncertainty of the estimator.
The bias increases (in magnitude) with the curvature of the b.x/ function (that is,
.d 2 b.x/=dx 2 /2 ). This makes sense, since rapid changes of the slope of b.x/ make it hard
to get b.x/ right by averaging at nearby x values. It also increases with the variance of
the kernel since a large kernel variance is similar to a large h.
It is clear that the choice of h has a major importance on the estimation results. A
lower value of h means a more “local” averaging, which has the potential of picking up
sharp changes in the regression function—at the cost of being more affected by random-
ness. See Figures 15.3–15.4 for an example.
A good (but computationally intensive) approach to choose h is by the leave-one-out
cross-validation technique. This approach would, for instance, choose h to minimize the
expected (or average) prediction error
XT h i2
EPE.h/ D yt bO t .x t ; h/ =T; (15.7)
t D1
197
Drift vs level, in bins Volatility vs level, in bins
1.5
0.2 Volt+1 = (∆yt+1 − fitted ∆yt+1)2
1
0.1
Volt+1
∆yt+1
0
0.5
−0.1
−0.2 0
0 5 10 15 20 0 5 10 15 20
yt yt
Daily federal funds rates 1954−2006
x t )—and then repeat this for different values of h to find the minimum.
Remark 15.4 (Speed and fast Fourier transforms) The calculation of the kernel estimator
can often be speeded up by the use of a fast Fourier transform.
If the observations are independent, then it can be shown (see Härdle (1990) 4.2,
Pagan and Ullah (1999) 3.3–6, and also (15.6)) that, with a Gaussian kernel, the estimator
at point x is asymptotically normally distributed
p h i
1 2
.x/
O
T h b.x/ O
E b.x/ d
! N 0; p ; (15.8)
2 f .x/
where 2 .x/ is the variance of the residuals in (15.1) and f .x/ the marginal density of
x. (A similar expression holds for other choices of the kernel.) This expression assumes
that the asymptotic bias is zero, which is guaranteed if h is decreased (as T increases)
slightly faster than T 1=5 . In practice, to implement this expression 2 .x/ is estimated as
a kernel regression (15.3) but with Œy t b.x O t /2 substituted for y t , f .x/ is estimated by
a kernel density estimate as in (15.24) and it is assumed that the asymptotic bias is zero
O
(E b.x/ D b.x/). Notice that the estimation of 2 .x/ is quite computationally intensive
O
since it requires estimating b.x/ at every point x D x t in the sample.
See Figure 15.5 for an example—and Figure 15.7 for the reason why the confidence
band varies across x values.
198
Drift vs level, kernel regression Vol vs level, kernel regression
1.5
0.2 Small h Small h
Large h 1 Large h
0.1
Volt+1
∆yt+1
0
0.5
−0.1
−0.2 0
0 5 10 15 20 0 5 10 15 20
yt yt
where " t is uncorrelated over time and where E " t D 0 and E." t jx t ; z t / D 0. This makes
the estimation problem much harder since there are typically few observations in every
bivariate bin (rectangle) of x and z. For instance, with as little as a 20 intervals of each
of x and z, we get 400 bins, so we need a large sample to have a reasonable number of
observations in every bin.
199
Drift vs level, kernel regression Vol vs level, kernel regression
1.5
0.2 Volt+1 = (∆yt+1 − fitted ∆yt+1)2
Point estimate and 1
0.1
Volt+1
90% confidence band
∆yt+1
0
0.5
−0.1
−0.2 0
0 5 10 15 20 0 5 10 15 20
yt yt
Daily federal funds rates 1954−2006
In any case, the most common way to implement the kernel regressor is to let
PT
xt x zt z
K
t D1 x hx
K z hz
yt
Ob.x; z/ D ; (15.10)
PT xt x zt z
t D1 Kx hx
K z hz
where Kx .u/ and Kz .v/ are two kernels like in (15.3) and where we may allow hx and
hy to be different (and depend on the variance of xt andy t ).
In this case, the weight of
xt x zt z
the observation (x t ; z t ) is proportional to Kx hx Kz hz , which is high if both x t
and y t are close to x and y respectively.
A possible way out of the curse of dimensionality of the multivariate kernel regression is
to specify a partially linear model
where " t is uncorrelated over time and where E " t D 0 and E." t jx t ; z t / D 0. This model
is linear in z t , but possibly non-linear in x t since the function b.x t / is unknown.
To construct an estimator, start by taking expectations of (15.11) conditional on x t
200
Subtract from (15.11) to get
The “double residual” method (see Pagan and Ullah (1999) 5.2) has several steps. First,
estimate E.y t jx t / by a kernel regression of y t on x t , bOy .x/, and E.z t jx t / by a similar
kernel regression of z t on x t , bOz .x/. Second, use these estimates in (15.13)
and estimate ˇ by least squares. Third, use these estimates in (15.12) to estimate b.x t / as
O t / D bOy .x t /
b.x O
bOz .x t /0 ˇ: (15.15)
It can be shown that (under the assumption that y t , z t and x t are iid)
p
T .ˇO ˇ/ !d N 0; Var." t / Cov.z t jx t / 1 :
(15.16)
We can consistently estimate Var." t / by the sample variance of the fitted residuals in
(15.11)—plugging in the estimated ˇ and b.x t /: and we can also consistently estimate
Cov.z t jx t / by the sample variance of z t bOz .x t /. Clearly, this result is as if we knew the
non-parametric parts of the problem (which relies on the consistency of their estimators),
so it is only an asymptotic results. By using this result, together with the known properties
of the kernel regressions (see, for instance, 15.6), it should be possible to apply the delta
method to (15.15) to construct the asymptotic variance of b.x O t / (that is, at a given point
x t ).
Reference: Harvey (1989) 260, Davidson and MacKinnon (1993) 267, Silverman (1986);
Mittelhammer (1996), DeGroot (1986)
201
15.2.1 Parametric Tests of Normal Distribution
The skewness, kurtosis and Bera-Jarque test for normality are useful diagnostic tools. For
an iid normally distributed variable, x t iid N ; 2 , they are
This is implemented by using the estimated mean and standard deviation. The distribu-
tions stated on the right hand side of (15.17) are under the null hypothesis that x t is iid
N ; 2 .
The intuition for the 22 distribution of the Bera-Jarque test is that both the skewness
and kurtosis are, if properly scaled, N.0; 1/ variables. It can also be shown that they,
under the null hypothesis, are uncorrelated. The Bera-Jarque test statistic is therefore a
sum of the square of two uncorrelated N.0; 1/ variables, which has a 22 distribution.
The Bera-Jarque test can also be implemented as a test of overidentifying restrictions
in GMM. The moment conditions
2 3
xt
T 6 2 2 7
7
2 1 X6 .x t /
g.; / D 6
3
7; (15.18)
T t D1 6
4 .x t /
7
5
.x t /4 3 4
should all be zero if x t is N.; 2 /. We can estimate the two parameters, and 2 , by
using the first two moment conditions only, and then test if all four moment conditions
are satisfied. It can be shown that this is the same as the Bera-Jarque test if x t is indeed
iid N.; 2 /.
202
Empirical distribution function and theoretical cdf
0.8
0.6
0.4
data
0.2 cdf of N(3,1)
EDF
0
defined as the fraction of observations which are less or equal to x, that is,
T
(
1X 1 if q is true
EDF .x/ D ı.x t x/; where ı.q/ D (15.19)
T tD1 0 else.
The EDF.x t / and F .x t / are often plotted against the sorted (in ascending order) sample
fx t gTtD1 . See Figure 15.6.
Example 15.5 (EDF) Suppose we have a sample with three data points: Œx1 ; x2 ; x3 D
Œ5; 3:5; 4. The empirical distribution function is then as in Figure 15.6.
Example 15.6 (Kolmogorov-Smirnov test statistic) Figure 15.6 also shows the cumula-
tive distribution function (cdf) of a normally distributed variable. The test statistic (15.20)
is then the largest difference (in absolute terms) of the EDF and the cdf—among the ob-
served values of x t .
203
p
We reject the null hypothesis that EDF.x/ D F .x/ if T D t > c, where c is a critical
value which can be calculated from
p 1
2i 2 c 2
X
lim Pr T DT c D 1 2 . 1/i 1
e : (15.21)
T !1
i D1
It can be approximated by replacing 1 with a large number (for instance, 100). For
instance, c D 1:35 provides a 5% critical value. There is a corresponding test for com-
paring two empirical cdfs.
Pearson’s 2 test does the same thing as the K-S test but for a discrete distribution.
Suppose you have K categories with Ni values in category i. The theoretical distribution
predicts that the fraction pi should be in category i , with K
P
i D1 pi D 1. Then
K
X .Ni Tpi /2 2
K 1: (15.22)
i D1
Tpi
exp u2 =2
1 XT x
t x
fO .x/ D K , where K.u/ D p : (15.24)
Th t D1 h 2
204
Histogram (scaled: area=1) Kernel density estimation
0.2 0.2
Small h
0.15 0.15 Large h
0.1 0.1
0.05 0.05
0 0
0 5 10 15 20 0 5 10 15 20
yt yt
The value h D Std.x t /1:06T 1=5 is sometimes recommended, since it can be shown to be
the optimal choice (in MSE sense) if data is normally distributed and the N.0; 1/ kernel
is used. Clearly, using K Œ.x t x/= h = h is the same as using pdf of N.x; h2 /. The
bandwidth h could be chosen by a leave-one-out cross-validation technique.
The results on bias and variance in (15.4) are approximately true also for the kernel
density estimation if we interpret b.x/ as the pdf of x. In particular, it can be shown that
(with iid data and a Gaussian kernel) the asymptotic distribution is
p h i
1
O O d
T h f .x/ E f .x/ ! N 0; p f .x/ ; (15.25)
2
The easiest way to handle a bounded support of x is to transform the variable into one
with an unbounded support, estimate the pdf for this variable, and then use the “change
of variable” technique to transform to the pdf of the original variable.
We can also estimate multivariate pdfs. Let x t be a d 1 matrix and ˝O be the estimated
covariance matrix of x t . We can then estimate the pdf at a point x by using a multivariate
205
Gaussian kernel as
1 XT
fO .x/ D expŒ .x x t /0 ˝O 1
.x x t /=.2h2 /: (15.26)
O 1=2
T hd .2/d=2 j˝j t D1
1=.d C4/
The value h D 0:96T is sometimes recommended.
This estimator involve non-linearities, but a a simple iteration works nicely. If we assume
that the median of the true residual, u t , is zero, then we (typically) have
p XT
T .ˇOLAD ˇ0 / !d N 0; f .0/ 2 ˙xx1 =4 , where ˙xx D plim x t x t0 =T;
t D1
(15.28)
where f .0/ is the value of the pdf of u t at zero. Unless we know this density function
(or else we would probably have used MLE instead of LAD), we need to estimate it—for
instance with a kernel density method.
p
Example 15.7 (N.0; 2 /) When u t N.0; 2 ), then f .0/ D 1= 2 2 , so the covari-
ance matrix in (15.28) becomes 2 ˙xx1 =2. This is =2 times larger than when using
LS.
Remark 15.8 (Algorithm for LAD) The LAD estimator can be written
T
X
ˇOLAD D arg min w t uO t .b/2 , w t D 1= juO t .b/j ;
ˇ
t D1
so it is a weighted least squares where both y t and x t are multiplied by 1= juO t .b/j. It can
be shown that iterating on LS with the weights given by 1= juO t .b/j, where the residuals
are from the previous iteration, converges very quickly to the LAD estimator.
206
Bibliography
Amemiya, T., 1985, Advanced econometrics, Harvard University Press, Cambridge, Mas-
sachusetts.
Härdle, W., 1990, Applied nonparametric regression, Cambridge University Press, Cam-
bridge.
Harvey, A. C., 1989, Forecasting, structural time series models and the Kalman filter,
Cambridge University Press.
Mittelhammer, R. C., 1996, Mathematical statistics for economics and business, Springer-
Verlag, New York.
Silverman, B. W., 1986, Density estimation for statistics and data analysis, Chapman and
Hall, London.
207
.D..3
The task is to evaluate if alphas or betas of individual investors (or funds) are related
to investor (fund) characteristics, for instance, age or trading activity. The data set is
panel with observations for T periods and N investors. (In many settings, the panel is
unbalanced, but, to keep things reasonably simple, that is disregarded in the discussion
below.)
The calendar time (CalTime) approach is to first define M discrete investor groups (for
instance, age 18–30, 31–40, etc) and calculate their respective average excess returns (yNjt
for group j )
1 P
yNjt D yi t ; (16.1)
Nj i 2Groupj
where Nj is the number of individuals in group j .
Then, we run a factor model
where x t typically includes a constant and various return factors (for instance, excess re-
turns on equity and bonds). By estimating these M equations as a SURE system with
White’s (or Newey-West’s) covariance estimator, it is straightforward to test various hy-
potheses, for instance, that the intercept (the “alpha”) is higher for the M th group than
for the for first group.
Example 16.1 (CalTime with two investor groups) With two investor groups, estimate the
208
following SURE system
yN1t D x t0 ˇ1 C v1t ;
yN2t D x t0 ˇ2 C v2t :
The CalTime approach is straightforward and the cross-sectional correlations are fairly
easy to handle (in the SURE approach). However, it forces us to define discrete investor
groups—which makes it hard to handle several different types of investor characteristics
(for instance, age, trading activity and income) at the same time.
The cross sectional regression (CrossReg) approach is to first estimate the factor
model for each investor
and to then regress the (estimated) betas for the pth factor (for instance, the intercept) on
the investor characteristics
ˇOpi D zi0 cp C wpi : (16.4)
16.3.1 OLS
209
where xi t is an K 1 vector. Notice that the coefficients are the same across individuals
(and time). Define the matrices
T N
1 XX
˙xx D xi t xi0 t (an K K matrix) (16.6)
T N t D1 i D1
T N
1 XX
˙xy D xi t yi t (a K 1 vector). (16.7)
T N t D1 i D1
16.3.2 GMM
Remark 16.2 (Distribution of GMM estimates) Under fairly weak assumption, the ex-
p d
actly identified GMM estimator T N .ˇO ˇ0 / ! N.0; D0 1 S0 D0 1 /, where D0 is the
p
Jacobian of the average moment conditions and S0 is the covariance matrix of T N
times the average moment conditions.
ˇO ˇ0 N.0; D0 1 SD0 1 /;
where S D S0 =.T N / which is the same as the covariance matrix of the average moment
conditions (16.9).
To apply these remarks, first notice that the Jacobian D0 corresponds to (the probabil-
ity limit of) the ˙xx matrix in (16.6). Second, notice that
T N
!
1X 1 X
Cov.average moment conditions/ D Cov hi t (16.10)
T t D1 N i D1
210
looks differently depending on the assumptions of cross correlations.
In particular, if hi t has no correlation across time (effectively, N1 N
P
i D1 hi t is not auto-
correlated), then we can simplify as
T N
!
1 X 1 X
Cov.average moment conditions/ D 2 Cov hi t : (16.11)
T t D1 N i D1
We would then design an estimator that would consistently estimate this covariance matrix
by using the time dimension.
16.3.3 Driscoll-Kraay
O D ˙ 1 S˙ 1 ;
Cov.ˇ/ (16.12)
xx xx
where
T N
1 X 0 1 X
SD 2 h t h t ; with h t D hi t , hi t D xi t "i t ; (16.13)
T t D1 N i D1
where hi t is the LS moment condition for individual i . Clearly, hi t and h t are K 1, so S
is KK. Since we use the covariance matrix of the moment conditions, heteroskedasticity
is accounted for.
Notice that h t is the cross-sectional average moment condition (in t ) and that S is an
211
estimator of the covariance matrix of those average moment conditions
b
1 PT PN
S D Cov hi t :
T N tD1 i D1
To calculate this estimator, (16.13) uses the time dimension (and hence requires a reason-
ably long time series).
Remark 16.5 (Relation to the notation in Hoechle (2011)) Hoechle writes Cov.ˇ/ O D
.X 0 X/ 1 SOT .X 0 X/ 1 , where SOT D TtD1 hO t hO 0t ; with hO t D N
P P
i D1 hi t . Clearly, my ˙xx D
X 0 X=.T N / and my S D SOT =.T 2 N 2 /. Combining gives Cov.ˇ/ O D .˙xx T N / 1 ST 2 N 2 .˙xx T N / 1 ,
so we can write
" 4
1 X
SD c it /
Var.h
T 16 i D1
b b b
C 2Cov.h1t ; h2t / C 2Cov.h1t ; h3t / C 2Cov.h1t ; h4t /
bov.h ; h / C 2Cbov.h
C 2C 2t 3t 2t ; h4t /
b i
C2Cov.h ; h / :
3t 4t
212
Notice that S is the (estimate of) the variance of the cross-sectional average, Var.h t / D
VarŒ.h1t C h2t C h3t C h4t /=4.
A cluster method puts restrictions on the covariance terms (of hi t ) that are allowed
to enter the estimate S. In practice, all terms across clusters are left out. This can be
implemented by changing the S matrix. In particular, instead of interacting all i with
each other, we only allow for interaction within each of the G clusters (g D 1; :::; G/
G T
X 1 X g g 0 1
h t h t , where hgt D
X
SD 2
hi t : (16.14)
gD1
T t D1 N i 2 cluster g
(Remark: the cluster sums should be divided by N , not the number of individuals in the
cluster.)
so we can write
b b
" 4 #
1 X
SD c i t / C 2Cov.h1t ; h2t / C 2Cov.h3t ; h4t / :
Var.h
T 16 i D1
Example 16.8 (Cluster method on N D 4) From (16.14) we have the cluster (group)
averages
1 1
h1t D .h1t C h2t / and h2t D .h3t C h4t / :
4 4
T 0
Assuming only one regressor (to keep it simple), the time averages, T1 hgt hgt , are
P
t D1
213
then (for cluster 1 and then 2)
T T 2 T
1 X 1 1 0 1X 1 1X 1 2
h1t C h22t C 2h1t h2t , and
ht ht D .h1t C h2t / D
T t D1 T t D1 4 T t D1 16
T T
1 X 2 2 0 1X 1 2
h3t C h24t C 2h3t h4t :
ht ht D
T t D1 T t D1 16
Finally, summing across these time averages gives the same expression as in Example
16.7. The following 4 4 matrix illustrates which cells that are included (assumption: no
dependence across time)
i 1 2 3 4
2
1 h1t h1t h2t 0 0
2 h1t h2t h22t 0 0
2
3 0 0 h3t h3t h4t
4 0 0 h3t h4t h24t
In comparison, the iid case only sums up the principal diagonal, while the DK method
fills the entire matrix.
Instead, we get White’s covariance matrix by excluding all cross terms. This can be
accomplished by defining
T N
1 X 1 X
SD 2 2
hi t h0i t : (16.15)
T t D1 N i D1
Example 16.9 (White’s method on N D 4) With only one regressor (16.15) gives
T
1 X 1 2
h1t C h22t C h23t C h24t
SD 2
T tD1 16
4
1 X
D c it /
Var.h
T 16 i D1
Finally, the traditional LS covariance matrix assumes that E hi t h0i t D ˙xx E "2it , so
we get
T N
O 1 2 2 1 XX 2
CovLS .ˇ/ D ˙xx s =T N , where s D " : (16.16)
T N t D1 i D1 i t
214
Remark 16.10 (Why the cluster method fails when there is a missing “time fixed effect”—
and one of the regressors indicates the cluster membership) To keep this remark short,
assume yi t D 0qi t C "i t , where qi t indicates the cluster membership of individual i (con-
stant over time). In addition, assume that all individual residuals are entirely due to an
(excluded) time fixed effect, "i t D w t . Let N D 4 where i D .1; 2/ belong to the first
cluster (qi D 1) and i D .3; 4/ belong to the second cluster (qi D 1). (Using the values
qi D ˙1 gives qi a zero mean, which is convenient.) It is straightforward to demon-
strate that the estimated (OLS) coefficient in any sample must be zero: there is in fact no
uncertainty about it. The individual moments in period t are then hi t D qi t w t
2 3 2 3
h1t wt
6 7 6 7
6 h2t 7 6 w t 7
6 h 7 D 6 w 7:
6 7 6 7
4 3t 5 4 t 5
h4t wt
i 1 2 3 4
2 2
1 wt wt 0 0
2 2
2 wt wt 0 0
3 0 2
0 w t w t2
4 0 0 w t2 w t2
215
ance matrix) are unchanged, then the matrix in Example 16.8 becomes
i 1 2 3 4
1 w t2 w t2 0 0
2 w t2 w t2 0 0
3 0 0 w t2 w t2
4 0 0 w t2 w t2
which sum to zero: the cluster covariance estimator works fine. The DK method also
works since it adds the off-diagonal elements which are
i 1 2 3 4
1 w t2 w t2
2 w t2 w t2
3 w t2 w t2
4 w t2 w t2
which also sum to zero. This suggests that the cluster covariance matrix goes wrong
only when the cluster definition (for the covariance matrix) is strongly related to the qi
regressor.
The CalTime estimates can be replicated by using the individual data in the panel. For
instance, with two investor groups we could estimate the following two regressions
yi t D x t0 ˇ1 C u.1/
i t for i 2 group 1 (16.17)
yi t D x t0 ˇ2 C u.2/
i t for i 2 group 2. (16.18)
More interestingly, these regression equations can be combined into one panel regres-
sion (and still give the same estimates) by the help of dummy variables. Let zj i D 1 if
individual i is a member of group j and zero otherwise. Stacking all the data, we have
216
(still with two investor groups)
yi t D .z1i x t /0 ˇ1 C .z2i x t /0 ˇ2 C ui t
" #!0 " #
z1i x t ˇ1
D C ui t
z2i x t ˇ2
# "
z 1i
D .zi ˝ x t /0 ˇ C ui t , where zi D : (16.19)
z2i
Hoechle, Schmid, and Zimmermann (2009) (HSZ) suggest the following regression on all
data (t D 1; : : : ; T and also i D 1; : : : ; N )
yi t D .zi t ˝ x t /0 d C vi t (16.22)
D .Œ1; z1i t ; : : : ; zmi t ˝ Œ1; x1t ; : : : ; xk t /0 d C vi t ; (16.23)
217
tains information about the number of fund switches as well as other possible drivers of
performance like gender. The x t vector contains the pricing factors. In case the charac-
teristics z1i t ; : : : ; zmi t sum to unity (for a given individual i and time t ), the constant in
Œ1; z1i t ; : : : ; zmi t is dropped.
This model is estimated with LS (stacking all N T observations), but the standard
errors are calculated according to Driscoll and Kraay (1998) (DK)—which accounts for
cross-sectional correlations, for instance, correlations between the residuals of different
investors (say, v1t and v7t ).
HSZ prove the following two propositions.
Example 16.13 (One investor characteristic and one pricing factor). In this case (16.22)
is
2 30
1
6 7
6 x1t 7
yi t D 6
6 z 7 d C vi t ;
7
4 it 5
zi t x1t
D d0 C d1 x1t C d2 zi t C d3 zi t x1t C vi t :
In case we are interested in how the investor characteristics (zi t ) affect the alpha (inter-
cept), then d2 is the key coefficient.
218
16.6 Monte Carlo Experiment
This section reports results from a simple Monte Carlo experiment. We use the model
O D ˙ 1 S˙ 1 , where ˙xx D 1 PT PN
Cov.b/ xx xx xi t xi0 t ; (16.26)
T N t D1 i D1
and S is covariance matrix of the moment conditions.
219
The three methods differ with respect to how the S matrix is estimated
1 PT 0
PN
SDK D t D1 h t h t , where h t D i D1 hi t ;
T 2N 2
1 PT PM j j 0 j
X
SC l D t D1 j D1 h t .h t / , where h t D hi t ;
T 2N 2 i 2 cluster j
1 PT PN 0
SW h D t D1 i D1 hi t hi t : (16.27)
T 2N 2
To see the difference, consider a simple example with N D 4 and where i D .1; 2/ belong
to the first cluster and i D .3; 4/ belong to the second cluster. The following matrix shows
the outer product of the moment conditions of all individuals. White’s estimator sums up
the cells on the principal diagonal, the cluster method adds the underlined cells, and the
DK method adds also the remaining cells
2 3
i 1 2 3 4
6 1 h1t h01t h1t h02t h1t h03t h1t h04t 7
6 7
6 7
0 0 0 0 7
2 h h h h h h h h (16.28)
6
6 2t 1t 2t 2t 2t 3t 2t 4t 7
6 7
6 3 h3t h0 h3t h0 h3t h0 h3t h0 7
4 1t 2t 3t 4t 5
4 h4t h1t h4t h2t h4t h3t h4t h04t
0 0 0
To generate data with correlated (in the cross-section) residuals, let the residual of indi-
vidual i (belonging to group j ) in period t be
where uit N.0; u2 ), vjt N.0; v2 ) and w t N.0; w2 )—and the three components
are uncorrelated. This implies that
220
Clearly, when w2 D 0 then the correlation across groups is zero, but there may be corre-
lation within a group. If both v2 D 0 and w2 D 0, then there is no correlation at all across
individuals. For CalTime portfolios (one per activity group), we expect the ui t to average
out, so a group portfolio has the variance v2 C w2 and the covariance of two different
group portfolios is w2 .
The Monte Carlo simulations consider different values of the variances—to illustrate
the effect of the correlation structure.
Table 16.1 reports the fraction of times the absolute value of a t-statistics for a true null
hypothesis is higher than 1.96. The table has three panels for different correlation patterns
the residuals ("i t ): no correlation between individuals, correlations only within the pre-
specified clusters and correlation across all individuals.
In the upper panel, where the residuals are iid, all three methods have rejection rates
around 5% (the nominal size).
In the middle panel, the residuals are correlated within each of the five clusters, but
there is no correlation between individuals that belong to the different clusters. In this
case, but the DK and the cluster method have the right rejection rates, while White’s
method gives much too high rejection rates (around 85%). The reason is that White’s
method disregards correlation between individuals—and in this way underestimates the
uncertainty about the point estimates. It is also worth noticing that the good performance
of the cluster method depends on pre-specifying the correct clustering. Further simula-
tions (not tabulated) shows that with a completely random cluster specification (unknown
to the econometrician), gives almost the same results as White’s method.
The lower panel has no cluster correlations, but all individuals are now equally cor-
related (similar to a fixed time effect). For the intercept (˛) and the slope coefficient on
the common factor (ˇ), the DK method still performs well, while the cluster and White’s
methods give too many rejects: the latter two methods underestimate the uncertainty since
some correlations across individuals are disregarded. Things are more complicated for the
slope coefficient of the cluster number (ı). Once again, DK performs well, but both the
cluster and White’s methods lead to too few rejections. The reason is the interaction of
the common component in the residual with the cross-sectional dispersion of the group
number (gi ).
221
To understand this last result, consider a stylised case where yi t D ıgi C "i t where
ı D 0 and "i t D w t so all residuals are due to an (excluded) time fixed effect. In this
case, the matrix above becomes
2 3
i 1 2 3 4
6 1 w t2 2 2 2 7
6 7
6 w t w t w t 7
6 2 w t2 w t2 w t2 w t2 7 (16.31)
6 7
6 7
6 3
4 w t2 w t2 w t2 w t2 75
4 w t2 w t2 w t2 w t2
(This follows from gi D . 1; 1; 1; 1/ and since hi t D gi w t we get .h1t ; h2t ; h3t ; h4t / D
. w t ; w t ; w t ; w t /.) Both White’s and the cluster method sums up only positive cells,
so S is a strictly positive number. (For this the cluster method, this result relies on the as-
sumption that the clusters used in estimating S correspond to the values of the regressor,
gi .) However, that is wrong since it is straightforward to demonstrate that the estimated
coefficient in any sample must be zero. This is seen by noticing that N
P
i D1 hi t D 0 at
a zero slope coefficient holds for all t, so there is in fact no uncertainty about the slope
coefficient. In contrast, the DK method adds the off-diagonal elements which are all equal
to w t2 , giving the correct result S D 0.
See 16.2 for results on a ten-year panel of some 60,000 Swedish pension savers (Dahlquist,
Martinez and Söderlind, 2011).
Bibliography
Driscoll, J., and A. Kraay, 1998, “Consistent Covariance Matrix Estimation with Spatially
Dependent Panel Data,” Review of Economics and Statistics, 80, 549–560.
Hoechle, D., 2011, “Robust Standard Errors for Panel Regressions with Cross-Sectional
Dependence,” The Stata Journal forhcoming.
222
endar Time Portfolio Approach and the Performance of Private Investors,” Working
paper, University of Basel.
223
Driscoll-
White Cluster Kraay
A. No cross-sectional correlation
B. Within-cluster correlations
Table 16.1: Simulated size of different covariance estimators This table presents the
fraction of rejections of true null hypotheses for three different estimators of the co-
variance matrix: White’s (1980) method, a cluster method, and Driscoll and Kraay’s
(1998) method. The model of individual i in period t and who belongs to cluster j is
ri t D ˛ C ˇf t C
gi C "i t , where f t is a common regressor (iid normally distributed)
and gi is the demeaned number of the cluster that the individual belongs to. The sim-
ulations use 3000 repetitions of samples with t D 1; : : : ; 2000 and i D 1; : : : ; 1665.
Each individual belongs to one of five different clusters. The error term is constructed as
"i t D ui t C vjt C w t , where ui t is an individual (iid) shock, vjt is a shock common to
all individuals who belong to cluster j , and w t is a shock common to all individuals. All
shocks are normally distributed. In Panel A the variances of .ui t ; vjt ; w t / are (1,0,0), so
the shocks are iid; in Panel B the variances are (0.67,0.33,0), so there is a 33% correlation
within a cluster but no correlation between different clusters; in Panel C the variances are
(0.67,0,0.33), so there is no cluster-specific shock and all shocks are equally correlated,
effectively having a 33% correlation within a cluster and between clusters.
224
Table 16.2: Investor activity, performance, and characteristics
I II III IV
The table presents the results of pooled regressions of an individual’s daily excess return on return factors,
and measures of individuals’ fund changes and other characteristics. The return factors are the excess
returns of the Swedish stock market, the Swedish bond market, and the world stock market, and they are
allowed to across the individuals’ characteristics. For brevity, the coefficients on these return factors are
not presented in the table. The measure of fund changes is either a dummy variable for an activity category
(see Table ??) or a variable counting the number of fund changes. Other characteristics are the individuals’
age in 2000, gender, or pension rights in 2000, which is a proxy for income. The constant term and
coefficients on the dummy variables are expressed in % per year. The income variable is scaled down by
1,000. Standard errors, robust to conditional heteroscedasticity and spatial cross-sectional correlations as
in Driscoll and Kraay (1998), are reported in parentheses. The sample consists of 62,640 individuals
225
followed daily over the 2000 to 2010 period.
21 Some Statistics
This section summarizes some useful facts about statistics. Heuristic proofs are given in
a few cases.
Some references: Mittelhammer (1996), DeGroot (1986), Greene (2000), Davidson
(2000), Johnson, Kotz, and Balakrishnan (1994).
Fact 21.1 (cdf and pdf) The cumulative distribution function of the random variable X is
Rx
F .x/ D Pr.X x/ D 1 f .q/dq. Clearly, f .x/ D dF .x/=dx. Note that x is just a
number, not random variable.
Fact 21.2 (Moment generating function of X) The moment generating function of the
random variable X is mgf .t/ D E e tX . The rth moment is the rth derivative of mgf .t/
evaluated at t D 0: E X r D d mgf .0/=dt r . If a moment generating function exists (that
is, E e tX < 1 for some small interval t 2 . h; h/), then it is unique.
Fact 21.3 (Moment generating function of a function of X) If X has the moment generat-
ing function mgfX .t/ D E e tX , then g.X/ has the moment generating function E e tg.X / .
The affine function a C bX (a and b are constants) has the moment generating func-
tion mgfg.X / .t/ D E e t .aCbX / D e t a E e t bX D e t a mgfX .bt/. By setting b D 1 and
a D E X we obtain a mgf for central moments (variance, skewness, kurtosis, etc),
mgf.X E X / .t/ D e t E X mgfX .t/.
.X /= so a D = and b D 1=. This gives mgfZ .t/ D exp. t=/mgfX .t=/ D
226
exp t 2 =2 . (Of course, this result can also be obtained by directly setting D 0 and
D 1 in mgfX .)
Fact 21.5 (Characteristic function and the pdf) The characteristic function of a random
variable x is
g./ D E exp.ix/
R
D x exp.ix/f .x/dx;
where f .x/ is the pdf. This is a Fourier transform of the pdf (if x is a continuous random
variable). The pdf can therefore be recovered by the inverse Fourier transform as
1 R1
f .x/ D exp. ix/g./d:
2 1
In practice, we typically use a fast (discrete) Fourier transform to perform this calcula-
tion, since there are very quick computer algorithms for doing that.
j 2 2 =2/.
Fact 21.7 (Change of variable, univariate case, monotonic function) Suppose X has the
probability density function fX .c/ and cumulative distribution function FX .c/. Let Y D
g.X / be a continuously differentiable function with dg=dX > 0 (so g.X/ is increasing
for all c such that fX .c/ > 0. Then the cdf of Y is
1
where g is the inverse function of g such that g 1 .Y / D X. We also have that the pdf
of Y is
ˇ dg 1 .c/ ˇ
ˇ ˇ
1
fY .c/ D fX Œg .c/ ˇˇ ˇ:
dc ˇ
If, instead, dg=dX < 0 (so g.X/ is decreasing), then we instead have the cdf of Y
227
Proof. Differentiate FY .c/, that is, FX Œg 1 .c/ with respect to c.
Example 21.8 Let X U.0; 1/ and Y D g.X/ D F 1 .X/ where F .c/ is a strictly
increasing cdf. We then get
dF .c/
fY .c/ D :
dc
The variable Y then has the pdf dF .c/=dc and the cdf F .c/. This shows how to gen-
erate random numbers from the F ./ distribution: draw X U.0; 1/ and calculate
Y D F 1 .X/.
Example 21.10 Let X U.0; 2/, so the pdf and cdf of X are then 1=2 and c=2 respec-
tively. Now, let Y D g.X/ D X gives the pdf and cdf as 1=2 and 1 C y=2 respectively.
The latter is clearly the same as 1 FX Œg 1 .c/ D 1 . c=2/.
Fact 21.11 (Distribution of truncated a random variable) Let the probability distribution
and density functions of X be F .x/ and f .x/, respectively. The corresponding functions,
conditional on a < X b are ŒF .x/ F .a/=ŒF .b/ F .a/ and f .x/=ŒF .b/ F .a/.
Clearly, outside a < X b the pdf is zero, while the cdf is zero below a and unity above
b.
Fact 21.12 (Joint and marginal cdf) Let X and Y be (possibly vectors of) random vari-
ables and let x and y be two numbers. The joint cumulative distribution function of
Rx Ry
X and Y is H.x; y/ D Pr.X x; Y y/ D 1 1 h.qx ; qy /dqy dqx , where
2
h.x; y/ D @ F .x; y/=@x@y is the joint probability density function.
Fact 21.13 (Joint and marginal pdf) The marginal cdf of X is obtained by integrating out
R x R 1
Y : F .x/ D Pr.X x; Y anything/ D 1 1 h.qx ; qy /dqy dqx . This shows that the
R1
marginal pdf of x is f .x/ D dF .x/=dx D 1 h.qx ; qy /dqy .
228
Fact 21.14 (Conditional distribution) The pdf of Y conditional on X D x (a number) is
g.yjx/ D h.x; y/=f .x/. This is clearly proportional to the joint pdf (at the given value
x).
Fact 21.15 (Change of variable, multivariate case, monotonic function) The result in
Fact 21.7 still holds if X and Y are both n 1 vectors, but the derivative are now
@g 1 .c/=@dc 0 which is an n n matrix. If gi 1 is the i th function in the vector g 1
then
@g1 1 .c/ @g1 1 .c/
2 3
@c1
@cn
@g 1 .c/ 6 : : 7
D 6 :
: :
: 7:
@dc 0 4 5
@gn 1 .c/ @gn 1 .c/
@c1
@cm
Fact 21.17 ( 1 Corr.X; y/ 1). Let Y and X in Fact 21.16 be zero mean variables
(or variables minus their means). We then get ŒCov.X; Y /2 Var.X/ Var.Y /, that is,
1 Cov.X; Y /=ŒStd.X/Std.Y / 1.
Fact 21.19 (Conditional moments as random variables) Before we observe X , the condi-
tional moments are random variables—since X is. We denote these random variables by
E .Y jX /, Var .Y jX/, etc.
Fact 21.20 (Law of iterated expectations) E Y D EŒE .Y jX/. Note that E .Y jX/ is a
random variable since it is a function of the random variable X . It is not a function of Y ,
however. The outer expectation is therefore an expectation with respect to X only.
229
R R RR RR
Proof. EŒE .Y jX / D yg.yjx/dy f .x/dx D yg.yjx/f .x/dydx D yh.y; x/dydx D
E Y:
Fact 21.21 (Conditional vs. unconditional variance) Var .Y / D Var ŒE .Y jX /CE ŒVar .Y jX /.
Proof. Note from Fact 21.22 that E.Y jX/ D 0 implies Cov .X; Y / D 0 so E XY D
E X E Y (recall that Cov .X; Y / D E XY E X E Y ). Note also that E .Y jX / D 0 implies
that E Y D 0 (by iterated expectations). We therefore get
" #
Cov .X; Y / D 0
E .Y jX / D 0 ) ) E YX D 0:
EY D 0
Fact 21.25 (Linear projection) Suppose we want to forecast the scalar Y using the k 1
vector X and that we restrict the forecasting rule to be linear YO D X 0 ˇ. This rule is a
linear projection, denoted P .Y jX/, if ˇ satisfies the orthogonality conditions EŒX.Y
X 0 ˇ/ D 0k1 , that is, if ˇ D .E XX 0 / 1 E XY . A linear projection minimizes EŒY
k.X /2 within the class of linear k.X/ functions.
230
Fact 21.26 (Properties of linear projections) (a) The orthogonality conditions in Fact
21.25 mean that
Y D X 0 ˇ C ";
where E.X"/ D 0k1 . This implies that EŒP .Y jX/" D 0, so the forecast and fore-
cast error are orthogonal. (b) The orthogonality conditions also imply that EŒXY D
EŒXP .Y jX/. (c) When X contains a constant, so E " D 0, then (a) and (b) carry over to
covariances: CovŒP .Y jX/; " D 0 and CovŒX; Y D CovŒXP; .Y jX/.
Remark 21.28 Some authors prefer to take the transpose of the forecasting rule, that is,
to use YO D ˇ 0 X . Clearly, since XX 0 is symmetric, we get ˇ 0 D E.YX 0 /.E XX 0 / 1 .
Fact 21.29 (Linear projection with a constant in X) If X contains a constant, then P .aY C
bjX / D aP .Y jX/ C b.
Fact 21.30 (Linear projection versus regression function) Both the linear regression and
the regression function (see Fact 21.24) minimize EŒY k.X/2 , but the linear projection
imposes the restriction that k.X/ is linear, whereas the regression function does not im-
pose any restrictions. In the special case when Y and X have a joint normal distribution,
then the linear projection is the regression function.
Fact 21.31 (Linear projection and OLS) The linear projection is about population mo-
ments, but OLS is its sample analogue.
Fact 21.32 (Convergence in probability) The sequence of random variables fXT g con-
verges in probability to the random variable X if (and only if) for all " > 0
231
p
We denote this XT ! X or plim XT D X (X is the probability limit of XT ). Note: (a)
X can be a constant instead of a random variable; (b) if XT and X are matrices, then
p
XT ! X if the previous condition holds for every element in the matrices.
Fact 21.34 (Convergence in mean square) The sequence of random variables fXT g con-
verges in mean square to the random variable X if (and only if)
m
We denote this XT ! X . Note: (a) X can be a constant instead of a random variable;
m
(b) if XT and X are matrices, then XT ! X if the previous condition holds for every
element in the matrices.
0 0
−2 −1 0 1 2 −5 0 5
Sample average √T × sample average
232
This means that both the variance and the squared bias go to zero as T ! 1.
Example 21.38 Consider the random variable in Example 21.33. The expected value is
E XT D 0.T 1/=T C T =T D 1. This means that the squared bias does not go to zero,
so XT does not converge in mean square to zero.
Fact 21.39 (Slutsky’s theorem) If fXT g is a sequence of random matrices such that plim XT D
X and g.XT / a continuous function, then plim g.XT / D g.X/.
Fact 21.40 (Continuous mapping theorem) Let the sequences of random matrices fXT g
d p
and fYT g, and the non-random matrix faT g be such that XT ! X , YT ! Y , and aT ! a
d
(a traditional limit). Let g.XT ; YT ; aT / be a continuous function. Then g.XT ; YT ; aT / !
g.X; Y; a/.
Fact 21.43 (The Lindeberg-Lévy theorem) Let X t be independently and identically dis-
d
tributed (iid) with E X t D 0 and Var.X t / < 1. Then p1T ˙ tTD1 X t = ! N.0; 1/.
233
21.5 Stationarity
E X t D is independent of t;
Cov .X t s ; X t / D
s depends only on s, and
both and
s are finite.
Fact 21.45 (Strict stationarity) X t is strictly stationary if, for all s, the joint distribution
of X t ; X t C1 ; :::; X tCs does not depend on t.
Fact 21.46 (Strict stationarity versus covariance stationarity) In general, strict station-
arity does not imply covariance stationarity or vice versa. However, strict stationary with
finite first two moments implies covariance stationarity.
21.6 Martingales
234
Proof. E.X tC1 g t j˝ t / D E.X t C1 j˝ t /g t since g t is a function of ˝ t .
Fact 21.53 (WLN for martingale difference) If X t is a martingale difference, then plim ˙ tTD1 X t =T D
0 if either (a) X t is strictly stationary and E jx t j < 0 or (b) E jx t j1Cı < 1 for ı > 0 and
all t. (See Davidson (2000) 6.2)
Fact 21.54 (CLT for martingale difference) Let X t be a martingale difference. If plim ˙ tTD1 .X t2
E X t2 /=T D 0 and either
p d
then .˙ tTD1 X t = T /=.˙ tTD1 E X t2 =T /1=2 ! N.0; 1/. (See Davidson (2000) 6.2)
Fact 21.55 (Univariate normal distribution) If X N.; 2 /, then the probability den-
sity function of X, f .x/ is
1 1 x 2
f .x/ D p e 2. / :
2 2
The moment generating function is mgfX .t/ D exp t C 2 t 2 =2 and the moment gen-
Example 21.56 The first few moments around the mean are E.X / D 0, E.X /2 D
2 , E.X /3 D 0 (all odd moments are zero), E.X /4 D 3 4 , E.X /6 D 15 6 ,
and E.X /8 D 105 8 .
235
Pdf of N(0,1) Pdf of bivariate normal, corr=0.1
0.4 0.2
0.1
0.2
2
0
2 0
0 0
−2 −1 0 1 2 y −2 −2 x
x
0.4
0.2
2
0
2 0
0
y −2 −2 x
Figure 21.2: Normal distributions
Fact 21.57 (Standard normal distribution) If X N.0; 1/, then the moment generating
function is mgfX .t/ D exp t 2 =2 . Since the mean is zero, m.t/ gives central moments.
The first few are E X D 0, E X 2 D 1, E X 3 D 0 (all odd moments are zero), and
p
EX 4 D 3. The distribution function, Pr.X a/ D ˚.a/ D 1=2 C 1=2 erf.a= 2/,
Rz
where erf./ is the error function, erf.z/ D p2 0 exp. t 2 /dt). The complementary error
function is erfc.z/ D 1 erf.z/. Since the distribution is symmetric around zero, we have
˚. a/ D Pr.X a/ D Pr.X a/ D 1 ˚.a/. Clearly, 1 ˚. a/ D ˚.a/ D
p
1=2 erfc. a= 2/.
236
Pdf of bivariate normal, corr=0.1 Conditional pdf of y, corr=0.1
0.8
x=−0.8
0.6 x=0
0.2
0.4
0.1
2 0.2
0
2 0
0 0
y −2 −2 x −2 −1 0 1 2
y
Fact 21.59 (Conditional normal distribution) Suppose Zm1 and Xn1 are jointly nor-
mally distributed
" # " # " #!
Z Z ˙ZZ ˙ZX
N ; :
X X ˙XZ ˙XX
and variance (variance of Z conditional on that X D x, that is, the variance of the
prediction error Z E .Zjx/)
237
Note that the conditional variance is constant in the multivariate normal distribution
(Var .ZjX / is not a random variable in this case). Note also that Var .Zjx/ is less than
Var.Z/ D ˙ZZ (in a matrix sense) if X contains any relevant information (so ˙ZX is
not zero, that is, E .Zjx/ is not the same for all x).
Example 21.60 (Conditional normal distribution) Suppose Z and X are scalars in Fact
21.59 and that the joint distribution is
" # " # " #!
Z 3 1 2
N ; :
X 5 2 6
Fact 21.62 (Stein’s lemma 2) It follows from Fact 21.61 that if X and Y have a bivariate
normal distribution and h./ is a differentiable function such that E jh0 .Y /j < 1, then
CovŒX; h.Y / D Cov.X; Y / E h0 .Y /.
Example 21.63 (a) With h.Y / D exp.Y / we get CovŒX; exp.Y / D Cov.X; Y / E exp.Y /;
(b) with h.Y / D Y 2 we get CovŒX; Y 2 D Cov.X; Y /2 E Y so with E Y D 0 we get a
zero covariance.
Fact 21.64 (Stein’s lemma 3) Fact 21.62 still holds if the joint distribution of X and Y is
a mixture of n bivariate normal distributions, provided the mean and variance of Y is the
same in each of the n components. (See Söderlind (2009) for a proof.)
238
Fact 21.65 (Truncated normal distribution) Let X N.; 2 /, and consider truncating
the distribution so that we want moments conditional on a < X b. Define a0 D
.a /= and b0 D .b /= . Then,
.b0 / .a0 /
E.Xja < X b/ D and
˚.b0 / ˚.a0 /
( 2 )
b0 .b0 / a0 .a0 / .b0 / .a0 /
Var.Xja < X b/ D 2 1 :
˚.b0 / ˚.a0 / ˚.b0 / ˚.a0 /
Fact 21.66 (Lower truncation) In Fact 21.65, let b ! 1, so we only have the truncation
a < X. Then, we have
.a0 /
E.Xja < X/ D C and
1 ˚.a0 /
( 2 )
2 a0 .a0 / .a0 /
Var.Xja < X/ D 1C :
1 ˚.a0 / 1 ˚.a0 /
Example 21.67 Suppose X N.0; 2 / and we want to calculate E jxj. This is the same
as E.XjX > 0/ D 2.0/.
Fact 21.68 (Upper truncation) In Fact 21.65, let a ! 1, so we only have the trunca-
tion X b. Then, we have
.b0 /
E.XjX b/ D and
˚.b0 /
( 2 )
b0 .b0 / .b0 /
Var.XjX b/ D 2 1 :
˚.b0 / ˚.b0 /
q1 D g .ˇ/ ;
239
where g .:/ is has continuous first derivatives. The result is
p h i d
T g ˇO
g .ˇ0 / ! N 0; qq ; where
@g .ˇ0 / @g .ˇ0 /0 @g .ˇ0 /
D 0 ˝ , where is q k:
@ˇ @ˇ @ˇ 0
Example 21.71 The first two moments are E y D exp C 2 =2 and E y 2 D exp.2 C
240
Normal distributions Lognormal distributions
N(0,1)
0.6 0.6
N(0,0.5)
0.4 0.4
0.2 0.2
0 0
−4 −2 0 2 4 −4 −2 0 2 4
lnx x
˚.r a0 / ˚.r b0 /
E.y r ja > y < b/ D E.y r / ;
˚.b0 / ˚.a0 /
where a0 D .ln a / = and b0 D .ln b / =. Note that the denominator is Pr.a >
y < b/ D ˚.b0 / ˚.a0 /. Clearly, E.y r / D exp.r C r 2 2 =2/.
Example 21.74 The first two moments of the truncated (from below) lognormal distri-
bution are E.yjy > a/ D exp C 2 =2 ˚. a0 /=˚. a0 / and E.y 2 jy > a/ D
Example 21.75 The first two moments of the truncated (from above) lognormal distri-
bution are E.yjy b/ D exp C 2 =2 ˚. C b0 /=˚.b0 / and E.y 2 jy b/ D
exp 2 C 2 2 ˚. 2 C b0 /=˚.b0 /.
Fact 21.76 (Multivariate lognormal distribution) Let the n 1 vector x have a mulivari-
241
ate normal distribution
2
3 2 3
1 11 1n
6 : 7
:: 7 and ˙ D 6 ::: : : : :: 7
6
x N.; ˙/, where D 6
4 5 4 5:
: 7
n n1 nn
Then y D exp.x/ has a lognormal distribution, with the means and covariances
Cleary, Var.yi / D exp Œ2i C i i Œexp.i i / 1. Cov.y1 ; y2 / and Corr.y1 ; y2 / have the
same sign as Corr.xi ; xj / and are increasing in it. However, Corr.yi ; yj / is closer to zero.
Fact 21.77 (The 2n distribution) If Y 2n , then the pdf of Y is f .y/ D 2n=2 1.n=2/ y n=2 1 e y=2
,
where ./ is the gamma function. The moment generating function is mgfY .t/ D .1
2t / n=2 for t < 1=2. The first moments of Y are E Y D n and Var.Y / D 2n.
Fact 21.79 (Distribution of X 0 AX) If the n1 vector X N.0; I /, and A is a symmetric
idempotent matrix (A D A0 and A D AA D A0 A) of rank r, then Y D X 0 AX 2r .
242
matrix associated with the r non-zero eigenvalues (found in the r r matrix 11 ). The
generalized inverse can be shown to be
" #
h i 1 0 h i0
C 11
˙ D C1 C2 C1 C2 D C1 111 C10 ;
0 0
We can write ˙ C D C1 111=2 111=2 C10 . Consider the r 1 vector Z D 111=2 C10 X, and
note that it has the covariance matrix
Example 21.82 If Y D ˙inD1 Xi2 =i2 , then this transformation means Z D .˙inD1 Xi2 =i2
1/=n1=2 .
Proof. We can directly note from the moments of a 2n variable that E Z D .E Y
n/=n1=2 D 0, and Var.Z/ D Var.Y /=n D 2. From the general properties of moment
generating functions, we note that the moment generating function of Z is
p
n=2
t n t
mgfZ .t/ D e 1 2 with lim mgfZ .t/ D exp.t 2 /:
n1=2 n!1
d
This is the moment generating function of a N.0; 2/ distribution, which shows that Z !
N.0; 2/. This result should not come as a surprise as we can think of Y as the sum of
n variables; dividing by n1=2 is then like creating a scaled sample average for which a
central limit theorem applies.
Fact 21.83 (The F .n1 ; n2 / distribution) If Y1 2n1 and Y2 2n2 and Y1 and Y2 are
independent, then Z D .Y1 =n1 /=.Y2 =n2 / has an F .n1 ; n2 / distribution. This distribution
has no moment generating function, but E Z D n2 =.n2 2/ for n > 2.
243
a. Pdf of Chi−square(n) b. Pdf of F(n1,10)
1 1
n=1 n1=2
n=2 n1=5
n=5
0.5 0.5 n1=10
n=10
0 0
0 5 10 0 2 4 6
x x
0 0
0 2 4 6 −2 0 2
x x
Figure 21.5: 2 , F, and t distributions
Fact 21.85 (The tn distribution) If X N.0; 1/ and Y 2n and X and Y are indepen-
dent, then Z D X=.Y =n/1=2 has a tn distribution. The moment generating function does
not exist, but E Z D 0 for n > 1 and Var.Z/ D n=.n 2/ for n > 2.
244
21.7.5 The Bernouilli and Binomial Distributions
Fact 21.88 (Bernoulli distribution) The random variable X can only take two values:
1 or 0, with probability p and 1 p respectively. The moment generating function is
mgf .t / D pe t C 1 p. This gives E.X/ D p and Var.X/ D p.1 p/.
Example 21.89 (Shifted Bernoulli distribution) Suppose the Bernoulli variable takes the
values a or b (instead of 1 and 0) with probability p and 1 p respectively. Then E.X/ D
pa C .1 p/b and Var.X/ D p.1 p/.a b/2 .
Fact 21.90 (Binomial distribution). Suppose X1 ; X2 ; :::; Xn all have Bernoulli distribu-
tions with the parameter p. Then, the sum Y D X1 C X2 C ::: C Xn has a Binomial
distribution with parameters p and n. The pdf is pdf.Y / D nŠ=ŒyŠ.n y/Šp y .1 p/n y
for y D 0; 1; :::; n. The moment generating function is mgf .t/ D Œpe t C 1 pn . This
gives E.Y / D np and Var.Y / D np.1 p/.
Example 21.91 (Shifted Binomial distribution) Suppose the Bernuolli variables X1 ; X2 ; :::; Xn
take the values a or b (instead of 1 and 0) with probability p and 1 p respectively.
Then, the sum Y D X1 C X2 C ::: C Xn has E.Y / D nŒpa C .1 p/b and Var.Y / D
nŒp.1 p/.a b/2 .
Fact 21.92 (Skew-normal distribution) Let and ˚ be the standard normal pdf and cdf
respectively. The pdf of a skew-normal distribution with shape parameter ˛ is then
f .z/ D 2.z/˚.˛z/:
Y D C !Z with ! > 0;
then Y is said to have a SN.; ! 2 ; ˛/ distribution (see Azzalini (2005)). Clearly, the pdf
of Y is
f .y/ D 2 Œ.y / =! ˚ Œ˛ .y / =! =!:
245
when ˛ D 0 the distribution becomes a normal distribution. When ˛ ! 1, then the
density function is zero for Y , and 2 Œ.y / =! =! otherwise—this is a half-
normal distribution.
p
Example 21.93 The first three moments are as follows. First, notice that E Z D 2=ı,
p
Var.Z/ D 1 2ı 2 = and E.Z E Z/3 D .4= 1/ 2=ı 3 . Then we have
EY D C ! EZ
Var.Y / D ! 2 Var.Z/
E .Y E Y /3 D ! 3 E.Z E Z/3 :
Notice that with ˛ D 0 (so ı D 0), then these moments of Y become , ! 2 and 0
respecively.
Fact 21.94 (Cdf and pdf of the generalized Pareto distribution) The generalized Pareto
distribution is described by a scale parameter (ˇ > 0) and a shape parameter (). The
cdf (Pr.Z z/, where Z is the random variable and z is a value) is
(
1 .1 C z=ˇ/ 1= if ¤ 0
G.z/ D
1 exp. z=ˇ/ D 0;
The mean is defined (finite) if < 1 and is then E.z/ D ˇ=.1 /, the median is
.2 1/ˇ= and the variance is defined if < 1=2 and is then ˇ 2 =Œ.1 /2 .1 2/.
21.8 Inference
Fact 21.95 (Comparing variance-covariance matrices) Let Var.ˇ/ O and Var.ˇ / be the
variance-covariance matrices of two estimators, ˇO and ˇ , and suppose Var.ˇ/ O Var.ˇ /
O
is a positive semi-definite matrix. This means that for any non-zero vector R that R0 Var.ˇ/R
246
R0 Var.ˇ /R, so every linear combination of ˇO has a variance that is as large as the vari-
ance of the same linear combination of ˇ . In particular, this means that the variance of
every element in ˇO (the diagonal elements of Var.ˇ/)O is at least as large as variance of
the corresponding element of ˇ .
Bibliography
Azzalini, A., 2005, “The skew-normal distribution and related Multivariate Families,”
Scandinavian Journal of Statistics, 32, 159–188.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
Mittelhammer, R. C., 1996, Mathematical statistics for economics and business, Springer-
Verlag, New York.
Söderlind, P., 2009, “An extended Stein’s lemma for asset pricing,” Applied Economics
Letters, forthcoming, 16, 1005–1008.
247
22 Some Facts about Matrices
Some references: Greene (2000), Golub and van Loan (1989), Björk (1996), Anton
(1987), Greenberg (1988).
22.1 Rank
Fact 22.1 (Submatrix) Any matrix obtained from the m n matrix A by deleting at most
m 1 rows and at most n 1 columns is a submatrix of A.
Fact 22.2 (Rank) The rank of the m n matrix A is if the largest submatrix with non-
zero determinant is . The number of linearly independent row vectors (and column
vectors) of A is then .
Fact 22.3 (Vector p-norm) Let x be an n 1 matrix. The p-norm is defined as/
n
!1=p
X
kxkp D jxi jp :
i D1
n
!1=2
X p
kxk2 D xi2 D x 0 x:
i D1
Fact 22.4 (Linear systems of equations) Consider the linear system Ax D c where A is
m n, x is n 1, and c is m 1. A solution is a vector x such that Ax D c. It has
a unique solution if and only if rank.A/ D rank.Œ A c / D n; an infinite number of
solutions if and only if rank.A/ D rank.Œ A c / < n; and no solution if and only if
rank.A/ ¤ rank.Œ A c /.
248
Example 22.5 (Linear systems of equations, unique solution when m D n) Let x be 21,
and consider the linear system
" # " #
1 5 3
Ax D c with A D and c D :
2 6 6
249
Sum of squared errors of solution to A*x=c, scalar x
5
With uniqe solution
4 No solutions
0
2.5 3 3.5 4
x
Fact 22.9 (Pseudo inverse or generalized inverse) Suppose that no solution exists to
Ax D c, and that A0 A is not invertible. There are then several approximations, x,
O which
all minimize kc Axk O 2 is given by xO D A c, where AC
O 2 . The one with the smallest kxk C
Example 22.10 (Linear systems of equations, unique solution when m > n) Change c
in Example 22.6 to c D Œ 3 6 0 . Then rank .A/ D 1 and rank.Œ A c / D 1, and the
unique solution is x D 3:
Example 22.11 (Linear systems of equations, infinite number of solutions, m < n) Let x
be 2 1, and consider the linear system
h i
Ax D c with A D 1 2 and c D 5:
Here rank .A/ D 1 and rank.Œ A c / D 1. Any value of x1 on the line 5 2x2 is a
solution.
Example 22.12 (Pseudo inverses again) In the previous example, there is an infinite
number of solutions along the line x1 D 5 2x2 . Which one has the smallest norm
250
O 2 D Œ.5 2x2 /2 C x22 1=2 ? The first order condition gives x2 D 2, and therefore
kxk
x1 D 1. This is the same value as given by xO D AC c, since AC D Œ0:2; 0:4 in this case.
Fact 22.13 (Rank and computers) Numerical calculations of the determinant are poor
indicators of whether a matrix is singular or not. For instance, det.0:1 I20 / D 10 20 .
Use the condition number instead (see Fact 22.53).
1
Fact 22.14 (Some properties of inverses) If A, B, and C are invertible, then .ABC / D
1
C 1 B 1 A 1 ; .A 1 /0 D .A0 / 1 ; if A is symmetric, then A 1 is symmetric; .An / D
n
A 1 .
Fact 22.15 (Changing sign of column and inverting) Suppose the square matrix A2 is the
same as A1 except that the i t h and j t h columns have the reverse signs. Then A2 1 is the
same as A1 1 except that the i t h and j t h rows have the reverse sign.
Fact 22.17 (Complex matrices) Let AH denote the transpose of the complex conjugate of
A, so that if " #
h i 1
A D 1 2 C 3i then AH D :
2 3i
A square matrix A is unitary (similar to orthogonal) if AH D A 1 , for instance,
" # " #
1Ci 1Ci 1 i 1Ci
2 2 H 1 2 2
AD 1 i 1Ci
gives A DA D 1 i 1 i
:
2 2 2 2
A Hermitian matrix has real elements along the principal diagonal and Aj i is the complex
conjugate of Aij . Moreover, the quadratic form x H Ax is always a real number.
251
22.5 Eigenvalues and Eigenvectors
Fact 22.18 (Homogeneous linear system). Consider the linear system in Fact 22.4 with
c D 0: Amn xn1 D 0m1 . Then rank.A/ D rank.Œ A c /, so it has a unique solution
if and only if rank.A/ D n; and an infinite number of solutions if and only if rank.A/ < n.
Note that x D 0 is always a solution, and it is the unique solution if rank.A/ D n. We
can thus only get a nontrivial solution (not all elements are zero), only if rank .A/ < n.
.A i I / zi D 0n1 :
We require the eigenvectors to be non-trivial (not all elements are zero). From Fact 22.18,
an eigenvalue must therefore satisfy
det.A i I / D 0:
Fact 22.20 (Right and left eigenvectors) A “right eigenvector” z (the most common) sat-
isfies Az D z, and a “left eigenvector” v (seldom used) satisfies v 0 A D v 0 , that is,
A0 v D v.
Fact 22.21 (Rank and eigenvalues) For any m n matrix A, rank .A/ D rank .A0 / D
rank .A0 A/ D rank .AA0 / and equals the number of non-zero eigenvalues of A0 A or AA0 .
Example 22.22 Let x be an n 1 vector, so rank .x/ D 1. We then have that the outer
product, xx 0 also has rank 1.
Fact 22.23 (Determinant and eigenvalues) For any n n matrix A, det.A/ D ˘inD1 i .
Fact 22.24 (Triangular matrix) A lower (upper) triangular matrix has zero elements
above (below) the main diagonal.
252
Fact 22.25 (Eigenvalues of triangular matrix) For a triangular matrix A, the eigenvalues
equal the diagonal elements of A. This follows from that
0 s c 0 s c 0 0 c2 C s2
u t D " t for t ¤ i; k
ui D " i c "k s
uk D "i s C "k c:
Fact 22.30 (Positive definite matrix) The n n matrix A is positive definite if for any
non-zero n 1 vector x, x 0 Ax > 0. (It is positive semidefinite if x 0 Ax 0.)
253
Fact 22.31 (Some properties of positive definite matrices) If A is positive definite, then
all eigenvalues are positive and real. (To see why, note that an eigenvalue satisfies Ax D
x. Premultiply by x 0 to get x 0 Ax D x 0 x. Since both x 0 Ax and x 0 x are positive real
numbers, must also be.)
Fact 22.33 (More properties of positive definite matrices) If A is positive definite, then
det .A/ > 0 and all diagional elements are positive; if A is positive definite, then A 1 is
too.
Fact 22.36 (Properties of symmetric matrices) If A is symmetric, then all eigenvalues are
real, and eigenvectors corresponding to distinct eigenvalues are orthogonal.
1
Fact 22.37 If A is symmetric, then A is symmetric.
254
Fact 22.40 (Spectral decomposition.) If the eigenvectors are linearly independent, then
we can decompose A as
h i
A D ZZ 1 , where D diag.1 ; :::; 1 / and Z D z1 z2 zn ;
where is a diagonal matrix with the eigenvalues along the principal diagonal, and Z is
a matrix with the corresponding eigenvalues in the columns.
where the n r matrix C1 contains the r eigenvectors associated with the r non-zero
eigenvalues in the r r matrix 1 .
Example 22.43 (2 2 matrix) For a 2 2 matrix we have the following Cholesky decom-
position " #! " p #
a b a 0
chol D p p :
b d b= a d b 2 =a
255
decomposition is of the form
2 p 3
D11 0 0
p p
D11 A21 D22 0
6 7
6 7
P D6 :: :: :: 7:
6
: : :
7
p p p
4 5
D11 An1 D22 An2 Dnn
Fact 22.45 (Schur decomposition) The decomposition of the n n matrix A gives the
n n matrices T and Z such that
A D ZT Z H
where Z is a unitary n n matrix and T is an n n upper triangular Schur form with the
eigenvalues along the diagonal. Note that premultiplying by Z 1 D Z H and postmulti-
plying by Z gives
T D Z H AZ;
which is upper triangular. The ordering of the eigenvalues in T can be reshuffled, al-
though this requires that Z is reshuffled conformably to keep A D ZT Z H , which involves
a bit of tricky “book keeping.”
G D QSZ H and D D QT Z H :
QH GZ D S and QH DZ D T:
256
Fact 22.48 (QR decomposition) Let A be m n with m n. The QR decomposition is
where Q is orthogonal (Q0 Q D I ) and R upper triangular. The last line is the “thin
QR decomposition,” where Q1 is an m n orthogonal matrix and R1 an n n upper
triangular matrix.
Q1 Rx D c:
Rx D Q10 c:
This is an upper triangular system which can be solved very easily (first solve the first
equation, then use the solution is the second, and so forth.)
0
A D Umm Smn Vnn
where U and V are orthogonal and S is diagonal with the first elements being non-zero,
that is, 2 3
" # s11 0
S1 0 6 : :
: : ::: 7 :
7
SD , where S1 D 4 ::
6
0 0 5
0 s
257
Fact 22.51 (Singular values and eigenvalues) The singular values of A are the nonnega-
tive square roots of AAH if m n and of AH A if m n.
Remark 22.52 If the square matrix A is symmetric and idempotent (A D A0 A), then
the singular values are the same as the eigevalues. From Fact (22.41) we know that a
symmetric A can be decomposed as A D CC 0 . It follows that this is the same as the
singular value decomposition.
Fact 22.53 (Condition number) The condition number of a matrix is the ratio of the
largest (in magnitude) of the singular values to the smallest
For a square matrix, we can calculate the condition value from the eigenvalues of AAH
or AH A (see Fact 22.51). In particular, for a square matrix we have
ˇp ˇ ˇp ˇ
c D ˇ i ˇ = ˇ i ˇ ;
ˇ ˇ ˇ ˇ
max min
Fact 22.54 (Condition number and computers) The determinant is not a good indicator
of the realibility of numerical inversion algorithms. Instead, let c be the condition number
of a square matrix. If 1=c is close to the a computer’s floating-point precision (10 13 or
so), then numerical routines for a matrix inverse become unreliable. For instance, while
det.0:1I20 / D 10 20 , the condition number of 0:1I20 is unity and the matrix is indeed
easy to invert to get 10 I20 .
Fact 22.55 (Inverting by using the SVD decomposition) The inverse of the square matrix
A is found by noting that if A is square, then from Fact 22.50 we have
1
AA D I or
USV 0 A 1
D I , so
1 1
A D VS U 0;
258
Fact 22.56 (Pseudo inverse or generalized inverse) The Moore-Penrose pseudo (gener-
alized) inverse of an m n matrix A is defined as
" #
C C 0 C S111 0
A D V S U ; where Snxm D ;
0 0
where V and U are from Fact 22.50. The submatrix S111 contains the reciprocals of the
non-zero singular values along the principal diagonal. AC satisfies the AC satisfies the
Moore-Penrose conditions
0 0
AAC A D A, AC AAC D AC , AAC D AAC , and AC A D AC A:
See Fact 22.9 for the idea behind the generalized inverse.
Fact 22.57 (Some properties of generalized inverses) If A has full rank, then AC D A 1 ;
.BC /C D C C B C ; if B, and C are invertible, then .BAC / 1 D C 1 AC B 1 ; .AC /0 D
.A0 /C ; if A is symmetric, then AC is symmetric.
Fact 22.59 (Pseudo inverse of symmetric matrix) If A is symmetric, then the SVD is
identical to the spectral decomposition A D ZZ 0 where Z is a matrix of the orthogonal
0
eigenvectors (Z Z D I ) and is a diagonal matrix with the eigenvalues along the main
diagonal. By Fact 22.56) we then have AC D ZC Z 0 , where
" #
1
11 0
C D ;
0 0
with the reciprocals of the non-zero eigen values along the principal diagonal of 111 .
259
22.8 Matrix Calculus
Fact 22.60 (Matrix differentiation of non-linear functions, @y=@x 0 ) Let the vector yn1
be a function of the vector xm1
2 3 2 3
y1 f1 .x/
6 : 7
6 :: 7 D f .x/ D 6 :::
6 7
7:
4 5 4 5
yn fn .x/
This matrix is often called the Jacobian of the f functions. (Note that the notation implies
that the derivatives of the first element in y, denoted y1 , with respect to each of the
elements in x 0 are found in the first row of @y=@x 0 . A rule to help memorizing the format
of @y=@x 0 : y is a column vector and x 0 is a row vector.)
Fact 22.61 (@y 0 =@x instead of @y=@x 0 ) With the notation in the previous Fact, we get
2 3
@f1 .x/ @fn .x/
i 6 @x1 @x1 0
@y 0
h
@f1 .x/ @fn .x/ : : 7 @y
D D6 : :: 7D :
@x @x @x 4 : 5 @x 0
@f1 .x/
@xm
@f@xn .x/
m
Fact 22.62 (Matrix differentiation of linear systems) When yn1 D Anm xm1 ; then
f .x/ is a linear function
2 3 2 32 3
y1 a11 a1m x1
6 : 7 6 : :: 76 : 7
6 :: 7 D 6 :: : 7 6 :: 7 :
4 5 4 54 5
yn an1 anm xm
Fact 22.63 (Matrix differentiation of inner product) The inner product of two column
vectors, y D z 0 x, is a special case of a linear system with A D z 0 . In this case we get
260
@ .z 0 x/ =@x 0 D z 0 and @ .z 0 x/ =@x D z. Clearly, the derivatives of x 0 z are the same (a
transpose of a scalar).
Fact 22.65 (First order Taylor series) For each element fi .x/ in the n vector f .x/, we
can apply the mean-value theorem
@fi .bi /
fi .x/ D fi .c/ C .x c/ ;
@x 0
for some vector bi between c and x. Stacking these expressions gives
2 3 2 3 2 32 3
@f1 .b1 / @f1 .b1 /
f1 .x/ f1 .c/ x1
6 : 7 6 : 7 6 : @x1 ::
@xm
76 :: 7
6 :: 7 D 6 :: 7 C 6 :: : : 75 or
76
4 5 4 5 4 54
@fn .bn / @fn .bn /
fn .x/ fn .c/ @x1
@xm
xm
@f .b/
f .x/ D f .c/ C .x c/ ;
@x 0
where the notation f .b/ is a bit sloppy. It should be interpreted as that we have to
evaluate the derivatives at different points for the different elements in f .x/.
Fact 22.66 (Matrix differentiation of quadratic forms) Let xm1 be a vector, Amm a
matrix, and f .x/n1 a vector of functions. Then,
261
Example 22.67 (@ .x 0 Ax/ =@x D 2Ax when x is 2 1 and A is 2 2)
" # " #! " # " #! " #
@ h i A A x A A A A x1
11 12 1 11 12 11 21
x1 x2 D C ;
@x A21 A22 x2 A21 A22 A12 A22 x2
" #" #
A11 A12 x1
D2 if A21 D A12 :
A12 A22 x2
Example 22.68 (Least squares) Consider the linear model Ym1 D Xmn ˇn1 C um1 .
We want to minimize the sum of squared fitted errors by choosing the n 1 vector ˇ. The
fitted errors depend on the chosen ˇ: u .ˇ/ D Y Xˇ, so quadratic loss function is
L D u.ˇ/0 u.ˇ/
D .Y Xˇ/0 .Y Xˇ/ :
In thus case, f .ˇ/ D u .ˇ/ D Y Xˇ, so @f .ˇ/ =@ˇ 0 D X. The first order condition
for u0 u is thus
O O
2X Y X ˇ D 0n1 or X 0 Y D X 0 X ˇ;
0
Fact 22.69 (Matrix of 2nd order derivatives of of a non-linear function, @2 y=@x@x 0 ) Let
the scalar y be a function of the vector xm1
y D f .x/ :
Then, let @2 y=@x@x 0 be the m m matrix with @2 y=@xi @xj in cell .i; j /
2 2 3
@ f .x/ @2 f .x/
@x1 @x1
@x 1 @xm
2
@ y 6 :
6 :: :: 7
D : 7:
@x@x 0 4 5
@2 f .x/ @2 f .x/
@xm @x1
@x m @xm
This matrix is often called the Hessian of the f function. This is clearly a symmetric
matrix.
262
22.9 Miscellaneous
Fact 22.73 (The vec operator). vec A where A is m n gives an mn 1 vector2with the
3
" a
# 6 11 7
a11 a12 6 a21 7
columns in A stacked on top of each other. For instance, vec D6 6 a 7.
7
a21 a22 4 12 5
a22
Properties: vec .A C B/ D vec AC vec B; vec .ABC / D .C 0 ˝ A/ vec B; if a and b
are column vectors, then vec .ab 0 / D b ˝ a.
Fact 22.74 (The vech operator) vechA where A is m m gives an m.m C 1/=2 1 vector
2 A stacked
with the elements on and below the principal diagonal 3 on top of each other
" # a11
a11 a12
(columnwise). For instance, vech D 4 a21 5, that is, like vec, but uses
6 7
a21 a22
a22
only the elements on and below the principal diagonal.
Fact 22.75 (Duplication matrix) The duplication matrix Dm is defined such that for any
symmetric m m matrix A we have vec A D Dm vechA. The duplication matrix is
263
therefore useful for “inverting” the vech operator (the step from vec A to A is trivial).
For instance, to continue the example of the vech operator
2 3 2 3
1 0 0 2 3 a11
7 a
6 0 1 0 7 6 11 7 6 a21 7
6 6 7
7 4 a21 5 D 6
6 a 7 or D2 vechA D vec A:
6 7
6 0 1 0 7
4 21 5
a22
4 5
0 0 1 a22
S D X 0 Y:
(To see this, let X.i; W/ be the i th row of X , and similarly for Y , so
T
X
0
XY D X.t; W/0 Y.t; W/;
t D1
and T D 2 we have
" #" # " #
0 a 1 a 2 p 1 q1 r 1 PT at h i
XY D D t D1 pt qt rt :
b1 b2 p2 q2 r2 bt
264
Fact 22.77 (Matrix geometric series) Suppose the eigenvalues to the square matrix A are
all less than one in modulus. Then,
1
I C A C A2 C D .1 A/ :
To see why this makes sense, consider .1 A/ ˙ tTD1 At (with the convention that A0 D I ).
It can be written as
A/ ˙ tTD1 At D I C A C A2 C A I C A C A2 C D I AT C1 :
.1
If all the eigenvalues are stable, then limT !1 AT C1 D 0, so taking the limit of the
previous equation gives
.1 A/ lim ˙ tTD1 A D I:
T !1
Bibliography
Anton, H., 1987, Elementary linear algebra, John Wiley and Sons, New York, 5th edn.
Björk, Å., 1996, Numerical methods for least squares problems, SIAM, Philadelphia.
Golub, G. H., and C. F. van Loan, 1989, Matrix computations, The John Hopkins Univer-
sity Press, Baltimore, 2nd edn.
Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.
265