0% found this document useful (0 votes)
389 views170 pages

Basic Financial Econometrics PDF

This document provides an overview of regression analysis techniques for financial data. It discusses least squares estimation, the finite sample and large sample properties of least squares estimates, maximum likelihood estimation, and diagnostic testing. It also covers topics such as heteroscedasticity, autocorrelation, endogeneity, instrumental variables, generalized method of moments, models with binary dependent variables, sample selection, and duration models. The document is intended as preliminary notes for a course on financial econometrics.

Uploaded by

phoenix eastwood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
389 views170 pages

Basic Financial Econometrics PDF

This document provides an overview of regression analysis techniques for financial data. It discusses least squares estimation, the finite sample and large sample properties of least squares estimates, maximum likelihood estimation, and diagnostic testing. It also covers topics such as heteroscedasticity, autocorrelation, endogeneity, instrumental variables, generalized method of moments, models with binary dependent variables, sample selection, and duration models. The document is intended as preliminary notes for a course on financial econometrics.

Uploaded by

phoenix eastwood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 170

Basic Financial Econometrics

Alois Geyer

Vienna University of Economics and Business

[email protected]
http://www.wu.ac.at/~geyer

this version:

May 12, 2019

preliminary and incomplete

c Alois Geyer 2019 { Some rights reserved.


This document is subject to the following Creative-Commons-License:
http://creativecommons.org/licenses/by-nc-nd/2.0/at/deed.en US
Contents
1 Financial Regression Analysis 1
1.1 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Finite sample properties of least squares estimates . . . . . . . . . . . . . . 6
1.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3 Testing hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4 Example 6: CAPM, beta-factors and multi-factor models . . . . . . 15
1.2.5 Example 7: Interest rate parity . . . . . . . . . . . . . . . . . . . . . 19
1.2.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Large sample properties of least squares estimates . . . . . . . . . . . . . . 22
1.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.3 Time series data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5 LM, LR and Wald tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.6 Speci cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6.1 Log and other transformations . . . . . . . . . . . . . . . . . . . . . 33
1.6.2 Dummy variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.4 Di erence-in-di erences . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6.5 Example 11: Hedonic price functions . . . . . . . . . . . . . . . . . . 37
1.6.6 Example 12: House price changes induced by siting decisions . . . . 38
1.6.7 Omitted and irrelevant regressors . . . . . . . . . . . . . . . . . . . . 39
1.6.8 Selection of regressors . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.7 Regression diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.7.1 Non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.7.2 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.7.3 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.8 Generalized least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.8.1 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.8.2 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.8.3 Example 19: Long-horizon return regressions . . . . . . . . . . . . . 56
1.9 Endogeneity and instrumental variable estimation . . . . . . . . . . . . . . . 58
1.9.1 Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.9.2 Instrumental variable estimation . . . . . . . . . . . . . . . . . . . . 60
1.9.3 Selection of instruments and tests . . . . . . . . . . . . . . . . . . . 63
1.9.4 Example 21: Consumption based asset pricing . . . . . . . . . . . . 66
1.10 Generalized method of moments . . . . . . . . . . . . . . . . . . . . . . . . 70
1.10.1 OLS, IV and GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.10.2 Asset pricing and GMM . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.10.3 Estimation and inference . . . . . . . . . . . . . . . . . . . . . . . . 75
1.10.4 Example 24: Models for the short-term interest rate . . . . . . . . . 78
1.11 Models with binary dependent variables . . . . . . . . . . . . . . . . . . . . 79
1.12 Sample selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
1.13 Duration models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2 Time Series Analysis 88
2.1 Financial time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.1.1 Descriptive statistics of returns . . . . . . . . . . . . . . . . . . . . . 89
2.1.2 Return distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.1.3 Abnormal returns and event studies . . . . . . . . . . . . . . . . . . 95
2.1.4 Autocorrelation analysis of nancial returns . . . . . . . . . . . . . . 98
2.1.5 Stochastic process terminology . . . . . . . . . . . . . . . . . . . . . 101
2.2 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.2.1 AR models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.2.2 MA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.2.3 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.2.4 Estimating ARMA models . . . . . . . . . . . . . . . . . . . . . . . . 107
2.2.5 Diagnostic checking of ARMA models . . . . . . . . . . . . . . . . . 108
2.2.6 Example 35: ARMA models for FTSE and AMEX returns . . . . . 109
2.2.7 Forecasting with ARMA models . . . . . . . . . . . . . . . . . . . . 111
2.2.8 Properties of ARMA forecast errors . . . . . . . . . . . . . . . . . . 113
2.3 Non-stationary models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.3.1 Random-walk and ARIMA models . . . . . . . . . . . . . . . . . . . 116
2.3.2 Forecasting prices from returns . . . . . . . . . . . . . . . . . . . . . 119
2.3.3 Unit-root tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.4 Di usion models in discrete time . . . . . . . . . . . . . . . . . . . . . . . . 125
2.4.1 Discrete time approximation . . . . . . . . . . . . . . . . . . . . . . 127
2.4.2 Estimating parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 127
2.4.3 Probability statements about future prices . . . . . . . . . . . . . . . 130
2.5 GARCH models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.5.1 Estimating and diagnostic checking of GARCH models . . . . . . . . 134
2.5.2 Example 49: ARMA-GARCH models for IBM and FTSE returns . . 134
2.5.3 Forecasting with GARCH models . . . . . . . . . . . . . . . . . . . . 136
2.5.4 Special GARCH models . . . . . . . . . . . . . . . . . . . . . . . . . 137
3 Vector time series models 139
3.1 Vector-autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.1.1 Formulation of VAR models . . . . . . . . . . . . . . . . . . . . . . . 139
3.1.2 Estimating and forecasting VAR models . . . . . . . . . . . . . . . . 141
3.2 Cointegration and error correction models . . . . . . . . . . . . . . . . . . . 144
3.2.1 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3.2.2 Error correction model . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3.2.3 Example 53: The expectation hypothesis of the term structure . . . 146
3.2.4 The Engle-Granger procedure . . . . . . . . . . . . . . . . . . . . . . 147
3.2.5 The Johansen procedure . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.2.6 Cointegration among more than two series . . . . . . . . . . . . . . . 156
3.3 State space modeling and the Kalman lter . . . . . . . . . . . . . . . . . . 158
3.3.1 The state space formulation . . . . . . . . . . . . . . . . . . . . . . . 158
3.3.2 The Kalman lter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.3.3 Example 60: The Cox-Ingersoll-Ross model of the term structure . . 160
Bibliography 163

I am grateful to many PhD students of the VGSF program, as well as doctoral and master
students at WU for valuable comments which have helped to improve these lecture notes.
1.1 Regression analysis 1
1 Financial Regression Analysis
1.1 Regression analysis
We start by reviewing key aspects of regression analysis. Its purpose is to relate a depen-
dent variable y to one or more variables X which are assumed to a ect y. The relation
is speci ed in terms of a systematic part which determines the expected value of y and
a random part . For example, the systematic part could be a (theoretically derived)
valuation relationship. The random part represents unsystematic deviations between ob-
servations and expectations (e.g. deviations from equilibrium). The relation between y
and X depends on unknown parameters which are used in the function that relates X
to the expectation of y.
Assumption AL (linearity): We consider the linear regression equation
y = X + :
y is the n1 vector (y1 ; : : : ; yn)0 of observations of the dependent (or endogenous) vari-
able,  is the vector of errors (also called residuals, disturbances, innovations or
shocks), is the K 1 vector of parameters, and the nK matrix X of regressors (also
called explanatory variables or covariates) is de ned as follows:
0 1
1 x11 x12    x1k
B 1 x21 x22    x2k
B C
C
X=B
B .. ... ... . . . ... C:
@ .
C
A
1 xn1 xn2    xnk
k is the number of regressors and K =k+1 is the dimension of =( 0 ; 1 ; : : : ; k )0 , where
0 is the constant term or intercept. A single row i of X will be denoted by the K 1
column vector xi . For a single observation the model equation is written as

yi = x0i + i (i = 1; : : : ; n):
We will frequently (mainly in the context of model speci cation and interpretation) use
formulations like
y = 0 + 1 x1 +    + k xk + ;

where the symbols y, xi and  represent the variables in question. It is understood that
such equations also hold for a single observation.
1.1 Regression analysis 2
1.1.1 Least squares estimation
A main purpose of regression analysis is to draw conclusions about the population using
a sample. The regression equation y=X + is assumed to hold in the population. The
sample estimate of is denoted by b and the estimate of  by e. According to the least
squares (LS) criterion, b should be chosen such that the sum of squared errors SSE is
minimized
n
X n
X
SSE(b) = e2i = (yi x0ib)2 = (y Xb)0(y Xb) ! min :
i=1 i=1

A necessary condition for a minimum is derived from


SSE(b) = y0y 2b0X 0y + b0X 0Xb;
and is given by
@ SSE(b)
@b
=0: 2X 0y + 2X 0Xb = 0:
Assumption AR (rank): We assume that X has full rank equal to K (i.e. the columns
of X are linearly independent). If X has full rank, X 0X is positive de nite and the
ordinary least squares (OLS) estimates b are given by
b = (X 0X ) 1X 0y: (1)
The solution is a minimum since
@ 2 SSE(b)
@ b2
= 2X 0X
is positive de nite by assumption AR.
It is useful to express X 0y and X 0X in terms of the sums
n
X n
X
X 0y = xi yi X 0X = xi x0i
i=1 i=1

to point out that the estimate is related to the covariance between the dependent variable
and the regressors, and the covariance among regressors. In the special case of the simple
regression model y=b0+b1x+e with a single regressor the estimates b1 and b0 are given by
s
b1 = yx
s2x
= ryx ssy b0 = y b1 x;
x

where syx (ryx) is the sample covariance (correlation) between y and x, sy and sx are the
sample standard deviations of y and x, and y and x are their sample means.
1.1 Regression analysis 3
1.1.2 Implications
By the rst order condition the OLS estimates satisfy the normal equation
(X 0X )b X 0y = X 0(y Xb) = X 0e = 0; (2)
which implies that each column of X is uncorrelated with (orthogonal to) e.
If the rst column of X is a column of ones denoted by , LS estimation has the following
implications:
1. The residuals have zero mean since 0e=0 (from the normal equation).
2. This implies that the mean of the tted values y^i=x0ib is equal to the sample mean:
1X
n
y^i = y:
n i=1

3. The tted values are equal to the mean of y if the regression equation is evaluated
for the means of X :
k
X
y = b0 + xj bj :
j =1

4. The tted values and the residuals are orthogonal:


y^0e = 0:
5. The slope in a regression of y on e is always equal to one and the constant is equal
to y.1
The goodness of t of a regression model can be measured by the coecient of deter-
mination R2 de ned as

R2 = 1
e0e = 1 (y y^)0(y y^) = (y^ y)0(y^ y) :
(y y) (y y)
0 (y y)0(y y) (y y)0(y y)
This is the so-called centered version of R2 which lies between 0 and 1 if the model contains
an intercept. It is equal to the squared correlation between y and y^. The three terms in
the expression
(y y)0(y y) = (y^ y)0(y^ y) + (y y^)0(y y^)
are called the total sum of squares (SST), the sum of squares from the regression (SSR),
and the sum of squared errors (SSE). Based on this relation R2 is frequently interpreted as
1 By implication 3 the constant must be equal to y since the mean of e is zero. The slope is given by
(e e) 1 e0 y~, where y~0=y y0. The 0slope is equal
0 to0 one since 0e0 y~=0e0 e. The latter identity holds since in the
original regression e y=e Xb+e e and e X =0 . Finally, e y=e y~ since e0 y=0.
0
1.1 Regression analysis 4
the percentage of y's variance 'explained' by the regression. If the model does not contain
an intercept, the centered R2 may become negative. In that case the uncentered R2 can
be used:
0 0
uncentered R2 = 1 ye0ye = yy^0yy^ :
R2 is zero if all regression coecients except for the constant are zero (b=(b0 0)0 and
y^=b0 =y). In this case the regression is a horizontal line. If R2 =1 all observations are
located on the regression line (or hyperplane) (i.e. y^i=yi). R2 is (only) a measure for
the goodness of the linear approximation implied by the regression. Many other, more
relevant aspects of a model's quality, are not taken into account by R2. Such aspects will
become more apparent as we proceed.
1.1.3 Interpretation
The coecients b can be interpreted on the basis of the tted values2
y^ = b0 + x1 b1 +    + xk bk :
bj is the change in y^ (or, the expected change in y) if xj changes by one unit ceteris paribus
(c.p.), i.e. holding all other regressors xed. In general the change in the expected value
is
^y = x1b1 +    + xk bk ;
which implies that the e ects of simultaneously changing several regressors can be added
up.
This interpretation is based on the Frisch-Waugh theorem. Suppose we partition the
regressors in two groups X1 and X2, and regress y on X1 to save the residuals e1. Next we
regress each column of X2 on X1 and save the residuals of these regressions in the matrix
E 2. According to the Frisch-Waugh theorem the coecients from the regression of e1 on
E 2 are equal to the subset of coecients from the regression of y on X that corresponds to
X2. In more general terms, the theorem implies that partial e ects can be obtained directly
from a multiple regression. It is not necessary to rst construct orthogonal variables.
To illustrate the theorem we consider the regression
y = b0 + b1 x1 + b2 x2 + e:
To obtain the coecient of x2 such that the e ect of x1 (and the intercept) is held constant,
we rst run the two simple regressions
y = cy + by1 x1 + ey1 x2 = cx2 + b21 x1 + e21 :
ey1 and e21 represent those parts of y and x2 which do not depend on x1 . Subsequently,
we run a regression using these residuals to obtain the coecient b2:
(y cy by1x1) = b2(x2 cx2 b21x1) + u ey1 = b2e21 + u:
2 An analogous interpretation holds for in the population.
1.1 Regression analysis 5
In general, this procedure is also referred to as 'controlling for' or 'partialling out' the
e ect of X1. Simply speaking, if we want to isolate the e ects of X2 on y we have to
'remove' the e ects of X1 from the entire regression equation.3 However, according to the
Frisch-Waugh theorem it is not necessary to run this sequence of regressions in practice.
Running a (multiple) regression of y on all regressors X 'automatically' controls for the
e ects of each regressor on all other regressors. A special case is an orthogonal regression,
where all regressors are uncorrelated (i.e. X 0X is a diagonal matrix). In this case the
coecients from the multiple regression are identical to those obtained from K simple
regressions using one column of X at a time.
Example 1: We use the real investment data from Table 3.1 in Greene (2003)
to estimate a multiple regression model. The dependent variable is real investment
(in trillion US$; denoted by y). The explanatory variables are real GNP (in trillion
US$; g), the (nominal) interest rate r and the in ation rate i (both measured as
percentages). The (rounded) estimated coecients are
b = ( 0:0726 0:236 0:00356 0:000276)0 ;
where the rst element is the constant term. The coecient 0:00356 can be inter-
preted as follows: if the interest rate goes up by one percentage point and the other
regressors do not change, real investment is expected to drop by about 3:56 billion
US$. SST=0.0164, SSR=0.0127 and SSE=0.00364. The corresponding R2 equals 0.78
(SSR/SST), which means that about 78% of the variance in real investment can be
explained by the regressors. Further details can be found in the le investment.xls.

Exercise 1: Use the quarterly data in Table F5.1 from Greene's website
http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm
(see le Table F5.1.xls) to estimate a regression of real investment on a
constant, real GDP, the nominal interest rate (90 day treasury bill rate) and
the in ation rate. Check the validity of the ve OLS implications mentioned
on p.3.
Apply the Frisch-Waugh theorem and show how the coecients of the constant
term and real GDP can be obtained by controlling for the e ects of the nominal
interest rate and in ation.

3 As a matter of fact, the e ects of X1 , or any other set of regressors we want to control for, need not
be removed from y. It can be shown that the coecients associated with X2 can also be obtained from
a regression of y on E 2 . Because of implication 5 the covariance between e1 and the columns of E 2 is
identical to the covariance between y and E 2 .
1.2 Finite sample properties of least squares estimates 6
1.2 Finite sample properties of least squares estimates4
Review 1: For any constants a and b and random variables Y and X the following
relations hold:
E[a + Y ] = a + E[Y ] E[aY ] = aE[Y ] V[a + Y ] = V[Y ] V[aY ] = a2 V[Y ]:
E[aX + bY ] = aE[X ] + bE[Y ] V[aX + bY ] = a2 V[X ] + b2 V[Y ] + 2abcov[XY ]:
Jensen's inequality: E[f (X )]f (E[X ]) for any convex function f (X ).
For a constant a and random variables W; X; Y; Z the following relations hold:
if Y = aZ : cov[X; Y ] = acov[X; Z ]:
if Y = W + Z : cov[X; Y ] = cov[X; W ] + cov[X; Z ]:
cov[X; Y ] = E[XY ] E[X ]E[Y ] cov[Y; a] = 0:
If X is a n1 vector of random variables V[X ]=cov[X ]==E[(X E[X ])(X E[X ])0 ]
is a nn matrix. Its diagonal elements are the variances of the elements of X . Using
=E[X ] we can write =E[XX 0 ] 0 .
If b is a n1 vector and A is a nn matrix of constants, the following relations hold:
E[b0 X ] = b0  V[b0 X ] = b0 b E[AX ] = A V[AX ] = AA0 :

Review 2: The conditional and unconditional moments of two random variables Y


and X are related as follows:
Law of iterated expectations:5 E[Y ] = Ex [E[Y jX ]]
Functions of the conditioning variable:6 E[f (X )Y jX ] = f (X )E[Y jX ]
cov[Y; X ]
If E[Y jX ] is a linear function of X : E[Y jX ] = E[Y ] + (X E[X ])
V[X ]
Variance decomposition: V[Y ] = Ex [V[Y jX ]] + Vx [E[Y jX ]]
Conditional variance: V[Y jX ] = E[(Y E[Y jX ])2 jX ] = E[Y 2 jX ] (E[Y jX ])2 .

Review 3:7 A set of n observations yi (i=1,. . . ,n) of a random variable Y is a random


sample if the observations are drawn independently from the same population with
probability density f (yi ; ). A random sample is said to be independent, identically
distributed (i.i.d.) which is denoted by yi i.i.d.
A cross section is a sample of several units (e.g. rms or households) observed at a
speci c point in time (or time interval). A time series is a chronologically ordered se-
quence of data usually observed at regular time intervals (e.g. days or months). Panel
data is constructed by stacking time series of several cross sections (e.g. monthly con-
sumption and income of several households).
We consider a parameter  and its estimator ^ derived from a random sample of size
n. Estimators are rules for calculating estimates from a sample. For simplicity ^ both
4 Most of this section is based on Greene (2003), sections 2.3 and 4.3 to 4.7, and Hayashi (2000), sections
1.1 5and 1.3.
E[Y jX ] is a function of X . The notation Ex indicates expectation over values of X .
6 See equation 7-60 in Papoulis (1984, p.165).
7 Greene (2003); sections C.1 to C.5.
1.2 Finite sample properties of least squares estimates 7
denotes the estimated value from a speci c sample and the estimator (the function
used to derive the estimate). ^ is a random variable since it depends on the (random)
sample. The sampling distribution describes the probability distribution of ^ across
possible samples.
Unbiasedness: ^ is unbiased, if E[^]=. The expectation is formed with respect to
the sampling distribution of ^. The bias is E[^] .
Examples: The sample mean and the sample median are unbiased estimators.
The unadjusted sample variance
n
2 1X
s~ = (y y)2
n i=1 i
is a biased estimator of 2 , whereas s2 =ns~2 =(n 1) is unbiased.
Mean squared error: The mean squared error (MSE) of ^ is the sum of the variance
and the squared bias:
MSE[^] = E[(^ )2 ] = V[^] + (E[(^ )])2 :
Example: The MSE of the unbiased estimator s2 is larger than the MSE of s~2 .
Eciency: ^ is ecient if it is unbiased, and its sampling variance is lower than
the variance of any other8 to another estimator unbiased estimator ^0 :
V[^] < V[^0 ]:

8 If the condition holds for another estimator one could use the term 'relative eciency'.
1.2 Finite sample properties of least squares estimates 8
1.2.1 Assumptions
The sample estimates b and e can be used to draw conclusions about the population. An
important question relates to the nite sample properties of the OLS estimates. Exact (or
nite sample) inference as opposed to asymptotic (large sample) inference is valid for any
sample size n and is based on further assumptions (in addition to AL and AR) mentioned
and discussed below.
To derive the nite sample properties of the OLS estimate we rewrite b in (1) as follows:
b = (X 0X ) 1X 0(X + )
(3)
= + (X 0X ) 1X 0 = + H:
We consider the statistical properties of b (in particular E[b], V[b], and its distribution).
This is equivalent to investigate the sampling error b . From
h i
E[b] = + E (X 0X ) 1X 0 (4)
we see that the properties of b depend on the properties of X , , and their relation. In the
so-called classical regression model, X is assumed to be non-stochastic. This means
that X can be chosen (like in an experimental situation), or is xed in repeated samples.
Neither case holds in typical nancial empirical studies. We will treat X as random, and
the nite sample properties derived below are considered to be conditional on the sample
X (although we will not always indicate this explicitly). This does not preclude the
possibility that X contains constants (e.g. dummy variables). The important requirement
(assumption) is that X and  are generated by mechanisms that are completely unrelated.
Assumption AX (strict exogeneity): The conditional expectation of each i conditional
on all observations and variables in X is zero:
E[jX ] = 0 E[ijx1; : : : ; xn] = 0 (i = 1; : : : ; n):
According to this assumption, X cannot be used to obtain information about . AX
has the following implications:
1. (unconditional mean): E[E[jX ]]=E[]=0.
2. (conditional expectation): E[yjX ]=y^=X .
3. Regressors and disturbances are orthogonal
E[xil j ] = 0 (i,j =1; : : : ; n; l=1; : : : ; K );
since E[xil j ]=E[E[xil j jxil ]]=E[xil E[j jxil ]]=0. This implies that regressors are
orthogonal to the disturbances from the same and all other observations. Or-
thogonality with respect to the same observations is expressed by
E[X 0] = 0:
Orthogonality is equivalent to zero correlation between X and :
cov[X ; ] = E[X 0] E[X ]E[] = 0:
1.2 Finite sample properties of least squares estimates 9
4. yi i =x0i i +2i ) E[yi i ]=E[2i ] = V[i ]:
If AX holds, the explanatory variables are (strictly) exogenous. The term endo-
geneity (i.e. one or all explanatory variables are endogenous) is used if AX does
not hold (broadly speaking, if X and  are correlated).
For example, AX is violated when a regressor, in fact, is determined on the basis
of the dependent variable y. This is the case in any situation where y and X
(at least one of its columns) are determined simultaneously. A classic example are
regressions attempting to analyze the e ect of the number of policemen on the crime
rate. These are bound to fail whenever the police force is driven by the number of
crimes committed. Solutions to this kind of problem are discussed in section 1.9.1.
Another example are regressions relating the performance of funds to their size. It
is conceivable that an unobserved variable like the skill of fund managers a ects size
and performance. If that is the case, AX is violated.
Another important case where AX does not hold is a model where a lagged de-
pendent variable is used as a regressor:
yt = yt 1 + x0t + t yt+1 = yt + x0t+1 + t+1 yt+2 = : : : :

AX requires the disturbance t to be uncorrelated with regressors from any other ob-
servation, e.g. with yt from the equation for t+1. AX is violated because E[ytt]6=0.
Predictive regressions consist of a regression of yt on a lagged predictor xt
yt = 0 + 1 xt 1 + t :

For typically used dependent variables like asset returns (i.e. yt=ln pt=pt 1) and
predictors like the dividend-price ratio (i.e. xt=ln dt=pt 1), Stambaugh (1999) argues
that, despite E[tjxt 1]=0, in a predictive regression E[tjxt] 6= 0, and thus AX is
violated. To understand this reasoning, we consider
ln
|
pt {zln pt 1} = 1 (ln
|
dt 1{zln pt 1}) + t ;
yt xt 1

ln
|
pt+1{z ln p}t = 1 (ln
|
dt{zln p}t ) + t+1 : : : ;
yt+1 xt

where 0=0 for simplicity. Disturbances t a ect the price in t, (and, for given
pt 1 , the return during the period t 1 to t). Thus, they are correlated with pt , and
hence with the regressor in the equation for t+1. Although the mechanism appears
similar to the case of a lagged dependent variable, here the correlation between the
disturbances and very speci cally de ned predictors xt is the source of violation of
AX. Stambaugh (1999) shows that this leads to a nite-sample bias (see below) in
the estimated parameter b1, irrespective of 1 (e.g. even if 1=0).
1.2 Finite sample properties of least squares estimates 10
Assumption AH (homoscedasticity; uncorrelatedness): This assumptions covers two
aspects. It states that the (conditional) variance of the disturbances is constant
across observations (assuming that AX holds):
V[ijX ] = E[2i jX ] (E[ijX ])2 = E[2i jX ] = 2 8i:
The errors are said to be heteroscedastic if their variance is not constant.
The second aspect of AH relates to the (conditional) covariance of  which is as-
sumed to be zero:
cov[i; j jX ] = 0 8i 6= j E[0jX ] = V[jX ] = 2I :
This aspect of AH implies that the errors from di erent observations are not corre-
lated. In a time series context this correlation is called serial or autocorrelation.
Assumption AN (normality): Assumptions AX and AH imply that the mean and
variance of jX are 0 and 2I . Adding the assumption of normality we have
jX  N(0; 2 I ):
Since X plays no role in the distribution of , we have N(0; 2I ). This assump-
tion is useful to construct test statistics (see section 1.2.3), although many of the
subsequent results do not require normality.
1.2 Finite sample properties of least squares estimates 11
1.2.2 Properties
Expected value of b (AL,AR,AX): We rst take the conditional expectation of (3)
E[bjX ] = + E[HjX ] H = (X 0X ) 1X 0:
Since H is a function of the conditioning variable X , it follows that
E[bjX ] = + H E [jX ];
and by assumption AX (E[jX ]=0) we nd that b is unbiased:
E[bjX ] = :
By using the law of iterated expectations we can also derive the following uncondi-
tional result9 (again using AX):
E[b] = Ex[E[bjX ]] = + Ex[H E[jX ]] = :
We note that assumptions AH and AN are not required for unbiasedness, whereas
AXis critical. Since a model with a lagged dependent variable violates AX, all
coecients in such a regression will be biased.
Covariance of b (AL,AR,AX,AH): The covariance of b conditional on X is given by
V[bjX ] = E[(b )(b )0jX ]
= E[H0H 0jX ]
= H E[0jX ]H 0 (5)
= H (2I )H 0 = 2HH 0
= 2(X 0X ) 1 since HH 0 = (X 0X ) 1X 0X (X 0X ) 1:
For the special case of a single regressor the variance of b1 is given by
2 2
V[b1] = X
n = (n  1)2 ; (6)
(xi x)2 x
i=1

which shows that the precision of the estimate increases with the sample size and
the variance of the regressor x2 , and decreases with the variance of the disturbances.
To derive the unconditional covariance of b we use the variance decomposition
E[V[bjX ]] = V[b] V[E[bjX ]]:
9 To verify that b is unbiased conditionally and unconditionally by simulation one could generate sam-
ples of y=X + for xed X using many realizations of . The average over the OLS estimates bjX {
corresponding to E[bjX ] { should be equal to . However, if X is also allowed to vary across samples the
average over b { corresponding to the unconditional mean E[b]=E[E[bjX ]] { should also equal .
1.2 Finite sample properties of least squares estimates 12
Since E[bjX ]= the second term is zero and
V[b] = E[2(X 0X ) 1] = 2E[(X 0X ) 1];
which implies that the unconditional covariance of b depends on the population
covariance of the regressors.
Variance of e (AL,AR,AX,AH): The variance of b is expressed in terms of 2 (the
population variance of ). To estimate the covariance of b from a sample we replace
2 by the unbiased estimator

s2e =
e0e E[s2e ] = 2:
n K
Its square root se is the standard error of regression. se is measured in the same
units as y. It may be a more informative measure for the goodness of t than R2,
which is expressed in terms of variances (measured in squared units of y).
The estimated standard error of b denoted by se[b] is the square root of the
diagonal of
V^ [bjX ] = s2e (X 0X ) 1:
Eciency (AL,AR,AX,AH): The Gauss-Markov Theorem states that the OLS es-
timator b is not only unbiased but has the minimum variance of all linear unbiased
estimators (BLUE) and is thus ecient. This result holds whether X is stochastic
or not. If AN holds (the disturbances are normal) b has the minimum variance of
all unbiased (linear or not) estimators (see Greene (2003), p.47,48).
Sampling distribution of b (AL,AR,AX,AH,AN): Given (3) and AN the distribu-
tion of b is normal for given X :
bjX  N( ; 2 (X 0X ) 1):
The sample covariance of b is obtained by replacing 2 with s2e , and is given by V^ [b]
de ned above.
Example 2: The standard error of regression from example 1 is 18.2 billion US$. This
can be compared to the standard deviation of real investment which amounts to 34
billion US$. se is used to compute the (estimated) standard errors for the estimated
coecients which are given by
se[b]=(0:0503 0:0515 0:00328 0:00365)0 :
Further details can be found in the le investment.xls.
1.2 Finite sample properties of least squares estimates 13
1.2.3 Testing hypothesis
Review 4: A null hypothesis H0 formulates a restriction with respect to an un-
known parameter of the population =0 . In a two-sided test the alternative hy-
pothesis Ha is 6=0 . The test procedure is a rule that rejects H0 if the sample estimate
^ is 'too far away' from 0 . This rule can be based on the 1 con dence interval
^Q( =2)se[^], where Q( ) denotes the -quantile of the sampling distribution of ^.
H0 is rejected if 0 is outside the con dence interval.
If Y N(; 2 ) and Z =(y )/ then Z N(0; 1). (Z )=P[Y y ]=((y )/) is the
standard normal distribution function (e.g. ( 1:96)=0.025). z is the -quantile of
the standard normal distribution, such that P[Z z ]= (e.g. z0:025 = 1:96).
Example 3: Consider a sample of n observations from a normal population
with mean  and standard deviation . The sampling distribution of pthe
sample mean y is also normal. The standard error of the mean is n.
pn=. The
The 1 con dence interval for the unknown meanp  is y
  z =2 =
estimated standard error of the mean se[y]=s= n is obtained by replacing
 with the sample estimateps. In this case the 1 con dence interval is
given by yT ( =2,n 1)s= n where T ( ,n 1) denotes the -quantile of
the t-distribution (e.g. T (0:025; 20)= 2:086). If n is large the standard
normal and t-quantiles
p are practically equal. In that case the interval is
given by yz =2 s= n.
A type I error is committed if H0 is rejected although it is true. The probability of
a type I error is the signi cance level (or size) . If H0 is rejected, ^ is said to be
signi cantly di erent from 0 at a level of . A type II error is committed if H0 is
not rejected although it is false. The power of a test is the probability of correctly
rejecting a false null hypothesis. The power depends on the true parameter (which is
usually unknown).
A test statistic is based on a sample estimate ^ and 0 . It is a random variable.
The distribution of the test statistic (usually under H0 ) can be used to specify a rule
for rejecting H0 . H0 is rejected if the test statistic exceeds critical values which
depend on (and other parameters). In a two-sided test the critical values are the
=2-quantiles and 1 =2-quantiles of the distribution. In a one-sided test of the form
H0 0 (and Ha <0 ) the critical value is the -quantile (this implies that H0 is rejected
if ^ is 'far below' 0 ). If H0 0 the critical value is the 1 quantile. The p-value is
that level of for which there is indi erence between accepting or rejecting H0 .
Example 4: We consider a hypothesis about the mean of a population.
=0 can be tested against 6=0 using the t-statistic (or t-ratio) t=(y 0 )/se[y].
t has a standard normal or t-distribution depending on whether  or
s is used to compute se[y]. If s is used, the t-statistic is compared to
T ( =2,n 1) in a two-sided test. One-sided tests use T ( ,n 1). In a
two-sided test, H0 is rejected if jtj>jT ( =2; n 1)j.

If  is normally distributed the t-statistic


bi i
ti =
se[bi]
has a t-distribution with n K degrees of freedom (df). se[bi] (the standard error of bi) is
the square root of the i-th diagonal element of V^ [b]). ti can be used to test hypotheses
about single elements of .
1.2 Finite sample properties of least squares estimates 14
A joint test of j =0 (j =1,. .. ,k) can be based on the statistic
F=
(n K )R 2 ;
k(1 R2 )
which has an F -distribution with df=(k,n K ) if the disturbances are normal.
Example 5: The t-statistics for the estimated coecients from example 1 are given
by ( 1:44 4:59 1:08 0:0755)0 . As it turns out only the coecient of real GNP
is signi cantly di erent from zero at a level of =5%. The F -statistic is 12.8 with a
p-value<0.001. Thus, we reject the hypothesis that the coecients are jointly equal
to zero. Further details can be found in the le investment.xls.

Exercise 2: Use the results from exercise 1 and test the estimated coecients
for individual and joint signi cance.
In general, hypothesis tests about can be based on imposing a linear restriction r (a
K 1 vector consisting of zeros and  ones) on and b, and compare =r0 to d=r0 b.
If d di ers signi cantly from  we conclude that the sample is inconsistent with (or, does
not support) the hypothesis expressed by the restriction. Since b is normal, r0b is also
normal, and the test statistic
q
d 
t= se[d] = r0 [s2e (X 0X ) 1] r
se[d]
has a t-distribution with df=n K .
We can consider several restrictions at once by using the mK matrix R to de ne =R
and d=Rb. Under the null that all restrictions hold we can de ne the Wald statistic
= (d )0 s2e R(X 0X ) 1R0 1(d ):
h i
W (7)
W has a 2m-distribution if the sample is large enough (see section 1.5) (or s2e in (7) is
replaced by the usually unknown 2). Instead, one can use the test statistic W=m which
has an F -distribution with df=(m,n K ). In small samples, a test based on W=m will be
more conservative (i.e. will have larger p-values).
So far, restrictions have been tested using the estimates from the unrestricted model.
Alternatively, restrictions may directly be imposed when the parameters are estimated.
This will lead to a loss of t (i.e. R2 will decrease). If Rr2 is based on the parameter vector
br (where some of the parameters are xed rather than estimated) and Ru2 is based on the
unrestricted estimate, the test statistic
F=
(n K )(Ru2 Rr2)
m(1 Ru2 )
has an F -distribution with df=(m,n K ). It can be shown that F =W=m (see Greene
(2003), section 6.3). If F is signi cantly di erent from zero, H0 is rejected and the restric-
tions are considered to be jointly invalid.
The distribution of the test statistics t, F and W depends on assumption AN (normality
of disturbances). In section 1.3 we will comment on the case that AN does not hold.
1.2 Finite sample properties of least squares estimates 15
1.2.4 Example 6: CAPM, beta-factors and multi-factor models
The Capital Asset Pricing Model (CAPM) considers the equilibrium relation between
the expected return of an asset or portfolio (i=E[yi]), the risk-free return rf , and the
expected return of the market portfolio (m=E[ym]). Based on various assumptions (e.g.
quadratic utility or normality of returns) the CAPM states that
i rf = i (m rf ): (8)
This relation is also known as the security market line (SML). In the CAPM the so-
called beta-factor i de ned as
i =
cov[yi; ym]
V[ym]
is the appropriate measure of an asset's risk. The (total) variance of the asset's returns is
an inappropriate measure of risk since a part of this variance can be diversi ed away by
holding the asset in a portfolio. The risk of the market portfolio cannot be diversi ed any
further. The beta-factor i shows how the asset responds to market-wide movements and
measures the market risk or systematic risk of the asset. The risk premium an investor
can expect to obtain (or requires) is proportional to i. Assets with i>1 imply more risk
than the market and should thus earn a proportionately higher risk premium.
Observed returns of the asset (yti; t=1,. .. ,n) and the market portfolio (ytm) can be used
to estimate i or to test the CAPM. Under the assumption that observed returns deviate
from expected returns we obtain
yti i = uit ytm m = um
t :

When we substitute these de nitions for the expected values in the CAPM we obtain the
so-called market model
yti = i + i ytm + it ;
where i=(1 i)rf and it=uit iumt. The coecients i and i in this equation can be
estimated by OLS. If we write the regression equation in terms of (observed) excess returns
xit =yti rf and xm t =yt rf we obtain
m

xit = i xm
t + t :
i

Thus the testable implication of the CAPM is that the constant term in a simple linear
regression using excess returns should be equal to zero. In addition, the CAPM implies
that there must not be any other risk factors than the market portfolio (i.e. the coecients
of such factors should not be signi cantly di erent from zero).
We use monthly data on the excess return of two industry portfolios (consumer goods
and hi-tech) compiled by French10. We regress the excess returns of the two industries
10 http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html . The les
capm.wf1 and capm.txt are based on previous versions of data posted there. These les have been
compiled using the datasets which are now labelled as "5 Industry Portfolios" and "Fama/French
3 Factors" (which includes the risk-free return rf ).
1.2 Finite sample properties of least squares estimates 16
on the excess market return based on a value-weighted average of all NYSE, AMEX, and
NASDAQ rms (all returns are measured in percentage terms). Using data from January
2000 to December 2004 (n=60) we obtain the following estimates for the consumer goods
portfolio (p-values in parenthesis; details can be found in the le capm.wf1)
xit = 0:343 + 0:624 xm
t + et
i R2 = 0:54 se = 2:9;
(0.36) (0.0)
and for the hi-tech portfolio
xit = 0:717 + 1:74 xmt + eit R2 = 0:87 se = 3:43:
(0.11) (0.0)
The coecients 0.624 and 1.74 indicate that a change in the (excess) market return by
one percentage point implies a change in the expected excess return by 0.624 percentage
points and 1.74 percentage points, respectively. In other words, the hi-tech portfolio has
much higher market risk than the consumer goods portfolio.
The market model can be used to decompose the total variance of an asset into market-
and rm-speci c variance as follows (assuming that cov[ym; i]=0):
i2 = i2 m
2 + 2i :


i2 m
2 can be interpreted as the risk that is market-speci c or systematic (cannot be diver-
si ed since it is due to market-wide movements) and 2i is rm-speci c (or idiosyncratic)
risk. Since R2 can also be written as ( i2m2 )=i2 it measures the proportion of the market-
speci c variance in total variance. The R2 from the two equations imply that 53% and
86% of the variance in the portfolio's returns are systematic. The higher R2 from the hi-
tech regression indicates that this industry is better diversi ed than the consumer goods
industry. The p-values of the constant terms indicate that the CAPM implication cannot
be rejected. This conclusion changes, however, when the sample size is increased.
The CAPM makes an (equilibrium) statement about all assets as expressed by the security
market line (8). In order to test the CAPM, beta-factors ^i for many assets are estimated
from the market model using time-series regressions. Then mean returns yi for each asset
(as an average across time) are computed, and the cross-sectional regression
yi = f + m ^i + i

is run. The estimates for f and m (the market risk premium) are estimates of rf and
(m rf ) in equation (8). If the CAPM is valid, the mean returns of all assets should be
located on the SML { i.e. on the line implied by this regression. However, there are some
problems associated with this regression. The usual OLS standard errors of the estimated
coecients are incorrect because of heteroscedasticity in the residuals. In addition, the
regressors ^i are subject to an errors-in-variables problem since they are not observed
and will not correspond to the 'true' beta-factors.
Fama and MacBeth (1973) have suggested a procedure to improve the precision of the
estimates. They rst estimate beta-factors ^it for a large number of assets by running
1.2 Finite sample properties of least squares estimates 17
the market model regression using monthly11 time series of excess returns. The estimated
beta-factors are subsequently used as regressors in the cross-sectional regression
yti = ft + mt ^it + it :

Note that ^it is based on an excess return series which ends one month before the cross-
sectional regression is estimated (i.e. using xis and xms for s=t n,. .. ,t 1). The cross-
sectional regression is run in each month of the sample period and a times series of esti-
mates ^ft and ^mt is obtained. The sample means and the standard errors of ^ft and ^mt
are used as the nal estimates for statistical inference 12. Although the Fama-MacBeth
approach yields improved estimates, Shanken (1992) has pointed out further de ciencies
and has suggested a correction.
The CAPM has been frequently challenged by empirical evidence indicating signi cant
risk premia associated with other factors than the market portfolio. A crucial aspect of
the CAPM (in addition to assumptions about utility or return distributions) is that the
market portfolio must include all available assets (which is hard to achieve in empirical
studies). According to the Arbitrage Pricing Theory (APT) by Ross (1976) there
exist several risk factors Fj that are common to a set of assets. The factors are assumed
to be uncorrelated, but no further assumptions about utility or return distributions are
made. These risk factors (and not only the market risk) capture the systematic risk
component. Although the APT does not explicitly specify the nature of these factors, em-
pirical research has typically considered two types of factors. One factor type corresponds
to macroeconomic conditions such as in ation or industrial production (see Chen et al.,
1986), and a second type corresponds to portfolios (see Fama and French, 1992). Consid-
ering only two common factors (for notational simplicity) the asset returns are governed
by the factor model
yti = i + i1 Ft1 + i2 Ft2 + it ;

where ji are the factor sensitivities (or factor loadings). The expected return of a
single asset in this two-factor model is given by
E[yi] = i = 0 + 1 i1 + 2 i2;
where j is the factor risk premium of Fj and 0=rf . Using V[Fj ]=j2 and cov[F1; F2]=0
the total variance of an asset can be decomposed as follows:
i2 = i21 12 + i22 22 + 2i :

Estimation of the beta-factors is done by factor analysis, which is not treated in this
text. For further details of the APT and associated empirical investigations see Roll and
Ross (1980).
We brie y investigate one version of multi-factor models using the so-called Fama-French
benchmark factors SMB (small minus big) and HML (high minus low) to test whether
11 Using monthly data is not a prerequisite of the procedure. It could be performed using other data
frequencies as well.
12 See Fama-MacBeth.xlsx for an illustration of the procedure using only 30 assets and the S&P500 index.
1.2 Finite sample properties of least squares estimates 18
excess returns depend on other factors than the market return. The factor SMB measures
the di erence in returns of portfolios of small and large stocks, and is intended to measure
the so-called size e ect. HML measures the di erence between value stocks (having a
high book value relative to their market value) and growth stocks (with a low book-market
ratio).13 The estimated regression equations are (details can be found in the le capm.wf1)
xit = 0:085 + 0:68 xm
t 0:089 SMBt + 0:29 HMLt + et R2 = 0:7
(0.8) (0.0) (0.30) (0.0)
for the consumer goods portfolio and
xit = 0:83 + 1:66 xmt + 0:244 SMBt 0:112 HMLt + et R2 = 0:89
(0.07) (0.0) (0.04) (0.21)
for the hi-tech portfolio. Consistent with the CAPM the constant terms in the rst case
is not signi cant. The beta-factor remains signi cant in both industries and changes
only slightly compared to the market model estimates. However, the results indicate a
signi cant return premium for holding value stocks in the consumer goods industry. For
the hi-tech portfolio we nd support for a size-e ect. Overall, the results can be viewed
as supporting multi-factor models.
Exercise 3: Retrieve excess returns for industry portfolios of your choice from
French's website. Estimate beta-factors in the context of multi-factor models.
Interpret the results and test implications of the CAPM.

13 Further details on the variable de nitions and the underlying considerations can be found on French's
website http://mba.tuck.dartmouth.edu/pages/faculty/ken.french.
1.2 Finite sample properties of least squares estimates 19
1.2.5 Example 7: Interest rate parity
We consider a European investor who invests in a riskless US deposit with rate rf . He
buys US dollars at the spot exchange rate St (St is the amount in Euro paid/received for
one dollar), invests at rf , and after one period converts back to Euro at the rate St+1.
The one-period return on this investment is given by
ln St+1 ln St + rf :
Forward exchange rates Ft can be used to hedge against the currency risk (introduced by
the unknown St+1) involved in this investment. If Ft denotes the rate xed at t to buy/sell
US dollars in t+1 the (certain) return is given by
ln Ft ln St + rf :
Since this return is riskless it must equal the return rfd from a domestic riskless investment
to avoid arbitrage. This leads to the covered interest rate parity (CIRP)
rfd rf = ln Ft ln St :
The left hand side is the interest rate di erential and the right hand side is the forward
premium.
The uncovered interest rate parity (UIRP) is de ned in terms of the expected spot
rate
rfd rf = Et [ln St+1 ] ln St :
Et[ln St+1] can di er from ln Ft if the market pays a risk premium for taking the risk of
an unhedged investment. A narrowly de ned version of the UIRP assumes risk neutrality
and states that the risk premium is zero (see Engel, 1996, for a survey)
Et[ln St+1 ln St] = ln Ft ln St:
Observed exchange rates St+1 can deviate from Ft, but the expected di erence must be
zero. The UIRP can be tested using the Fama regression
st st 1 = 0 + 1 (ft 1 st 1 ) + t ;
where st=ln St and ft=ln Ft. The UIRP imposes the testable restrictions 0=0 and 1=1.14
We use a data set15 from Verbeek (2004) and obtain the following results (t-statistics in
parenthesis)
st st 1 = 0:0023 + 0:515 (ft 1 st 1 ) + et R2 = 0:00165:
(0.72) (0.67)
14 Hayashi (2000, p.424) discusses the question, why UIRP cannot be tested on the basis of
st = 0 + 1 ft 1 +t .
15 This data is available from http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html. We
use the corrected data set forward2c from chapter 4 (foreign exchange markets). Note that the exchange
and forward rates in this dataset are expressed in terms of US dollars paid/received for one Euro. To make
the data consistent with the description in this section we have de ned the logs of spot and forward rates
accordingly (although this does not change the substantive conclusions). Details can be found in the le
uirp.xls.
1.2 Finite sample properties of least squares estimates 20
Testing the coecients individually shows that b0 is not signi cantly di erent from 0 and
b1 is not signi cantly di erent from 1.
To test both restrictions at once we de ne
" # " #
R= 0 1 1 0  = 01 :
The Wald statistic for testing both restrictions equals 3.903 with a p-value of 0.142. The
p-value of the F -statistic W=2=1.952 is 0.144. Alternatively, we can use the R2 from
the restricted model with 0=0 and 1=1. This requires to de ne restricted residuals
according to (st st 1) (ft 1 st 1). The associated R2 is negative and the F -statistic is
again 1.952. Thus, the joint test con rms the conclusion derived from testing individual
coecients, and we cannot reject UIRP (which does not mean that UIRP holds!).
Exercise 4: Repeat the analysis and tests from example 7 but use the US
dollar/British pound exchange and forward rates in the le uirp.xls to test
the UIRP.
1.2 Finite sample properties of least squares estimates 21
1.2.6 Prediction
Regression models can be also used for out-of-sample prediction. Suppose the estimated
model from n observations is y=Xb+e and we want to predict y0 given a new observation
of the regressors x0 which has not been included in the estimation (hence: out-of-sample).
From the Gauss-Markov theorem it follows that the prediction
y^0 = x00 b

is the BLUE of E[y0]. Its variance is given by


V[^y0] = V[x00b] = x00V[b]x0 = x002(X 0X ) 1x0;
and re ects the sampling error of b. The prediction error is
e0 = y0 y^0 = x00 + 0 x00 b = 0 + x00 ( b);

and its variance is given by


V[e0] = 2 + V[x00( b)] = 2 + 2x00(X 0X ) 1x0:
The variance can be estimated by using s2e in place of 2. For the special case of a single
regressor the variance of e0 is given by (see (6) and Kmenta (1971), p.240)
" #
2
2 1 + n1 + Pn(x0(x x) x)2 :
i=1 i

This shows that the variance of the prediction (error) increases with the distance of x0 from
the mean of the regressors and decreases with the sample size. The (estimated) variance
of the disturbances can be viewed as a lower bound for the variance of the out-of-sample
prediction error.
If 2 is replaced by s2e we can compute a 1 prediction interval for y0 from
y^0  z =2 se[e0];
where se[e0] is the square root of the estimated variance V^ [e0]. These calculations, using
example 1, can be found in the le investment.xls on the sheet prediction.
1.3 Large sample properties of least squares estimates 22
1.3 Large sample properties of least squares estimates16
Review 5:17 We consider the asymptotic properties of an estimator ^n which hold as
the sample size n grows without bound.
Convergence: The random variable ^n converges in probability to the (non-
random) constant c if, for any >0,
lim P[j^n
n!1
cj > ] = 0:

c is the probability limit of ^n and is denoted by plim ^n =c.


Rules for scalars xn and yn :
plim (xn + yn ) = plim xn + plim yn plim (xn  yn ) = plim xn  plim yn :
Rules for vectors and matrices:
plim Xy = plim X  plim y:
Rule for a nonsingular matrix X :
plim X 1 = (plim X ) 1 :
Consistency: ^n is consistent for  if plim ^n =. ^n is consistent if the asymptotic
bias is zero and the asymptotic variance is zero:
lim E[^n ]  = 0
n!1 n!1
lim aV[^n ] = 0:

Example: The sample mean y from a population with  and 2 is consistent for
 since E[y]= and aV[y]=2 =n. Thus plim y=.
Consistency of a mean of functions: Consider a random sample (y1 ,. . . ,yn ) from
a random variable Y and any function f (y). If E[f (Y )] and V[f (Y )] are nite
constants then
n
1X
plim f (y ) = E[f (Y )]:
n i=1 i

Limiting distribution: ^n with cdf Fn converges in distribution to a random


d
variable  with cdf F (this is denoted by ^n ! ) if
lim jFn
n!1
Fj = 0
for every continuity point of F . F is the limiting or asymptotic distribution
of n .
A consistent estimator ^n is asymptotically normal if
pn(^ ) d! N(0; v) or ^n a N(; v=n);
n

where aV[^n ]=v=n is the asymptotic variance of ^n .


16 Most of this subsection is based on Greene (2003), sections 5.2 and 5.3.
17 Greene (2003); section D.
1.3 Large sample properties of least squares estimates 23
Central Limit Theorem: If y is the sample mean of a random sample (y1 ,. . . ,yn )
from a distribution with mean  and variance 2 (which need not be normal)
pn(y ) d! N(0; 2 ) or y a N(; 2 =n):
Expressed di erently,
y 
zn = p
= n
is asymptotically standard normal: zn a N(0,1).
The nite sample properties of OLS estimates only hold if assumptions AL, AR, AX, and
AH are satis ed. AN is required to obtain the exact distribution of b and to derive (the
distribution of) test statistics. Large-sample theory drops AN and adds other assumptions
about the data generating mechanism. The sample is assumed to be large enough so
that certain asymptotic properties hold, and an approximation of the distribution of OLS
estimates can be derived.
1.3.1 Consistency
Consistency relates to the properties of b as n!1. Therefore we use the formulation
 1

bn = + X X1 0 1 0
X :

(9)
n n
This shows that the large-sample properties of bn depend on the behavior of the sample
averages of X 0X and X 0. In addition to the assumptions from the previous subsection
we assume that (xi,i) are an i.i.d. sequence of random variables:
Aiid: (xi ; i )  i.i.d.
To prove consistency we consider the probability limit of bn:
"  1  #
1
plim bn = + plim n X 0X  n X 0 1
 1 (10)

1 
= + plim n X 0X  plim n X 0 :1 

We have to make sure that the covariance matrix of regressors X is 'well behaved'. This
requires that all elements of X 0X =n converge to nite constants (i.e. the corresponding
population moments). This is expressed by the assumption
AR: plim 1 X 0 X = Q;
n
(11)
where Q is a positive de nite matrix.
Regarding the second probability limit in (10), Greene (2003, p.66) de nes
1 X 0 = 1 Xn
xi i = 1 wi = w n
Xn
n n i=1 n i=1
1.3 Large sample properties of least squares estimates 24
and uses AX to show that
2
V[w n] = E[w nw 0n] = n E[Xn X ] :
0
E[w n] = 0
The variance of w n will converge to zero, which implies that plim w n=0, or
plim 1 X 0 = 0:
 

n
Thus the probability limit of bn is given by
plim bn = + Q 1  0;
and we conclude that bn is consistent:
plim bn = :
1.3 Large sample properties of least squares estimates 25
1.3.2 Asymptotic normality
Large-sample theory is not based on the normality assumption AN, but derives an ap-
proximation of the distribution of OLS estimates. We rewrite (9) as
 1
) = n1 X 0X p1n X 0
pn(b  
(12)
n

to derive the asymptotic distribution of pn(bn ) using the central limit theorem. By
AR the probability limit of the rst term on the right hand side of (12) is Q 1 . Next we
consider the limiting distribution of
p1 X 0 = pn (w n E[wn]) :
n
w n is the average of n i.i.d. random vectors wi =xi i . From the previous
p subsection we
know that E[w n]=0. Greene (2003, p.68) shows that the variance of nw n converges to
2 Q. Thus, in analogy to the univariate case, we can apply the central limit theorem.
The means of the i.i.d. random vectors wi converge to a normal distribution:
p1n X 0 = pnw n d! N(0; 2Q):

We can now complete the derivation of the limiting distribution of (12) by including Q 1
to obtain
Q 1 p1 X 0 d! N(Q 10; Q 1 (2Q)Q 1)
n
or
pn(b 2
n ) d! N(0; 2Q 1) bn a N( ; n Q 1 ):
Note that the asymptotic normality of b is not based on AN but on the central limit
theorem. The asymptotic covariance of bn is estimated by using (X 0X ) 1 to estimate
(1/n)Q 1 and s2e =SSE/(n K ) to estimate 2:
c b ] = s2 (X 0 X )
aV[ 1:
n e

This implies that t- and F -statistics are asymptotically valid even if the residuals are not
normal. If F has an F -distribution with df=(m,n k) then W =mF a 2m.
In small samples the t-distribution may be a reasonable approximation18 even when AN
does not hold. Since it is more conservative than the standard normal, it may be preferable
to use the t-distribution. By a similar argument, using the F -distribution (rather than
W =mF and the 2 distribution) can be justi ed in small samples when AN does not
hold.
18 If AN does not hold the nite sample distribution of the t-statistic is unknown.
1.3 Large sample properties of least squares estimates 26
1.3.3 Time series data19
With time series data the strict exogeneity assumption AX is usually hard to maintain.
For example, a company's returns may depend on the current, exogenous macroeconomic
conditions and the rm's past production (or investment, nance, etc.) decisions. To
the extent that the company decides upon the level of production based on past realized
returns (which include past disturbances), the current disturbances may be correlated
with regressors in future equations. More generally, strict exogeneity might not hold if
regressors are policy variables which are set depending on past outcomes.
If AX does not hold (e.g. in a model with a lagged dependent variable), bn is biased. In
the previous subsections consistency and asymptotic normality have been established on
the basis of Aiid and AR. However, with time series data the i.i.d. assumption need not
hold and the applicability of limit theorems is not straightforward. Nevertheless, consistent
estimates in a time series context can still be obtained. The additional assumptions needed
are based on the following concepts.
A stochastic process Yt is a sequence20 of random variables Y 1,: : :,Y0,Y1,: : :,Y+1. An
observed sequence yt (t=1; : : : ; n) is a sample or realization (one possible outcome) of
the stochastic process. Any statistical inference about Yt must be based on the single draw
yt from the so-called ensemble of realizations of the process. Two properties are crucial
in this context: the process has to be stationary (i.e. the underlying distribution of Yt
does not change with t) and ergodic (i.e. each individual observation provides unique
information about the process; adjacent observations must not be too similar). More
formally, a stationary process is ergodic if any two random variables Yt and Yt ` are
asymptotically (i.e. `!1) independent.
A stochastic process is characterized by the autocovariance `
` = E[(Yt )(Yt ` )]  = E[Yt ]; (13)
or the autocorrelation `
`
` =
0
=  `2 : (14)
A stochastic process is weakly or covariance stationary if E[Yt2]<1 and if E[Yt], V[Yt]
and ` do not depend on t (i.e. ` and ` only depend on `). If Yt is strictly stationary
the joint distribution of Yt and Yt ` does not depend on the time shift `. If Yt is weakly
stationary and normally distributed then Yt is also strictly stationary.
According to the ergodic theorem, averages from a single observed sequence will con-
verge to the corresponding parameters of the population, if the process is stationary and
ergodic. If Yt is stationary and ergodic with E[Yt]=, the sample mean obtained from a
single realization yt converges to  asymptotically:
lim 1 yn = yt = :
Xn
n!1 n
t=1
19 Most of this subsection is based on Greene (2003), section 12.4.
20 We use the index t since stochastic processes are frequently viewed in terms of chronologically ordered
sequences across time. However, the index set is arbitrary and everything we say holds as well if the index
refers to other entities (e.g. rms).
1.3 Large sample properties of least squares estimates 27
1
X
If Yt is covariance stationary it is sucient that j `j<1 (absolute summability) for
`=0
the process to be ergodic for the mean. The theorem extends to any ( nite) moment of
stationary and ergodic processes. In the special case where Yt is a normal and station-
ary process, then absolute summability is enough to insure ergodicity for all moments.
Whereas many tests for stationarity are available (see section 2.3.3), ergodicity is dicult
to test and is usually assumed to hold. Quickly decaying estimated autocorrelations can
be taken as empirical evidence of stationarity and ergodicity.
In other words, the ergodic theorem implies that consistency does not require independent
observations. Greene (2003, p.73) shows that consistency and asymptotic normality of the
OLS estimator can be preserved in a time-series context by replacing AX with21
AX: E[tjxt `] = 0 (8`  0);
replacing AR by
plim n 1 `
n
X
ARt : xtx0t ` = Q(`);
t=`+1

where Q(`) is a nite matrix, and by requiring that Q(`) has to converge to a matrix of
zeros as `!1. These properties of Q(`) can be summarized by the assumption that xt is
stationary and ergodic.
This has the following implications for models with a lagged dependent variables:
yt = 1 yt 1 +    + p yt p + z0t + t:
Although estimates of i and are biased (since AX is violated), they are consistent
provided AX holds, and xt=[yt 1; : : : ; yt p; z ] is stationary and ergodic.
t

21 Other authors (e.g. Hayashi, 2000, p.109) assume that t and xt are contemporaneously uncorrelated
(E[xt t ]=0), as implied by AX.
1.4 Maximum likelihood estimation 28
1.4 Maximum likelihood estimation
Review 6:22 We consider a random sample yi (i=1; : : : ; n) to estimate the parameters
 and 2 of a random variable Y N(; 2 ). The maximum likelihood (ML) estimates
are those values for the parameters of the underlying distribution which make the
observed sample most likely (i.e. would generate it most frequently).
The likelihood (function) L() is the joint density evaluated at the observations yi
(i=1; : : : ; n) as a function of the parameter (vector) :
L() = f (y1 j)f (y2 j)    f (yn j):
f (yi j) is the value of the density function at yi given the parameters . To simplify
the involved calculations the logarithm of the likelihood function (the log-likelihood)
is maximized:
n
X
ln L() = `() = ln f (yi j) ! max :
i=1

The ML method requires an assumption about the distribution of the population.


Using the density function of the normal distribution and =(; 2 ) we have
1 (yi )2
ln f (yi j; 2 ) = ln(22 ) ;
2 2 2
and the log-likelihood as a function of  and 2 is given by
n
n n 2 1 X
`(; 2 ) = ln 2 ln  (y )2 :
2 2 22 i=1 i

From the rst derivatives with respect to  and 2


n n
@` 1 X @` n 1 X
= 2 (yi ) = + (y )2
@  i=1 @2 22 24 i=1 i

we obtain the ML estimates


n n
1X 1X
y = y s~2 = (y y)2 :
n i=1 i n i=1 i

To estimate more general models the constants  and 2 can be replaced by conditional
mean i and variance i2 , provided the standardized residuals i =(yi i )/i are i.i.d.
Then the likelihood depends on the coecients in the equations which determine i
and i2 .

The ML estimate of a regression model requires the speci cation of a distribution for
the disturbances. If i=yi x0i is assumed to be i.i.d.23 and normal N(0; 2), the
log-likelihood is given by
n n 2 1
`( ; 2 ) = ln 2 2 ln  22 (y X ) (y X ): (15)
 0
2
22 For details see Kmenta (1971), p.174 or Wooldridge (2003), p.746.
23 Note that the i.i.d.assumption is not necessary for the observations but only for the residuals.
1.4 Maximum likelihood estimation 29
The necessary conditions for a maximum are
@`
@
: 1 X 0(y X ) = 1 X 0 = 0
2 2
@`
@2
: 2n2 + 21 4 (y X )0(y X ) = 0:
The solution of these equations gives the estimates
b = (X 0X ) 1X 0y e0e
s~2e = :
n
ML estimators are attractive because of their large sample properties: provided that the
model is correctly speci ed they are consistent, asymptotically ecient and asymptotically
normal24:
b a N( ; I ( ) 1):
I ( ) is the information matrix evaluated at the true parameters. Its inverse can be
used to estimate the covariance of b. Theoretically, I ( ) is minus the expected value of
the second derivatives of the log-likelihood. In practice this expectation can be computed
in either of the two following ways (see Greene, 2003, p.480). One way is to evaluate the
Hessian matrix (the second derivatives of `) at b
2
I^(b) = @ b@ @`b0 ;
where the second derivatives are usually calculated numerically. Another way is the BHHH
estimator25 which is based on the outer-product of the gradient (or score vector):
I (b) = g g0 g = @`i (b) or I^(b) = G0G;
Xn
^
i=1
i i i @b
where gi is the K 1 gradient for observation i, and G is a nK matrix with rows equal
to the transpose of the gradients for each observation.
In general, the Hessian and the BHHH approach do not yield the same results, even when
the derivatives are available in closed form. The two estimates of I can also di er when
the model is misspeci ed. Quasi-ML (QML) estimates are based on maximizing the
likelihood using a distribution which is known to be incorrect (i.e. using the wrong density
when formulating (15)). For instance, the normal distribution is frequently used as an
approximation when the true distribution is unknown or cumbersome to use.
Signi cance tests of regression coecients are based on the asymptotic normality of the
ML estimates. z-statistics (rather than t-statistics) are frequently used to refer to the
standard normal distribution of the test statistic zi=(bi i)/se[bi], where the standard
error se[bi] is the square root of the i-th diagonal element of the inverse of I^(b), and
zi N(0; 1).
a

The major weakness of ML estimates is their potential bias in small samples (e.g. the
variance estimate is scaled by n rather than n K ).
24 Greene (2003), p.473.
25 BHHH refers to the initials of Berndt, Hall, Hall, and Hausman who have rst proposed this approach.
1.4 Maximum likelihood estimation 30
Example 8: We use the quarterly investment data from 1950-2000 from Table 5.1 in
Greene (2003) (see exercise 1). The explanatory variables are the log of real output,
the nominal interest rate and the rate of in ation. to estimate the same regression as
in example 1 by numerically maximizing the log-likelihood. The dependent variable
is the log of real investment. Details can be found in the le investment-ml.xls.
The estimated ML coecients are almost equal to the OLS estimates, and depend on
the settings which trigger the convergence of the numerical optimization algorithm.
The standard errors are based on the outer-product of the gradients, and are slightly
di erent from those based on the inverse of X 0 X . Accordingly, the z -statistics di er
from the t-statistics. The interest rate turns out to be the only regressor which is not
statistically signi cant at the 5% level.

Exercise 5: Use the annual data and the regression equation from example 1
(see le investment.xls) and estimate the model by maximum likelihood.
1.5 LM, LR and Wald tests 31
1.5 LM, LR and Wald tests26
Suppose the ML estimate ^ (a K 1 parameter vector) shall be used to test m linear
restrictions H0: =R. Three test principles can be used for that purpose.
The Wald test is based on unrestricted estimates. If the restrictions are valid, d=R^
will not deviate signi cantly from . The Wald test statistic for m restrictions de ned as
W = (d )0(V[d]) 1(d ):
The covariance of d can be estimated by RV^ [^1]R0, where V^ [^] can be based on the inverse
of the information matrix. Using V^ [^]=I^(^) we obtain
1
(d )0 RI^(^) 1R0 (d ) a 2m:
h i
(16)
The likelihood-ratio (LR) test requires estimating the model with and without restric-
tions. The LR test statistic is
2[`u `r ] a 2m ; (17)
where `u is the unrestricted log-likelihood, and `r the log-likelihood obtained by imposing
m restrictions. If the restrictions are valid, the di erence between `r and `u will be close
to zero.
If parameters are estimated by OLS, the LR test statistic can be computed using the
residuals eu and er from unrestricted and restricted OLS regressions, respectively. For m
restrictions the LR test statistic is given by
LR = n[ln(e0r er ) ln(e0ueu)] LR  2m:
The Lagrange multiplier (LM) test (or score test) is based on maximizing the log-
likelihood under the restrictions using the Lagrangian function
L(r ) = `(r ) + 0(Rr ):
The estimates ^r and ^ can be obtained from the rst order conditions
@L @`
@ r
: @ r
+ 0 R = 0
@L
@
: Rr  = 0:
Lagrange multipliers measure the improvement in the likelihood which can be obtained
by relaxing constraints. If the restrictions are valid (i.e. hold in the data), imposing them
is not necessary, and ^ will not di er signi cantly from zero. This consideration leads to
26 This section is based on Greene (2000), p.150 and Greene (2003), section 17.5.
1.5 LM, LR and Wald tests 32
H0: =0 (hence LM test). This is equivalent to testing the derivatives evaluated at the
restricted estimates ^r :
gr = @`(^r ) = ^ 0R:
^
@ r
Under H0: gr =0 the LM test statistic is given by
g0r I^(^r ) 1gr a 2m : (18)
I^(^r )=G0r Gr , where Gr is a nK matrix with rows equal to the transpose of the gradients
for each observation evaluated at the restricted parameters.
Alternatively, the Lagrange multiplier (LM) test statistic can be derived from a regression
of the restricted residuals er on all regressors including the constant (see Greene (2003),
p.496). This version of LM is de ned as
LM = nRe2 LM  2m;
where Re2 is the coecient of determination from that auxiliary regression.
Wald, LR and LM tests of linear restrictions in multiple regressions are asymptotically
equivalent. Depending on how the information matrix is estimated, the test statistics and
the associated conclusions will di er. In small samples the 2 distribution may lead to
too many rejections of the true H0. Alternatively, the t-statistic (for a single restriction)
or the F -statistic can be used.
Example 9: We use the data and results from example 7 to test the restrictions 0 =0
and 1 =1 using OLS based LR and LM tests. Using the residuals from unrestricted
and restricted residuals we obtain LR=3.904 with a p-value of 0.142. Regressing the
residuals from the restricted model on X we obtain LM=0.402 with a p-value of 0.818.
Details can be found in the le uirp.xls.
Example 10: We use the same data to estimate the Fama regression numerically by
ML. The coecients are virtually identical to the OLS estimates, while the standard
errors (derived from the inverse of the information matrix) di er slightly from the
OLS standard errors. To test the restriction 0 =0 and 1 =1 we use ML based Wald,
LR and LM tests. All three tests agree that these restriction cannot be rejected with
p-values ranging from 0.11 to 0.17. Details can be found in the le uirp-ml.xls.

Exercise 6: Use the data from exercise 4 (i.e. US dollar/British pound ex-
change and forward rates ) to test the restriction 0=0 and 1=1 using OLS
and ML based Wald, LR and LM tests.
1.6 Speci cations 33
1.6 Speci cations
The speci cation of the regression equation has key importance for a successful application
of regression analysis (in addition to a careful de nition and selection of variables). The
linearity assumption AL may appear to be a very strong restriction. However, y and X
can be arbitrary functions of the underlying variables of interest. Thus, as we will show
in this section, there exist several linear formulations to model a variety of practically
relevant and interesting cases.
1.6.1 Log and other transformations
The log-linear model27
ln y = ln b0 + b1 ln x1 +    + bk ln xk + e = y^ln + e
corresponds to the multiplicative expression
y = b0 xb11    xbkk expfeg:

In this model bi is the (estimated) elasticity of y with respect to xi:


@y
@xi
= bib0xb11    xbi i 1    xbkk expfeg = bi xy =) bi =
@y . @xi
y xi
:
i

In other words, a change in xi by p percent leads to a c.p. change in y^ by pbi percent.


This implies that the change in y^ in response to a change in xi depends on the levels of y
and xi, whereas these levels are irrelevant in the linear model.
To compute y^ using the tted values y^ln from the log-linear model we have to account
for the properties of the lognormal distribution (see review 9). If the residuals from the
log-linear model are (approximately) normal the expected value of y is given by
y^i = expfy^iln + 0:5s2e g;

where se is the standard error of the residuals from the log-linear model. Note that these
errors are given by
ei = ln yi y^iln ;

where y^iln is the tted value of ln yi. ei is not equal to ln yi ln y^i because of Jensen's
inequality lnE[y]>E[ln y] (see review 1). The standard error of residuals se is an approxi-
mation for the magnitude of the percentage error (yi y^i )/^yi .
In the semi-log model
ln y = b0 + b1x1 +    + bk xk + e
27 Insection 1.6 we will formulate regression models in terms of estimated parameters since these are
usually used for interpretations.
1.6 Speci cations 34
the expected c.p. percentage change in y^ is given by bi100, if xi changes by one unit. More
accurately, y^ changes by (expfbig 1)100 percent. This model is appropriate when the
growth rate of y is assumed to be a linear function of the regressors. The chosen speci ca-
tion will mainly be driven by assumptions about the nature of the underlying relationships.
However, taking logs is frequently also used to reduce or eliminate heteroscedasticity.
Another version of a semi-log model is
y = b0 + b1 ln x1 +    + bk ln xk + e:
Here, a one percent change in xi yields a c.p. change in y^ of 0.01bi units.
The logistic model
ln 1 y y = b0 + b1x1 +    + bk xk + e 0<y<1
implies that y^ is s-shaped according to:
y^ =
expfb0 + b1x1 +    + bk xk g = 1
1 + expfb0 + b1x1 +    + bk xk g 1 + expf (b0 + b1x1 +    + bk xk )g :
1.6.2 Dummy variables
Explanatory variables which are measured on a nominal scale (i.e. the variables are quali-
tative in nature) can be used in regressions after they have been recoded. A binary valued
(0-1) dummy variable is de ned for each category except one which constitutes the refer-
ence category. Suppose there are m+1 categories (e.g. industries or regions). We de ne
m dummy variables di (di =1 if an observation belongs to category i and 0 otherwise).
Note that de ning a dummy for each category leads to an exact linear relationship among
the regressors. If the model contains an intercept the sum of all dummies is equal to the
rst column of X , and X will not have full rank. The coecients i in the regression
model
y^ = b0 + b1 x1 +    + 1 d1 +    + m dm
correspond to parallel shifts of the regression line (hyperplane). i represents the change
in y^ for a c.p. shift from the reference category to category i.
If categories have a natural ordering, an alternative de nition of dummy variables may be
appropriate. In this case all dummy variables d1,. .. ,dj are set equal to 1 if an observation
belongs to category j . Now i represents the expected change in y^ for a c.p. shift from
category j 1 to category j .
1.6.3 Interactions
Dummy variables cannot be used to model changes in the slope (e.g. di erences in the
propensity to consume between men and women). If the slope is assumed to be di erent
among categories the following speci cation can be used:
y^ = b0 + b1 x1 + b2 d + b3 dx1 :
1.6 Speci cations 35
The product dx1 is an interaction term. If d=0 this speci cation implies y^=b0+b1x1,
and if d=1 it implies y^=(b0+b2)+(b1+b3)x1. Thus, the coecient b3 measures the expected
c.p. change in the slope of x1 when switching categories.
It is important to note that the presence of an interaction term changes the 'usual' inter-
pretation of the coecients associated with the components of the interaction. First, the
coecient b1 of x1 must be interpreted as the slope of the reference category (for which
d=0). Second, the coecient b2 of the dummy variable is not the expected c.p. di erence
between the two categories anymore (except for x1=0). Now the di erence depends on
the level of x1. Even if x1 is held constant the di erence in y^ when changing from d=1 to
d=0 is given by b2 +b3 x1 .
Interactions are not con ned to using dummy variables but can be based on two 'regular'
regressors. The equation
y^ = b0 + b1 x1 + b2 x2 + b3 x1 x2

implies a change in the slope of x1 that depends on the level of x2 and vice versa. To
simplify the interpretation of the coecients it is useful to evaluate y^ for typical values of
one of the two variables (e.g. using x2 and x2sx2 ).
If interactions are de ned using logs of variables such as in the following so-called translog
model
ln y = ln b0 + b1 ln x1 + b2 ln x2 + b3 ln x1 ln x2 + e
the conditional expectation of y is given by
y^ = b0 xb11 +b3 ln x2 xb22 expf0:5s2e g:

This implies that a c.p. change of x2 by p percent leads to an expected change of the
elasticity b1 by pb3. However, if y^ is de ned as
y^ = b0 xb11 +b3 x2 xb22 expf0:5s2e g;

it is necessary to estimate the model


ln y = ln b0 + b1 ln x1 + b2 ln x2 + b3(ln x1)x2 + e:
In this case a c.p. change of x2 by one unit leads to an expected change of the elasticity
b1 by b3 units.
1.6 Speci cations 36
1.6.4 Di erence-in-di erences
As a special case, an interaction can be de ned as the product of a time-dummy and
another dummy, identifying group membership. (Quasi) natural experiments are typical
situations where this is an appropriate speci cation. The purpose of the analysis is to nd
out about the e ect of a certain stimulus or a special event (e.g. a policy or legal change,
a crisis, an announcement, etc.). Such experiments are characterized by a treatment and
a control group: the treatment group consists of those objects under study which are
subject to the special event, whereas the remaining, una ected subjects constitute the
control group.
The e ect size can be estimated by comparing the means of the two groups before and
after the event. This di erence (among groups) of the di erence (over time) { hence,
di erence-in-di erences (or, di -in-di ) { can be simply estimated from the coecient b3
in the regression
y^ = b0 + b1 T + b2 d + b3 T d;

where T denotes a time-dummy (i.e. being 0 before and 1 after the event), and d is the
dummy distinguishing the treatment (d=1) and the control (d=0) group. Note that b0 is
the average of y for T =0 and d=0 (i.e. the control group before the event), b1 estimates the
average change in y over the two time periods for d=0, b2 estimates the average di erence
between treatment and control for T =0. b3 is the estimate which is of primary interest in
such studies.
Note that the simple formulation above is only appropriate if no other regressors need to
be accounted for. If this is not the case, the model has to be extended as follows:
y^ = b0 + b1 T + b2 d + b3 T d + Xb:

As soon as the term Xb is included in the speci cation, the coecient b3 is still the main
object of interest, however it is not a di erence of sample averages any more, but has the
corresponding ceteris paribus interpretation.
The di -in-di approach can be used to account for a so-called selection bias. For example,
when assessing the e ect of an MBA on salaries, people who choose to do an MBA may
already have higher salaries than those who do not. Thus, the assignment to treatment
and control group is not random but depends on (existing or expected) salaries. This
problem of so-called self-selection results whenever subjects enter the treatment sample
for reasons which are related to the dependent variable.
The appropriateness of the di erence-in-di erences approach rests on the parallel-trends
assumption. Absent the e ect under study, the dependent variable of the two groups
must not have di erent "trends" (i.e. must not have di ering slopes with respect to time).
If this assumption is violated, the e ect is over- or underestimated (because of diverging
or converging trends), and partially but falsely attributed to the treatment. In the MBA-
salary example this assumption is violated, when the salaries of people who choose to do
an MBA already increase more quickly than the salaries of those, who do not.
Note that the interaction term T d already accounts for di erent slopes with respect to
time. Therefore, it is impossible to separate the e ect under study from possibly di erent
trends of the two groups which have nothing to do with the e ect under study.
1.6 Speci cations 37
1.6.5 Example 11: Hedonic price functions
Hedonic price functions are used to de ne the implicit price of key attributes of goods
as revealed by their sales price. We use a subset of a dataset used in Wooldridge (2003,
p.194)28 consisting of the price of houses (y), the number of bedrooms (x1), the size
measured in square feet (x2) and a dummy variable to indicate the style of the house (x3)
(see hedonic.wf1). We estimate the regression equation
y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x1 x2 + ;

where the interaction-term x1x2 is used to model the importance of the number of bed-
rooms depending on the size of the house. The underlying hypothesis is that additional
bedrooms in large houses have a stronger e ect on the price than in small houses (i.e. it
is expected that 4>0). The estimated equation is
y^ = 199:45 45:51 x1 + 0:025 x2 + 20:072 x3 + 0:026 x1x2:
(0.034) (0.072) (0.575) (0.191) (0.014)
The interaction term is signi cant and has the expected sign. To facilitate the model's
interpretation it is useful to evaluate the regression equation using typical values for one of
the variables in the interaction term. Mean and standard deviation of size (x2) are 2013.7
and 578.7. We can formulate equations for the expected price for small, average and large
houses as a function of style (x3) and the number of bedrooms (x1):
x2 =1500: y^ = 236:48 + 20:072x3 6:63x1
x2 =2000: y^ = 248:82 + 20:072x3 + 6:33x1
x2 =2500: y^ = 261:16 + 20:072x3 + 19:29x1 :

This shows that a regression equation with interactions can be viewed as a model with
varying intercept and slope, where this variation depends on one of the interaction vari-
ables. The rst equation shows that additional bedrooms in small houses lead to a c.p.
drop in the expected price (probably because those bedrooms would be rather small and
thus unattractive). We nd a positive e ect of bedrooms in houses with (above) average
size.

28 Source: Go to the book companion site of Wooldridge (2003) at http://www.cengage.com/ (latest


edition), click on "Data Sets", download one of the zip- les and choose "HPRICE1.*''.
1.6 Speci cations 38
1.6.6 Example 12: House price changes induced by siting decisions
We consider a part of the dataset used by Kiel and McClain (1995)29 who study the impact
of building an incinerator on house prices. Prices are available for 1978, before any rumors
about potentially building an incinerator, and 1981 when its construction began.30 The
purpose of the analysis is to quantify the e ect of building a new incinerator on house
prices. For simplicity we rst ignore the control variables considered in the original study.
Running a regression of house prices on a dummy indicating whether a house is near31
the incinerator (d=1) makes no sense. If the incinerator was built in an area where house
prices were already (relatively) low, the coecient would not estimate the impact of the
incinerator. In other words, the sample su ers from a selection bias. To determine the
e ect of the incinerator we must compare the average house prices before and after rumors
by distinguishing houses near and far from the incinerator.
A non-regression based approach to estimate the e ect is shown in diff-in-diff.xlsx.
More speci cally, we nd that prices for houses far from the incinerator (control group)
have increased by 18790.3 from 1978 to 1981, whereas prices of houses nearby (treatment
group) have increased by only 6926.4. The di erence of these di erences between treatment
and control group is {11863.9; i.e. houses nearby have increased less strongly { an e ect
that can be attributed to the incinerator. Alternatively, one could compare the di erence
in prices for the two types of houses in 1978 ({18824.4; i.e. houses nearby are cheaper) and
1981 ({30688.3; i.e. houses nearby are now much cheaper). This also results in an estimated
e ect of {11863.9. Simply comparing averages leads to the same results as a regression
with dummy variables and an interaction with time, provided no further regressors (i.e.
control variables) are used.
Adding further regressors (see le diff-in-diff.wf1) accounts for various (additional)
features of the houses to be compared. This avoids any biases from ignoring other e ects
(see section 1.6.7). The revised estimate turns out to be {14177.9. This estimate cannot
be simply derived from comparing averages, but measures the e ect of the incinerator
on house prices after controlling for other features of houses (i.e. ceteris paribus). Note
that adding further regressors not only controls for additional features but (usually) also
improves the goodness of t. Thereby, standard errors of coecients are reduced, and
statistical inference is enhanced.

29 Source: Go to the book companion site of Wooldridge (2003) at http://www.cengage.com/ (latest


edition), click on "Data Sets", download one of the zip- les, and choose "KIELMC.*''.
30 Note that this is a pooled dataset, i.e. prices for di erent houses are considered in the two years.
31 We ignore the available measure for the distances between houses and the incinerator to be able to
illustrate the di erence-in-di erences approach.
1.6 Speci cations 39
1.6.7 Omitted and irrelevant regressors
Relevant variables may have been omitted from a regression equation by mistake or be-
cause of a lack of data. Omitting relevant variables may have serious consequences. To
investigate the potential e ects we suppose that the correctly speci ed model is given by
y = X1 1 + X2 2 + ; (19)
but the model is estimated without X 2. The OLS estimate of 1 is given by
b1 = (X 01X 1) 1X 01y:
We rewrite (19) as y=X1 1+1 where 1=X2 2+ and substitute for y in the equation
for b1 to obtain
b1 = (X 01X 1) 1X 01(X 1 1 + 1)
= (X 01X 1) 1X 01X 1 1 + (X 01X 1) 1X 011
= 1 + (X 01X 1) 1X 011:
The expectation of b1 is given by
E[b1] = 1 + (X 01X 1) 1E[X 011]:
This shows that assumption AX is violated since
E[X 011] = E[X 01(X2 2 + )] = X 01X2 2
is non-zero unless X 01X2=0 (i.e. all elements of X 1 and X2 are uncorrelated) or 2=0.
Thus b1 is biased and inconsistent if there are omitted regressors which are correlated with
included regressors. The expected value of b1 is given by the so-called omitted variable
formula
E[b1] = 1 + (X 01X 1) 1X 01X2 2: (20)
The formula shows that the bias depends on the term that is multiplied with 2. This
term is equal to the coecients from a regression of omitted regressors X2 on included
regressors X 1.
As a further consequence, the standard error of regression se and the standard errors of
b1 will also be biased. Thus, statistical tests about 1 are not meaningful. Usually, se will
be too high, and se[b1] can be lower than if X2 is included in the regression (see Greene,
2000, p.337).
In the simple case of only two regressors, where the correct equation is given by
y = 0 + 1 x1 + 2 x2 +  (21)
and x2 is omitted from the estimated regression, the bias is given by
E[b1] 1 = 2 cov[ x1 x2 ]
V[x1] :
1.6 Speci cations 40
If x1 and x2 are uncorrelated, the estimate of 0 is still biased32 and inconsistent. In this
case the estimate b1 is unbiased, but the standard error of b1 is too large (see Kmenta,
1971, p.392).
The estimated coecients in the reduced regression for a speci c sample can also be
computed from the omitted variable formula. If bfull 1 denotes the subset of coecients
from the full model corresponding to X 1 and bfull
2 corresponds to X 2, the coecients in
the reduced model are given by
b1 = bfull 1 0 full
1 + (X 1 X 1 ) X 1 X2 b2 :
0

The omission of explanatory variables cannot be detected by a statistical test. Indirect


(but possibly ambiguous) evidence may be obtained from the analysis of residuals. Note
that the correlation between the residuals of the OLS regression e=y Xb and X cannot
be used to detect this problem. It is one of the implications of the LS principle that
this correlation is always zero (see section 1.1.2). Proxy variables may be used instead
of actually required, but unavailable regressors. Proxies should be highly correlated with
the unavailable variables, but one can only make assumptions about this correlation.
The negative consequences of omitted variables can be mitigated or eliminated using
instrumental variable estimates (see section 1.9), or panel data analysis33.
Including irrelevant variables in the model leads to inecient but unbiased and consistent
estimates. The ineciency can be shown by considering an alternative de nition of the
variance of the OLS estimate bj (see Greene, 2003, p.57)
2
V[bj ] = n
X
; (22)
(1 Rj2 ) (xtj xj )2
t=1

where Rj2 is the R2 from a regression of regressor j on all other regressors (including
a constant term). This de nition shows that c.p. the variance of bj will increase with
the correlation between variable xj and other regressors. This fact is also known as the
multicollinearity problem which becomes relevant if Rj2 is very close to one.
Suppose that the correct model is given by y= 0+ 1x1+ but the irrelevant variable x2
(i.e. 2=0) is added to the estimated regression equation. Denote the estimate of 1 from
the over tted model by ~b1. The variance of b1 (from the correct regression) is given by
(6), p.11 whereas V[~b1] is given by (22). Thus, unless x1 and x2 are uncorrelated in the
sample, V[~b1 ] will be larger than necessary (i.e. larger than V[b1 ]).
Exact multicollinearity holds when there are exact linear relationships among some re-
gressors (i.e. X does not have full rank). This can easily be corrected by eliminating
redundant regressors (e.g. super uous dummies). Typical signs of strong (but not exact)
multicollinearity are wrong signs or implausible magnitudes of coecients, as well as a
strong sensitivity to changes in the sample (dropping or adding observations). The in-
ating e ect on standard errors of coecients may lead to cases where several coecients
are individually insigni cant, but eliminating them (jointly) from the model leads to a
signi cant drop in R2 (based on an F -test).
32 Exception: the mean of x2 is zero.
33 For an introduction to the principles of panel data analysis, see Wooldridge (2003), chapters 13 and
14).
1.6 Speci cations 41
The consequences of including irrelevant regressors (ineciency) have to be compared to
the consequences of omitting relevant regressors (bias and inconsistency). We hesitate to
formulate a general recommendation, but it is worth while asking "What is the point of
estimating a parameter more precisely if it is potentially biased?"
1.6 Speci cations 42
1.6.8 Selection of regressors
The search for a correct speci cation of a regression model is usually dicult. The selec-
tion procedure can either start from a model with one or only a few explanatory variables,
and subsequently add variables to the equation (the speci c to general approach). Alterna-
tively, one can start with a large model and subsequently eliminate insigni cant variables.
The second approach (general to speci c ) is preferable, since the omission of relevant vari-
ables has more drawbacks than the inclusion of irrelevant variables. In any case, it is
strongly recommended to select regressors on the basis of a sound theory or a thorough
investigation of the subject matter. A good deal of common sense is always useful.
The following guidelines can be used in the model selection process:
1. The selection of variables must not be based on simple correlations between the
dependent variable and preselected regressors. Because of the potential bias associ-
ated with omitted variables any selection should be done in the context of estimating
multiple regressions.
2. If the p-value of a coecient is above the signi cance level this indicates that the
associated variable can be eliminated. If several coecients are insigni cant one can
start by eliminating the variable with the largest p-value and re-estimate the model.
3. If the p-value indicates elimination but the associated variable is considered to be of
key importance theoretically, the variable should be kept in the model (in particular
if the p-value is not far above the signi cance level). A failure to nd signi cant
coecients may be due to insucient data or a random sample e ect (bad luck).
4. Statistical signi cance alone is not sucient. There should be a very good reason
for a variable to be included in a model and its coecient should have the expected
sign.
5. Adding a regressor will always lead to an increase of R2. Thus, R2 is not a useful
selection criterion. A number of model selection criteria have been de ned to fa-
cilitate the model choice in terms of a compromise between goodness of t and the
principle of parsimony. The adjusted R2
n 1 2 0
R 2 = 1
n K
(1 R2) = 1 s2e s2e = e e
sy n K
is a criterion that can be used for model selection. Note that removing a variable
whose t-statistic is less than 1 leads to an increase of R 2 (R2 always drops if a
regressors is removed!), and a decrease of the standard error of regression (se). It
has been found, however, that R 2 puts too little penalty on the loss in degrees of
freedom. Alternative criteria are Akaike's information criterion
AIC = 2n` + 2nK
and the Schwarz criterion34
SC = 2n` + K nln n ;
34 These are the de nitions of AIC and SC used in EViews. Alternatively, the rst term in the de nition
of AIC and SC can be replaced by ln(e0e=n)=ln(~s2e ).
1.6 Speci cations 43
where `= 0.5n[1+ln(2)+ln(e0e=n)]. We nally note that model selection criteria
must never be used to compare models with di erent dependent variables (e.g. to
compare linear and log-linear models).
Exercise 7: Consider the data on the salary of 208 employees in the le
salary.wf135 . Estimate and choose a regression model for salary using avail-
able information such as gender, education level, experience and others. Note
that EDUC is a categorical variable measuring the education level in terms
of degrees obtained (1= nished high school, 2= nished some college courses,
3=obtained a bachelor's degree, 4=took some graduate courses, 5=obtained a
graduate degree). Use model formulations which allow you to test for gender-
speci c payment behavior.

35 Source: Albright et al. (2002, p.686), Example 13.3.


1.7 Regression diagnostics 44
1.7 Regression diagnostics
The purpose of diagnostic checking is to test whether important assumptions of regression
analysis hold. In this subsection we will present some frequently applied tests, discuss
some implications associated with violated assumptions, and provide simple remedies to
correct for negative consequences. Note that a model that passes diagnostic tests need
not necessarily be correctly speci ed.
1.7.1 Non-normality
The Jarque-Bera test is based on the null hypothesis of a normal distribution and takes
skewness S and kurtosis U into account:
n K 2 1 2

JB = 6 S + 4 (U 3) JB  22;
where
1X n (y
i y)
3 1X n (y
i y)
4
S= U = :
n i=1 s~3 n i=1 s~4

If OLS residuals are not normally distributed, OLS estimates are unbiased and consis-
tent, but not ecient (see Kmenta (1971), p.248). There exist other estimators with
greater accuracy (of course, only if the correct (or a more suitable) distribution is used
in those estimators). In addition, the t-statistics for signi cance testing are not appropri-
ate. However, this is only true in small samples, and when the deviation from the normal
distribution is 'strong'. A failure to obtain normal residuals in a regression may indicate
missing regressors and/or other speci cation problems (although the speci c kind prob-
lem cannot be easily inferred). At any rate, normality of the dependent variables is not a
requirement of OLS (as can be derived from sections 1.2.1 and 1.2.2).
Example 13: We use the data from example 8 and estimate the regression equation by
OLS. Details can be found in the le investment quarterly.wf1. The distribution of
residuals is positively skewed (0.25). This indicates an asymmetric distribution whose
right tail is slightly longer than the left one. The kurtosis is far greater than three
(5.08) which indicates more concentration around the mean than a normal distribution.
JB is 38.9 with a p-value of zero. This clearly indicates that we can reject H0 and we
conclude that the residuals are not normally distributed.
1.7 Regression diagnostics 45
1.7.2 Heteroscedasticity
Heteroscedasticity means that the variance of disturbances is not constant across ob-
servations
V[i] = i2 = !i2 8i;
and thus violates assumption AH. To analyze the implications of heteroscedasticity we
assume that the covariance matrix is diagonal
0 1
12 0  0
B
0 22  0 C
E[ ] = 2
= BB
0
B ... ... . . . ...
C
C:
C (23)
@ A
0 0    n2
If the variance of  is not given by 2I but 2
, the model is a so-called generalized
least squares (GLS) model.
It can be shown that the nite sample properties of the OLS estimator are not a ected if
only AH is violated (see Greene (2003), section 10.2.1). However, the covariance of b is
not given by (5), p.11 but is given by
V[b] = (X 0X ) 1X 0(2
)X (X 0X ) 1: (24)
Provided that X 0
X=n converges to a positive de nite matrix it can be shown that
in the presence of heteroscedasticity the OLS estimator b is unbiased, consistent and
asymptotically normal (see Greene (2003), section 10.2.2):
!
2
bNa
; n Q 1Q nQ 1 ; (25)
where Q is de ned in (11), p.23 and
Q = plim 1 X 0
X :
n n
However, the OLS estimator b is inecient since it does not use all the information
available in the sample. The estimated standard errors of b are biased and the associated
t- and F -statistics are incorrect. For instance, if i2 and a regressor xj are positively
correlated, the bias in the standard error of bj is negative (see Kmenta (1971), p.256).
Depending on the correlation between the heteroscedasticity and the regressors (and their
cross-products) the consequences may be substantial (see Greene (2000), p.502-505).
The Breusch-Pagan test for heteroscedasticity is based on an auxiliary regression of e2i
on a constant and the regressors. Under the null of homoscedasticity we can use the R2
from this regression to compute the test statistic nR22k (k is the number of regressors
excluding the constant). The White-test for heteroscedasticity is based on regressing e2i
against a constant, the regressors and their squares. In a more general version of the test
the cross products of regressors may be added, too. Under the null of homoscedasticity the
1.7 Regression diagnostics 46
test statistic is also nR22k , where k is the number of regressors in the auxiliary regression
excluding the constant. The advantage of the White-test is that no assumptions about the
type of heteroscedasticity are required. On the other hand, rejecting H0 need not be due
to heteroscedasticity but may indicate other speci cation errors (e.g. omitted variables).
In section 1.8 we will present estimators that make use of some knowledge about
. If no
such information is available the OLS estimator may still be retained. However, to improve
statistical inference about coecients the estimated standard errors can be corrected using
the White heteroscedasticity consistent (WHC) estimator
!
n n
X 0X 1 e2i xi x0i X 0 X 1 :
X
aV[b] = n
c
K
 
(26)
i=1

Example 14: We use a dataset36 on hourly earnings (y), employment duration (x1 )
and years of schooling (x2 ) (n=49) (see earnings.wf1). A plot of the residuals from
the estimated regression (t-statistics in parenthesis)
ln y = 1:22 + 0:027 x1 + 0:126 x2 + e
(6.1) (4.4) (3.6)
against x1 shows a strong increase in the variance of e. The White-test statistic 23.7
based on the regression (p-values in parenthesis)
e2 = 0:1 0:022 x1 + 0:0009 x21 + 0:12 x2 0:019 x22 + u R2 = 0:484
(0.5) (0.15) (0.008) (0.09) (0.08)
is highly signi cant (the p-value is very close to zero) and we can rmly reject the
homoscedasticity assumption. The t-statistics based on the WHC standard errors are
9.7, 4.0 and 4.9, respectively. Thus, in this example, the conclusions regarding the
signi cance of coecients are not a ected by the heteroscedasticity of residuals.

Exercise 8: Test the residuals from the models estimated in exercise 7 for
non-normality and heteroscedasticity.

36 Source: Thomas (1997), p.293.


1.7 Regression diagnostics 47
1.7.3 Autocorrelation
Autocorrelation (or serial correlation) is only relevant in case of time series data. It
means that consecutive disturbances are correlated { which violates assumption AH. For
instance, if the dependent variable is subject to seasonality (e.g. a monthly time series
which has local peaks during the months of summer and troughs during winter) which is
not accounted for by the regressors, the residuals et and et 12 will be correlated.
To analyze the implications of autocorrelation we assume that the covariance matrix of
disturbances is given by
0 1
1 1    n 1
B
1 1    n 2 C
E[ ] = 2
= 2 BB
0
B ... ... ... ...
C
C;
C (27)
@ A
n 1 n 2  1
where ` is the (auto)correlation between t and t `. It can be shown (see Greene (2003),
section 10.2.2) that under this assumption, autocorrelated disturbances have the same
consequences as heteroscedasticity. The OLS estimator b is unbiased, consistent, asymp-
totically normal as in (25), but inecient. This implies that the standard errors of the
coecients are biased. For instance, if the majority of autocorrelations is positive the
standard errors are too small (see Kmenta (1971), p.273).
Autocorrelations can be estimated from the sample by
1
r` = 0
n
X
et et ` ;
ee t=`+1

and tests for the signi cance of individual autocorrelations can be based on
r` a N( 1=n; 1=n):

The asymptotic properties of r` hold if the disturbances  are uncorrelated (see Chat eld
(1989), p.51). The Ljung-Box Q-statistic
p
X r2̀
Qp = n(n + 2) Qp  2p
`=1 n `

can be used as a joint test for all autocorrelations up to lag p. The Durbin-Watson test
DW2(1 r1) has a long tradition in econometrics. However, it only takes the autocorre-
lation at lag 1 into account and has other conceptual problems; e.g. it is not appropriate
if the lagged dependent variable is used as a regressor (see Greene (2003), p.270).
The Breusch-Godfrey test is based on an auxiliary regression of et on p lagged residuals
and the original regressors. Under the null of no autocorrelation we can use the R2 from
this regression to compute the test statistic nR22p.
Similar to the WHC estimator the Newey-West (HAC) estimator can be used to account
for residual autocorrelation without changing the model speci cation. It is a covariance
1.7 Regression diagnostics 48
estimator that is consistent in the presence of both heteroscedasticity and autocorrelation
(hence HAC) of unknown form. It is given by
c b] = X 0 X 1

aV[ ^ X 0X  1


where
0 1
n q n
^ = n @ e2t xt x0t + wj

X X X
xtetet j x0t j + xt 0
j et j et xt A :
n K t=1 j =1 t=j +1

wj =1 j /(q+1), and the truncation lag q determines how many autocorrelations are taken
into account. Newey and West (1987) suggest to set q=4(n/100)2=9.
We now take a closer look at implications of autocorrelated residuals and consider the
model37
yt = xt + ut (28)
ut = ut 1 + t jj < 1 t  i.i.d.
The rst equation may be viewed as being incorrectly speci ed, as we are going to show
now. The second equation for the autocorrelated residuals ut is a so-called rst order
autoregression AR(1). Substituting ut into the rst equation (yt = xt +ut 1 +t ) and
using ut 1=yt 1 xt 1, we obtain
yt = yt 1 + xt  xt 1 + t :

This shows that the autocorrelation in ut can be viewed as a result of missing lags in the
original equation. If we run a regression without using yt 1 and xt 1, we have an omitted
variables problem. The coecient  of the incomplete regression yt=xt+t is given by
P
yx X
 = P t t = M yt xt M=P :
1
x2t x2t
Substituting for yt from the complete regression we obtain
E[] = E[M P xt(yt 1 + xt  xt 1 + t)]
= E[M P xtyt 1] + E M P x2t  E[M P xtxt 1] + E[M P xtt] :
To simplify this relation it is useful to write the AR(1) equation as
t 1
X
ut =  t u0 + i t i ;
i=0
37 For the sake of simplicity we consider only a single regressor xt , and assume that yt and xt have mean
zero.
1.7 Regression diagnostics 49
which implies that equation (28) can be written as (tu0 vanishes for large t, since jj<1)
t 1
X
yt = xt + i t i :
i=0

We also make use of E[xtt i]=0 (8i0), and note that the autocorrelation of xt is given
by x=M P xtxt 1. We nd that  is unbiased since its expectation is given by
E[] = EM P xt( xt 1 + P it i) +  x
=  x +  x = :
Thus, despite the incomplete regression and the presence of autocorrelated residuals, we
obtain unbiased estimates.
We now add a lagged dependent variable to equation (28). From section 1.2 we know that a
lagged dependent variable leads to biased estimates. However, the estimates are consistent
provided that assumptions AX and ARt hold. We now investigate what happens if the
disturbances are autocorrelated. We consider the model
yt = yt 1 + xt + ut
ut = ut 1 + t jj < 1 t  i.i.d.;
which can be written as
yt = ( + )yt 1 yt 2 + xt  xt 1 + t :

Suppose we run a regression without using yt 2 and xt 1. From the omitted variable for-
mula (20) we know that the resulting bias depends on the coecients of omitted regressors
( 2=[   ]0 in the present case), and the matrix of coecients from regressing yt 2
and xt 1 on included regressors. This matrix will be proportional to the following matrix
(i.e. we ignore the inverse of the matrix associated with included regressors):
P P !
Pt
y 1 yt 2 Pt
y 1 xt 1 :
xt yt 2 xt xt 1

The elements in the rst row will be non-zero (if 6=0 and 6=0), and thus the estimated
coecient of yt 1 is biased. It is more dicult to say something general about the rst
element in the second row, but autocorrelation in xt leads to a bias in . Greene (2003,
p.266) considers the simpli ed case =0, and states that the probability limit of the
estimated coecient of yt on yt 1 alone is given by (+)/(1+).
1.7 Regression diagnostics 50
Example 15: We consider the data set analyzed by Coen et al. (1969) who formulate
a regression for the Financial Times index (yt ) using the UK car production index
(pt ) lagged by six quarters and the Financial Times commodity index (ct ) lagged by
seven quarters as regressors. Details can be found in the le coen.wf1. These lags
where found by "graphing the series on transparencies and then superimposing them
(p.136)". The estimated equation is (all p-values are very close to zero; we report
t-statistics below coecients for later comparisons)
yt = 653 + 0:47 pt 6 6:13 ct 7 + et : (29)
(11.6) (14.1) (9.9)
The Coen et al. study has raised considerable debate (see the discussion in their paper
and in Granger and Newbold, 1971, 1974) because the properties of the residuals had
not been thoroughly tested. As it turns out DW=0.98, and the Breusch-Godfrey test
statistic using p=1 is 12.4 with a p-value below 0.001. This is evidence of considerable
autocorrelation. In fact, using Newey-West HAC standard errors, the t-statistics are
reduced to 11.4 and 7.6, respectively.
Stock prices or indices are frequently claimed to follow a random walk (see section
2.3) yt =yt 1 +t (t i.i.d.). Thus we add the lagged dependent variable yt 1 to Coen
et al.'s equation and nd
yt = 276 + 0:661 yt 1 + 0:127 pt 6 2:59 ct 7 + et : (30)
(4.1) (6.9) (2.3) (3.8)
The residuals in this equation are not autocorrelated which indicates that the estimates
are consistent (to the extent that no regressors have been omitted). The coecients
and the t-statistics of pt 6 and ct 7 are considerably lower than before. It is not
straightforward to test whether the coecient of yt 1 is equal to one for reasons
explained in section 2.3.3. In sum, our results raise some doubt about the highly
signi cant lagged relationships found by Coen et al.
Example 16: We brie y return to example 7 where we have considered tests of the
UIRP based on one-month forward rates and monthly data. Since forward rates are
also available for other maturities, this provides further opportunities to test the UIRP.
We use the three-month forward rate Ft3 for which the UIRP implies
Et [ln St+3 ] = ln Ft3 :
This can be tested by running the regression
st st 3 = 0 + 1 (ft3 3 st 3 ) + t :
The estimated regression is38
st st 3 = 0:01 + 0:994 (ft3 3 st 3 ) + et R2 = 0:0123:
(1.76) (1.86)
Before we draw any conclusions from this regression it is important to note that the
observation frequency need not (and in the present case does not) conform to the
maturity. st st 3 is a three-month return (i.e. the sum of three consecutive monthly
returns). This introduces autocorrelation in three-month returns even though the
monthly returns are not serially correlated (similar to section 1.8.3). This is known
as the overlapping samples problem. In fact, the residual autocorrelations at lags
1 and 2 are highly signi cant (and positive), and the p-value of the Breusch-Godfrey
38 See le uirp.wf1 for details.
1.7 Regression diagnostics 51
test is zero. Thus, the standard errors cannot be used since they will most likely be
biased (the bias will be negative since the autocorrelations are positive).
One way to overcome this problem is to use quarterly data (i.e. to use only every
third monthly observation). However, this leads to a substantial loss of information,
and reduces the power of the tests. Alternatively, we can use Newey-West standard
errors to nd t-statistics of b0 and b1 equal to 1.29 and 1.21, which are much lower, as
expected. Whereas a Wald test based on the usual standard errors has a p-value of
about 0.017 (which implies a rejection of the UIP), the p-value of a Wald test based
on Newey-West standard errors is 0.19.

Exercise 9: Use the data in the le coen.wf1 to estimate and test alterna-
tive models for the Financial Times index. Make use of additional regressors
available in that le.
Exercise 10: Use the US dollar/British pound exchange rate and the three-
month forward rate in the le uirp.wf1 to test the UIRP.
1.8 Generalized least squares 52
1.8 Generalized least squares
We now consider alternative estimators to overcome the ineciency of OLS estimates
associated with features of disturbances (i.e. violations of assumption AH) introduced
in sections 1.7.2 and 1.7.3. In general the matrix
in (23) or (27) is unknown. If its
structure is known, or assumptions are made about its structure, it is possible to derive
alternative estimators.
1.8.1 Heteroscedasticity
We rst consider the problem of heteroscedasticity and suppose that the variance of dis-
turbances is given by
V[i] = i2 = !i2 8i:
In the method of weighted least squares (WLS) the regression equation is multiplied
by a suitable variable i39
k
X
i yi = 0 i + j i xij + i i i = 1; : : : ; n:
j =1

The variance of the disturbance term i =ii is


V[i ] = V[ii] = 2i E[2i ] = 2i i2:
Obviously, if i is chosen such that it is equal to 1=p!i the variance of the disturbances
in the modi ed equation is constant
V[i ] = 2i i2 = i2=!i = 2 if i=1=p!i;
and OLS estimation of the modi ed equation should give ecient estimates for . This
can be achieved if  is chosen such that it is proportional to the reciprocal of the standard
deviation of the disturbances. Since !i cannot be observed, one can try to de ne  in terms
of a regressor of the model. The Breusch-Pagan- or White-test may serve as starting points
for this choice. From a regression of e2 on all regressors and their squares one may nd,
for example, a signi cant coecient for x2j . In this case =1/xj may be a good choice and
the modi ed regression equation is given by:
y=xj = b0 =xj + b1 x1 =xj + bj + bk xk =xj + e=xj ;
or, using transformed variables y=y=xj and xl =xl =xj (l=0; : : : ; k; x0=1):
y = b0 x0 + b1 x1 + bj + bk xk + e : (31)
Note that the coecient bj is the constant term in the modi ed regression but is still the
estimator of the coecient for regressor xj . The variance of the modi ed residuals is
V[e] = E[e2=x2j ]:
39 The index i is meant to emphasize that i di ers across observations.
1.8 Generalized least squares 53
To the extent that e2 and x2j are related (as indicated by the White-test regressions) the
variance of e should be approximately constant. If the White-test shows a signi cant
coecient for xj , a good choice may be =1=pxj .
Standard errors for coecients estimated by WLS are based on the covariance derived
from X  (the matrix of weighted regressors) and e=X bWLS:
V^ [bWLS] = s2e (X 0X ) 1:
The R2 from the modi ed regression (31) must not be compared to the R2 from the original
equation since the dependent variables are not the same. It does not describe the relation
of interest and thus the R2 from the transformed equation is rather useless. Equation (31)
is mainly a device to obtain ecient estimates and correct statistical inference. Therefore,
parameter estimates should also be interpreted and used in the context of the original
(untransformed) model.
Another approach is based on using weights that are estimated from the data. It may
be assumed that heteroscedasticity depends on regressors z (which may include original
regressors x). Plausible candidates are i2=z0i z or i2=2 expfz0i z g. In the method
of feasible generalized least squares (FGLS) estimates of i2 are used to obtain an
estimate of the matrix
de ned in (23). The corresponding FGLS estimator is given by
fgls = (X
^ X ) 1X
^ y:
0 0

The FGLS estimator is asymptotically ecient if the estimate to construct


^ is consistent.
Estimates ^i2 can be obtained by using z0ibz from the auxiliary regressions
e2i = z 0i bz + ui or ln e2i = z0ibz + vi:
Example 17: We use data and results from example 14. The White test regression
has shown a signi cant relation between e2 and x21 . We use 1=x1 as the weight in a
WLS regression and obtain (t-statistics in parenthesis)
ln y = 1:23 + 0:042 x1 + 0:025 x2 + e :
(27.9) (8.7) (1.8)
Note the small changes in the estimated coecients (which are expected) and the
substantial changes in the t-statistics compared to example 14.

Exercise 11: Consider the data from example 11 (see le hedonic.wf1). Es-
timate a model excluding the interaction term. Test the residuals from this
model for heteroscedasticity, and obtain WLS or FGLS estimates if required.
1.8.2 Autocorrelation
Autocorrelation is another case where assumption AH is violated. To overcome the asso-
ciated ineciency of OLS we assume, for simplicity, that autocorrelations in the covariance
matrix (27) can be expressed in terms of the rst order (lag one) autocorrelation 1 only:
 = 1  = 1; : : : ; n 1:
1.8 Generalized least squares 54
This is equivalent40 to the model
yt = 0 + 1 xt1 +    + k xtk + ut

ut = 1 ut 1 + t t  i.i.d.

Upon substitution of ut from the second equation we nd


yt = 0 (1 1 ) + 1 xt1 1 1 xt 1;1 +    + k xtk k 1 xt 1;k + 1 yt 1 + t :

Alternatively, we can use transformed variables yt=yt 1 yt 1 and xtj =xtj 1 xt 1;j (the
so-called partial di erences):
yt = 0 + 1 xt1 +    + k xtk + t :

t is uncorrelated, and estimating the transformed equation by OLS gives ecient esti-
mates. 1 is unknown, but we can use FGLS if 1 is replaced by a consistent estimate.
Several options are available to estimate 1 consistently (e.g. Cochrane-Orcutt or Prais-
Winsten; see Greene (2003), section 12.9). The simplest is to use the rst order autocor-
relation of the residuals from the original (consistent) regression.
We also note that autocorrelation of disturbances may be viewed as the result of a mis-
speci ed equation (see section 1.7.3). In other words, the (original) equation has to be
modi ed to account for the dynamics of the variables and responses involved. According
to this view, GLS is not appropriate to resolve the ineciency of OLS. A starting point
for the reformulation may be obtained from the equation using partial di erences.
WLS and FGLS may be summarized in terms of the Cholesky decomposition of the inverse
of
:

1 = C 0C C
C 0 = I :
The matrix C is used to transform the model such the transformed disturbances are
homoscedastic and non-autocorrelated:
Cy = CX + C =) y = X  + 
E[C] = C E[] = 0
V[C] = C V[]C 0 = C 2
C 0 = 2C
C 0 = 2I :
If the transformed equation is estimated by OLS the GLS estimator is obtained:
gls = (X 0X ) 1 X 0 y = X 0C 0CX 1 X 0C 0Cy = (X 0
X ) 1X 0
y:


40 The
autocorrelations of ut from the model ut =1 ut 1 +t can be shown to be given by  =1 (see
section 2.2).
1.8 Generalized least squares 55
Example 18: We consider the model (29) estimated in example 15. The estimated
residual autocorrelation at lag 1 is ^1 =0.5 and can be used to form partial di erences
(e.g. yt =yt 0:5yt 1 ). The FGLS estimates are given by (t-statistics below coecients)
yt = 299:7 + 0:446 pt 6 5:41 ct 7 + et :
(7.2) (8.7) (5.8)
The increased eciency associated with the FGLS estimator cannot be derived from
comparing these estimates to those in example 15. It is worth noting, however, that
the t-statistics are much lower and closer to those obtained in equation (30).

Exercise 12: Consider a model you have estimated in exercise 9. Test the
residuals for autocorrelation, and obtain FGLS estimates if required.
1.8 Generalized least squares 56
1.8.3 Example 19: Long-horizon return regressions
Suppose a time series of (overlapping) h-period returns is computed from single-period
returns as follows (see section 2.1):
y(h)t = yt + yt 1 +    + yt h yt = ln pt ln pt 1:
In the context of long-horizon return regressions the conditional expected value of y(h)t is
formed on the basis of the information set available at the date when forecasts are made;
i.e. at date t h 1. For example, in case of a single regressor a regression equation is
formulated as
y(h)t = 0 + 1 xt h 1 + t :
Alternatively, the model can be reformulated by shifting the time axis as follows:
y(h)t+h = 0 + 1 xt 1 + t+h :

This model speci cation corresponds to the problem of predicting the sum of h single-
period returns during the period t until t+h on the basis of information available at date
t 1.
We consider weekly observations of the DAX pt which are used to form a time series of
annual returns
y(52)t = ln pt ln pt 52 :
The following regressors are available: the dividend yield dt, the spread between ten-
year and one-month interest rates st, the one-month real interest rate rt, and the growth
rate of industrial production gt41. Real interest rates are computed by subtracting the
in ation rate from one-month interest rates. The in ation rate it is calculated from the
consumer price index ct, using it=ln ct ln ct 52. The growth rate of industrial production
is computed in the same way from the industrial production index. Details can be found
in the le long.wf1.
Each explanatory variable is included with a lag of 53 weeks and the estimated equation
is42
y(52)t = 0:41 + 39:8dt 53 1:76st 53 8:79rt 53 + 0:638gt 53 + et:
The present result is typical for some similar cases known from literature which support
the 'predictability' of long-horizon returns. All parameters are highly signi cant which
leads to the (possibly premature) conclusion that the corresponding explanatory variables
are relevant when forming expectations.
The most obvious de ciency of this model is the substantial autocorrelation of the residuals
(r1=0.959). The literature on return predictability typically reports that R2 increases
with h (see Kirby, 1997). This property of the residuals is mainly caused by the way
41 Source: Datastream; January 1983 to December 1997; 782 observations.
42 All p-values are less than 0.001 except for st 53 with a p-value equal to 0.002.
1.8 Generalized least squares 57

Figure 1: Fit and residuals of the long-horizon return regression of annual DAX returns.
1.0
Residual Actual Fitted
0.5

0.0
0.4

0.2 -0.5

0.0

-0.2

-0.4
11/01/85 11/11/88 11/09/92 12/07/96

multi-period returns are constructed. The (positive) autocorrelation of residuals causes a


(negative) bias of the standard errors of the parameters. In extreme cases this may lead
to the so-called spurious regression problem43.
No matter whether this is, in fact, a spurious regression case, Figure 1 shows that data and
in-sample t may deviate strongly from each other for very long periods. Therefore, when
this model is used for out-of-sample forecasts, large errors in the estimation of expected
returns can be expected over long time intervals.
Note that the residual autocorrelation cannot be simply corrected by using partial dif-
ferences (e.g. yt ^1yt 1). Such a formulation would imply single-period expectations,
contrary to the intension of long-horizon regressions. On the other hand, partial di er-
ences based on longer periods (e.g. yt ^53yt 53) would not account for the short-term
autocorrelation.
A viable alternative is to use Newey-West HAC standard errors which shows that only the
coecients of dt 53 and rt 53 remain signi cant. The p-values of st 53 and gt 53 increase
to 0.28 and 0.11, respectively. Valkanov (2003) provides theoretical results why t-statistics
in long-horizon regressions do not converge to well-de ned distributions, and proposes a
rescaled t-statistic.

43 For details see Granger and Newbold (1974).


1.9 Endogeneity and instrumental variable estimation 58
1.9 Endogeneity and instrumental variable estimation44
1.9.1 Endogeneity
In sections 1.2 and 1.3 the exogeneity assumptions AX and AX were found to be crucial for
the properties of OLS estimates. The term endogeneity (i.e. regressors are not exogenous,
but are correlated with the disturbances) refers to violations of these assumptions. There
are several circumstances which may lead to endogeneity. As shown in section 1.6.7 omitted
variables lead to biased and inconsistent OLS estimates of regression coecients, because
regressors and disturbances are correlated in this case. Two further reasons for endogeneity
are measurement errors and simultaneity (see below). Roberts and Whited (2012) provide
a comprehensive treatment of endogeneity, its sources, and econometric techniques aimed
at addressing that problem in the context of corporate nance.
X and  are correlated in case of measurement errors (or errors-in-variables) (see
Greene, 2003, section 5.6). For instance, the Fama-MacBeth two-step procedure mentioned
in example 6 leads to an errors-in-variables problem, since the second step uses generated
regressors. To provide some insight into the associated consequences we consider the
regression y=X +, where X is measured with error: X~ =X +u. u is a mean-zero error,
uncorrelated with y and X . Upon substituting X with X~ u we obtain
y = (X~ u) +  = X ~ + v v =  u :
Regressors and disturbances in this regression are correlated, since
E[X~ 0v] = E[(X + u)0( u )]
= E[X 0 + u0 X 0u u0u ]
= E[u0u] = u2 :
Note that all coecients are biased and inconsistent, even if only one of the regressors in
X is measured with error (see Greene, 2003, p.85). The bias can be derived on the basis
of relation (4):
h i  h i
E[b] = + E (X~ 0X~ ) 1 E[X~ 0v] = 1 0
u2 E (X~ X~ ) 1 :
In case of the slope in a simple regression we have
!
u2
E[b] = 1 x2~
:

Since E[X 0u]=0 and x2~=x2 +u2 we obtain


!
x2
:
x2 + u2
This shows that b will be biased towards zero, since the term in parenthesis is less than
one.
44 Most of this section is based on Greene (2003), section 5.4, and Wooldridge (2002), sections 5.1 and
6.2.
1.9 Endogeneity and instrumental variable estimation 59
A very typical case of endogeneity is simultaneity (see Greene, 2003, p.378). Simultaneity
arises if at least one explanatory variable xk is not exogenous but is determined (partly)
as a function of y, and thus xk and  are correlated. A frequently used example is the case
of demand and supply functions in the following simple model:
d = 0d + 1d p + d s = 0s + 1s p + s d = s (in equilibrium):

Supply and demand di er conceptually but usually they cannot be separately measured.
Thus, we can only observe quantities sold q (representing the market equilibrium values
of d and s). Using q=d=s we can solve the equations
q = 0d + 1d p + d q = 0s + 1s p + s

for p and obtain


s d s d
p = 0d 0s + d s :
1 1 1 1
Thus, p is a function of the disturbances from both equations, and is thus endogenous in
a regression of q on p. Its covariance with d and s is given by45
cov[p; d] = V[d] cov[p; s] = V[ ]s
1d 1s d s :
1 1
In this case, endogeneity is a consequence of market equilibrium. If we estimate a regression
of q on p it is not clear whether the result is the slope of a demand or supply function. It
can be shown, however, that the probability limit of the coecient of p in a regression of
q on p is given by (see Hayashi, 2000, p.189)

1d V[s ] + 1s V[d ]
V[s] + V[d] :
Thus, the estimated coecient is neither the slope of the demand nor the supply function,
but a weighted average of both. If supply shocks dominate (i.e. V[s]>V[d]) the estimate
will be closer to the slope 1d of the demand function. This may hold in case of agricultural
products which are more exposed to supply shocks (e.g. weather conditions). Positive(!)
slopes of a "demand" function may be found in case of manufactured goods which are more
subject to demand shifts over the business cycle. In an extreme case, where there are no
demand shocks, the observed quantities correspond to the intersections of a (constant)
demand curve and many supply curves. This would allow us to identify the estimated
slope as the slope of a demand function. This observation will be the basis for a solution
to the endogeneity bias described below.

45 For simplicity we assume that d and s are uncorrelated.


1.9 Endogeneity and instrumental variable estimation 60
1.9.2 Instrumental variable estimation
Instrumental variable (IV) or two-stage least squares (2SLS) estimation is a method
that can be used when exogeneity is violated. We start by considering the case of only
one endogenous regressor xk being correlated with  in the regression
k
X
y = 0 + j xj + : (32)
j =1

IV-estimation is based on using an observable variable z { the so-called instrumental


variable or instrument { which satis es the following conditions:
1. the instrument z must be uncorrelated with  (orthogonality condition):
corr[z; ] = 0:
This condition implies that the instrument and a potentially omitted variable (ab-
sorbed in ) must be uncorrelated. If this condition is violated, the instrument is
considered to be invalid.
2. the coecient bz of z in the so-called rst-stage regression
kX1
xk = b0 + bj xj + bz z + v = x^k + v
j =1

must be non-zero: bz 6=0. If this condition (which is also called relevance condition )
is violated, the instrument is called weak.
3. z must not appear in the original regression (32). Roberts and Whited (2012) call
corr[z; ]=0 the exclusion condition, expressing that the only way z must a ect y is
through the endogenous regressor xk (but not directly via equation 32).
It is frequently stated that z must be correlated with the endogenous regressor. Note that
the second requirement is stronger, since it refers to the partial correlation between z and
xk . Whereas the second condition can be tested, the rst requirement of zero correlation
between z and  cannot be tested directly. Since  is unobservable, this assumption must
be maintained or justi ed on the basis of economic arguments. However, in section 1.9.3
we will show that it is possible to test for exogeneity of regressors and the appropriateness
of instruments.
Consistent estimates can be obtained by using x^k from the rst-stage regression to re-
place xk in the original regression (32). x^k is exogenous by construction, since it only
depends on exogenous regressors and an instrument uncorrelated with . Thus, the endo-
geneity is removed from equation (32), and the resulting IV-estimates of its parameters
are consistent.
We brie y return to the example of supply and demand functions described in section
1.9.1. A suitable instrument for a demand equation is an observable variable which leads
to supply shifts and hence price changes (e.g. temperature variations a ect the supply of
1.9 Endogeneity and instrumental variable estimation 61
co ee and its price). We are able to identify the demand function and estimate its slope,
if the instrument is uncorrelated with demand shocks (i.e. temperature has little or no
impact on the demand for co ee). IV-estimates can be obtained by rst regressing price
on temperature (and other regressors), and then use the tted prices as a regressor (among
others) to explain quantities sold. Consequently, the second regression can be considered
to be a demand equation.
In general, IV-estimation replaces those elements of X which are correlated with  by a
set of instruments which are uncorrelated with , but related to the endogenous elements
of X by rst-stage regressions. The number of instruments can be larger than the number
of endogenous regressors. However, the number of instruments must not be less than
the number of (potentially) endogenous regressors (this is the so-called order condition ).
To derive the IV-estimator in more general terms we de ne a matrix Z of exogenous
regressors which includes the exogenous elements of X (including the constant) and the
instruments. Regressors suspected to be endogenous are not included in Z . We assume
that Z is uncorrelated with 
E[Z 0] = 0: (33)
IV-estimation can be viewed as a two-stage LS procedure. The rst stage involves regress-
ing each original regressor xi on all instruments Z :
xi = Zbiz + vi = x^i + vi i = 1; : : : ; K:
Regressors in Z that are also present in the original matrix X are exactly reproduced
by this regression. The resulting tted values are used to construct the matrix X^ which
is equal to X , except for the columns which correspond to the (suspected) endogenous
regressors. X^ is used in the second stage in the regression
y = Xb
^ iv + e: (34)
Since X^ only represents exogenous information the IV-estimator given by
biv = (X^ 0X^ ) 1X^ 0y
is consistent. X^ can be written as (see Greene, 2003, p.78)
X^ = Z (Z 0Z ) 1 Z 0X = Zbz ; (35)
where bz is a matrix (or vector) of coecients from rst-stage regressions. If Z has the
same number of columns as X (i.e. the number of instruments is equal to the number of
endogenous regressors) the IV-estimator is given by
biv = (Z 0X ) 1 Z 0y:
If the following conditions are satis ed
plim 1 Z 0Z = Q jQ j > 0
n z z
1.9 Endogeneity and instrumental variable estimation 62

plim n1 Z 0X = Qzx jQzxj > 0

plim n1 Z 0 = 0
the IV-estimator can be shown to be consistent
 1

1 
1
plim biv = + plim Z X  plim Z  = ;
0 0

(36)
n n
and asymptotically normal (see Greene, 2003, p.77):
biv a N( ; Qzx1 Qz Qzx1 ):
The asymptotic variance of biv is estimated by
s~2e (Z 0X ) 1 Z 0Z (X 0Z ) 1 ; (37)
where
s~2e =
1Xn
(yi x0ibiv )2:
n i=1

Note that the standard errors and s~2e are derived from residuals based on X rather than
X^ :
eiv = y Xbiv : (38)
This further implies that the R2 from IV-estimation based on these residuals cannot be
interpreted. The variance of eiv can be even larger than the variance of y, and thus R2
can become negative.
If the IV-estimation is done in two OLS-based steps, the standard errors from running
the regression (34) will di er from those based on (37), and thus will be incorrect (see
Wooldridge, 2002, p.91). We also note that the standard errors of biv are always larger
than the OLS standard errors, since biv only uses that part of the (co)variance of X which
appears in the tted values X^ . Thus, the potential reduction of the bias is associated with
a loss in eciency.
So far, the treatment of this subject has been based on the assumption that instruments
satisfy the conditions stated above. However, violations of these conditions may have
serious consequences (see Murray, 2006, p.124). First, invalid instruments { correlated
with the disturbance term { yield biased and inconsistent estimates, which can be even
more biased than the OLS estimates. Second, if instruments are too weak it may not be
possible to eliminate the bias associated with OLS, and standard errors may be misleading
even in very large samples. Thus, the selection of instruments is crucial for the properties
of IV-estimates. In the next section we will investigate those issues more closely.
1.9 Endogeneity and instrumental variable estimation 63
1.9.3 Selection of instruments and tests
Suitable instruments are usually hard to nd. In the previous section we have mentioned
weather conditions. They may serve as instruments to estimate a demand equation for
co ee, since they may be responsible for supply shifts, and they are correlated with prices
but not with demand shocks. In a study on crime rates Levitt (1997) uses data on electoral
cycles as instruments to estimate the e ects associated with hiring policemen on crime
rates. Such a regression is subject to a simultaneity bias since more policemen should
lower crime rates, however, cities with a higher crime rate tend to hire more policemen.
Electoral cycles may be suitable instruments: they are exogenously given (predetermined),
and expenditures on security may be higher during election years (i.e. the instrument is
correlated with the endogenous regressor). In a time series context, lagged variables of the
endogenous regressors may be reasonable candidates. The instruments Xt 1 may be highly
correlated with Xt but uncorrelated with t. According to Roberts and Whited (2012)
suitable instruments can be derived from biological or physical events or features. They
stress the importance of understanding the economics of the question at hand, i.e. that
the instrument must only a ect y via the endogenous regressor. For example, institutional
changes may be suitable as long as the economic question under study was not one of the
original reasons for the institutional change.
Note that instruments are distinctly di erent from proxy variables (which may serve as
a substitute for an otherwise omitted regressor). Whereas a proxy variable should be
highly correlated with an omitted regressor, an instrument must be highly correlated
with a (potentially) endogenous regressor xk . However, the higher the correlation of an
instrument z with xk , the less justi ed may be the assumption that z is uncorrelated with
 { given that the correlation between xk and  is causing the problem in the rst place!
IV-estimation will only lead to consistent estimates if suitable instruments are found (i.e.
E[Z 0]=0) and bz 6=0). If instruments are not exogenous because (33) is violated (i.e.
E[Z 0]6=0) the IV-estimator is inconsistent (see (36)). Note also that the consistency of
biv critically depends on cov[Z ;X ] even if E[Z 0]0. (36) shows that poor instruments
(being only weakly correlated with X ) may lead to strongly biased estimates even in large
samples (i.e. inconsistency prevails). As noted above, estimated IV standard errors are
always larger than OLS standard errors, but they can be strongly biased downward when
instruments are weak (see Murray, 2006, p.125).
Given these problems associated with IV-estimation, it is important to rst test for the
endogeneity of a regressor (i.e. does the endogeneity problem even exist?). This can be
done with the Hausman test described below. However, that test requires valid and
powerful instruments. Thus, it is necessary to rst investigate the properties of potential
instruments.
Testing bz : Evidence for weak instruments can be obtained from rst-stage regressions.
The joint signi cance of the instruments' coecients can be tested with an F -test
(weak instruments will lead to low F -statistics). If weak instruments are found, the
worst should be dropped and replaced by more suitable instruments (if possible at
all). To emphasize the impact of (very) weak instruments consider an extreme case,
where the instruments' coecients in the rst-stage regression are all zero. In that
case, x^k could not be used in the second stage, since it would be a linear combination
of the other exogenous regressors. As a rule of thumb, n times R2 from the rst-
stage regression should be larger than the number of instruments so that the bias
1.9 Endogeneity and instrumental variable estimation 64
of IV will tend to be less than the OLS bias (see Murray, 2006, p.124). Staiger and
Stock (1997) consider instruments to be weak, if the rst-stage F -statistic, testing
the coecients bz of instruments entering the rst-stage regression for being jointly
zero, is less than ten. Dealing with more than one endogeneous regressor is even
more demanding (see Stock et al., 2002).
Testing corr[z; ]: If there are more instruments than necessary,46 is overidenti ed.
Rather than eliminating some instruments they are usually kept in the model, since
they may increase the eciency of IV-estimates. However, a gain in eciency re-
quires valid instruments (i.e. they must be truly exogenous). Since invalid instru-
ments lead to inconsistent IV-estimates, it is necessary to test the overidentifying
restrictions. Suppose two instruments z1 and z2 are available, but only one (z1) is
used in 2SLS (this is the just identi ed case). Whereas the condition corr[z1; ]=0
cannot be tested (both e from (34) and eiv from (38) are uncorrelated with z by
the LS principle), we can test whether z2 is uncorrelated with e, and may thus be
a suitable instrument. The same applies vice versa, if z2 is used in 2SLS and z1 is
tested.
In general, overidentifying restrictions can be tested by regressing the residuals from
the 2SLS regression eiv as de ned in (38) on all exogenous regressors and all instru-
ments (see Wooldridge, 2002, p.123). Valid (i.e. exogenous) instruments should not
be related to 2SLS residuals. Under H0: E[Z 0]=0, the test statistic is nR22m,
where R2 is taken from this regression, and m is the di erence between the number
of endogenous regressors and the number of instruments. A failure to reject the
overidentifying restrictions is an indicator of valid instruments.
If acceptable instruments have been found, we can proceed to test for the presence of
endogeneity. The Hausman test is based on comparing biv and bls, the IV and OLS
estimates of . A signi cant di erence is an indicator of endogeneity. The Hausman test
is based on the null hypothesis that plim(1=n)X 0=0 (i.e. H0: X is not endogenous).
In this case OLS and IV are both consistent. If H0 does not hold, only IV is consistent.
However, a failure to reject H0 may be due to invalid or weak instruments. Murray (2006,
p.126) reviews alternative procedures, which are less a ected by weak instruments, and
provides further useful guidelines.
The Hausman test is based on d=biv bls, H0: plim d=0. The Hausman test statistic is
given by
c d] 1 d = 1 d0 [(X
H = d0 aV[ ^ 0X^ ) 1 (X 0 X ) 1 ] 1 d H  2 ;
s2e m

where X^ is de ned in (35), m=K K0 and K0 is the number of regressors for which H0
must not be tested (because they are known { actually, assumed { to be exogenous).
A simpli ed but asymptotically equivalent version of the test is based on the residuals from
the rst-stage regression associated with the endogenous regressor (vk ), and an auxiliary
OLS regression (see Wooldridge, 2002, p.119), where vk is added to the original regression47
46 By the order condition the number of instruments must be greater or equal to the number of endogenous
regressors.
47 Note that adding vk to equation (32) will not change any of the coecients in (32) estimated by 2SLS.
1.9 Endogeneity and instrumental variable estimation 65
(32):
k
X
y = 0 + j xj + bv vk + :
j =1

Note that vk represents the endogenous part of xk (if there is any). If there is an en-
dogeneity problem, the coecient bv will pick up this e ect (i.e. the endogenous part of
 will move to bv vk ). Thus, endogeneity can be tested by a standard t-test of bv (based
on heteroscedasticity-consistent standard errors). If the coecient is signi cant, we reject
H0 (no endogeneity) and conclude that the suspected regressor is endogenous. However,
failing to reject H0 need not indicate the absence of endogeneity, but may be due to weak
instruments. If more than one regressor is suspected to be endogenous, a rst-stage re-
gression is run for each one of them. All residuals thus obtained are added to the original
equation, and a F -test for the residuals' coecients being jointly equal to zero can be
used.
Example 20: We use a subset of wage data from Wooldridge (2002, p.89) (example
5.2)48 to illustrate IV-estimation. Wages are assumed to be (partially) determined by
the unobservable variable ability. Another regressor in the wage regression is educa-
tion (measured by years of schooling), which can be assumed to be correlated with
ability. Since ability is an omitted variable, which is correlated with (at least) another
regressor, this will lead to inconsistent OLS estimates. The dummy variable 'near'
indicates whether someone grew up near a four-year college. This variable can be
used as an instrument: it is exogenously given (i.e. uncorrelated with the error term
which contains 'ability'), and most likely correlated with education. The rst-stage
regression shows that the coecient of the instrument is highly signi cant (i.e. there
seems to be no danger of using a weak instrument). The condition corr[z; ]=0 cannot
be tested since only one instrument is available. The coecient of v (the residual from
the rst-stage regression) is highly signi cant (p-value 0.0165) which indicates an en-
dogeneity problem (as expected). Comparing OLS and IV estimates shows that the
IV coecient of education is three times as high as the OLS coecient. However, the
IV standard error is more than ten times as high as the OLS standard error. Similar
results are obtained for other coecients (see wage.wf1 and wage.xls).

48 Source: Go to the book companion site of Wooldridge (2003) at http://www.cengage.com/ (latest


edition), click on "Data Sets", download one of the zip- les and choose "CARD.*"; 874 observations have
been extracted from the original sample.
1.9 Endogeneity and instrumental variable estimation 66
1.9.4 Example 21: Consumption based asset pricing
We consider an investor who maximizes the expected utility of present and future con-
sumption by solving
" #
T
X
max Et  U (Ct+ ) ;
 =0

subject to the budget constraint


Ct + Wt = Lt + (1 + Rt )Wt 1 ;
where  is the time discount factor, Ct is the investor's consumption, Wt is nancial
wealth, Lt is labor income and Rt is the return from investments. A rst-order condition
for the investor's intertemporal consumption and investment problem is given by the
intertemporal Euler equation
Et [U 0 (Ct+1 )(1 + Rt+1 )] = U 0 (Ct ):
We derive this equation in a simpli ed, two-period setting, ignoring labor income. In this
case the objective is given by
max Et[U (Ct) + U (Ct+1)] ;
and the (reformulated) budget constraint is
Ct+1 = (1 + Rt+1 )Wt Wt+1 :
The agent has to decide how much to consume and invest now (in t) in order to maximize
expected utility. This implies maximizing current and expected utility with respect to
either current consumption or wealth (because of the budget constraint, only one of the
two needs to be determined):
max
Wt
U (Ct ) + Et [U ((1 + Rt+1 )Wt Wt+1 ))] :

Taking rst derivatives and applying the chain results gives


U 0 (Ct ) + Et U 0 ((1 + Rt+1 )Wt Wt+1 )(1 + R1+t ) = 0
 

which simpli es to
U 0 (Ct ) = Et [U 0 (Ct+1 )(1 + Rt+1 )]:
Thus, the optimal solution is obtained by equating the expected marginal utility from
investment to the marginal utility of consumption. The Euler equation can be rewritten
in terms of the so-called stochastic discount factor Mt
Et 1[(1 + Rt)Mt] = 1 Mt =  UU0(C(Ct) ) :
0
t 1
1.9 Endogeneity and instrumental variable estimation 67
This equation is also known as the consumption based CAPM.
Using the power utility function
C1
U (C ) =
1 U (C ) = C ( . .. coecient of relative risk aversion)
0

the Euler equation is given by


"  #
Ct 
Et 1 (1 + Rt) C = 1 or Et 1[(1 + Rt)Ct ] = Ct 1:
t 1

Campbell et al. (1997, p.306) assume that Rt and Ct are lognormal and homoscedastic,
use the relation ln E[X ]=E[ln X ]+0:5V[ln X ] (see (41) in section 2.1.2), and take logs of
the term in square brackets to obtain
Et 1[ln(1 + Rt)] + ln  Et 1 [ ln Ct ] + 0:5c = 0:
c is a constant (by the assumption of homoscedasticity) involving variances and covari-
ances of Rt and Ct. This equation implies a linear relation between expected returns and
expected consumption growth, and can be used to estimate . We replace expectations
by observed data on log returns yt=ln(1+Rt) and consumption growth ct= ln Ct
yt = Et 1 [ln(1 + Rt )] + at ct = Et 1 [ ln Ct ] + bt :
Replacing expectations by observed variables implies measurement errors, which further
lead to inconsistent OLS estimates if obtained from the regression equation (as shown in
section 1.9.1)
yt = + ct + t = ln  0:5c t = at bt :
We estimate this equation to replicate parts of the analysis by Campbell et al. (1997,
p.311) using a slightly di erent annual dataset (n=105) prepared by Shiller49. Details of
estimation results can be found in the les ccapm.wf1 and ccapm.xls. Using OLS, the
estimated equation is (p-values in parentheses)
yt = 0:0057 + 2:75 ct + ut R2 = 0:31:
(.72) (.00)
The estimate 2.75 is in a plausible range. However, OLS estimation is not appropriate
since the regressor ct is correlated with t via bt (unless =0). An IV-estimate of can be
obtained using instruments which are assumed to be correlated with consumption growth.
Campbell et al. use lags of the real interest rate it and the log dividend-price ratio dt as
instruments, arguing that t is uncorrelated with any variables in the information set from
t 1. Using 2SLS we obtain
yt = 0:29 11:2 ct + et;
(.43) (.53)
49 http://www.econ.yale.edu/~shiller/data/chapt26.xls.
1.9 Endogeneity and instrumental variable estimation 68
which shows that the estimated is insigni cant (which saves us from the need to explain
the unexpected negative sign). We use these instruments to test for their suitability, and
subsequently, for testing the endogeneity of ct. The rst-stage regression of ct on the
instruments yields
ct = 0:013 0:042 it 1 0:0025 dt 1 + vt R2 = 0:006:
(.64) (.46) (.78)
Obviously, the requirement of high correlation of instruments with the possibly endogenous
regressor is not met. The F -statistic is 0.33 with a p-value of 0.72, and the instruments
are considered to be very weak. nR2<2 indicates that the IV-bias may be substantially
larger than the OLS-bias.
To test the validity of the instruments based on overidentifying restrictions, we run the
regression (et are the 2SLS residuals)
et = 0:155 0:125 it 1 + 0:048 dt 1 + at R2 = 0:0014:
(.71) (.88) (.72)
The test statistic is 1150.0014=0.16, with a p-value of 0.69 (m=1). We cannot reject the
overidentifying restrictions (i.e. instruments are uncorrelated with 2SLS residuals et), and
the instruments can be considered to be valid.
Despite this ambiguous evidence regarding the appropriateness of the instruments, we use
the residuals from the rst-stage regression to test for endogeneity:
yt = 0:29 11:2 ct + 14:1 vt + wt R2 = 0:35:
(.005) (.024) (.005)
The coecient of vt is signi cant at low levels, and we can rmly reject the H0 of exogeneity
of ct (as expected).
This leaves us with con icting results. On the one hand, we have no clear evidence
regarding the appropriateness of the instruments (by comparing the rst-stage regression
and the overidenti cation test). On the other hand, we can reject exogeneity of ct on
theoretical grounds, but obtain no meaningful estimate for using 2SLS. From OLS we
obtain a reasonable estimate for , but OLS would only be appropriate if ct was truly
exogenous (which is very doubtful).
In a related study, Yogo (2004) applies 2SLS to estimate the elasticity of intertemporal
substitution, which is the reciprocal of the risk aversion coecient for a speci c choice
of parameters in the Epstein-Zin utility function. He shows that using weak instruments
(i.e. nominal interest rate, in ation, consumption growth, and the log dividend-price ratio
lagged twice) leads to biased estimates and standard errors. Yogo's results imply that the
lower end of a 95% con dence interval for is around 4.5 for the US, and not less than 2
across eleven developed countries.
Exercise 13: Use the data from example 21 to replicate the analysis using the
same instruments lagged by one and two periods.
Exercise 14: Use the monthly data in the le ie data.xls which is based
on data prepared by Shiller50 and Verbeek51 to replicate the analysis from
example 21.
50 http://www.econ.yale.edu/~shiller
51 http://eu.wiley.com/legacy/wileychi/verbeek2ed/
1.9 Endogeneity and instrumental variable estimation 69
Exercise 15: Use the weekly data in the le JEC.* downloaded from the
companion website of http://www.pearsonhighered.com/stock_watson/ to
estimate a demand equation for the quantity of grain shipped. Use \price",
\ice" and \seas1" to \seas12" as explanatory variables. Discuss potential en-
dogeneity in this equation. Consider using \cartel" as an instrument. Discuss
and test the appropriateness of \cartel" as an instrument.
1.10 Generalized method of moments 70
1.10 Generalized method of moments52
Review 7:53 In the method of moments the parameters of a distribution are es-
timated by equating sample and population moments. Given a sample x1 ; : : : ; xn of
independent draws from a distribution, the moment condition
E[X ] = 0
is replaced by the sample analog
n
1X
x x = 0
n i=1 i

to obtain the sample mean x as an estimate of . In general, to obtain estimates for


a K 1 parameter vector  we have to consider K (sample) moment conditions
n
1X
E[mj (X ; )] = 0 m (x ; ^) = 0 j = 1; : : : ; K;
n i=1 ij i

where mij (xi ; ^) are suitable functions of the sample and the parameter vector. The K
parameters can be estimated by solving the system of K equations. The moment esti-
mators are based on averages of functions. By the consistency of a mean of functions
(see review 5) they converge to their population counterparts. They are consistent,
but not necessarily ecient estimates. In many cases their asymptotic distribution is
normal.
Example 22: The rst and second central moments of the Gamma distri-
bution with density
(x= ) 1 e x=
f (x) =
( )
are given by
E[X ] =  = and E[(X )2 ] = 2 = 2 :
To estimate the two parameters and we de ne two functions
mi1 = xi ab mi2 = (xi ab)2 ab2 = x2i (ab2 + a2 b2 ):
The two sample moment conditions
n n
1X 1X
x ab = x ab = 0 x2 (ab2 + a2 b2 ) = 0
n i=1 i n i=1 i
can be used to estimate a and b by solving the equations54
ab x = 0 and ab2 s~2 = 0;
which yields b=~s2 =x and a=x2 =s~2 .
52 This section is based on selected parts from Greene (2003), section 18. An alternative source is
Cochrane (2001), sections 10 and 11. Note that Cochrane uses a di erent notation with ut mt , gT m ,
S  and dG.
53 For details see Greene (2003), p.527 or Hastings and Peacock (1975), p.68.
54 s~2 is the unadjusted sample variance s~2 =(1=n) P(xi x)2 .
1.10 Generalized method of moments 71
Example 23: We consider the problem of a time series of prices observed at irregularly
spaced points in time (i.e. the intervals between observations have varying length). We
want to compute mean and standard deviation of returns for a comparable (uniform)
time interval by applying the method of moments (see le irregular.xls for a nu-
merical example).
The observed returns are assumed to be determined by the following process:
Y (i ) = i + Zi  i
p (i = 1; : : : ; n);
where i is the length of the time interval (e.g. measured in days) used to compute
the i-th return. Zi is a pseudo-return, with mean zero and standard deviation one. 
and  are mean and standard deviation of returns passociated with the base interval.
Assuming that Zi and i are independent (i.e. E[Zi i ]=0), we can take expectations
on both sides, and replace these by sample averages to obtain
n n
1X 1X
Y = Y (i ) =  =  ;
n i=1 n i=1 i

from which we can estimate ^ = Y =.


To estimate the standard deviation  we use
(Y (i ) i )2
= Zi2 2 :
i
Taking expectations and using sample averages we obtain (note that E[Z 2 ]=V[Z ]=1):

1X n
(Y (i ) ^i )2
^ 2 = :
n i=1 i
1.10 Generalized method of moments 72
1.10.1 OLS, IV and GMM
Generalized method of moments (GMM) can not only be used to estimate the parameters
of a distribution, but also to estimate the parameters of an econometric model by general-
izing the method of moments principle. GMM has its origins and motivation in the context
of asset pricing and modeling rational expectations (see Hansen and Singleton, 1996). One
of the main objectives was to estimate models without making strong assumptions about
the distribution of returns.
We start by showing that the OLS estimator can be regarded as a method of moments
estimator. Assumption AX in the context of the regression model y=X + implies the
orthogonality condition
E[X 0] = E[X 0(y X )] = 0:
To estimate the K 1 parameter vector we de ne K functions and apply them to each
observation in the sample55
mij (b) = xij (yi x0i b) = xij ei i = 1; : : : ; n; j = 1; : : : ; K:
The moment conditions are the sample averages
 j = 1 mij = 0 j = 1; : : : ; K;
n
X
m
n i=1

which are identical to the normal equations (2) which have been used to derive the OLS
estimator in section 1.1:
1Xn
xi ei = 1 xi (yi x0 b) = 1 X 0e = 0:
Xn
n i=1 n i=1 i n

If some of the regressors are (possibly) endogenous it is not appropriate to impose the
orthogonality condition. Suppose there are instruments Z available for which E[Z 0]=0
holds. If Z has dimension nK (the same as X ) we can obtain IV-estimates from
1Xn
zi (yi x0i b) = 0:
n i=1

If there are more instruments than parameters we can specify the conditions
1 X^ 0e = 0;
n
where X^ is de ned in (35). Using X^ generates K conditions, even when there are L>K
instruments56.
55 The notation mij =mj (yi ; xij ) is used to indicate the dependence of the j -th moment condition on the
observation i.
56 More instruments than necessary can be used to generate overidentifying restrictions and can improve
the eciency of the estimates.
1.10 Generalized method of moments 73
The homoscedasticity assumption implies that the variance of residuals is uncorrelated
with the regressors. This can be expressed as
E[xi(yi x0i )2] E[xi]E[2i ] = 0:
If the model speci cation is correct the following expression
1Xn
xi e2i x i s~2e
n i=1

should be close to zero.


GMM can also be based on conditional moment restrictions of the form
E[jX ] = 0:
This implies that  is not only uncorrelated with X but with any function of X . Thus,
depending on the way the conditional expectation is formulated, such conditions can be
much stronger than unconditional restrictions. In a time series context, it can be assumed
that the expectation of  conditional on past regressors is zero. Other examples are
nonlinear functions of X , or restrictions on the conditional variance. If z are regressors
assumed to determine the (conditional) variance of disturbances, this can be expressed by
the moment condition (see FGLS on p.1.8.1)
mi (b) = (yi x0i b)2 f (z 0i )bz :

1.10.2 Asset pricing and GMM


In example 21 we have considered the Euler equation
"   #

Et 1 (1 + Rt) CCt = 1;
t 1

and have shown how to estimate based on linearizing this equation. An alternative view
is to consider the Euler equation as a testable restriction. It should hold for all assets and
across all periods. This implies the following sample moment condition:
1X
n
mt (; ) = 0

mt (; ) = (1 + Rt )
Ct 
1:
n t=1 Ct 1

The returns of at least two assets are required to estimate the parameters  and . Note
that no linearization or closed-form solution of the underlying optimization problem is
required (as opposed to the approach by Campbell et al. (1997) described in example 21).
GMM can accommodate more conditions than necessary (i.e. additional instruments can
be used to formulate overidentifying restrictions; see section 1.10.3).
1.10 Generalized method of moments 74
In asset pricing or rational expectation models the errors57 in expectations should be
uncorrelated with all variables in the information set It 1 of agents forming those ex-
pectations. This can be used to formulate orthogonality conditions for any instrument
zt 1 2It 1 in the following general way:

E[(yt x0t )zt 1] = 0:


The Euler equation in the consumption based CAPM is also expressed in terms of a
conditional expectation. Thus, for any element of the information set
1Xn
mt (; )zt 1 = 0
n t=2

should hold.
In example 6 we have brie y described the Fama-MacBeth approach to estimate the pa-
rameters of asset pricing models. GMM provides an alternative (and possibly preferable)
way to pursue that objective. We consider N assets with excess returns xit, and a single-
factor model with factor excess return xmt. The factor model implies that the following
equations hold:
xit = i xm
t + t i = 1; : : : ; N;
i

E[xit] = m i i = 1; : : : ; N:

Fama and MacBeth estimate i from the rst equation for each asset. Given these esti-
mates, m is estimated from a single regression across the second set of equations (using
sample means as observations of the dependent variable and estimated beta-factors as
observations of the regressor). The CAPM or the APT imply a set of restrictions that
should have zero expectation (at the true parameter values). The moment conditions
corresponding to the rst set of equations for the present example are
1X
n
(xit i xmt )xmt = 0 i = 1; : : : ; N:
n t=1

The second set of equations implies


1X
n
xit m i = 0 i = 1; : : : ; N:
n t=1

The generalization to several factors is straightforward.

57 These errors are supposed to be evaluated at the true parameters.


1.10 Generalized method of moments 75
1.10.3 Estimation and inference
Generalizing the method of moments we consider the case of L>K moment conditions to
estimate K parameters . Since there is no unique solution to the overdetermined system
of equations we can minimize the sum of squares
L
X
 2j () = m 0m
m  = (m 1 ; : : : ; m L )0 ;
m
j =1

where
 j ( ) =
m
1 n
X
mij () j = 1; : : : ; L:
n i=1

Minimizing this criterion gives consistent but not necessarily ecient estimates of .
Hansen (1982) has considered estimates based on minimizing the weighted sum of squares
J =m
 0W m
:
The weight matrix W has to be positive de nite. The choice of W relies on the idea of
GLS estimators, with the intention to obtain ecient estimates. Elements of m which are
more precisely estimated should have a higher weight and have more impact on the value
of the criterion function. If W is inversely proportional to the asymptotic covariance of
 , i.e.
m
W = 1
p
 = aV[ nm
 ];
and plim m =0, the GMM estimates are consistent and ecient.
Before we proceed, we brie y refer to the asymptotic variance of the sample mean y
(see review 5, Pp.22). It can be derived from observations yi and is given by s2=n where
s2 =(1=(n 1)) i (yi y)2 . Now, we note that m can be viewed as a (multivariate) sample
mean. It can be derived from
 = 1 mi mi = m(yi ; xi );
X n
m n i=1

where mi is a L1 vector of conditions evaluated at observation i. Similar to the asymp-


totic variance of the sample mean, the estimated covariance of m can be based on the
(estimated) covariance of mi:
^ (^) = 1 X mi (^) m
n h ih i0
 n 1
 m 
i (^) m :
i=1

In general, the asymptotic covariance matrix of GMM parameter estimates can be esti-
mated by
V^ = 1 G^ 0^ 1G^ :
  1

n
(39)
1.10 Generalized method of moments 76
G^ is the Jacobian of the moment functions (i.e. the matrix of derivatives of the moment
functions with respect to the estimated parameters):
G^ = n1 @ mi (^)
n
X
0 :
i=1 @ ^

The columns of the LK matrix G^ correspond to the K parameters and the rows to the
L moment conditions.
As shown in section 1.10.1, OLS and GMM lead to the same parameter estimates, if
GMM is only based on the orthogonality condition E[X 0]. However, if the covariance of
parameters is estimated according to (39), ^ is given by
^ = 1 X mi m0i = 1 X xi ei x0i ei = 1 X e2i xi x0i ;
n n n
 n 1 n 1 n 1
i=1 i=1 i=1

and G^ is given by
G^ = n1 @@mb0i = n1 @ [xi (y@i b0 xi b)] = n1 X 0X :
n
X n
X 0
i=1 i=1

Combining terms we nd that the estimated covariance matrix for the GMM parameters
of a regression model is given by
!
n n
X 0X 1 e2i xi x0i X 0 X 1 ;
 X 
n 1 i=1

which corresponds to White's heteroscedasticity consistent estimate (26). In this sense,


estimating a regression model by GMM 'automatically' accounts for heteroscedasticity.
In practice, we need to take into account that ^ depends on the { yet to be determined {
estimates ^. The usually suggested two-step approach starts with an unrestricted estimate
^u derived from using W =I , and minimizing m  . The resulting estimates ^u are used
 0m
to construct u, which is then used in the second step to minimize
^

J =m ^ u 1m
 (^)0   (^):
The asymptotic properties of GMM estimates can be derived on the basis of a set of
assumptions (see Greene, 2003, p.540). Among others, the empirical moments are assumed
to obey a central limit theorem. They are assumed to have a nite covariance matrix =n,
so that
pnm
 d! N(0; ]:
Under this and further assumptions (see Greene, 2003, p.540) it can be shown that the
asymptotic distribution of GMM estimates is normal, i.e.
^ a N[; V ]:
1.10 Generalized method of moments 77
The diagonal elements of the estimated covariance matrix V^ can be used to compute
t-statistics for the parameter estimates:

^j a
q  N(0; 1):
V^ jj
Alternative estimators like the White or the Newey-West estimator can be used if required
(see Cochrane, 2001, p.220).
Overidentifying restrictions can be tested on the basis of nJ 2L K . Under the null hy-
pothesis, the restrictions are valid, and the model is correctly speci ed. Invalid restrictions
lead to high values of J and to a rejection of the model. In the just identi ed case L=K
and J =0.
Despite this relatively brief description of GMM, its main advantages should have become
clear. GMM does not rely on Aiid, requires no distributional assumptions, it may also
be based on conditional moments, and allows for more conditions than parameters to
be estimated (i.e. it can be used to formulate and test overidentifying restrictions). The
requirements for consistency and asymptotic normality are that mi must be well behaved
(i.e. stationary and ergodic), and the empirical moments must have a nite covariance
matrix.
These advantages are not without cost, however. Some of the problems (which have
received insucient space in this short treatment) associated with GMM are: In some
cases the rst derivative of J may not be known analytically and the optimization of the
criterion function J has to be carried out numerically. Moreover, J is not necessarily a
convex function which implies that there is no unique minimum, and good starting values
are very important for the numerical search algorithm.
1.10 Generalized method of moments 78
1.10.4 Example 24: Models for the short-term interest rate
Chan et al. (1992) use GMM to estimate several models for the short-term interest rate.
They consider a general case where the short rate follows the di usion
dr = ( + r)dt + r dZ:

By imposing restrictions on the parameters special cases are obtained (e.g. the Vasicek
model if =0, or the Brennan-Schwartz model if =1). The discrete-time speci cation of
the model is given by
rt rt 1 = + rt 1 + t E[t] = 0 E[2t ] = 2rt2 1:
Using =(  )0, Chan et al. impose the following moment conditions
h i0
mt() = t trt 1 2t 2rt2 1 (2t 2rt2 1 )rt 1 :
Conditions one and three correspond to the mean and variance of t. Conditions two
and four impose orthogonality between the regressor rt 1 and the error from describing
the variance of the disturbances t. The estimated covariance of the parameter estimates
is based on the following components of the Jacobian (rows correspond to conditions,
columns to parameters):
 
@mt;1 @mt;1 @mt;1 @mt;1
Gt;1 = = 1 = rt 1 =0 =0
@ @ @ @
 
@mt;2 @mt;2
Gt;2 = = rt 1 = rt2 1 0 0
@ @
h i
Gt;3 = 2mt;1 2mt;1rt 1 2rt2 1 22rt2 1 ln(rt 1)
h i
Gt;4 = 2mt;1rt 1 2mt;1rt2 1 2rt2 +1
1 22rt2 +1
1 ln(rt 1 ) :

Chan et al. (1992) use monthly observations of the three-month rate for rt from June 1964
to December 1989. Details of computations and some estimation results can be found in
the le ckls.xls. Note that the estimates for , and 2 have to be scaled by t=1/12
to convert them into annual terms, and to make them comparable to the results presented
in Chan et al. (1992).
Exercise 16: Retrieve a series of short-term interest rates from the website
http://www.federalreserve.gov/Releases/H15/data.htm or from another
source. Estimate two or three di erent models of the short-term interest rate
by GMM.
1.11 Models with binary dependent variables 79
1.11 Models with binary dependent variables
Review 8: The binomial distribution describes the probabilities associated with a
sequence of n independent trials, where each trial has two possible outcomes (usually
called success and failure). The probability of success p is the same in each trial. The
probability of y successes in n trials is given by
 
n y
f (y) = p (1 p)(n y) :
y
Expected value and variance of a binomial random variable are given by np and
np(1 p), respectively. If the number of trials in a binomial experiment is large (e.g.
np5), the binomial distribution can be approximated by the normal distribution. If
n=1, the binomial distribution is a Bernoulli distribution.
We now consider the application of regression analysis to the case of binary dependent
variables. This applies, for example, when the variable of interest is the result of a choice
(e.g. brand choice or choosing means of transport), or an interesting event (e.g. the default
of a company or getting unemployed). For simplicity we will only consider the binary case
but the models discussed below can be extended to the multinomial case (see Greene
(2003), section 21.7).
Observations of the dependent variable y indicate whether the event or decision has taken
place or not (y=1 or y=0). The probability for the event is assumed to depend on regressors
X and parameters , and is expressed in terms of a distribution function F . For a single
observation i we specify the conditional probabilities
P[yi = 1] = F (xi; ) P[yi = 0] = 1 F (xi ; ):

The conditional expectation of yi is a weighted average of the two possible outcomes:


E[yijxi] = y^i = 1  P[yi = 1] + 0  P[yi = 0] = F (xi; ):
There are several options to formalize F . In the linear model F (xi; )=x0i , and the
corresponding regression model is given by
yi = E[yi jxi ] + (yi E[yijxi]) = x0i + i = y^i + i:
The linear model has three major drawbacks. First, x0i is not constrained to the interval
[0,1]. Second, the disturbances are not normal but Bernoulli random variables with two
possible outcomes (conditional on xi):
P[i = x0i ] = P[yi = 0] = 1 x0i P[i = 1 x0i ] = P[yi = 1] = x0i :
This implies the third drawback that the disturbances are heteroscedastic with conditional
variance
V[ijxi] = x0i (1 x0i ):
1.11 Models with binary dependent variables 80
Instead of specifying F as a linear function, in the probit-model F is assumed to be the
standard normal distribution function
p1 expf 0:5u2gdu:
Z y^i Z y^i
F (xi ) = (^yi ) =
0 (u) du =
1 1 2
In the logit-model or logistic regression model the distribution function is given by
F (x0i ) = L(^yi ) =
1 = expfx0i g :
1 + expf x0i g 1 + expfx0i g
The probit- and logit-models imply (slightly di erent) s-shaped forms of the conditional
expectation E[yijxi]. The logit-model assigns larger probabilities to yi=0 than the probit-
model if x0i is very small. The di erence between the two models will be large if the
sample has only a few cases for which yi=1 (or yi=0), and if an important regressor has
a large variance (see Greene, 2000, p.667).
The interpretation of the coecients in the three models can be based on the partial
derivatives with respect to regressor j :
@ x0i
@xij
= j
@ (x0i )
@xij
= (x0i ) j
@L(x0i )
@xij
= (1 +exp fx0i g :
expfx0 g)2 j
i

Hence, in the probit- and logit-model the e ect of a change in regressor j depends on the
probability at a given value of x0i . A convenient interpretation of the logit-model is based
on the so-called odds-ratio, which is de ned as
L(^yi ) expfx0i g :
L(^yi ) =
1 L(^yi) 1 + expfx0i g
The log odds-ratio is given by

L(^yi ) 
ln 1 L(^y ) = x0i :
i

This implies that expf j xj g is the factor by which the odds-ratio is changed c.p. if
regressor j is changed by xj units. The e ect of a change in a regressor on L(^yi) is low
if L(^yi) is close to zero or one.
Binary choice models can be estimated by maximum-likelihood, whereas linear models are
usually estimated by least squares. Each observation in the sample is treated as a random
draw from a Bernoulli distribution. For a single observation the (conditional) probability
of observing yi is given by
P[yijxi; ] = F (xi; )yi(1 F (xi; ))(1 yi) = Fiyi(1 Fi)(1 yi) :
1.11 Models with binary dependent variables 81
For the entire sample the joint probability is given by
n
Fiyi(1 Fi )(1 yi ) ;
Y

i=1

and the log-likelihood function is given by


n
X
`( ) = yi ln Fi + (1 yi ) ln(1 Fi ):
i=1

Maximizing the log-likelihood requires an iterative procedure. Standard errors of esti-


mated coecients in the logit-model can be based on the Hessian
@2` n
X
@ @ 0
= L(^yi )(1 L(^yi ))xi x0i ;
i=1

and tests of the estimated coecients make use of the asymptotic normality of ML esti-
mators.
A likelihood ratio test of m restrictions r0 =0 is based on comparing the restricted like-
lihood `r to the unrestricted likelihood `u:
2[`u `r ]  2m :

The goodness of t cannot be measured in terms of R2. McFadden's R2 is based on `u


and the log-likelihood `0 of a model which only contains a constant term:
McFadden R2 = 1 `u =`0 :
Example 25: We consider the choice among mortgages with xed and adjustable
interest rates analyzed by Dhillon et al. (1987), and use part of their data from Stu-
denmund (2001), p.45958 . The dependent variable is ADJUST (equal to 1 when an
adjustable rate has been chosen). The regressors are the xed interest rate (FIXED),
the interest premium on the adjustable rate (PREMIUM), the net worth of the bor-
rower (NET), the ratio of the borrowing costs (adjustable over xed; POINTS), the
ratio of the adjustable rate maturity to that of the xed rate (MATURITY) and
the di erence between the 10-year and 1-year Treasury rate (YIELD). Details can be
found in the les mortgage.wf1 and mortgage.xls.
The estimation results from the linear and the logit-model are summarized in the
table below. The z -values are based on the Hessian matrix. The p-values from the
two models are only marginally di erent. The coecients of PREMIUM, NET and
YIELD are signi cant at the 5% level. The tted probabilities from both models are
very similar which is con rmed by the similarity of R2 and McFadden's R2 . The linear
model's probabilities are negative in only two cases, and never greater than one.
58 The data is available from the Student Resources at http://wps.aw.com/aw_studenmund_useecon_5.
1.11 Models with binary dependent variables 82
coecients t-, z - and LR-test p-values
linear logit linear logit z logit LR linear z LR
constant {0.083 {3.722 {0.064 {0.514 0.268 0.949 0.607 0.605
FIXED 0.161 0.902 1.963 1.859 3.699 0.054 0.063 0.054
PREMIUM {0.132 {0.708 {2.643 {2.331 6.386 0.010 0.020 0.012
NET 0.029 0.149 2.437 1.906 4.824 0.017 0.057 0.028
POINTS {0.088 {0.518 {1.242 {1.217 1.595 0.218 0.224 0.207
MATURITY {0.034 {0.238 {0.179 {0.229 0.053 0.858 0.819 0.819
YIELD {0.793 {4.110 {2.451 {2.159 5.378 0.017 0.031 0.020
R2 =0.314; McFadden R2 =0.26
The coecients from the linear model have the usual interpretation. The coecient
0:708 from the logit-model is transformed to expf 0:708g=0.49, and can be inter-
preted as follows: the odds-ratio is about one half of its original value if the premium
increases c.p. by one unit. For the rst observation in the sample we obtain a tted
probability of 0.8 which corresponds to an odds-ratio of 4:1. If the premium changes
from its current value of 1.5 to 2.5 the odds-ratio will fall to 2:1. From the linear
model the corresponding change yields a drop in y^ from 0.78 to 0.65.
1.12 Sample selection 83
1.12 Sample selection59
Consider two random variables yN(; 2) and zN(z ; z2) with correlation yz . Sup-
pose y is only observed if z>a (so-called incidental truncation). The expected value of y
conditional on truncation is given by
E[yjtruncation] =  + yz ( z );
where (see Greene, 2003, p.781) z =(a z )=z , and ( z ) is the so-called inverse Mills
ratio given by ( z )=f ( z )=[1 F ( z )], where f () denotes the normal pdf and F () the
normal cdf. For example, if we can only observe the income y of people whose wealth z
is below a (and yz >0), the average of the sample income is lower than the 'true' average
income in the population.
A similar argument holds for a regression, i.e. for the conditional expectation
yi = y^i + i = x0i + i ;

E[^yijtruncation] = x0i + yz ( z ):


A similar result holds if z is not a (correlated) random variable but determined by an
equation like
zi = w0i + ui = z^i + ui :

If sample data can only be observed conditional on some mechanism related to z, the
conditional mean of y (now subject to selection ) is given by
E[^yijselection] = x0i + ui( ui );
where ui = z^i=u and i( ui )=f (^zi=u)=F (^zi=u). This result is obtained by assuming
bivariate normality of  and u (rather than y and z). Note that the inverse Mills ratio
i () is not a constant, but depends on w0i . Estimating the equation yi =x0i +i without
i () yields inconsistent estimates because of the omitted regressor, or, equivalently, as a
result of sample selection.60 Note that a non-zero correlation among  and u determines
the bias/inconsistency. Thus, a special treatment is required when the unobservable fac-
tors determining inclusion in the subsample are correlated with the unobservable factors
a ecting the variable of primary interest.
In many cases, z is not directly observed/observable, but only a binary variable d, indicat-
ing the consequence of the z-based selection rule. This o ers the opportunity to estimate
the so-called selection equation (using a logistic regression as described in section 1.11):
di = w0i + vi :
59 Most of this section is based on Greene (2003), section 22.4.
60 The resulting inconsistency cannot be 'argued away' by stating that the estimated incomplete equa-
tion is representative for the population corresponding to that available subsample. Since the estimated
equation describes (only) the non-random subsample, such a viewpoint is rather useless as long as nothing
is known about the mechanism that determines whether y (in the population) is non-zero.
1.12 Sample selection 84
This forms the basis for the so-called Heckman correction. Heckman (1979) suggested
a two-step estimator61 which rst estimates the selection equation by probit to obtain
i =f (w0i ^ )=F (w0i ^ ) for every i, and then estimates the original equation after adding this
auxiliary regressor to the equation.62 The coecient of i can be interpreted by noting
that it is an estimate of the term u; i.e. it can be viewed as a scaled correlation
coecient.63
For the practical implementation of this two-step approach we note that yi and xi are only
observed if di=1, while the regressors wi must be observed for all cases. The information
in wi must be able to suciently discriminate among subjects who enter or do not enter
the sample. More importantly, the selection equation requires at least one (additional)
exogeneous regressor which is not included in xi. In other words, we impose an exclusion
condition on the main equation, and this additional regressor plays a similar role as an
instrument in case of IV regressions for treating endogeneity. Note that IV-estimation is
impossible when the regressors in the rst stage are identical to those in the main equation
(because of perfect multicollinearity). In the Heckit approach it is feasible to set wi=xi
(because of the nonlinearity of the inverse Mills ratio, and the fact that a di erent number
of observations is used in the two equations) but not recommended (see Wooldridge, 2002,
p.564).
Example 26: We consider a well-known and frequently used dataset about female
labor force participation and wages, and replicate the results in Table 22.7 in Greene
(2003).64 A wage model can only be estimated for those 428 females who actually
have a job, so that wage data can be observed.65 One can view the absent wage
information for another 325 females in this dataset as the result of truncation: if the
o ered wage is below the reservation wage, females are not actively participating in
the labor market.
Estimation results based on the available sample of 428 females may su er from a
selection bias if unobserved e ects in the wage and selection equations are correlated
(i.e. u 6=0; see above). Whether this bias results from so-called self-selection (i.e.
women's deliberate choice to participate in the labor market), or other sampling ef-
fects is irrelevant for the problem, but may be important for choosing regressors wi .
The estimated coecient of the inverse Mills ratio is given by 1:1. This can be inter-
preted as follows: women who have above average willingness (interest or tendency)
to work (i.e. zi is above z^i ; ui >0) tend to earn below average wage (i.e. yi is below y^i ;
i <0). This estimate is statistically insigni cant which indicates that sample selection
may not play an important role in this example.

61 The procedure is often called 'Heckit', because of the combination of the name Heckman and
logit/probit models.
62 i are also called 'generalized residuals' of a probit model. For the entire sample (not just the subsample
for63which y is observed) they have mean zero and are uncorrelated with the regressors wi .
It may be possible to assign a 'physical' meaning to this coecient upon recalling that the slope in a
(simple) regression of y on x is given by yx y =x (see p.2). Thus, the ratio's coecient is proportional to
the64slope of a regression of i on ui (i.e. the slope is multiplied/scaled by u ).
Source and description of variables: https://rdrr.io/rforge/Ecdat/man/Mroz.html; this dataset
Mroz87 is also included in the R-package sampleSelection. sample-selection.R contains code for Heckit
estimates (two-stage and ML), as well as an extension which also deals with endogeneity.
65 For the purpose of this example we ignore the potential endogeneity associated with estimating a wage
equation (see example 20). See Wooldridge (2002, p.567) for a treatment of this case.
1.13 Duration models 85
1.13 Duration models66
The purpose of duration analysis (also known as event history analysis or survival
analysis) is to analyze the length of time (the duration) of some phenomenon of interest
(e.g. the length of being unemployed or the time until a loan defaults). A straightforward
application of regression models using observed durations as the dependent variable is
inappropriate, however, because duration data are typically censored. This means that
the actual duration cannot be recorded for some elements of the sample. For example,
some people in the sample are still unemployed at the time of analysis, and it is unknown
when they are going to become employed again (if at all). We can only record the length of
the unemployment period at the time the observation is made. Such records are censored
observations and this fact must be taken into account in the analysis (see below). Two cases
are possible: the subject under study is still in the interesting state when measurements are
made, and it is unknown how long it will continue to stay in that state (right censoring).
Left censoring holds, if the subject has already been in the interesting state before the
beginning of the study, and it is unknown for how long.
We de ne the (continuous) variable T which measures the length of time spent in the
interesting state, or the time until the event of interest has occurred. The units of mea-
surement will usually be days, weeks or months, but T is not constrained to integer values.
The distribution of T is described by a cumulative distribution function
Z t
F (t) = P[T  t] = f (s) ds:
0
The survival or survivor function is the probability of being in the interesting state for
more than t units of time:
S (t) = 1 F (t) = P[T  t]:
We now consider the conditional probability of leaving the state of interest between t and
t+h conditional on having 'survived' until t:

P[t  T  t + hjT  t] = P[t P[TT  tt]+ h] = F (t 1+ h)F (t)F (t) = F (t +Sh()t) F (t) :
This probability is used to de ne the hazard function
(t) = lim
P[t  T  t + hjT  t] :
h!0 h
(t) does not have a straightforward interpretation. It may be viewed as an instantaneous
probability of leaving the state. However, to view it as a probability is not quite appro-
priate, since (t) can be greater than one (in fact, it has no upper bound). If we assume
that the hazard rate is a constant  and assume that the event is repeatable, then  is the
expected number of events per unit of time. Alternatively, a constant hazard rate implies
E[T ]=1=, which is the expected number of periods until the state is left.
66 Most of this section is based on Greene (2003), section 22.5 and Kiefer (1988).
1.13 Duration models 86
The hazard rate can be expressed in terms of F (t), f (t) and S (t). Since
lim F (t + h) F (t) = F 0(t) = f (t);
h!0 h
the hazard rate can also be written as
f (t)
(t) = or f (t) = (t)S (t):
S (t)
It can be shown that
 Z t 
F (t) = 1 exp (s) ds :
0
A constant hazard rate (t)= corresponds to an exponential distribution F (t)=1 e t. It
implies that the probability of leaving the interesting state during the next time interval
does not depend on the time spent in the state. This may not always be a realistic
assumption. Instead, assuming a Weibull distribution for T results in the hazard rate
(t) = t 1 > 0; > 0;

which is increasing if >1. Assuming a lognormal or log-logististic distribution for T gives


rise to a non-monotonic behavior of the hazard rate.
The parameters =f ; g can be estimated by maximum likelihood. The joint density for
an i.i.d. sample of n uncensored durations ti is given by
n
Y
L() = f (ti ; ) f (t) = (t)S (t):
i=1

When ti is a right-censored observation, we only know that the actual duration ti is at
least ti. As a consequence, the contribution to the likelihood is the probability that the
duration is longer than ti, which is given by the survivor function S (ti). Using the dummy
variable di=1 to indicate uncensored observations, the log-likelihood is de ned as
n
X
`() = di ln f (ti ; ) + (1 di ) ln S (ti ; ):
i=1

Because f (t)=(t)S (t) the log-likelihood can be written as


n
X
`() = di ln (ti ; ) + ln S (ti ; ):
i=1

In other words, the likelihood of observing a duration of length ti depends on survival until
ti , and exiting the interesting state at ti . For censored cases, exiting cannot be accounted
for, and only survival until ti enters the likelihood.
1.13 Duration models 87
In case of the Weibull distribution (t)= t 1, ln S (t)= t , and the log-likelihood is
given by
n
X
`() = di ln ti 1 t i :
i=1
An obvious extension of modeling durations makes use of explanatory variables67. This
can be done by replacing the constant by the term expfx0i g.68 The resulting parameter
estimates can be interpreted in terms of expf j g, which is the factor by which the hazard
rate is multiplied if regressor j is increased ceteris paribus by  units.
The proportional hazards model (or Cox regression model) does not require any
assumption about the distribution of T . Rather than modeling the hazard rate as
(t; xi ) = 0 (t) expfx0i g;
where 0(t) is the baseline hazard function (e.g. t 1 for the Weibull model), the Cox
regression assumes that the ratio of the hazard rates of two individuals does not depend
upon time:
(t; xi ) 0 (t) expfx0i g expfx0i g
= =
(t; xj ) 0 (t) expfx0j g expfx0j g
:

Hence, there is no need to specify the baseline function 0(t). Cox de nes a partial
likelihood estimator using the log-likelihood
2 3
n
X X
`( ) = 4 x0i ln expfx0j g5 :
i=1 j 2Ri

For a sample of n distinct exit times t1; : : : ; tn (i.e. considering uncensored cases only), the
risk set Ri contains all individuals whose exit time is at least ti (which includes censored
and uncensored cases).
Example 27: We consider a dataset about lung cancer from the North Central
Cancer Treatment Group.69 Ignoring (available) covariates and assuming a Weibull
distribution results in estimates of ^ =1.342 and ^=0.0003. The function survreg in
the R-package survival reports a constant term 6.054, which can be derived from
ln ^= ^ .
Using the available regressors and maintaining the Weibull assumption shows that
sex, ph.ecog and ph.karno are signi cant covariates. The estimate for sex can be
converted to expf 0:56g=0.571, which implies that the hazard rate for otherwise
identical observations is nearly halved when comparing a man (sex=1) to a female
(sex=2). The estimate 0.0235 for ph.karno implies that an increase of this variable
by (a typical change of) 10 units yields a factor of expf10  0:0235g=1.265, i.e. an
approximately 25% increase in the hazard rate. Running a Cox regression yields very
similar parameter estimates.
67 In the context of hazard rate models the regressors are frequently called covariates.
68 Note that the function survreg in the R-package survival sets =expf x0i g, and applies a scaling
factor.
69 Source and description of variables:
https://www.rdocumentation.org/packages/survival/versions/2.41-2/topics/lung. Computations
and code can be found in lung.xlsx and lung.R.
2.1 Financial time series 88
2 Time Series Analysis
2.1 Financial time series
A nancial time series is a chronologically ordered sequence of data observed on nancial
markets. These include stock prices and indices, interest rates, exchange rates (prices for
foreign currencies), and commodity prices. Usually the subject of nancial studies are
returns rather than prices. Returns summarize an investment irrespective of the amount
invested, and nancial theories are usually expressed in terms of returns.
Log returns yt are calculated from prices pt using
yt = ln pt ln pt 1 = ln(pt =pt 1 ):
This de nition corresponds to continuous compounding. pt is assumed to include dividend
or coupon payments. Simple returns rt are computed on the basis of relative price
changes:
pt pt 1
rt =
pt 1
= ppt 1:
t 1
This de nition corresponds to discrete compounding. Log and simple returns are related
as follows:
yt = ln(1 + rt ) rt = expfyt g 1:
A Taylor series expansion of rt shows that the two return de nitions di er with respect
to second and higher order terms:
1 yi
X yt 1 i
X 1 i
X yt
rt = expfyt g 1=1 = t = yt + :
i=0 i! i=1 i! i=2 i!

The simple return of a portfolio of m assets is a weighted average of the simple returns
of individual assets
m
X
rp;t = wi rit ;
i=1

where wi is the weight of asset i in the portfolio. For log returns this relation only holds
approximately:
m
X
yp;t  wi yit :
i=1

Some nancial models focus on returns and their statistical properties aggregated over
time. Multi-period log returns are the sum of single-period log returns. The h-period
log return (ln pt ln pt h) is given by
ln pt ln pt h = ln(pt=pt 1) + ln(pt 1=pt 2) +    + ln(pt h+1=pt h)
yt (h) = yt + yt 1 +    + yt h+1 :
2.1 Financial time series 89
The corresponding expression for simple returns is
pt =pt h = (pt=pt 1)(pt 1 =pt 2 )    (pt h+1 =pt h )
hY1
1 + rt(h) = (1 + rt)(1 + rt 1)    (1 + rt h+1 )= (1 + rt j ):
j =0

2.1.1 Descriptive statistics of returns


Basic statistical properties of returns are described by mean, standard deviation, skewness
and kurtosis. The mean is estimated from a sample of log returns yt (t=1; : : : ; n) using
y =
1X
n
yt :
n t=1

The mean r of simple returns (obtained from the same price series) is not equal to y. An
approximate70 relation between the two means is
r  expfy + 0:5s2 g 1 y  ln(1 + r) 0:5s2; (40)
where s2 is the (sample) variance of log returns:
s2 =
1 n
X
2
n 1 t=1(yt y) :
The square root of s2 is the (sample) standard deviation or volatility71. Examples 28
and 29 document the well-known fact that the variance (or volatility) of returns is not
constant over time (i.e. the heteroscedasticity of nancial returns).
Example 28: Figure 2 shows the stock prices of IBM72 and its log returns. Log
and simple returns cannot be distinguished in such graphs. Obvious features are the
erratic, strongly oscillating behavior of returns around the more or less constant mean,
and the increase in the volatility towards the end of the sample period.

Example 29: Figure 3 shows the daily log returns of IBM73 over a long period of time
(1962{1997). This series shows that temporary increases in volatility as in Figure 2
are very common. This phenomenon is called volatility clustering and can be found
in many return series.
70 The relation is exact if log returns are normally distributed (see section 2.1.2).
71 In the context of nancial economics the term volatility is frequently used in place of the statistical
term standard deviation. Volatility usually refers to the standard deviation expressed in annual terms.
72 Source: Box and Jenkins (1976), p.526; see le ibm.wf1; daily data from 1961/5/17 to 1962/11/2; 369
observations.
73 Source: Tsay (2002), p.257; daily data from 1962/7/3 to 1997/12/31; 8938 observations; available from
http://faculty.chicagobooth.edu/ruey.tsay/teaching/fts2/.
2.1 Financial time series 90

Figure 2: Daily stock prices of IBM and its log returns 1961{1962.
0.10

0.05
700

0.00
600
-0.05

500
-0.10

400 price log return -0.15

300
50 100 150 200 250 300 350

Figure 3: Daily IBM log returns 1962{1997.


0.2

0.1

0.0

-0.1

-0.2

-0.3
2000 4000 6000 8000

Many nancial theories and models assume that returns are normally distributed to facili-
tate theoretical derivations and applications. Deviations from normality can be measured
by the (sample) skewness
1X n (y
t y)
3
S=
n t=1 s~3

and (sample) kurtosis


U = n1
n
X (yt y)4 :
t=1 s~4
2.1 Financial time series 91
In large samples S a N(0; 6=n) and U a N(3; 24=n). Skewness is a measure of symmetry. If
the skewness is negative, the left tail of the histogram is longer than the right tail. Simply
speaking, the skewness is negative, if yt has more negative than positive extreme values.
The kurtosis74 is a measure for the tail behavior. Financial returns typically have a kurtosis
greater than 3. This is the case if the distribution is more strongly concentrated around
the mean than the normal and assigns correspondingly higher probabilities to extreme
values (positive or negative). Such distributions are leptokurtic and have so-called fat
or heavy tails.
The Jarque-Bera (JB) test can be used to test for normality. It is based on the null
hypothesis of a normal distribution and the test statistic takes skewness S and kurtosis U
into account:
n 2 1 2

JB = 6 S + 4 (U 3) JB  22:
Example 30: Figure 4 shows the histogram and descriptive statistics of log returns
from the index of the American Stock Exchange (AMEX)75 . The distribution is skewed
and has fat tails. The JB-test rejects normality.

Figure 4: Histogram and descriptive statistics of AMEX log returns.


24

22 Series: DLOG(AMEX)
Sample 1 209
20 Observations 208

18 Mean 3.35e-05
Median 0.000816
16 Maximum 0.015678
Minimum -0.020474
14 Std. Dev. 0.005477
Skewness -0.933927
12
Kurtosis 5.056577
10
Jarque-Bera 66.89270
8 Probability 0.000000

0
-0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015

Exercise 17: Download a few nancial time series from nance.yahoo.com or


another website, or use another data source. Choose at least two di erent
types of series (stock prices, indices, exchange rates or commodity prices) or
at least two di erent frequencies (daily, weekly or monthly). Compute log and
simple returns, obtain their descriptive statistics, and test for normality.

74 U 3 is also called excess kurtosis.


75 Source: SAS (1995) p.163; raw data: http://ftp.sas.com/samples/A55217 (withdrawn by SAS); see
le amex.wf1; daily data from 1993/8/2 to 1994/5/27.
2.1 Financial time series 92
2.1.2 Return distributions
Review 9:76 A random variable X has a lognormal distribution if Y =ln X is
normally distributed. Conversely, if Y N(; 2 ) then X =expfY g is lognormal. The
density function of a lognormal random variable X is given by
(ln x )2
 
f (x) = p1 exp 22
x  0;
x 2
where  and 2 are mean and variance of ln X , respectively. Mean and variance of X
are given by
E[X ] = E[expfY g] = expf + 0:52 g V[X ] = expf2 + 2 g[expf2 g 1]:

We now consider the log return in t and treat it as a random variable (denoted by Yt;
yt is the corresponding sample value or realization).  and 2 are mean and variance of
the underlying population of log returns. Assuming that log returns are normal random
variables with YtN(; 2) implies that (1+Rt)=expfYtg, the simple, gross returns are
lognormal random variables with

E[Rt] = expf + 0:52g 1 and V[Rt] = expf2 + 2g[expf2g 1]:


If the simple, gross return is lognormal (1+Rt)LN(1+m; v), mean and variance of the
corresponding log return are given by

v 
E[Yt] = ln(1 + m) 0:5Y2 2Y = V[Yt] = ln 1 + (1 + m)2 : (41)
What are the implications for the corresponding prices? Normality of Yt implies that
prices given by Pt=expfYtgPt 1 or Pt=(1+Rt)Pt 1 are lognormal (for given, non-random
Pt 1 ). Thus, prices can never become negative if log returns are normal. Note that the
computation of expected prices from expected returns di ers from computing historical
(ex-post) prices. Whereas pt=expfytgpt 1 holds for observed yt and pt, the expected price
is given by E[pt]=expf+0:52gpt 1 if ytN(; 2).
Example 31: The mean log return of the FTSE77 is 0.00765, whereas the mean of
simple returns is 0.009859. The standard deviation of log returns is 0.065256. Relation
(40) holds pretty well since expfy+0:5s2 g 1=0.009827.
We now compare the ex-post and ex-ante implications of y and r. The value of the
index was p0 =105.4 in January 1965. Given the average log return y and continuous
compounding, the index at t is given by
pt = p0 expfytg = 105:4 expf0:00765tg:
This yields 1137.75 in December 1990 (t=311), which corresponds to the actual value
of the FTSE. For comparison we use r, discrete compounding and
pt = p0 (1 + r)t = 105:4(1 + 0:009859)t
76 For details see Hastings and Peacock (1975), p.84.
77 The Financial Times All Share Index (FTSE). Source: Mills (1993) p.225; see le ftse.wf1; monthly
data from January 1965 to December 1990.
2.1 Financial time series 93
to compute p311 =2228.06. Thus, at hindsight only the average log return corresponds
exactly to observed prices. We would need to use r~=(pt =p0 )1=t 1 to get the correct
ex-post implications based on discrete returns. However, from an ex-ante perspective,
y and r imply roughly the same expected prices if log returns are assumed to be normal:
E[pt ] = p0 expft(y + 0:5s2 )g = p0 expf0:009779tg  p0 (1 + 0:009859)t :

Another attractive feature of normal log returns is their behavior under temporal aggre-
gation. If single-period log returns are normally distributed YtN(; 2), the multi-period
log returns are also normal with Yt(h)N(h; h2). This property is called stability
(under addition). It does not hold for simple returns.
Many nancial theories and models assume that simple returns are normal. There are
several conceptual diculties associated with this assumption. First, simple returns have
a lower bound of 1, whereas the normal distribution extends to 1. Second, multi-
period returns are not normal even if single-period (simple) returns are normal. Third,
a normal distribution for simple returns implies a normal distribution for prices, since
Pt =(1+Rt )Pt 1 . Thus, a non zero probability may be assigned to negative prices which
is generally not acceptable. These drawbacks can be overcome by using log returns rather
than simple returns. However, empirical properties usually indicate strong deviations from
normality for both simple and log returns.
As a consequence of the empirical evidence against the normality of returns various alterna-
tives have been suggested. The class of stable distributions has the desirable properties
of fat tails and stability under addition. One example is the Cauchy distribution with
density
f (y ) = 2
1 b
; 1 < y < 1:
 b + (y a)2
However, the variance of stable distributions does not exist, which causes diculties for
almost all nancial theories and applications.78 The Student t-distribution also has
fat tails if its only parameter { the degrees of freedom { is set to a small value. The
t-distribution is a frequently applied alternative to the normal distribution.79
The mixture of normal distributions approach assumes that returns are generated by
two or more normal distributions, each with a di erent variance. For example, a mixture
of two normal distributions80 is given by
yt  (1 x)N(; 12 ) + xN(; 22 );

where x is a Bernoulli random variable with P[x=1]= . This accounts for the observation
that return volatility is not constant over time (see example 29). The normal mixture
model is based on the notion that nancial markets are processing information. The
amount of information can be approximated by the variance of returns. As it turns out, the
mixture also captures non-normality. For instance, a mixture of a low variance distribution
(with high probability ) and a large variance distribution (with low probability ) results
78 For details see Fielitz and Rozelle (1983).
79 For details see Blattberg and Gonedes (1974) or Kon (1984).
80 An example of simulated returns based on a mixture of three normal distributions can be found in the
le mixture of normal distributions.xls .
2.1 Financial time series 94
in a non-normal distribution with fat tails. Thus, if returns are assumed to be conditionally
normal given a certain amount of information, the implied unconditional distribution is
non-normal. Kon (1984) has found that between two and four normal distributions are
necessary and provide a better t than t-distributions with degrees of freedom ranging
from 3.1 to 5.5.
2.1 Financial time series 95
2.1.3 Abnormal returns and event studies81
Financial returns can be viewed as the result of processing information. The purpose
of event studies is to test the statistical signi cance of events (mainly announcements)
on the returns of one or several assets. For example, a frequently analyzed event is the
announcement of a (planned) merger or takeover. This may be a signal about the value
of the rm which may be re ected in its stock price. Comparing returns before and after
the information becomes publicly available can be used to draw conclusions about the
relevance of this information.
Event studies typically consist of analyzing the e ects of a particular type of information
or event across a large number of companies. This requires an alignment of individual
security returns relative to an event date (denoted by  =0). In other words, a new time
index  replaces the calendar time t such that  =0 corresponds to the event date in each
case. The event window covers a certain time period around  =0 and is used to make
comparisons with pre-event (or post-event) returns.
The e ects of the event have to be isolated from e ects that would have occurred irrespec-
tive of the event. For this purpose it is necessary to de ne normal and abnormal returns.
'Normal' refers to the fact that these returns would 'normally' be observed, either because
of other reasons than the event under study or if the event has no relevance. Normal
returns can be de ned either on the basis of average historical returns or a regression
model. These estimates are obtained from the estimation window, which is a time pe-
riod preceding the event window. They serve as the expected or predicted returns during
the event window. Abnormal returns are the di erence between normal and observed
returns during the event window.
Suppose that the estimation window ranges from  =0+1 to 1 (n1 observations), and the
event window ranges from  =1+1 to 2 (n2 observations) and includes the event date
 =0. We will consider estimating abnormal returns for company i based on the market
model
yi = i + i yim + i  =0 +1,. .. ,1 ;
where the market return yim has a rm-speci c superscript to indicate that the market
returns have been aligned to match the rm's event date. Given OLS estimates ai and bi
and observations for the market returns in the event window, we can compute n2 abnormal
returns
ei = yi ai bi yim  =1 +1,. .. ,2 :
We de ne the n12 matrix X for rm i using n1 observations from the estimation window.
Each of its rows is given by (1 yim). A corresponding n22 matrix X0 is de ned for the
event window and the subscript 0 refers to the index set (1+1,. .. ,2). Given the OLS
estimates bi=(ai bi)0 for the parameters of the market model, the vector of abnormal
returns for rm i is de ned as
ei0 = yi0 X0 bi ;
81 Most of this section is based on Chapter 4 in Campbell et al. (1997) where further details and references
on the event study methodology can be found. Other useful sources of information are the Event Study
Webpage http://web.mit.edu/doncram/www/eventstudy.html by Don Cram and the lecture notes by
Frank de Jong.
2.1 Financial time series 96
where yi0 is the vector of observed returns. From section 1.2.6 we know that E[ei0]=0
(since X0bi is an unbiased estimate of yi0), and its variance is given by
V[ei0] = V i = i2I + i2X0(X 0X ) 1X00 :
I is the n2 n2 identity matrix and i2 is the variance of disturbances i in the market
model. The estimated variance V^i is obtained by using the error variance from the esti-
mation period
s2i =
e0e e = y Xb
n1 2
in place of i2.
Event studies are usually based on the null hypothesis that the event under consideration
has no impact on (abnormal) returns. Statistical tests can be based on the assumption that
abnormal returns are normally distributed and the properties just derived: ei0N(0; V i).
However, the information collected must be aggregated to be able to make statements and
draw conclusions about the event (rather than individual cases or observations). It is not
always known when an event will have an e ect and how long it will last. Therefore abnor-
mal returns are cumulated across time in the event window. In addition, the implications
of the event are expressed in terms of averages across several rms which may potentially
be a ected by the event. We start by considering the temporal aggregation.
If the event window consists of more than one observation we can de ne the cumulative
abnormal return for rm i by summing all abnormal returns from 1+1 to 
ci = 0 ei0;
where the n21 vector  has ones from row one to row  , and zeros elsewhere. The
estimated variance of ci is given by
0 V^ i  :
This variance is rm speci c. To simplify the notation we de ne the variance in terms of
H = X0 (X 0X ) 1X00 V i = i2I + i2H :
Note that H is rm-speci c since X is di erent for each rm (the market returns contained
in X have to be aligned with the event time of rm i). The standard error of cumulative
abnormal returns across n2 periods for rm i is given by
q
se[ci ] = n2 s2i + s2i (0 H ):

The null hypothesis of zero abnormal returns can be tested using the standardized test
statistic
ci
tic =
se[ci ] :
2.1 Financial time series 97
Under the assumption that abnormal returns are jointly normal and serially uncorrelated
the test statistic tc has a t-distribution with df=n1 2.
Event studies are frequently based on analyzing many rms which are all subject to the
same kind of event (usually at di erent points in calendar time). Under the assumption
that abnormal returns for individual rms are uncorrelated (i.e. the event windows do not
overlap) tests can be based on averaging cumulative abnormal returns across m rms and
the test statistic
c a
t1 =
se[c]  N(0; 1);
where
v

c =
1 m
X
ci se[c] =
u
u
t 1 m
X
se[ci ]2:
m i=1 m2 i=1

Alternatively, the test statistic tic can be averaged to obtain the test statistic
s !
m(n1 4) 1 X
m
t2 = tic a N(0; 1):
n1 2 m i=1
Example 32: We consider the case of two Austrian mining companies Radex and
Veitscher who where the subject of some rumors about a possible takeover. The rst
newspaper reports about a possible 'cooperation' appeared on March 8, 1991. Similar
reports appeared throughout March 29. On April 16 it was ocially announced that
Radex will buy a 51% share of Veitscher. The purpose of the analysis is to test
for abnormal returns associated with this event. Details can be found in the le
event.xls.
The estimation window consists of the three year period from January 25, 1988 to
January 24, 1991. We use daily log returns for the two companies and the ATX to
estimate the market model. The event window consists of 51 days (January 25 to
April 10). The cumulative abnormal returns start to increase strongly about 14 days
before March 8 and reach their peak on March 7. After that day cumulative abnormal
returns are slightly decreasing. Based on 51 days of the event period we nd c1 =0.25
and c1 =0.17 for Radex and Veitscher, respectively. The associated t-statistics are
t1c =3.29 and t2c =2.54 which are both highly signi cant. Tests based on an aggregation
across the two companies are not appropriate in this example since they share the
same event window.

Exercise 18: Use the data from example 32 to test the signi cance of cumula-
tive abnormal returns for event windows ranging from January 25 to March 7
and March 15, respectively. You may also use other event windows that allow
for interesting conclusions.
2.1 Financial time series 98
2.1.4 Autocorrelation analysis of nancial returns
The methods of time series analysis are used to investigate the dynamic properties of
a single realization yt, in order to draw conclusions about the nature of the underlying
stochastic process Yt, and to estimate its parameters. Before we de ne speci c time
series models, and consider their estimation and forecasts, we brie y analyze the dynamic
properties of some nancial time series.
Autocorrelation analysis is a standard tool for that purpose. The sample autocovariance82
and the sample autocorrelation
c` =
1 n
X
(yt y)(yt y)
`
n t=`+1
c`
r` =
c0
= sc2`
can be used to investigate linear temporal dependencies in an observed series yt. c` and
r` are sample estimates of ` (13) and ` (14). If the underlying process Yt is i.i.d.,
the sampling distribution of r` is r`'N( 1=n; 1=n). This can be used to test individual
autocorrelations for signi cance (e.g. using the 95% con dence interval83 1=n1:96/pn.).
Rather than testing individual autocorrelations the Ljung-Box statistic can be used to
test jointly that all autocorrelations up to lag p are zero:
p
X r2̀
Qp = n(n + 2) :
`=1 n `

Under the null hypothesis of zero autocorrelation in the population (1=  =p=0): Qp2p.
Example 33: The autocorrelations of IBM log returns in Figure 5 are negligibly small
(except for lags 6 and 9). The p-values of the Q-statistic (Prob and Q-Stat in Figure 5)
indicate that the log returns are uncorrelated. The situation is slightly di erent for
FTSE log returns. These autocorrelations are rather small but the correlations at lags
one, two and ve are slightly outside the 95%-interval. The p-values of the Q-statistic
are between 0.01 and 0.05. Depending on the signi cance level we would either reject
or accept the null hypothesis of no correlation. We conclude that the FTSE log returns
are weakly correlated.

Assuming that returns are independent is stronger than assuming uncorrelated returns84.
However, testing for independence is not straightforward because it usually requires to
specify a particular type of dependence. Given that the variance of nancial returns is
typically not constant over time, a simple test for independence is based on the autocor-
relations of squared or absolute returns.
Example 34: Figure 6 shows the autocorrelations of squared and absolute log returns
of IBM. There are many signi cant autocorrelations even at long lags. Thus we
82 This is a biased estimate of the autocovariance which has the advantage of yielding a positive semi-
de nite autocovariance matrix. The unbiased estimated is obtained if the sum is divided by n 1.
83 Usually the mean 1=n is ignored and 1:96/pn is used as the 95% con dence interval.
84 Independence and uncorrelatedness are only equivalent if returns are normally distributed.
2.1 Financial time series 99

Figure 5: Autocorrelations of IBM (left panel) and FTSE (right panel) log returns.
Sample: 1965:01 1990:12
Included observations: 368 Included observations: 311

Autocorrelation AC Q-Stat Prob Autocorrelation AC Q-Stat Prob

1 0.023 0.2017 0.653 1 0.113 4.0342 0.045


2 0.006 0.2149 0.898 2 -0.103 7.3859 0.025
3 -0.036 0.6889 0.876 3 0.093 10.118 0.018
4 -0.055 1.8366 0.766 4 0.061 11.304 0.023
5 -0.027 2.1125 0.833 5 -0.102 14.589 0.012
6 0.139 9.4080 0.152 6 -0.036 15.001 0.020
7 0.070 11.267 0.127 7 0.043 15.599 0.029
8 0.041 11.913 0.155 8 -0.047 16.312 0.038
9 -0.090 15.011 0.091 9 0.076 18.152 0.033
10 0.002 15.014 0.132 10 0.022 18.303 0.050

Figure 6: Autocorrelations of squared (left panel) and absolute (right panel) log returns
of IBM. Included observations: 368 Included observations: 368

Autocorrelation AC Q-Stat Prob Autocorrelation AC Q-Stat Prob

1 0.303 34.004 0.000 1 0.405 60.964 0.000


2 0.188 47.096 0.000 2 0.294 93.200 0.000
3 0.321 85.576 0.000 3 0.398 152.34 0.000
4 0.306 120.61 0.000 4 0.340 195.55 0.000
5 0.040 121.20 0.000 5 0.143 203.22 0.000
6 0.158 130.55 0.000 6 0.258 228.33 0.000
7 0.111 135.19 0.000 7 0.235 249.16 0.000
8 0.121 140.77 0.000 8 0.212 266.10 0.000
9 0.298 174.39 0.000 9 0.327 306.66 0.000
10 0.265 201.12 0.000 10 0.334 349.13 0.000
11 0.146 209.22 0.000 11 0.252 373.39 0.000
12 0.242 231.62 0.000 12 0.270 401.18 0.000
13 0.372 284.77 0.000 13 0.323 441.12 0.000
14 0.067 286.48 0.000 14 0.178 453.34 0.000
15 0.111 291.25 0.000 15 0.225 472.81 0.000
16 0.164 301.71 0.000 16 0.283 503.71 0.000
17 0.200 317.29 0.000 17 0.314 541.92 0.000
18 0.065 318.96 0.000 18 0.211 559.26 0.000
19 0.251 343.51 0.000 19 0.323 600.00 0.000
20 0.196 358.54 0.000 20 0.297 634.61 0.000

conclude that the IBM log returns are uncorrelated but not independent. At the same
time the signi cant autocorrelations among squared and absolute returns point at
dependencies in (the variance of) returns.

In section 2.1 we have presented examples of volatility clustering. If the sign of returns is
ignored (either by considering squared or absolute returns), the correlation within clusters
is high. If the variance has moved to a high level it tends to stays there; if it is low it
tends to stay low. This explains that autocorrelations of absolute and squared returns are
positive for many lags.
Signi cant autocorrelation in squared or absolute returns is evidence for heteroscedasticity.
In this case the standard errors 1=pn are not appropriate to test the regular autocorre-
lations r` for signi cance. Corrected con dence intervals can be based on the modi ed
2.1 Financial time series 100
variance of the autocorrelation coecient at lag `:
!
1 1 + cy2 (`) ;
n s4

where cy2 (`) is the autocovariance ofpyt2 and s4 is the squared variance of yt. The resulting
standard errors are larger than 1= n if squared returns are positively autocorrelated
which is typical for nancial returns. This leads to wider con dence intervals and to
more conservative conclusions about the signi cance of autocorrelations. If the modi ed
standard errors are used for testing log returns of the FTSE no autocorrelations in Figure 5
are signi cant ( =0.05).
Exercise 19: Use the log returns de ned in exercise 17. Estimate and test
autocorrelations of regular, squared and absolute returns.
2.1 Financial time series 101
2.1.5 Stochastic process terminology
We brie y de ne some frequently used stochastic processes:
A white-noise process t is a stationary and uncorrelated sequence of random numbers.
It may have mean zero (which is mainly assumed for convenience), but this is not essential.
The key requirement is that the series is serially uncorrelated; i.e. `=`=0 (8`6=0). If t
is normally distributed and white-noise it is independent (Gaussian white-noise). If t is
white-noise with constant mean and constant variance with a xed distribution it is an
i.i.d. sequence85 (also called independent white-noise).
A martingale di erence sequence (m.d.s.) Yt is de ned with respect to the infor-
mation It available at t. This could include any variables but typically only includes
Yt : It =fYt ; Yt 1 ; : : :g. fYt g1
t=1 is a m.d.s. (with respect to It 1 ) if E[Yt jYt 1 ; Yt 2 ; : : :]=0
(which implies E[Yt]=0). Since white-noise restricts the conditional expectation to linear
functions, a m.d.s.implies stronger forms of independence than white-noise.
A random walk with drift  is de ned as
Yt = Yt 1 +  + t t . .. white noise:

In other words, the time increments of a random walk are white-noise.86


If Yt is an element of It and E[YtjIt 1]=Yt 1 then Yt is a martingale (or martingale
sequence) with respect to It 1 . A random walk is an example of a martingale.
A mean reverting process is a stationary process with non-zero autocorrelations. It is
expected to revert to its (unconditional) mean87  from below (above) if Yt< (Yt>).
Since the process is stationary, it reverts to the mean relatively fast compared to a non-
stationary process without drift (see section 2.3).
An autocorrelated process88 can be written as
Yt = Y^t + t Y^t = E[Yt jYt 1 ; Yt 2 ; : : : ; ] Y2 6= 2 ;

where Y^t denotes the conditional mean. If the variance of 2 is not constant over time,
the conditional variance is de ned in a similar way
E[(Yt Y^t)2jYt 1; Yt 2; : : : ; ] = V[tjYt 1; Yt 2; : : : ; ] = t2:
In this case t is uncorrelated (white noise) but not i.i.d. t is the conditional variance of
t (i.e. the conditional expectation of 2t ).

85 We will use the stronger i.i.d. property for a white-noise with constant variance (and distribution).
White-noise only refers to zero autocorrelation and need not have constant variance.
86 Campbell et al. (p.31 1997) distinguish three types of random walks depending on the nature of t :
i.i.d. increments, independent (but not identically distributed) increments and uncorrelated increments.
87 Strictly speaking this de nition also applies to white-noise, but the term mean reversion is mainly used
in 88the context of autocorrelated stationary processes.
An uncorrelated process would be written as Yt =+t .
2.2 ARMA models 102
2.2 ARMA models
We now introduce some important linear models for the conditional mean. An autore-
gressive moving-average (ARMA) process is a linear stochastic process which is com-
pletely characterized by its autocovariances ` (or autocorrelations `). Thus, various
ARMA models can be de ned and distinguished by their (estimated) autocorrelations. In
practice the (estimated) autocorrelations r` from an observed time series are compared to
the known theoretical autocorrelations of ARMA processes. Based on this comparison a
time series model is speci ed. This is also called the identi cation step in the model
building process. After estimating its parameters diagnostic checks are used to con rm
that a suitable model has been chosen (i.e. the underlying stochastic process conforms to
the estimated model). ARMA models are only appropriate for stationary time series.
2.2.1 AR models
The rst order autoregressive process AR(1)
Yt =  + 1 Yt 1 + t j1j < 1
has exponentially decaying autocorrelations `=1̀. t is a white-noise process with mean
zero and constant variance 2. The condition j1j<1 is necessary and sucient for the
AR(1) process to be weakly stationary.
The unconditional mean of an AR(1) process is given by

E[Yt] =  = 1 1
:

An equivalent formulation of the AR(1) process is given by


Yt Yt 1 = Yt = (1 1 )( Yt 1 ) + t :
Thus, deviations from the unconditional mean imply expected changes in Yt which depend
on the extent of the deviation and the degree of mean reversion 1 1.
The unconditional variance of an AR(1) process is derived as follows:
V[Yt] = Y2 = V[ + 1Yt 1 + t] = 21V[Yt 1] + V[t]:
If Yt is stationary V[Yt]=V[Yt 1] and
2
V[Yt] = 1 2 ;
1
which is bounded and non-negative only if j1j<1.
The exponential decay of the autocorrelations of an AR(1) process can be derived by
multiplying the model equation by Yt 1 and taking the expected value (assuming  =0 for
the sake of simplicity):
1 = E[Yt Yt 1 ] = E[1 Yt 1 Yt 1 ] + E[t Yt 1 ] = 1 E[Yt2 1 ] = 1 0 = 1 Y2 : (42)
2.2 ARMA models 103
Therefore 1= 1= 0=1 since 0=Y2 . Repeating this procedure for Yt 2 gives
2 = E[Yt Yt 2 ] = 1 E[Yt 1 Yt 2 ] = 21 Y2 ;
so that 2=21, and in general `=1̀.
A generalization of the AR(1) process is the AR(p) process

Yt =  + 1 Yt 1 + 2 Yt 2 +    + p Yt + t E[Yt] =  = (1
p
1    p) :
It is convenient to make use of the backshift operator B with the property
B ` Yt = Yt ` :
Using this operator an AR(p) process can be formulated as follows:
Yt =  + (1 B + 2 B 2 +    + p B p )Yt + t
(1 1 B 2 B 2    pB p)Yt =  + t
(B )Yt =  + t (B ) = 1 1 B 2 B 2    pB p:
(B ) is a polynomial of order p in B . Using (B ) we can rewrite the AR process as
 t
Yt = +
(B ) (B )
=  + (Bt ) :
For the AR(1) model (B ) = (1 1B ) we have
1 2 2
(1 1B ) = (1 + 1B + 1B +   );
and Yt can be written as
Yt =  + t + 1 t 1 + 21 t 2 +    :
The resulting process is an in nite order moving-average (see below).
Since t is stationary (by de nition) Yt will only be stationary if the weighted sum of
disturbances converges, i.e. if j1j<1. In general, the stationarity of an AR(p) model
depends on the properties of the polynomial (B ).
Review 10: Given a polynomial of degree n
f (x) = a0 + a1 x + a2 x2 +    + an xn
the constants z1 ,: : :,zn (real or complex) are called zeros of f (x) or roots of f (x)=0
such that
f (x) = an (x z1 )    (x zn ):
2.2 ARMA models 104
The stationarity of an AR process depends on the roots of the AR polynomial (B ) which
can be factored as follows:
p
Y
(B ) = (1 1 B    pB p) = (1 wiB );
i=1

where wi are the inverted roots of the polynomial, which may be complex valued. The AR
model is stationary if all inverted roots are less than one in absolute value (or, all inverted
roots are inside the unit circle).
Various autocorrelation patterns of an AR(p) process are possible. For instance, the
autocorrelations of an AR(2) model show sinusoidal decay if its inverted roots are complex.
In this case the underlying series has stochastic cycles.89
AR models imply non-zero autocorrelations for many lags. Nevertheless it may be sucient
to use one or only a few lags of Yt to de ne Y^t. The number of necessary lags p can be
determined on the basis of partial autocorrelations ``. `` is the `-th coecient of
an AR(`) model. It measures the e ect of Yt ` on Yt under the condition that the e ects
from all other lags are held constant.
Partial autocorrelations can be determined from the solution of the Yule-Walker equa-
tions of an AR(p) process:
` = 1 ` 1 + 2 ` 2 +    + p ` p ` = 1; : : : ; p:

For example, the Yule-Walker equations of an AR(2) process are given by


` = 1 : 1 = 1 0 + 2 1
` = 2 : 2 = 1 1 + 2 0 :

In this case the solutions are given by (see Box and Jenkins, 1976, p.83)
(1 2 ) 2
1 = 1 2 = 22 = 2 21 :
1 12 1 1
AR coecients (and thereby, partial autocorrelations) can be obtained by solving the
Yule-Walker equations recursively for AR models of increasing order. The recursions (in
terms of autocorrelations `) are given by (see Box and Jenkins, 1976, p.83)
p
X
p+1 p;` p+1 `
p+1;p+1 = `=1
Xp
1 p;` `
`=1

p+1;` = p;` p+1;p+1 p;p `+1 ` = 1; : : : ; p:


89 An example of such a process is obtained if 1 =1.5 and 2 = 0:7. The le arma.xls can be used to
obtain simulated paths of ARMA models.
2.2 ARMA models 105
If the theoretical autocovariances are replaced by estimated autocovariances or autocorre-
lations, (preliminary) AR parameters can be estimated from the solution of the equation
system.
The partial autocorrelations of an AR(p) process cut o at p: ``=0 for `>p. This is the
basis for identifying AR(p) models empirically. Signi cant partial autocorrelations up to
lag p and decaying autocorrelations (theoretically) suggest to estimate an AR(p) model.
In practice this identi cation may be dicult.
2.2.2 MA models
The moving-average process of order q denoted by MA(q) is de ned as follows:
Yt =  + 1 t 1 +    + q t q + t t : : : white-noise:
Its unconditional mean and variance are given by
E[Yt] =  V[Yt] = (1 + 12 +    q2)2:
The autocovariance at lag 1 is given by (assuming  =0 for the sake of simplicity):
1 = E[Yt Yt 1 ] = E[(1 t 1 + t )(1 t 2 + t 1 )]
= E[12t 1t 2 + 12t 1 + 1tt 2 + tt 1]
= 1E[2t 1] = 12;
since for a white-noise process E[ts]=0 (8t6=s). In a similar manner it can be shown that
` =0 (8 `>1). For the general MA(q) process the autocorrelation function is given by
8
> q `
X
>
>
>
>
> i i+`
< i=0
>
` = X q ` = 1; : : : ; q (43)
>
>
>
>
1+ i2
>
> i=1
>
:
0 ` > q:
Thus a MA(q) is characterized by autocorrelations that cut o at lag q.
The choice of the term 'moving-average' can be derived from the correspondence between
MA(q) and AR(1) models. We consider a MA(1) model and rewrite it as an AR model:
Yt
Yt =  + (1 + 1 B )t
(1 + 1B ) = (1 + B ) + t
1
(1 1B + 12B 2   )Yt =  + t:
This shows that the corresponding AR coecients imply a weighted average of lagged
Yt (with decreasing weights, provided j1 j<1). This transformation is possible if the
MA model is invertible, which is the case if all inverted roots of the MA polynomial
are less than one.
The relation between AR and MA models is the foundation for identifying MA(q) models
empirically. Signi cant autocorrelations up to lag q and decaying partial autocorrelations
suggest to estimate an MA(q) model.
2.2 ARMA models 106

Table 1: Theoretical patterns of (partial) autocorrelations.


autocorrelations partial autocorrelations
AR(1) 1>0 exponential decay; `=1̀ cut o at lag 1; 11>0
AR(1) 1<0 oscillating decay; `=1̀ cut o at lag 1; 11<0
AR(p) exponential decay (oscillating) cut o at lag p
MA(1) 1>0 cut o at lag 1; 1>0 oscillating decay; 11>0
MA(1) 1<0 cut o at lag 1; 1<0 exponential decay; 11<0
MA(q) cut o at lag q exponential decay (oscillating)
ARMA(p; q) decay starting at lag q decay starting at lag p
2.2.3 ARMA models
ARMA(p; q) models combine AR and MA models:
Yt =  + 1 Yt 1 +    + p Yt p + 1 t 1 +    + q t q + t
(B )Yt =  + (B )t :
Table 1 summarizes the theoretical patterns of autocorrelations and partial autocorrela-
tions of ARMA(p; q) models. These can be used as a rough guideline for model identi-
cation. In practice, the identi cation of AR and MA models can be dicult because
estimated (partial) autocorrelations are subject to estimation errors and the assignment
to theoretical patterns may be ambiguous. However, several possible models may be tem-
porarily selected. For instance, a low-order ARMA model can be speci ed and extended
step-by-step. Subsequent model estimation is carried out for all selected models. After
estimation and diagnostic checking a nal choice among the models can be made.
Low-order ARMA models can substitute high-order AR or MA models with a few param-
eters only. Provided that all inverted roots of (B ) are less than one in absolute terms,
an ARMA(p,q) model can be formulated as follows:
Yt =

+ (B )  Yt =  + (B )t:
(B ) (B ) t
This is equivalent to a MA(1) model:
1
X
Yt =  + (1 + 1 B + 2 B 2 +   )t =  + ` t ` ( 0 = 1):
`=0
This representation not only holds for ARMA models. According to the Wold decom-
position any covariance stationary process (with mean zero) can be written as
1
X
Yt = Ut + Vt = ` t ` + Vt ;
`=0
1
X
where Ut and Vt are uncorrelated, 0=1 and 2̀ <1, t is white-noise de ned as
`=0
t = Yt E[YtjYt 1; Yt 2; : : :];
E[tVt]=0, and Vt can be predicted from Vt 1,Vt 2,: : : with zero prediction variance.
2.2 ARMA models 107
2.2.4 Estimating ARMA models
ARMA models can be viewed as a special case of a linear regression model with lagged
dependent variables. Estimating ARMA models leads to biased estimates in small sample
since assumption AX E[jX ]=0 is violated. This can be shown using the AR(1) model
Yt =1 Yt 1 +t . Assuming that AX holds (i.e. t is orthogonal to (the stochastic regressor)
Yt 1 ; E[Yt 1 t ]=0) we have

E[Ytt] = E[(1Yt 1 + t)t] = E[2t ]:


This implies that the regressor Yt 1 is not orthogonal to the disturbance t 1 (i.e. there is
correlation between regressors and disturbances across observations). This violates AX
and estimates of  will be biased. As shown in section 1.3 the ARMA parameters can be
consistently estimated if AX (i.e. Yt 1 and t are uncorrelated) holds. In other words,
the bias { due to the unavoidable violation of AX by the lagged dependent variable {
disappears in large samples. We use again the simple AR(1) model with  =0 and consider
cov[Yt 1 ; t ] = E[Yt 1 t ] E[Yt 1]E[t] = E[Yt 1(Yt 1 Yt 1 )]:

We can use (42) to obtain


cov[Yt 1 ; t ] = E[Yt Yt 1 ] 1 E[Yt2 1 ] = 1 1 0 = 0:
Thus we can estimate the parameters of ARMA models consistently by OLS. Note that
the presence of MA terms does not cause any problems if t is white-noise (see section 2.2.5
below), and thus does not violate AX.
Table 2 illustrates the magnitude of the bias associated with estimating AR coecients.
The true model is an AR(1), the estimated model is yt=c+f1yt 1+et. The table shows
the means and standard errors of the OLS estimates c=c/(1 f1), y and f1 obtained from
10000 simulated realizations of AR(1) processes with =0.5. For 10.9 and in small
samples there are strong biases and large standard errors. While c and y are almost
unbiased for 1<0.9, f1 remains biased but the bias is reduced as n grows.
Given an observed time series yt (t=1,. .. ,n) we formulate the model
yt = c + f1 yt 1 +    + fp yt p + h1et 1 +    + hq et q + et:
The parameters c, fi and hi are estimated such that the sum of squared residuals (errors)
is minimized:90
n
X
e2t ! min :
t=maxfp;qg+1

Before the model is estimated it is necessary to determine p and q. This choice can be
based upon comparing estimated (partial) autocorrelations to the theoretical (partial)
90 Ingeneral, the estimation procedure is iterative. Whereas lagged yt are xed explanatory variables
(in the sense that they do not depend on the coecients to be estimated) the lagged values of et depend
on the parameters to be estimated. For details see Box and Jenkins (1976), p.208.
2.2 ARMA models 108

Table 2: Means and standard errors (in parentheses) across 10000 estimates of AR(1)
series for di erent sample sizes.
1 n c y f1
50 0.413 ( 142) 0.497 (3.79) 0.886 (0.08)
0.990 100 0.373 (93.3) 0.481 (4.33) 0.938 (0.04)
200 0.538 (31.1) 0.410 (4.47) 0.964 (0.02)
50 0.552 (3.75) 0.514 (1.24) 0.818 (0.09)
0.900 100 0.501 (1.00) 0.505 (0.93) 0.859 (0.06)
200 0.513 (0.70) 0.512 (0.68) 0.881 (0.04)
50 0.502 (0.28) 0.502 (0.28) 0.448 (0.13)
0.500 100 0.499 (0.20) 0.499 (0.20) 0.475 (0.09)
200 0.499 (0.14) 0.499 (0.14) 0.487 (0.06)
50 0.500 (0.18) 0.500 (0.18) 0.169 (0.14)
0.200 100 0.499 (0.13) 0.499 (0.13) 0.182 (0.10)
200 0.500 (0.09) 0.500 (0.09) 0.192 (0.07)
50 0.500 (0.12) 0.500 (0.12) {0.208 (0.14)
{0.200 100 0.501 (0.09) 0.501 (0.09) {0.204 (0.10)
200 0.500 (0.06) 0.500 (0.06) {0.203 (0.07)
50 0.501 (0.09) 0.501 (0.10) {0.491 (0.12)
{0.500 100 0.500 (0.07) 0.500 (0.07) {0.496 (0.09)
200 0.500 (0.05) 0.500 (0.05) {0.497 (0.06)
50 0.501 (0.07) 0.501 (0.08) {0.871 (0.08)
{0.900 100 0.501 (0.05) 0.500 (0.05) {0.884 (0.05)
200 0.500 (0.04) 0.500 (0.03) {0.892 (0.04)

autocorrelations in Table 1. Alternatively, model selection criteria like the Akaike infor-
mation criterion (AIC) or the Schwarz criterion (SC) can be used. AIC and SC are
based on the log-likelihood91 `=ln L and the number of estimated parameters K (for an
ARMA(p,q) model with a constant term K =p+q+1):
AIC = 2` + 2K SC = 2` + K ln n :
n n n n
If the type of model cannot be uniquely determined from Table 1, several models are
estimated and the model with minimal AIC or SC is selected.
2.2.5 Diagnostic checking of ARMA models
ARMA model building is not complete unless the residuals are white-noise. The conse-
quence of residual autocorrelation is inconsistency (see section 1.7.3). This can be shown
in terms of the simple model
Yt =  + Yt 1 + ut ut = ut 1 + t :
91 ` is de ned as
" !#
) + ln 1
n X
`= 1 + ln(2 e2 :
2 n t t
2.2 ARMA models 109
To derive E[Yt 1 ut ] we use
 ut 1 1
X
Yt 1 = 2 2
1  1 B =  + ut 1(1 + B +  B +   ) =  + i=0 ut
+ 1 i:

Hence
" #
1
X
E[Yt 1 ut ] = E ut 1 i ut
i=0

depends on the autocorrelations of ut. E[Yt 1ut] will be non-zero as long as 6=0, and this
will give rise to inconsistent estimates. Thus, it is essential that the model is speci ed
such that the residuals are white-noise. This requirement can also be derived from an
alternative viewpoint. The main purpose of a time series model is to extract all dynamic
features from a time series. This objective is achieved if the residuals are white-noise.
Autocorrelation of residuals can be removed by changing the model speci cation (mainly
by including additional AR or MA terms). The choice may be based on patterns in
(partial) autocorrelations of the residuals. AIC and SC can also be used to support the
decision about including additional lags. However, it is not recommended to include lags
that cannot be meaningfully interpreted. For instance, even if the coecient of yt 11 is
'signi cant' in a model for daily returns this is a highly questionable result.
Indications about possible misspeci cations can be derived from the inverted roots of the
AR and MA polynomials. If two inverted roots of the AR and the MA polynomial are
similar in magnitude, the model possibly contains redundant terms (i.e. the model order
is too large). This situation is known as over tting. If the absolute value of one of
the inverted AR roots is close to or above 1.0 then the autoregressive term implies non-
stationary behavior. This indicates the need to take di erences of the time series (we will
return to that point in section 2.3.3). An absolute value of one of the inverted MA roots
close to or above 1.0 indicates that the series is overdi erenced. Taking rst di erences
of a white-noise series yt=t leads to
yt = t t 1 :

The resulting series yt is 'best' described by an MA(1) model with 1= 1:0. Its auto-
correlations can be shown to be decaying. However, a white-noise must not be di erenced
at all, and it does not make sense to t a model to yt. Similar considerations hold for
stationary series in general: they must not be di erenced.
If residuals are white-noise but not homoscedastic, modi cations of the ARMA model
equation are not meaningful. Heteroscedasticity of the disturbances cannot be removed
with a linear time series model for the conditional mean. Models to account for residuals
that are not normally distributed and/or heteroscedastic will be introduced in section 2.5.
2.2.6 Example 35: ARMA models for FTSE and AMEX returns
Box and Jenkins (1976) have proposed a modeling strategy which consists of several steps.
In the identi cation step one or several preliminary models are chosen on the basis of
(partial) autocorrelations and the patterns in Table 1. After estimating each model the
2.2 ARMA models 110

Table 3: Results of tting various ARMA models to FTSE log returns.


model f`` p-value AIC SC model AIC SC
null model {2.618 {2.606
AR(1) 0.113 0.046 {2.622 {2.597 MA(1) {2.629 {2.605
AR(2) {0.118 0.039 {2.626 {2.590 MA(2) {2.636 {2.600
AR(3) 0.123 0.032 {2.632 {2.583 MA(3) {2.634 {2.586
AR(4) 0.022 0.708 {2.622 {2.562 MA(4) {2.632 {2.572
residuals are analyzed { mainly to test for any remaining autocorrelation. If necessary, the
models are modi ed and estimated again. A model is used for forecasting if its residuals
are white-noise and its coecients are signi cant. If there are several competing models
which ful ll these requirements, model selection criteria can be used to make a nal choice.
We illustrate this procedure by considering FTSE and AMEX log returns.
The autocorrelations of FTSE log returns are rather small (see Figure 5) and cannot be
easily associated with the theoretical patterns in Table 1. To determine p and q we t
several models with increasing order and observe AIC, SC and the partial autocorrela-
tions. Table 3 summarizes the results. Partial autocorrelations and AIC indicate that a
AR(3) model is appropriate. The estimated model is
yt = 0:0067 + 0:14 yt 1 0:13 yt 2 + 0:12 yt 3 + et se = 0:06449:
(0.075) (0.014) (0.02) (0.03)
Note that the p-values may be biased if the residuals et do not have the properties reviewed
in section 1.7.
The overall minimum AIC indicates an MA(2) model:
yt = 0:0077 + 0:15 et 1 0:11 et 2 + et se = 0:06445:
(0.044) (0.001) (0.05)
In both models the standard deviation of residuals (the standard error) se is not very
di erent from the standard deviation of log returns sy 0.06526 (see example 31). This
indicates that the conditional mean y^t from these models is very close to the unconditional
mean y.
p=0 would be chosen on the basis of SC. In this case the conditional and unconditional
mean are identical and the standard error is equal to the standard deviation of returns:
yt = 0:00765 + et se = 0:06526:
The ARMA(1,1) model (AIC={2.633, SC={2.596)
yt = 0:011 0:47 yt 1 + 0:62 et 1 + et se = 0:06457;
(0.073) (0.039) (0.002)
is not supported by AIC or SC. It is worth mentioning that all p-values of the ARMA(2,2)
model (AIC={2.65, SC={2.589)
yt = 0:019 1:04yt 1 0:84yt 2 + 1:17et 1 + 0:88et 2 + et se = 0:0638
2.2 ARMA models 111

Figure 7: Autocorrelogram of AMEX log returns.


Sample: 1 209
Included observations: 208

Autocorrelation Partial Correlation AC PAC Q-Stat Prob

1 0.318 0.318 21.325 0.000


2 0.103 0.002 23.566 0.000
3 -0.006 -0.044 23.574 0.000
4 0.074 0.098 24.745 0.000
5 0.010 -0.042 24.767 0.000
6 -0.024 -0.030 24.890 0.000
7 -0.043 -0.018 25.293 0.001
8 -0.024 -0.010 25.414 0.001
9 -0.025 -0.015 25.555 0.002
10 -0.032 -0.019 25.783 0.004

are equal to zero. This would be the 'optimal' model according to AIC. However, the
model has inverted AR roots 0:52  0:75i and MA roots 0:59  0:73i which are very
similar. This situation is known as over tting: Too many, redundant parameters have been
estimated and a model with less coecients is more appropriate. The ratio of MA and
AR polynomials (1+0.135B 0.095B 2 0.014B 3+0.094B 4 0.086B 5+  ) has coecients
which are rather small and similar to the MA(2) model. The signi cance tests of individual
coecients for an over tted model have very limited value.
We apply diagnostic checking to the residuals from the MA(2) and the AR(3) model. The
p-values of Q10 to test for autocorrelation in residuals are 0.226 and 0.398, which indicates
that residuals are white-noise. For squared residuals the p-values of Q5 are 0.03 (MA)
and 0.35 (AR); p-values of Q10 are 0.079 (MA) and 0.398 (AR). Thus MA residuals are
not quite homoscedastic. The p-values of the JB-test are 0.0 for both models (S 0.5 and
U 13) which rejects normality. Thus the signi cance tests of the estimated parameters
may be biased.
The (partial) autocorrelations of AMEX log returns (see Figure 7) may be viewed to
suggest a MA(1) model. The estimated model is
yt = 3:6  10 5 + 0:28 et 1 + et se = 0:005239:
(0.94) (0.0)
The residuals are white-noise but not normally distributed. The squared residuals are
correlated and indicate heteroscedasticity (i.e. the residuals are not independent).
Exercise 20: Use the log returns de ned in exercise 17. Identify and estimate
suitable ARMA models and check their residuals.
2.2.7 Forecasting with ARMA models
Forecasting makes statements about the process Yt at a future date t+ on the basis of
information available at date t. The forecast Y^t; is the conditional expected value
Y^t; = E[Yt+ jYt ; Yt 1 ; : : : ; t ; t 1 ; : : :] = E[Yt+ jIt ]  = 1; 2; : : :
using the model equation.  is the forecasting horizon.
2.2 ARMA models 112
Forecasts for future dates t+ ( =1,2,. . .) from the same date t are called dynamic (or
multi-step) forecasts. The one-step ahead forecast Y^t;1 is the starting point. The next
dynamic forecast Y^t;2 (for t+2) is also made in t and uses Y^t;1. In general, a dynamic
forecast Y^t; depends on all previous dynamic forecasts (see below). Static forecasts are
a sequence of one-step ahead forecasts Y^t;1 Y^t+1;1 : : : made at di erent points in time.
AR model forecasts
Dynamic AR(1) model forecasts are given by:
Y^t;1 =  + 1 Yt
Y^t;2 =  + 1 E[Yt+1 jIt ] =  + 1 Y^t;1 =  + 1 ( + 1 Yt )
Y^t; =  (1 + 1 + 21 +    + 1 1 ) + 1 Yt
^ 
 !1 Yt; = (1
lim 1 )
= :
Unknown future values Yt+1 are replaced by the forecasts Y^t;1. Forecasts of AR(1) models
decay exponentially to the unconditional mean of the process . The rate of decay depends
on j1j. Dynamic forecasts of stationary AR(p) models show a more complicated pattern
but also correspond to the autocorrelations. The forecasts converge to

(1      ) = :
1 p

Note that  is estimated by the constant term c in the model


y^t = c + f1 yt 1 +    + fp yt p :
Forecasts from an estimated AR model are determined by the estimated parameters in
the same way as described above.  is estimated by c=c/(1 f1    fp) which need not
agree exactly with the sample mean y.92
MA model forecasts
Dynamic MA(q) model forecasts are given by:
Y^t;1 =  + 1 t +    + q t q+1
Y^t;2 =  + 1 E[t+1 jIt ] + 2 t +    =  + 2 t +   
Y^t; =  ( > q):
92 In EViews AR models can be estimated with two di erent speci cations. The lag speci cation LS Y C
Y(-1) Y(-2) ... estimates the model
yt = c + f1 yt 1 + f2 yt 2 +    + et :
c is an estimate of  . Using the AR speci cation LS Y C AR(1) AR(2) .. .however, EViews estimates the
model
yt = c + ut ut = f1 ut 1 + f2 ut 2 +    + et ;
where c=c/(1 f1 f2   ) (using c from the lag speci cation) is an estimate of . The estimated coe-
cients fi from the two speci cations are identical.
2.2 ARMA models 113
The unknown future disturbance terms t+ are replaced by their expected value zero.
Forecasts of MA(q) processes cut o to  after q periods. Thus the forecasting behavior
corresponds to the autocorrelation pattern.
Forecasts based on an estimated MA model are determined in the same way. The uncon-
ditional mean  is estimated by the constant term c in the model
y^t = c + h1 et 1 +    + hq et q
which need not agree exactly with the sample mean y.
ARMA model forecasts
Forecasts using ARMA models can be derived in a similar fashion as described for AR and
MA models. The behavior of dynamic forecasts corresponds to the autocorrelations (see
Table 1). Once the contribution from the MA part has vanished the forecasting behavior
is driven by the AR part.
To investigate the behavior of ARMA forecasts for  !1 we make use of Wold's decom-
position. It implies that the coecients i of the MA(1) representation approach zero as
i!1. This has two consequences:
1. Dynamic forecasts Y^t; converge to  if Yt is stationary. Since Y^t; is the condi-
tional expected value of Yt+ this implies that returns are expected to approach
their unconditional mean. This property is called mean reversion which requires
stationarity. The speed of mean reversion depends on the coecients ` (i.e. on the
autocorrelations of the process).
2. The MA(1) representation implies that the variance of the forecast errors t; =Yt Y^t ;
converges to the variance of Yt.93 The variance of t; can be used to compute forecast
(con dence) intervals.
If the process is non-stationary (see section 2.3) (B ) can be written as
(B ) = (1 B(B)) (B ) = (1 + B + B 2 +   ) ((BB)) : (44)
Thus, the polynomial (B ) does not converge. This implies that the (non-stationary)
process is not mean reverting and the forecast variance does not converge.
2.2.8 Properties of ARMA forecast errors
The properties of forecast errors are also based on the MA(1) representation. The  -step
ahead forecast error is given by Yt+ Y^t; . The following properties hold if the forecast
(the conditional expectation) is based on the correct process de nition:
1. Expected value of the  -step ahead forecast error:
E[Yt+ Y^t; ] = 0:
93 For details see Tsay (2002), p.53.
2.2 ARMA models 114
2. Variance of the  -step ahead forecast error94:
 1
V[Y^t; ] = 2 = 2
X
2
i:
i=0

3. Variance of the one-step ahead forecast error: 12=2.


4. Forecast errors for a forecasting horizon  behave like a MA( 1) process.
5. One-step ahead forecast errors are white-noise.
6. For  !1 the variance of forecast errors converges to the variance of the process:
lim 2 2
 !1  !  :
The forecast variance of integrated processes (see section 2.3) tends to 1 because
(B ) de ned in (44) implies cumulating the variance of forecast errors.
These properties may be used to determine a (1 ) con dence interval for forecasts that
are calculated from an estimated ARMA model using n observations. The (1 ) forecast
interval is given by
y^t;  T ( =2; n)s ;
where T ( ; n) is the -quantile of the t-distribution with n degrees of freedom, and s is
the estimated standard deviation of the  -step ahead forecast error:
X1 h(B )
s2 = s2e gi2 g(B ) = :
i=0 f (B )
Example 36: Long-horizon returns revisited: In example 1.8.3 we have con-
sidered regressions with long-horizon returns. Now we will have a close look at the
(partial) autocorrelations of such returns. For simplicity we assume that single-period
returns are white-noise t N(0; 2 ), and we consider the sum of only three consecutive,
single-period returns:
yt = t + t 1 + t 2 :
Thus, yt is a MA(2) process with parameters 1 =2 =1. The autocovariances of yt
are 1 =22 , 2 =2 , and ` =0 (`>2), so that the only non-zero autocorrelations are
1 =2/3 and 2 =1/3. These autocorrelations can be derived from (43). Partial au-
tocorrelations can be obtained recursively and follow a very speci c pattern. `` at
the 'seasonal' lags `=1,4,7,. . . are given by 1 =j (j =1,2,. . . ); for example, 4;4 =1/3.
Partial autocorrelations at 'non-seasonal' lags 2,3,5,6,. . . are all negative, and converge
exponentially to zero. Empirically, the appropriate MA(2) model may be (easily) iden-
ti ed from the pattern of (partial) autocorrelations. However, the dynamic features
captured by this model cannot be exploited in out-of-sample predictions of three-
months returns for more than two periods ahead. These forecasts would be equal to
the unconditional mean implied by the model parameters.
94 There is no di erence between the variance of the forecast and the variance of the forecast error if the
expected value of the forecast error equals zero.
2.2 ARMA models 115
Exercise 21: Use the ARMA models from exercise 20. Estimate the same
model for a subset of the available sample (omit about 10% of the observations
at the end of the sample). Compute static and dynamic out-of-sample forecasts
of returns and prices and compare them to the actual observations. Describe
the behavior of forecasts and evaluate their quality.
2.3 Non-stationary models 116
2.3 Non-stationary models
2.3.1 Random-walk and ARIMA models
Consider an AR(1) process with parameter 1=1. The resulting non-stationary process is
called random-walk:
Yt = Yt 1 + t t : : : white-noise:
The random-walk can be transformed into the stationary white-noise process t by dif-
ferencing:
Yt = Yt Yt 1 = (1 B )Yt = t:
A process that becomes stationary after di erencing is also called integrated or di erence-
stationary. A random-walk can be written as the sum of all lags of t
t t
X
Yt = 2
(1 B ) = (1 + B + B +   )t = i= 1
i ;

which corresponds to integrating over t.


If the rst di erences of a random-walk are white-noise but the mean of t is di erent from
zero then Yt is a random-walk with drift:
t
X
Yt =  + Yt 1 + t = t + Y0 + i = t + Y0 + !t :
i=0

This process has two trend components: the deterministic trend t and the stochastic
trend !t.
For a xed, non-random initial value Y0 the random-walk (with drift or without drift) has
the following properties:
1. E[Yt] = t + Y0
2. V[Yt] = t2
3. k = (t k)2
4. rk decay very slowly (approximately linearly).
A random-walk is non-stationary since mean, variance and autocovariance depend on t.
Thus it is not mean-reverting and its (long-term) forecasts are given by
E[Y^t; jYt] =  + Yt:
A general class of integrated processes can be de ned, if the di erences Yt Yt 1 follow an
ARMA(p; q) process:
Yt = Yt 1 + Ut
2.3 Non-stationary models 117
Ut =  + 1 Ut 1 +    + p Ut p + 1t 1 +    + q t q + t:
In this case Yt is an ARIMA(p; 1; q) (integrated ARMA) process and Yt is called integrated
of order 1: YtI (1).
If Yt is an ARMA(p; q) process after di erencing d times so that
(1 B )dYt = Ut
Ut =  + 1 Ut 1 +    + p Ut p + 1 t 1 +    + q t q + t ;
Yt is an ARIMA(p; d; q) process and Yt I (d). Obviously, an ARIMA model for log prices
is equivalent to an ARMA model for log returns.
Forecasts of the ARIMA(0,1,1) process (1 B )Yt= +1t 1+t are obtained by using the
same procedure as in section 2.2.7:
Y^t;1 = Yt +  + 1 t
Y^t;2 = Y^t;1 +  = Yt + 2 + 1 t
Y^t; = Yt +  + 1 t :
Forecasts of ARIMA(0,1,q) processes converge to a straight line with slope  , where 
corresponds to the expected value of Yt. The transition to the straight line is described
by the MA parameters and corresponds to the cut o pattern of autocorrelations.
The ARIMA(1,1,0) process (1 1B )(1 B )Yt= +t can be written as
Yt =  + 1Yt 1 + t Yt = Yt 1 +  + 1Yt 1 + t = Yt 1 + Yt;
and dynamic forecasts are obtained as follows:
Y^t;1 = Yt + Y^t;1 = Yt +  + 1 Yt
Y^t;2 = Y^t;1 + Y^t;2
= Yt + Y^t;1 + Y^t;2 = Yt + [ + 1Yt] + [ (1 + 1) + 21Yt]
Y^t;3 = Yt +  +  (1 + 1 ) +  (1 + 1 + 21 ) + (1 + 21 + 31 )Yt :
Box and Jenkins (1976, p.152) show that the forecasts approach a straight line:
lim Y^ = Yt +  + (Yt Yt 1 ) (1 1 )  = (1   ) :
 !1 t; 1 1
In general, forecasts of ARIMA(p,1,0) processes approach a straight line with slope ,
which is the expected value of Yt. The transition to the straight line is described by the
AR parameters, and corresponds to the pattern of autocorrelations.
The process
Yt = 0 + t + Ut
is a trend-stationary process. Ut is stationary but need not be white-noise. The process
Yt evolves around a linear, deterministic trend in a stationary way. The appropriate
2.3 Non-stationary models 118

Figure 8: Autocorrelogram of the FTSE.


Sample: 1965:01 1990:12
Included observations: 312

Autocorrelation Partial Correlation AC PAC Q-Stat Prob

1 0.988 0.988 307.47 0.000


2 0.975 -0.027 608.14 0.000
3 0.965 0.068 903.16 0.000
4 0.955 0.023 1193.0 0.000
5 0.943 -0.078 1476.6 0.000
6 0.929 -0.086 1752.8 0.000
7 0.915 -0.026 2021.5 0.000
8 0.901 -0.008 2283.0 0.000
9 0.891 0.160 2539.6 0.000
10 0.880 -0.040 2790.9 0.000

transformation to make this process stationary is to subtract the trend term 0+t from
Yt . Note that di erencing a trend stationary process does not only eliminate the trend
but also a ects the autocorrelations of Yt:
Yt Yt 1 =  + Ut Ut 1 :

In general, the autocorrelations of Ut Ut 1 are not zero. For instance, if Ut is white-noise


Yt Yt 1 is a MA(1) process with parameter 1 = 1 and 1 = 0.5 (see (43), p.105).
Example 37: Many nancial time series (prices, indices, rates) or their logarithm
are non-stationary. The autocorrelations of a non-stationary series decay very slowly
(approximately linearly) and r1 is close to 1.0. The autocorrelogram of the FTSE in
Figure 8 shows this typical pattern.
2.3 Non-stationary models 119
2.3.2 Forecasting prices from returns
Frequently, a model tted to returns is also used to obtain tted values or forecasts of the
corresponding prices. Suppose y^t is the conditional mean of log returns yt=ln(pt=pt 1)
derived from a ARMA model, and assume that the residuals are normal. Given the
properties of the lognormal distribution (see section 2.1.2) the expected value of the price
is given by
p^t = pt 1 expfy^t + 0:5s2e g; (45)
where s2e is the variance of the residuals from the estimated model which determines y^t.
Example 38: We t the ARMA(1,1) model
yt = 0:0078 + ut ut = 0:473 ut 1 + 0:626 et 1 + et se = 0:06547
(0.0698) (0.049) (0.03)
to the FTSE log returns using the period 1965:01 to 1988:12 (see le ftse.wf1 for
details). The one-step ahead (static) out-of-sample forecasts of the index are close to
the actual values of the index and the dynamic forecasts quickly converge to a line with
almost constant slope. The slope of this line can be determined as follows. Dynamic
 -period ahead forecasts of the index are given by
( )

X
p^t; = pt exp y^t;i +  0:5s2e ;
i=1

where y^t;i are out-of-sample forecasts from the ARMA model.95 Dynamic forecasts
y^t; converge to the constant c=0.0078 but the changes in the index do not converge
to a constant:
p^t; p^t; 1 = pt [expf (c + 0:5s2e )g expf( 1)(c + 0:5s2e )g]:

95 Note: EViews does not include the term 0:5s2e .


2.3 Non-stationary models 120
2.3.3 Unit-root tests
An AR process is stationary, if the AR polynomial (B ) has no inverted root on or outside
the unit circle. If there is a root on the unit circle (a so-called unit-root), (B ) can be
decomposed as follows:
(B ) = (1 1 B 2 B 2    p 1B p 1)(1 B ):
The term (1 B ) corresponds to the unit-root and implies taking rst di erences. The
existence of a unit-root has considerable consequences for the behavior of the process and
its forecasts. The dynamic forecasts of Yt converge to a straight line with a slope equal
to the expected value of Yt, and the forecast interval (which is based on the variance
of the forecast error) diverges. If there is no unit-root, forecasts of Yt converge to the
(unconditional) mean of Yt, and the variance of forecast errors converges to the variance
of Yt.
Example 39: The polynomial of the AR(2) model (1 1.8B +0.8B 2 )Yt =t has two
inverted roots (1.0 and 0.8) and can be decomposed into (B )=(1 0.8B )(1 B ). Thus
Yt is an ARIMA(1,1,0) process and integrated Yt I (1).
The AR(2) model (1 1.8B +0.81B 2 )Yt =t is only marginally di erent. However, its
inverted roots are both equal to 0.9. There is no unit-root and the process is stationary
Yt I (0).
ARMA models are only suitable for stationary time series. One way to deal with integrated
time series is to take rst (or higher order) di erences, which is not appropriate if the series
is trend-stationary. Empirically, it is very dicult to distinguish trend-stationary and
di erence-stationary processes. A slow, approximately linear decay of autocorrelations
and r1 close to one are (heuristic) indicators of an integrated series. However, there are
integrated processes where the autocorrelations of rst di erences decay slowly, but the
decay starts at r10.5 rather than 1.0. The ARIMA(0,1,1) process Yt=(1 0.8B )t is an
example of this case (see Box and Jenkins, 1976, p.200).
The Dickey-Fuller (DF) unit-root test is based on the equation
Yt =  + (1 1)Yt 1 + t =  + Yt 1 + t = 1 1
H0 : = 0 (1 = 1) (unit-root) Ha : < 0 j1 j (stationary):
=0 if Yt is integrated, and the estimate ^ should be close to zero. When ^ is signi cantly
less 96 than zero, the null hypothesis is rejected, and Yt is assumed to be stationary.
However, it is not straightforward to test =0 (or 1=1) based on the null hypothesis
of a unit-root and the estimated equation
yt = c + (f1 1)yt 1 + et = c + ^yt 1 + et:
According to Fuller (1976) the t-statistic (f1 1)/se[f1] is not t-distributed under the null
hypothesis (irrespective of n). He shows that n(f1 1) has a non-degenerate distribution
with two main characteristics: ^ is downward biased if =1 (a fact also indicated by the
96 The unit-root test is a one-sided test since the coecient is negative under Ha .
2.3 Non-stationary models 121

Table 4: Critical values for the ADF test.


without trend with trend
0.01 0.05 0.10 0.01 0.05 0.10
n=50 {3.58 {2.93 {2.60 {4.15 {3.50 {3.18
n=100 {3.51 {2.89 {2.58 {4.04 {3.45 {3.15
n=250 {3.46 {2.88 {2.57 {3.99 {3.43 {3.13
n=500 {3.44 {2.87 {2.57 {3.98 {3.42 {3.13
n=1 {3.43 {2.86 {2.57 {3.96 {3.41 {3.12
results in Table 2), and the variance of ^ under the null hypothesis is of order 1=n2 (rather
than the usual order 1=n). Critical values have to be derived from simulations since no
analytical expression is available for that distribution. H0 is rejected if the t-statistic of ^
is less than the corresponding critical value in Table 497.
The critical values in Table 4 are valid, even if t is heteroskedastic. However, t must be
white-noise. If t is not white-noise, the DF test equation has to be extended (augmented
Dickey-Fuller (ADF) test)98 :
p
X
Yt =  + Yt 1 + ci Yt i + t: (46)
i=1

It is recommended to choose p=n1=3, but AIC or SC can be used as well to choose p.


Insigni cant coecients ci should be eliminated, but if in doubt, too large values of p
are not very harmful. The Phillips-Perron test does not account for the (possible)
autocorrelations in the residuals by adding lags to the DF equation. Instead, the test
statistic (f1 1)/se[f1] is adjusted for autocorrelation like in the computation of Newey-
West standard errors. The critical values are the same as in the ADF test.
If a series shows a more or less monotonic trend it can be either trend-stationary or an
integrated series (e.g. a random-walk) with drift. Consider the integrated process
Yt =  + Yt 1 + Wt

where Wt is stationary (hence, Yt is called di erence-stationary). This process can be


written as
t
X
Yt = Y0 + t + Wi ;
i=0
97 Source: Fuller (1976), p.373.
98 To derive this speci cation on the basis of an AR(p+1) model we set =1 +  +p+1 ,
cs = (s+1 +  +p+1 ) (s=1,... ,p), reformulate the AR polynomial of order p+1 as
(1 B) (c1 B +    + cp Bp )(1 B);
and write the AR(p+1) model as
Yt =  + Yt 1 + c1 Yt 1 +    + cp Yt p + t :
The ADF test equation is obtained by subtracting Yt 1 from both sides and setting = 1.
2.3 Non-stationary models 122
where the sum of (stationary) disturbances makes this process evolve around a linear trend
in an integrated (non-stationary) way. If Wt is white-noise, Yt is a random-walk with drift.
A natural alternative to this process is the trend-stationary process
Yt = 0 + t + Ut
with stationary disturbances Ut. Although both Wt and Ut are stationary in these spec-
i cations, their properties have to be quite di erent to make the resulting series appear
similar. For example, if Wt is white-noise with W =1, and Ut is an AR(1) with 1=0.9
and Y =5, some similarity between sample paths of those processes can be obtained (see
le nonstationary.xls).
A unit-root test to distinguish among these alternatives is based on estimating the equation
p
X
yt = ^yt 1 + c0 + ct + ci yt i + et ;
i=1

and the critical values from Table 4 (column 'with trend'). If H0 is not rejected, yt is
concluded to be integrated with a drift corresponding to c= ^ (assuming that ^<0 in any
nite sample). If H0 is rejected, yt is assumed to be trend-stationary with slope  c= ^.
If a series shows no clear trends, a unit-root test can be used to decide whether the series
is stationary or integrated without a drift. The integrated process
Yt = Yt 1 + Wt Wt . .. stationary
can be written as
t
X
Yt = Y0 + Wi ;
i=0

where the sum introduces non-stationarity. If Wt is white-noise Yt is a random-walk


without drift. A natural alternative to this process is the stationary process
Yt = 0 + Ut
with stationary disturbances. In this case we estimate the test equation
p
X
yt = ^yt 1 + c0 + ci yt i + et:
i=1

If H0 is rejected, yt is assumed to be stationary. If H0 is not rejected, yt is assumed to be


integrated. In both cases, c0 is proportional to the mean of yt with factor 1= ^.
In general, unit root tests should be interpreted with caution. The power of unit-root tests
is low, which means that stationary processes are too frequently assumed to be integrated
(in particular if 1 is close to one). Including irrelevant, deterministic regressors (e.g.
constant or trend) in the test equation reduces the power of the test even further (since
critical values become more negative). On the other hand, if constant or trend terms
are omitted although they belong to the true data generating process, the power can
2.3 Non-stationary models 123
go to zero. Choosing the correct speci cation is dicult, because two important issues
are interrelated: Unit root tests depend on the presence of deterministic regressors, and
conversely, tests for the signi cance of such regressors depend on the presence of a unit
root. A standard recommendation is to choose a speci cation of the test equation that is
plausible under the null and the alternative hypotheses (see Hamilton (1994), p.501, or
the guidelines in Enders (2004), p.207.)
Kwiatkowski et al. (1992) have proposed a test based on the null hypothesis of (trend)
stationarity. The KPSS test runs a regression of yt on a constant (if H0 is stationarity),
or a constant and a time trend t (if H0 is trend stationarity). The residuals are used to
compute the test statistic
n
X St2 t
X
St = ei :
^e2
t=1  i=1

^e2is an estimate of the residual variance that accounts for autocorrelation as in the
Newey-West estimator. The asymptotic critical values of the KPSS statistic are tabulated
in Kwiatkowski et al. (1992, p.166). For =0.05 the critical value under the null of
stationarity is 0.463, and 0.146 for trend stationarity.
Example 40: A unit-root test of the spread between long- and short-term inter-
est rates in the UK99 leads to ambiguous conclusions. The estimated value of is
0:143679 and gives the impression that 1 =1+ is suciently far away from one.
The t-statistic of ^ is 3:174306. Although this is below the critical value at a 5%
signi cance level ( 2:881), it is above the critical value for =0.01 ( 3:4758). There-
fore the unit-root hypothesis can only be rejected for high signi cance levels and it
remains unclear, whether the spread can be considered stationary or not. However,
given the low power of unit-root tests it may be appropriate to conclude that the
spread is stationary. The KPSS test con rms this conclusion since the test statistic is
far below the critical values. Details can be found in the le spread.wf1.
Example 41: We consider a unit-root test of the AMEX index (see le amex.wf1).
Since the index does not follow a clear trend, we do not include a trend term in the
test equation. We use p=1 since the coecients c^i (i>1) are insigni cant (initially
p=6 (2091=3 ) was chosen). The estimate for is 0:0173 and has the expected
negative sign. The t-statistic of ^ is 1:6887, and is clearly above all critical val-
ues in Table 4: It is also above the critical values provided by EViews. Therefore
the unit-root hypothesis cannot be rejected and the AMEX index is assumed to be
integrated (of order one). This is partially con rmed by the KPSS test. The test
statistic 0.413 exceeds the critical value only at the 10% level, but stationarity can-
not be rejected for lower levels of . To derive the implied mean of yt from the
estimated equation ^yt = 0.0173yt 1 +7.998+0.331yt 1 we reformulate the equa-
tion as y^t =(1 0.0173+0.331)yt 1 +7.998 0.331yt 2 , and the implied mean is given
by 7.998/0.0173462.
Example 42: We consider a unit-root test of the log of the FTSE index (see le
ftse.wf1). We use only data from 1978 to 1986 since during this period it is not clear
whether the series has a drift or is stationary around a linear trend. This situation
requires to include a trend term in the test equation. The estimated equation is
99 Source: http://www.lboro.ac.uk/departments/ec/cup/data.html; 'Yield on 20 Year UK Gilts'
(long; le R20Q.txt) and '91 day UK treasury bill rate' (short; le RSQ.htm); the spread is the di er-
ence between long and short; quarterly data from 1952 to 1988; 148 observations.
2.3 Non-stationary models 124
^yt = 0:163yt 1 +0.868+0:0023t. The t-statistic of ^ is 3:19. This is above the 1%
and 5% critical values in Table 4 and slightly below the 10% level. Therefore the unit-
root hypothesis cannot be rejected, and the log of the FTSE index can be assumed
to be integrated (of order one). This is con rmed by the KPSS test where the test
statistic exceeds the critical value, and stationarity can be rejected. Since augmented
terms are not necessary, the log of the index can be viewed as a random walk with
drift approximately given by 0.0023/0.163=0.014.

Exercise 22: Consider the ADF test equation (46) and p=1. Show that the
implied sample mean of yt is given by ^/^ .
Exercise 23: Use annual data on the real price-earnings ratio from the le
pe.wf1 (source: http://www.econ.yale.edu/~shiller/data/chapt26.xls).
Test the series for a unit-root. Irrespective of the test results, t stationary
and non-stationary models to the series using data until 1995. Compute out-
of-sample forecasts for the series using both types of models.
2.4 Di usion models in discrete time 125
2.4 Di usion models in discrete time
Several areas of nance make extensive use of stochastic processes in continuous time.
However, data is only available in discrete time, and the empirical analysis has to be done
in discrete time, too. In this section we focus on the relation between continuous and
discrete time models.
Review 11:100 A geometric Brownian motion (GBM) is de ned as
dPt
dPt = Pt dt + Pt dWt =  dt +  dWt ;
Pt
where Wt is a Wiener process with the following properties:
p
1. Wt =Zt t where Zt N(0; 1) (standard normal)and Wt N(0; t).
2. The changes over distinct (non-overlapping) intervals are independent 101 .
3. Wt N(0; t) if W0 =0.
4. Wt evolves in continuous time and has no jumps (no discontinuities). However,
its sample paths are not smooth but rather erratic.
5. The increments of Wt can be viewed as the counterpart of a discrete time white-
noise process (with mean zero and unit variance if t=1), and Wt corresponds
to a discrete time random-walk.
A GBM is frequently used to describe stock prices and implies non-negativity of the
price Pt .  and  can be viewed as mean and standard deviation of the simple return
Rt =dPt =Pt . This return is measured over an in nitely small time interval dt and is
therefore called instantaneous return. The (instantaneous) expected return is given
by
E[dPt =Pt ] = E[ dt +  dWt ] =  dt:
The (instantaneous) variance is given by
V[dPt =Pt ] = V[ dt +  dWt ] = 2 V[dWt ] = 2 dt:
Both mean and standard deviation are constant over time.  and  are usually
measured in annual terms.
The standard or arithmetic Brownian motion de ned as
dXt =  dt +  dWt (Xt+t Xt )N(t; 2 t)
is not suitable to describe stock prices since Xt can become negative.
A process that is frequently used to model interest rates is the Ornstein-Uhlenbeck
process
dXt = ( Xt )dt +  dWt :
This is an example of a mean reverting process. When Xt is above (below)  it tends
back to  at a speed determined by the mean-reversion parameter >0. The square
root process
p
dXt = ( Xt )dt +  Xt dWt
100 Campbell et al. (1997), p.341 or Baxter and Rennie (1996), p.44.
101 Because of the normality assumption it is sucient to require that changes are uncorrelated.
2.4 Di usion models in discrete time 126
is also used to model interest rates. It has the advantage that Xt cannot become
negative.
A very general process is the Ito process
dXt = (Xt ; t) dt + (Xt ; t) dWt ;
where mean and variance can be functions of Xt and t.

Review 12:102 If Xt is an Ito process then Ito's lemma states that a function
Gt =f (Xt ; t) can be described by the stochastic di erential equation (SDE)
 
1
dGt = ()fX0 + ft0 + 2 ()fX00 dt + ()fX0 dWt ;
2
where
@Gt 0 @Gt 00 @ 2 Gt
fX0 = f = fX = :
@Xt t @t @Xt2
Example 43: Suppose the stock price Pt follows a GBM. We are interested in the
process for the logarithm of the stock price. We have
@Gt 1 @Gt @ 2 Gt 1
Gt = ln Pt () = Pt () = Pt = =0 = :
@Pt Pt @t @Pt2 Pt2
Applying Ito's lemma we obtain
 
1 1 1
d ln Pt = Pt 0:52 Pt2 2 dt + Pt dWt ;
Pt Pt Pt
d ln Pt = ( 0:52 )dt +  dWt :
Thus, the log stock price ln Pt is an arithmetic Brownian motion with drift  0:52 ,
if Pt is a GBM with drift .

102 Tsay (2002), p.226.


2.4 Di usion models in discrete time 127
2.4.1 Discrete time approximation
We rst consider the discrete time approximation of a continuous time stochastic process
in the interval [t0; T ]. We choose n equidistant time points ti=t0+it (i=1,: : :,n), where
t=(T t0)=n=ti+1 ti. The so-called Euler approximation of an Ito process is given
by
Xi+1 = Xi + ()(ti+1 ti ) + ()(Wi+1 Wi );

where Xi is the discrete time approximation of Xt at t=ti. Equivalently, this could be


written as Xi+1=Xi+Xi where
Xi = Xti+t Xti = ()t + ()Wti :
Example 44: The Euler approximation of a GBM is given by103
Pi
Pi = Pi t + Pi Wi = Ri (t) = t + Wi :
Pi

2.4.2 Estimating parameters


We now assume that the stock price follows a GBM and consider the SDE of the logarithm
of the stock price. From example 43 we know that
d ln Pt = ( 0:52)dt +  dWt:
For a discrete time interval t the corresponding log return process in discrete time is
given by (see Gourieroux and Jasiak, 2001, p.287)104
p
ln Pt+t ln Pt = Yt(t) = ( 0:52)t + Zt t: Zt  N(0; 1): (47)
Suppose we have n +1 observations of a stock price pt (t=0,: : :,n) sampled in discrete time.
We want to use this sample to estimate the parameters  and 2 of the underlying GBM
in annual terms. We further suppose that log returns yt=ln(pt=pt 1) are i.i.d. normal.
Several things should be taken into account when comparing continuous and discrete time
models:
1. In section 2.1 we have used the symbol  to denote the mean of log returns (Yt).
However, to follow the notation typically used in di usion models, in the present
section  is the mean of the corresponding simple return Rt.
2. Time series analysis in discrete time usually does not explicitly specify t. The
corresponding discrete time model would use t=1 (i.e. use intervals of one day,
one week, .. . ).
103 Simulated sample paths of a GBM can be found in the le gbm.xls.
104 It is understood that t is a discrete point in time ti but we suppress the index i.
2.4 Di usion models in discrete time 128
3. A discrete time series model for i.i.d. log returns would be formulated as
yt = y + et et  N(0; s2e );

where y corresponds to ( 0:52)t, and s2e (or s2y ) to 2t. To estimate the
GBM parameters  and  (which are usually given in annual terms ) the observation
frequency of yt (which corresponds to t) has to be taken into account. We suppose
that the time interval between t and t 1 is t and is measured in years (e.g. t=1/52
for weekly data).
4. d ln Pt can be interpreted as the instantaneous log return of Pt. The (instantaneous)
mean of the log return d ln Pt is  0:52. However, when we compare equations (41),
p.92 and (47) we nd a discrepancy. The mean of log returns Yt in section 2.1.2 is
given by ln(1+m) 0:5Y2 whereas the mean of Yt(t) is given by ( 0:52)t. This
can be explained by the fact that ln(1+mt)!m dt as t!dt.
The sample estimates from log returns (y and s2) correspond to ( 0:52)t and 2t,
respectively. Thus estimates of  and 2 are given by
y y s2
^ 2 = s2 =t ^ = + 0:5^2 = + 0:5 :
t t t
Gourieroux and Jasiak (2001, p.289) show that the asymptotic variance of ^ 2 and ^ is
given by
4 2 4
aV[^2] = 2n aV[^] = nt + 2n :
By increasing the sampling frequency more observations become available (n increases),
but t becomes accordingly smaller. The net e ect is that nt stays constant, the rst
term in the de nition of aV[^] does not become smaller as n increases, and the drift cannot
be consistently estimated.
Example 45: In example 31 the mean FTSE log return y estimated from monthly
data was 0.00765 and the standard deviation s was 0.065256. The estimated mean
and variance of the underlying GBM in annual terms are given by
^ 2 = 0:0652562  12 = 0:0511 ^ = 0:00765  12 + 0:5  0:0511 = 0:117346:
We now consider estimating the parameters of the Ornstein-Uhlenbeck process using a
discrete time series. A simpli ed discrete time version of the process can be written as
p
Xt Xt t = t tXt t + Zt t
p
Xt = t + (1 t)Xt t + Zt t:
This is equivalent to an AR(1) model (using the notation from section 2.2)
Xt =  + 1 Xt 1 + t ;
2.4 Di usion models in discrete time 129
where  corresponds to t, 1 to (1 t), and 2 to 2t. The Ornstein-Uhlenbeck
process is only mean reverting (or stationary) if >0. This corresponds to the condi-
tion j1j<1 for AR(1) models. Thus it is useful to carry out a unit-root test before the
parameters  and  are estimated.
Given an observed series xt we can t the AR(1) model
xt = c + f1 xt 1 + et
and use the estimates c, f1 and se to estimate ,  and  (in annual terms ):
^ =
1 f1 ^ = c = c s
^ = p e :
t ^t 1 f1 t
Since estimated AR coecients are biased105 downwards in small samples, ^ will be biased
upwards.
A precise discrete time formulation is given by (see Gourieroux and Jasiak, 2001, p.289)
p
Xt = (1 expf tg) + expf tgXt t + Zt t;
where

1 expf 2tg1=2
= :
2t
Using this formulation the parameters are estimated by
ln f1 ^ = c expf^tg s 1 expf 2^tg1=2
^ = ^ = p e :
t expf^tg 1 t 2^t
Note that f1 has to be positive in this case.
Example 46: In example 40 we have found that the spread between long- and short-
term interest rates in the UK is stationary (or mean reverting). We assume that
the spread follows a Ornstein-Uhlenbeck process. The estimated AR(1) model using
quarterly data is106
xt = 0:1764 + 0:8563xt 1 + et se = 0:8696
which yields the following estimates in annual terms (t=1/4):
1 0:8563 0:1764 0:8696
^ = = 0:575 ^ = = 1:227 ^ = p = 1:74:
t 0:575t t
Using the precise formulation we obtain
ln 0:8563 c expf0:62tg
^ = = 0:62 ^ = = 1:227
t expf0:62tg 1
se 1 expf 2  0:62tg 1=2
 
^ = p = 1:613:
t 2  0:62t
105 The bias increases as the AR parameter  approaches one, or as the mean reversion parameter 
approaches zero.
106 Details can be found in the le ornstein-uhlenbeck.xls.
2.4 Di usion models in discrete time 130
2.4.3 Probability statements about future prices
We now focus on longer time intervals and consider price changes over T periods (e.g. 30
days). The T -period log return is the change in log prices between t and t+T . Thus the
log return is normally distributed107 with mean and variance
E[ln Pt+T ] ln Pt = E[Yt(T )] = ( 0:52)T V[Yt(T )] = 2T:
Equivalently, Pt+T is lognormal and ln Pt+T is normally distributed:
ln Pt+T  N(ln Pt + ( 0:52)T; 2T ):
Conditional on Pt the expected value of Pt+T is
E[Pt+T jPt] = Pt expfT g:
The discrepancy between this formula and equation (45), p.119 used to forecast prices in
section 2.3.2 can be reconciled by noting that here  is the mean of simple returns. The
corresponding discrete time series model for log returns yt is
yt = y + et et  N(0; s2 )
and the conditional expectation of pt+T is
E[pt+T jpt] = pt expfyT + 0:5s2T g:
A (1 ) con dence interval for the price in t+T can be computed from the properties of
T -period log returns. The boundaries of the interval for log returns are given by
p
( 0:52)T  jz =2j T ;
and the boundaries for the price Pt+T are given by108
n p o
Pt exp ( 0:52 )T  jz =2 j T :
Example 47: December 28, 1990 the value of the FTSE was 2160.4 (according to
finance.yahoo.com). We use the estimated mean and variance from example 45 to
compute a 95% con dence interval for the index in nine months (end of September
1991) and ten years (December 2000).109
Using ^ 2 =0.05 and ^=0.117 the interval for T =0.75 is given by110
n p o
2160:4  exp (0:117 0:5  0:05)0:75  1:96 0:05  0:75 = [1584; 3383]
and for T =10
n p o
2160:4  exp (0:117 0:5  0:05)10  1:96 0:05  10 = [1356; 21676]:
Note: the actual values of the FTSE were 2621.7 (September 30, 1991) and 6222.5
(December 29, 2000).
107 The normal assumption for log returns cannot be justi ed empirically unless the observation frequency
is low.
108 Note that the bounds are not given by E[Pt+T ]  jz =2 jpV[Pt+T ].
109 Details can be found in the le probability statements.xls.
110 We use rounded values of the estimates ^ and ^ .
2.4 Di usion models in discrete time 131
We now consider probabilities like P[Pt+T K ], where K is a pre-speci ed, non-stochastic
value (e.g. the strike price in option pricing).
Given that log returns over T periods are normally distributed Yt(T )N(( 0:52)T; 2T ),
probability statements about Pt+T can be based on the properties of Yt(T ):
P[Pt expfYt(T )g  K ] = P[Pt+T  K ] = P[Yt(T )  ln(K=Pt)]:
For instance, the probability that the price in t+T is less than K is given by
!
2
P[Yt(T )  ln(K=Pt)] =  ln(K=Pt) p(T 0:5 )T :

Similar probabilities are used in the Black-Scholes option pricing formula, and can be used
in a heuristic derivation of that formula111.
Example 48: We use the information from example 47 to compute the probability
that the FTSE will be below K =2000 in September 1991.
 
ln(2000=2160:4) (0:117 0:5  0:05)0:75
P[Pt+T  K ] =  p = 0:225:
0:05  0:75
Exercise 24:
1. Use a time series from exercise 17 (stock price, index or exchange rate).
Assume that this series follows a GBM and estimate the parameters 
and  (in annual terms).
2. Select a time series that appears to be mean-reverting. Verify this as-
sumption by a unit-root test. Assume that this series follows a Ornstein-
Uhlenbeck process and estimate the parameters ,  and .

111 For details see Jarrow and Rudd (1983), p.90.


2.5 GARCH models 132
2.5 GARCH models
For many problems in nance the variance or volatility of returns is a parameter of central
importance. It can serve as a risk measure, it is necessary for portfolio selection, it is
required in option pricing, and in the context of value-at-risk.
The time series models from section 2.2 can be used to replace the unconditional mean by
a conditional mean (i.e. the sample mean y is replaced by y^t). Similarly, the purpose of
modelling the variance is to replace the unconditional sample estimate s2 by a conditional
estimate s2t . Given that the volatility of returns is typically not constant over time, the
conditional variance s2t should be a better variance estimate or forecast than s2.
The variance of a GARCH process is not constant over time (heteroscedastic), and its
conditional variance follows a generalized AR model (see below). The acronym GARCH
stands for 'generalized autoregressive conditional heteroscedasticity'. A GARCH model
always consists of two equations:
1. The equation for the conditional mean has the following general form:
Yt = Y^t 1;1 + t = E[Yt jIt 1 ] + t t : : : white-noise:

Y^t 1;1is the conditional expectation (or the one-step ahead forecast) of Yt derived
from a time series or regression model. It 1 is the information set available at time
t 1. If Yt is white-noise Y^t 1;1 =.
In a GARCH model the variance of the disturbance term t is not constant but the
conditional variance is time-varying:

E[(Yt Y^t 1;1)2jIt 1] = t2:


What we need is a model that determines how t2 evolves over time.
2. In a GARCH(1,1) model the time variation of the conditional variance is given by
t2 = !0 + !1 (Yt 1 Y^t 2;1 )2 + 1 t2 1
= !0 + !12t 1 + 1t2 1 !0; !1; 1  0; (!1 + 1) < 1:
It is frequently assumed but not necessary that the conditional distribution of t is
normal: tjIt 1N(0; t2).
The conditional variance in t is based on 'news' or 'shocks' (i.e. forecast errors t) in-
troduced by the term !12t 1. In addition, the variance in t is based on the conditional
variance of the previous period weighted by 1. !1 determines the immediate (but lagged)
response to shocks and 1 determines the duration of the e ect. If 1 is much greater
than !1, t2 decays very slowly after extraordinary events (large t).
The coecients !0, !1 and 1 also determine the average level of t2 which is identical to
the unconditional variance of t:
!0
2 = : (48)
1 !1 1
2.5 GARCH models 133
Note that conditional (and unconditional) variance of Yt and t are only identical if the
conditional mean of Yt is constant (Y^t 1;1=).
GARCH models account for two well documented features of nancial returns:
1. Volatility clustering (heteroscedasticity): Suppose a (relatively) large value of t
occurs. This leads to an increase in t2 in the following period. Thus, the conditional
distribution of returns in the subsequent period(s) has a higher variance. This makes
further large disturbances more likely. As a result, a phase with approximately the
same level of volatility { a volatility cluster { is formed. If !1 is greater than 1,
the conditional variance returns very quickly to a lower level and the degree of the
volatility clustering is small.
2. Non-normality: The kurtosis of a GARCH(1,1) model is given by112
E[4t ] = 3[1 (!1 + 1)2] > 3:
V[t]2 1 (!1 + 1)2 2!12
Thus, GARCH models can account for fat tails. Although the unconditional mo-
ments implied by a GARCH model can be determined, the unconditional GARCH
distribution is not known analytically, even when the conditional distribution is nor-
mal.
If a time series or regression model has heteroscedastic or non-normal residuals the stan-
dard errors of estimated parameters (and p-values) are biased (see section 1.7). Since
GARCH models can account for both problems, adding a GARCH equation to a model
for the conditional mean may lead to choose a di erent ARMA model or di erent explana-
tory variables in a regression model.
The GARCH(p; q) model is a generalization of the GARCH(1,1) model where q past values
of 2t and p past values of t2 are used:
q
X p
X
t2 = !0 + !i 2t i+ i t2 i :
i=1 i=1

Many empirical investigations found that GARCH(1,1) models are sucient (see e.g.
Bollerslev et al., 1992).

112 For details see Tsay (2002) p.118.


2.5 GARCH models 134
2.5.1 Estimating and diagnostic checking of GARCH models
GARCH models cannot be estimated with least squares because the variance cannot be
observed directly. Thus, the di erence between 'actual' and tted variance cannot be
computed. GARCH models can be estimated by maximum-likelihood.113 To estimate
a GARCH model we need (a) a model for the conditional mean y^t to determine the
residuals t=yt y^t; (b) a conditional distribution for the residuals, and (c) a model for the
conditional variance of t. There exists no well established strategy for selecting the order
of a GARCH model (similar to the ARMA model building strategy). The choice cannot
be based on (partial) autocorrelations of squared returns or residuals. A simple model
building strategy starts with a GARCH(1,1) model, adds further lags to the variance
equation, and uses AIC or SC to select a nal model.
If we assume a conditional normal distribution for the residuals tjIt 1N(0,t2), the log-
likelihood function is given by
n 1 Xn 1 Xn 2
`= 2
2 ln 2 2 t=1 ln t 2 t=1 t2 ;
t

where t2 can be de ned in terms of a GARCH(p,q) model. The log-likelihood is a straight-
forward extension of equation (15) in section 1.4. It is obtained by replacing the constant
variance 2 by the conditional variance t2 from the GARCH equation.
Diagnostic checking of a GARCH model is based on the standardized residuals e~t=et=st.
The GARCH model is adequate if e~t, e~2t and je~tj are white-noise, and e~t is normal.
2.5.2 Example 49: ARMA-GARCH models for IBM and FTSE returns
In example 33 we found that IBM log returns are white-noise and example 34 indicated
heteroscedasticity of returns. Therefore we estimate a GARCH(1,1) model with constant
mean:
yt = 0:0002 + et s2t = 9:6  10 6 + 0:27 e2t 1 + 0:72 s2t 1 :
(0.75) (0.002) (0.0) (0.0)
However, e~t is not white-noise (r1=0.138 and Q1=0.008). Therefore we add AR(1) and
MA(1) terms to the conditional mean equation. AIC and SC select the following model:
yt = 0:0003 + 0:1 et 1 + et s2t = 7:8  10 6 + 0:24 e2t 1 + 0:75 s2t 1 :
(0.67) (0.078) (0.003) (0.0) (0.0)
Adding the term e2t 2 to the variance equation is supported by AIC but not by SC. The
standardized residuals and their squares are white-noise (the p-values of Q5 are 0.75 and
0.355, respectively). The JB-test rejects normality but the kurtosis is only 4.06 (the
skewness is {0.19), whereas skewness and kurtosis of observed returns are {0.6 and 8.2.
We conclude that the GARCH model explains a lot of the non-normality of IBM log
returns.
The conditional standard deviation st from this model captures the changes in the volatil-
ity of residuals et very well (see Figure 9). The conditional mean is very close to the
113 An example can be found in the le AR-GARCH ML estimation.xls.
2.5 GARCH models 135

Figure 9: Conditional standard deviation and residuals from a MA(1)-GARCH(1,1) model


for IBM log returns.
0.10

0.05

0.00

-0.05
0.06 -0.10
conditional standard deviation
residuals -0.15
0.04

0.02

0.00
50 100 150 200 250 300 350

unconditional mean, and thus residuals and returns are almost equal (compare the re-
turns in Figure 2 to the residuals in Figure 9).
We extend the MA(2) model for FTSE log returns from example 2.2.6 and t the MA(2)-
GARCH model
yt = 0:0074 + 0:06 et 1 0:15 et 2 + et
(0.021) (0.4) (0.01)
s2t = 0:0003 + 0:099 e2t 1 + 0:82 s2t 1 :
(0.11) (0.016) (0.0)
The p-values of the MA coecients have changed compared to example 2.2.6. The rst
MA parameter h1 is clearly insigni cant and could be removed from the mean equation.
In example 2.2.6 we found that MA residuals were not normal and not homoscedastic.
Since p-values are biased in this case, we expect that adding a GARCH equation which
accounts for non-normality and heteroscedasticity should a ect the p-values.
The standardized residuals of the MA-GARCH model are white-noise and homoscedastic
but not normal. If the conditional normal assumption does not turn out to be adequate a
di erent conditional distribution has to be used (e.g. a t-distribution).
Exercise 25: Usethe ARMA models from exercise 20, estimate ARMA-
GARCH models, and carry out diagnostic checking.
2.5 GARCH models 136
2.5.3 Forecasting with GARCH models
GARCH models can be used to determine static and dynamic variance forecasts of a time
series. The GARCH(1,1) forecasting equation for future dates t+ is
t;2 1 = !0 + !1 2t + 1 t2
t;2 2 = !0 + !1 2t+1 + 1 t;2 1
= !0 + !12t+1 + 1(!0 + !12t + 1t2):
The unknown future value 2t+1 in this equation is replaced by the conditional expectation
E[2t+1jIt]=t;2 1:
t;2 2 = !0 + !1 t;2 1 + 1 (!0 + !1 2t + 1 t2 ) = !0 + (!1 + 1 )t;2 1 :

Thus, the variance for t+2 can be determined on the basis of t and t2. The same
procedure can be applied recursively to obtain forecasts for any 
t2+ = !0 + (!1 + 1)t2+ 1:
For increasing  the forecasts t;2 converge to the unconditional variance 2 from equation
(48), provided (!1+1)<1. The time until the level of the unconditional variance is reached
depends on the GARCH parameters, the value of the last residual in the sample, and the
di erence between the unconditional variance and the conditional variance in t (when the
forecast is made).
We nally note that, in general, the variance of h-period returns yt(h) estimated from a
GARCH model will di er from the (frequently used) unconditional estimate h2 which is
based on homoscedastic returns. The h-period variance is given by the sum
h
X
2 (h) = 2 ;
t;
 =1

which also depends on the current level t2.


2.5 GARCH models 137
2.5.4 Special GARCH models
In empirical studies it is usually found that !1+1 is close to one. For instance, in
the models estimated in example 49 we found !1+1=0.99 and 0.92. The sum of the
GARCH parameters (!1+!2+  +1+2+  ) can be used as a measure of persistence in
variance. Persistence implies that the conditional variance tends to remain at a particular
(high or low) level. This tendency increases with the level of persistence. If persistence is
high this leads to volatility clustering.
The integrated I-GARCH model is a special case of a GARCH model with the con-
straint that !1+1=1. This saves one parameter to be estimated. The forecasts from a
I-GARCH model are given by
t2+ = !0 + t2:
A further special case is the exponentially weighted moving average (EWMA)
where !0=0 and !1+1=1, and only one parameter  is required:
t2 = (1 )2t 1 + t2 1 :
The EWMA model is used by RiskMetrics for value-at-risk114 calculations. In this context
the parameter  is not estimated. RiskMetrics recommends to use values around 0.95.
Example 50: Figure 10 shows the estimated in-sample (until end of 1987) and out
of-sample (starting 1988) variance of FTSE log returns from the GARCH(1,1) model
st = 0:0003 + 0:115 (yt 1 0:0079 )2 + 0:83 s2t 1
(0.13) (0.01) (0.05) (0.0)
with constant mean return. The EWMA variance115 using =0.95 is shown for com-
parison. The dynamic forecasts converge to the unconditional variance based on the
estimated parameters (!0 /(1 !1 1 )=0.0003/(1 0.115 0.83)=0.0055). During the
in-sample period EWMA and GARCH variance behave similarly. Di erences in the
decay after large shocks are due to the di erence between 1 =0.83 and =0.95.

GARCH models can be extended in various ways, and numerous formulations of the
variance equation exist. In the threshold ARCH (TARCH) model, for instance, asym-
metric e ects of news on the variance can be taken into account. In this case the variance
equation has the following form:
t2 = !0 + !1 2t 1 + 1 2t 1 dt 1 + 1 t2 1 ;
where, dt=1 (dt=0) if t<0 (t0). If 1>0 negative disturbances have a stronger e ect on
the variance than positive ones. The exponential GARCH (EGARCH) model also
allows for modelling asymmetric e ects. It is formulated in terms of the logarithm of t2:
ln t2 = !0 + 1 t 1 + 1 jt 1j + 1 ln t2 1:
t 1 t 1
114 http://www.riskmetrics.com/mrdocs.html.
115 The EWMA variance during the out-of-sample period is based on observed returns, while the dynamic
GARCH variance forecasts do not use any data at all from that period.
2.5 GARCH models 138

Figure 10: GARCH and EWMA estimates and forecasts of the variance of FTSE log
returns. 0.04

0.03 GARCH
EWMA (0.95)

0.02

0.01

0.00
66 68 70 72 74 76 78 80 82 84 86 88 90

If t 1<0 (t 1>0) the total impact of t 1=t 1 on the conditional (log) variance is given
by 1 1 ( 1+1). If bad news have a stronger e ect on volatility the expected signs are
1 +1 >0 and 1 <0.
As a further extension, explanatory variables can be included in the variance equation.
Some empirical investigations show that the number or the volume of trades have a sig-
ni cant e ect on the conditional variance (see Lamoureux and Lastrapes, 1990). After
including such explanatory variables the GARCH parameters frequently become smaller
or insigni cant.
In the GARCH-in-the-mean (GARCH-M) model the conditional variance or standard
deviation is used as an explanatory variable in the equation for the conditional mean:
Yt =  + t2 + t ;
where any GARCH model can be speci ed for t2. A signi cant parameter  would support
the hypothesis that expected returns of an asset contain a risk premium that is proportional
to the variance (or standard deviation) of that asset's returns. However, according to
nancial theory (e.g. the CAPM) the risk premium of an asset has to be determined in
the context of a portfolio of many assets.
Exercise 26: Use the log returns de ned in exercise 17 and estimate a
TARCH model to test for asymmetry in the conditional variance.
Obtain a daily nancial time series from nance.yahoo.com and retrieve the
trading volume, too. Add volume as explanatory variable to the GARCH
equation. Hint: Rescale the volume series (e.g. divide by 106 or a greater
number), and/or divide by the price or index to convert volume into number
of trades.
Use the log returns de ned in exercise 17 and estimate a GARCH-M model.
3.1 Vector-autoregressive models 139
3 Vector time series models
3.1 Vector-autoregressive models
3.1.1 Formulation of VAR models
Multivariate time series analysis deals with more than one series and accounts for feedback
among the series. The models can be viewed as extensions or generalizations of univariate
ARMA models. A basic model of multivariate analysis is the vector-autoregressive
(VAR) model.116
VAR models have their origin mainly in macroeconomic modeling, where simultaneous
(structural) equation models developed in the fties and sixties turned out to have inferior
forecasting performance. There were also concerns about the validity of the theories
underlying the structural models. Simple, small-scale VAR models were found to provide
suitable tools for analyzing the impacts of policy changes or external shocks. VAR models
are mainly applied in the context of Granger causality tests and impulse-response analyses
(see Greene, 2003, p.592). In addition, they are the basis for vector error correction models
(see section 3.2).
The standard form or reduced form of a rst order VAR model { VAR(1) { for two
processes Yt and Xt is given by
Yt = y + yy Yt 1 + yx Xt 1 + yt
Xt = x + xy Yt 1 + xx Xt 1 + xt ;
where yt and xt are white-noise disturbances which may be correlated. A VAR(1) process
can be written in matrix form as
Yt = V + 1 Yt 1 + t t  N(0;  );
where Yt is a column vector which contains all k series in the model. V is a vector of
constants. 1 is a kk matrix containing the autoregressive coecients for lag 1. t is a
column vector of disturbance terms assumed to be normally distributed with covariance
 . In the two-variable VAR(1) model formulated above Yt , V, 1 and t are given by
" # " # " # " #
Yt = Yt V= y 1 = yy yx t = yt :
Xt x xy xx xt
 is related to the correlation matrix of disturbances C and the vector of standard errors
 by =C (0).
The moving average (MA) representation of a VAR(1) model exists, if the VAR process
is stationary. This requires that all eigenvalues of 1 have modulus less than one (see
Lutkepohl, 1993, p.10). In this case
1
X 1
X
Yt =  + i1 t i =+  i t i ;
i=0 i=0
116 The general case of vector ARMA models will not be presented in this text; see Tsay (2002), p.322 for
details.
3.1 Vector-autoregressive models 140
where i1 denotes the matrix power of order i, i is the MA coecient matrix for lag i,
and =(I 1) 1V. The autocovariance of Yt for lag ` is given by
1
X 1
X
`1+i  (i1)0 = `+i  i 0 :
i=0 i=0
Extensions to higher order VAR models are possible (see Lutkepohl, 1993, p.11).
The VAR model in standard form only contains lagged variables on the right hand side.
This raises the question whether and how contemporaneous dependencies between Yt and
Xt are taken into account. To answer this question we consider the following example:
Yt = !0 Xt + !1 Xt 1 + 1 Yt 1 + Ut
Xt = 1 Xt 1 + Wt :
These equations can be formulated as a VAR(1) model in structural form117:
" #" # " #" # " #
1 !0 Yt = 1 !1 Yt 1 + Ut :
0 1 Xt 0 1 Xt 1 Wt
The structural form may include contemporaneous relations represented by the coecient
matrix on the left side of the equation. Substituting Xt from the second equation into the
rst equation yields
Yt = (!0 1 + !1 )Xt 1 + 1 Yt 1 + !0 Wt + Ut ;
or in matrix form:
" # " #" # " #
Yt = 1 !0 1 + !1 Yt 1 + !0 Wt + Ut :
Xt 0 1 Xt 1 Wt
Formulating this VAR(1) model in reduced form
" # " #" # " #
Yt = yy yx Yt 1 + yt
Xt xy xx Xt 1 xt
yields the following identities:
yy = 1 yx = (!0 1 + !1 ) xy = 0 xx = 1
2y = !02 W
2 + 2 2x = 2 cov[y x ] = !0 2 :
U  W t t W
Thus, if Yt and Xt are contemporaneously related the disturbance terms yt and xt of
the reduced form are correlated. This correlation depends on the parameter !0 in the
structural equation. In example 51 the correlation between the residuals is 0.41, which
can be used to estimate the parameter !0. In general, it is not possible to uniquely
determine the parameters of the structural from the (estimated) parameters of a VAR
model in reduced form. For this purpose, suitable assumptions about the dependencies in
the structural form must be made.
117 Note that appropriate estimation of structural forms depends on the speci c formulation. For example,
if Yt also appeared as a regressor in the equation for Xt , separately estimating each equation would lead
to inconsistent estimates because of the associated endogeneity (simultaneous equation bias). The same
applies in the present formulation if Ut and Wt are correlated.
3.1 Vector-autoregressive models 141
3.1.2 Estimating and forecasting VAR models
The joint estimation of two or more regression equations (system of equations) is beyond
the scope of this text. In general, possible dependencies across equations need to be taken
into account using GLS or ML. As a major advantage, VAR models in reduced form can be
estimated by applying least-squares separately to each equation of the model. OLS yields
consistent and asymptotically ecient estimates. None of the series in a VAR models is
exogenous as de ned in a regression context. A necessary condition is that the series are
stationary (i.e. ARt has to hold), and the residuals in each equation are white-noise. If the
residuals are autocorrelated, additional lags are added to the model. The number of lags
can also be selected on the basis of information criteria like AIC or SC. No precautions
are necessary if the residuals are correlated across equations. Since a VAR model can be
viewed as a seemingly unrelated regression (SUR) with identical regressors, OLS has the
same properties as GLS (see Greene, 2003, p.343).
The VAR model should only include variables with the same order of integration. If the
series are integrated the VAR model is tted to ( rst) di erences.118 In section 3.2 we
will present a test for integration of several series that can be interpreted as a multivariate
version of the DF test.
Lags with insigni cant coecients are usually not eliminated from the VAR model. This
may have a negative e ect on the forecasts from VAR models since (in most cases) too
many parameters are estimated. This ineciency leads to an unnecessary increase in the
variance of forecasts. However, if some coecients are restricted to zero, least-square
estimates are not ecient any more. In this case, the VAR model can be estimated by
(constrained) maximum likelihood119.
Figure 11: VAR(2) model for one-month (Y 1M) and ve-year interest rates (Y 5Y).
Sample(adjusted): 1964:04 1993:12
Included observations: 357
t-statistics in parentheses

D(Y_1M) D(Y_5Y)

D(Y_1M(-1)) -0.198355 0.014609


(-3.41811) (0.46413)
D(Y_1M(-2)) 0.010262 0.048413
(0.18134) (1.57724)
D(Y_5Y(-1)) 0.624043 0.063369
(5.81580) (1.08884)
D(Y_5Y(-2)) -0.275744 -0.129845
(-2.45457) (-2.13101)
C -0.003692 0.003407
(-0.08909) (0.15157)

R-squared 0.116266 0.018756


Adj. R-squared 0.106224 0.007605
S.E. equation 0.783066 0.424725
S.D. dependent 0.828293 0.426349

Akaike Information Criteria 3.317573


Schwarz Criteria 3.426193

118 For a discussion of various alternatives see Hamilton (1994), p.651.


119 For details see Hamilton (1994), p.315.
3.1 Vector-autoregressive models 142
Forecasts of VAR(1) models have the same structure as forecasts of AR(1) models. The
 -step ahead forecast is given by
^ t; = (I + 1 +    + 1 1 )V + 1 Yt :
Y
These forecasts are unbiased (i.e. E[Yt+ Y^ t; ]=0). The mean squared errors of the
forecasts are minimal if the disturbances are independent white-noise (see Lutkepohl,
1993, p.29). The covariance of  -step ahead forecast errors is given by (see Lutkepohl,
1993, p.31)
X1
i1  (i1)0 :
i=0

Figure 12: Out-of-sample forecasts of one-month (Y 1M) and ve-year (Y 5Y) interest rates
using the VAR model in Figure 11 (estimation until 12/93).
18

16 Y_1M
Y_5Y
14

12

10

2
65 70 75 80 85 90 95

Example 51: We consider the monthly interest rates of US treasury bills for maturi-
ties of one month (yt1M , Y 1M) and ve years (yt5Y , Y 5Y)120 . Both series are integrated
and we t a VAR(2) models to the rst di erences. The VAR(2) model was selected
by AIC. The signi cance of estimated parameters can be used to draw conclusions
about the dependence structure among the series. The estimation results in Figure 11
show a feedback relationship. The one-month rate depends on the ve-year rate and
the ve-year rate depends on the one-month rate (with a lag of two periods). However,
the dependence of the one-month rate on the ve-year rate is much stronger (as can
be seen from R2 ).
Figure 12 shows dynamic out-of-sample forecasts (starting in January 1994) of the
two interest rates. The forecasts converge rapidly to weakly ascending and descending
linear trend lines. Their slope is determined by the (insigni cant) constant terms.
120 Source:
CRSP database, Government Bond le; see le us-tbill.wf1; monthly data from January
1964 to December 1993; 360 observations.
3.1 Vector-autoregressive models 143
Exercise 27: Use the data in the le ccva.wf1 which is taken from Campbell
et al. (2003). Fit a VAR model using all series in the le and interpret the
results.
Fit a VAR model using only data from 1893 to 1981. Obtain dynamic forecasts
for all series until 1997 and interpret the results.
3.2 Cointegration and error correction models 144
3.2 Cointegration and error correction models
Time series models for integrated series are usually based on applying ARMA or VAR
models to ( rst) di erences. However, it was frequently argued that di erencing may
eliminate valuable information about the relationship among integrated series. We now
consider the case that two or more integrated series are related in terms of di erences and
levels.
3.2.1 Cointegration
Two121 processes Yt and Xt are cointegrated of rst order if
1. each process is integrated of order one122 and
2. Zt = Yt  Xt is stationary: Zt  I (0).
Yt =  + Xt + Zt (49)
is the cointegration regression or cointegrating equation.
Suppose there is an equilibrium relation between Yt and Xt. Then Zt represents the extent
of dis equilibrium in the system. If Zt is not stationary, it can move 'far away' from zero
'for a long time'. If Zt is stationary, Zt will 'stay close' to zero or frequently return to
zero (i.e. it is mean-reverting). This is consistent with the view that both processes are
controlled by a common (unobserved) stationary process. This process prevents Yt and
Xt from moving 'too far away' from each other.

3.2.2 Error correction model


If Yt and Xt are cointegrated a vector error correction model (VEC) can be formulated:
Yt = y Zt 1 + y + !1y Xt 1 +    + 1y Yt 1 +    + yt
(50)
Xt = xZt 1 + x + !1xXt 1 +    + 1xYt 1 +    + xt:
At least one of the coecients y or x must be di erent from zero. The number of lagged
di erences in the VEC model can be determined by AIC or SC. If cointegration holds,
models which do not include Zt 1 are misspeci ed.
Substituting Zt 1 in (50) by using the cointegrating equation (49) yields
Yt = y (Yt 1  Xt 1 ) + y + !1y Xt 1 +    + y Yt 1 +    + yt
Xt = x(Yt 1  Xt 1 ) + x + !1x Xt 1 +    + x Yt 1 +    + xt :
121 The concept of cointegration is also de ned for k>2 series. We start to introduce the topic by consid-
ering only two time series and will gradually broaden the scope to more than two series.
122 For simplicity, Yt and Xt are assumed to be I (1) processes. This is very often the case. However,
cointegration can also be de ned in terms of I (d) processes.
3.2 Cointegration and error correction models 145
Therefore, the structure of a VEC model corresponds to a VAR model in di erences that
accounts for the levels of the series using a special (linear) constraint.
A VEC model can be interpreted as follows: deviations from equilibrium, represented
by Zt, a ect Yt and Xt such that Yt and Xt approach each other. This mechanism
'corrects' errors (or imbalances) in the system. Therefore Zt 1 is also called error cor-
rection term. The degree of correction depends on the so-called speed-of-adjustment
parameters y and x.
Consider the simple case
Yt = y (Yt 1 Xt 1 ) + yt y < 0
Xt = xt;
which implies E[YtjIt 1]= y (Yt 1 Xt 1 ) and E[Xt jIt 1 ]=0. Three cases can be dis-
tinguished:
1. If Yt 1= Xt 1 (Zt=0) the system is in long-run equilibrium. There is no need for
adjustments and E[YtjIt 1]=0.
2. If Yt 1> Xt 1 (Zt>0) the system is not in long-run equilibrium. There is a need
for a downward adjustment of Yt a ected by E[YtjIt 1]<0.
3. If Yt 1< Xt 1 (Zt<0) there is an upward adjustment of Yt since E[YtjIt 1]>0.
Example 52: 123 Consider the relation between the spot price St of a stock and its
corresponding futures124 price Ft . The cost of carry model states that the equilibrium
relation between St and Ft is given by
Ft = St e(r d) ;
where r is the risk-free interest rate, and d is the dividend yield derived from holding
the stock until the future matures in t+ . Taking logarithms yields
ln Ft = ln St + (r d):
In practice, this relation will not hold exactly. But the di erence between the left and
right hand side can be expected to be stationary (or even white-noise). This suggests
that the logs of spot and futures prices can described by a cointegration regression
with  (r d) and 1.

123 For details and an empirical example see Brooks et al. (2001).
124 Futures are standardized, transferable, exchange-traded contracts that require delivery of a commodity,
bond, currency, or stock index, at a speci ed price, on a speci ed future date. Unlike options, futures convey
an obligation to buy.
3.2 Cointegration and error correction models 146
3.2.3 Example 53: The expectation hypothesis of the term structure
The (unbiased) expectation hypothesis of the term structure of interest rates (EHT)
states that investors are risk neutral, and bonds with di erent maturities are perfect
substitutes. Accordingly, interest rate di erentials cannot become too large since otherwise
arbitrage opportunities would exist. In ecient markets such possibilities are quickly
recognized and lead to a corresponding reduction of interest rate di erentials. This is
true even if the assumption of risk neutrality is dropped and liquidity premia are taken
into account.125 According to the EHT, a long-term interest rate can be expressed as a
weighted average of current and expected short-term interest rates. Let Rt( ) be the spot
rate of a zero bond with maturity >1 and St=Rt(1) a short-term rate (e.g. the one-month
rate). The EHT states that

Rt ( ) =
1 X1 E[St+j jIt] + ( );
 j =0

where ( ) is a time-invariant but maturity dependent term premium. For instance, the
relation between three- and one-month interest rates is given by
1
Rt (3) = (St + Et [St+1 ] + Et [St+2 ]) + (3):
3
If we consider the spread between the long and the short rate we nd
1
Rt (3) St = (Et [St+1 St ] + Et [St+2 St ]) + (3):
3
Usually, interest rates are considered to be integrated processes. Thus, the terms on the
right hand side are ( rst and higher order) di erences of integrated processes and should
therefore be stationary. This implies that the spread Rt(3) St is also stationary since
both sides of the equation must have the same order of integration.
More generally, we now consider the linear combination 1Rt(3)+ 2St which can be writ-
ten as (ignoring the term premium)

1 Rt (3) + 2 St = ( 1 + 2 )St + 1 (Et [St+1 St ] + Et [St+2 St ]) :
3
The linear combination 1Rt(3)+ 2St will only be stationary if the non-stationary series
( 1+ 2)St drops from the right-hand side. Thus, the right hand side will be station-
ary if 1+ 2=0, e.g. if 1=1 and 2= 1. Empirically, the EHT implies that the resid-
uals from the cointegration regression between Rt(3) and St should be stationary and
Zt Rt (3) St  .

125 For theoretical details see Ingersoll (1987), p.389; for an empirical example see Engsted and Tanggaard
(1994).
3.2 Cointegration and error correction models 147
3.2.4 The Engle-Granger procedure
Engle and Granger (1987) have developed an approach to specify and estimate error
correction models which is only based on least-square regressions. The procedure consists
of the following steps:
1. Test whether each series is integrated of the same order.
2. Estimate the cointegration regression (49) and compute zt=yt c bxt. In general,
tting a regression model to the levels of integrated series may lead to the so-called
spurious regression problem126 . However, if cointegration holds, the parameter
estimate b converges (with increasing sample size) faster to than usual (this is
also called super-consistency). If a VAR model is tted to the levels of integrated
series a sucient number of lags should be included, such that the residuals are
white-noise. This should avoid the spurious regression problem.
3. Test whether zt is stationary. For that purpose use a ADF test without intercept
since zt has zero mean:
p
X
zt = gzt 1 + cj zt j + et:
j =1

The t-statistic of g must not be compared to the usual critical values (e.g. those in
Table 4 or those supplied by EViews). Since zt is an estimated rather than observed
time series, the critical values in Table 5127 must be used. These critical values also
depend on k (the number of series which are tested for cointegration).
If zt is stationary we conclude that yt and xt are cointegrated, and a VEC model
for the cointegrated time series is estimated. If zt is integrated a VAR model using
di erences of yt and xt is appropriate.
Example 54: We illustrate the Engle-Granger procedure by using the two interest
series yt =yt1M and xt =yt5Y from example 51. The assignment of the symbols yt and xt
to the two time series is only used to clarify the exposition. It implies no assumptions
about the direction of dependence, and usually128 has no e ect on the results. Details
can be found in the le us-tbill.wf1.
Both interest rate series are assumed to be integrated, although the ADF test statistic
for yt1M is 2:98, which is less than the critical value 2:87 at the 5% level. The
OLS estimate of the cointegration regression is given by yt = 0:845+0:92xt +zt . The
t-statistic of g in a unit-root test of zt (using p=1) is 4:48. No direct comparison
with the values in Table 5 is possible (n=360, k=2, p=1). However, 4:48 is far less
than the critical values in case of n=200 and =0.01, so that the unit-root hypothesis
for zt can be rejected at the 1% level. We conclude that zt is stationary and there is
cointegration among the two interest series.
The estimated VEC model is presented in Figure 13. The upper panel of the table
shows the cointegration equation and de nes zt (CointEq1): zt =yt 0:932xt +0.926.
This equation is estimated by maximum likelihood, and thus di ers slightly from the
126 For details see Granger and Newbold (1974).
127 Source: Engle and Yoo (1987), p.157.
128 For details see Hamilton (1994), p.589.
3.2 Cointegration and error correction models 148

Table 5: Critical values of the ADF t-statistic for the cointegration test.
p=0 p=4

k n 0.01 0.05 0.10 0.01 0.05 0.10
50 {4.32 {3.67 {3.28 {4.12 {3.29 {2.90
2 100 {4.07 {3.37 {3.03 {3.73 {3.17 {2.91
200 {4.00 {3.37 {3.02 {3.78 {3.25 {2.98
50 {4.84 {4.11 {3.73 {4.54 {3.75 {3.36
3 100 {4.45 {3.93 {3.59 {4.22 {3.62 {3.32
200 {4.35 {3.78 {3.47 {4.34 {3.78 {3.51
50 {4.94 {4.35 {4.02 {4.61 {3.98 {3.67
4 100 {4.75 {4.22 {3.89 {4.61 {4.02 {3.71
200 {4.70 {4.18 {3.89 {4.72 {4.13 {3.83
p is the number of lags in the ADF regression. k is the number
of series. is the signi cance level.

OLS estimates mentioned above. The lower panel shows the error correction model.
p=2 was based on the results of the VAR model in example 51. Both (changes in)
interest rates depend signi cantly on the error correction term zt 1 . Thus, the changes
of each time series depend on the interest rate levels, and di erences between their
levels, respectively. The dependencies on past interest rate changes already known
from example 51 are con rmed.
The negative sign of the coecient 0:1065 of zt 1 in the equation for yt1M can be
interpreted as follows. If the ve-year interest rates are much greater than the inter-
est rates for one month, zt 1 is negative (according to the cointegration regression
zt =yt1M 0:932yt5Y +0.926). Multiplication of this negative value with the negative co-
ecient 0:1065 has a positive e ect (c.p.) on the expected changes in yt1M , and
therefore leads to increasing short-term interest rates. This implies a tendency to
reduce (or correct) large di erences in interest rates. These results agree with the
EHT. In ecient markets spreads among interest rates cannot become too large. The
positive coecient 0.041 in the equation for yt5Y can be interpreted in a similar way. A
negative zt 1 leads to negative expected changes (c.p.) in yt5Y , and therefore leads to
a decline of the long-term interest rates. In addition, these corrections depend on past
changes of both interest rates. Whereas the dependence on lagged changes could be
called short-term adjustment, the response to zt 1 is a long-term adjustment e ect.
Figure 14 shows out-of-sample forecasts (starting January 1994) of the two interest
rate series using the VEC model from Figure 13. In contrast to forecasts based on the
VAR model (see Figure 12), these forecasts do not diverge. This may be explained by
the additional error correction term.

The Engle-Granger procedure has two drawbacks. First, if k>2 at most (k 1) cointe-
gration relations are (theoretically) possible. It is not straightforward how to test for
cointegration in this case. Second, even when k=2 the cointegration regression between
yt and xt can also be estimated in reverse using

xt = c0 + b0 yt + zt0 :
3.2 Cointegration and error correction models 149

Figure 13: Cointegration regression and error correction model for one-month (Y 1M) and
ve-year (Y 5Y) interest rates.
Sample(adjusted): 1964:04 1993:12
Included observations: 357
t-statistics in parentheses

Cointegrating Eq: CointEq1

Y_1M(-1) 1.000000
Y_5Y(-1) -0.932143
(-9.54095)
C 0.925997

Error Correction: D(Y_1M) D(Y_5Y)

CointEq1 -0.106479 0.041026


(-3.15443) (2.22526)
D(Y_1M(-1)) -0.138652 -0.008395
(-2.29743) (-0.25467)
D(Y_1M(-2)) 0.054517 0.031362
(0.94613) (0.99654)
D(Y_5Y(-1)) 0.599125 0.072970
(5.63847) (1.25734)
D(Y_5Y(-2)) -0.270073 -0.132030
(-2.43414) (-2.17871)
C -0.003420 0.003302
(-0.08356) (0.14772)

R-squared 0.140629 0.032406


Adj. R-squared 0.128387 0.018623
S.E. equation 0.773296 0.422361
S.D. dependent 0.828293 0.426349

Akaike Information Criteria 3.268280


Schwarz Criteria 3.420348

Figure 14: Out-of-sample forecasts of one-month (Y 1M) and ve-year (Y 5Y) interest rates
using the VEC model in Figure 13 (estimation until 12/93).
18

16 Y_1M
Y_5Y
14

12

10

2
65 70 75 80 85 90 95
3.2 Cointegration and error correction models 150
In principle, the formulation is arbitrary. However, since zt and zt0 are not 129 identical,
unit-root tests can lead to di erent results130. Engle-Granger suggest to test both vari-
ants.131
Exercise 28: Choose two time series which you expect to be cointegrated.
Use the Engle-Granger procedure to test the series for cointegration. Depend-
ing on the outcome, t an appropriate VAR or VEC model to the series, and
interpret the results.

129 Suppose we estimate the equation y=b0 +bx+e. The estimated slope is given by b=syx =s2x . If we
estimate x=c0 +cy+u (reverse regression) the2 estimate c will not be equal to 1=b. c=syx =s2y which is
di erent from 1=b except for the special case sy =s2yx =s2x .
130 Unit-root tests of zt and zt0 are equivalent only asymptotically.
131 For details see Hamilton (1994), p.589.
3.2 Cointegration and error correction models 151
3.2.5 The Johansen procedure
The Johansen procedure132 can be used to overcome the drawbacks of the Engle-Granger
approach. In addition, it o ers the possibility to test whether a VEC model, a VAR model
in levels, or a VAR model in ( rst) di erences is appropriate.
The Johansen approach is based on a VAR(p+1) model of k (integrated) variables:
Yt = V + 1 Yt 1 +    + p+1 Yt p 1 + t :
This model can be reformulated to obtain the following VEC representation:
p
X
Yt = V + Yt 1 + Ci Yt i + t ; (51)
i=1

where
+1
pX +1
pX
= i I Ci = j ;
i=1 j =i+1

and I is a kk unit matrix. Comparing equation(51) to the ADF test regression
p
X
Yt =  + Yt 1 + ci Yt i + t
i=1

shows that Johansen's approach can be interpreted as a multivariate unit-root test. In


the univariate case, di erences of yt are regressed on the level of yt 1. In the multivariate
case, di erences of the vector of variables are regressed on linear combinations of the
vector of past levels (represented by Yt 1). In the univariate case, conclusions about
a unit-root of Yt are based on the null hypothesis =0. Analogously, in the multivariate
case, conclusions are based on the properties of the matrix estimated from equation (51)
by maximum-likelihood.
Review 13: The rank r(A) of a mn matrix A is the maximum number of linearly
independent rows or columns of A and r(A)minfm; ng.
A scalar  is called eigenvalue of A if the equation (A I)!=0 can be solved for the
non-zero eigenvector !. The solution will be non-trivial if det(A I)=0 (this is the
characteristic equation). For example, for a 22 matrix, the characteristic equation is
given by 2 (a11 +p a22 )+(a11 a22 a12 a21 )=0. Alternatively, 2 tr(A)+jAj=0 with
solution 0.5[tr(A) tr(A)2 4jAj].
The maximum number of eigenvalues of a nn matrix is n. The rank of A is the
number of non-zero eigenvalues. Special cases are: The eigenvalues of a unit matrix
are all equal to 1 (the matrix has full rank). If the unit matrix is multiplied by a
all eigenvalues are equal to a. A nn matrix with identical elements c has only one
non-zero eigenvalue equal to cn (its rank is one). A null matrix has rank zero.
132 For details see Enders (2004), p.362.
3.2 Cointegration and error correction models 152
The formal basis for conclusions derived from the Johansen test is provided by Granger's
representation theorem: If the rank r of the matrix is less than k, there exist kr
matrices and (each with rank r) such that = 0 and such that Zt= 0Yt is stationary.
The rank r is equal to the number of cointegration relations (the so-called cointegration
rank) and every column of is a cointegration vector.
The following cases can be distinguished:
1. If has rank zero there is no cointegration. This corresponds to the case =0 in the
univariate ADF test. In this case, all elements of Yt are unit-root series and there
exists no stationary linear combination of these elements. Therefore, a VAR model
in rst di erences (with no error correction term) is appropriate.
2. If has full rank there is no cointegration either. In this case, all elements of Yt
are stationary, which corresponds to the situation 6=0 in the univariate ADF test.
Therefore, a VAR model in levels (with no error correction term) is appropriate.
3. Cointegration holds if has rank 0<r<k. In this case a VEC model is appropriate.
If k=2 and r1 the decomposition of can be written as
! !
= 0 = 1 2 ( 1 2) = 1 1 1 2 :
2 1 2 2

This implies the error correction model


Y1t = 1( 1Y1t 1 + 2Y2t 1) + 1t
Y2t = 2( 1Y1t 1 + 2Y2t 1) + 2t:
This decomposition is not feasible if the rank is full (i.e. j j6=0). If the rank is not full (i.e.
j j=0) the eigenvalues are given by 0.5[tr( )tr( )]. Thus, one eigenvalue will be zero
and the second eigenvalue is the trace of . Cointegration obtains if the trace is di erent
from zero. For the rank to be zero (i.e. no cointegration) tr( ) must be zero. This holds
if 1= 2= 2= 1, i.e. and are orthogonal.
Note that the matrices and are not unique. If Zt= 0Yt is stationary, then c 0Yt
will also be stationary for any nonzero scalar c. In general, any linear combination of the
cointegrating relations is also a cointegrating relation. This non-uniqueness typically leads
to normalizations of that make the interpretation of Zt easier (see example 58 below).
Example 55: We use the results from example 54 (see Figure 13) and omit the
constants for simplicity. can be written as
   
0:107 (1 0:932) = 0:107 0:0997
= 0:041 0:041 0:0382 :
The non-zero eigenvalue is {0.145. Simulated sample paths of the cointegrated series
can be found in the le vec.xls on the sheet rank=1. The sheet rank=0 illustrates the
case r=0 where 1 = 2 2 = 1 =0.0382 is used. The simulated paths of both cases are
based on the same disturbances. Comparisons clearly show that the paths of the two
3.2 Cointegration and error correction models 153
random walks typically deviate from each other more strongly and for longer periods
of time than the two cointegrated series.
A di erent normalization can be obtained if is divided by 0:932 and is multiplied
by 0:932. remains unchanged by this linear transformation. The new normaliza-
tion for =( 1:075 1)0 implies that zt yt5Y yt1M , which corresponds to the more
frequently used de nition of an interest rate spread.

The Johansen test involves estimating133 the VEC model (51) and to test how many
eigenvalues of the estimated matrix are signi cant. Two di erent types of tests are
available. Their critical values are tabulated and depend on p { the number of lags in the
VEC model { and on assumptions about constant terms. To determine the order p of the
VEC model VAR models with increasing order are tted to the levels of the series. p is
chosen such that a VAR(p+1) model tted to the levels has minimum AIC or SC. Setting
p larger than necessary is less harmful than choosing a value of p that is too small. If
a level VAR(1) has minimum AIC or SC (i.e. p=0) this may indicate that the series are
stationary. In this case the Johansen test can be carried out using p=1 to con rm this
(preliminary) evidence.
The following ve assumptions about constant terms and trends in the cointegrating equa-
tion (49) and in the error correction model (50) can be distinguished:134
1. There are no constant terms in the cointegrating equation and the VEC model:
 =y =x =0.135
2. The cointegrating equation has a constant term  6=0, but the VEC model does not
have constant terms: y , x=0.136
3. The cointegrating equation and the VEC model have constant terms:  , y , x6= 0.137
y , x 6=0 is equivalent to assuming a 'linear trend in the data' because a constant
term in the VEC model for Yt corresponds to a drift in the levels Yt.
4. The cointegrating equation has a constant and a linear trend (Yt= +t+ Xt+Zt).
This case accounts for the possibility that the imbalance between Yt and Xt may
linearly increase or decrease. Accordingly, the di erence in the levels need not nec-
essarily approach zero or  , but may change in a deterministic way. The VEC model
has constant terms: y , x6=0.138
5. The cointegrating equation has a constant term  6=0 and a linear trend. The VEC
model has constants (y , x6=0) and a linear trend. The presence of a linear trend
in addition to the drift corresponds to a quadratic trend in the level of the series.139
The conclusions about cointegration will usually depend on the assumptions about con-
stant terms and trends. This choice may be supported by inspecting graphs of the series or
by economic reasoning (for instance, a quadratic trend in interest rates may be excluded
133 For details about the maximum likelihood estimation of VEC models see Hamilton (1994), p.635.
134 We only consider the simplest case of two series.
135 EViews: VAR assumes no deterministic trend in data: No intercept or trend in CE or test VAR.
136 EViews: Assume no deterministic trend in data: intercept (no trend) in CE - no intercept in VAR.
137 EViews: Allow for linear deterministic trend in data: Intercept (no trend) in CE and test VAR.
138 EViews: Allow for linear deterministic trend in data: Intercept and trend in CE - no trend in VAR.
139 EViews: Allow for quadratic deterministic trend in data: Intercept and trend in CE - linear trend in VAR.
3.2 Cointegration and error correction models 154
apriori). If it is dicult to decide which assumption is most reasonable, the Johansen
test can be carried out under all ve assumptions. The results can be used to select an
assumption that is well supported by the data.
Figure 15: Johansen test for cointegration among yt1M and yt5Y.
Sample: 1964:01 1993:12
Included observations: 357
Test assumption: No deterministic trend in the data
Series: Y_1M Y_5Y
Lags interval: 1 to 2

Likelihood 5 Percent 1 Percent Hypothesized


Eigenvalue Ratio Critical Value Critical Value No. of CE(s)

0.069316 29.35324 19.96 24.60 None **


0.010333 3.708043 9.24 12.97 At most 1

*(**) denotes rejection of the hypothesis at 5%(1%) significance level


L.R. test indicates 1 cointegrating equation(s) at 5% significance level

Figure 16: Summary of the Johansen test for Test


Johansen Cointegration cointegration
Summary among yt1M and yt5Y under
di erent assumptions.
Sample: 1964:01 1993:12
Included observations: 358
Series: Y_1M Y_5Y
Lags interval: 1 to 1

Data Trend: None None Linear Linear Quadratic


------------------------------------------------------------------------------------------------------------
Rank or No Intercept Intercept Intercept Intercept Intercept
No. of CEs No Trend No Trend No Trend Trend Trend

Akaike Information Criteria by Model and Rank

0 3.306984 3.306984 3.318016 3.318016 3.324710


1 3.253165 3.255658 3.261239 3.256909 3.259051
2 3.274557 3.270444 3.270444 3.271651 3.271651

Schwarz Criteria by Model and Rank

0 3.350342 3.350342 3.383053 3.383053 3.411426


1 3.339881 3.353214 3.369634 3.376144 3.389125
2 3.404631 3.422196 3.422196 3.445082 3.445082

L.R. Test: Rank = 1 Rank = 1 Rank = 2 Rank = 1 Rank = 1

Example 56: Fitting VAR models to the levels of yt1M and yt5Y indicates that p=1
should be used to estimate the VEC model for the Johansen test. However, we choose
p=2 to obtain results that are comparable to example 54. Below we will obtain
test results using p=1. Figure 15 shows the results of the test. The assumption No
deterministic trend in the data was used because it appears most plausible in economic
terms, and is supported by the results obtained in example 54. EViews provides an
interpretation of the test results: L.R. test indicates 1 cointegrating equation(s) at 5%
signi cance level. The null hypothesis 'no cointegration' (None) is rejected at the 1%
level. The hypothesis of at most one cointegration relation cannot be rejected. This
con rms the conclusion drawn in example 54 that cointegration among yt1M and yt5Y
exists.
Figure 16 contains a summary of the results for various assumptions and p=1. The last
line indicates which rank can be concluded on the basis of the likelihood-ratio test for
3.2 Cointegration and error correction models 155
each assumption, using a 5% level. The conclusion r=1 is drawn for all assumptions,
except the third.
In addition, AIC and SC for every possible rank and every assumption are provided.
Note that the speci ed rank in the row L.R. Test is based on the estimated eigenval-
ues. The rank is not determined on the basis of AIC or SC, and therefore need not
correspond to these criteria (e.g., under assumption 2, SC points at r=0).
For a given rank, the values in a row can be compared to nd out which assumption
about the data is most plausible. Since the alternatives within a line are nested,
the precondition for a selection on the basis of AIC and SC is met. If conclusions
about the cointegration rank are not unique, and/or no assumption about constant
terms and trends is particularly justi ed, AIC and SC may be used heuristically in
order to search for a global minimum across assumptions and ranks. As it turns out
both criteria agree in pointing at assumption 1. This corresponds to the result that
the intercept terms in the VEC model are not signi cant (see Figure 13). Therefore,
assuming a drift in interest rates is not compatible with the data and could hardly be
justi ed using economic reasoning.

Exercise 29: Choose two time series which you expect to be cointegrated.
Use the Johansen procedure to test the series for cointegration. Depending on
the outcome of the test, t an appropriate VAR or VEC model to the series,
and interpret the results.
3.2 Cointegration and error correction models 156
3.2.6 Cointegration among more than two series
Example 57: 140 The purchasing power parity (PPP) states that the currencies of
two countries are in equilibrium when their purchasing power is the same in each
country. In the long run the exchange rate should equal the ratio of the two countries'
price levels. There may be short-term deviations from this relation which should
disappear rather quickly. According to the theory, the real exchange rate is given by
F Pf
Qt = t dt ;
Pt
where Ft is the nominal exchange rate in domestic currency per unit of foreign cur-
rency, Ptd is the domestic price level, and Ptf is the foreign price level. Taking loga-
rithms yields the linear relation
ln Ft + ln Ptf ln Ptd = ln Qt :

The PPP holds if the logs of Ft , Ptd and Ptf are cointegrated with cointegration vector
=(1 1 {1)0 , and the log of Qt is stationary.
Example 58: Applying the EHT to more than two interest rates implies that all
spreads between long- and short-term interest rates (Rt (1 ) St , Rt (2 ) St , etc.)
should be stationary. In a VEC model with k interest rates this implies k 1 cointegra-
tion relations. For instance, if k=4 and Yt =(St ; Rt (1 ); Rt (2 ); Rt (3 ))0 the k(k 1)
cointegration matrix is given by
0 1
1 1 1
B 1 0 0C
0A: (52)
B C
@ 0 1
0 0 1
Extending example 51, we add the one year interest rate (yt1Y , Y 1Y) to the one-month
and ve-year rates. Fitting VAR models to the levels indicates that lagged di erences
of order one are sucient. The results from the Johansen test clearly indicate the
presence of two cointegration relations (see le us-tbill.wf1). The upper panel
of Figure 17 shows the so-called triangular representation (see Hamilton, 1994,
p.576) of the two cointegration vectors used by EViews to identify . Since any
linear combination of the cointegrating relations is also a cointegrating relation, this
representation can be transformed to obtain the structure of in equation 52. For
simplicity, we set the coecients in the row of Y 5Y(-1) in Figure 17 equal to 1,
and ignore the constants. The representation in Figure 17 implies that the spreads
yt1M yt5Y and yt1Y yt5Y are stationary. Using a suitable transformation matrix we
obtain
0 10 1 0 1
0:5 0:5 0:5 1 0 1 1
@ 1 0 0A@ 0 1A = @ 1 0A:
0 1 0 1 1 0 1
The transformed matrix now implies that the spreads yt1Y yt1M and yt5Y yt1M are
stationary. The lower panel in Figure 17 shows signi cant speed-of-adjustment coef-
cients in all cases. The e ects of lagged di erences are clearly less important.
140 For empirical examples see Hamilton (1994), p.582 or Chen (1995).
3.2 Cointegration and error correction models 157

Figure 17: VEC model for one-month, and one- and ve-year interest rates.
Sample(adjusted): 1964:03 1993:12
Included observations: 358
t-statistics in parentheses

Cointegrating Eq: CointEq1 CointEq2

Y_1M(-1) 1.000000 0.000000


Y_1Y(-1) 0.000000 1.000000
Y_5Y(-1) -0.868984 -0.965373
(-8.57599) (-12.0787)
C 0.522425 0.332333
(0.63841) (0.51487)

Error Correction: D(Y_1M) D(Y_1Y) D(Y_5Y)

CointEq1 -0.314107 0.122230 0.127018


(-4.45035) (2.14357) (3.21582)
CointEq2 0.327518 -0.211224 -0.141137
(3.30355) (-2.63713) (-2.54388)
D(Y_1M(-1)) -0.203101 -0.068047 -0.081964
(-2.82280) (-1.17063) (-2.03563)
D(Y_1Y(-1)) 0.375950 0.149988 0.130965
(2.32363) (1.14745) (1.44644)
D(Y_5Y(-1)) 0.093839 0.087987 -0.007274
(0.48405) (0.56177) (-0.06704)

R-squared 0.183793 0.033800 0.035264


Adj. R-squared 0.174545 0.022851 0.024332
S.E. equation 0.751489 0.607129 0.420547
S.D. dependent 0.827134 0.614187 0.425759

Akaike Information Criteria 3.226556


Schwarz Criteria 3.475864

Exercise 30: Choose three time series which you expect to be cointegrated.
Use the Johansen procedure to test the series for cointegration. Depending on
the outcome, t an appropriate VAR or VEC model to the series, and interpret
the results.
3.3 State space modeling and the Kalman lter 158
3.3 State space modeling and the Kalman lter141
3.3.1 The state space formulation
The objective of state-space modeling is to estimate (the parameters of) an unobservable
vector process t (k1) on the basis of an observable process yt (which may, in general,
be a vector process, too). Two equations are distinguished. For a single observation t the
measurement, signal or observation equation is given by
yt = ct + z 0t t + t ;

and can be viewed as a regression model with (potentially) time-varying coecients t


and ct. zt is the k1 vector of regressors and t is the residual. t is assumed to be a
rst-order (vector) autoregression as de ned in the system or transition equation
t = dt + T t t 1 + t :
The disturbances t and t are assumed to be serially independent with mean zero and
covariance
" # " #
V  = t h G :
t G0 Q
The state space formulation can be used for a variety of models (see Harvey (1989) or
Wang (2003)). The main areas of application are regressions with time-varying coe-
cients and the extraction of unobserved components (or latent, underlying factors) from
observed series. Harvey (1984) has proposed so-called structural models to extract (or
estimate) trend and seasonal components from a time series. One example is a model with
(unobservable) level t and trend t de ned in the system and measurement equations as
follows
" # " #" # " # " #
t = 1 1 t 1 + ut y = [1 0] t +  : (53)
t 0 1 t 1 vt t t t

This model can be viewed as a random walk with time-varying drift t. If v2=0 the drift
is constant.
The stochastic volatility model is another model that can be formulated in state space
form. Volatility is unobservable and is treated as the state variable. We de ne ht=ln t2
with transition equation
ht = d + T ht 1 + t :

The observed returns are de ned as yt=tt where tN(0; 1). If we de ne gt=ln yt2 and
t =ln 2t the observation equation can be written as

gt = ht + t :
141 For a more comprehensive treatment of this topic see Harvey (1984, 1989), Hamilton (1994), chapter 13,
or Wang (2003), chapter 7.
3.3 State space modeling and the Kalman lter 159
3.3.2 The Kalman lter
The Kalman lter is a recursive procedure to estimate t. Assume for the time being
that all vectors and matrices except the state vector are known. The recursion proceeds
in two steps. In the prediction step t is estimated using the available information in
t 1. This estimate atjt 1 is used to obtain the prediction ytjt 1 for the observable process
yt . In the updating step the actual observation yt is compared to ytjt 1 . Based on the
prediction error yt ytjt 1 the original estimate of the state vector is updated to obtain
the ( nal) estimate at.
The conditional expectation of t is given by
atjt 1 = Et 1 [ t] = dt + T tat 1;
and the covariance of the prediction error is
P tjt 1 = Et 1[( t at)( t at)0] = T t P t 1T 0t + Q:
Given the estimate atjt 1 for t we can estimate the conditional mean of yt from
ytjt 1 = Et 1 [yt ] = ct + z 0t atjt 1 :

The prediction error et=yt ytjt 1 is used in the updating equations

at = atjt 1 + P tjt 1 ztF t 1et


P t = P tjt 1 P tjt 1ztF t 1 z0tP tjt 1 :
F t is the Kalman gain
F t = z0tP tjt 1zt + h;
which determines the correction of atjt 1 and P tjt 1.
The application of the Kalman lter requires to specify starting values a0 and P 0. In
addition ct, zt, dt, T t, h, G and Q need to be xed or estimated from a sample. In
general they may depend on further parameters to be estimated. Given a sample of n
observations and assuming that t and t are multivariate normal the log-likelihood is
given by
log L = n2 ln 2 12 ln jF tj 12 e0tF t 1et:
Xn Xn
(54)
t=1 t=1

The initial state vector 0 can also be estimated or set to 'reasonable' values. The diagonal
elements of the initial covariance matrix P 0 are usually set to large values (e.g. 104),
depending on the accuracy of prior information about 0.
The stochastic volatility model cannot be estimated by ML using a normal assumption.
Harvey et al. (1994) and Ruiz (1994) have proposed a QML approach for this purpose.
3.3 State space modeling and the Kalman lter 160
Example 59: Estimating a time-varying beta-factor excluding a constant term is a
very simple application of the Kalman lter (see Bos and Newbold (1984) for a more
comprehensive study). The system and observation equations are given by
t = t 1 + t xit = t xm
t + t :

In other words we assume that the beta-factor evolves like a random walk without
drift. Details of the Kalman lter recursion and ML estimation can be found in the
le kalman.xls. Note that the nal, updated estimate of the state vector is equal to
the LS estimate using the entire sample.

3.3.3 Example 60: The Cox-Ingersoll-Ross model of the term structure


In the K -factor Cox-Ingersoll-Ross (CIR) term structure model (see Cox et al., 1985) the
instantaneous nominal interest rate it is assumed to be the sum of K state variables (or
factors) Xt;j :
K
X
it = Xt;j : (55)
j =1

The factors Xt;j are assumed to be independently generated by a square-root process


q
dXt;j = j (j Xt;j )dt + j Xt;j dZt;j (j = 1; : : : ; K );
where Zt;j are independent Wiener processes, j are the long-term means of Xt;j , and j are
their mean reversion parameters. The volatility parameters j determine the magnitude
of changes in Xt;j .
The price of a pure discount bond with face value 1 maturing at time t+T is
0 1
K
Y K
X
Pt (T ) = Aj (T ) exp @ Bj (T )Xt;j A ;
j =1 j =1

where
!j;3
Aj (T ) =
2j;1 exp(j;2T=2) ; (56)
 j;4

Bj (T ) =
2(exp(j;1T ) 1) ; (57)
j;4
q
j;1 = (j + j )2 + 2j2; j;2 = j + j + j;1 ; j;3 = 2j j =j2 ;

j;4 = 2j;1 + j;2 (exp(j;1 T ) 1):


The parameters j are negatively related to the risk premium.
3.3 State space modeling and the Kalman lter 161
The yield to maturity at time t of a pure discount bond which matures at time t+T is
de ned as
Yt (T ) =
log Pt (T ) X
=
K log A (T ) B (T )X
j
+ j T t;j ; (58)
T T j =1

which is ane in the state-variables Xt;j .


To estimate parameters and to extract the unobservable state variables from yields ob-
served at discrete time intervals we use a state-space formulation of the CIR model. We
de ne the state-vector xt=(Xt;1; : : : ; Xt;K )0. The exact transition density P(xtjxt 1) for
the CIR-model is the product of K non-central 2-densities. A quasi-maximum-likelihood
(QML) estimation of the model parameters can be carried out by substituting the exact
transition density by a normal density:
xt jxt 1  N(t ; Qt ):

t and Qt are determined in such a way that the rst two moments of the approximate
normal and the exact transition density are equal. The elements of the K -dimensional
vector t are de ned as
t;j = j [1 exp( j t)] + exp( j t)Xt 1;j ;

where t is a discrete time interval. Qt is a K K diagonal matrix with elements


Qt;j = j2 1 exp( j t)  j 

j 2 [1 exp( j t)] + exp( j t)Xt 1;j :


Let yt=(Yt;1; : : : ; Yt;m)0 be the m-dimensional vector of yields observed at time t. The
observation density P(ytjxt) is based on the linear relation (58) between observed yields
and the state variables. The measurement equation for observed yields is:
yt = at + bt xt + t t  NID(0; H ) (t = 1; : : : ; n);
where n is the number of observations, at is a m1 vector derived from (56) and bt is a
mK matrix derived from (57):

at =
K
X log Aj (Tt;i) (i = 1; : : : ; m);
j =1 Tt;i

Bj (Tt;i )
bt = (i = 1; : : : ; n); (j = 1; : : : ; K ):
Tt;i
Tt is a m1 vector of maturities associated with the vector of yields. H is the variance-
covariance matrix of t with constant dimension mm. It is assumed to be a diagonal
matrix but each diagonal element hi (i=1,: : :,m) may be di erent such that the variance
of errors may depend on maturity.
3.3 State space modeling and the Kalman lter 162
The Kalman lter recursion consists of the following equations:
xtjt 1 = [1 exp( )] + exp( )xt 1jt 1
y^t = at + bt xtjt 1 :

The Kalman lter requires initial values for t=0 for the factors and their variance-covariance
matrix. We set the initial values for Xt;j and Pt equal to their unconditional moments:
X0;j =j and diagonal elements of P0 are 0:5j j2 =j . The initial values for the parame-
ters fj ; j ; j ; j ; hig can be based on random samples of the parameter vector. Further
details and results from an empirical example can be found in Geyer and Pichler (1999).
Bibliography 163
Bibliography
Albright, S. C., Winston, W., and Zappe, C. J. (2002). Managerial Statistics. Duxbury.
Baxter, M. and Rennie, A. (1996). Financial Calculus. Cambridge University Press.
Blattberg, R. and Gonedes, N. (1974). A comparison of the stable and Student distribu-
tions as statistical models for stock prices. Journal of Business, 47:244{280.
Bollerslev, T., Chou, R., and Kroner, K. F. (1992). ARCH modeling in nance. Journal
of Econometrics, 52:5{59.
Bos, T. and Newbold, P. (1984). An empirical investigation of the possibility of stochastic
systematic risk in the market model. Journal of Business, 57:35{41.
Box, G. and Jenkins, G. (1976). Time Series Analysis Forecasting and Control. Holden-
Day, revised edition.
Brooks, C., Rew, A., and Ritson, S. (2001). A trading strategy based on the lead-lag
relationship between the spot index and futures contract for the FTSE 100. International
Journal of Forecasting, 17:31{44.
Campbell, J. Y., Chan, Y. L., and Viceira, L. M. (2003). A multivariate model of strategic
asset allocation. Journal of Financial Economics, 67:41{80.
Campbell, J. Y., Lo, A. W., and MacKinlay, A. C. (1997). The Econometrics of Financial
Markets. Princton University Press.
Chan, K., Karoly, G. A., Longsta , F. A., and Sanders, A. B. (1992). An empirical
comparison of alternative models of the short-term interest rate. Journal of Finance,
47:1209{1227.
Chat eld, C. (1989). The Analysis of Time Series. Chapman and Hall, 4th edition.
Chen, B. (1995). Long-run purchasing power parity: Evidence from some European mon-
etary system countries. Applied Economics, 27:377{383.
Chen, N.-F., Roll, R., and Ross, S. A. (1986). Economic forces and the stock market.
Journal of Business, 59:383{403.
Cochrane, J. H. (2001). Asset Pricing. Princeton University Press.
Coen, P., Gomme, F., and Kendall, M. G. (1969). Lagged relationships in economic
forecasting. Journal of the Royal Statistical Society Series A, 132:133{152.
Cox, J., Ingersoll, J. E., and Ross, S. A. (1985). A theory of the term structure of interest
rates. Econometrica, 53:385{407.
Dhillon, U., Shilling, J., and Sirmans, C. (1987). Choosing between xed and adjustable
rate mortgages. Journal of Money, Credit and Banking, 19:260{267.
Enders, W. (2004). Applied Econometric Time Series. Wiley, 2nd edition.
Engel, C. (1996). The forward discount anomaly and the risk premium: A survey of recent
evidence. Journal of Empirical Finance, 3:123{238.
Bibliography 164
Engle, R. F. and Granger, C. W. (1987). Co-integration and error correction: representa-
tion, estimation, and testing. Econometrica, 55:251{276.
Engle, R. F. and Yoo, B. (1987). Forecasting and testing in co-integrated systems. Journal
of Econometrics, 35:143{159.
Engsted, T. and Tanggaard, C. (1994). Cointegration and the US term structure. Journal
of Banking and Finance, 18:167{181.
Fama, E. F. and French, K. R. (1992). The cross-section of expected stock returns. Journal
of Finance, 47:427{465.
Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests.
Journal of Political Economy, 81:607{636.
Fielitz, B. and Rozelle, J. (1983). Stable distributions and the mixtures of distributions
hypotheses for common stock returns. Journal of the American Statistical Association,
78:28{36.
Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley.
Geyer, A. and Pichler, S. (1999). A state-space approach to estimate and test multifactor
Cox-Ingersoll-Ross models of the term structure. Journal of Financial Research, 22:107{
130.
Gourieroux, C. and Jasiak, J. (2001). Financial Econometrics. Princeton University Press.
Granger, C. W. and Newbold, P. (1971). Some comments on a paper of Coen, Gomme
and Kendall. Journal of the Royal Statistical Society Series A, 134:229{240.
Granger, C. W. and Newbold, P. (1974). Spurious regressions in econometrics. Journal of
Econometrics, 2:111{120.
Greene, W. H. (2000). Econometric Analysis. Prentice Hall, 4th edition.
Greene, W. H. (2003). Econometric Analysis. Prentice Hall, 5th edition.
Hamilton, J. D. (1994). Time Series Analysis. Princton University Press.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estima-
tors. Econometrica, 50:1029{1054.
Hansen, L. P. and Singleton, K. J. (1996). Ecient estimation of linear asset-pricing models
with moving average errors. Journal of Business and Economic Statistics, 14:53{68.
Harvey, A. C. (1984). A uni ed view of statistical forecasting procedures. Journal of
Forecasting, 3:245{275.
Harvey, A. C. (1989). Forecasting, structural time series models and the Kalman lter.
Cambridge University Press.
Harvey, A. C., Ruiz, E., and Shepard, N. (1994). Multivariate stochastic variance models.
Review of Economic Studies, 61:247{64.
Hastings, N. and Peacock, J. (1975). Statistical Distributions. Butterworth.
Bibliography 165
Hayashi, F. (2000). Econometrics. Princeton University Press.
Heckman, J. (1979). Sample selection bias as a speci cation error. Econometrica, 47:153{
161.
Ingersoll, J. E. (1987). Theory of Financial Decision Making. Rowman & Little eld.
Jarrow, R. A. and Rudd, A. (1983). Option Pricing. Dow Jones-Irwin.
Kiefer, N. (1988). Economic duration data and hazard functions. Journal of Economic
Literature, 26:646{679.
Kiel, K. A. and McClain, K. T. (1995). House prices during siting decision stages: The case
of an incinerator from rumor through operation. Journal of Environmental Economics
and Management, 28:241{255.
Kirby, C. (1997). Measuring the predictable variation in stock and bond returns. Review
of Financial Studies, 10:579{630.
Kmenta, J. (1971). Elements of Econometrics. Macmillan.
Kon, S. J. (1984). Models of stock returns { a comparison. Journal of Finance, 39:147{165.
Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., and Shin, Y. (1992). Testing the null
hypothesis of stationarity against the alternative of a unit root. Journal of Econometrics,
52:159{178.
Lamoureux, C. and Lastrapes, W. (1990). Heteroscedasticity in stock return data: volume
versus GARCH e ects. Journal of Finance, 45:221{229.
Levitt, S. D. (1997). Using electoral cycles in police hiring to estimate the e ect of police
on crime. American Economic Review, 87(4):270{290.
Lutkepohl, H. (1993). Introduction to Multiple Time Series Analysis. Springer.
Mills, T. C. (1993). The Econometric Modelling of Financial Time Series. Cambridge
University Press.
Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments.
Journal of Economic Perspectives, 20(4):111{132.
Newey, W. K. and West, K. D. (1987). A simple, positive semi-de nite, heteroskedasticity
and autocorrelation consistent covariance matrix. Econometrica, 55:703{708.
Papoulis, A. (1984). Probability, Random Variables, and Stochastic Processes. McGraw-
Hill, 2nd edition.
Roberts, M. R. and Whited, T. M. (2012). Endogeneity in Empirical Corporate Finance,
volume 2A of Handbook of the Economics of Finance. Elsevier.
Roll, R. and Ross, S. A. (1980). An empirical investigation of the arbitrage pricing theory.
Journal of Finance, 35:1073{1103.
Ross, S. A. (1976). The arbitrage theory of capital asset pricing. Journal of Economic
Theory, 13:341{360.
Bibliography 166
Ruiz, E. (1994). Quasi-maximum likelihood estimation of stochastic volatility models.
Journal of Econometrics, 63:289{306.
SAS (1995). Stock Market Analysis using the SAS System. SAS Institute.
Shanken, J. (1992). On the estimation of beta-pricing models. Review of Financial Studies,
5:1{33.
Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instru-
ments. Econometrica, 65(3):557{586.
Stambaugh, R. F. (1999). Predictive regressions. Journal of Financial Economics, 54:375{
421.
Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instruments and weak
identi cation in generalized method of moments. Journal of Business and Economic
Statistics, 20(4):518{529.
Studenmund, A. (2001). Using Econometrics. Addison Wesley Longman.
Thomas, R. (1997). Modern Econometrics. Addison Wesley.
Tsay, R. S. (2002). Analysis of Financial Time Series. Wiley.
Valkanov, R. (2003). Long-horizon regressions: Theoretical results and applications. Jour-
nal of Financial Economics, 68:201{232.
Verbeek, M. (2004). Modern Econometrics. Wiley, 2nd edition.
Wang, P. (2003). Financial Econometrics. Routledge.
Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. The
MIT Press.
Wooldridge, J. M. (2003). Introductory Econometrics. Thomson, 2nd edition.
Yogo, M. (2004). Estimating the elasticity of intertemporal substitution when instruments
are weak. The Review of Economics and Statistics, 86(3):797{810.

You might also like