Basic Financial Econometrics PDF
Basic Financial Econometrics PDF
Alois Geyer
[email protected]
http://www.wu.ac.at/~geyer
this version:
I am grateful to many PhD students of the VGSF program, as well as doctoral and master
students at WU for valuable comments which have helped to improve these lecture notes.
1.1 Regression analysis 1
1 Financial Regression Analysis
1.1 Regression analysis
We start by reviewing key aspects of regression analysis. Its purpose is to relate a depen-
dent variable y to one or more variables X which are assumed to aect y. The relation
is specied in terms of a systematic part which determines the expected value of y and
a random part . For example, the systematic part could be a (theoretically derived)
valuation relationship. The random part represents unsystematic deviations between ob-
servations and expectations (e.g. deviations from equilibrium). The relation between y
and X depends on unknown parameters which are used in the function that relates X
to the expectation of y.
Assumption AL (linearity): We consider the linear regression equation
y = X + :
y is the n1 vector (y1 ; : : : ; yn)0 of observations of the dependent (or endogenous) vari-
able, is the vector of errors (also called residuals, disturbances, innovations or
shocks), is the K 1 vector of parameters, and the nK matrix X of regressors (also
called explanatory variables or covariates) is dened as follows:
0 1
1 x11 x12 x1k
B 1 x21 x22 x2k
B C
C
X=B
B .. ... ... . . . ... C:
@ .
C
A
1 xn1 xn2 xnk
k is the number of regressors and K =k+1 is the dimension of =(0 ; 1 ; : : : ; k )0 , where
0 is the constant term or intercept. A single row i of X will be denoted by the K 1
column vector xi . For a single observation the model equation is written as
yi = x0i + i (i = 1; : : : ; n):
We will frequently (mainly in the context of model specication and interpretation) use
formulations like
y = 0 + 1 x1 + + k xk + ;
where the symbols y, xi and represent the variables in question. It is understood that
such equations also hold for a single observation.
1.1 Regression analysis 2
1.1.1 Least squares estimation
A main purpose of regression analysis is to draw conclusions about the population using
a sample. The regression equation y=X+ is assumed to hold in the population. The
sample estimate of is denoted by b and the estimate of by e. According to the least
squares (LS) criterion, b should be chosen such that the sum of squared errors SSE is
minimized
n
X n
X
SSE(b) = e2i = (yi x0ib)2 = (y Xb)0(y Xb) ! min :
i=1 i=1
to point out that the estimate is related to the covariance between the dependent variable
and the regressors, and the covariance among regressors. In the special case of the simple
regression model y=b0+b1x+e with a single regressor the estimates b1 and b0 are given by
s
b1 = yx
s2x
= ryx ssy b0 = y b1 x;
x
where syx (ryx) is the sample covariance (correlation) between y and x, sy and sx are the
sample standard deviations of y and x, and y and x are their sample means.
1.1 Regression analysis 3
1.1.2 Implications
By the rst order condition the OLS estimates satisfy the normal equation
(X 0X )b X 0y = X 0(y Xb) = X 0e = 0; (2)
which implies that each column of X is uncorrelated with (orthogonal to) e.
If the rst column of X is a column of ones denoted by , LS estimation has the following
implications:
1. The residuals have zero mean since 0e=0 (from the normal equation).
2. This implies that the mean of the tted values y^i=x0ib is equal to the sample mean:
1X
n
y^i = y:
n i=1
3. The tted values are equal to the mean of y if the regression equation is evaluated
for the means of X :
k
X
y = b0 + xj bj :
j =1
R2 = 1
e0e = 1 (y y^)0(y y^) = (y^ y)0(y^ y) :
(y y) (y y)
0 (y y)0(y y) (y y)0(y y)
This is the so-called centered version of R2 which lies between 0 and 1 if the model contains
an intercept. It is equal to the squared correlation between y and y^. The three terms in
the expression
(y y)0(y y) = (y^ y)0(y^ y) + (y y^)0(y y^)
are called the total sum of squares (SST), the sum of squares from the regression (SSR),
and the sum of squared errors (SSE). Based on this relation R2 is frequently interpreted as
1 By implication 3 the constant must be equal to y since the mean of e is zero. The slope is given by
(e e) 1 e0 y~, where y~0=y y0. The 0slope is equal
0 to0 one since 0e0 y~=0e0 e. The latter identity holds since in the
original regression e y=e Xb+e e and e X =0 . Finally, e y=e y~ since e0 y=0.
0
1.1 Regression analysis 4
the percentage of y's variance 'explained' by the regression. If the model does not contain
an intercept, the centered R2 may become negative. In that case the uncentered R2 can
be used:
0 0
uncentered R2 = 1 ye0ye = yy^0yy^ :
R2 is zero if all regression coecients except for the constant are zero (b=(b0 0)0 and
y^=b0 =y). In this case the regression is a horizontal line. If R2 =1 all observations are
located on the regression line (or hyperplane) (i.e. y^i=yi). R2 is (only) a measure for
the goodness of the linear approximation implied by the regression. Many other, more
relevant aspects of a model's quality, are not taken into account by R2. Such aspects will
become more apparent as we proceed.
1.1.3 Interpretation
The coecients b can be interpreted on the basis of the tted values2
y^ = b0 + x1 b1 + + xk bk :
bj is the change in y^ (or, the expected change in y) if xj changes by one unit ceteris paribus
(c.p.), i.e. holding all other regressors xed. In general the change in the expected value
is
^y = x1b1 + + xk bk ;
which implies that the eects of simultaneously changing several regressors can be added
up.
This interpretation is based on the Frisch-Waugh theorem. Suppose we partition the
regressors in two groups X1 and X2, and regress y on X1 to save the residuals e1. Next we
regress each column of X2 on X1 and save the residuals of these regressions in the matrix
E 2. According to the Frisch-Waugh theorem the coecients from the regression of e1 on
E 2 are equal to the subset of coecients from the regression of y on X that corresponds to
X2. In more general terms, the theorem implies that partial eects can be obtained directly
from a multiple regression. It is not necessary to rst construct orthogonal variables.
To illustrate the theorem we consider the regression
y = b0 + b1 x1 + b2 x2 + e:
To obtain the coecient of x2 such that the eect of x1 (and the intercept) is held constant,
we rst run the two simple regressions
y = cy + by1 x1 + ey1 x2 = cx2 + b21 x1 + e21 :
ey1 and e21 represent those parts of y and x2 which do not depend on x1 . Subsequently,
we run a regression using these residuals to obtain the coecient b2:
(y cy by1x1) = b2(x2 cx2 b21x1) + u ey1 = b2e21 + u:
2 An analogous interpretation holds for in the population.
1.1 Regression analysis 5
In general, this procedure is also referred to as 'controlling for' or 'partialling out' the
eect of X1. Simply speaking, if we want to isolate the eects of X2 on y we have to
'remove' the eects of X1 from the entire regression equation.3 However, according to the
Frisch-Waugh theorem it is not necessary to run this sequence of regressions in practice.
Running a (multiple) regression of y on all regressors X 'automatically' controls for the
eects of each regressor on all other regressors. A special case is an orthogonal regression,
where all regressors are uncorrelated (i.e. X 0X is a diagonal matrix). In this case the
coecients from the multiple regression are identical to those obtained from K simple
regressions using one column of X at a time.
Example 1: We use the real investment data from Table 3.1 in Greene (2003)
to estimate a multiple regression model. The dependent variable is real investment
(in trillion US$; denoted by y). The explanatory variables are real GNP (in trillion
US$; g), the (nominal) interest rate r and the in
ation rate i (both measured as
percentages). The (rounded) estimated coecients are
b = ( 0:0726 0:236 0:00356 0:000276)0 ;
where the rst element is the constant term. The coecient 0:00356 can be inter-
preted as follows: if the interest rate goes up by one percentage point and the other
regressors do not change, real investment is expected to drop by about 3:56 billion
US$. SST=0.0164, SSR=0.0127 and SSE=0.00364. The corresponding R2 equals 0.78
(SSR/SST), which means that about 78% of the variance in real investment can be
explained by the regressors. Further details can be found in the le investment.xls.
Exercise 1: Use the quarterly data in Table F5.1 from Greene's website
http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm
(see le Table F5.1.xls) to estimate a regression of real investment on a
constant, real GDP, the nominal interest rate (90 day treasury bill rate) and
the in
ation rate. Check the validity of the ve OLS implications mentioned
on p.3.
Apply the Frisch-Waugh theorem and show how the coecients of the constant
term and real GDP can be obtained by controlling for the eects of the nominal
interest rate and in
ation.
3 As a matter of fact, the eects of X1 , or any other set of regressors we want to control for, need not
be removed from y. It can be shown that the coecients associated with X2 can also be obtained from
a regression of y on E 2 . Because of implication 5 the covariance between e1 and the columns of E 2 is
identical to the covariance between y and E 2 .
1.2 Finite sample properties of least squares estimates 6
1.2 Finite sample properties of least squares estimates4
Review 1: For any constants a and b and random variables Y and X the following
relations hold:
E[a + Y ] = a + E[Y ] E[aY ] = aE[Y ] V[a + Y ] = V[Y ] V[aY ] = a2 V[Y ]:
E[aX + bY ] = aE[X ] + bE[Y ] V[aX + bY ] = a2 V[X ] + b2 V[Y ] + 2abcov[XY ]:
Jensen's inequality: E[f (X )]f (E[X ]) for any convex function f (X ).
For a constant a and random variables W; X; Y; Z the following relations hold:
if Y = aZ : cov[X; Y ] = acov[X; Z ]:
if Y = W + Z : cov[X; Y ] = cov[X; W ] + cov[X; Z ]:
cov[X; Y ] = E[XY ] E[X ]E[Y ] cov[Y; a] = 0:
If X is a n1 vector of random variables V[X ]=cov[X ]==E[(X E[X ])(X E[X ])0 ]
is a nn matrix. Its diagonal elements are the variances of the elements of X . Using
=E[X ] we can write =E[XX 0 ] 0 .
If b is a n1 vector and A is a nn matrix of constants, the following relations hold:
E[b0 X ] = b0 V[b0 X ] = b0 b E[AX ] = A V[AX ] = AA0 :
8 If the condition holds for another estimator one could use the term 'relative eciency'.
1.2 Finite sample properties of least squares estimates 8
1.2.1 Assumptions
The sample estimates b and e can be used to draw conclusions about the population. An
important question relates to the nite sample properties of the OLS estimates. Exact (or
nite sample) inference as opposed to asymptotic (large sample) inference is valid for any
sample size n and is based on further assumptions (in addition to AL and AR) mentioned
and discussed below.
To derive the nite sample properties of the OLS estimate we rewrite b in (1) as follows:
b = (X 0X ) 1X 0(X + )
(3)
= + (X 0X ) 1X 0 = + H:
We consider the statistical properties of b (in particular E[b], V[b], and its distribution).
This is equivalent to investigate the sampling error b . From
h i
E[b] = + E (X 0X ) 1X 0 (4)
we see that the properties of b depend on the properties of X , , and their relation. In the
so-called classical regression model, X is assumed to be non-stochastic. This means
that X can be chosen (like in an experimental situation), or is xed in repeated samples.
Neither case holds in typical nancial empirical studies. We will treat X as random, and
the nite sample properties derived below are considered to be conditional on the sample
X (although we will not always indicate this explicitly). This does not preclude the
possibility that X contains constants (e.g. dummy variables). The important requirement
(assumption) is that X and are generated by mechanisms that are completely unrelated.
Assumption AX (strict exogeneity): The conditional expectation of each i conditional
on all observations and variables in X is zero:
E[jX ] = 0 E[ijx1; : : : ; xn] = 0 (i = 1; : : : ; n):
According to this assumption, X cannot be used to obtain information about . AX
has the following implications:
1. (unconditional mean): E[E[jX ]]=E[]=0.
2. (conditional expectation): E[yjX ]=y^=X.
3. Regressors and disturbances are orthogonal
E[xil j ] = 0 (i,j =1; : : : ; n; l=1; : : : ; K );
since E[xil j ]=E[E[xil j jxil ]]=E[xil E[j jxil ]]=0. This implies that regressors are
orthogonal to the disturbances from the same and all other observations. Or-
thogonality with respect to the same observations is expressed by
E[X 0] = 0:
Orthogonality is equivalent to zero correlation between X and :
cov[X ; ] = E[X 0] E[X ]E[] = 0:
1.2 Finite sample properties of least squares estimates 9
4. yi i =x0i i +2i ) E[yi i ]=E[2i ] = V[i ]:
If AX holds, the explanatory variables are (strictly) exogenous. The term endo-
geneity (i.e. one or all explanatory variables are endogenous) is used if AX does
not hold (broadly speaking, if X and are correlated).
For example, AX is violated when a regressor, in fact, is determined on the basis
of the dependent variable y. This is the case in any situation where y and X
(at least one of its columns) are determined simultaneously. A classic example are
regressions attempting to analyze the eect of the number of policemen on the crime
rate. These are bound to fail whenever the police force is driven by the number of
crimes committed. Solutions to this kind of problem are discussed in section 1.9.1.
Another example are regressions relating the performance of funds to their size. It
is conceivable that an unobserved variable like the skill of fund managers aects size
and performance. If that is the case, AX is violated.
Another important case where AX does not hold is a model where a lagged de-
pendent variable is used as a regressor:
yt = yt 1 + x0t + t yt+1 = yt + x0t+1 + t+1 yt+2 = : : : :
AX requires the disturbance t to be uncorrelated with regressors from any other ob-
servation, e.g. with yt from the equation for t+1. AX is violated because E[ytt]6=0.
Predictive regressions consist of a regression of yt on a lagged predictor xt
yt = 0 + 1 xt 1 + t :
For typically used dependent variables like asset returns (i.e. yt=ln pt=pt 1) and
predictors like the dividend-price ratio (i.e. xt=ln dt=pt 1), Stambaugh (1999) argues
that, despite E[tjxt 1]=0, in a predictive regression E[tjxt] 6= 0, and thus AX is
violated. To understand this reasoning, we consider
ln
|
pt {zln pt 1} = 1 (ln
|
dt 1{zln pt 1}) + t ;
yt xt 1
ln
|
pt+1{z ln p}t = 1 (ln
|
dt{zln p}t ) + t+1 : : : ;
yt+1 xt
where 0=0 for simplicity. Disturbances t aect the price in t, (and, for given
pt 1 , the return during the period t 1 to t). Thus, they are correlated with pt , and
hence with the regressor in the equation for t+1. Although the mechanism appears
similar to the case of a lagged dependent variable, here the correlation between the
disturbances and very specically dened predictors xt is the source of violation of
AX. Stambaugh (1999) shows that this leads to a nite-sample bias (see below) in
the estimated parameter b1, irrespective of 1 (e.g. even if 1=0).
1.2 Finite sample properties of least squares estimates 10
Assumption AH (homoscedasticity; uncorrelatedness): This assumptions covers two
aspects. It states that the (conditional) variance of the disturbances is constant
across observations (assuming that AX holds):
V[ijX ] = E[2i jX ] (E[ijX ])2 = E[2i jX ] = 2 8i:
The errors are said to be heteroscedastic if their variance is not constant.
The second aspect of AH relates to the (conditional) covariance of which is as-
sumed to be zero:
cov[i; j jX ] = 0 8i 6= j E[0jX ] = V[jX ] = 2I :
This aspect of AH implies that the errors from dierent observations are not corre-
lated. In a time series context this correlation is called serial or autocorrelation.
Assumption AN (normality): Assumptions AX and AH imply that the mean and
variance of jX are 0 and 2I . Adding the assumption of normality we have
jX N(0; 2 I ):
Since X plays no role in the distribution of , we have N(0; 2I ). This assump-
tion is useful to construct test statistics (see section 1.2.3), although many of the
subsequent results do not require normality.
1.2 Finite sample properties of least squares estimates 11
1.2.2 Properties
Expected value of b (AL,AR,AX): We rst take the conditional expectation of (3)
E[bjX ] = + E[HjX ] H = (X 0X ) 1X 0:
Since H is a function of the conditioning variable X , it follows that
E[bjX ] = + H E [jX ];
and by assumption AX (E[jX ]=0) we nd that b is unbiased:
E[bjX ] = :
By using the law of iterated expectations we can also derive the following uncondi-
tional result9 (again using AX):
E[b] = Ex[E[bjX ]] = + Ex[H E[jX ]] = :
We note that assumptions AH and AN are not required for unbiasedness, whereas
AXis critical. Since a model with a lagged dependent variable violates AX, all
coecients in such a regression will be biased.
Covariance of b (AL,AR,AX,AH): The covariance of b conditional on X is given by
V[bjX ] = E[(b )(b )0jX ]
= E[H0H 0jX ]
= H E[0jX ]H 0 (5)
= H (2I )H 0 = 2HH 0
= 2(X 0X ) 1 since HH 0 = (X 0X ) 1X 0X (X 0X ) 1:
For the special case of a single regressor the variance of b1 is given by
2 2
V[b1] = X
n = (n 1)2 ; (6)
(xi x)2 x
i=1
which shows that the precision of the estimate increases with the sample size and
the variance of the regressor x2 , and decreases with the variance of the disturbances.
To derive the unconditional covariance of b we use the variance decomposition
E[V[bjX ]] = V[b] V[E[bjX ]]:
9 To verify that b is unbiased conditionally and unconditionally by simulation one could generate sam-
ples of y=X+ for xed X using many realizations of . The average over the OLS estimates bjX {
corresponding to E[bjX ] { should be equal to . However, if X is also allowed to vary across samples the
average over b { corresponding to the unconditional mean E[b]=E[E[bjX ]] { should also equal .
1.2 Finite sample properties of least squares estimates 12
Since E[bjX ]= the second term is zero and
V[b] = E[2(X 0X ) 1] = 2E[(X 0X ) 1];
which implies that the unconditional covariance of b depends on the population
covariance of the regressors.
Variance of e (AL,AR,AX,AH): The variance of b is expressed in terms of 2 (the
population variance of ). To estimate the covariance of b from a sample we replace
2 by the unbiased estimator
s2e =
e0e E[s2e ] = 2:
n K
Its square root se is the standard error of regression. se is measured in the same
units as y. It may be a more informative measure for the goodness of t than R2,
which is expressed in terms of variances (measured in squared units of y).
The estimated standard error of b denoted by se[b] is the square root of the
diagonal of
V^ [bjX ] = s2e (X 0X ) 1:
Eciency (AL,AR,AX,AH): The Gauss-Markov Theorem states that the OLS es-
timator b is not only unbiased but has the minimum variance of all linear unbiased
estimators (BLUE) and is thus ecient. This result holds whether X is stochastic
or not. If AN holds (the disturbances are normal) b has the minimum variance of
all unbiased (linear or not) estimators (see Greene (2003), p.47,48).
Sampling distribution of b (AL,AR,AX,AH,AN): Given (3) and AN the distribu-
tion of b is normal for given X :
bjX N(; 2 (X 0X ) 1):
The sample covariance of b is obtained by replacing 2 with s2e , and is given by V^ [b]
dened above.
Example 2: The standard error of regression from example 1 is 18.2 billion US$. This
can be compared to the standard deviation of real investment which amounts to 34
billion US$. se is used to compute the (estimated) standard errors for the estimated
coecients which are given by
se[b]=(0:0503 0:0515 0:00328 0:00365)0 :
Further details can be found in the le investment.xls.
1.2 Finite sample properties of least squares estimates 13
1.2.3 Testing hypothesis
Review 4: A null hypothesis H0 formulates a restriction with respect to an un-
known parameter of the population =0 . In a two-sided test the alternative hy-
pothesis Ha is 6=0 . The test procedure is a rule that rejects H0 if the sample estimate
^ is 'too far away' from 0 . This rule can be based on the 1 condence interval
^Q(=2)se[^], where Q() denotes the -quantile of the sampling distribution of ^.
H0 is rejected if 0 is outside the condence interval.
If Y N(; 2 ) and Z =(y )/ then Z N(0; 1). (Z )=P[Y y ]=((y )/) is the
standard normal distribution function (e.g. ( 1:96)=0.025). z is the -quantile of
the standard normal distribution, such that P[Z z ]= (e.g. z0:025 = 1:96).
Example 3: Consider a sample of n observations from a normal population
with mean and standard deviation . The sampling distribution of pthe
sample mean y is also normal. The standard error of the mean is n.
pn=. The
The 1 condence interval for the unknown meanp is y
z=2 =
estimated standard error of the mean se[y]=s= n is obtained by replacing
with the sample estimateps. In this case the 1 condence interval is
given by yT (=2,n 1)s= n where T (,n 1) denotes the -quantile of
the t-distribution (e.g. T (0:025; 20)= 2:086). If n is large the standard
normal and t-quantiles
p are practically equal. In that case the interval is
given by yz=2 s= n.
A type I error is committed if H0 is rejected although it is true. The probability of
a type I error is the signicance level (or size) . If H0 is rejected, ^ is said to be
signicantly dierent from 0 at a level of . A type II error is committed if H0 is
not rejected although it is false. The power of a test is the probability of correctly
rejecting a false null hypothesis. The power depends on the true parameter (which is
usually unknown).
A test statistic is based on a sample estimate ^ and 0 . It is a random variable.
The distribution of the test statistic (usually under H0 ) can be used to specify a rule
for rejecting H0 . H0 is rejected if the test statistic exceeds critical values which
depend on (and other parameters). In a two-sided test the critical values are the
=2-quantiles and 1 =2-quantiles of the distribution. In a one-sided test of the form
H0 0 (and Ha <0 ) the critical value is the -quantile (this implies that H0 is rejected
if ^ is 'far below' 0 ). If H0 0 the critical value is the 1 quantile. The p-value is
that level of for which there is indierence between accepting or rejecting H0 .
Example 4: We consider a hypothesis about the mean of a population.
=0 can be tested against 6=0 using the t-statistic (or t-ratio) t=(y 0 )/se[y].
t has a standard normal or t-distribution depending on whether or
s is used to compute se[y]. If s is used, the t-statistic is compared to
T (=2,n 1) in a two-sided test. One-sided tests use T (,n 1). In a
two-sided test, H0 is rejected if jtj>jT (=2; n 1)j.
Exercise 2: Use the results from exercise 1 and test the estimated coecients
for individual and joint signicance.
In general, hypothesis tests about can be based on imposing a linear restriction r (a
K 1 vector consisting of zeros and ones) on and b, and compare =r0 to d=r0 b.
If d diers signicantly from we conclude that the sample is inconsistent with (or, does
not support) the hypothesis expressed by the restriction. Since b is normal, r0b is also
normal, and the test statistic
q
d
t= se[d] = r0 [s2e (X 0X ) 1] r
se[d]
has a t-distribution with df=n K .
We can consider several restrictions at once by using the mK matrix R to dene =R
and d=Rb. Under the null that all restrictions hold we can dene the Wald statistic
= (d )0 s2e R(X 0X ) 1R0 1(d ):
h i
W (7)
W has a 2m-distribution if the sample is large enough (see section 1.5) (or s2e in (7) is
replaced by the usually unknown 2). Instead, one can use the test statistic W=m which
has an F -distribution with df=(m,n K ). In small samples, a test based on W=m will be
more conservative (i.e. will have larger p-values).
So far, restrictions have been tested using the estimates from the unrestricted model.
Alternatively, restrictions may directly be imposed when the parameters are estimated.
This will lead to a loss of t (i.e. R2 will decrease). If Rr2 is based on the parameter vector
br (where some of the parameters are xed rather than estimated) and Ru2 is based on the
unrestricted estimate, the test statistic
F=
(n K )(Ru2 Rr2)
m(1 Ru2 )
has an F -distribution with df=(m,n K ). It can be shown that F =W=m (see Greene
(2003), section 6.3). If F is signicantly dierent from zero, H0 is rejected and the restric-
tions are considered to be jointly invalid.
The distribution of the test statistics t, F and W depends on assumption AN (normality
of disturbances). In section 1.3 we will comment on the case that AN does not hold.
1.2 Finite sample properties of least squares estimates 15
1.2.4 Example 6: CAPM, beta-factors and multi-factor models
The Capital Asset Pricing Model (CAPM) considers the equilibrium relation between
the expected return of an asset or portfolio (i=E[yi]), the risk-free return rf , and the
expected return of the market portfolio (m=E[ym]). Based on various assumptions (e.g.
quadratic utility or normality of returns) the CAPM states that
i rf = i (m rf ): (8)
This relation is also known as the security market line (SML). In the CAPM the so-
called beta-factor i dened as
i =
cov[yi; ym]
V[ym]
is the appropriate measure of an asset's risk. The (total) variance of the asset's returns is
an inappropriate measure of risk since a part of this variance can be diversied away by
holding the asset in a portfolio. The risk of the market portfolio cannot be diversied any
further. The beta-factor i shows how the asset responds to market-wide movements and
measures the market risk or systematic risk of the asset. The risk premium an investor
can expect to obtain (or requires) is proportional to i. Assets with i>1 imply more risk
than the market and should thus earn a proportionately higher risk premium.
Observed returns of the asset (yti; t=1,. .. ,n) and the market portfolio (ytm) can be used
to estimate i or to test the CAPM. Under the assumption that observed returns deviate
from expected returns we obtain
yti i = uit ytm m = um
t :
When we substitute these denitions for the expected values in the CAPM we obtain the
so-called market model
yti = i + i ytm + it ;
where i=(1 i)rf and it=uit iumt. The coecients i and i in this equation can be
estimated by OLS. If we write the regression equation in terms of (observed) excess returns
xit =yti rf and xm t =yt rf we obtain
m
xit = i xm
t + t :
i
Thus the testable implication of the CAPM is that the constant term in a simple linear
regression using excess returns should be equal to zero. In addition, the CAPM implies
that there must not be any other risk factors than the market portfolio (i.e. the coecients
of such factors should not be signicantly dierent from zero).
We use monthly data on the excess return of two industry portfolios (consumer goods
and hi-tech) compiled by French10. We regress the excess returns of the two industries
10 http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html . The les
capm.wf1 and capm.txt are based on previous versions of data posted there. These les have been
compiled using the datasets which are now labelled as "5 Industry Portfolios" and "Fama/French
3 Factors" (which includes the risk-free return rf ).
1.2 Finite sample properties of least squares estimates 16
on the excess market return based on a value-weighted average of all NYSE, AMEX, and
NASDAQ rms (all returns are measured in percentage terms). Using data from January
2000 to December 2004 (n=60) we obtain the following estimates for the consumer goods
portfolio (p-values in parenthesis; details can be found in the le capm.wf1)
xit = 0:343 + 0:624 xm
t + et
i R2 = 0:54 se = 2:9;
(0.36) (0.0)
and for the hi-tech portfolio
xit = 0:717 + 1:74 xmt + eit R2 = 0:87 se = 3:43:
(0.11) (0.0)
The coecients 0.624 and 1.74 indicate that a change in the (excess) market return by
one percentage point implies a change in the expected excess return by 0.624 percentage
points and 1.74 percentage points, respectively. In other words, the hi-tech portfolio has
much higher market risk than the consumer goods portfolio.
The market model can be used to decompose the total variance of an asset into market-
and rm-specic variance as follows (assuming that cov[ym; i]=0):
i2 = i2 m
2 + 2i :
i2 m
2 can be interpreted as the risk that is market-specic or systematic (cannot be diver-
sied since it is due to market-wide movements) and 2i is rm-specic (or idiosyncratic)
risk. Since R2 can also be written as (i2m2 )=i2 it measures the proportion of the market-
specic variance in total variance. The R2 from the two equations imply that 53% and
86% of the variance in the portfolio's returns are systematic. The higher R2 from the hi-
tech regression indicates that this industry is better diversied than the consumer goods
industry. The p-values of the constant terms indicate that the CAPM implication cannot
be rejected. This conclusion changes, however, when the sample size is increased.
The CAPM makes an (equilibrium) statement about all assets as expressed by the security
market line (8). In order to test the CAPM, beta-factors ^i for many assets are estimated
from the market model using time-series regressions. Then mean returns yi for each asset
(as an average across time) are computed, and the cross-sectional regression
yi = f + m ^i + i
is run. The estimates for f and m (the market risk premium) are estimates of rf and
(m rf ) in equation (8). If the CAPM is valid, the mean returns of all assets should be
located on the SML { i.e. on the line implied by this regression. However, there are some
problems associated with this regression. The usual OLS standard errors of the estimated
coecients are incorrect because of heteroscedasticity in the residuals. In addition, the
regressors ^i are subject to an errors-in-variables problem since they are not observed
and will not correspond to the 'true' beta-factors.
Fama and MacBeth (1973) have suggested a procedure to improve the precision of the
estimates. They rst estimate beta-factors ^it for a large number of assets by running
1.2 Finite sample properties of least squares estimates 17
the market model regression using monthly11 time series of excess returns. The estimated
beta-factors are subsequently used as regressors in the cross-sectional regression
yti = ft + mt ^it + it :
Note that ^it is based on an excess return series which ends one month before the cross-
sectional regression is estimated (i.e. using xis and xms for s=t n,. .. ,t 1). The cross-
sectional regression is run in each month of the sample period and a times series of esti-
mates ^ft and ^mt is obtained. The sample means and the standard errors of ^ft and ^mt
are used as the nal estimates for statistical inference 12. Although the Fama-MacBeth
approach yields improved estimates, Shanken (1992) has pointed out further deciencies
and has suggested a correction.
The CAPM has been frequently challenged by empirical evidence indicating signicant
risk premia associated with other factors than the market portfolio. A crucial aspect of
the CAPM (in addition to assumptions about utility or return distributions) is that the
market portfolio must include all available assets (which is hard to achieve in empirical
studies). According to the Arbitrage Pricing Theory (APT) by Ross (1976) there
exist several risk factors Fj that are common to a set of assets. The factors are assumed
to be uncorrelated, but no further assumptions about utility or return distributions are
made. These risk factors (and not only the market risk) capture the systematic risk
component. Although the APT does not explicitly specify the nature of these factors, em-
pirical research has typically considered two types of factors. One factor type corresponds
to macroeconomic conditions such as in
ation or industrial production (see Chen et al.,
1986), and a second type corresponds to portfolios (see Fama and French, 1992). Consid-
ering only two common factors (for notational simplicity) the asset returns are governed
by the factor model
yti = i + i1 Ft1 + i2 Ft2 + it ;
where ji are the factor sensitivities (or factor loadings). The expected return of a
single asset in this two-factor model is given by
E[yi] = i = 0 + 1i1 + 2i2;
where j is the factor risk premium of Fj and 0=rf . Using V[Fj ]=j2 and cov[F1; F2]=0
the total variance of an asset can be decomposed as follows:
i2 = i21 12 + i22 22 + 2i :
Estimation of the beta-factors is done by factor analysis, which is not treated in this
text. For further details of the APT and associated empirical investigations see Roll and
Ross (1980).
We brie
y investigate one version of multi-factor models using the so-called Fama-French
benchmark factors SMB (small minus big) and HML (high minus low) to test whether
11 Using monthly data is not a prerequisite of the procedure. It could be performed using other data
frequencies as well.
12 See Fama-MacBeth.xlsx for an illustration of the procedure using only 30 assets and the S&P500 index.
1.2 Finite sample properties of least squares estimates 18
excess returns depend on other factors than the market return. The factor SMB measures
the dierence in returns of portfolios of small and large stocks, and is intended to measure
the so-called size eect. HML measures the dierence between value stocks (having a
high book value relative to their market value) and growth stocks (with a low book-market
ratio).13 The estimated regression equations are (details can be found in the le capm.wf1)
xit = 0:085 + 0:68 xm
t 0:089 SMBt + 0:29 HMLt + et R2 = 0:7
(0.8) (0.0) (0.30) (0.0)
for the consumer goods portfolio and
xit = 0:83 + 1:66 xmt + 0:244 SMBt 0:112 HMLt + et R2 = 0:89
(0.07) (0.0) (0.04) (0.21)
for the hi-tech portfolio. Consistent with the CAPM the constant terms in the rst case
is not signicant. The beta-factor remains signicant in both industries and changes
only slightly compared to the market model estimates. However, the results indicate a
signicant return premium for holding value stocks in the consumer goods industry. For
the hi-tech portfolio we nd support for a size-eect. Overall, the results can be viewed
as supporting multi-factor models.
Exercise 3: Retrieve excess returns for industry portfolios of your choice from
French's website. Estimate beta-factors in the context of multi-factor models.
Interpret the results and test implications of the CAPM.
13 Further details on the variable denitions and the underlying considerations can be found on French's
website http://mba.tuck.dartmouth.edu/pages/faculty/ken.french.
1.2 Finite sample properties of least squares estimates 19
1.2.5 Example 7: Interest rate parity
We consider a European investor who invests in a riskless US deposit with rate rf . He
buys US dollars at the spot exchange rate St (St is the amount in Euro paid/received for
one dollar), invests at rf , and after one period converts back to Euro at the rate St+1.
The one-period return on this investment is given by
ln St+1 ln St + rf :
Forward exchange rates Ft can be used to hedge against the currency risk (introduced by
the unknown St+1) involved in this investment. If Ft denotes the rate xed at t to buy/sell
US dollars in t+1 the (certain) return is given by
ln Ft ln St + rf :
Since this return is riskless it must equal the return rfd from a domestic riskless investment
to avoid arbitrage. This leads to the covered interest rate parity (CIRP)
rfd rf = ln Ft ln St :
The left hand side is the interest rate dierential and the right hand side is the forward
premium.
The uncovered interest rate parity (UIRP) is dened in terms of the expected spot
rate
rfd rf = Et [ln St+1 ] ln St :
Et[ln St+1] can dier from ln Ft if the market pays a risk premium for taking the risk of
an unhedged investment. A narrowly dened version of the UIRP assumes risk neutrality
and states that the risk premium is zero (see Engel, 1996, for a survey)
Et[ln St+1 ln St] = ln Ft ln St:
Observed exchange rates St+1 can deviate from Ft, but the expected dierence must be
zero. The UIRP can be tested using the Fama regression
st st 1 = 0 + 1 (ft 1 st 1 ) + t ;
where st=ln St and ft=ln Ft. The UIRP imposes the testable restrictions 0=0 and 1=1.14
We use a data set15 from Verbeek (2004) and obtain the following results (t-statistics in
parenthesis)
st st 1 = 0:0023 + 0:515 (ft 1 st 1 ) + et R2 = 0:00165:
(0.72) (0.67)
14 Hayashi (2000, p.424) discusses the question, why UIRP cannot be tested on the basis of
st =0 +1 ft 1 +t .
15 This data is available from http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html. We
use the corrected data set forward2c from chapter 4 (foreign exchange markets). Note that the exchange
and forward rates in this dataset are expressed in terms of US dollars paid/received for one Euro. To make
the data consistent with the description in this section we have dened the logs of spot and forward rates
accordingly (although this does not change the substantive conclusions). Details can be found in the le
uirp.xls.
1.2 Finite sample properties of least squares estimates 20
Testing the coecients individually shows that b0 is not signicantly dierent from 0 and
b1 is not signicantly dierent from 1.
To test both restrictions at once we dene
" # " #
R= 0 1 1 0 = 01 :
The Wald statistic for testing both restrictions equals 3.903 with a p-value of 0.142. The
p-value of the F -statistic W=2=1.952 is 0.144. Alternatively, we can use the R2 from
the restricted model with 0=0 and 1=1. This requires to dene restricted residuals
according to (st st 1) (ft 1 st 1). The associated R2 is negative and the F -statistic is
again 1.952. Thus, the joint test conrms the conclusion derived from testing individual
coecients, and we cannot reject UIRP (which does not mean that UIRP holds!).
Exercise 4: Repeat the analysis and tests from example 7 but use the US
dollar/British pound exchange and forward rates in the le uirp.xls to test
the UIRP.
1.2 Finite sample properties of least squares estimates 21
1.2.6 Prediction
Regression models can be also used for out-of-sample prediction. Suppose the estimated
model from n observations is y=Xb+e and we want to predict y0 given a new observation
of the regressors x0 which has not been included in the estimation (hence: out-of-sample).
From the Gauss-Markov theorem it follows that the prediction
y^0 = x00 b
This shows that the variance of the prediction (error) increases with the distance of x0 from
the mean of the regressors and decreases with the sample size. The (estimated) variance
of the disturbances can be viewed as a lower bound for the variance of the out-of-sample
prediction error.
If 2 is replaced by s2e we can compute a 1 prediction interval for y0 from
y^0 z=2 se[e0];
where se[e0] is the square root of the estimated variance V^ [e0]. These calculations, using
example 1, can be found in the le investment.xls on the sheet prediction.
1.3 Large sample properties of least squares estimates 22
1.3 Large sample properties of least squares estimates16
Review 5:17 We consider the asymptotic properties of an estimator ^n which hold as
the sample size n grows without bound.
Convergence: The random variable ^n converges in probability to the (non-
random) constant c if, for any >0,
lim P[j^n
n!1
cj > ] = 0:
Example: The sample mean y from a population with and 2 is consistent for
since E[y]= and aV[y]=2 =n. Thus plim y=.
Consistency of a mean of functions: Consider a random sample (y1 ,. . . ,yn ) from
a random variable Y and any function f (y). If E[f (Y )] and V[f (Y )] are nite
constants then
n
1X
plim f (y ) = E[f (Y )]:
n i=1 i
We have to make sure that the covariance matrix of regressors X is 'well behaved'. This
requires that all elements of X 0X =n converge to nite constants (i.e. the corresponding
population moments). This is expressed by the assumption
AR: plim 1 X 0 X = Q;
n
(11)
where Q is a positive denite matrix.
Regarding the second probability limit in (10), Greene (2003, p.66) denes
1 X 0 = 1 Xn
xi i = 1 wi = w n
Xn
n n i=1 n i=1
1.3 Large sample properties of least squares estimates 24
and uses AX to show that
2
V[w n] = E[w nw 0n] = n E[Xn X ] :
0
E[w n] = 0
The variance of w n will converge to zero, which implies that plim w n=0, or
plim 1 X 0 = 0:
n
Thus the probability limit of bn is given by
plim bn = + Q 1 0;
and we conclude that bn is consistent:
plim bn = :
1.3 Large sample properties of least squares estimates 25
1.3.2 Asymptotic normality
Large-sample theory is not based on the normality assumption AN, but derives an ap-
proximation of the distribution of OLS estimates. We rewrite (9) as
1
) = n1 X 0X p1n X 0
pn(b
(12)
n
to derive the asymptotic distribution of pn(bn ) using the central limit theorem. By
AR the probability limit of the rst term on the right hand side of (12) is Q 1 . Next we
consider the limiting distribution of
p1 X 0 = pn (w n E[wn]) :
n
w n is the average of n i.i.d. random vectors wi =xi i . From the previous
p subsection we
know that E[w n]=0. Greene (2003, p.68) shows that the variance of nw n converges to
2 Q. Thus, in analogy to the univariate case, we can apply the central limit theorem.
The means of the i.i.d. random vectors wi converge to a normal distribution:
p1n X 0 = pnw n d! N(0; 2Q):
We can now complete the derivation of the limiting distribution of (12) by including Q 1
to obtain
Q 1 p1 X 0 d! N(Q 10; Q 1 (2Q)Q 1)
n
or
pn(b 2
n ) d! N(0; 2Q 1) bn a N(; n Q 1 ):
Note that the asymptotic normality of b is not based on AN but on the central limit
theorem. The asymptotic covariance of bn is estimated by using (X 0X ) 1 to estimate
(1/n)Q 1 and s2e =SSE/(n K ) to estimate 2:
c b ] = s2 (X 0 X )
aV[ 1:
n e
This implies that t- and F -statistics are asymptotically valid even if the residuals are not
normal. If F has an F -distribution with df=(m,n k) then W =mF a 2m.
In small samples the t-distribution may be a reasonable approximation18 even when AN
does not hold. Since it is more conservative than the standard normal, it may be preferable
to use the t-distribution. By a similar argument, using the F -distribution (rather than
W =mF and the 2 distribution) can be justied in small samples when AN does not
hold.
18 If AN does not hold the nite sample distribution of the t-statistic is unknown.
1.3 Large sample properties of least squares estimates 26
1.3.3 Time series data19
With time series data the strict exogeneity assumption AX is usually hard to maintain.
For example, a company's returns may depend on the current, exogenous macroeconomic
conditions and the rm's past production (or investment, nance, etc.) decisions. To
the extent that the company decides upon the level of production based on past realized
returns (which include past disturbances), the current disturbances may be correlated
with regressors in future equations. More generally, strict exogeneity might not hold if
regressors are policy variables which are set depending on past outcomes.
If AX does not hold (e.g. in a model with a lagged dependent variable), bn is biased. In
the previous subsections consistency and asymptotic normality have been established on
the basis of Aiid and AR. However, with time series data the i.i.d. assumption need not
hold and the applicability of limit theorems is not straightforward. Nevertheless, consistent
estimates in a time series context can still be obtained. The additional assumptions needed
are based on the following concepts.
A stochastic process Yt is a sequence20 of random variables Y 1,: : :,Y0,Y1,: : :,Y+1. An
observed sequence yt (t=1; : : : ; n) is a sample or realization (one possible outcome) of
the stochastic process. Any statistical inference about Yt must be based on the single draw
yt from the so-called ensemble of realizations of the process. Two properties are crucial
in this context: the process has to be stationary (i.e. the underlying distribution of Yt
does not change with t) and ergodic (i.e. each individual observation provides unique
information about the process; adjacent observations must not be too similar). More
formally, a stationary process is ergodic if any two random variables Yt and Yt ` are
asymptotically (i.e. `!1) independent.
A stochastic process is characterized by the autocovariance
`
` = E[(Yt )(Yt ` )] = E[Yt ]; (13)
or the autocorrelation `
`
` =
0
=
`2 : (14)
A stochastic process is weakly or covariance stationary if E[Yt2]<1 and if E[Yt], V[Yt]
and
` do not depend on t (i.e.
` and ` only depend on `). If Yt is strictly stationary
the joint distribution of Yt and Yt ` does not depend on the time shift `. If Yt is weakly
stationary and normally distributed then Yt is also strictly stationary.
According to the ergodic theorem, averages from a single observed sequence will con-
verge to the corresponding parameters of the population, if the process is stationary and
ergodic. If Yt is stationary and ergodic with E[Yt]=, the sample mean obtained from a
single realization yt converges to asymptotically:
lim 1 yn = yt = :
Xn
n!1 n
t=1
19 Most of this subsection is based on Greene (2003), section 12.4.
20 We use the index t since stochastic processes are frequently viewed in terms of chronologically ordered
sequences across time. However, the index set is arbitrary and everything we say holds as well if the index
refers to other entities (e.g. rms).
1.3 Large sample properties of least squares estimates 27
1
X
If Yt is covariance stationary it is sucient that j
`j<1 (absolute summability) for
`=0
the process to be ergodic for the mean. The theorem extends to any (nite) moment of
stationary and ergodic processes. In the special case where Yt is a normal and station-
ary process, then absolute summability is enough to insure ergodicity for all moments.
Whereas many tests for stationarity are available (see section 2.3.3), ergodicity is dicult
to test and is usually assumed to hold. Quickly decaying estimated autocorrelations can
be taken as empirical evidence of stationarity and ergodicity.
In other words, the ergodic theorem implies that consistency does not require independent
observations. Greene (2003, p.73) shows that consistency and asymptotic normality of the
OLS estimator can be preserved in a time-series context by replacing AX with21
AX: E[tjxt `] = 0 (8` 0);
replacing AR by
plim n 1 `
n
X
ARt : xtx0t ` = Q(`);
t=`+1
where Q(`) is a nite matrix, and by requiring that Q(`) has to converge to a matrix of
zeros as `!1. These properties of Q(`) can be summarized by the assumption that xt is
stationary and ergodic.
This has the following implications for models with a lagged dependent variables:
yt = 1 yt 1 + + p yt p + z0t + t:
Although estimates of i and are biased (since AX is violated), they are consistent
provided AX holds, and xt=[yt 1; : : : ; yt p; z ] is stationary and ergodic.
t
21 Other authors (e.g. Hayashi, 2000, p.109) assume that t and xt are contemporaneously uncorrelated
(E[xt t ]=0), as implied by AX.
1.4 Maximum likelihood estimation 28
1.4 Maximum likelihood estimation
Review 6:22 We consider a random sample yi (i=1; : : : ; n) to estimate the parameters
and 2 of a random variable Y N(; 2 ). The maximum likelihood (ML) estimates
are those values for the parameters of the underlying distribution which make the
observed sample most likely (i.e. would generate it most frequently).
The likelihood (function) L() is the joint density evaluated at the observations yi
(i=1; : : : ; n) as a function of the parameter (vector) :
L() = f (y1 j)f (y2 j) f (yn j):
f (yi j) is the value of the density function at yi given the parameters . To simplify
the involved calculations the logarithm of the likelihood function (the log-likelihood)
is maximized:
n
X
ln L() = `() = ln f (yi j) ! max :
i=1
To estimate more general models the constants and 2 can be replaced by conditional
mean i and variance i2 , provided the standardized residuals i =(yi i )/i are i.i.d.
Then the likelihood depends on the coecients in the equations which determine i
and i2 .
The ML estimate of a regression model requires the specication of a distribution for
the disturbances. If i=yi x0i is assumed to be i.i.d.23 and normal N(0; 2), the
log-likelihood is given by
n n 2 1
`(; 2 ) = ln 2 2 ln 22 (y X) (y X): (15)
0
2
22 For details see Kmenta (1971), p.174 or Wooldridge (2003), p.746.
23 Note that the i.i.d.assumption is not necessary for the observations but only for the residuals.
1.4 Maximum likelihood estimation 29
The necessary conditions for a maximum are
@`
@
: 1 X 0(y X) = 1 X 0 = 0
2 2
@`
@2
: 2n2 + 21 4 (y X)0(y X) = 0:
The solution of these equations gives the estimates
b = (X 0X ) 1X 0y e0e
s~2e = :
n
ML estimators are attractive because of their large sample properties: provided that the
model is correctly specied they are consistent, asymptotically ecient and asymptotically
normal24:
b a N(; I () 1):
I () is the information matrix evaluated at the true parameters. Its inverse can be
used to estimate the covariance of b. Theoretically, I () is minus the expected value of
the second derivatives of the log-likelihood. In practice this expectation can be computed
in either of the two following ways (see Greene, 2003, p.480). One way is to evaluate the
Hessian matrix (the second derivatives of `) at b
2
I^(b) = @ b@ @`b0 ;
where the second derivatives are usually calculated numerically. Another way is the BHHH
estimator25 which is based on the outer-product of the gradient (or score vector):
I (b) = g g0 g = @`i (b) or I^(b) = G0G;
Xn
^
i=1
i i i @b
where gi is the K 1 gradient for observation i, and G is a nK matrix with rows equal
to the transpose of the gradients for each observation.
In general, the Hessian and the BHHH approach do not yield the same results, even when
the derivatives are available in closed form. The two estimates of I can also dier when
the model is misspecied. Quasi-ML (QML) estimates are based on maximizing the
likelihood using a distribution which is known to be incorrect (i.e. using the wrong density
when formulating (15)). For instance, the normal distribution is frequently used as an
approximation when the true distribution is unknown or cumbersome to use.
Signicance tests of regression coecients are based on the asymptotic normality of the
ML estimates. z-statistics (rather than t-statistics) are frequently used to refer to the
standard normal distribution of the test statistic zi=(bi i)/se[bi], where the standard
error se[bi] is the square root of the i-th diagonal element of the inverse of I^(b), and
zi N(0; 1).
a
The major weakness of ML estimates is their potential bias in small samples (e.g. the
variance estimate is scaled by n rather than n K ).
24 Greene (2003), p.473.
25 BHHH refers to the initials of Berndt, Hall, Hall, and Hausman who have rst proposed this approach.
1.4 Maximum likelihood estimation 30
Example 8: We use the quarterly investment data from 1950-2000 from Table 5.1 in
Greene (2003) (see exercise 1). The explanatory variables are the log of real output,
the nominal interest rate and the rate of in
ation. to estimate the same regression as
in example 1 by numerically maximizing the log-likelihood. The dependent variable
is the log of real investment. Details can be found in the le investment-ml.xls.
The estimated ML coecients are almost equal to the OLS estimates, and depend on
the settings which trigger the convergence of the numerical optimization algorithm.
The standard errors are based on the outer-product of the gradients, and are slightly
dierent from those based on the inverse of X 0 X . Accordingly, the z -statistics dier
from the t-statistics. The interest rate turns out to be the only regressor which is not
statistically signicant at the 5% level.
Exercise 5: Use the annual data and the regression equation from example 1
(see le investment.xls) and estimate the model by maximum likelihood.
1.5 LM, LR and Wald tests 31
1.5 LM, LR and Wald tests26
Suppose the ML estimate ^ (a K 1 parameter vector) shall be used to test m linear
restrictions H0: =R. Three test principles can be used for that purpose.
The Wald test is based on unrestricted estimates. If the restrictions are valid, d=R^
will not deviate signicantly from . The Wald test statistic for m restrictions dened as
W = (d )0(V[d]) 1(d ):
The covariance of d can be estimated by RV^ [^1]R0, where V^ [^] can be based on the inverse
of the information matrix. Using V^ [^]=I^(^) we obtain
1
(d )0 RI^(^) 1R0 (d ) a 2m:
h i
(16)
The likelihood-ratio (LR) test requires estimating the model with and without restric-
tions. The LR test statistic is
2[`u `r ] a 2m ; (17)
where `u is the unrestricted log-likelihood, and `r the log-likelihood obtained by imposing
m restrictions. If the restrictions are valid, the dierence between `r and `u will be close
to zero.
If parameters are estimated by OLS, the LR test statistic can be computed using the
residuals eu and er from unrestricted and restricted OLS regressions, respectively. For m
restrictions the LR test statistic is given by
LR = n[ln(e0r er ) ln(e0ueu)] LR 2m:
The Lagrange multiplier (LM) test (or score test) is based on maximizing the log-
likelihood under the restrictions using the Lagrangian function
L(r ) = `(r ) + 0(Rr ):
The estimates ^r and ^ can be obtained from the rst order conditions
@L @`
@ r
: @ r
+ 0 R = 0
@L
@
: Rr = 0:
Lagrange multipliers measure the improvement in the likelihood which can be obtained
by relaxing constraints. If the restrictions are valid (i.e. hold in the data), imposing them
is not necessary, and ^ will not dier signicantly from zero. This consideration leads to
26 This section is based on Greene (2000), p.150 and Greene (2003), section 17.5.
1.5 LM, LR and Wald tests 32
H0: =0 (hence LM test). This is equivalent to testing the derivatives evaluated at the
restricted estimates ^r :
gr = @`(^r ) = ^ 0R:
^
@ r
Under H0: gr =0 the LM test statistic is given by
g0r I^(^r ) 1gr a 2m : (18)
I^(^r )=G0r Gr , where Gr is a nK matrix with rows equal to the transpose of the gradients
for each observation evaluated at the restricted parameters.
Alternatively, the Lagrange multiplier (LM) test statistic can be derived from a regression
of the restricted residuals er on all regressors including the constant (see Greene (2003),
p.496). This version of LM is dened as
LM = nRe2 LM 2m;
where Re2 is the coecient of determination from that auxiliary regression.
Wald, LR and LM tests of linear restrictions in multiple regressions are asymptotically
equivalent. Depending on how the information matrix is estimated, the test statistics and
the associated conclusions will dier. In small samples the 2 distribution may lead to
too many rejections of the true H0. Alternatively, the t-statistic (for a single restriction)
or the F -statistic can be used.
Example 9: We use the data and results from example 7 to test the restrictions 0 =0
and 1 =1 using OLS based LR and LM tests. Using the residuals from unrestricted
and restricted residuals we obtain LR=3.904 with a p-value of 0.142. Regressing the
residuals from the restricted model on X we obtain LM=0.402 with a p-value of 0.818.
Details can be found in the le uirp.xls.
Example 10: We use the same data to estimate the Fama regression numerically by
ML. The coecients are virtually identical to the OLS estimates, while the standard
errors (derived from the inverse of the information matrix) dier slightly from the
OLS standard errors. To test the restriction 0 =0 and 1 =1 we use ML based Wald,
LR and LM tests. All three tests agree that these restriction cannot be rejected with
p-values ranging from 0.11 to 0.17. Details can be found in the le uirp-ml.xls.
Exercise 6: Use the data from exercise 4 (i.e. US dollar/British pound ex-
change and forward rates ) to test the restriction 0=0 and 1=1 using OLS
and ML based Wald, LR and LM tests.
1.6 Specications 33
1.6 Specications
The specication of the regression equation has key importance for a successful application
of regression analysis (in addition to a careful denition and selection of variables). The
linearity assumption AL may appear to be a very strong restriction. However, y and X
can be arbitrary functions of the underlying variables of interest. Thus, as we will show
in this section, there exist several linear formulations to model a variety of practically
relevant and interesting cases.
1.6.1 Log and other transformations
The log-linear model27
ln y = ln b0 + b1 ln x1 + + bk ln xk + e = y^ln + e
corresponds to the multiplicative expression
y = b0 xb11 xbkk expfeg:
where se is the standard error of the residuals from the log-linear model. Note that these
errors are given by
ei = ln yi y^iln ;
where y^iln is the tted value of ln yi. ei is not equal to ln yi ln y^i because of Jensen's
inequality lnE[y]>E[ln y] (see review 1). The standard error of residuals se is an approxi-
mation for the magnitude of the percentage error (yi y^i )/^yi .
In the semi-log model
ln y = b0 + b1x1 + + bk xk + e
27 Insection 1.6 we will formulate regression models in terms of estimated parameters since these are
usually used for interpretations.
1.6 Specications 34
the expected c.p. percentage change in y^ is given by bi100, if xi changes by one unit. More
accurately, y^ changes by (expfbig 1)100 percent. This model is appropriate when the
growth rate of y is assumed to be a linear function of the regressors. The chosen specica-
tion will mainly be driven by assumptions about the nature of the underlying relationships.
However, taking logs is frequently also used to reduce or eliminate heteroscedasticity.
Another version of a semi-log model is
y = b0 + b1 ln x1 + + bk ln xk + e:
Here, a one percent change in xi yields a c.p. change in y^ of 0.01bi units.
The logistic model
ln 1 y y = b0 + b1x1 + + bk xk + e 0<y<1
implies that y^ is s-shaped according to:
y^ =
expfb0 + b1x1 + + bk xk g = 1
1 + expfb0 + b1x1 + + bk xk g 1 + expf (b0 + b1x1 + + bk xk )g :
1.6.2 Dummy variables
Explanatory variables which are measured on a nominal scale (i.e. the variables are quali-
tative in nature) can be used in regressions after they have been recoded. A binary valued
(0-1) dummy variable is dened for each category except one which constitutes the refer-
ence category. Suppose there are m+1 categories (e.g. industries or regions). We dene
m dummy variables di (di =1 if an observation belongs to category i and 0 otherwise).
Note that dening a dummy for each category leads to an exact linear relationship among
the regressors. If the model contains an intercept the sum of all dummies is equal to the
rst column of X , and X will not have full rank. The coecients i in the regression
model
y^ = b0 + b1 x1 + + 1 d1 + + m dm
correspond to parallel shifts of the regression line (hyperplane). i represents the change
in y^ for a c.p. shift from the reference category to category i.
If categories have a natural ordering, an alternative denition of dummy variables may be
appropriate. In this case all dummy variables d1,. .. ,dj are set equal to 1 if an observation
belongs to category j . Now i represents the expected change in y^ for a c.p. shift from
category j 1 to category j .
1.6.3 Interactions
Dummy variables cannot be used to model changes in the slope (e.g. dierences in the
propensity to consume between men and women). If the slope is assumed to be dierent
among categories the following specication can be used:
y^ = b0 + b1 x1 + b2 d + b3 dx1 :
1.6 Specications 35
The product dx1 is an interaction term. If d=0 this specication implies y^=b0+b1x1,
and if d=1 it implies y^=(b0+b2)+(b1+b3)x1. Thus, the coecient b3 measures the expected
c.p. change in the slope of x1 when switching categories.
It is important to note that the presence of an interaction term changes the 'usual' inter-
pretation of the coecients associated with the components of the interaction. First, the
coecient b1 of x1 must be interpreted as the slope of the reference category (for which
d=0). Second, the coecient b2 of the dummy variable is not the expected c.p. dierence
between the two categories anymore (except for x1=0). Now the dierence depends on
the level of x1. Even if x1 is held constant the dierence in y^ when changing from d=1 to
d=0 is given by b2 +b3 x1 .
Interactions are not conned to using dummy variables but can be based on two 'regular'
regressors. The equation
y^ = b0 + b1 x1 + b2 x2 + b3 x1 x2
implies a change in the slope of x1 that depends on the level of x2 and vice versa. To
simplify the interpretation of the coecients it is useful to evaluate y^ for typical values of
one of the two variables (e.g. using x2 and x2sx2 ).
If interactions are dened using logs of variables such as in the following so-called translog
model
ln y = ln b0 + b1 ln x1 + b2 ln x2 + b3 ln x1 ln x2 + e
the conditional expectation of y is given by
y^ = b0 xb11 +b3 ln x2 xb22 expf0:5s2e g:
This implies that a c.p. change of x2 by p percent leads to an expected change of the
elasticity b1 by pb3. However, if y^ is dened as
y^ = b0 xb11 +b3 x2 xb22 expf0:5s2e g;
where T denotes a time-dummy (i.e. being 0 before and 1 after the event), and d is the
dummy distinguishing the treatment (d=1) and the control (d=0) group. Note that b0 is
the average of y for T =0 and d=0 (i.e. the control group before the event), b1 estimates the
average change in y over the two time periods for d=0, b2 estimates the average dierence
between treatment and control for T =0. b3 is the estimate which is of primary interest in
such studies.
Note that the simple formulation above is only appropriate if no other regressors need to
be accounted for. If this is not the case, the model has to be extended as follows:
y^ = b0 + b1 T + b2 d + b3 T d + Xb:
As soon as the term Xb is included in the specication, the coecient b3 is still the main
object of interest, however it is not a dierence of sample averages any more, but has the
corresponding ceteris paribus interpretation.
The di-in-di approach can be used to account for a so-called selection bias. For example,
when assessing the eect of an MBA on salaries, people who choose to do an MBA may
already have higher salaries than those who do not. Thus, the assignment to treatment
and control group is not random but depends on (existing or expected) salaries. This
problem of so-called self-selection results whenever subjects enter the treatment sample
for reasons which are related to the dependent variable.
The appropriateness of the dierence-in-dierences approach rests on the parallel-trends
assumption. Absent the eect under study, the dependent variable of the two groups
must not have dierent "trends" (i.e. must not have diering slopes with respect to time).
If this assumption is violated, the eect is over- or underestimated (because of diverging
or converging trends), and partially but falsely attributed to the treatment. In the MBA-
salary example this assumption is violated, when the salaries of people who choose to do
an MBA already increase more quickly than the salaries of those, who do not.
Note that the interaction term T d already accounts for dierent slopes with respect to
time. Therefore, it is impossible to separate the eect under study from possibly dierent
trends of the two groups which have nothing to do with the eect under study.
1.6 Specications 37
1.6.5 Example 11: Hedonic price functions
Hedonic price functions are used to dene the implicit price of key attributes of goods
as revealed by their sales price. We use a subset of a dataset used in Wooldridge (2003,
p.194)28 consisting of the price of houses (y), the number of bedrooms (x1), the size
measured in square feet (x2) and a dummy variable to indicate the style of the house (x3)
(see hedonic.wf1). We estimate the regression equation
y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x1 x2 + ;
where the interaction-term x1x2 is used to model the importance of the number of bed-
rooms depending on the size of the house. The underlying hypothesis is that additional
bedrooms in large houses have a stronger eect on the price than in small houses (i.e. it
is expected that 4>0). The estimated equation is
y^ = 199:45 45:51 x1 + 0:025 x2 + 20:072 x3 + 0:026 x1x2:
(0.034) (0.072) (0.575) (0.191) (0.014)
The interaction term is signicant and has the expected sign. To facilitate the model's
interpretation it is useful to evaluate the regression equation using typical values for one of
the variables in the interaction term. Mean and standard deviation of size (x2) are 2013.7
and 578.7. We can formulate equations for the expected price for small, average and large
houses as a function of style (x3) and the number of bedrooms (x1):
x2 =1500: y^ = 236:48 + 20:072x3 6:63x1
x2 =2000: y^ = 248:82 + 20:072x3 + 6:33x1
x2 =2500: y^ = 261:16 + 20:072x3 + 19:29x1 :
This shows that a regression equation with interactions can be viewed as a model with
varying intercept and slope, where this variation depends on one of the interaction vari-
ables. The rst equation shows that additional bedrooms in small houses lead to a c.p.
drop in the expected price (probably because those bedrooms would be rather small and
thus unattractive). We nd a positive eect of bedrooms in houses with (above) average
size.
where Rj2 is the R2 from a regression of regressor j on all other regressors (including
a constant term). This denition shows that c.p. the variance of bj will increase with
the correlation between variable xj and other regressors. This fact is also known as the
multicollinearity problem which becomes relevant if Rj2 is very close to one.
Suppose that the correct model is given by y=0+1x1+ but the irrelevant variable x2
(i.e. 2=0) is added to the estimated regression equation. Denote the estimate of 1 from
the overtted model by ~b1. The variance of b1 (from the correct regression) is given by
(6), p.11 whereas V[~b1] is given by (22). Thus, unless x1 and x2 are uncorrelated in the
sample, V[~b1 ] will be larger than necessary (i.e. larger than V[b1 ]).
Exact multicollinearity holds when there are exact linear relationships among some re-
gressors (i.e. X does not have full rank). This can easily be corrected by eliminating
redundant regressors (e.g. super
uous dummies). Typical signs of strong (but not exact)
multicollinearity are wrong signs or implausible magnitudes of coecients, as well as a
strong sensitivity to changes in the sample (dropping or adding observations). The in-
ating eect on standard errors of coecients may lead to cases where several coecients
are individually insignicant, but eliminating them (jointly) from the model leads to a
signicant drop in R2 (based on an F -test).
32 Exception: the mean of x2 is zero.
33 For an introduction to the principles of panel data analysis, see Wooldridge (2003), chapters 13 and
14).
1.6 Specications 41
The consequences of including irrelevant regressors (ineciency) have to be compared to
the consequences of omitting relevant regressors (bias and inconsistency). We hesitate to
formulate a general recommendation, but it is worth while asking "What is the point of
estimating a parameter more precisely if it is potentially biased?"
1.6 Specications 42
1.6.8 Selection of regressors
The search for a correct specication of a regression model is usually dicult. The selec-
tion procedure can either start from a model with one or only a few explanatory variables,
and subsequently add variables to the equation (the specic to general approach). Alterna-
tively, one can start with a large model and subsequently eliminate insignicant variables.
The second approach (general to specic ) is preferable, since the omission of relevant vari-
ables has more drawbacks than the inclusion of irrelevant variables. In any case, it is
strongly recommended to select regressors on the basis of a sound theory or a thorough
investigation of the subject matter. A good deal of common sense is always useful.
The following guidelines can be used in the model selection process:
1. The selection of variables must not be based on simple correlations between the
dependent variable and preselected regressors. Because of the potential bias associ-
ated with omitted variables any selection should be done in the context of estimating
multiple regressions.
2. If the p-value of a coecient is above the signicance level this indicates that the
associated variable can be eliminated. If several coecients are insignicant one can
start by eliminating the variable with the largest p-value and re-estimate the model.
3. If the p-value indicates elimination but the associated variable is considered to be of
key importance theoretically, the variable should be kept in the model (in particular
if the p-value is not far above the signicance level). A failure to nd signicant
coecients may be due to insucient data or a random sample eect (bad luck).
4. Statistical signicance alone is not sucient. There should be a very good reason
for a variable to be included in a model and its coecient should have the expected
sign.
5. Adding a regressor will always lead to an increase of R2. Thus, R2 is not a useful
selection criterion. A number of model selection criteria have been dened to fa-
cilitate the model choice in terms of a compromise between goodness of t and the
principle of parsimony. The adjusted R2
n 1 2 0
R 2 = 1
n K
(1 R2) = 1 s2e s2e = e e
sy n K
is a criterion that can be used for model selection. Note that removing a variable
whose t-statistic is less than 1 leads to an increase of R 2 (R2 always drops if a
regressors is removed!), and a decrease of the standard error of regression (se). It
has been found, however, that R 2 puts too little penalty on the loss in degrees of
freedom. Alternative criteria are Akaike's information criterion
AIC = 2n` + 2nK
and the Schwarz criterion34
SC = 2n` + K nln n ;
34 These are the denitions of AIC and SC used in EViews. Alternatively, the rst term in the denition
of AIC and SC can be replaced by ln(e0e=n)=ln(~s2e ).
1.6 Specications 43
where `= 0.5n[1+ln(2)+ln(e0e=n)]. We nally note that model selection criteria
must never be used to compare models with dierent dependent variables (e.g. to
compare linear and log-linear models).
Exercise 7: Consider the data on the salary of 208 employees in the le
salary.wf135 . Estimate and choose a regression model for salary using avail-
able information such as gender, education level, experience and others. Note
that EDUC is a categorical variable measuring the education level in terms
of degrees obtained (1=nished high school, 2=nished some college courses,
3=obtained a bachelor's degree, 4=took some graduate courses, 5=obtained a
graduate degree). Use model formulations which allow you to test for gender-
specic payment behavior.
If OLS residuals are not normally distributed, OLS estimates are unbiased and consis-
tent, but not ecient (see Kmenta (1971), p.248). There exist other estimators with
greater accuracy (of course, only if the correct (or a more suitable) distribution is used
in those estimators). In addition, the t-statistics for signicance testing are not appropri-
ate. However, this is only true in small samples, and when the deviation from the normal
distribution is 'strong'. A failure to obtain normal residuals in a regression may indicate
missing regressors and/or other specication problems (although the specic kind prob-
lem cannot be easily inferred). At any rate, normality of the dependent variables is not a
requirement of OLS (as can be derived from sections 1.2.1 and 1.2.2).
Example 13: We use the data from example 8 and estimate the regression equation by
OLS. Details can be found in the le investment quarterly.wf1. The distribution of
residuals is positively skewed (0.25). This indicates an asymmetric distribution whose
right tail is slightly longer than the left one. The kurtosis is far greater than three
(5.08) which indicates more concentration around the mean than a normal distribution.
JB is 38.9 with a p-value of zero. This clearly indicates that we can reject H0 and we
conclude that the residuals are not normally distributed.
1.7 Regression diagnostics 45
1.7.2 Heteroscedasticity
Heteroscedasticity means that the variance of disturbances is not constant across ob-
servations
V[i] = i2 = !i2 8i;
and thus violates assumption AH. To analyze the implications of heteroscedasticity we
assume that the covariance matrix is diagonal
0 1
12 0 0
B
0 22 0 C
E[ ] = 2
= BB
0
B ... ... . . . ...
C
C:
C (23)
@ A
0 0 n2
If the variance of is not given by 2I but 2
, the model is a so-called generalized
least squares (GLS) model.
It can be shown that the nite sample properties of the OLS estimator are not aected if
only AH is violated (see Greene (2003), section 10.2.1). However, the covariance of b is
not given by (5), p.11 but is given by
V[b] = (X 0X ) 1X 0(2
)X (X 0X ) 1: (24)
Provided that X 0
X=n converges to a positive denite matrix it can be shown that
in the presence of heteroscedasticity the OLS estimator b is unbiased, consistent and
asymptotically normal (see Greene (2003), section 10.2.2):
!
2
bNa
; n Q 1Q nQ 1 ; (25)
where Q is dened in (11), p.23 and
Q = plim 1 X 0
X :
n n
However, the OLS estimator b is inecient since it does not use all the information
available in the sample. The estimated standard errors of b are biased and the associated
t- and F -statistics are incorrect. For instance, if i2 and a regressor xj are positively
correlated, the bias in the standard error of bj is negative (see Kmenta (1971), p.256).
Depending on the correlation between the heteroscedasticity and the regressors (and their
cross-products) the consequences may be substantial (see Greene (2000), p.502-505).
The Breusch-Pagan test for heteroscedasticity is based on an auxiliary regression of e2i
on a constant and the regressors. Under the null of homoscedasticity we can use the R2
from this regression to compute the test statistic nR22k (k is the number of regressors
excluding the constant). The White-test for heteroscedasticity is based on regressing e2i
against a constant, the regressors and their squares. In a more general version of the test
the cross products of regressors may be added, too. Under the null of homoscedasticity the
1.7 Regression diagnostics 46
test statistic is also nR22k , where k is the number of regressors in the auxiliary regression
excluding the constant. The advantage of the White-test is that no assumptions about the
type of heteroscedasticity are required. On the other hand, rejecting H0 need not be due
to heteroscedasticity but may indicate other specication errors (e.g. omitted variables).
In section 1.8 we will present estimators that make use of some knowledge about
. If no
such information is available the OLS estimator may still be retained. However, to improve
statistical inference about coecients the estimated standard errors can be corrected using
the White heteroscedasticity consistent (WHC) estimator
!
n n
X 0X 1 e2i xi x0i X 0 X 1 :
X
aV[b] = n
c
K
(26)
i=1
Example 14: We use a dataset36 on hourly earnings (y), employment duration (x1 )
and years of schooling (x2 ) (n=49) (see earnings.wf1). A plot of the residuals from
the estimated regression (t-statistics in parenthesis)
ln y = 1:22 + 0:027 x1 + 0:126 x2 + e
(6.1) (4.4) (3.6)
against x1 shows a strong increase in the variance of e. The White-test statistic 23.7
based on the regression (p-values in parenthesis)
e2 = 0:1 0:022 x1 + 0:0009 x21 + 0:12 x2 0:019 x22 + u R2 = 0:484
(0.5) (0.15) (0.008) (0.09) (0.08)
is highly signicant (the p-value is very close to zero) and we can rmly reject the
homoscedasticity assumption. The t-statistics based on the WHC standard errors are
9.7, 4.0 and 4.9, respectively. Thus, in this example, the conclusions regarding the
signicance of coecients are not aected by the heteroscedasticity of residuals.
Exercise 8: Test the residuals from the models estimated in exercise 7 for
non-normality and heteroscedasticity.
and tests for the signicance of individual autocorrelations can be based on
r` a N( 1=n; 1=n):
The asymptotic properties of r` hold if the disturbances are uncorrelated (see Chateld
(1989), p.51). The Ljung-Box Q-statistic
p
X r2̀
Qp = n(n + 2) Qp 2p
`=1 n `
can be used as a joint test for all autocorrelations up to lag p. The Durbin-Watson test
DW2(1 r1) has a long tradition in econometrics. However, it only takes the autocorre-
lation at lag 1 into account and has other conceptual problems; e.g. it is not appropriate
if the lagged dependent variable is used as a regressor (see Greene (2003), p.270).
The Breusch-Godfrey test is based on an auxiliary regression of et on p lagged residuals
and the original regressors. Under the null of no autocorrelation we can use the R2 from
this regression to compute the test statistic nR22p.
Similar to the WHC estimator the Newey-West (HAC) estimator can be used to account
for residual autocorrelation without changing the model specication. It is a covariance
1.7 Regression diagnostics 48
estimator that is consistent in the presence of both heteroscedasticity and autocorrelation
(hence HAC) of unknown form. It is given by
c b] = X 0 X 1
aV[ ^ X 0X 1
where
0 1
n q n
^ = n @ e2t xt x0t + wj
X X X
xtetet j x0t j + xt 0
j et j et xt A :
n K t=1 j =1 t=j +1
wj =1 j /(q+1), and the truncation lag q determines how many autocorrelations are taken
into account. Newey and West (1987) suggest to set q=4(n/100)2=9.
We now take a closer look at implications of autocorrelated residuals and consider the
model37
yt = xt + ut (28)
ut = ut 1 + t jj < 1 t i.i.d.
The rst equation may be viewed as being incorrectly specied, as we are going to show
now. The second equation for the autocorrelated residuals ut is a so-called rst order
autoregression AR(1). Substituting ut into the rst equation (yt =xt +ut 1 +t ) and
using ut 1=yt 1 xt 1, we obtain
yt = yt 1 + xt xt 1 + t :
This shows that the autocorrelation in ut can be viewed as a result of missing lags in the
original equation. If we run a regression without using yt 1 and xt 1, we have an omitted
variables problem. The coecient of the incomplete regression yt=xt+t is given by
P
yx X
= P t t = M yt xt M=P :
1
x2t x2t
Substituting for yt from the complete regression we obtain
E[] = E[M P xt(yt 1 + xt xt 1 + t)]
= E[M P xtyt 1] + E M P x2t E[M P xtxt 1] + E[M P xtt] :
To simplify this relation it is useful to write the AR(1) equation as
t 1
X
ut = t u0 + i t i ;
i=0
37 For the sake of simplicity we consider only a single regressor xt , and assume that yt and xt have mean
zero.
1.7 Regression diagnostics 49
which implies that equation (28) can be written as (tu0 vanishes for large t, since jj<1)
t 1
X
yt = xt + i t i :
i=0
We also make use of E[xtt i]=0 (8i0), and note that the autocorrelation of xt is given
by x=M P xtxt 1. We nd that is unbiased since its expectation is given by
E[] = EM P xt(xt 1 + P it i) + x
= x + x = :
Thus, despite the incomplete regression and the presence of autocorrelated residuals, we
obtain unbiased estimates.
We now add a lagged dependent variable to equation (28). From section 1.2 we know that a
lagged dependent variable leads to biased estimates. However, the estimates are consistent
provided that assumptions AX and ARt hold. We now investigate what happens if the
disturbances are autocorrelated. We consider the model
yt = yt 1 + xt + ut
ut = ut 1 + t jj < 1 t i.i.d.;
which can be written as
yt = ( + )yt 1 yt 2 + xt xt 1 + t :
Suppose we run a regression without using yt 2 and xt 1. From the omitted variable for-
mula (20) we know that the resulting bias depends on the coecients of omitted regressors
(2=[ ]0 in the present case), and the matrix of coecients from regressing yt 2
and xt 1 on included regressors. This matrix will be proportional to the following matrix
(i.e. we ignore the inverse of the matrix associated with included regressors):
P P !
Pt
y 1 yt 2 Pt
y 1 xt 1 :
xt yt 2 xt xt 1
The elements in the rst row will be non-zero (if 6=0 and 6=0), and thus the estimated
coecient of yt 1 is biased. It is more dicult to say something general about the rst
element in the second row, but autocorrelation in xt leads to a bias in . Greene (2003,
p.266) considers the simplied case =0, and states that the probability limit of the
estimated coecient of yt on yt 1 alone is given by (+)/(1+).
1.7 Regression diagnostics 50
Example 15: We consider the data set analyzed by Coen et al. (1969) who formulate
a regression for the Financial Times index (yt ) using the UK car production index
(pt ) lagged by six quarters and the Financial Times commodity index (ct ) lagged by
seven quarters as regressors. Details can be found in the le coen.wf1. These lags
where found by "graphing the series on transparencies and then superimposing them
(p.136)". The estimated equation is (all p-values are very close to zero; we report
t-statistics below coecients for later comparisons)
yt = 653 + 0:47 pt 6 6:13 ct 7 + et : (29)
(11.6) (14.1) (9.9)
The Coen et al. study has raised considerable debate (see the discussion in their paper
and in Granger and Newbold, 1971, 1974) because the properties of the residuals had
not been thoroughly tested. As it turns out DW=0.98, and the Breusch-Godfrey test
statistic using p=1 is 12.4 with a p-value below 0.001. This is evidence of considerable
autocorrelation. In fact, using Newey-West HAC standard errors, the t-statistics are
reduced to 11.4 and 7.6, respectively.
Stock prices or indices are frequently claimed to follow a random walk (see section
2.3) yt =yt 1 +t (t i.i.d.). Thus we add the lagged dependent variable yt 1 to Coen
et al.'s equation and nd
yt = 276 + 0:661 yt 1 + 0:127 pt 6 2:59 ct 7 + et : (30)
(4.1) (6.9) (2.3) (3.8)
The residuals in this equation are not autocorrelated which indicates that the estimates
are consistent (to the extent that no regressors have been omitted). The coecients
and the t-statistics of pt 6 and ct 7 are considerably lower than before. It is not
straightforward to test whether the coecient of yt 1 is equal to one for reasons
explained in section 2.3.3. In sum, our results raise some doubt about the highly
signicant lagged relationships found by Coen et al.
Example 16: We brie
y return to example 7 where we have considered tests of the
UIRP based on one-month forward rates and monthly data. Since forward rates are
also available for other maturities, this provides further opportunities to test the UIRP.
We use the three-month forward rate Ft3 for which the UIRP implies
Et [ln St+3 ] = ln Ft3 :
This can be tested by running the regression
st st 3 = 0 + 1 (ft3 3 st 3 ) + t :
The estimated regression is38
st st 3 = 0:01 + 0:994 (ft3 3 st 3 ) + et R2 = 0:0123:
(1.76) (1.86)
Before we draw any conclusions from this regression it is important to note that the
observation frequency need not (and in the present case does not) conform to the
maturity. st st 3 is a three-month return (i.e. the sum of three consecutive monthly
returns). This introduces autocorrelation in three-month returns even though the
monthly returns are not serially correlated (similar to section 1.8.3). This is known
as the overlapping samples problem. In fact, the residual autocorrelations at lags
1 and 2 are highly signicant (and positive), and the p-value of the Breusch-Godfrey
38 See le uirp.wf1 for details.
1.7 Regression diagnostics 51
test is zero. Thus, the standard errors cannot be used since they will most likely be
biased (the bias will be negative since the autocorrelations are positive).
One way to overcome this problem is to use quarterly data (i.e. to use only every
third monthly observation). However, this leads to a substantial loss of information,
and reduces the power of the tests. Alternatively, we can use Newey-West standard
errors to nd t-statistics of b0 and b1 equal to 1.29 and 1.21, which are much lower, as
expected. Whereas a Wald test based on the usual standard errors has a p-value of
about 0.017 (which implies a rejection of the UIP), the p-value of a Wald test based
on Newey-West standard errors is 0.19.
Exercise 9: Use the data in the le coen.wf1 to estimate and test alterna-
tive models for the Financial Times index. Make use of additional regressors
available in that le.
Exercise 10: Use the US dollar/British pound exchange rate and the three-
month forward rate in the le uirp.wf1 to test the UIRP.
1.8 Generalized least squares 52
1.8 Generalized least squares
We now consider alternative estimators to overcome the ineciency of OLS estimates
associated with features of disturbances (i.e. violations of assumption AH) introduced
in sections 1.7.2 and 1.7.3. In general the matrix
in (23) or (27) is unknown. If its
structure is known, or assumptions are made about its structure, it is possible to derive
alternative estimators.
1.8.1 Heteroscedasticity
We rst consider the problem of heteroscedasticity and suppose that the variance of dis-
turbances is given by
V[i] = i2 = !i2 8i:
In the method of weighted least squares (WLS) the regression equation is multiplied
by a suitable variable i39
k
X
i yi = 0 i + j i xij + i i i = 1; : : : ; n:
j =1
Exercise 11: Consider the data from example 11 (see le hedonic.wf1). Es-
timate a model excluding the interaction term. Test the residuals from this
model for heteroscedasticity, and obtain WLS or FGLS estimates if required.
1.8.2 Autocorrelation
Autocorrelation is another case where assumption AH is violated. To overcome the asso-
ciated ineciency of OLS we assume, for simplicity, that autocorrelations in the covariance
matrix (27) can be expressed in terms of the rst order (lag one) autocorrelation 1 only:
= 1 = 1; : : : ; n 1:
1.8 Generalized least squares 54
This is equivalent40 to the model
yt = 0 + 1 xt1 + + k xtk + ut
ut = 1 ut 1 + t t i.i.d.
Alternatively, we can use transformed variables yt=yt 1 yt 1 and xtj =xtj 1 xt 1;j (the
so-called partial dierences):
yt = 0 + 1 xt1 + + k xtk + t :
t is uncorrelated, and estimating the transformed equation by OLS gives ecient esti-
mates. 1 is unknown, but we can use FGLS if 1 is replaced by a consistent estimate.
Several options are available to estimate 1 consistently (e.g. Cochrane-Orcutt or Prais-
Winsten; see Greene (2003), section 12.9). The simplest is to use the rst order autocor-
relation of the residuals from the original (consistent) regression.
We also note that autocorrelation of disturbances may be viewed as the result of a mis-
specied equation (see section 1.7.3). In other words, the (original) equation has to be
modied to account for the dynamics of the variables and responses involved. According
to this view, GLS is not appropriate to resolve the ineciency of OLS. A starting point
for the reformulation may be obtained from the equation using partial dierences.
WLS and FGLS may be summarized in terms of the Cholesky decomposition of the inverse
of
:
1 = C 0C C
C 0 = I :
The matrix C is used to transform the model such the transformed disturbances are
homoscedastic and non-autocorrelated:
Cy = CX + C =) y = X +
E[C] = C E[] = 0
V[C] = C V[]C 0 = C 2
C 0 = 2C
C 0 = 2I :
If the transformed equation is estimated by OLS the GLS estimator is obtained:
gls = (X 0X ) 1 X 0 y = X 0C 0CX 1 X 0C 0Cy = (X 0
X ) 1X 0
y:
40 The
autocorrelations of ut from the model ut =1 ut 1 +t can be shown to be given by =1 (see
section 2.2).
1.8 Generalized least squares 55
Example 18: We consider the model (29) estimated in example 15. The estimated
residual autocorrelation at lag 1 is ^1 =0.5 and can be used to form partial dierences
(e.g. yt =yt 0:5yt 1 ). The FGLS estimates are given by (t-statistics below coecients)
yt = 299:7 + 0:446 pt 6 5:41 ct 7 + et :
(7.2) (8.7) (5.8)
The increased eciency associated with the FGLS estimator cannot be derived from
comparing these estimates to those in example 15. It is worth noting, however, that
the t-statistics are much lower and closer to those obtained in equation (30).
Exercise 12: Consider a model you have estimated in exercise 9. Test the
residuals for autocorrelation, and obtain FGLS estimates if required.
1.8 Generalized least squares 56
1.8.3 Example 19: Long-horizon return regressions
Suppose a time series of (overlapping) h-period returns is computed from single-period
returns as follows (see section 2.1):
y(h)t = yt + yt 1 + + yt h yt = ln pt ln pt 1:
In the context of long-horizon return regressions the conditional expected value of y(h)t is
formed on the basis of the information set available at the date when forecasts are made;
i.e. at date t h 1. For example, in case of a single regressor a regression equation is
formulated as
y(h)t = 0 + 1 xt h 1 + t :
Alternatively, the model can be reformulated by shifting the time axis as follows:
y(h)t+h = 0 + 1 xt 1 + t+h :
This model specication corresponds to the problem of predicting the sum of h single-
period returns during the period t until t+h on the basis of information available at date
t 1.
We consider weekly observations of the DAX pt which are used to form a time series of
annual returns
y(52)t = ln pt ln pt 52 :
The following regressors are available: the dividend yield dt, the spread between ten-
year and one-month interest rates st, the one-month real interest rate rt, and the growth
rate of industrial production gt41. Real interest rates are computed by subtracting the
in
ation rate from one-month interest rates. The in
ation rate it is calculated from the
consumer price index ct, using it=ln ct ln ct 52. The growth rate of industrial production
is computed in the same way from the industrial production index. Details can be found
in the le long.wf1.
Each explanatory variable is included with a lag of 53 weeks and the estimated equation
is42
y(52)t = 0:41 + 39:8dt 53 1:76st 53 8:79rt 53 + 0:638gt 53 + et:
The present result is typical for some similar cases known from literature which support
the 'predictability' of long-horizon returns. All parameters are highly signicant which
leads to the (possibly premature) conclusion that the corresponding explanatory variables
are relevant when forming expectations.
The most obvious deciency of this model is the substantial autocorrelation of the residuals
(r1=0.959). The literature on return predictability typically reports that R2 increases
with h (see Kirby, 1997). This property of the residuals is mainly caused by the way
41 Source: Datastream; January 1983 to December 1997; 782 observations.
42 All p-values are less than 0.001 except for st 53 with a p-value equal to 0.002.
1.8 Generalized least squares 57
Figure 1: Fit and residuals of the long-horizon return regression of annual DAX returns.
1.0
Residual Actual Fitted
0.5
0.0
0.4
0.2 -0.5
0.0
-0.2
-0.4
11/01/85 11/11/88 11/09/92 12/07/96
Supply and demand dier conceptually but usually they cannot be separately measured.
Thus, we can only observe quantities sold q (representing the market equilibrium values
of d and s). Using q=d=s we can solve the equations
q = 0d + 1d p + d q = 0s + 1s p + s
1d V[s ] + 1s V[d ]
V[s] + V[d] :
Thus, the estimated coecient is neither the slope of the demand nor the supply function,
but a weighted average of both. If supply shocks dominate (i.e. V[s]>V[d]) the estimate
will be closer to the slope 1d of the demand function. This may hold in case of agricultural
products which are more exposed to supply shocks (e.g. weather conditions). Positive(!)
slopes of a "demand" function may be found in case of manufactured goods which are more
subject to demand shifts over the business cycle. In an extreme case, where there are no
demand shocks, the observed quantities correspond to the intersections of a (constant)
demand curve and many supply curves. This would allow us to identify the estimated
slope as the slope of a demand function. This observation will be the basis for a solution
to the endogeneity bias described below.
must be non-zero: bz 6=0. If this condition (which is also called relevance condition )
is violated, the instrument is called weak.
3. z must not appear in the original regression (32). Roberts and Whited (2012) call
corr[z; ]=0 the exclusion condition, expressing that the only way z must aect y is
through the endogenous regressor xk (but not directly via equation 32).
It is frequently stated that z must be correlated with the endogenous regressor. Note that
the second requirement is stronger, since it refers to the partial correlation between z and
xk . Whereas the second condition can be tested, the rst requirement of zero correlation
between z and cannot be tested directly. Since is unobservable, this assumption must
be maintained or justied on the basis of economic arguments. However, in section 1.9.3
we will show that it is possible to test for exogeneity of regressors and the appropriateness
of instruments.
Consistent estimates can be obtained by using x^k from the rst-stage regression to re-
place xk in the original regression (32). x^k is exogenous by construction, since it only
depends on exogenous regressors and an instrument uncorrelated with . Thus, the endo-
geneity is removed from equation (32), and the resulting IV-estimates of its parameters
are consistent.
We brie
y return to the example of supply and demand functions described in section
1.9.1. A suitable instrument for a demand equation is an observable variable which leads
to supply shifts and hence price changes (e.g. temperature variations aect the supply of
1.9 Endogeneity and instrumental variable estimation 61
coee and its price). We are able to identify the demand function and estimate its slope,
if the instrument is uncorrelated with demand shocks (i.e. temperature has little or no
impact on the demand for coee). IV-estimates can be obtained by rst regressing price
on temperature (and other regressors), and then use the tted prices as a regressor (among
others) to explain quantities sold. Consequently, the second regression can be considered
to be a demand equation.
In general, IV-estimation replaces those elements of X which are correlated with by a
set of instruments which are uncorrelated with , but related to the endogenous elements
of X by rst-stage regressions. The number of instruments can be larger than the number
of endogenous regressors. However, the number of instruments must not be less than
the number of (potentially) endogenous regressors (this is the so-called order condition ).
To derive the IV-estimator in more general terms we dene a matrix Z of exogenous
regressors which includes the exogenous elements of X (including the constant) and the
instruments. Regressors suspected to be endogenous are not included in Z . We assume
that Z is uncorrelated with
E[Z 0] = 0: (33)
IV-estimation can be viewed as a two-stage LS procedure. The rst stage involves regress-
ing each original regressor xi on all instruments Z :
xi = Zbiz + vi = x^i + vi i = 1; : : : ; K:
Regressors in Z that are also present in the original matrix X are exactly reproduced
by this regression. The resulting tted values are used to construct the matrix X^ which
is equal to X , except for the columns which correspond to the (suspected) endogenous
regressors. X^ is used in the second stage in the regression
y = Xb
^ iv + e: (34)
Since X^ only represents exogenous information the IV-estimator given by
biv = (X^ 0X^ ) 1X^ 0y
is consistent. X^ can be written as (see Greene, 2003, p.78)
X^ = Z (Z 0Z ) 1 Z 0X = Zbz ; (35)
where bz is a matrix (or vector) of coecients from rst-stage regressions. If Z has the
same number of columns as X (i.e. the number of instruments is equal to the number of
endogenous regressors) the IV-estimator is given by
biv = (Z 0X ) 1 Z 0y:
If the following conditions are satised
plim 1 Z 0Z = Q jQ j > 0
n z z
1.9 Endogeneity and instrumental variable estimation 62
plim n1 Z 0 = 0
the IV-estimator can be shown to be consistent
1
1
1
plim biv = + plim Z X plim Z = ;
0 0
(36)
n n
and asymptotically normal (see Greene, 2003, p.77):
biv a N(; Qzx1 Qz Qzx1 ):
The asymptotic variance of biv is estimated by
s~2e (Z 0X ) 1 Z 0Z (X 0Z ) 1 ; (37)
where
s~2e =
1Xn
(yi x0ibiv )2:
n i=1
Note that the standard errors and s~2e are derived from residuals based on X rather than
X^ :
eiv = y Xbiv : (38)
This further implies that the R2 from IV-estimation based on these residuals cannot be
interpreted. The variance of eiv can be even larger than the variance of y, and thus R2
can become negative.
If the IV-estimation is done in two OLS-based steps, the standard errors from running
the regression (34) will dier from those based on (37), and thus will be incorrect (see
Wooldridge, 2002, p.91). We also note that the standard errors of biv are always larger
than the OLS standard errors, since biv only uses that part of the (co)variance of X which
appears in the tted values X^ . Thus, the potential reduction of the bias is associated with
a loss in eciency.
So far, the treatment of this subject has been based on the assumption that instruments
satisfy the conditions stated above. However, violations of these conditions may have
serious consequences (see Murray, 2006, p.124). First, invalid instruments { correlated
with the disturbance term { yield biased and inconsistent estimates, which can be even
more biased than the OLS estimates. Second, if instruments are too weak it may not be
possible to eliminate the bias associated with OLS, and standard errors may be misleading
even in very large samples. Thus, the selection of instruments is crucial for the properties
of IV-estimates. In the next section we will investigate those issues more closely.
1.9 Endogeneity and instrumental variable estimation 63
1.9.3 Selection of instruments and tests
Suitable instruments are usually hard to nd. In the previous section we have mentioned
weather conditions. They may serve as instruments to estimate a demand equation for
coee, since they may be responsible for supply shifts, and they are correlated with prices
but not with demand shocks. In a study on crime rates Levitt (1997) uses data on electoral
cycles as instruments to estimate the eects associated with hiring policemen on crime
rates. Such a regression is subject to a simultaneity bias since more policemen should
lower crime rates, however, cities with a higher crime rate tend to hire more policemen.
Electoral cycles may be suitable instruments: they are exogenously given (predetermined),
and expenditures on security may be higher during election years (i.e. the instrument is
correlated with the endogenous regressor). In a time series context, lagged variables of the
endogenous regressors may be reasonable candidates. The instruments Xt 1 may be highly
correlated with Xt but uncorrelated with t. According to Roberts and Whited (2012)
suitable instruments can be derived from biological or physical events or features. They
stress the importance of understanding the economics of the question at hand, i.e. that
the instrument must only aect y via the endogenous regressor. For example, institutional
changes may be suitable as long as the economic question under study was not one of the
original reasons for the institutional change.
Note that instruments are distinctly dierent from proxy variables (which may serve as
a substitute for an otherwise omitted regressor). Whereas a proxy variable should be
highly correlated with an omitted regressor, an instrument must be highly correlated
with a (potentially) endogenous regressor xk . However, the higher the correlation of an
instrument z with xk , the less justied may be the assumption that z is uncorrelated with
{ given that the correlation between xk and is causing the problem in the rst place!
IV-estimation will only lead to consistent estimates if suitable instruments are found (i.e.
E[Z 0]=0) and bz 6=0). If instruments are not exogenous because (33) is violated (i.e.
E[Z 0]6=0) the IV-estimator is inconsistent (see (36)). Note also that the consistency of
biv critically depends on cov[Z ;X ] even if E[Z 0]0. (36) shows that poor instruments
(being only weakly correlated with X ) may lead to strongly biased estimates even in large
samples (i.e. inconsistency prevails). As noted above, estimated IV standard errors are
always larger than OLS standard errors, but they can be strongly biased downward when
instruments are weak (see Murray, 2006, p.125).
Given these problems associated with IV-estimation, it is important to rst test for the
endogeneity of a regressor (i.e. does the endogeneity problem even exist?). This can be
done with the Hausman test described below. However, that test requires valid and
powerful instruments. Thus, it is necessary to rst investigate the properties of potential
instruments.
Testing bz : Evidence for weak instruments can be obtained from rst-stage regressions.
The joint signicance of the instruments' coecients can be tested with an F -test
(weak instruments will lead to low F -statistics). If weak instruments are found, the
worst should be dropped and replaced by more suitable instruments (if possible at
all). To emphasize the impact of (very) weak instruments consider an extreme case,
where the instruments' coecients in the rst-stage regression are all zero. In that
case, x^k could not be used in the second stage, since it would be a linear combination
of the other exogenous regressors. As a rule of thumb, n times R2 from the rst-
stage regression should be larger than the number of instruments so that the bias
1.9 Endogeneity and instrumental variable estimation 64
of IV will tend to be less than the OLS bias (see Murray, 2006, p.124). Staiger and
Stock (1997) consider instruments to be weak, if the rst-stage F -statistic, testing
the coecients bz of instruments entering the rst-stage regression for being jointly
zero, is less than ten. Dealing with more than one endogeneous regressor is even
more demanding (see Stock et al., 2002).
Testing corr[z; ]: If there are more instruments than necessary,46 is overidentied.
Rather than eliminating some instruments they are usually kept in the model, since
they may increase the eciency of IV-estimates. However, a gain in eciency re-
quires valid instruments (i.e. they must be truly exogenous). Since invalid instru-
ments lead to inconsistent IV-estimates, it is necessary to test the overidentifying
restrictions. Suppose two instruments z1 and z2 are available, but only one (z1) is
used in 2SLS (this is the just identied case). Whereas the condition corr[z1; ]=0
cannot be tested (both e from (34) and eiv from (38) are uncorrelated with z by
the LS principle), we can test whether z2 is uncorrelated with e, and may thus be
a suitable instrument. The same applies vice versa, if z2 is used in 2SLS and z1 is
tested.
In general, overidentifying restrictions can be tested by regressing the residuals from
the 2SLS regression eiv as dened in (38) on all exogenous regressors and all instru-
ments (see Wooldridge, 2002, p.123). Valid (i.e. exogenous) instruments should not
be related to 2SLS residuals. Under H0: E[Z 0]=0, the test statistic is nR22m,
where R2 is taken from this regression, and m is the dierence between the number
of endogenous regressors and the number of instruments. A failure to reject the
overidentifying restrictions is an indicator of valid instruments.
If acceptable instruments have been found, we can proceed to test for the presence of
endogeneity. The Hausman test is based on comparing biv and bls, the IV and OLS
estimates of . A signicant dierence is an indicator of endogeneity. The Hausman test
is based on the null hypothesis that plim(1=n)X 0=0 (i.e. H0: X is not endogenous).
In this case OLS and IV are both consistent. If H0 does not hold, only IV is consistent.
However, a failure to reject H0 may be due to invalid or weak instruments. Murray (2006,
p.126) reviews alternative procedures, which are less aected by weak instruments, and
provides further useful guidelines.
The Hausman test is based on d=biv bls, H0: plim d=0. The Hausman test statistic is
given by
c d] 1 d = 1 d0 [(X
H = d0 aV[ ^ 0X^ ) 1 (X 0 X ) 1 ] 1 d H 2 ;
s2e m
where X^ is dened in (35), m=K K0 and K0 is the number of regressors for which H0
must not be tested (because they are known { actually, assumed { to be exogenous).
A simplied but asymptotically equivalent version of the test is based on the residuals from
the rst-stage regression associated with the endogenous regressor (vk ), and an auxiliary
OLS regression (see Wooldridge, 2002, p.119), where vk is added to the original regression47
46 By the order condition the number of instruments must be greater or equal to the number of endogenous
regressors.
47 Note that adding vk to equation (32) will not change any of the coecients in (32) estimated by 2SLS.
1.9 Endogeneity and instrumental variable estimation 65
(32):
k
X
y = 0 + j xj + bv vk + :
j =1
Note that vk represents the endogenous part of xk (if there is any). If there is an en-
dogeneity problem, the coecient bv will pick up this eect (i.e. the endogenous part of
will move to bv vk ). Thus, endogeneity can be tested by a standard t-test of bv (based
on heteroscedasticity-consistent standard errors). If the coecient is signicant, we reject
H0 (no endogeneity) and conclude that the suspected regressor is endogenous. However,
failing to reject H0 need not indicate the absence of endogeneity, but may be due to weak
instruments. If more than one regressor is suspected to be endogenous, a rst-stage re-
gression is run for each one of them. All residuals thus obtained are added to the original
equation, and a F -test for the residuals' coecients being jointly equal to zero can be
used.
Example 20: We use a subset of wage data from Wooldridge (2002, p.89) (example
5.2)48 to illustrate IV-estimation. Wages are assumed to be (partially) determined by
the unobservable variable ability. Another regressor in the wage regression is educa-
tion (measured by years of schooling), which can be assumed to be correlated with
ability. Since ability is an omitted variable, which is correlated with (at least) another
regressor, this will lead to inconsistent OLS estimates. The dummy variable 'near'
indicates whether someone grew up near a four-year college. This variable can be
used as an instrument: it is exogenously given (i.e. uncorrelated with the error term
which contains 'ability'), and most likely correlated with education. The rst-stage
regression shows that the coecient of the instrument is highly signicant (i.e. there
seems to be no danger of using a weak instrument). The condition corr[z; ]=0 cannot
be tested since only one instrument is available. The coecient of v (the residual from
the rst-stage regression) is highly signicant (p-value 0.0165) which indicates an en-
dogeneity problem (as expected). Comparing OLS and IV estimates shows that the
IV coecient of education is three times as high as the OLS coecient. However, the
IV standard error is more than ten times as high as the OLS standard error. Similar
results are obtained for other coecients (see wage.wf1 and wage.xls).
which simplies to
U 0 (Ct ) = Et [U 0 (Ct+1 )(1 + Rt+1 )]:
Thus, the optimal solution is obtained by equating the expected marginal utility from
investment to the marginal utility of consumption. The Euler equation can be rewritten
in terms of the so-called stochastic discount factor Mt
Et 1[(1 + Rt)Mt] = 1 Mt = UU0(C(Ct) ) :
0
t 1
1.9 Endogeneity and instrumental variable estimation 67
This equation is also known as the consumption based CAPM.
Using the power utility function
C1
U (C ) =
1
U (C ) = C (
. .. coecient of relative risk aversion)
0
Campbell et al. (1997, p.306) assume that Rt and Ct are lognormal and homoscedastic,
use the relation ln E[X ]=E[ln X ]+0:5V[ln X ] (see (41) in section 2.1.2), and take logs of
the term in square brackets to obtain
Et 1[ln(1 + Rt)] + ln
Et 1 [ ln Ct ] + 0:5c = 0:
c is a constant (by the assumption of homoscedasticity) involving variances and covari-
ances of Rt and Ct. This equation implies a linear relation between expected returns and
expected consumption growth, and can be used to estimate
. We replace expectations
by observed data on log returns yt=ln(1+Rt) and consumption growth ct= ln Ct
yt = Et 1 [ln(1 + Rt )] + at ct = Et 1 [ ln Ct ] + bt :
Replacing expectations by observed variables implies measurement errors, which further
lead to inconsistent OLS estimates if obtained from the regression equation (as shown in
section 1.9.1)
yt = +
ct + t = ln 0:5c t = at
bt :
We estimate this equation to replicate parts of the analysis by Campbell et al. (1997,
p.311) using a slightly dierent annual dataset (n=105) prepared by Shiller49. Details of
estimation results can be found in the les ccapm.wf1 and ccapm.xls. Using OLS, the
estimated equation is (p-values in parentheses)
yt = 0:0057 + 2:75 ct + ut R2 = 0:31:
(.72) (.00)
The estimate 2.75 is in a plausible range. However, OLS estimation is not appropriate
since the regressor ct is correlated with t via bt (unless
=0). An IV-estimate of
can be
obtained using instruments which are assumed to be correlated with consumption growth.
Campbell et al. use lags of the real interest rate it and the log dividend-price ratio dt as
instruments, arguing that t is uncorrelated with any variables in the information set from
t 1. Using 2SLS we obtain
yt = 0:29 11:2 ct + et;
(.43) (.53)
49 http://www.econ.yale.edu/~shiller/data/chapt26.xls.
1.9 Endogeneity and instrumental variable estimation 68
which shows that the estimated
is insignicant (which saves us from the need to explain
the unexpected negative sign). We use these instruments to test for their suitability, and
subsequently, for testing the endogeneity of ct. The rst-stage regression of ct on the
instruments yields
ct = 0:013 0:042 it 1 0:0025 dt 1 + vt R2 = 0:006:
(.64) (.46) (.78)
Obviously, the requirement of high correlation of instruments with the possibly endogenous
regressor is not met. The F -statistic is 0.33 with a p-value of 0.72, and the instruments
are considered to be very weak. nR2<2 indicates that the IV-bias may be substantially
larger than the OLS-bias.
To test the validity of the instruments based on overidentifying restrictions, we run the
regression (et are the 2SLS residuals)
et = 0:155 0:125 it 1 + 0:048 dt 1 + at R2 = 0:0014:
(.71) (.88) (.72)
The test statistic is 1150.0014=0.16, with a p-value of 0.69 (m=1). We cannot reject the
overidentifying restrictions (i.e. instruments are uncorrelated with 2SLS residuals et), and
the instruments can be considered to be valid.
Despite this ambiguous evidence regarding the appropriateness of the instruments, we use
the residuals from the rst-stage regression to test for endogeneity:
yt = 0:29 11:2 ct + 14:1 vt + wt R2 = 0:35:
(.005) (.024) (.005)
The coecient of vt is signicant at low levels, and we can rmly reject the H0 of exogeneity
of ct (as expected).
This leaves us with con
icting results. On the one hand, we have no clear evidence
regarding the appropriateness of the instruments (by comparing the rst-stage regression
and the overidentication test). On the other hand, we can reject exogeneity of ct on
theoretical grounds, but obtain no meaningful estimate for
using 2SLS. From OLS we
obtain a reasonable estimate for
, but OLS would only be appropriate if ct was truly
exogenous (which is very doubtful).
In a related study, Yogo (2004) applies 2SLS to estimate the elasticity of intertemporal
substitution, which is the reciprocal of the risk aversion coecient
for a specic choice
of parameters in the Epstein-Zin utility function. He shows that using weak instruments
(i.e. nominal interest rate, in
ation, consumption growth, and the log dividend-price ratio
lagged twice) leads to biased estimates and standard errors. Yogo's results imply that the
lower end of a 95% condence interval for
is around 4.5 for the US, and not less than 2
across eleven developed countries.
Exercise 13: Use the data from example 21 to replicate the analysis using the
same instruments lagged by one and two periods.
Exercise 14: Use the monthly data in the le ie data.xls which is based
on data prepared by Shiller50 and Verbeek51 to replicate the analysis from
example 21.
50 http://www.econ.yale.edu/~shiller
51 http://eu.wiley.com/legacy/wileychi/verbeek2ed/
1.9 Endogeneity and instrumental variable estimation 69
Exercise 15: Use the weekly data in the le JEC.* downloaded from the
companion website of http://www.pearsonhighered.com/stock_watson/ to
estimate a demand equation for the quantity of grain shipped. Use \price",
\ice" and \seas1" to \seas12" as explanatory variables. Discuss potential en-
dogeneity in this equation. Consider using \cartel" as an instrument. Discuss
and test the appropriateness of \cartel" as an instrument.
1.10 Generalized method of moments 70
1.10 Generalized method of moments52
Review 7:53 In the method of moments the parameters of a distribution are es-
timated by equating sample and population moments. Given a sample x1 ; : : : ; xn of
independent draws from a distribution, the moment condition
E[X ] = 0
is replaced by the sample analog
n
1X
x x = 0
n i=1 i
where mij (xi ; ^) are suitable functions of the sample and the parameter vector. The K
parameters can be estimated by solving the system of K equations. The moment esti-
mators are based on averages of functions. By the consistency of a mean of functions
(see review 5) they converge to their population counterparts. They are consistent,
but not necessarily ecient estimates. In many cases their asymptotic distribution is
normal.
Example 22: The rst and second central moments of the Gamma distri-
bution with density
(x= ) 1 e x=
f (x) =
( )
are given by
E[X ] = = and E[(X )2 ] = 2 = 2 :
To estimate the two parameters and we dene two functions
mi1 = xi ab mi2 = (xi ab)2 ab2 = x2i (ab2 + a2 b2 ):
The two sample moment conditions
n n
1X 1X
x ab = x ab = 0 x2 (ab2 + a2 b2 ) = 0
n i=1 i n i=1 i
can be used to estimate a and b by solving the equations54
ab x = 0 and ab2 s~2 = 0;
which yields b=~s2 =x and a=x2 =s~2 .
52 This section is based on selected parts from Greene (2003), section 18. An alternative source is
Cochrane (2001), sections 10 and 11. Note that Cochrane uses a dierent notation with ut mt , gT m ,
S and dG.
53 For details see Greene (2003), p.527 or Hastings and Peacock (1975), p.68.
54 s~2 is the unadjusted sample variance s~2 =(1=n) P(xi x)2 .
1.10 Generalized method of moments 71
Example 23: We consider the problem of a time series of prices observed at irregularly
spaced points in time (i.e. the intervals between observations have varying length). We
want to compute mean and standard deviation of returns for a comparable (uniform)
time interval by applying the method of moments (see le irregular.xls for a nu-
merical example).
The observed returns are assumed to be determined by the following process:
Y (i ) = i + Zi i
p (i = 1; : : : ; n);
where i is the length of the time interval (e.g. measured in days) used to compute
the i-th return. Zi is a pseudo-return, with mean zero and standard deviation one.
and are mean and standard deviation of returns passociated with the base interval.
Assuming that Zi and i are independent (i.e. E[Zi i ]=0), we can take expectations
on both sides, and replace these by sample averages to obtain
n n
1X 1X
Y = Y (i ) = = ;
n i=1 n i=1 i
1X n
(Y (i ) ^i )2
^ 2 = :
n i=1 i
1.10 Generalized method of moments 72
1.10.1 OLS, IV and GMM
Generalized method of moments (GMM) can not only be used to estimate the parameters
of a distribution, but also to estimate the parameters of an econometric model by general-
izing the method of moments principle. GMM has its origins and motivation in the context
of asset pricing and modeling rational expectations (see Hansen and Singleton, 1996). One
of the main objectives was to estimate models without making strong assumptions about
the distribution of returns.
We start by showing that the OLS estimator can be regarded as a method of moments
estimator. Assumption AX in the context of the regression model y=X+ implies the
orthogonality condition
E[X 0] = E[X 0(y X)] = 0:
To estimate the K 1 parameter vector we dene K functions and apply them to each
observation in the sample55
mij (b) = xij (yi x0i b) = xij ei i = 1; : : : ; n; j = 1; : : : ; K:
The moment conditions are the sample averages
j = 1 mij = 0 j = 1; : : : ; K;
n
X
m
n i=1
which are identical to the normal equations (2) which have been used to derive the OLS
estimator in section 1.1:
1Xn
xi ei = 1 xi (yi x0 b) = 1 X 0e = 0:
Xn
n i=1 n i=1 i n
If some of the regressors are (possibly) endogenous it is not appropriate to impose the
orthogonality condition. Suppose there are instruments Z available for which E[Z 0]=0
holds. If Z has dimension nK (the same as X ) we can obtain IV-estimates from
1Xn
zi (yi x0i b) = 0:
n i=1
If there are more instruments than parameters we can specify the conditions
1 X^ 0e = 0;
n
where X^ is dened in (35). Using X^ generates K conditions, even when there are L>K
instruments56.
55 The notation mij =mj (yi ; xij ) is used to indicate the dependence of the j -th moment condition on the
observation i.
56 More instruments than necessary can be used to generate overidentifying restrictions and can improve
the eciency of the estimates.
1.10 Generalized method of moments 73
The homoscedasticity assumption implies that the variance of residuals is uncorrelated
with the regressors. This can be expressed as
E[xi(yi x0i)2] E[xi]E[2i ] = 0:
If the model specication is correct the following expression
1Xn
xi e2i x i s~2e
n i=1
and have shown how to estimate
based on linearizing this equation. An alternative view
is to consider the Euler equation as a testable restriction. It should hold for all assets and
across all periods. This implies the following sample moment condition:
1X
n
mt (;
) = 0
mt (;
) = (1 + Rt )
Ct
1:
n t=1 Ct 1
The returns of at least two assets are required to estimate the parameters and
. Note
that no linearization or closed-form solution of the underlying optimization problem is
required (as opposed to the approach by Campbell et al. (1997) described in example 21).
GMM can accommodate more conditions than necessary (i.e. additional instruments can
be used to formulate overidentifying restrictions; see section 1.10.3).
1.10 Generalized method of moments 74
In asset pricing or rational expectation models the errors57 in expectations should be
uncorrelated with all variables in the information set It 1 of agents forming those ex-
pectations. This can be used to formulate orthogonality conditions for any instrument
zt 1 2It 1 in the following general way:
should hold.
In example 6 we have brie
y described the Fama-MacBeth approach to estimate the pa-
rameters of asset pricing models. GMM provides an alternative (and possibly preferable)
way to pursue that objective. We consider N assets with excess returns xit, and a single-
factor model with factor excess return xmt. The factor model implies that the following
equations hold:
xit = i xm
t + t i = 1; : : : ; N;
i
E[xit] = mi i = 1; : : : ; N:
Fama and MacBeth estimate i from the rst equation for each asset. Given these esti-
mates, m is estimated from a single regression across the second set of equations (using
sample means as observations of the dependent variable and estimated beta-factors as
observations of the regressor). The CAPM or the APT imply a set of restrictions that
should have zero expectation (at the true parameter values). The moment conditions
corresponding to the rst set of equations for the present example are
1X
n
(xit i xmt )xmt = 0 i = 1; : : : ; N:
n t=1
where
j ( ) =
m
1 n
X
mij () j = 1; : : : ; L:
n i=1
Minimizing this criterion gives consistent but not necessarily ecient estimates of .
Hansen (1982) has considered estimates based on minimizing the weighted sum of squares
J =m
0W m
:
The weight matrix W has to be positive denite. The choice of W relies on the idea of
GLS estimators, with the intention to obtain ecient estimates. Elements of m which are
more precisely estimated should have a higher weight and have more impact on the value
of the criterion function. If W is inversely proportional to the asymptotic covariance of
, i.e.
m
W = 1
p
= aV[ nm
];
and plim m =0, the GMM estimates are consistent and ecient.
Before we proceed, we brie
y refer to the asymptotic variance of the sample mean y
(see review 5, Pp.22). It can be derived from observations yi and is given by s2=n where
s2 =(1=(n 1)) i (yi y)2 . Now, we note that m can be viewed as a (multivariate) sample
mean. It can be derived from
= 1 mi mi = m(yi ; xi );
X n
m n i=1
In general, the asymptotic covariance matrix of GMM parameter estimates can be esti-
mated by
V^ = 1 G^ 0^ 1G^ :
1
n
(39)
1.10 Generalized method of moments 76
G^ is the Jacobian of the moment functions (i.e. the matrix of derivatives of the moment
functions with respect to the estimated parameters):
G^ = n1 @ mi (^)
n
X
0 :
i=1 @ ^
The columns of the LK matrix G^ correspond to the K parameters and the rows to the
L moment conditions.
As shown in section 1.10.1, OLS and GMM lead to the same parameter estimates, if
GMM is only based on the orthogonality condition E[X 0]. However, if the covariance of
parameters is estimated according to (39), ^ is given by
^ = 1 X mi m0i = 1 X xi ei x0i ei = 1 X e2i xi x0i ;
n n n
n 1 n 1 n 1
i=1 i=1 i=1
and G^ is given by
G^ = n1 @@mb0i = n1 @ [xi (y@i b0 xi b)] = n1 X 0X :
n
X n
X 0
i=1 i=1
Combining terms we nd that the estimated covariance matrix for the GMM parameters
of a regression model is given by
!
n n
X 0X 1 e2i xi x0i X 0 X 1 ;
X
n 1 i=1
J =m ^ u 1m
(^)0 (^):
The asymptotic properties of GMM estimates can be derived on the basis of a set of
assumptions (see Greene, 2003, p.540). Among others, the empirical moments are assumed
to obey a central limit theorem. They are assumed to have a nite covariance matrix =n,
so that
pnm
d! N(0; ]:
Under this and further assumptions (see Greene, 2003, p.540) it can be shown that the
asymptotic distribution of GMM estimates is normal, i.e.
^ a N[; V ]:
1.10 Generalized method of moments 77
The diagonal elements of the estimated covariance matrix V^ can be used to compute
t-statistics for the parameter estimates:
^j a
q N(0; 1):
V^ jj
Alternative estimators like the White or the Newey-West estimator can be used if required
(see Cochrane, 2001, p.220).
Overidentifying restrictions can be tested on the basis of nJ 2L K . Under the null hy-
pothesis, the restrictions are valid, and the model is correctly specied. Invalid restrictions
lead to high values of J and to a rejection of the model. In the just identied case L=K
and J =0.
Despite this relatively brief description of GMM, its main advantages should have become
clear. GMM does not rely on Aiid, requires no distributional assumptions, it may also
be based on conditional moments, and allows for more conditions than parameters to
be estimated (i.e. it can be used to formulate and test overidentifying restrictions). The
requirements for consistency and asymptotic normality are that mi must be well behaved
(i.e. stationary and ergodic), and the empirical moments must have a nite covariance
matrix.
These advantages are not without cost, however. Some of the problems (which have
received insucient space in this short treatment) associated with GMM are: In some
cases the rst derivative of J may not be known analytically and the optimization of the
criterion function J has to be carried out numerically. Moreover, J is not necessarily a
convex function which implies that there is no unique minimum, and good starting values
are very important for the numerical search algorithm.
1.10 Generalized method of moments 78
1.10.4 Example 24: Models for the short-term interest rate
Chan et al. (1992) use GMM to estimate several models for the short-term interest rate.
They consider a general case where the short rate follows the diusion
dr = ( + r)dt + r
dZ:
By imposing restrictions on the parameters special cases are obtained (e.g. the Vasicek
model if
=0, or the Brennan-Schwartz model if
=1). The discrete-time specication of
the model is given by
rt rt 1 = + rt 1 + t E[t] = 0 E[2t ] = 2rt2
1:
Using =(
)0, Chan et al. impose the following moment conditions
h i0
mt() = t trt 1 2t 2rt2
1 (2t 2rt2
1 )rt 1 :
Conditions one and three correspond to the mean and variance of t. Conditions two
and four impose orthogonality between the regressor rt 1 and the error from describing
the variance of the disturbances t. The estimated covariance of the parameter estimates
is based on the following components of the Jacobian (rows correspond to conditions,
columns to parameters):
@mt;1 @mt;1 @mt;1 @mt;1
Gt;1 = = 1 = rt 1 =0 =0
@ @ @ @
@mt;2 @mt;2
Gt;2 = = rt 1 = rt2 1 0 0
@ @
h i
Gt;3 = 2mt;1 2mt;1rt 1 2rt2
1 22rt2
1 ln(rt 1)
h i
Gt;4 = 2mt;1rt 1 2mt;1rt2 1 2rt2
+1
1 22rt2
+1
1 ln(rt 1 ) :
Chan et al. (1992) use monthly observations of the three-month rate for rt from June 1964
to December 1989. Details of computations and some estimation results can be found in
the le ckls.xls. Note that the estimates for , and 2 have to be scaled by t=1/12
to convert them into annual terms, and to make them comparable to the results presented
in Chan et al. (1992).
Exercise 16: Retrieve a series of short-term interest rates from the website
http://www.federalreserve.gov/Releases/H15/data.htm or from another
source. Estimate two or three dierent models of the short-term interest rate
by GMM.
1.11 Models with binary dependent variables 79
1.11 Models with binary dependent variables
Review 8: The binomial distribution describes the probabilities associated with a
sequence of n independent trials, where each trial has two possible outcomes (usually
called success and failure). The probability of success p is the same in each trial. The
probability of y successes in n trials is given by
n y
f (y) = p (1 p)(n y) :
y
Expected value and variance of a binomial random variable are given by np and
np(1 p), respectively. If the number of trials in a binomial experiment is large (e.g.
np5), the binomial distribution can be approximated by the normal distribution. If
n=1, the binomial distribution is a Bernoulli distribution.
We now consider the application of regression analysis to the case of binary dependent
variables. This applies, for example, when the variable of interest is the result of a choice
(e.g. brand choice or choosing means of transport), or an interesting event (e.g. the default
of a company or getting unemployed). For simplicity we will only consider the binary case
but the models discussed below can be extended to the multinomial case (see Greene
(2003), section 21.7).
Observations of the dependent variable y indicate whether the event or decision has taken
place or not (y=1 or y=0). The probability for the event is assumed to depend on regressors
X and parameters , and is expressed in terms of a distribution function F . For a single
observation i we specify the conditional probabilities
P[yi = 1] = F (xi; ) P[yi = 0] = 1 F (xi ; ):
Hence, in the probit- and logit-model the eect of a change in regressor j depends on the
probability at a given value of x0i. A convenient interpretation of the logit-model is based
on the so-called odds-ratio, which is dened as
L(^yi ) expfx0ig :
L(^yi ) =
1 L(^yi) 1 + expfx0ig
The log odds-ratio is given by
L(^yi )
ln 1 L(^y ) = x0i:
i
This implies that expfj xj g is the factor by which the odds-ratio is changed c.p. if
regressor j is changed by xj units. The eect of a change in a regressor on L(^yi) is low
if L(^yi) is close to zero or one.
Binary choice models can be estimated by maximum-likelihood, whereas linear models are
usually estimated by least squares. Each observation in the sample is treated as a random
draw from a Bernoulli distribution. For a single observation the (conditional) probability
of observing yi is given by
P[yijxi; ] = F (xi; )yi(1 F (xi; ))(1 yi) = Fiyi(1 Fi)(1 yi) :
1.11 Models with binary dependent variables 81
For the entire sample the joint probability is given by
n
Fiyi(1 Fi )(1 yi ) ;
Y
i=1
and tests of the estimated coecients make use of the asymptotic normality of ML esti-
mators.
A likelihood ratio test of m restrictions r0=0 is based on comparing the restricted like-
lihood `r to the unrestricted likelihood `u:
2[`u `r ] 2m :
If sample data can only be observed conditional on some mechanism related to z, the
conditional mean of y (now subject to selection ) is given by
E[^yijselection] = x0i + ui(ui );
where ui = z^i=u and i(ui )=f (^zi=u)=F (^zi=u). This result is obtained by assuming
bivariate normality of and u (rather than y and z). Note that the inverse Mills ratio
i () is not a constant, but depends on w0i
. Estimating the equation yi =x0i +i without
i () yields inconsistent estimates because of the omitted regressor, or, equivalently, as a
result of sample selection.60 Note that a non-zero correlation among and u determines
the bias/inconsistency. Thus, a special treatment is required when the unobservable fac-
tors determining inclusion in the subsample are correlated with the unobservable factors
aecting the variable of primary interest.
In many cases, z is not directly observed/observable, but only a binary variable d, indicat-
ing the consequence of the z-based selection rule. This oers the opportunity to estimate
the so-called selection equation (using a logistic regression as described in section 1.11):
di = w0i
+ vi :
59 Most of this section is based on Greene (2003), section 22.4.
60 The resulting inconsistency cannot be 'argued away' by stating that the estimated incomplete equa-
tion is representative for the population corresponding to that available subsample. Since the estimated
equation describes (only) the non-random subsample, such a viewpoint is rather useless as long as nothing
is known about the mechanism that determines whether y (in the population) is non-zero.
1.12 Sample selection 84
This forms the basis for the so-called Heckman correction. Heckman (1979) suggested
a two-step estimator61 which rst estimates the selection equation by probit to obtain
i =f (w0i
^ )=F (w0i
^ ) for every i, and then estimates the original equation after adding this
auxiliary regressor to the equation.62 The coecient of i can be interpreted by noting
that it is an estimate of the term u; i.e. it can be viewed as a scaled correlation
coecient.63
For the practical implementation of this two-step approach we note that yi and xi are only
observed if di=1, while the regressors wi must be observed for all cases. The information
in wi must be able to suciently discriminate among subjects who enter or do not enter
the sample. More importantly, the selection equation requires at least one (additional)
exogeneous regressor which is not included in xi. In other words, we impose an exclusion
condition on the main equation, and this additional regressor plays a similar role as an
instrument in case of IV regressions for treating endogeneity. Note that IV-estimation is
impossible when the regressors in the rst stage are identical to those in the main equation
(because of perfect multicollinearity). In the Heckit approach it is feasible to set wi=xi
(because of the nonlinearity of the inverse Mills ratio, and the fact that a dierent number
of observations is used in the two equations) but not recommended (see Wooldridge, 2002,
p.564).
Example 26: We consider a well-known and frequently used dataset about female
labor force participation and wages, and replicate the results in Table 22.7 in Greene
(2003).64 A wage model can only be estimated for those 428 females who actually
have a job, so that wage data can be observed.65 One can view the absent wage
information for another 325 females in this dataset as the result of truncation: if the
oered wage is below the reservation wage, females are not actively participating in
the labor market.
Estimation results based on the available sample of 428 females may suer from a
selection bias if unobserved eects in the wage and selection equations are correlated
(i.e. u 6=0; see above). Whether this bias results from so-called self-selection (i.e.
women's deliberate choice to participate in the labor market), or other sampling ef-
fects is irrelevant for the problem, but may be important for choosing regressors wi .
The estimated coecient of the inverse Mills ratio is given by 1:1. This can be inter-
preted as follows: women who have above average willingness (interest or tendency)
to work (i.e. zi is above z^i ; ui >0) tend to earn below average wage (i.e. yi is below y^i ;
i <0). This estimate is statistically insignicant which indicates that sample selection
may not play an important role in this example.
61 The procedure is often called 'Heckit', because of the combination of the name Heckman and
logit/probit models.
62 i are also called 'generalized residuals' of a probit model. For the entire sample (not just the subsample
for63which y is observed) they have mean zero and are uncorrelated with the regressors wi .
It may be possible to assign a 'physical' meaning to this coecient upon recalling that the slope in a
(simple) regression of y on x is given by yx y =x (see p.2). Thus, the ratio's coecient is proportional to
the64slope of a regression of i on ui (i.e. the slope is multiplied/scaled by u ).
Source and description of variables: https://rdrr.io/rforge/Ecdat/man/Mroz.html; this dataset
Mroz87 is also included in the R-package sampleSelection. sample-selection.R contains code for Heckit
estimates (two-stage and ML), as well as an extension which also deals with endogeneity.
65 For the purpose of this example we ignore the potential endogeneity associated with estimating a wage
equation (see example 20). See Wooldridge (2002, p.567) for a treatment of this case.
1.13 Duration models 85
1.13 Duration models66
The purpose of duration analysis (also known as event history analysis or survival
analysis) is to analyze the length of time (the duration) of some phenomenon of interest
(e.g. the length of being unemployed or the time until a loan defaults). A straightforward
application of regression models using observed durations as the dependent variable is
inappropriate, however, because duration data are typically censored. This means that
the actual duration cannot be recorded for some elements of the sample. For example,
some people in the sample are still unemployed at the time of analysis, and it is unknown
when they are going to become employed again (if at all). We can only record the length of
the unemployment period at the time the observation is made. Such records are censored
observations and this fact must be taken into account in the analysis (see below). Two cases
are possible: the subject under study is still in the interesting state when measurements are
made, and it is unknown how long it will continue to stay in that state (right censoring).
Left censoring holds, if the subject has already been in the interesting state before the
beginning of the study, and it is unknown for how long.
We dene the (continuous) variable T which measures the length of time spent in the
interesting state, or the time until the event of interest has occurred. The units of mea-
surement will usually be days, weeks or months, but T is not constrained to integer values.
The distribution of T is described by a cumulative distribution function
Z t
F (t) = P[T t] = f (s) ds:
0
The survival or survivor function is the probability of being in the interesting state for
more than t units of time:
S (t) = 1 F (t) = P[T t]:
We now consider the conditional probability of leaving the state of interest between t and
t+h conditional on having 'survived' until t:
P[t T t + hjT t] = P[t P[TT tt]+ h] = F (t 1+ h)F (t)F (t) = F (t +Sh()t) F (t) :
This probability is used to dene the hazard function
(t) = lim
P[t T t + hjT t] :
h!0 h
(t) does not have a straightforward interpretation. It may be viewed as an instantaneous
probability of leaving the state. However, to view it as a probability is not quite appro-
priate, since (t) can be greater than one (in fact, it has no upper bound). If we assume
that the hazard rate is a constant and assume that the event is repeatable, then is the
expected number of events per unit of time. Alternatively, a constant hazard rate implies
E[T ]=1=, which is the expected number of periods until the state is left.
66 Most of this section is based on Greene (2003), section 22.5 and Kiefer (1988).
1.13 Duration models 86
The hazard rate can be expressed in terms of F (t), f (t) and S (t). Since
lim F (t + h) F (t) = F 0(t) = f (t);
h!0 h
the hazard rate can also be written as
f (t)
(t) = or f (t) = (t)S (t):
S (t)
It can be shown that
Z t
F (t) = 1 exp (s) ds :
0
A constant hazard rate (t)=
corresponds to an exponential distribution F (t)=1 e
t. It
implies that the probability of leaving the interesting state during the next time interval
does not depend on the time spent in the state. This may not always be a realistic
assumption. Instead, assuming a Weibull distribution for T results in the hazard rate
(t) =
t 1
> 0; > 0;
When ti is a right-censored observation, we only know that the actual duration ti is at
least ti. As a consequence, the contribution to the likelihood is the probability that the
duration is longer than ti, which is given by the survivor function S (ti). Using the dummy
variable di=1 to indicate uncensored observations, the log-likelihood is dened as
n
X
`() = di ln f (ti ; ) + (1 di ) ln S (ti ; ):
i=1
In other words, the likelihood of observing a duration of length ti depends on survival until
ti , and exiting the interesting state at ti . For censored cases, exiting cannot be accounted
for, and only survival until ti enters the likelihood.
1.13 Duration models 87
In case of the Weibull distribution (t)=
t 1, ln S (t)=
t , and the log-likelihood is
given by
n
X
`() = di ln
ti 1
ti :
i=1
An obvious extension of modeling durations makes use of explanatory variables67. This
can be done by replacing the constant
by the term expfx0ig.68 The resulting parameter
estimates can be interpreted in terms of expfj g, which is the factor by which the hazard
rate is multiplied if regressor j is increased ceteris paribus by units.
The proportional hazards model (or Cox regression model) does not require any
assumption about the distribution of T . Rather than modeling the hazard rate as
(t; xi ) = 0 (t) expfx0i g;
where 0(t) is the baseline hazard function (e.g. t 1 for the Weibull model), the Cox
regression assumes that the ratio of the hazard rates of two individuals does not depend
upon time:
(t; xi ) 0 (t) expfx0i g expfx0i g
= =
(t; xj ) 0 (t) expfx0j g expfx0j g
:
Hence, there is no need to specify the baseline function 0(t). Cox denes a partial
likelihood estimator using the log-likelihood
2 3
n
X X
`() = 4 x0i ln expfx0j g5 :
i=1 j 2Ri
For a sample of n distinct exit times t1; : : : ; tn (i.e. considering uncensored cases only), the
risk set Ri contains all individuals whose exit time is at least ti (which includes censored
and uncensored cases).
Example 27: We consider a dataset about lung cancer from the North Central
Cancer Treatment Group.69 Ignoring (available) covariates and assuming a Weibull
distribution results in estimates of ^ =1.342 and
^=0.0003. The function survreg in
the R-package survival reports a constant term 6.054, which can be derived from
ln
^=^ .
Using the available regressors and maintaining the Weibull assumption shows that
sex, ph.ecog and ph.karno are signicant covariates. The estimate for sex can be
converted to expf 0:56g=0.571, which implies that the hazard rate for otherwise
identical observations is nearly halved when comparing a man (sex=1) to a female
(sex=2). The estimate 0.0235 for ph.karno implies that an increase of this variable
by (a typical change of) 10 units yields a factor of expf10 0:0235g=1.265, i.e. an
approximately 25% increase in the hazard rate. Running a Cox regression yields very
similar parameter estimates.
67 In the context of hazard rate models the regressors are frequently called covariates.
68 Note that the function survreg in the R-package survival sets
=expf x0i g, and applies a scaling
factor.
69 Source and description of variables:
https://www.rdocumentation.org/packages/survival/versions/2.41-2/topics/lung. Computations
and code can be found in lung.xlsx and lung.R.
2.1 Financial time series 88
2 Time Series Analysis
2.1 Financial time series
A nancial time series is a chronologically ordered sequence of data observed on nancial
markets. These include stock prices and indices, interest rates, exchange rates (prices for
foreign currencies), and commodity prices. Usually the subject of nancial studies are
returns rather than prices. Returns summarize an investment irrespective of the amount
invested, and nancial theories are usually expressed in terms of returns.
Log returns yt are calculated from prices pt using
yt = ln pt ln pt 1 = ln(pt =pt 1 ):
This denition corresponds to continuous compounding. pt is assumed to include dividend
or coupon payments. Simple returns rt are computed on the basis of relative price
changes:
pt pt 1
rt =
pt 1
= ppt 1:
t 1
This denition corresponds to discrete compounding. Log and simple returns are related
as follows:
yt = ln(1 + rt ) rt = expfyt g 1:
A Taylor series expansion of rt shows that the two return denitions dier with respect
to second and higher order terms:
1 yi
X yt 1 i
X 1 i
X yt
rt = expfyt g 1=1 = t = yt + :
i=0 i! i=1 i! i=2 i!
The simple return of a portfolio of m assets is a weighted average of the simple returns
of individual assets
m
X
rp;t = wi rit ;
i=1
where wi is the weight of asset i in the portfolio. For log returns this relation only holds
approximately:
m
X
yp;t wi yit :
i=1
Some nancial models focus on returns and their statistical properties aggregated over
time. Multi-period log returns are the sum of single-period log returns. The h-period
log return (ln pt ln pt h) is given by
ln pt ln pt h = ln(pt=pt 1) + ln(pt 1=pt 2) + + ln(pt h+1=pt h)
yt (h) = yt + yt 1 + + yt h+1 :
2.1 Financial time series 89
The corresponding expression for simple returns is
pt =pt h = (pt=pt 1)(pt 1 =pt 2 ) (pt h+1 =pt h )
hY1
1 + rt(h) = (1 + rt)(1 + rt 1) (1 + rt h+1 )= (1 + rt j ):
j =0
The mean r of simple returns (obtained from the same price series) is not equal to y. An
approximate70 relation between the two means is
r expfy + 0:5s2 g 1 y ln(1 + r) 0:5s2; (40)
where s2 is the (sample) variance of log returns:
s2 =
1 n
X
2
n 1 t=1(yt y) :
The square root of s2 is the (sample) standard deviation or volatility71. Examples 28
and 29 document the well-known fact that the variance (or volatility) of returns is not
constant over time (i.e. the heteroscedasticity of nancial returns).
Example 28: Figure 2 shows the stock prices of IBM72 and its log returns. Log
and simple returns cannot be distinguished in such graphs. Obvious features are the
erratic, strongly oscillating behavior of returns around the more or less constant mean,
and the increase in the volatility towards the end of the sample period.
Example 29: Figure 3 shows the daily log returns of IBM73 over a long period of time
(1962{1997). This series shows that temporary increases in volatility as in Figure 2
are very common. This phenomenon is called volatility clustering and can be found
in many return series.
70 The relation is exact if log returns are normally distributed (see section 2.1.2).
71 In the context of nancial economics the term volatility is frequently used in place of the statistical
term standard deviation. Volatility usually refers to the standard deviation expressed in annual terms.
72 Source: Box and Jenkins (1976), p.526; see le ibm.wf1; daily data from 1961/5/17 to 1962/11/2; 369
observations.
73 Source: Tsay (2002), p.257; daily data from 1962/7/3 to 1997/12/31; 8938 observations; available from
http://faculty.chicagobooth.edu/ruey.tsay/teaching/fts2/.
2.1 Financial time series 90
Figure 2: Daily stock prices of IBM and its log returns 1961{1962.
0.10
0.05
700
0.00
600
-0.05
500
-0.10
300
50 100 150 200 250 300 350
0.1
0.0
-0.1
-0.2
-0.3
2000 4000 6000 8000
Many nancial theories and models assume that returns are normally distributed to facili-
tate theoretical derivations and applications. Deviations from normality can be measured
by the (sample) skewness
1X n (y
t y)
3
S=
n t=1 s~3
22 Series: DLOG(AMEX)
Sample 1 209
20 Observations 208
18 Mean 3.35e-05
Median 0.000816
16 Maximum 0.015678
Minimum -0.020474
14 Std. Dev. 0.005477
Skewness -0.933927
12
Kurtosis 5.056577
10
Jarque-Bera 66.89270
8 Probability 0.000000
0
-0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015
We now consider the log return in t and treat it as a random variable (denoted by Yt;
yt is the corresponding sample value or realization). and 2 are mean and variance of
the underlying population of log returns. Assuming that log returns are normal random
variables with YtN(; 2) implies that (1+Rt)=expfYtg, the simple, gross returns are
lognormal random variables with
Another attractive feature of normal log returns is their behavior under temporal aggre-
gation. If single-period log returns are normally distributed YtN(; 2), the multi-period
log returns are also normal with Yt(h)N(h; h2). This property is called stability
(under addition). It does not hold for simple returns.
Many nancial theories and models assume that simple returns are normal. There are
several conceptual diculties associated with this assumption. First, simple returns have
a lower bound of 1, whereas the normal distribution extends to 1. Second, multi-
period returns are not normal even if single-period (simple) returns are normal. Third,
a normal distribution for simple returns implies a normal distribution for prices, since
Pt =(1+Rt )Pt 1 . Thus, a non zero probability may be assigned to negative prices which
is generally not acceptable. These drawbacks can be overcome by using log returns rather
than simple returns. However, empirical properties usually indicate strong deviations from
normality for both simple and log returns.
As a consequence of the empirical evidence against the normality of returns various alterna-
tives have been suggested. The class of stable distributions has the desirable properties
of fat tails and stability under addition. One example is the Cauchy distribution with
density
f (y ) = 2
1 b
; 1 < y < 1:
b + (y a)2
However, the variance of stable distributions does not exist, which causes diculties for
almost all nancial theories and applications.78 The Student t-distribution also has
fat tails if its only parameter { the degrees of freedom { is set to a small value. The
t-distribution is a frequently applied alternative to the normal distribution.79
The mixture of normal distributions approach assumes that returns are generated by
two or more normal distributions, each with a dierent variance. For example, a mixture
of two normal distributions80 is given by
yt (1 x)N(; 12 ) + xN(; 22 );
where x is a Bernoulli random variable with P[x=1]=. This accounts for the observation
that return volatility is not constant over time (see example 29). The normal mixture
model is based on the notion that nancial markets are processing information. The
amount of information can be approximated by the variance of returns. As it turns out, the
mixture also captures non-normality. For instance, a mixture of a low variance distribution
(with high probability ) and a large variance distribution (with low probability ) results
78 For details see Fielitz and Rozelle (1983).
79 For details see Blattberg and Gonedes (1974) or Kon (1984).
80 An example of simulated returns based on a mixture of three normal distributions can be found in the
le mixture of normal distributions.xls .
2.1 Financial time series 94
in a non-normal distribution with fat tails. Thus, if returns are assumed to be conditionally
normal given a certain amount of information, the implied unconditional distribution is
non-normal. Kon (1984) has found that between two and four normal distributions are
necessary and provide a better t than t-distributions with degrees of freedom ranging
from 3.1 to 5.5.
2.1 Financial time series 95
2.1.3 Abnormal returns and event studies81
Financial returns can be viewed as the result of processing information. The purpose
of event studies is to test the statistical signicance of events (mainly announcements)
on the returns of one or several assets. For example, a frequently analyzed event is the
announcement of a (planned) merger or takeover. This may be a signal about the value
of the rm which may be re
ected in its stock price. Comparing returns before and after
the information becomes publicly available can be used to draw conclusions about the
relevance of this information.
Event studies typically consist of analyzing the eects of a particular type of information
or event across a large number of companies. This requires an alignment of individual
security returns relative to an event date (denoted by =0). In other words, a new time
index replaces the calendar time t such that =0 corresponds to the event date in each
case. The event window covers a certain time period around =0 and is used to make
comparisons with pre-event (or post-event) returns.
The eects of the event have to be isolated from eects that would have occurred irrespec-
tive of the event. For this purpose it is necessary to dene normal and abnormal returns.
'Normal' refers to the fact that these returns would 'normally' be observed, either because
of other reasons than the event under study or if the event has no relevance. Normal
returns can be dened either on the basis of average historical returns or a regression
model. These estimates are obtained from the estimation window, which is a time pe-
riod preceding the event window. They serve as the expected or predicted returns during
the event window. Abnormal returns are the dierence between normal and observed
returns during the event window.
Suppose that the estimation window ranges from =0+1 to 1 (n1 observations), and the
event window ranges from =1+1 to 2 (n2 observations) and includes the event date
=0. We will consider estimating abnormal returns for company i based on the market
model
yi = i + i yim + i =0 +1,. .. ,1 ;
where the market return yim has a rm-specic superscript to indicate that the market
returns have been aligned to match the rm's event date. Given OLS estimates ai and bi
and observations for the market returns in the event window, we can compute n2 abnormal
returns
ei = yi ai bi yim =1 +1,. .. ,2 :
We dene the n12 matrix X for rm i using n1 observations from the estimation window.
Each of its rows is given by (1 yim). A corresponding n22 matrix X0 is dened for the
event window and the subscript 0 refers to the index set (1+1,. .. ,2). Given the OLS
estimates bi=(ai bi)0 for the parameters of the market model, the vector of abnormal
returns for rm i is dened as
ei0 = yi0 X0 bi ;
81 Most of this section is based on Chapter 4 in Campbell et al. (1997) where further details and references
on the event study methodology can be found. Other useful sources of information are the Event Study
Webpage http://web.mit.edu/doncram/www/eventstudy.html by Don Cram and the lecture notes by
Frank de Jong.
2.1 Financial time series 96
where yi0 is the vector of observed returns. From section 1.2.6 we know that E[ei0]=0
(since X0bi is an unbiased estimate of yi0), and its variance is given by
V[ei0] = V i = i2I + i2X0(X 0X ) 1X00 :
I is the n2 n2 identity matrix and i2 is the variance of disturbances i in the market
model. The estimated variance V^i is obtained by using the error variance from the esti-
mation period
s2i =
e0e e = y Xb
n1 2
in place of i2.
Event studies are usually based on the null hypothesis that the event under consideration
has no impact on (abnormal) returns. Statistical tests can be based on the assumption that
abnormal returns are normally distributed and the properties just derived: ei0N(0; V i).
However, the information collected must be aggregated to be able to make statements and
draw conclusions about the event (rather than individual cases or observations). It is not
always known when an event will have an eect and how long it will last. Therefore abnor-
mal returns are cumulated across time in the event window. In addition, the implications
of the event are expressed in terms of averages across several rms which may potentially
be aected by the event. We start by considering the temporal aggregation.
If the event window consists of more than one observation we can dene the cumulative
abnormal return for rm i by summing all abnormal returns from 1+1 to
ci = 0 ei0;
where the n21 vector has ones from row one to row , and zeros elsewhere. The
estimated variance of ci is given by
0 V^ i :
This variance is rm specic. To simplify the notation we dene the variance in terms of
H = X0 (X 0X ) 1X00 V i = i2I + i2H :
Note that H is rm-specic since X is dierent for each rm (the market returns contained
in X have to be aligned with the event time of rm i). The standard error of cumulative
abnormal returns across n2 periods for rm i is given by
q
se[ci ] = n2 s2i + s2i (0 H ):
The null hypothesis of zero abnormal returns can be tested using the standardized test
statistic
ci
tic =
se[ci ] :
2.1 Financial time series 97
Under the assumption that abnormal returns are jointly normal and serially uncorrelated
the test statistic tc has a t-distribution with df=n1 2.
Event studies are frequently based on analyzing many rms which are all subject to the
same kind of event (usually at dierent points in calendar time). Under the assumption
that abnormal returns for individual rms are uncorrelated (i.e. the event windows do not
overlap) tests can be based on averaging cumulative abnormal returns across m rms and
the test statistic
c a
t1 =
se[c] N(0; 1);
where
v
c =
1 m
X
ci se[c] =
u
u
t 1 m
X
se[ci ]2:
m i=1 m2 i=1
Alternatively, the test statistic tic can be averaged to obtain the test statistic
s !
m(n1 4) 1 X
m
t2 = tic a N(0; 1):
n1 2 m i=1
Example 32: We consider the case of two Austrian mining companies Radex and
Veitscher who where the subject of some rumors about a possible takeover. The rst
newspaper reports about a possible 'cooperation' appeared on March 8, 1991. Similar
reports appeared throughout March 29. On April 16 it was ocially announced that
Radex will buy a 51% share of Veitscher. The purpose of the analysis is to test
for abnormal returns associated with this event. Details can be found in the le
event.xls.
The estimation window consists of the three year period from January 25, 1988 to
January 24, 1991. We use daily log returns for the two companies and the ATX to
estimate the market model. The event window consists of 51 days (January 25 to
April 10). The cumulative abnormal returns start to increase strongly about 14 days
before March 8 and reach their peak on March 7. After that day cumulative abnormal
returns are slightly decreasing. Based on 51 days of the event period we nd c1 =0.25
and c1 =0.17 for Radex and Veitscher, respectively. The associated t-statistics are
t1c =3.29 and t2c =2.54 which are both highly signicant. Tests based on an aggregation
across the two companies are not appropriate in this example since they share the
same event window.
Exercise 18: Use the data from example 32 to test the signicance of cumula-
tive abnormal returns for event windows ranging from January 25 to March 7
and March 15, respectively. You may also use other event windows that allow
for interesting conclusions.
2.1 Financial time series 98
2.1.4 Autocorrelation analysis of nancial returns
The methods of time series analysis are used to investigate the dynamic properties of
a single realization yt, in order to draw conclusions about the nature of the underlying
stochastic process Yt, and to estimate its parameters. Before we dene specic time
series models, and consider their estimation and forecasts, we brie
y analyze the dynamic
properties of some nancial time series.
Autocorrelation analysis is a standard tool for that purpose. The sample autocovariance82
and the sample autocorrelation
c` =
1 n
X
(yt y)(yt y)
`
n t=`+1
c`
r` =
c0
= sc2`
can be used to investigate linear temporal dependencies in an observed series yt. c` and
r` are sample estimates of
` (13) and ` (14). If the underlying process Yt is i.i.d.,
the sampling distribution of r` is r`'N( 1=n; 1=n). This can be used to test individual
autocorrelations for signicance (e.g. using the 95% condence interval83 1=n1:96/pn.).
Rather than testing individual autocorrelations the Ljung-Box statistic can be used to
test jointly that all autocorrelations up to lag p are zero:
p
X r2̀
Qp = n(n + 2) :
`=1 n `
Under the null hypothesis of zero autocorrelation in the population (1= =p=0): Qp2p.
Example 33: The autocorrelations of IBM log returns in Figure 5 are negligibly small
(except for lags 6 and 9). The p-values of the Q-statistic (Prob and Q-Stat in Figure 5)
indicate that the log returns are uncorrelated. The situation is slightly dierent for
FTSE log returns. These autocorrelations are rather small but the correlations at lags
one, two and ve are slightly outside the 95%-interval. The p-values of the Q-statistic
are between 0.01 and 0.05. Depending on the signicance level we would either reject
or accept the null hypothesis of no correlation. We conclude that the FTSE log returns
are weakly correlated.
Assuming that returns are independent is stronger than assuming uncorrelated returns84.
However, testing for independence is not straightforward because it usually requires to
specify a particular type of dependence. Given that the variance of nancial returns is
typically not constant over time, a simple test for independence is based on the autocor-
relations of squared or absolute returns.
Example 34: Figure 6 shows the autocorrelations of squared and absolute log returns
of IBM. There are many signicant autocorrelations even at long lags. Thus we
82 This is a biased estimate of the autocovariance which has the advantage of yielding a positive semi-
denite autocovariance matrix. The unbiased estimated is obtained if the sum is divided by n 1.
83 Usually the mean 1=n is ignored and 1:96/pn is used as the 95% condence interval.
84 Independence and uncorrelatedness are only equivalent if returns are normally distributed.
2.1 Financial time series 99
Figure 5: Autocorrelations of IBM (left panel) and FTSE (right panel) log returns.
Sample: 1965:01 1990:12
Included observations: 368 Included observations: 311
Figure 6: Autocorrelations of squared (left panel) and absolute (right panel) log returns
of IBM. Included observations: 368 Included observations: 368
conclude that the IBM log returns are uncorrelated but not independent. At the same
time the signicant autocorrelations among squared and absolute returns point at
dependencies in (the variance of) returns.
In section 2.1 we have presented examples of volatility clustering. If the sign of returns is
ignored (either by considering squared or absolute returns), the correlation within clusters
is high. If the variance has moved to a high level it tends to stays there; if it is low it
tends to stay low. This explains that autocorrelations of absolute and squared returns are
positive for many lags.
Signicant autocorrelation in squared or absolute returns is evidence for heteroscedasticity.
In this case the standard errors 1=pn are not appropriate to test the regular autocorre-
lations r` for signicance. Corrected condence intervals can be based on the modied
2.1 Financial time series 100
variance of the autocorrelation coecient at lag `:
!
1 1 + cy2 (`) ;
n s4
where cy2 (`) is the autocovariance ofpyt2 and s4 is the squared variance of yt. The resulting
standard errors are larger than 1= n if squared returns are positively autocorrelated
which is typical for nancial returns. This leads to wider condence intervals and to
more conservative conclusions about the signicance of autocorrelations. If the modied
standard errors are used for testing log returns of the FTSE no autocorrelations in Figure 5
are signicant (=0.05).
Exercise 19: Use the log returns dened in exercise 17. Estimate and test
autocorrelations of regular, squared and absolute returns.
2.1 Financial time series 101
2.1.5 Stochastic process terminology
We brie
y dene some frequently used stochastic processes:
A white-noise process t is a stationary and uncorrelated sequence of random numbers.
It may have mean zero (which is mainly assumed for convenience), but this is not essential.
The key requirement is that the series is serially uncorrelated; i.e.
`=`=0 (8`6=0). If t
is normally distributed and white-noise it is independent (Gaussian white-noise). If t is
white-noise with constant mean and constant variance with a xed distribution it is an
i.i.d. sequence85 (also called independent white-noise).
A martingale dierence sequence (m.d.s.) Yt is dened with respect to the infor-
mation It available at t. This could include any variables but typically only includes
Yt : It =fYt ; Yt 1 ; : : :g. fYt g1
t=1 is a m.d.s. (with respect to It 1 ) if E[Yt jYt 1 ; Yt 2 ; : : :]=0
(which implies E[Yt]=0). Since white-noise restricts the conditional expectation to linear
functions, a m.d.s.implies stronger forms of independence than white-noise.
A random walk with drift is dened as
Yt = Yt 1 + + t t . .. white noise:
where Y^t denotes the conditional mean. If the variance of 2 is not constant over time,
the conditional variance is dened in a similar way
E[(Yt Y^t)2jYt 1; Yt 2; : : : ; ] = V[tjYt 1; Yt 2; : : : ; ] = t2:
In this case t is uncorrelated (white noise) but not i.i.d. t is the conditional variance of
t (i.e. the conditional expectation of 2t ).
85 We will use the stronger i.i.d. property for a white-noise with constant variance (and distribution).
White-noise only refers to zero autocorrelation and need not have constant variance.
86 Campbell et al. (p.31 1997) distinguish three types of random walks depending on the nature of t :
i.i.d. increments, independent (but not identically distributed) increments and uncorrelated increments.
87 Strictly speaking this denition also applies to white-noise, but the term mean reversion is mainly used
in 88the context of autocorrelated stationary processes.
An uncorrelated process would be written as Yt =+t .
2.2 ARMA models 102
2.2 ARMA models
We now introduce some important linear models for the conditional mean. An autore-
gressive moving-average (ARMA) process is a linear stochastic process which is com-
pletely characterized by its autocovariances
` (or autocorrelations `). Thus, various
ARMA models can be dened and distinguished by their (estimated) autocorrelations. In
practice the (estimated) autocorrelations r` from an observed time series are compared to
the known theoretical autocorrelations of ARMA processes. Based on this comparison a
time series model is specied. This is also called the identication step in the model
building process. After estimating its parameters diagnostic checks are used to conrm
that a suitable model has been chosen (i.e. the underlying stochastic process conforms to
the estimated model). ARMA models are only appropriate for stationary time series.
2.2.1 AR models
The rst order autoregressive process AR(1)
Yt = + 1 Yt 1 + t j1j < 1
has exponentially decaying autocorrelations `=1̀. t is a white-noise process with mean
zero and constant variance 2. The condition j1j<1 is necessary and sucient for the
AR(1) process to be weakly stationary.
The unconditional mean of an AR(1) process is given by
E[Yt] = = 1 1
:
where wi are the inverted roots of the polynomial, which may be complex valued. The AR
model is stationary if all inverted roots are less than one in absolute value (or, all inverted
roots are inside the unit circle).
Various autocorrelation patterns of an AR(p) process are possible. For instance, the
autocorrelations of an AR(2) model show sinusoidal decay if its inverted roots are complex.
In this case the underlying series has stochastic cycles.89
AR models imply non-zero autocorrelations for many lags. Nevertheless it may be sucient
to use one or only a few lags of Yt to dene Y^t. The number of necessary lags p can be
determined on the basis of partial autocorrelations ``. `` is the `-th coecient of
an AR(`) model. It measures the eect of Yt ` on Yt under the condition that the eects
from all other lags are held constant.
Partial autocorrelations can be determined from the solution of the Yule-Walker equa-
tions of an AR(p) process:
` = 1
` 1 + 2
` 2 + + p
` p ` = 1; : : : ; p:
In this case the solutions are given by (see Box and Jenkins, 1976, p.83)
(1
2 )
2
1 = 1 2 = 22 = 2 21 :
1
12 1
1
AR coecients (and thereby, partial autocorrelations) can be obtained by solving the
Yule-Walker equations recursively for AR models of increasing order. The recursions (in
terms of autocorrelations `) are given by (see Box and Jenkins, 1976, p.83)
p
X
p+1 p;` p+1 `
p+1;p+1 = `=1
Xp
1 p;` `
`=1
Before the model is estimated it is necessary to determine p and q. This choice can be
based upon comparing estimated (partial) autocorrelations to the theoretical (partial)
90 Ingeneral, the estimation procedure is iterative. Whereas lagged yt are xed explanatory variables
(in the sense that they do not depend on the coecients to be estimated) the lagged values of et depend
on the parameters to be estimated. For details see Box and Jenkins (1976), p.208.
2.2 ARMA models 108
Table 2: Means and standard errors (in parentheses) across 10000 estimates of AR(1)
series for dierent sample sizes.
1 n c y f1
50 0.413 ( 142) 0.497 (3.79) 0.886 (0.08)
0.990 100 0.373 (93.3) 0.481 (4.33) 0.938 (0.04)
200 0.538 (31.1) 0.410 (4.47) 0.964 (0.02)
50 0.552 (3.75) 0.514 (1.24) 0.818 (0.09)
0.900 100 0.501 (1.00) 0.505 (0.93) 0.859 (0.06)
200 0.513 (0.70) 0.512 (0.68) 0.881 (0.04)
50 0.502 (0.28) 0.502 (0.28) 0.448 (0.13)
0.500 100 0.499 (0.20) 0.499 (0.20) 0.475 (0.09)
200 0.499 (0.14) 0.499 (0.14) 0.487 (0.06)
50 0.500 (0.18) 0.500 (0.18) 0.169 (0.14)
0.200 100 0.499 (0.13) 0.499 (0.13) 0.182 (0.10)
200 0.500 (0.09) 0.500 (0.09) 0.192 (0.07)
50 0.500 (0.12) 0.500 (0.12) {0.208 (0.14)
{0.200 100 0.501 (0.09) 0.501 (0.09) {0.204 (0.10)
200 0.500 (0.06) 0.500 (0.06) {0.203 (0.07)
50 0.501 (0.09) 0.501 (0.10) {0.491 (0.12)
{0.500 100 0.500 (0.07) 0.500 (0.07) {0.496 (0.09)
200 0.500 (0.05) 0.500 (0.05) {0.497 (0.06)
50 0.501 (0.07) 0.501 (0.08) {0.871 (0.08)
{0.900 100 0.501 (0.05) 0.500 (0.05) {0.884 (0.05)
200 0.500 (0.04) 0.500 (0.03) {0.892 (0.04)
autocorrelations in Table 1. Alternatively, model selection criteria like the Akaike infor-
mation criterion (AIC) or the Schwarz criterion (SC) can be used. AIC and SC are
based on the log-likelihood91 `=ln L and the number of estimated parameters K (for an
ARMA(p,q) model with a constant term K =p+q+1):
AIC = 2` + 2K SC = 2` + K ln n :
n n n n
If the type of model cannot be uniquely determined from Table 1, several models are
estimated and the model with minimal AIC or SC is selected.
2.2.5 Diagnostic checking of ARMA models
ARMA model building is not complete unless the residuals are white-noise. The conse-
quence of residual autocorrelation is inconsistency (see section 1.7.3). This can be shown
in terms of the simple model
Yt = + Yt 1 + ut ut = ut 1 + t :
91 ` is dened as
" !#
) + ln 1
n X
`= 1 + ln(2 e2 :
2 n t t
2.2 ARMA models 109
To derive E[Yt 1 ut ] we use
ut 1 1
X
Yt 1 = 2 2
1 1 B = + ut 1(1 + B + B + ) = + i=0 ut
+ 1 i:
Hence
" #
1
X
E[Yt 1 ut ] = E ut 1 i ut
i=0
depends on the autocorrelations of ut. E[Yt 1ut] will be non-zero as long as 6=0, and this
will give rise to inconsistent estimates. Thus, it is essential that the model is specied
such that the residuals are white-noise. This requirement can also be derived from an
alternative viewpoint. The main purpose of a time series model is to extract all dynamic
features from a time series. This objective is achieved if the residuals are white-noise.
Autocorrelation of residuals can be removed by changing the model specication (mainly
by including additional AR or MA terms). The choice may be based on patterns in
(partial) autocorrelations of the residuals. AIC and SC can also be used to support the
decision about including additional lags. However, it is not recommended to include lags
that cannot be meaningfully interpreted. For instance, even if the coecient of yt 11 is
'signicant' in a model for daily returns this is a highly questionable result.
Indications about possible misspecications can be derived from the inverted roots of the
AR and MA polynomials. If two inverted roots of the AR and the MA polynomial are
similar in magnitude, the model possibly contains redundant terms (i.e. the model order
is too large). This situation is known as overtting. If the absolute value of one of
the inverted AR roots is close to or above 1.0 then the autoregressive term implies non-
stationary behavior. This indicates the need to take dierences of the time series (we will
return to that point in section 2.3.3). An absolute value of one of the inverted MA roots
close to or above 1.0 indicates that the series is overdierenced. Taking rst dierences
of a white-noise series yt=t leads to
yt = t t 1 :
The resulting series yt is 'best' described by an MA(1) model with 1= 1:0. Its auto-
correlations can be shown to be decaying. However, a white-noise must not be dierenced
at all, and it does not make sense to t a model to yt. Similar considerations hold for
stationary series in general: they must not be dierenced.
If residuals are white-noise but not homoscedastic, modications of the ARMA model
equation are not meaningful. Heteroscedasticity of the disturbances cannot be removed
with a linear time series model for the conditional mean. Models to account for residuals
that are not normally distributed and/or heteroscedastic will be introduced in section 2.5.
2.2.6 Example 35: ARMA models for FTSE and AMEX returns
Box and Jenkins (1976) have proposed a modeling strategy which consists of several steps.
In the identication step one or several preliminary models are chosen on the basis of
(partial) autocorrelations and the patterns in Table 1. After estimating each model the
2.2 ARMA models 110
are equal to zero. This would be the 'optimal' model according to AIC. However, the
model has inverted AR roots 0:52 0:75i and MA roots 0:59 0:73i which are very
similar. This situation is known as overtting: Too many, redundant parameters have been
estimated and a model with less coecients is more appropriate. The ratio of MA and
AR polynomials (1+0.135B 0.095B 2 0.014B 3+0.094B 4 0.086B 5+ ) has coecients
which are rather small and similar to the MA(2) model. The signicance tests of individual
coecients for an overtted model have very limited value.
We apply diagnostic checking to the residuals from the MA(2) and the AR(3) model. The
p-values of Q10 to test for autocorrelation in residuals are 0.226 and 0.398, which indicates
that residuals are white-noise. For squared residuals the p-values of Q5 are 0.03 (MA)
and 0.35 (AR); p-values of Q10 are 0.079 (MA) and 0.398 (AR). Thus MA residuals are
not quite homoscedastic. The p-values of the JB-test are 0.0 for both models (S 0.5 and
U 13) which rejects normality. Thus the signicance tests of the estimated parameters
may be biased.
The (partial) autocorrelations of AMEX log returns (see Figure 7) may be viewed to
suggest a MA(1) model. The estimated model is
yt = 3:6 10 5 + 0:28 et 1 + et se = 0:005239:
(0.94) (0.0)
The residuals are white-noise but not normally distributed. The squared residuals are
correlated and indicate heteroscedasticity (i.e. the residuals are not independent).
Exercise 20: Use the log returns dened in exercise 17. Identify and estimate
suitable ARMA models and check their residuals.
2.2.7 Forecasting with ARMA models
Forecasting makes statements about the process Yt at a future date t+ on the basis of
information available at date t. The forecast Y^t; is the conditional expected value
Y^t; = E[Yt+ jYt ; Yt 1 ; : : : ; t ; t 1 ; : : :] = E[Yt+ jIt ] = 1; 2; : : :
using the model equation. is the forecasting horizon.
2.2 ARMA models 112
Forecasts for future dates t+ ( =1,2,. . .) from the same date t are called dynamic (or
multi-step) forecasts. The one-step ahead forecast Y^t;1 is the starting point. The next
dynamic forecast Y^t;2 (for t+2) is also made in t and uses Y^t;1. In general, a dynamic
forecast Y^t; depends on all previous dynamic forecasts (see below). Static forecasts are
a sequence of one-step ahead forecasts Y^t;1 Y^t+1;1 : : : made at dierent points in time.
AR model forecasts
Dynamic AR(1) model forecasts are given by:
Y^t;1 = + 1 Yt
Y^t;2 = + 1 E[Yt+1 jIt ] = + 1 Y^t;1 = + 1 ( + 1 Yt )
Y^t; = (1 + 1 + 21 + + 1 1 ) + 1 Yt
^
!1 Yt; = (1
lim 1 )
= :
Unknown future values Yt+1 are replaced by the forecasts Y^t;1. Forecasts of AR(1) models
decay exponentially to the unconditional mean of the process . The rate of decay depends
on j1j. Dynamic forecasts of stationary AR(p) models show a more complicated pattern
but also correspond to the autocorrelations. The forecasts converge to
(1 ) = :
1 p
This process has two trend components: the deterministic trend t and the stochastic
trend !t.
For a xed, non-random initial value Y0 the random-walk (with drift or without drift) has
the following properties:
1. E[Yt] = t + Y0
2. V[Yt] = t2
3.
k = (t k)2
4. rk decay very slowly (approximately linearly).
A random-walk is non-stationary since mean, variance and autocovariance depend on t.
Thus it is not mean-reverting and its (long-term) forecasts are given by
E[Y^t; jYt] = + Yt:
A general class of integrated processes can be dened, if the dierences Yt Yt 1 follow an
ARMA(p; q) process:
Yt = Yt 1 + Ut
2.3 Non-stationary models 117
Ut = + 1 Ut 1 + + p Ut p + 1t 1 + + q t q + t:
In this case Yt is an ARIMA(p; 1; q) (integrated ARMA) process and Yt is called integrated
of order 1: YtI (1).
If Yt is an ARMA(p; q) process after dierencing d times so that
(1 B )dYt = Ut
Ut = + 1 Ut 1 + + p Ut p + 1 t 1 + + q t q + t ;
Yt is an ARIMA(p; d; q) process and Yt I (d). Obviously, an ARIMA model for log prices
is equivalent to an ARMA model for log returns.
Forecasts of the ARIMA(0,1,1) process (1 B )Yt= +1t 1+t are obtained by using the
same procedure as in section 2.2.7:
Y^t;1 = Yt + + 1 t
Y^t;2 = Y^t;1 + = Yt + 2 + 1 t
Y^t; = Yt + + 1 t :
Forecasts of ARIMA(0,1,q) processes converge to a straight line with slope , where
corresponds to the expected value of Yt. The transition to the straight line is described
by the MA parameters and corresponds to the cut o pattern of autocorrelations.
The ARIMA(1,1,0) process (1 1B )(1 B )Yt= +t can be written as
Yt = + 1Yt 1 + t Yt = Yt 1 + + 1Yt 1 + t = Yt 1 + Yt;
and dynamic forecasts are obtained as follows:
Y^t;1 = Yt + Y^t;1 = Yt + + 1 Yt
Y^t;2 = Y^t;1 + Y^t;2
= Yt + Y^t;1 + Y^t;2 = Yt + [ + 1Yt] + [ (1 + 1) + 21Yt]
Y^t;3 = Yt + + (1 + 1 ) + (1 + 1 + 21 ) + (1 + 21 + 31 )Yt :
Box and Jenkins (1976, p.152) show that the forecasts approach a straight line:
lim Y^ = Yt + + (Yt Yt 1 ) (1 1 ) = (1 ) :
!1 t; 1 1
In general, forecasts of ARIMA(p,1,0) processes approach a straight line with slope ,
which is the expected value of Yt. The transition to the straight line is described by the
AR parameters, and corresponds to the pattern of autocorrelations.
The process
Yt = 0 + t + Ut
is a trend-stationary process. Ut is stationary but need not be white-noise. The process
Yt evolves around a linear, deterministic trend in a stationary way. The appropriate
2.3 Non-stationary models 118
transformation to make this process stationary is to subtract the trend term 0+t from
Yt . Note that dierencing a trend stationary process does not only eliminate the trend
but also aects the autocorrelations of Yt:
Yt Yt 1 = + Ut Ut 1 :
where y^t;i are out-of-sample forecasts from the ARMA model.95 Dynamic forecasts
y^t; converge to the constant c=0.0078 but the changes in the index do not converge
to a constant:
p^t; p^t; 1 = pt [expf (c + 0:5s2e )g expf( 1)(c + 0:5s2e )g]:
and the critical values from Table 4 (column 'with trend'). If H0 is not rejected, yt is
concluded to be integrated with a drift corresponding to c=
^ (assuming that
^<0 in any
nite sample). If H0 is rejected, yt is assumed to be trend-stationary with slope c=
^.
If a series shows no clear trends, a unit-root test can be used to decide whether the series
is stationary or integrated without a drift. The integrated process
Yt = Yt 1 + Wt Wt . .. stationary
can be written as
t
X
Yt = Y0 + Wi ;
i=0
^e2is an estimate of the residual variance that accounts for autocorrelation as in the
Newey-West estimator. The asymptotic critical values of the KPSS statistic are tabulated
in Kwiatkowski et al. (1992, p.166). For =0.05 the critical value under the null of
stationarity is 0.463, and 0.146 for trend stationarity.
Example 40: A unit-root test of the spread between long- and short-term inter-
est rates in the UK99 leads to ambiguous conclusions. The estimated value of
is
0:143679 and gives the impression that 1 =1+
is suciently far away from one.
The t-statistic of
^ is 3:174306. Although this is below the critical value at a 5%
signicance level ( 2:881), it is above the critical value for =0.01 ( 3:4758). There-
fore the unit-root hypothesis can only be rejected for high signicance levels and it
remains unclear, whether the spread can be considered stationary or not. However,
given the low power of unit-root tests it may be appropriate to conclude that the
spread is stationary. The KPSS test conrms this conclusion since the test statistic is
far below the critical values. Details can be found in the le spread.wf1.
Example 41: We consider a unit-root test of the AMEX index (see le amex.wf1).
Since the index does not follow a clear trend, we do not include a trend term in the
test equation. We use p=1 since the coecients c^i (i>1) are insignicant (initially
p=6 (2091=3 ) was chosen). The estimate for
is 0:0173 and has the expected
negative sign. The t-statistic of
^ is 1:6887, and is clearly above all critical val-
ues in Table 4: It is also above the critical values provided by EViews. Therefore
the unit-root hypothesis cannot be rejected and the AMEX index is assumed to be
integrated (of order one). This is partially conrmed by the KPSS test. The test
statistic 0.413 exceeds the critical value only at the 10% level, but stationarity can-
not be rejected for lower levels of . To derive the implied mean of yt from the
estimated equation ^yt = 0.0173yt 1 +7.998+0.331yt 1 we reformulate the equa-
tion as y^t =(1 0.0173+0.331)yt 1 +7.998 0.331yt 2 , and the implied mean is given
by 7.998/0.0173462.
Example 42: We consider a unit-root test of the log of the FTSE index (see le
ftse.wf1). We use only data from 1978 to 1986 since during this period it is not clear
whether the series has a drift or is stationary around a linear trend. This situation
requires to include a trend term in the test equation. The estimated equation is
99 Source: http://www.lboro.ac.uk/departments/ec/cup/data.html; 'Yield on 20 Year UK Gilts'
(long; le R20Q.txt) and '91 day UK treasury bill rate' (short; le RSQ.htm); the spread is the dier-
ence between long and short; quarterly data from 1952 to 1988; 148 observations.
2.3 Non-stationary models 124
^yt = 0:163yt 1 +0.868+0:0023t. The t-statistic of
^ is 3:19. This is above the 1%
and 5% critical values in Table 4 and slightly below the 10% level. Therefore the unit-
root hypothesis cannot be rejected, and the log of the FTSE index can be assumed
to be integrated (of order one). This is conrmed by the KPSS test where the test
statistic exceeds the critical value, and stationarity can be rejected. Since augmented
terms are not necessary, the log of the index can be viewed as a random walk with
drift approximately given by 0.0023/0.163=0.014.
Exercise 22: Consider the ADF test equation (46) and p=1. Show that the
implied sample mean of yt is given by ^/^
.
Exercise 23: Use annual data on the real price-earnings ratio from the le
pe.wf1 (source: http://www.econ.yale.edu/~shiller/data/chapt26.xls).
Test the series for a unit-root. Irrespective of the test results, t stationary
and non-stationary models to the series using data until 1995. Compute out-
of-sample forecasts for the series using both types of models.
2.4 Diusion models in discrete time 125
2.4 Diusion models in discrete time
Several areas of nance make extensive use of stochastic processes in continuous time.
However, data is only available in discrete time, and the empirical analysis has to be done
in discrete time, too. In this section we focus on the relation between continuous and
discrete time models.
Review 11:100 A geometric Brownian motion (GBM) is dened as
dPt
dPt = Pt dt + Pt dWt = dt + dWt ;
Pt
where Wt is a Wiener process with the following properties:
p
1. Wt =Zt t where Zt N(0; 1) (standard normal)and Wt N(0; t).
2. The changes over distinct (non-overlapping) intervals are independent 101 .
3. Wt N(0; t) if W0 =0.
4. Wt evolves in continuous time and has no jumps (no discontinuities). However,
its sample paths are not smooth but rather erratic.
5. The increments of Wt can be viewed as the counterpart of a discrete time white-
noise process (with mean zero and unit variance if t=1), and Wt corresponds
to a discrete time random-walk.
A GBM is frequently used to describe stock prices and implies non-negativity of the
price Pt . and can be viewed as mean and standard deviation of the simple return
Rt =dPt =Pt . This return is measured over an innitely small time interval dt and is
therefore called instantaneous return. The (instantaneous) expected return is given
by
E[dPt =Pt ] = E[ dt + dWt ] = dt:
The (instantaneous) variance is given by
V[dPt =Pt ] = V[ dt + dWt ] = 2 V[dWt ] = 2 dt:
Both mean and standard deviation are constant over time. and are usually
measured in annual terms.
The standard or arithmetic Brownian motion dened as
dXt = dt + dWt (Xt+t Xt )N(t; 2 t)
is not suitable to describe stock prices since Xt can become negative.
A process that is frequently used to model interest rates is the Ornstein-Uhlenbeck
process
dXt = ( Xt )dt + dWt :
This is an example of a mean reverting process. When Xt is above (below) it tends
back to at a speed determined by the mean-reversion parameter >0. The square
root process
p
dXt = ( Xt )dt + Xt dWt
100 Campbell et al. (1997), p.341 or Baxter and Rennie (1996), p.44.
101 Because of the normality assumption it is sucient to require that changes are uncorrelated.
2.4 Diusion models in discrete time 126
is also used to model interest rates. It has the advantage that Xt cannot become
negative.
A very general process is the Ito process
dXt = (Xt ; t) dt + (Xt ; t) dWt ;
where mean and variance can be functions of Xt and t.
Review 12:102 If Xt is an Ito process then Ito's lemma states that a function
Gt =f (Xt ; t) can be described by the stochastic dierential equation (SDE)
1
dGt = ()fX0 + ft0 + 2 ()fX00 dt + ()fX0 dWt ;
2
where
@Gt 0 @Gt 00 @ 2 Gt
fX0 = f = fX = :
@Xt t @t @Xt2
Example 43: Suppose the stock price Pt follows a GBM. We are interested in the
process for the logarithm of the stock price. We have
@Gt 1 @Gt @ 2 Gt 1
Gt = ln Pt () = Pt () = Pt = =0 = :
@Pt Pt @t @Pt2 Pt2
Applying Ito's lemma we obtain
1 1 1
d ln Pt = Pt 0:52 Pt2 2 dt + Pt dWt ;
Pt Pt Pt
d ln Pt = ( 0:52 )dt + dWt :
Thus, the log stock price ln Pt is an arithmetic Brownian motion with drift 0:52 ,
if Pt is a GBM with drift .
where y corresponds to ( 0:52)t, and s2e (or s2y ) to 2t. To estimate the
GBM parameters and (which are usually given in annual terms ) the observation
frequency of yt (which corresponds to t) has to be taken into account. We suppose
that the time interval between t and t 1 is t and is measured in years (e.g. t=1/52
for weekly data).
4. d ln Pt can be interpreted as the instantaneous log return of Pt. The (instantaneous)
mean of the log return d ln Pt is 0:52. However, when we compare equations (41),
p.92 and (47) we nd a discrepancy. The mean of log returns Yt in section 2.1.2 is
given by ln(1+m) 0:5Y2 whereas the mean of Yt(t) is given by ( 0:52)t. This
can be explained by the fact that ln(1+mt)!m dt as t!dt.
The sample estimates from log returns (y and s2) correspond to ( 0:52)t and 2t,
respectively. Thus estimates of and 2 are given by
y y s2
^ 2 = s2 =t ^ = + 0:5^2 = + 0:5 :
t t t
Gourieroux and Jasiak (2001, p.289) show that the asymptotic variance of ^ 2 and ^ is
given by
4 2 4
aV[^2] = 2n aV[^] = nt + 2n :
By increasing the sampling frequency more observations become available (n increases),
but t becomes accordingly smaller. The net eect is that nt stays constant, the rst
term in the denition of aV[^] does not become smaller as n increases, and the drift cannot
be consistently estimated.
Example 45: In example 31 the mean FTSE log return y estimated from monthly
data was 0.00765 and the standard deviation s was 0.065256. The estimated mean
and variance of the underlying GBM in annual terms are given by
^ 2 = 0:0652562 12 = 0:0511 ^ = 0:00765 12 + 0:5 0:0511 = 0:117346:
We now consider estimating the parameters of the Ornstein-Uhlenbeck process using a
discrete time series. A simplied discrete time version of the process can be written as
p
Xt Xt t = t tXt t + Zt t
p
Xt = t + (1 t)Xt t + Zt t:
This is equivalent to an AR(1) model (using the notation from section 2.2)
Xt = + 1 Xt 1 + t ;
2.4 Diusion models in discrete time 129
where corresponds to t, 1 to (1 t), and 2 to 2t. The Ornstein-Uhlenbeck
process is only mean reverting (or stationary) if >0. This corresponds to the condi-
tion j1j<1 for AR(1) models. Thus it is useful to carry out a unit-root test before the
parameters and are estimated.
Given an observed series xt we can t the AR(1) model
xt = c + f1 xt 1 + et
and use the estimates c, f1 and se to estimate , and (in annual terms ):
^ =
1 f1 ^ = c = c s
^ = p e :
t ^t 1 f1 t
Since estimated AR coecients are biased105 downwards in small samples, ^ will be biased
upwards.
A precise discrete time formulation is given by (see Gourieroux and Jasiak, 2001, p.289)
p
Xt = (1 expf tg) + expf tgXt t + Zt t;
where
1 expf 2tg1=2
= :
2t
Using this formulation the parameters are estimated by
ln f1 ^ = c expf^tg s 1 expf 2^tg1=2
^ = ^ = p e :
t expf^tg 1 t 2^t
Note that f1 has to be positive in this case.
Example 46: In example 40 we have found that the spread between long- and short-
term interest rates in the UK is stationary (or mean reverting). We assume that
the spread follows a Ornstein-Uhlenbeck process. The estimated AR(1) model using
quarterly data is106
xt = 0:1764 + 0:8563xt 1 + et se = 0:8696
which yields the following estimates in annual terms (t=1/4):
1 0:8563 0:1764 0:8696
^ = = 0:575 ^ = = 1:227 ^ = p = 1:74:
t 0:575t t
Using the precise formulation we obtain
ln 0:8563 c expf0:62tg
^ = = 0:62 ^ = = 1:227
t expf0:62tg 1
se 1 expf 2 0:62tg 1=2
^ = p = 1:613:
t 2 0:62t
105 The bias increases as the AR parameter approaches one, or as the mean reversion parameter
approaches zero.
106 Details can be found in the le ornstein-uhlenbeck.xls.
2.4 Diusion models in discrete time 130
2.4.3 Probability statements about future prices
We now focus on longer time intervals and consider price changes over T periods (e.g. 30
days). The T -period log return is the change in log prices between t and t+T . Thus the
log return is normally distributed107 with mean and variance
E[ln Pt+T ] ln Pt = E[Yt(T )] = ( 0:52)T V[Yt(T )] = 2T:
Equivalently, Pt+T is lognormal and ln Pt+T is normally distributed:
ln Pt+T N(ln Pt + ( 0:52)T; 2T ):
Conditional on Pt the expected value of Pt+T is
E[Pt+T jPt] = Pt expfT g:
The discrepancy between this formula and equation (45), p.119 used to forecast prices in
section 2.3.2 can be reconciled by noting that here is the mean of simple returns. The
corresponding discrete time series model for log returns yt is
yt = y + et et N(0; s2 )
and the conditional expectation of pt+T is
E[pt+T jpt] = pt expfyT + 0:5s2T g:
A (1 ) condence interval for the price in t+T can be computed from the properties of
T -period log returns. The boundaries of the interval for log returns are given by
p
( 0:52)T jz=2j T ;
and the boundaries for the price Pt+T are given by108
n p o
Pt exp ( 0:52 )T jz=2 j T :
Example 47: December 28, 1990 the value of the FTSE was 2160.4 (according to
finance.yahoo.com). We use the estimated mean and variance from example 45 to
compute a 95% condence interval for the index in nine months (end of September
1991) and ten years (December 2000).109
Using ^ 2 =0.05 and ^=0.117 the interval for T =0.75 is given by110
n p o
2160:4 exp (0:117 0:5 0:05)0:75 1:96 0:05 0:75 = [1584; 3383]
and for T =10
n p o
2160:4 exp (0:117 0:5 0:05)10 1:96 0:05 10 = [1356; 21676]:
Note: the actual values of the FTSE were 2621.7 (September 30, 1991) and 6222.5
(December 29, 2000).
107 The normal assumption for log returns cannot be justied empirically unless the observation frequency
is low.
108 Note that the bounds are not given by E[Pt+T ] jz=2 jpV[Pt+T ].
109 Details can be found in the le probability statements.xls.
110 We use rounded values of the estimates ^ and ^ .
2.4 Diusion models in discrete time 131
We now consider probabilities like P[Pt+T K ], where K is a pre-specied, non-stochastic
value (e.g. the strike price in option pricing).
Given that log returns over T periods are normally distributed Yt(T )N(( 0:52)T; 2T ),
probability statements about Pt+T can be based on the properties of Yt(T ):
P[Pt expfYt(T )g K ] = P[Pt+T K ] = P[Yt(T ) ln(K=Pt)]:
For instance, the probability that the price in t+T is less than K is given by
!
2
P[Yt(T ) ln(K=Pt)] = ln(K=Pt) p(T 0:5 )T :
Similar probabilities are used in the Black-Scholes option pricing formula, and can be used
in a heuristic derivation of that formula111.
Example 48: We use the information from example 47 to compute the probability
that the FTSE will be below K =2000 in September 1991.
ln(2000=2160:4) (0:117 0:5 0:05)0:75
P[Pt+T K ] = p = 0:225:
0:05 0:75
Exercise 24:
1. Use a time series from exercise 17 (stock price, index or exchange rate).
Assume that this series follows a GBM and estimate the parameters
and (in annual terms).
2. Select a time series that appears to be mean-reverting. Verify this as-
sumption by a unit-root test. Assume that this series follows a Ornstein-
Uhlenbeck process and estimate the parameters , and .
Y^t 1;1is the conditional expectation (or the one-step ahead forecast) of Yt derived
from a time series or regression model. It 1 is the information set available at time
t 1. If Yt is white-noise Y^t 1;1 =.
In a GARCH model the variance of the disturbance term t is not constant but the
conditional variance is time-varying:
Many empirical investigations found that GARCH(1,1) models are sucient (see e.g.
Bollerslev et al., 1992).
where t2 can be dened in terms of a GARCH(p,q) model. The log-likelihood is a straight-
forward extension of equation (15) in section 1.4. It is obtained by replacing the constant
variance 2 by the conditional variance t2 from the GARCH equation.
Diagnostic checking of a GARCH model is based on the standardized residuals e~t=et=st.
The GARCH model is adequate if e~t, e~2t and je~tj are white-noise, and e~t is normal.
2.5.2 Example 49: ARMA-GARCH models for IBM and FTSE returns
In example 33 we found that IBM log returns are white-noise and example 34 indicated
heteroscedasticity of returns. Therefore we estimate a GARCH(1,1) model with constant
mean:
yt = 0:0002 + et s2t = 9:6 10 6 + 0:27 e2t 1 + 0:72 s2t 1 :
(0.75) (0.002) (0.0) (0.0)
However, e~t is not white-noise (r1=0.138 and Q1=0.008). Therefore we add AR(1) and
MA(1) terms to the conditional mean equation. AIC and SC select the following model:
yt = 0:0003 + 0:1 et 1 + et s2t = 7:8 10 6 + 0:24 e2t 1 + 0:75 s2t 1 :
(0.67) (0.078) (0.003) (0.0) (0.0)
Adding the term e2t 2 to the variance equation is supported by AIC but not by SC. The
standardized residuals and their squares are white-noise (the p-values of Q5 are 0.75 and
0.355, respectively). The JB-test rejects normality but the kurtosis is only 4.06 (the
skewness is {0.19), whereas skewness and kurtosis of observed returns are {0.6 and 8.2.
We conclude that the GARCH model explains a lot of the non-normality of IBM log
returns.
The conditional standard deviation st from this model captures the changes in the volatil-
ity of residuals et very well (see Figure 9). The conditional mean is very close to the
113 An example can be found in the le AR-GARCH ML estimation.xls.
2.5 GARCH models 135
0.05
0.00
-0.05
0.06 -0.10
conditional standard deviation
residuals -0.15
0.04
0.02
0.00
50 100 150 200 250 300 350
unconditional mean, and thus residuals and returns are almost equal (compare the re-
turns in Figure 2 to the residuals in Figure 9).
We extend the MA(2) model for FTSE log returns from example 2.2.6 and t the MA(2)-
GARCH model
yt = 0:0074 + 0:06 et 1 0:15 et 2 + et
(0.021) (0.4) (0.01)
s2t = 0:0003 + 0:099 e2t 1 + 0:82 s2t 1 :
(0.11) (0.016) (0.0)
The p-values of the MA coecients have changed compared to example 2.2.6. The rst
MA parameter h1 is clearly insignicant and could be removed from the mean equation.
In example 2.2.6 we found that MA residuals were not normal and not homoscedastic.
Since p-values are biased in this case, we expect that adding a GARCH equation which
accounts for non-normality and heteroscedasticity should aect the p-values.
The standardized residuals of the MA-GARCH model are white-noise and homoscedastic
but not normal. If the conditional normal assumption does not turn out to be adequate a
dierent conditional distribution has to be used (e.g. a t-distribution).
Exercise 25: Usethe ARMA models from exercise 20, estimate ARMA-
GARCH models, and carry out diagnostic checking.
2.5 GARCH models 136
2.5.3 Forecasting with GARCH models
GARCH models can be used to determine static and dynamic variance forecasts of a time
series. The GARCH(1,1) forecasting equation for future dates t+ is
t;2 1 = !0 + !1 2t + 1 t2
t;2 2 = !0 + !1 2t+1 + 1 t;2 1
= !0 + !12t+1 + 1(!0 + !12t + 1t2):
The unknown future value 2t+1 in this equation is replaced by the conditional expectation
E[2t+1jIt]=t;2 1:
t;2 2 = !0 + !1 t;2 1 + 1 (!0 + !1 2t + 1 t2 ) = !0 + (!1 + 1 )t;2 1 :
Thus, the variance for t+2 can be determined on the basis of t and t2. The same
procedure can be applied recursively to obtain forecasts for any
t2+ = !0 + (!1 + 1)t2+ 1:
For increasing the forecasts t;2 converge to the unconditional variance 2 from equation
(48), provided (!1+1)<1. The time until the level of the unconditional variance is reached
depends on the GARCH parameters, the value of the last residual in the sample, and the
dierence between the unconditional variance and the conditional variance in t (when the
forecast is made).
We nally note that, in general, the variance of h-period returns yt(h) estimated from a
GARCH model will dier from the (frequently used) unconditional estimate h2 which is
based on homoscedastic returns. The h-period variance is given by the sum
h
X
2 (h) = 2 ;
t;
=1
GARCH models can be extended in various ways, and numerous formulations of the
variance equation exist. In the threshold ARCH (TARCH) model, for instance, asym-
metric eects of news on the variance can be taken into account. In this case the variance
equation has the following form:
t2 = !0 + !1 2t 1 +
1 2t 1 dt 1 + 1 t2 1 ;
where, dt=1 (dt=0) if t<0 (t0). If
1>0 negative disturbances have a stronger eect on
the variance than positive ones. The exponential GARCH (EGARCH) model also
allows for modelling asymmetric eects. It is formulated in terms of the logarithm of t2:
ln t2 = !0 +
1 t 1 + 1 jt 1j + 1 ln t2 1:
t 1 t 1
114 http://www.riskmetrics.com/mrdocs.html.
115 The EWMA variance during the out-of-sample period is based on observed returns, while the dynamic
GARCH variance forecasts do not use any data at all from that period.
2.5 GARCH models 138
Figure 10: GARCH and EWMA estimates and forecasts of the variance of FTSE log
returns. 0.04
0.03 GARCH
EWMA (0.95)
0.02
0.01
0.00
66 68 70 72 74 76 78 80 82 84 86 88 90
If t 1<0 (t 1>0) the total impact of t 1=t 1 on the conditional (log) variance is given
by
1 1 (
1+1). If bad news have a stronger eect on volatility the expected signs are
1 +1 >0 and
1 <0.
As a further extension, explanatory variables can be included in the variance equation.
Some empirical investigations show that the number or the volume of trades have a sig-
nicant eect on the conditional variance (see Lamoureux and Lastrapes, 1990). After
including such explanatory variables the GARCH parameters frequently become smaller
or insignicant.
In the GARCH-in-the-mean (GARCH-M) model the conditional variance or standard
deviation is used as an explanatory variable in the equation for the conditional mean:
Yt = + t2 + t ;
where any GARCH model can be specied for t2. A signicant parameter would support
the hypothesis that expected returns of an asset contain a risk premium that is proportional
to the variance (or standard deviation) of that asset's returns. However, according to
nancial theory (e.g. the CAPM) the risk premium of an asset has to be determined in
the context of a portfolio of many assets.
Exercise 26: Use the log returns dened in exercise 17 and estimate a
TARCH model to test for asymmetry in the conditional variance.
Obtain a daily nancial time series from nance.yahoo.com and retrieve the
trading volume, too. Add volume as explanatory variable to the GARCH
equation. Hint: Rescale the volume series (e.g. divide by 106 or a greater
number), and/or divide by the price or index to convert volume into number
of trades.
Use the log returns dened in exercise 17 and estimate a GARCH-M model.
3.1 Vector-autoregressive models 139
3 Vector time series models
3.1 Vector-autoregressive models
3.1.1 Formulation of VAR models
Multivariate time series analysis deals with more than one series and accounts for feedback
among the series. The models can be viewed as extensions or generalizations of univariate
ARMA models. A basic model of multivariate analysis is the vector-autoregressive
(VAR) model.116
VAR models have their origin mainly in macroeconomic modeling, where simultaneous
(structural) equation models developed in the fties and sixties turned out to have inferior
forecasting performance. There were also concerns about the validity of the theories
underlying the structural models. Simple, small-scale VAR models were found to provide
suitable tools for analyzing the impacts of policy changes or external shocks. VAR models
are mainly applied in the context of Granger causality tests and impulse-response analyses
(see Greene, 2003, p.592). In addition, they are the basis for vector error correction models
(see section 3.2).
The standard form or reduced form of a rst order VAR model { VAR(1) { for two
processes Yt and Xt is given by
Yt = y + yy Yt 1 + yx Xt 1 + yt
Xt = x + xy Yt 1 + xx Xt 1 + xt ;
where yt and xt are white-noise disturbances which may be correlated. A VAR(1) process
can be written in matrix form as
Yt = V + 1 Yt 1 + t t N(0; );
where Yt is a column vector which contains all k series in the model. V is a vector of
constants. 1 is a kk matrix containing the autoregressive coecients for lag 1. t is a
column vector of disturbance terms assumed to be normally distributed with covariance
. In the two-variable VAR(1) model formulated above Yt , V, 1 and t are given by
" # " # " # " #
Yt = Yt V= y 1 = yy yx t = yt :
Xt x xy xx xt
is related to the correlation matrix of disturbances C and the vector of standard errors
by =C (0).
The moving average (MA) representation of a VAR(1) model exists, if the VAR process
is stationary. This requires that all eigenvalues of 1 have modulus less than one (see
Lutkepohl, 1993, p.10). In this case
1
X 1
X
Yt = + i1 t i =+ i t i ;
i=0 i=0
116 The general case of vector ARMA models will not be presented in this text; see Tsay (2002), p.322 for
details.
3.1 Vector-autoregressive models 140
where i1 denotes the matrix power of order i, i is the MA coecient matrix for lag i,
and =(I 1) 1V. The autocovariance of Yt for lag ` is given by
1
X 1
X
`1+i (i1)0 = `+i i 0 :
i=0 i=0
Extensions to higher order VAR models are possible (see Lutkepohl, 1993, p.11).
The VAR model in standard form only contains lagged variables on the right hand side.
This raises the question whether and how contemporaneous dependencies between Yt and
Xt are taken into account. To answer this question we consider the following example:
Yt = !0 Xt + !1 Xt 1 + 1 Yt 1 + Ut
Xt = 1 Xt 1 + Wt :
These equations can be formulated as a VAR(1) model in structural form117:
" #" # " #" # " #
1 !0 Yt = 1 !1 Yt 1 + Ut :
0 1 Xt 0 1 Xt 1 Wt
The structural form may include contemporaneous relations represented by the coecient
matrix on the left side of the equation. Substituting Xt from the second equation into the
rst equation yields
Yt = (!0 1 + !1 )Xt 1 + 1 Yt 1 + !0 Wt + Ut ;
or in matrix form:
" # " #" # " #
Yt = 1 !0 1 + !1 Yt 1 + !0 Wt + Ut :
Xt 0 1 Xt 1 Wt
Formulating this VAR(1) model in reduced form
" # " #" # " #
Yt = yy yx Yt 1 + yt
Xt xy xx Xt 1 xt
yields the following identities:
yy = 1 yx = (!0 1 + !1 ) xy = 0 xx = 1
2y = !02 W
2 + 2 2x = 2 cov[y x ] = !0 2 :
U W t t W
Thus, if Yt and Xt are contemporaneously related the disturbance terms yt and xt of
the reduced form are correlated. This correlation depends on the parameter !0 in the
structural equation. In example 51 the correlation between the residuals is 0.41, which
can be used to estimate the parameter !0. In general, it is not possible to uniquely
determine the parameters of the structural from the (estimated) parameters of a VAR
model in reduced form. For this purpose, suitable assumptions about the dependencies in
the structural form must be made.
117 Note that appropriate estimation of structural forms depends on the specic formulation. For example,
if Yt also appeared as a regressor in the equation for Xt , separately estimating each equation would lead
to inconsistent estimates because of the associated endogeneity (simultaneous equation bias). The same
applies in the present formulation if Ut and Wt are correlated.
3.1 Vector-autoregressive models 141
3.1.2 Estimating and forecasting VAR models
The joint estimation of two or more regression equations (system of equations) is beyond
the scope of this text. In general, possible dependencies across equations need to be taken
into account using GLS or ML. As a major advantage, VAR models in reduced form can be
estimated by applying least-squares separately to each equation of the model. OLS yields
consistent and asymptotically ecient estimates. None of the series in a VAR models is
exogenous as dened in a regression context. A necessary condition is that the series are
stationary (i.e. ARt has to hold), and the residuals in each equation are white-noise. If the
residuals are autocorrelated, additional lags are added to the model. The number of lags
can also be selected on the basis of information criteria like AIC or SC. No precautions
are necessary if the residuals are correlated across equations. Since a VAR model can be
viewed as a seemingly unrelated regression (SUR) with identical regressors, OLS has the
same properties as GLS (see Greene, 2003, p.343).
The VAR model should only include variables with the same order of integration. If the
series are integrated the VAR model is tted to (rst) dierences.118 In section 3.2 we
will present a test for integration of several series that can be interpreted as a multivariate
version of the DF test.
Lags with insignicant coecients are usually not eliminated from the VAR model. This
may have a negative eect on the forecasts from VAR models since (in most cases) too
many parameters are estimated. This ineciency leads to an unnecessary increase in the
variance of forecasts. However, if some coecients are restricted to zero, least-square
estimates are not ecient any more. In this case, the VAR model can be estimated by
(constrained) maximum likelihood119.
Figure 11: VAR(2) model for one-month (Y 1M) and ve-year interest rates (Y 5Y).
Sample(adjusted): 1964:04 1993:12
Included observations: 357
t-statistics in parentheses
D(Y_1M) D(Y_5Y)
Figure 12: Out-of-sample forecasts of one-month (Y 1M) and ve-year (Y 5Y) interest rates
using the VAR model in Figure 11 (estimation until 12/93).
18
16 Y_1M
Y_5Y
14
12
10
2
65 70 75 80 85 90 95
Example 51: We consider the monthly interest rates of US treasury bills for maturi-
ties of one month (yt1M , Y 1M) and ve years (yt5Y , Y 5Y)120 . Both series are integrated
and we t a VAR(2) models to the rst dierences. The VAR(2) model was selected
by AIC. The signicance of estimated parameters can be used to draw conclusions
about the dependence structure among the series. The estimation results in Figure 11
show a feedback relationship. The one-month rate depends on the ve-year rate and
the ve-year rate depends on the one-month rate (with a lag of two periods). However,
the dependence of the one-month rate on the ve-year rate is much stronger (as can
be seen from R2 ).
Figure 12 shows dynamic out-of-sample forecasts (starting in January 1994) of the
two interest rates. The forecasts converge rapidly to weakly ascending and descending
linear trend lines. Their slope is determined by the (insignicant) constant terms.
120 Source:
CRSP database, Government Bond le; see le us-tbill.wf1; monthly data from January
1964 to December 1993; 360 observations.
3.1 Vector-autoregressive models 143
Exercise 27: Use the data in the le ccva.wf1 which is taken from Campbell
et al. (2003). Fit a VAR model using all series in the le and interpret the
results.
Fit a VAR model using only data from 1893 to 1981. Obtain dynamic forecasts
for all series until 1997 and interpret the results.
3.2 Cointegration and error correction models 144
3.2 Cointegration and error correction models
Time series models for integrated series are usually based on applying ARMA or VAR
models to (rst) dierences. However, it was frequently argued that dierencing may
eliminate valuable information about the relationship among integrated series. We now
consider the case that two or more integrated series are related in terms of dierences and
levels.
3.2.1 Cointegration
Two121 processes Yt and Xt are cointegrated of rst order if
1. each process is integrated of order one122 and
2. Zt = Yt Xt is stationary: Zt I (0).
Yt = + Xt + Zt (49)
is the cointegration regression or cointegrating equation.
Suppose there is an equilibrium relation between Yt and Xt. Then Zt represents the extent
of dis equilibrium in the system. If Zt is not stationary, it can move 'far away' from zero
'for a long time'. If Zt is stationary, Zt will 'stay close' to zero or frequently return to
zero (i.e. it is mean-reverting). This is consistent with the view that both processes are
controlled by a common (unobserved) stationary process. This process prevents Yt and
Xt from moving 'too far away' from each other.
123 For details and an empirical example see Brooks et al. (2001).
124 Futures are standardized, transferable, exchange-traded contracts that require delivery of a commodity,
bond, currency, or stock index, at a specied price, on a specied future date. Unlike options, futures convey
an obligation to buy.
3.2 Cointegration and error correction models 146
3.2.3 Example 53: The expectation hypothesis of the term structure
The (unbiased) expectation hypothesis of the term structure of interest rates (EHT)
states that investors are risk neutral, and bonds with dierent maturities are perfect
substitutes. Accordingly, interest rate dierentials cannot become too large since otherwise
arbitrage opportunities would exist. In ecient markets such possibilities are quickly
recognized and lead to a corresponding reduction of interest rate dierentials. This is
true even if the assumption of risk neutrality is dropped and liquidity premia are taken
into account.125 According to the EHT, a long-term interest rate can be expressed as a
weighted average of current and expected short-term interest rates. Let Rt( ) be the spot
rate of a zero bond with maturity >1 and St=Rt(1) a short-term rate (e.g. the one-month
rate). The EHT states that
Rt ( ) =
1 X1 E[St+j jIt] + ( );
j =0
where ( ) is a time-invariant but maturity dependent term premium. For instance, the
relation between three- and one-month interest rates is given by
1
Rt (3) = (St + Et [St+1 ] + Et [St+2 ]) + (3):
3
If we consider the spread between the long and the short rate we nd
1
Rt (3) St = (Et [St+1 St ] + Et [St+2 St ]) + (3):
3
Usually, interest rates are considered to be integrated processes. Thus, the terms on the
right hand side are (rst and higher order) dierences of integrated processes and should
therefore be stationary. This implies that the spread Rt(3) St is also stationary since
both sides of the equation must have the same order of integration.
More generally, we now consider the linear combination 1Rt(3)+2St which can be writ-
ten as (ignoring the term premium)
1 Rt (3) + 2 St = (1 + 2 )St + 1 (Et [St+1 St ] + Et [St+2 St ]) :
3
The linear combination 1Rt(3)+2St will only be stationary if the non-stationary series
(1+2)St drops from the right-hand side. Thus, the right hand side will be station-
ary if 1+2=0, e.g. if 1=1 and 2= 1. Empirically, the EHT implies that the resid-
uals from the cointegration regression between Rt(3) and St should be stationary and
Zt Rt (3) St .
125 For theoretical details see Ingersoll (1987), p.389; for an empirical example see Engsted and Tanggaard
(1994).
3.2 Cointegration and error correction models 147
3.2.4 The Engle-Granger procedure
Engle and Granger (1987) have developed an approach to specify and estimate error
correction models which is only based on least-square regressions. The procedure consists
of the following steps:
1. Test whether each series is integrated of the same order.
2. Estimate the cointegration regression (49) and compute zt=yt c bxt. In general,
tting a regression model to the levels of integrated series may lead to the so-called
spurious regression problem126 . However, if cointegration holds, the parameter
estimate b converges (with increasing sample size) faster to than usual (this is
also called super-consistency). If a VAR model is tted to the levels of integrated
series a sucient number of lags should be included, such that the residuals are
white-noise. This should avoid the spurious regression problem.
3. Test whether zt is stationary. For that purpose use a ADF test without intercept
since zt has zero mean:
p
X
zt = gzt 1 + cj zt j + et:
j =1
The t-statistic of g must not be compared to the usual critical values (e.g. those in
Table 4 or those supplied by EViews). Since zt is an estimated rather than observed
time series, the critical values in Table 5127 must be used. These critical values also
depend on k (the number of series which are tested for cointegration).
If zt is stationary we conclude that yt and xt are cointegrated, and a VEC model
for the cointegrated time series is estimated. If zt is integrated a VAR model using
dierences of yt and xt is appropriate.
Example 54: We illustrate the Engle-Granger procedure by using the two interest
series yt =yt1M and xt =yt5Y from example 51. The assignment of the symbols yt and xt
to the two time series is only used to clarify the exposition. It implies no assumptions
about the direction of dependence, and usually128 has no eect on the results. Details
can be found in the le us-tbill.wf1.
Both interest rate series are assumed to be integrated, although the ADF test statistic
for yt1M is 2:98, which is less than the critical value 2:87 at the 5% level. The
OLS estimate of the cointegration regression is given by yt = 0:845+0:92xt +zt . The
t-statistic of g in a unit-root test of zt (using p=1) is 4:48. No direct comparison
with the values in Table 5 is possible (n=360, k=2, p=1). However, 4:48 is far less
than the critical values in case of n=200 and =0.01, so that the unit-root hypothesis
for zt can be rejected at the 1% level. We conclude that zt is stationary and there is
cointegration among the two interest series.
The estimated VEC model is presented in Figure 13. The upper panel of the table
shows the cointegration equation and denes zt (CointEq1): zt =yt 0:932xt +0.926.
This equation is estimated by maximum likelihood, and thus diers slightly from the
126 For details see Granger and Newbold (1974).
127 Source: Engle and Yoo (1987), p.157.
128 For details see Hamilton (1994), p.589.
3.2 Cointegration and error correction models 148
Table 5: Critical values of the ADF t-statistic for the cointegration test.
p=0 p=4
k n 0.01 0.05 0.10 0.01 0.05 0.10
50 {4.32 {3.67 {3.28 {4.12 {3.29 {2.90
2 100 {4.07 {3.37 {3.03 {3.73 {3.17 {2.91
200 {4.00 {3.37 {3.02 {3.78 {3.25 {2.98
50 {4.84 {4.11 {3.73 {4.54 {3.75 {3.36
3 100 {4.45 {3.93 {3.59 {4.22 {3.62 {3.32
200 {4.35 {3.78 {3.47 {4.34 {3.78 {3.51
50 {4.94 {4.35 {4.02 {4.61 {3.98 {3.67
4 100 {4.75 {4.22 {3.89 {4.61 {4.02 {3.71
200 {4.70 {4.18 {3.89 {4.72 {4.13 {3.83
p is the number of lags in the ADF regression. k is the number
of series. is the signicance level.
OLS estimates mentioned above. The lower panel shows the error correction model.
p=2 was based on the results of the VAR model in example 51. Both (changes in)
interest rates depend signicantly on the error correction term zt 1 . Thus, the changes
of each time series depend on the interest rate levels, and dierences between their
levels, respectively. The dependencies on past interest rate changes already known
from example 51 are conrmed.
The negative sign of the coecient 0:1065 of zt 1 in the equation for yt1M can be
interpreted as follows. If the ve-year interest rates are much greater than the inter-
est rates for one month, zt 1 is negative (according to the cointegration regression
zt =yt1M 0:932yt5Y +0.926). Multiplication of this negative value with the negative co-
ecient 0:1065 has a positive eect (c.p.) on the expected changes in yt1M , and
therefore leads to increasing short-term interest rates. This implies a tendency to
reduce (or correct) large dierences in interest rates. These results agree with the
EHT. In ecient markets spreads among interest rates cannot become too large. The
positive coecient 0.041 in the equation for yt5Y can be interpreted in a similar way. A
negative zt 1 leads to negative expected changes (c.p.) in yt5Y , and therefore leads to
a decline of the long-term interest rates. In addition, these corrections depend on past
changes of both interest rates. Whereas the dependence on lagged changes could be
called short-term adjustment, the response to zt 1 is a long-term adjustment eect.
Figure 14 shows out-of-sample forecasts (starting January 1994) of the two interest
rate series using the VEC model from Figure 13. In contrast to forecasts based on the
VAR model (see Figure 12), these forecasts do not diverge. This may be explained by
the additional error correction term.
The Engle-Granger procedure has two drawbacks. First, if k>2 at most (k 1) cointe-
gration relations are (theoretically) possible. It is not straightforward how to test for
cointegration in this case. Second, even when k=2 the cointegration regression between
yt and xt can also be estimated in reverse using
xt = c0 + b0 yt + zt0 :
3.2 Cointegration and error correction models 149
Figure 13: Cointegration regression and error correction model for one-month (Y 1M) and
ve-year (Y 5Y) interest rates.
Sample(adjusted): 1964:04 1993:12
Included observations: 357
t-statistics in parentheses
Y_1M(-1) 1.000000
Y_5Y(-1) -0.932143
(-9.54095)
C 0.925997
Figure 14: Out-of-sample forecasts of one-month (Y 1M) and ve-year (Y 5Y) interest rates
using the VEC model in Figure 13 (estimation until 12/93).
18
16 Y_1M
Y_5Y
14
12
10
2
65 70 75 80 85 90 95
3.2 Cointegration and error correction models 150
In principle, the formulation is arbitrary. However, since zt and zt0 are not 129 identical,
unit-root tests can lead to dierent results130. Engle-Granger suggest to test both vari-
ants.131
Exercise 28: Choose two time series which you expect to be cointegrated.
Use the Engle-Granger procedure to test the series for cointegration. Depend-
ing on the outcome, t an appropriate VAR or VEC model to the series, and
interpret the results.
129 Suppose we estimate the equation y=b0 +bx+e. The estimated slope is given by b=syx =s2x . If we
estimate x=c0 +cy+u (reverse regression) the2 estimate c will not be equal to 1=b. c=syx =s2y which is
dierent from 1=b except for the special case sy =s2yx =s2x .
130 Unit-root tests of zt and zt0 are equivalent only asymptotically.
131 For details see Hamilton (1994), p.589.
3.2 Cointegration and error correction models 151
3.2.5 The Johansen procedure
The Johansen procedure132 can be used to overcome the drawbacks of the Engle-Granger
approach. In addition, it oers the possibility to test whether a VEC model, a VAR model
in levels, or a VAR model in (rst) dierences is appropriate.
The Johansen approach is based on a VAR(p+1) model of k (integrated) variables:
Yt = V + 1 Yt 1 + + p+1 Yt p 1 + t :
This model can be reformulated to obtain the following VEC representation:
p
X
Yt = V + Yt 1 + Ci Yt i + t ; (51)
i=1
where
+1
pX +1
pX
= i I Ci = j ;
i=1 j =i+1
and I is a kk unit matrix. Comparing equation(51) to the ADF test regression
p
X
Yt = +
Yt 1 + ci Yt i + t
i=1
The Johansen test involves estimating133 the VEC model (51) and to test how many
eigenvalues of the estimated matrix are signicant. Two dierent types of tests are
available. Their critical values are tabulated and depend on p { the number of lags in the
VEC model { and on assumptions about constant terms. To determine the order p of the
VEC model VAR models with increasing order are tted to the levels of the series. p is
chosen such that a VAR(p+1) model tted to the levels has minimum AIC or SC. Setting
p larger than necessary is less harmful than choosing a value of p that is too small. If
a level VAR(1) has minimum AIC or SC (i.e. p=0) this may indicate that the series are
stationary. In this case the Johansen test can be carried out using p=1 to conrm this
(preliminary) evidence.
The following ve assumptions about constant terms and trends in the cointegrating equa-
tion (49) and in the error correction model (50) can be distinguished:134
1. There are no constant terms in the cointegrating equation and the VEC model:
=y =x =0.135
2. The cointegrating equation has a constant term 6=0, but the VEC model does not
have constant terms: y , x=0.136
3. The cointegrating equation and the VEC model have constant terms: , y , x6= 0.137
y , x 6=0 is equivalent to assuming a 'linear trend in the data' because a constant
term in the VEC model for Yt corresponds to a drift in the levels Yt.
4. The cointegrating equation has a constant and a linear trend (Yt= +t+Xt+Zt).
This case accounts for the possibility that the imbalance between Yt and Xt may
linearly increase or decrease. Accordingly, the dierence in the levels need not nec-
essarily approach zero or , but may change in a deterministic way. The VEC model
has constant terms: y , x6=0.138
5. The cointegrating equation has a constant term 6=0 and a linear trend. The VEC
model has constants (y , x6=0) and a linear trend. The presence of a linear trend
in addition to the drift corresponds to a quadratic trend in the level of the series.139
The conclusions about cointegration will usually depend on the assumptions about con-
stant terms and trends. This choice may be supported by inspecting graphs of the series or
by economic reasoning (for instance, a quadratic trend in interest rates may be excluded
133 For details about the maximum likelihood estimation of VEC models see Hamilton (1994), p.635.
134 We only consider the simplest case of two series.
135 EViews: VAR assumes no deterministic trend in data: No intercept or trend in CE or test VAR.
136 EViews: Assume no deterministic trend in data: intercept (no trend) in CE - no intercept in VAR.
137 EViews: Allow for linear deterministic trend in data: Intercept (no trend) in CE and test VAR.
138 EViews: Allow for linear deterministic trend in data: Intercept and trend in CE - no trend in VAR.
139 EViews: Allow for quadratic deterministic trend in data: Intercept and trend in CE - linear trend in VAR.
3.2 Cointegration and error correction models 154
apriori). If it is dicult to decide which assumption is most reasonable, the Johansen
test can be carried out under all ve assumptions. The results can be used to select an
assumption that is well supported by the data.
Figure 15: Johansen test for cointegration among yt1M and yt5Y.
Sample: 1964:01 1993:12
Included observations: 357
Test assumption: No deterministic trend in the data
Series: Y_1M Y_5Y
Lags interval: 1 to 2
Example 56: Fitting VAR models to the levels of yt1M and yt5Y indicates that p=1
should be used to estimate the VEC model for the Johansen test. However, we choose
p=2 to obtain results that are comparable to example 54. Below we will obtain
test results using p=1. Figure 15 shows the results of the test. The assumption No
deterministic trend in the data was used because it appears most plausible in economic
terms, and is supported by the results obtained in example 54. EViews provides an
interpretation of the test results: L.R. test indicates 1 cointegrating equation(s) at 5%
signicance level. The null hypothesis 'no cointegration' (None) is rejected at the 1%
level. The hypothesis of at most one cointegration relation cannot be rejected. This
conrms the conclusion drawn in example 54 that cointegration among yt1M and yt5Y
exists.
Figure 16 contains a summary of the results for various assumptions and p=1. The last
line indicates which rank can be concluded on the basis of the likelihood-ratio test for
3.2 Cointegration and error correction models 155
each assumption, using a 5% level. The conclusion r=1 is drawn for all assumptions,
except the third.
In addition, AIC and SC for every possible rank and every assumption are provided.
Note that the specied rank in the row L.R. Test is based on the estimated eigenval-
ues. The rank is not determined on the basis of AIC or SC, and therefore need not
correspond to these criteria (e.g., under assumption 2, SC points at r=0).
For a given rank, the values in a row can be compared to nd out which assumption
about the data is most plausible. Since the alternatives within a line are nested,
the precondition for a selection on the basis of AIC and SC is met. If conclusions
about the cointegration rank are not unique, and/or no assumption about constant
terms and trends is particularly justied, AIC and SC may be used heuristically in
order to search for a global minimum across assumptions and ranks. As it turns out
both criteria agree in pointing at assumption 1. This corresponds to the result that
the intercept terms in the VEC model are not signicant (see Figure 13). Therefore,
assuming a drift in interest rates is not compatible with the data and could hardly be
justied using economic reasoning.
Exercise 29: Choose two time series which you expect to be cointegrated.
Use the Johansen procedure to test the series for cointegration. Depending on
the outcome of the test, t an appropriate VAR or VEC model to the series,
and interpret the results.
3.2 Cointegration and error correction models 156
3.2.6 Cointegration among more than two series
Example 57: 140 The purchasing power parity (PPP) states that the currencies of
two countries are in equilibrium when their purchasing power is the same in each
country. In the long run the exchange rate should equal the ratio of the two countries'
price levels. There may be short-term deviations from this relation which should
disappear rather quickly. According to the theory, the real exchange rate is given by
F Pf
Qt = t dt ;
Pt
where Ft is the nominal exchange rate in domestic currency per unit of foreign cur-
rency, Ptd is the domestic price level, and Ptf is the foreign price level. Taking loga-
rithms yields the linear relation
ln Ft + ln Ptf ln Ptd = ln Qt :
The PPP holds if the logs of Ft , Ptd and Ptf are cointegrated with cointegration vector
=(1 1 {1)0 , and the log of Qt is stationary.
Example 58: Applying the EHT to more than two interest rates implies that all
spreads between long- and short-term interest rates (Rt (1 ) St , Rt (2 ) St , etc.)
should be stationary. In a VEC model with k interest rates this implies k 1 cointegra-
tion relations. For instance, if k=4 and Yt =(St ; Rt (1 ); Rt (2 ); Rt (3 ))0 the k(k 1)
cointegration matrix is given by
0 1
1 1 1
B 1 0 0C
0A: (52)
B C
@ 0 1
0 0 1
Extending example 51, we add the one year interest rate (yt1Y , Y 1Y) to the one-month
and ve-year rates. Fitting VAR models to the levels indicates that lagged dierences
of order one are sucient. The results from the Johansen test clearly indicate the
presence of two cointegration relations (see le us-tbill.wf1). The upper panel
of Figure 17 shows the so-called triangular representation (see Hamilton, 1994,
p.576) of the two cointegration vectors used by EViews to identify . Since any
linear combination of the cointegrating relations is also a cointegrating relation, this
representation can be transformed to obtain the structure of in equation 52. For
simplicity, we set the coecients in the row of Y 5Y(-1) in Figure 17 equal to 1,
and ignore the constants. The representation in Figure 17 implies that the spreads
yt1M yt5Y and yt1Y yt5Y are stationary. Using a suitable transformation matrix we
obtain
0 10 1 0 1
0:5 0:5 0:5 1 0 1 1
@ 1 0 0A@ 0 1A = @ 1 0A:
0 1 0 1 1 0 1
The transformed matrix now implies that the spreads yt1Y yt1M and yt5Y yt1M are
stationary. The lower panel in Figure 17 shows signicant speed-of-adjustment coef-
cients in all cases. The eects of lagged dierences are clearly less important.
140 For empirical examples see Hamilton (1994), p.582 or Chen (1995).
3.2 Cointegration and error correction models 157
Figure 17: VEC model for one-month, and one- and ve-year interest rates.
Sample(adjusted): 1964:03 1993:12
Included observations: 358
t-statistics in parentheses
Exercise 30: Choose three time series which you expect to be cointegrated.
Use the Johansen procedure to test the series for cointegration. Depending on
the outcome, t an appropriate VAR or VEC model to the series, and interpret
the results.
3.3 State space modeling and the Kalman lter 158
3.3 State space modeling and the Kalman lter141
3.3.1 The state space formulation
The objective of state-space modeling is to estimate (the parameters of) an unobservable
vector process t (k1) on the basis of an observable process yt (which may, in general,
be a vector process, too). Two equations are distinguished. For a single observation t the
measurement, signal or observation equation is given by
yt = ct + z 0t t + t ;
This model can be viewed as a random walk with time-varying drift t. If v2=0 the drift
is constant.
The stochastic volatility model is another model that can be formulated in state space
form. Volatility is unobservable and is treated as the state variable. We dene ht=ln t2
with transition equation
ht = d + T ht 1 + t :
The observed returns are dened as yt=tt where tN(0; 1). If we dene gt=ln yt2 and
t =ln 2t the observation equation can be written as
gt = ht + t :
141 For a more comprehensive treatment of this topic see Harvey (1984, 1989), Hamilton (1994), chapter 13,
or Wang (2003), chapter 7.
3.3 State space modeling and the Kalman lter 159
3.3.2 The Kalman lter
The Kalman lter is a recursive procedure to estimate t. Assume for the time being
that all vectors and matrices except the state vector are known. The recursion proceeds
in two steps. In the prediction step t is estimated using the available information in
t 1. This estimate atjt 1 is used to obtain the prediction ytjt 1 for the observable process
yt . In the updating step the actual observation yt is compared to ytjt 1 . Based on the
prediction error yt ytjt 1 the original estimate of the state vector is updated to obtain
the (nal) estimate at.
The conditional expectation of t is given by
atjt 1 = Et 1 [t] = dt + T tat 1;
and the covariance of the prediction error is
P tjt 1 = Et 1[(t at)(t at)0] = T t P t 1T 0t + Q:
Given the estimate atjt 1 for t we can estimate the conditional mean of yt from
ytjt 1 = Et 1 [yt ] = ct + z 0t atjt 1 :
The initial state vector 0 can also be estimated or set to 'reasonable' values. The diagonal
elements of the initial covariance matrix P 0 are usually set to large values (e.g. 104),
depending on the accuracy of prior information about 0.
The stochastic volatility model cannot be estimated by ML using a normal assumption.
Harvey et al. (1994) and Ruiz (1994) have proposed a QML approach for this purpose.
3.3 State space modeling and the Kalman lter 160
Example 59: Estimating a time-varying beta-factor excluding a constant term is a
very simple application of the Kalman lter (see Bos and Newbold (1984) for a more
comprehensive study). The system and observation equations are given by
t = t 1 + t xit = t xm
t + t :
In other words we assume that the beta-factor evolves like a random walk without
drift. Details of the Kalman lter recursion and ML estimation can be found in the
le kalman.xls. Note that the nal, updated estimate of the state vector is equal to
the LS estimate using the entire sample.
where
!j;3
Aj (T ) =
2j;1 exp(j;2T=2) ; (56)
j;4
Bj (T ) =
2(exp(j;1T ) 1) ; (57)
j;4
q
j;1 = (j + j )2 + 2j2; j;2 = j + j + j;1 ; j;3 = 2j j =j2 ;
t and Qt are determined in such a way that the rst two moments of the approximate
normal and the exact transition density are equal. The elements of the K -dimensional
vector t are dened as
t;j = j [1 exp( j t)] + exp( j t)Xt 1;j ;
at =
K
X log Aj (Tt;i) (i = 1; : : : ; m);
j =1 Tt;i
Bj (Tt;i )
bt = (i = 1; : : : ; n); (j = 1; : : : ; K ):
Tt;i
Tt is a m1 vector of maturities associated with the vector of yields. H is the variance-
covariance matrix of t with constant dimension mm. It is assumed to be a diagonal
matrix but each diagonal element hi (i=1,: : :,m) may be dierent such that the variance
of errors may depend on maturity.
3.3 State space modeling and the Kalman lter 162
The Kalman lter recursion consists of the following equations:
xtjt 1 = [1 exp( )] + exp( )xt 1jt 1
y^t = at + bt xtjt 1 :
The Kalman lter requires initial values for t=0 for the factors and their variance-covariance
matrix. We set the initial values for Xt;j and Pt equal to their unconditional moments:
X0;j =j and diagonal elements of P0 are 0:5j j2 =j . The initial values for the parame-
ters fj ; j ; j ; j ; hig can be based on random samples of the parameter vector. Further
details and results from an empirical example can be found in Geyer and Pichler (1999).
Bibliography 163
Bibliography
Albright, S. C., Winston, W., and Zappe, C. J. (2002). Managerial Statistics. Duxbury.
Baxter, M. and Rennie, A. (1996). Financial Calculus. Cambridge University Press.
Blattberg, R. and Gonedes, N. (1974). A comparison of the stable and Student distribu-
tions as statistical models for stock prices. Journal of Business, 47:244{280.
Bollerslev, T., Chou, R., and Kroner, K. F. (1992). ARCH modeling in nance. Journal
of Econometrics, 52:5{59.
Bos, T. and Newbold, P. (1984). An empirical investigation of the possibility of stochastic
systematic risk in the market model. Journal of Business, 57:35{41.
Box, G. and Jenkins, G. (1976). Time Series Analysis Forecasting and Control. Holden-
Day, revised edition.
Brooks, C., Rew, A., and Ritson, S. (2001). A trading strategy based on the lead-lag
relationship between the spot index and futures contract for the FTSE 100. International
Journal of Forecasting, 17:31{44.
Campbell, J. Y., Chan, Y. L., and Viceira, L. M. (2003). A multivariate model of strategic
asset allocation. Journal of Financial Economics, 67:41{80.
Campbell, J. Y., Lo, A. W., and MacKinlay, A. C. (1997). The Econometrics of Financial
Markets. Princton University Press.
Chan, K., Karoly, G. A., Longsta, F. A., and Sanders, A. B. (1992). An empirical
comparison of alternative models of the short-term interest rate. Journal of Finance,
47:1209{1227.
Chateld, C. (1989). The Analysis of Time Series. Chapman and Hall, 4th edition.
Chen, B. (1995). Long-run purchasing power parity: Evidence from some European mon-
etary system countries. Applied Economics, 27:377{383.
Chen, N.-F., Roll, R., and Ross, S. A. (1986). Economic forces and the stock market.
Journal of Business, 59:383{403.
Cochrane, J. H. (2001). Asset Pricing. Princeton University Press.
Coen, P., Gomme, F., and Kendall, M. G. (1969). Lagged relationships in economic
forecasting. Journal of the Royal Statistical Society Series A, 132:133{152.
Cox, J., Ingersoll, J. E., and Ross, S. A. (1985). A theory of the term structure of interest
rates. Econometrica, 53:385{407.
Dhillon, U., Shilling, J., and Sirmans, C. (1987). Choosing between xed and adjustable
rate mortgages. Journal of Money, Credit and Banking, 19:260{267.
Enders, W. (2004). Applied Econometric Time Series. Wiley, 2nd edition.
Engel, C. (1996). The forward discount anomaly and the risk premium: A survey of recent
evidence. Journal of Empirical Finance, 3:123{238.
Bibliography 164
Engle, R. F. and Granger, C. W. (1987). Co-integration and error correction: representa-
tion, estimation, and testing. Econometrica, 55:251{276.
Engle, R. F. and Yoo, B. (1987). Forecasting and testing in co-integrated systems. Journal
of Econometrics, 35:143{159.
Engsted, T. and Tanggaard, C. (1994). Cointegration and the US term structure. Journal
of Banking and Finance, 18:167{181.
Fama, E. F. and French, K. R. (1992). The cross-section of expected stock returns. Journal
of Finance, 47:427{465.
Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests.
Journal of Political Economy, 81:607{636.
Fielitz, B. and Rozelle, J. (1983). Stable distributions and the mixtures of distributions
hypotheses for common stock returns. Journal of the American Statistical Association,
78:28{36.
Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley.
Geyer, A. and Pichler, S. (1999). A state-space approach to estimate and test multifactor
Cox-Ingersoll-Ross models of the term structure. Journal of Financial Research, 22:107{
130.
Gourieroux, C. and Jasiak, J. (2001). Financial Econometrics. Princeton University Press.
Granger, C. W. and Newbold, P. (1971). Some comments on a paper of Coen, Gomme
and Kendall. Journal of the Royal Statistical Society Series A, 134:229{240.
Granger, C. W. and Newbold, P. (1974). Spurious regressions in econometrics. Journal of
Econometrics, 2:111{120.
Greene, W. H. (2000). Econometric Analysis. Prentice Hall, 4th edition.
Greene, W. H. (2003). Econometric Analysis. Prentice Hall, 5th edition.
Hamilton, J. D. (1994). Time Series Analysis. Princton University Press.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estima-
tors. Econometrica, 50:1029{1054.
Hansen, L. P. and Singleton, K. J. (1996). Ecient estimation of linear asset-pricing models
with moving average errors. Journal of Business and Economic Statistics, 14:53{68.
Harvey, A. C. (1984). A unied view of statistical forecasting procedures. Journal of
Forecasting, 3:245{275.
Harvey, A. C. (1989). Forecasting, structural time series models and the Kalman lter.
Cambridge University Press.
Harvey, A. C., Ruiz, E., and Shepard, N. (1994). Multivariate stochastic variance models.
Review of Economic Studies, 61:247{64.
Hastings, N. and Peacock, J. (1975). Statistical Distributions. Butterworth.
Bibliography 165
Hayashi, F. (2000). Econometrics. Princeton University Press.
Heckman, J. (1979). Sample selection bias as a specication error. Econometrica, 47:153{
161.
Ingersoll, J. E. (1987). Theory of Financial Decision Making. Rowman & Littleeld.
Jarrow, R. A. and Rudd, A. (1983). Option Pricing. Dow Jones-Irwin.
Kiefer, N. (1988). Economic duration data and hazard functions. Journal of Economic
Literature, 26:646{679.
Kiel, K. A. and McClain, K. T. (1995). House prices during siting decision stages: The case
of an incinerator from rumor through operation. Journal of Environmental Economics
and Management, 28:241{255.
Kirby, C. (1997). Measuring the predictable variation in stock and bond returns. Review
of Financial Studies, 10:579{630.
Kmenta, J. (1971). Elements of Econometrics. Macmillan.
Kon, S. J. (1984). Models of stock returns { a comparison. Journal of Finance, 39:147{165.
Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., and Shin, Y. (1992). Testing the null
hypothesis of stationarity against the alternative of a unit root. Journal of Econometrics,
52:159{178.
Lamoureux, C. and Lastrapes, W. (1990). Heteroscedasticity in stock return data: volume
versus GARCH eects. Journal of Finance, 45:221{229.
Levitt, S. D. (1997). Using electoral cycles in police hiring to estimate the eect of police
on crime. American Economic Review, 87(4):270{290.
Lutkepohl, H. (1993). Introduction to Multiple Time Series Analysis. Springer.
Mills, T. C. (1993). The Econometric Modelling of Financial Time Series. Cambridge
University Press.
Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments.
Journal of Economic Perspectives, 20(4):111{132.
Newey, W. K. and West, K. D. (1987). A simple, positive semi-denite, heteroskedasticity
and autocorrelation consistent covariance matrix. Econometrica, 55:703{708.
Papoulis, A. (1984). Probability, Random Variables, and Stochastic Processes. McGraw-
Hill, 2nd edition.
Roberts, M. R. and Whited, T. M. (2012). Endogeneity in Empirical Corporate Finance,
volume 2A of Handbook of the Economics of Finance. Elsevier.
Roll, R. and Ross, S. A. (1980). An empirical investigation of the arbitrage pricing theory.
Journal of Finance, 35:1073{1103.
Ross, S. A. (1976). The arbitrage theory of capital asset pricing. Journal of Economic
Theory, 13:341{360.
Bibliography 166
Ruiz, E. (1994). Quasi-maximum likelihood estimation of stochastic volatility models.
Journal of Econometrics, 63:289{306.
SAS (1995). Stock Market Analysis using the SAS System. SAS Institute.
Shanken, J. (1992). On the estimation of beta-pricing models. Review of Financial Studies,
5:1{33.
Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instru-
ments. Econometrica, 65(3):557{586.
Stambaugh, R. F. (1999). Predictive regressions. Journal of Financial Economics, 54:375{
421.
Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instruments and weak
identication in generalized method of moments. Journal of Business and Economic
Statistics, 20(4):518{529.
Studenmund, A. (2001). Using Econometrics. Addison Wesley Longman.
Thomas, R. (1997). Modern Econometrics. Addison Wesley.
Tsay, R. S. (2002). Analysis of Financial Time Series. Wiley.
Valkanov, R. (2003). Long-horizon regressions: Theoretical results and applications. Jour-
nal of Financial Economics, 68:201{232.
Verbeek, M. (2004). Modern Econometrics. Wiley, 2nd edition.
Wang, P. (2003). Financial Econometrics. Routledge.
Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. The
MIT Press.
Wooldridge, J. M. (2003). Introductory Econometrics. Thomson, 2nd edition.
Yogo, M. (2004). Estimating the elasticity of intertemporal substitution when instruments
are weak. The Review of Economics and Statistics, 86(3):797{810.