Economics 717 Fall 2019 Lecture - IV PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Economics 717 – Fall 2019

Professor Jeffrey Smith


Instrumental Variables Lecture
Version of October 8, 2019

Motivation

Standard instrumental variable (IV) methods deal with the endogeneity of a regressor
within a linear regression framework.

The instrument, or exclusion restriction, is a variable that, in loose terms, affects the
endogenous variable but does not affect the outcome variable other than through its effect
on the endogenous variable.

Instrumental variable methods are also used in measurement error contexts.

In the last 20 years, there has been a revolution in the interpretation of IV estimates as the
literature has begun to investigate them in contexts where the effect of the endogenous
variable varies across units in ways that may, or may not, be related to the instrument.

Understanding this recent literature is both fascinating and key to getting an IV paper
accepted at a good journal.

Outline

Standard case with one instrument and common effects

Standard case with multiple instruments

Details of the standard case

Heterogeneous effects and a binary instrument

Heterogeneous effects and continuous instruments

Example: Project Star

To begin, consider the example of Project Star in Tennessee, which was a random
assignment evaluation of (ideally) changing the size of primary school classes in
government schools from 24 to 18. The outcome variable of interest was test scores.

Define the following notation:

Yi = average test score for class “i”


X i ∈ {18, 24} is the (intended) number of students in class “i”
Z i ∈ {0,1} equals 1 for a small class (the treatment) and 0 for a large class (the control)

1
The reduced form model gives the direct impact of being assigned to the treatment group
on outcomes:

Yi = Π 0 + Π1Z i + vi ,

where Π1 is of direct policy interest.

However, we are also interested in a more structural parameter, namely the derivative of
test scores with respect to class size. That parameter appears in the “structural” equation

Yi =β 0 + β1 X i + ε i ,

dY
where β1 = .
dX

In terms of the notation above, the coefficient from the reduced form model may be
written as

 dX  dY 
Π1 =  .
 dZ  dX 

We might think about estimating the structural model directly, given random assignment,
but there may be some practical concerns that intervene. In Project Star, random
assignment was defeated at the margin by parents in some cases. Thus, we may want
instead to estimate the structural parameter of interest indirectly using IV.

We can obtain the first component by estimating the first stage model

Xi =θ 0 + θ1Z i + ui .

If random assignment worked perfectly, then the estimated coefficient will equal minus
six; given the empirical problems, it will lie between zero and minus six.

The IV estimate is then given by the ratio of the reduced form estimate from the outcome
equation to the coefficient from the first stage, or

dY Π1
β1
= = .
dX θ1

This estimate is consistent despite the partial disruption of random assignment, while the
estimate based on direct estimation of the structural equation would not be, because the
error term in that equation will be correlated with the class size variable if disruptions of
random assignment are not random with respect to test scores.

2
This estimator is sometimes called the “indirect least squares” estimator.

Just-Identified Case

Basic model

Now consider the case of non-experimental data. To keep things simple, suppose that we
have just one covariate, and it is endogenous. Thus, we have that

yi =β 0 + β1 xi + ε ,

where cov( xi , ε i ) ≠ 0 .

Because of the endogeneity problem, we cannot impose the second of the OLS
orthogonality conditions (which is equivalent to saying the second of the OLS first order
conditions).

Suppose instead that we have an instrument, a variable such that cov( zi , ε i ) = 0 but
cov( zi , xi ) ≠ 0 .

Note that in the general case with more covariates, all that is required is that the
covariance of zi and ε i conditional on all of the xi equals zero.

Deriving the estimator

With such a variable, we can utilize an alternative orthogonality condition based on


cov( zi , ε i ) = 0 to derive an estimator. To see this, note that the sample analogue
covariance condition is that

∑ z ( y − βˆ
i =1
i i 0 − βˆ1 xi ) =
0

This is the same as the analogous OLS first order condition, but with zi in place of xi .

Proceeding with the usual derivation but with this replacement leads to

∑ ( y − y )( z − z )
i i
βˆIV = i =1
n
.
∑ ( x − x )( z − z )
i =1
i i

Using the notation from the Project Star example, this falls out easily. Recall that

3
n

∑ ( z − z )( x − x )
i i
θˆ1 = i =1
n

∑ (z − z )
i =1
i
2

and that

∑ ( z − z )( y − y )
i i
ˆ =i =1
Π .
1 n

∑ (z − z )
i =1
i
2

As in the Project Star example, the ratio then gives

ˆ
Π
βˆ1 = 1 .
θˆ1

The Wald estimator

A useful special case occurs when both the endogenous variable and the instrument are
binary. In this case, the IV formula reduces to a simple form called the Wald estimator.

Consider the reduced form outcome model. In the case of a binary instrument, the
coefficient on the instrument is simply the mean difference in outcomes between units
with the instrument equal to one and units with the instrument equal to zero. In notation,
the first stage coefficient equals

ˆ = Eˆ (Y | Z = 1) − Eˆ (Y | Z = 0) ,
Π 1

where the expectations are estimated by sample means.

Similarly, the first stage regression of the endogenous variable on the instrument is a
linear probability model in this case. The coefficient on the instrument is just the mean
difference in the probability that the endogenous dummy variable equals one for units
with the instrument equal to one and units with the instrument equal to zero. In notation,
the first stage coefficient equals

θˆ1 =Pr(
 D =1| Z =1) − Pr(
 D =1| Z =0) ,

where D is the endogenous binary variable.

Combining the two estimates using the formula above gives the Wald estimator

4
Π1 E (Y | Z =
1) − E (Y | Z =
0)
∆ IV = = .
θ1 Pr( D =1| Z =1) − Pr( D =1| Z =0)

A simple example

Consider now a simple example, which will use distance from the training center as an
instrument in an effort to measure the effect of training on some outcome Y.

To begin, consider a training center in one town that serves two towns. Those in the
same town as the training center have a short distance to go, while those in the other town
have a long distance to go.

Suppose that the impact of training, ∆ , on those who take it is 10, while the outcome in
the absence of training, denoted by Y0 , is 100.

The tuition for the training course is five.

The cost of travel for those in the second town varies across persons. For those in the
near town, the cost is zero for everyone. In the far town, for those with a car the cost is
essentially zero; for those without one the cost is 12. Assume that half of the eligible
persons have a car and that there are 200 eligible persons in each town.

Assume also that everyone knows their cost of training and their benefits from training,
and participates only when the benefits exceed the costs.

Let Z = 1 denote residence in the near town and Z = 0 denote residence in the far town
and let D = 1 denote participation in training and D = 0 denote non-participation.

Using our standard notation, let

Pr( D= 1| Z= 1)= 1.0

and

= 0.5 .
Pr( D= 1| Z= 0)

Thus, there is a difference in participation probabilities between the near town and the far
town, which is to say that our instrument is correlated with the endogenous variable.

At the same time, by construction the instrument is not correlated with the untreated
outcome (which is a constant in this example). Thus, the instrument meets to the two
criteria for validity.

5
E (Y | Z= 1)= Y0 + ∆ Pr( D= 1 | Z= 1)= 100 + 10(1.0)= 110

E (Y | Z= 0)= Y0 + ∆ Pr( D= 1 | Z= 0)= 100 + 10(0.5)= 105

The IV estimator in this simple case is given by:

E (Y | Z =
1) − E (Y | Z =
0)
∆ IV = .
Pr( D =1| Z =1) − Pr( D =1| Z =0)

The intuition behind the formula is that the denominator serves to scale up the mean
difference in outcomes to account for the fact that the change in the instrument changes
the status for some but not all individuals.

Inserting the numbers from the example into the formula gives:

110 − 105 5
∆ˆ IV = = = 10 ,
1.0 − 0.5 0.5

which is the correct answer.

The IV estimator isolates variation in the endogenous variable due to specific exogenous
factors. In this example, that exogenous variation is all of the variation.

Multiple instruments and two-stage least squares

So far we have considered the case with only one instrument for one endogenous
variable. Suppose instead that we have multiple instruments, z1 ,..., zM .

Identification

In this case, the model is said to be “over-identified”, because only a single instrument is
required to identify a model with one endogenous variable.

The single instrument case is referred to as a “just-identified” model.

If there are M instruments and one endogenous variable, then there are M − 1 “over-
identifying restrictions” which can be tested. We discuss how to do that later on in the
lecture.

Choosing a combination of instruments

Assuming each instrument is individually correlated with the endogenous variable, and
uncorrelated with the error term, we have many potential IV estimators. Each instrument
by itself will produce consistent estimates, as will any linear combination of the
instruments.

6
Which combination is best?

It turns out that using the linear projection of the endogenous variable on all of the
exogenous variables, both those in the model and the outside instruments, minimizes the
variance of the resulting estimates. This is not surprising given that the linear projection
corresponds to the least squares estimator of a regression of the endogenous variable on
the exogenous variables, and that, within the class of linear estimators, least squares
maximizes the covariance between the two by minimizing the sum of squared errors.

Using the linear projection of the endogenous variables on the exogenous variables as the
instrument leads to the two-stage least squares estimator. We will formally derive this
estimator shortly. For the moment, the basic idea is the following:

First stage: regress the endogenous variable on the exogenous variables. In the notation
from Wooldridge, this means estimating the model

xK = δ 0 + δ1 x1 + ... + δ K −1 xK −1 + θ1 z1 + .... + θ M zM + rK ,

where xK is the endogenous variable, x1 ,..., xK −1 are the exogenous variables in the
outcome equation and z1 ,..., zM are the outside instruments.

In the population,

xK* = δ 0 + δ1 x1 + ... + δ K −1 xK −1 + θ1 z1 + .... + θ M zM

is uncorrelated with u by construction, because it is a linear combination of variables


assumed to be uncorrelated with u.

In practice, the OLS fitted values are used. These are given by

xˆK = δˆ0 + δˆ1 x1 + ... + δˆK −1 xK −1 + θˆ1 z1 + .... + θˆM zM

Second stage: estimate the outcome equation using the projection of the endogenous
variable in place of the endogenous variable itself.

That is, in the second stage you estimate the model:

y = β 0 + β1 x1 + ... + β K −1 xK −1 + β K xˆK + v ,

where the error term is now uncorrelated with all of the independent variables, thus
giving consistent (but not unbiased) estimates.

Proof

7
The proof that two-stage least squares is the most efficient estimator (conditional on
using a linear combination of the instruments) is in Wooldridge. It is not that
illuminating, so I omit it here.

Formal derivation with matrices

We now give an abbreviated formal treatment of the IV estimator; for a longer treatment
see the Wooldridge book.

Assumptions

Two key assumptions underlie the two-stage least squares estimator. They are:

Assumption 2SLS1: For some (N x L) matrix Z, E ( Z ′U ) = 0 .

In Wooldridge’s notation, Z includes the exogenous variables in the structural outcome


equation as well as the instruments. As a result, L = (K + M) – the number of endogenous
variables. U is the vector of error terms in the structural outcome equation.

Thus, this first assumption is the key one – which is that the instruments (including the
exogenous X) are uncorrelated with the error term.

The second assumption has to do with making sure that there are enough exogenous
variables.

Assumption 2SLS2: (a) rank E ( Z ′Z ) = L ; (b) rank E ( Z ′X ) = K .

As before Z is (N x L) so that Z ′Z is (L x L). Similarly, X is (N x K), so that Z ′X is


(L x N)(N x K) = (L x K).

Part (a) is not very important; it just means that you do not have perfect multicollinearity
among the exogenous variables.

Part (b) is more important. It means that the excluded instruments are related to the
endogenous structural equation variables conditional on the included exogenous
variables. This formalizes the intuition that the excluded instruments need to be related
to the endogenous variables, and makes it clear that this has to be true even conditional
on the included exogenous variables.

Note that if Z = X , Assumptions 2SLS1 and 2SLS2 become two of the standard
assumptions for the OLS model: that the X matrix is of full rank and that the X are
uncorrelated with the population residuals.

In the simple case with one endogenous variable, the condition in (b) can be tested by
regressing the endogenous variable on all of the exogenous variables. The necessary

8
condition is that the coefficient on at least one of the excluded instruments does not equal
zero.

A general test is also possible in the case of multiple endogenous variables. In practice, it
is often easier to test the necessary, but not sufficient, condition that at least one of the
excluded instruments is significant in each first stage. If this necessary condition fails,
then assumption (b) is problematic.

Deriving the estimator

Start with the derivation of the IV estimator, which flows out of the basic orthogonality
condition. Impose

E ( Z ′U ) = E[ Z ′(Y − X β )]= 0 .

Rewrite this to yield

′Y ) E ( Z ′X ) β ⇒ =
E ( Z= β IV E ( Z ′X ) −1 E ( Z ′Y ) .

The consistent estimator is given by replacing the expected values by their sample
analogues, so that

βˆIV = ( Z ′X ) −1 ( Z ′Y ) .

This estimator only works when Z has the same number of columns as X, which is to say
that the number of excluded instruments is the same as the number of endogenous
variables.

The two stage least squares estimator handles the general case where the number of
excluded instruments is greater than or equal to the number of endogenous variables. It
uses the projection of the X onto all of the exogenous variables Z (the exogenous X and
the excluded instruments) as the instruments in the IV estimator.

Let Xˆ = Z ( Z ′Z ) −1 Z ′X be the linear projection of X onto the Z, where θˆ1 = ( Z ′Z ) −1 Z ′X are


the estimated coefficients from the first stage.

Replacing Z with X̂ in the IV estimator, and noting that ( Xˆ ′X ) = ( Xˆ ′Xˆ ) by construction,


because the residual from the first stage is necessarily uncorrelated with X, yields

βˆ2 SLS = ( Xˆ ′Xˆ ) −1 ( Xˆ ′Y ) .

This is just the least squares estimator with X̂ in place of X, which is where the name
“two-stage least squares” comes from.

9
As noted above, the 2SLS estimator is efficient in the class of estimators based on linear
functions of the instruments.

Consistency

The 2SLS estimator is consistent. The proof is in Wooldridge and proceeds along very
similar lines to the proof for OLS. The key element is that the plim of ZU equals zero by
assumption.

Asymptotic distribution

Wooldridge presents the asymptotic variance under the following homoskedasticity


assumption.

Assumption 2SLS3: E (UU ′ | Z ) = σ 2 I , where σ 2 = E (ui2 ) .

The conditioning on Z follows from the fact that X includes endogenous variables. Recall
that Z includes the exogenous variables in X as well as the excluded instruments.

The assumption of homoskedasticity, combined with the original assumptions 2SLS1 and
2SLS2, allow the following result:

Theorem 5.2: Under assumptions 2SLS1-2SLS3,

N ( βˆ − β )  AN (0, σ 2 [ E ( X ′Z ) E ( Z ′Z ) −1 E ( Z ′X )]−1 ) ,

where this is the asymptotic distribution.

Estimating the variance

The variance is fairly straightforward to estimate.

The first thing that is required is an estimate of σ 2 . That is given by

∑ uˆ 2
i
σˆ 2 = i =1
,
N −K

where uˆ=
i Yi − X i βˆ .

Note that these uˆi are residuals obtained using the actual X, rather than the predicted X
from the first stage. Thus, they are not the residuals from the second stage equation that
you actually estimate as part of the two-stage least squares procedure. The reason for this

10
is that you do not want to include the estimation error in X in the estimate of the residual
variance.

Though it is not obvious that this should be so,

( Xˆ ′Xˆ ) −1

is a consistent estimator of [ E ( X ′Z ) E ( Z ′Z ) −1 E ( Z ′X )]−1 .

To see why this should be so, recall that the first stage predicted values are given by

Xˆ = Z Π
ˆ = Z ( Z ′Z ) −1 Z ′X
1

This comprises the last five matrices in the component of the variance we want to
estimate.

Then note that ( X ′Xˆ ) = ( Xˆ ′Xˆ ) because the residual from the first stage is necessarily
uncorrelated with X.

Combining the two parts, we obtain the variance estimator for the 2SLS estimates given
by:

 βˆ ) = σˆ 2 ( Xˆ ′Xˆ ) −1 .
var( 2 SLS

This is just the OLS variance estimate with X̂ in place of X. This is not surprising, given
that 2SLS is just OLS using the predicted X in place of the original X.

Comparing this estimator to the OLS estimator also makes it clear why you want
instruments that do a good job of capturing the variance of the endogenous variable. If
they do not, then the variation in X̂ will be (potentially much) less than that in X. When
this occurs, the 2SLS variance will be much larger than the OLS variance because the
denominator will be smaller, with the imprecision concentrated on the coefficients of the
endogenous variables.

Show the corresponding formulae for the simple linear regression model to make the
point.

Control function version of two-stage least squares

The control function version of the 2SLS estimator just includes the estimated residual
from the first stage as a covariate in the second stage.

This turns out to work in the same way as the inverse Mill’s ratio in the treatment effects
version of the bivariate normal estimator. It captures the expected value of the error term

11
and thereby solves the selection or endogeneity problem when thought about as an
omitted variable problem.

Keep in mind that this is a generated regressor, so that the estimated standard errors are
more complicated than usual.

Asymptotic bias

To consider the asymptotic bias it is simplest to stick with the case of one instrument and
one endogenous variable. The points all carry over to the general case.

In the simple case, we have from

N N

∑ ( zi − z )( yi − y ) ∑ ( zi − z )ui
=
βˆ1, IV= N
i 1 =
= β1 + N i 1

∑ ( zi − z )( xi − x )
=i 1 =i 1
∑ ( zi − z )( xi − x )
(which skips a few steps that are the same as in the OLS case) that

σ zu
plimβˆ1, IV= β1 + .
σ xz

Thus, the asymptotic bias depends on the relative strength of the population relationships
between the instrument and the error term and between the instrument and the
endogenous variable.

Although it is assumed that σ zu = 0 , it is likely not zero but small in the population. As
such, you want σ xz to be as large as possible so that any correlation between the
instrument and the error term results in a small bias. If the instruments are very weak, so
that σ xz is small, then even a small correlation can lead to a large bias.

It is interesting to compare the asymptotic biases of OLS and IV. The asymptotic bias of
OLS is given by

σ xu
plimβ1,OLS= β1 + .
σ xx

The derivation is, not surprisingly, exactly parallel to that for the IV estimator.

The ratio of the asymptotic bias of IV to that of OLS then equals

12
 σ zu 
 
 σ xz  = ρ zu .
 σ xu  ρ xu ρ zx
 
 σ xx 

Thus, the relative asymptotic bias depends on three terms: the correlation between the
instrument and the error term, the correlation between the endogenous variable and the
error term and the correlation between the instrument and the endogenous variable.

For IV to be preferred to OLS, the absolute value of the ratio of the biases should be less
than one. For that to happen, you need that

| ρ zu |<| ρ xu ρ xz | .

Thus, IV is likely to have a lower bias than OLS when the instrument is not very
correlated with the residual, when the endogenous variable is highly correlated with the
residual, and when the instrument is strongly correlated with the endogenous variable.

Of course, considering both the bias and the variance would tip the balance a bit more
toward OLS.

This is most of what you need to know about picking a good instrument.

Finite sample bias and weak instruments

Finite sample bias in IV estimators has received a lot of attention in the literature in
recent years, motivated in large part by the Bound, Jaeger and Baker (1995) JASA paper
that showed that the Angrist and Krueger (1991) QJE paper on the effect of schooling on
earnings using compulsory schooling laws as instruments, despite using samples in the
hundreds of thousands, was still subject to substantial finite sample bias due to the large
number of weak instruments they employ.

Finite sample bias results from the fact that the first stage coefficients are estimated with
error in finite samples.

My discussion here mainly follows Bound, Jaeger and Baker (1995) JASA with some
additional bits from Angrist and Pischke (2009).

Using power series methods, Buse (1992) derives the following expression for the bias of
the instrumental variables estimator in samples of size N:

σ uv
bias( βˆIV , N ) ≈ (Q − 2)
θ ′Z e′Z eθ

13
where Q is the number of excluded instruments, σ uv is the covariance of the unobserved
components in the first stage and the outcome equation, and Z e here refers to the
excluded instruments.

A related derivation in Angrist and Pischke (2009) gets to the same place, but with Q in
place of (Q – 2). Note that “ ≈ ” is important here.

The matrices in the denominator have dimensions (1 x M)(M x N)(N x M)(M x 1) and so
the matrix multiplication results in a scalar.

Three aspects of this formula deserve note:

1. All else equal, the bias increases in Q, though adding another instrument should also
increase the denominator. In this sense, the formula provides some guide to whether or
not you want to include a given instrument. It is clear that adding a sufficiently weak
instrument, one that does little to increase the denominator, can increase the finite sample
bias by increasing Q even if it decreases the asymptotic bias by increasing the covariance
between the instruments and the endogenous variable.

2. All else equal the finite sample bias decreases in N (as the name suggests that it
should), because the denominator increases with N while the probability limit of the
numerator does not change with N. But as Bound, Baker and Jaeger (1995) JASA show,
the bias can remain substantively large even in what are very large samples by the
standards of empirical research in economics.

3. This formula provides a direct motivation for the use of the first-stage F statistic on the
excluded instruments as a signal of potential weak instrument problems.

To see this, first re-write the formula as:

σ uv σ v2
bias( β IV ) ≈ (Q − 2) .
σ v2 θ ′Z e′Z eθ

θ ′Z e′Z eθ
The term τ = is called the concentration ratio. The numerator here is the
σ v2
increment to the explained sum of squares of the first stage that results from the excluded
instruments. If the excluded instruments are the only variables in the first stage, then this
is literally the explained sum of squares from the first stage.

The first term is approximately the bias of OLS when Z explains little of X so that X is
approximately equal to v.

The last two terms are approximately the inverse of the first-stage F-statistic on the
excluded instruments. To see this recall that the F-statistic plays off the difference in
residual variance between the full model and the model with the restrictions imposed:

14
SSRr − SSRur
F= M
SSRur
N − K −1

Note further that SSRur ≈ ( N − K − 1)σ v2 .

Stock, Wright and Yogo (2002) JBES suggest that a value of the first-stage F-statistic on
the excluded instruments exceeding 10 generally suffices to avoid problems with finite
sample bias.

Staiger and Stock (1997) Econometrica, using alternative asymptotics, also show that the
finite sample bias is proportional to the first-stage F-statistic.

Bias is towards the (probability limit of the) OLS estimate. The intuition here is that with
a random first stage you would, in a loose sense, get back the OLS estimates because the
random instrument will, in a finite sample, be correlated with random bits of the
endogenous variable.

Another way to think of this is that the problem relates to over-fitting in the first stage in
finite samples.

As the ratio of exogenous variation to randomness increases in the first stage, the
component of the endogenous variable that the first stage picks out becomes increasingly
exogenous, and the IV estimate migrates away from OLS towards its asymptotic value.

Getting the standard errors right when there is finite sample bias is also an issue, as the
usual TSLS variance matrix performs poorly in this context. Imbens and Wooldridge
(2007) present estimates suggesting that the TSLS variance matrix performs reasonably
well in the just identified case without excessive endogeneity. For the case of multiple
instruments, see the Monte Carlo analysis in Flores-Lagunes (2007) Journal of Applied
Econometrics.

Angrist and Pischke (2009) note that alternative estimators such as the LIML estimator
and Jackknife Instrumental Variables (JIVE) may perform better in contexts with weak
instruments (and, in the latter case, heteroskedasticity).

Murray (2006) Journal of Economic Perspectives provides a low-tech overview.

Instrumental variables in Stata

The ivreg command in Stata estimates models using instrumental variables or two-
stage least squares.

The format of the command is

15
ivreg <depvar> <exogenous x> (<endogensous x> = <excluded
instruments>)

Thus, to estimate

yi =β 0 + β1 x1i + β 2 x2i + β 3 x3i + ui

where x3 is endogenous and z is an instrument for x3 , you would say

ivreg y x1 x2 (x3 = z)

The first option to the ivreg command causes Stata to display the results of both
the first stage and second stage regressions.

Testing exogeneity

Introduction

Sometimes it is useful to test whether or not a potentially endogenous variable actually is


endogenous. In the literature you will occasionally see papers that test the null
hypothesis of exogeneity of the potentially endogenous variable. If this null is not
rejected, the analysis will proceed under the assumption of exogeneity and do OLS. If
not the analysis proceeds to do IV.

This standard test compares the OLS and 2SLS estimates. Under the null that the
potentially endogenous variable is actually exogenous, these two estimates differ only
due to sampling error, but OLS is more efficient in the statistical sense. In contrast,
under the alternative hypothesis that the variable is actually endogenous, they should
differ, because the 2SLS estimator is consistent under the alternative but OLS is not.

Formal discussion

There are a number of different ways to perform this test. Following the Wooldridge
text, this discussion highlights two of the most common.

Method 1: Using the regression residuals from the first stage

This method is quite straightforward. It consists of taking the residuals from the first
stage estimation and including them in the second stage along with the endogenous
variable.

In notation, the first stage is given by

X i Z iγ + vi .
=

16
The estimated residuals are given by

ˆi X i − Z iγˆ .
v=

The second stage for the test is given by

Yi = X i β + vˆi β v + ui

The test is just the standard t-test of the null that the coefficient on the estimated residuals
from the first stage equals zero. In notation, the null is given by

H 0 : βv = 0 .

The intuition behind the test is that under the null of exogeneity, the residuals from the
first stage are just noise, and so should have a zero coefficient. In contrast, under the
alternative hypothesis of endogeneity, the first-stage residuals represent the “bad”
variation in the endogenous variable – the part that is correlated with the error term in the
outcome equation – just as they do in the control function version of the 2SLS estimator.
In this case, their coefficient should be non-zero.

This test is easily generalized to the case of multiple endogenous variables. In this case,
there is a vector of estimated residuals – one from the first stage for each of the
endogenous variables – and the test takes the form of an F-test of the null that all of the
coefficients on the estimated residuals equal zero.

Method 2: A chi-squared test on the coefficient of interest

The second method focuses solely on the coefficient of interest estimated using OLS and
IV. Typically the coefficient of interest is the coefficient on the endogenous variable.

Call the coefficient of interest βˆk . The test then consists of examining the statistic

( βˆk ,2 SLS − βˆk ,OLS ) 2


,
se( βˆk ,2 SLS ) 2 − se( βˆk ,OLS ) 2

which has an asymptotic χ12 distribution.

The denominator is the variance of the difference in estimated coefficients under the null.
To see this recall that under the null the additional variance component in the 2SLS
estimates is just noise. Thus, under the null the covariance is just the OLS variance. The
standard formula for the variance of a difference then yields the result in the
denominator.

17
The intuition here is that under the null, the coefficients differ only due to sampling error.
The 2SLS estimator is inefficient, and so has a higher asymptotic variance, because it
uses a less efficient instrument for the endogenous variable under the null. Under the null
of exogeneity, the efficient instrument for the potentially endogenous variable is the
variable itself.

Issues with the variance of the difference

In practice, the estimated variance of the difference will sometimes not be positive
definite; the ordering is asymptotic and may not hold in particular finite samples. This is
one reason to prefer the regression-based tests. In this situation Stata reports an error
message and takes a generalized inverse so that a positive value results.

Remarks

In Cambridge, Massachusetts, this is called the Hausman test. In Chicago, it is called the
Durbin-Wu-Hausman test. You can take your pick. Using all three names is probably
the risk-averse choice; another alternative is to simply call it the “standard exogeneity
test” or some such.

The null here is exogeneity. Thus, the test is, in the usual classical statistical sense, being
conservative about concluding endogeneity. If theory (or even common sense) or
evidence from other studies suggests endogeneity, this may suffice to proceed with 2SLS
regardless of the results of the test.

In the past, and sometimes in the present, researchers would adopt a strategy of testing
exogeneity and the reporting the OLS estimates if the test “passed” and the IV estimates
if the test “failed”. The OLS or IV standard errors are invalid here due to pre-test bias.
Correct standard errors should take full account of the two-step procedure that includes
the test. Guggenberger (2010) Econometric Theory discusses these issues.

Draw the decision tree corresponding to this testing strategy.

This test assumes you have a valid instrument. This is a strong assumption in many
cases. The implication of this is that if you fail to reject the null, it may be because the
instrument is invalid, rather than because the variable is in fact exogenous.

The discussions in the Greene, Wooldridge and Cameron and Trivedi books emphasize
different issues associated with the test. Greene also presents an alternative regression
based version of the test due to Wu (1973). This version of the test includes the predicted
values of the endogenous variables in the outcome equation along with the original X.
The test consists of an F test of the joint null that the coefficients on all of the predicted
values equal zero. The intuition is that under the null the differences between the
predicted values and the original values are just noise, and so should not be related to the
dependent variable.

18
The bottom line: if there is a reason to expect endogeneity but the test fails, it is nice to
report both the OLS and IV estimates, the test results, and to cite the Guggenberger
(2010) Econometric Theory paper and interpret the findings from the analysis
accordingly.

The D-W-H test is in fact far more general. The basic idea of comparing two estimates,
one that is consistent both under the null and the alternative and one that is consistent
only under the null, can be applied in many econometric contexts. We will see the test
again in the context of fixed and random effects in panel data models.

Durbin-Wu-Hausman test in Stata

The implementation of the Durbin-Wu-Hausman test in Stata differs between Version 7


and Versions 8 and 9. In Version 7, do the following

ivreg y x1 (x2 = z1)

where y denotes the dependent variable, x1 denotes the exogenous variables included in
the outcome equation, x2 denotes the endogenous variables included in the outcome
equation and z1 denotes the excluded instruments.

Next do

hausman, save

This tells Stata to save the IV results for later comparison with the OLS results. Then
generate the OLS results by

regress y x1 x2

Finally, have Stata compare the OLS and IV results and do the Hausman test by using the
hausman command a second time.

hausman, constant sigmamore

The constant option tells Stata that the constant has the same interpretation in the two
models and so should be included in the test. The hausman command is very general
and also applies to circumstances where this is not the case, as in the comparison of
random effect and fixed effect models.

The sigmamore option tells Stata to use the same estimate of the covariance matrix for
both covariance matrices in the test – namely the covariance matrix from the efficient
estimator under the null. As noted in the Stata manual, this is the appropriate covariance
matrix for tests of exogeneity and over-identifying restrictions in IV models.

In Stata Versions 8 and 9, the sequence of commands is:

19
ivreg y x1 (x2 = z1)
estimates store ivname
regress y x1 x2
estimates store olsname
hausman ivname olsname, constant sigmamore

where “ivname” and “olsname” can be any names (other than Stata command names) that
you give to the saved IV and OLS estimates, respectively.

Stata uses the second version of the test, generalized to matrices.

Testing over-identifying restrictions

Motivation

In some cases, you may have more candidate instruments than you do endogenous
variables. In this case, because you only need as many instruments as there are
endogenous variables for identification there are over-identifying restrictions that can be
tested. These restrictions consist of the exogeneity of the “extra” instruments.

Intuition

The intuition behind the test is to compare 2SLS estimates obtained using all of the
instruments with those using just a necessary subset of the instrument. The original
version of the test in Hausman does exactly this.

As this test is computationally burdensome, Wooldridge recommends a simple


regression-based alternative. In the alternative test, the 2SLS residuals from using all of
the candidate instruments are regressed on all of the exogenous variables. Under the null
that all of the instruments are valid, this regression should have a population r-squared
value of zero. Under the alternative, the population r-squared is non-zero because some
of the instruments are correlated with the error term, which is estimated by the 2SLS
residuals.

Formal test

First obtain the 2SLS estimates using all of the exogenous variables as instruments.

Using these estimates, generate the predicted outcome equation residuals. These
residuals are generated using the observed value of the endogenous variable, not its
predicted value from the first stage.

Now regress the predicted residuals on all of the exogenous variables, which means all of
the candidate instruments as well as the exogenous variables from the outcome equation.

20
Under the null that E ( Z ′U ) = 0 , NR 2 has a chi-squared distribution with Q1 degrees of
freedom, where Q1 is the number of “extra” instruments. For example, if there is one
endogenous variable and three candidate instruments, then Q1 = 3 − 1 = 2 .

Under the alternative hypothesis that not all of the instruments are valid, the test statistic
will be large. Thus, the null is rejected for appropriately large values of the test statistic.
The test statistic is large under the alternative because if one or more of the instruments is
invalid, they will account for some of the variance in the predicted residuals.

The key here is that at least one of the instruments must be valid so that it can be a test of
over-identifying restrictions and not of just-identifying restrictions.

Testing over-identifying restrictions in Stata

Doing the regression-based test in Stata uses standard commands.

Doing the original Hausman version of the test in Stata uses the same general structure as
the exogeneity test.

The following sequence of commands performs the test:

ivreg y x1 (x2 = z1)


estimates save ivconsist
ivreg y x1 (x2 = z1 z2)
estimates save iveffic
ivreg y x1 (x2 = z1)
hausman ivconsist iveffic, constant sigmamore

where y denotes the dependent variable, x1 denotes the exogenous variables included in
the outcome equation, x2 denotes the endogenous variables included in the outcome
equation, z1 denotes the excluded instruments that are assumed to be valid under both the
null and the alternative and where “ivconsist” and “iveffic” are just names given to the
saved estimates from the consistent and efficient models, respectively.

There is also a command called overid available on the internet that does the
regression version of the test described above (and in the Wooldridge book) in one step,
immediately following the estimation of the ivreg command.

To install this command and learn about it, type

net install overid

within Stata, and then

help overid

21
Note once again that the regression version of the test does not test the validity of one
instrument conditional on the others; rather, it tests the joint null that all of the
instruments are valid, leaving you to sort out which ones may be causing problems. As in
the exogeneity test, is a bit of an odd choice for the null.

Heterogeneous treatment effects: binary endogenous variable and binary


instrument

Motivation

The recent literature on instrumental variables (and, indeed, the recent literature on
treatment effects in general) emphasizes the role of heterogeneous treatment effects.

The implicit baseline model here is a model of common treatment effects. That is, there
is a dummy variable for the treatment, and everyone has the same value for the
coefficient on this variable. If one is feeling ambitious, the dummy variable might be
interacted with various observed characteristics to estimate subgroup effects.

Similar issues arise with random coefficient models for continuous independent variables
of interest. The related literature is an active area of research.

In the case of IV, the key issue that has excited the most attention is the potential for
instruments that are uncorrelated with the outcome equation error term to nonetheless be
correlated with the person-specific component of the impact.

The linear model

Consider the following equation

Yi = β X i + δ i Di + ε i .

This equation embodies heterogeneous treatment effects because the coefficient on the
treatment dummy has an “i” subscript.

Now let δ =
i δ + δ i* ,

where we decompose the individual treatment effect into the mean treatment effect
(conditional on treatment) and the individual deviation from the mean.

We can then re-write the model as

Yi = β X i + δ Di + (δ i* Di + ε i ) ,

22
where (δ i* Di + ε i ) is the composite error term. For the untreated units, it consists only of
the usual error term, while for the treated units it consists of both the usual error term and
the idiosyncratic component of the impact.

In the common coefficient model, the condition for a valid instrument is

E (ε i | X i , Di , Z i ) = 0 .

In contrast, the corresponding condition for the heterogeneous treatment effect world is
given by

E[(δ i* Di + ε i ) | X i , Di , Z i ] =
0.

This is a different and stronger condition that will not be fulfilled by many standard
instruments.

To see why consider a typical IV strategy in a treatment effects context. Here variation
in the cost of treatment gets used to generate exogenous variation in the probability of
treatment. If the component of costs (e.g. distance from the training center or college,
presence of a bus strike) that generates the variation is uncorrelated with the outcome
equation residual, then this constitutes a valid instrument in the common effect world.

However, it is easy to show that such instruments are not likely to be valid in the
heterogeneous treatment effects world, if agents know both their costs and their likely
impacts from the program.

For example, suppose that participation is determined by

Di* =γ 0 + γ 1δ i + γ 2 Z i + υi ,

where Di = 1 iff Di* > 0 and Di = 0 otherwise, and where Z i is a variable that measures
the cost of participation.

In this case, units with high idiosyncratic impacts will participate even when their costs
are high, but units with low idiosyncratic impacts will participate only when their costs
are low. As a result (and this is the key point) conditional on D, the value of the
instrument is correlated with the unit-specific component of the impact. Put differently,
the standard model of participation – based on the model in Heckman and Robb (1985) –
implies that cost-based instruments are likely invalid in a heterogeneous treatment effects
world.

In terms of the notation, in this case E[(δ i − δ ) | X i , Z i , Di ] ≠ 0

23
Put differently, in this case there are two kinds of selection, one on the outcome equation
unobserved component in the untreated state and one on the unobserved component of
the impact.

However, all is not lost. When the instrument is correlated with the unit-specific
component of the impact, but not with the original outcome equation error term, IV still
estimates an interpretable and (sometimes) useful parameter under some additional
assumptions. It is to this parameter that we now turn.

LATE

It is easiest to see what is going on with local average treatment effects if we stick to the
very simple case of a binary instrument, a binary treatment and no covariates.

Draw a 4 x 2 table with the four groups as the rows and columns for the values of the
binary instrument.

The table divides up the population into four groups:

Never takers (NT): D = 0 for both values of Z;


Defiers (DF): D = 0 when Z = 1 and D = 1 when Z = 0 ;
Compliers (C): D = 1 when Z = 1 and D = 0 when Z = 0 ;
Always takers (AT): D = 1 for both values of Z.

The compliers are the key group here. They are the units that respond to the instrument
by taking the treatment when they otherwise would not. For example, if costs are the
instrument, these units do not participate when costs are high but do participate when
costs are low.

Now consider the formula for the Wald estimator once again.

E (Y | Z =
1) − E (Y | Z =
0)
∆ IV = .
Pr( D =1| Z =1) − Pr( D =1| Z =0)

Using the four groups, we can rewrite the terms in the numerator as:

E (Y | Z =
1) =
E (Y1 | AT ) Pr( AT ) + E (Y1 | C ) Pr(C ) + E (Y0 | DF ) Pr( DF ) + E (Y0 | NT ) Pr( NT )

and

E (Y | Z =
0) =
E (Y1 | AT ) Pr( AT ) + E (Y0 | C ) Pr(C ) + E (Y1 | DF ) Pr( DF ) + E (Y0 | NT ) Pr( NT )
.
The denominator terms become

24
Pr( D =1| Z =1) =Pr( D =1| Z =1, AT ) Pr( AT ) + Pr( D =1| Z =1, C ) Pr(C ) =Pr( AT ) + Pr(C )

and

Pr( D =1| Z =0) =Pr( D =1| Z =1, AT ) Pr( AT ) + Pr( D =1| Z =1, DF ) Pr( DF ) =Pr( AT ) + Pr( DF )

Some of these terms cancel out in the subtraction, leaving

[ E (Y | C ) Pr(C ) + E (Y0 | DF ) Pr( DF )] − [ E (Y0 | C ) Pr(C ) + E (Y1 | DF ) Pr( DF )]


∆ IV = 1 .
Pr(C ) − Pr( DF )

This is a bit of a mess, but still provides some insight. The IV estimator in this case is a
weighted average of the treatment effect on the compliers and the negative of the
treatment effect on the defiers.

To get to the LATE, two assumptions are required.

First, we assume monotonicity. That is, we assume that the instrument can only increase
or decrease the probability of participation. Thus, if increasing the instrument increases
the probability of participation for some units, it does so for all. This assumption fits in
very well for cost-based instruments, which theory suggests should have such a
monotonic effect.

In terms of the notation, monotonicity means that Pr( DF ) = 0 . In words, it means that
there are no defiers.

Ed Vytlacil’s Econometrica paper shows that the monotonicity assumption is equivalent


to assuming an index function model; this result modifies some of the claims in the
Angrist, Imbens and Rubin (1996) JASA paper.

The second assumption is that there are some compliers, which is to say that Pr(C ) > 0 .
The behavior of some individuals must be affected by the instrument.

Imposing these assumptions on the formula gives

E (Y1 | C ) Pr(C ) − E (Y0 | C ) Pr(C )


=
∆ IV | C ) LATE .
= E (Y1 | C ) − E (Y0=
Pr(C )

Thus, in a heterogeneous treatment effects world, where the instrument is correlated with
the impact and where monotonicity holds, the IV estimator gives the Local Average
Treatment Effect (LATE), defined as the mean impact of treatment on the compliers.

In statistics the LATE is called the CACE = Complier Average Causal Effect.

25
The LATE is a well-defined economic parameter. In some cases, as when the available
policy option consists of moving the instrument, it may be the parameter of greatest
interest. Thus, if we want to give a $500 tax credit for university attendance, the policy
parameter of interest is the impact of university on individuals who attend with the tax
credit but do not attend without it.

The LATE provides no information about the impact of treatment on the always-takers,
which could be large or small. If the mean impact of treatment on the treated is the
primary parameter of interest, this is a very important omission.

Discuss how this works in an experiment with dropouts and substitutes, using the JTPA
experiment numbers – 60 percent treated in the treatment group and 40 percent treated in
the control group. Make the point that the impact estimate when you do the Wald/Bloom
estimator is a LATE. It is the impact on the 20 percent who receive treatment because
they were randomly assigned into the treatment group rather than the control group.

A simple example

Return now to our simple example of the two training centers, but changes the parameters
as follows:

Outcome in the absence of training: 100


Tuition for the training course: 5
Cost of travel from near town: 0
Cost of travel from far town: 10
Number of persons in each town: 100

Change the impacts so that they vary across persons. In particular, assume that:

Impact of training for one half of the population: 20


Impact of training for the other half of the population: 10

Assume that the impact of training is independent of what town you live in.

As before, Z = 1 denotes residence in the near town and Z = 0 denotes residence in the
far town.

Now we have the following:

Pr( D= 1| Z= 1)= 1.0


Pr( D= 1| Z= 0)
= 0.5

The latter follows from the fact that in the far town only those persons with an impact of
20 will take the training, because only for them does the impact exceed the combined
travel and tuition costs.

26
Similarly,

E (Y | Z = 0) = Y0 + E (∆ | Z = 0) Pr( D = 1| Z = 0) =
100 + 10(0.5) + 20(0.5) = 115 ,
and

E (Y | Z =1) =Y0 + E (∆ | Z =1, D =1) Pr( D =1| Z =1) =


100 + 20(0.5) = 110 .

Inserting the numbers from the example into the formula gives:

115 − 110 5
∆ˆ IV = = = 10.
1.0 − 0.5 0.5

This is the same answer as before, but now it is a local average treatment effect. The
marginal group, the group that enters training when the instrument changes, consists of
persons with impacts of 10. They participate in the near town but not in the far town.

The LATE does not equal either the ATE or TT.

In particular,

= E (∆ |=
∆ ATE Z 1,=
D 1) Pr(=
Z 1) + E (∆ |=
Z 0,= Z 0) ,
D 1) Pr(=

and

∆TT = E (∆ | Z = 1, D = 1) Pr( Z = 1| D = 1) −
E (∆ | Z = 0, D = 1) Pr( Z = 0 | D = 1) .

That is, these parameters are weighted averages of the mean impacts in each town, where
the weights depend on the number of assumed participants in each town.

We have Pr( Z= 1)= Pr( Z= 0)= 0.5 by assumption.

We have E (∆ | Z =1) =15 and E (∆ | Z =0) =15 by assumption.

We have E (∆ | Z =1, D =1) =15 and E (∆ | Z =0, D =1) =20 due to the differences in
participation between the near and far town.

Finally, Pr( Z= 1| D= 1)= 0.67 while Pr( Z= 0 | D= 1)= 0.33 .

Thus,
ATE
∆= (15)(0.5) + (15)(0.5)
= 15,

27
and

∆TT (0.67)(15) + (0.33)(20)


= = 16.67.

As expected, we have that ∆TT > ∆ ATE .

When are there heterogeneous treatment effects?

You can bound the variance of the impacts (conditional on having solved the selection
problem) using the methods in e.g. Heckman, Smith and Clements (1997) Review of
Economic Studies and Djebbari and Smith (2008) Journal of Econometrics. It is also
possible to formally test the null of a zero variance.

Institutional knowledge is helpful here. If the treatment is itself heterogeneous, as in the


case of most training programs, this is suggestive that the treatment effects are likely to
be as well.

Looking for variation in the average treatment effect across subgroups is also
informative. If mean impacts vary a lot with observed characteristics, this suggests that
they likely vary with unobserved characteristics as well.

Heterogeneous treatment effects: continuous endogenous variable and binary or


continuous instruments

The same issues arise in the more general contexts with a continuous endogenous
variable and either binary or continuous instruments. These contexts are beyond the
scope of this course, but the same general result is found. What is estimated is a
weighted average effect where the weights depend on the particular instrument used and
the effects for those units whose behavior is changed by the instrument.

See Angrist and Imbens (2005) JASA as well as the many recent papers on the “correlated
random coefficient model” by Heckman and various co-authors.

Heterogeneous treatment effects: multiple instruments

In a world with heterogeneous treatment effects, the use of multiple instruments affects
the nature of the parameter being estimated if the instruments are associated with the
treatment effects.

Consider the case of two valid instruments, each of which is uncorrelated with the level
but correlated with the impact. Each instrument, used separately, produces an estimate of
a different LATE. Put differently, using each instrument estimates a different parameter.

Using both instruments together in 2SLS then produces a hybrid of the two LATEs.
Again, a detailed treatment is beyond the scope of this course, but it is important to be
aware that using multiple instruments changes from a simple and obviously useful way to

28
(perhaps) improve the efficiency of the estimator to a complicated decision regarding the
nature and definition of the parameter being estimated when the world switches from a
homogeneous treatment effects world to a heterogeneous treatment effects world.

When different instruments estimate different LATEs, this also complicates the
interpretation of Durbin-Wu-Hausman tests. Getting a different estimate when using two
instruments rather than one of the two now no longer necessarily signals that the marginal
instrument is invalid. It may signal instead that the two instruments yield different
LATEs. See the recent Heckman, Schmierer and Urzua (2010) Journal of Econometrics
paper on this point.

Where do instruments come from?

Theory combined with clever data collection

Theory can be used to come up with a variable that may affect participation but not
outcomes other than through participation. Think about which variables might affect
treatment choice but not have a direct effect on outcomes in the context of a formal or
informal theory of the phenomenon being analyzed. Then find a data set, or collect some
data, that embody these variables along with the outcomes and other exogenous
covariates. In many contexts such variables will represents aspects of the costs of being
treated that are more or less exogenously assigned.

Examples include:

Distance from the training center


Distance from college at birth as in Card (1995)
Living in a town where a new college opens or does not open

Most of these instruments yield LATEs rather than ATET.

Exogenous variation in policies or program implementation

A second source of instruments is exogenous variation in policy or program


implementation. This can include variation over time or variation across space. The key
is making the case that the variation really is exogenous.

Examples:

BC disability person
Random assignment to judges of varying severity
Assignment to different caseworkers within a job training center
Variation in the budget across local offices of a job training program within a local labor
market as in Frölich and Lechner (2004)

These again will often yield LATEs rather than ATETs.

29
Nature

The natural world sometimes creates exogenous variation. Thus, learning about the
natural world can have a payoff in economics.

Examples include:

Miscarriages and single motherhood, as in Hotz, Mullin and Sanders (1997)


Temperature and fertility
Weather and fish
Sex of first two children for third birth
Twins at second birth for third child

Deliberate creation

A final source of instruments is deliberate creation. Running an experiment is one way of


creating a really good instrument. When experiments are infeasible, Angrist and others
have suggested randomized encouragement as a method for creating a good instrument.
The trick in this case is finding an outreach technique that has a substantial effect on the
probability of participation.

30

You might also like