Economics 717 Fall 2019 Lecture - IV PDF
Economics 717 Fall 2019 Lecture - IV PDF
Economics 717 Fall 2019 Lecture - IV PDF
Motivation
Standard instrumental variable (IV) methods deal with the endogeneity of a regressor
within a linear regression framework.
The instrument, or exclusion restriction, is a variable that, in loose terms, affects the
endogenous variable but does not affect the outcome variable other than through its effect
on the endogenous variable.
In the last 20 years, there has been a revolution in the interpretation of IV estimates as the
literature has begun to investigate them in contexts where the effect of the endogenous
variable varies across units in ways that may, or may not, be related to the instrument.
Understanding this recent literature is both fascinating and key to getting an IV paper
accepted at a good journal.
Outline
To begin, consider the example of Project Star in Tennessee, which was a random
assignment evaluation of (ideally) changing the size of primary school classes in
government schools from 24 to 18. The outcome variable of interest was test scores.
1
The reduced form model gives the direct impact of being assigned to the treatment group
on outcomes:
Yi = Π 0 + Π1Z i + vi ,
However, we are also interested in a more structural parameter, namely the derivative of
test scores with respect to class size. That parameter appears in the “structural” equation
Yi =β 0 + β1 X i + ε i ,
dY
where β1 = .
dX
In terms of the notation above, the coefficient from the reduced form model may be
written as
dX dY
Π1 = .
dZ dX
We might think about estimating the structural model directly, given random assignment,
but there may be some practical concerns that intervene. In Project Star, random
assignment was defeated at the margin by parents in some cases. Thus, we may want
instead to estimate the structural parameter of interest indirectly using IV.
We can obtain the first component by estimating the first stage model
Xi =θ 0 + θ1Z i + ui .
If random assignment worked perfectly, then the estimated coefficient will equal minus
six; given the empirical problems, it will lie between zero and minus six.
The IV estimate is then given by the ratio of the reduced form estimate from the outcome
equation to the coefficient from the first stage, or
dY Π1
β1
= = .
dX θ1
This estimate is consistent despite the partial disruption of random assignment, while the
estimate based on direct estimation of the structural equation would not be, because the
error term in that equation will be correlated with the class size variable if disruptions of
random assignment are not random with respect to test scores.
2
This estimator is sometimes called the “indirect least squares” estimator.
Just-Identified Case
Basic model
Now consider the case of non-experimental data. To keep things simple, suppose that we
have just one covariate, and it is endogenous. Thus, we have that
yi =β 0 + β1 xi + ε ,
where cov( xi , ε i ) ≠ 0 .
Because of the endogeneity problem, we cannot impose the second of the OLS
orthogonality conditions (which is equivalent to saying the second of the OLS first order
conditions).
Suppose instead that we have an instrument, a variable such that cov( zi , ε i ) = 0 but
cov( zi , xi ) ≠ 0 .
Note that in the general case with more covariates, all that is required is that the
covariance of zi and ε i conditional on all of the xi equals zero.
∑ z ( y − βˆ
i =1
i i 0 − βˆ1 xi ) =
0
This is the same as the analogous OLS first order condition, but with zi in place of xi .
Proceeding with the usual derivation but with this replacement leads to
∑ ( y − y )( z − z )
i i
βˆIV = i =1
n
.
∑ ( x − x )( z − z )
i =1
i i
Using the notation from the Project Star example, this falls out easily. Recall that
3
n
∑ ( z − z )( x − x )
i i
θˆ1 = i =1
n
∑ (z − z )
i =1
i
2
and that
∑ ( z − z )( y − y )
i i
ˆ =i =1
Π .
1 n
∑ (z − z )
i =1
i
2
ˆ
Π
βˆ1 = 1 .
θˆ1
A useful special case occurs when both the endogenous variable and the instrument are
binary. In this case, the IV formula reduces to a simple form called the Wald estimator.
Consider the reduced form outcome model. In the case of a binary instrument, the
coefficient on the instrument is simply the mean difference in outcomes between units
with the instrument equal to one and units with the instrument equal to zero. In notation,
the first stage coefficient equals
ˆ = Eˆ (Y | Z = 1) − Eˆ (Y | Z = 0) ,
Π 1
Similarly, the first stage regression of the endogenous variable on the instrument is a
linear probability model in this case. The coefficient on the instrument is just the mean
difference in the probability that the endogenous dummy variable equals one for units
with the instrument equal to one and units with the instrument equal to zero. In notation,
the first stage coefficient equals
θˆ1 =Pr(
D =1| Z =1) − Pr(
D =1| Z =0) ,
Combining the two estimates using the formula above gives the Wald estimator
4
Π1 E (Y | Z =
1) − E (Y | Z =
0)
∆ IV = = .
θ1 Pr( D =1| Z =1) − Pr( D =1| Z =0)
A simple example
Consider now a simple example, which will use distance from the training center as an
instrument in an effort to measure the effect of training on some outcome Y.
To begin, consider a training center in one town that serves two towns. Those in the
same town as the training center have a short distance to go, while those in the other town
have a long distance to go.
Suppose that the impact of training, ∆ , on those who take it is 10, while the outcome in
the absence of training, denoted by Y0 , is 100.
The cost of travel for those in the second town varies across persons. For those in the
near town, the cost is zero for everyone. In the far town, for those with a car the cost is
essentially zero; for those without one the cost is 12. Assume that half of the eligible
persons have a car and that there are 200 eligible persons in each town.
Assume also that everyone knows their cost of training and their benefits from training,
and participates only when the benefits exceed the costs.
Let Z = 1 denote residence in the near town and Z = 0 denote residence in the far town
and let D = 1 denote participation in training and D = 0 denote non-participation.
and
= 0.5 .
Pr( D= 1| Z= 0)
Thus, there is a difference in participation probabilities between the near town and the far
town, which is to say that our instrument is correlated with the endogenous variable.
At the same time, by construction the instrument is not correlated with the untreated
outcome (which is a constant in this example). Thus, the instrument meets to the two
criteria for validity.
5
E (Y | Z= 1)= Y0 + ∆ Pr( D= 1 | Z= 1)= 100 + 10(1.0)= 110
E (Y | Z =
1) − E (Y | Z =
0)
∆ IV = .
Pr( D =1| Z =1) − Pr( D =1| Z =0)
The intuition behind the formula is that the denominator serves to scale up the mean
difference in outcomes to account for the fact that the change in the instrument changes
the status for some but not all individuals.
Inserting the numbers from the example into the formula gives:
110 − 105 5
∆ˆ IV = = = 10 ,
1.0 − 0.5 0.5
The IV estimator isolates variation in the endogenous variable due to specific exogenous
factors. In this example, that exogenous variation is all of the variation.
So far we have considered the case with only one instrument for one endogenous
variable. Suppose instead that we have multiple instruments, z1 ,..., zM .
Identification
In this case, the model is said to be “over-identified”, because only a single instrument is
required to identify a model with one endogenous variable.
If there are M instruments and one endogenous variable, then there are M − 1 “over-
identifying restrictions” which can be tested. We discuss how to do that later on in the
lecture.
Assuming each instrument is individually correlated with the endogenous variable, and
uncorrelated with the error term, we have many potential IV estimators. Each instrument
by itself will produce consistent estimates, as will any linear combination of the
instruments.
6
Which combination is best?
It turns out that using the linear projection of the endogenous variable on all of the
exogenous variables, both those in the model and the outside instruments, minimizes the
variance of the resulting estimates. This is not surprising given that the linear projection
corresponds to the least squares estimator of a regression of the endogenous variable on
the exogenous variables, and that, within the class of linear estimators, least squares
maximizes the covariance between the two by minimizing the sum of squared errors.
Using the linear projection of the endogenous variables on the exogenous variables as the
instrument leads to the two-stage least squares estimator. We will formally derive this
estimator shortly. For the moment, the basic idea is the following:
First stage: regress the endogenous variable on the exogenous variables. In the notation
from Wooldridge, this means estimating the model
xK = δ 0 + δ1 x1 + ... + δ K −1 xK −1 + θ1 z1 + .... + θ M zM + rK ,
where xK is the endogenous variable, x1 ,..., xK −1 are the exogenous variables in the
outcome equation and z1 ,..., zM are the outside instruments.
In the population,
In practice, the OLS fitted values are used. These are given by
Second stage: estimate the outcome equation using the projection of the endogenous
variable in place of the endogenous variable itself.
y = β 0 + β1 x1 + ... + β K −1 xK −1 + β K xˆK + v ,
where the error term is now uncorrelated with all of the independent variables, thus
giving consistent (but not unbiased) estimates.
Proof
7
The proof that two-stage least squares is the most efficient estimator (conditional on
using a linear combination of the instruments) is in Wooldridge. It is not that
illuminating, so I omit it here.
We now give an abbreviated formal treatment of the IV estimator; for a longer treatment
see the Wooldridge book.
Assumptions
Two key assumptions underlie the two-stage least squares estimator. They are:
Thus, this first assumption is the key one – which is that the instruments (including the
exogenous X) are uncorrelated with the error term.
The second assumption has to do with making sure that there are enough exogenous
variables.
Part (a) is not very important; it just means that you do not have perfect multicollinearity
among the exogenous variables.
Part (b) is more important. It means that the excluded instruments are related to the
endogenous structural equation variables conditional on the included exogenous
variables. This formalizes the intuition that the excluded instruments need to be related
to the endogenous variables, and makes it clear that this has to be true even conditional
on the included exogenous variables.
Note that if Z = X , Assumptions 2SLS1 and 2SLS2 become two of the standard
assumptions for the OLS model: that the X matrix is of full rank and that the X are
uncorrelated with the population residuals.
In the simple case with one endogenous variable, the condition in (b) can be tested by
regressing the endogenous variable on all of the exogenous variables. The necessary
8
condition is that the coefficient on at least one of the excluded instruments does not equal
zero.
A general test is also possible in the case of multiple endogenous variables. In practice, it
is often easier to test the necessary, but not sufficient, condition that at least one of the
excluded instruments is significant in each first stage. If this necessary condition fails,
then assumption (b) is problematic.
Start with the derivation of the IV estimator, which flows out of the basic orthogonality
condition. Impose
E ( Z ′U ) = E[ Z ′(Y − X β )]= 0 .
′Y ) E ( Z ′X ) β ⇒ =
E ( Z= β IV E ( Z ′X ) −1 E ( Z ′Y ) .
The consistent estimator is given by replacing the expected values by their sample
analogues, so that
βˆIV = ( Z ′X ) −1 ( Z ′Y ) .
This estimator only works when Z has the same number of columns as X, which is to say
that the number of excluded instruments is the same as the number of endogenous
variables.
The two stage least squares estimator handles the general case where the number of
excluded instruments is greater than or equal to the number of endogenous variables. It
uses the projection of the X onto all of the exogenous variables Z (the exogenous X and
the excluded instruments) as the instruments in the IV estimator.
This is just the least squares estimator with X̂ in place of X, which is where the name
“two-stage least squares” comes from.
9
As noted above, the 2SLS estimator is efficient in the class of estimators based on linear
functions of the instruments.
Consistency
The 2SLS estimator is consistent. The proof is in Wooldridge and proceeds along very
similar lines to the proof for OLS. The key element is that the plim of ZU equals zero by
assumption.
Asymptotic distribution
The conditioning on Z follows from the fact that X includes endogenous variables. Recall
that Z includes the exogenous variables in X as well as the excluded instruments.
The assumption of homoskedasticity, combined with the original assumptions 2SLS1 and
2SLS2, allow the following result:
N ( βˆ − β ) AN (0, σ 2 [ E ( X ′Z ) E ( Z ′Z ) −1 E ( Z ′X )]−1 ) ,
∑ uˆ 2
i
σˆ 2 = i =1
,
N −K
where uˆ=
i Yi − X i βˆ .
Note that these uˆi are residuals obtained using the actual X, rather than the predicted X
from the first stage. Thus, they are not the residuals from the second stage equation that
you actually estimate as part of the two-stage least squares procedure. The reason for this
10
is that you do not want to include the estimation error in X in the estimate of the residual
variance.
( Xˆ ′Xˆ ) −1
To see why this should be so, recall that the first stage predicted values are given by
Xˆ = Z Π
ˆ = Z ( Z ′Z ) −1 Z ′X
1
This comprises the last five matrices in the component of the variance we want to
estimate.
Then note that ( X ′Xˆ ) = ( Xˆ ′Xˆ ) because the residual from the first stage is necessarily
uncorrelated with X.
Combining the two parts, we obtain the variance estimator for the 2SLS estimates given
by:
βˆ ) = σˆ 2 ( Xˆ ′Xˆ ) −1 .
var( 2 SLS
This is just the OLS variance estimate with X̂ in place of X. This is not surprising, given
that 2SLS is just OLS using the predicted X in place of the original X.
Comparing this estimator to the OLS estimator also makes it clear why you want
instruments that do a good job of capturing the variance of the endogenous variable. If
they do not, then the variation in X̂ will be (potentially much) less than that in X. When
this occurs, the 2SLS variance will be much larger than the OLS variance because the
denominator will be smaller, with the imprecision concentrated on the coefficients of the
endogenous variables.
Show the corresponding formulae for the simple linear regression model to make the
point.
The control function version of the 2SLS estimator just includes the estimated residual
from the first stage as a covariate in the second stage.
This turns out to work in the same way as the inverse Mill’s ratio in the treatment effects
version of the bivariate normal estimator. It captures the expected value of the error term
11
and thereby solves the selection or endogeneity problem when thought about as an
omitted variable problem.
Keep in mind that this is a generated regressor, so that the estimated standard errors are
more complicated than usual.
Asymptotic bias
To consider the asymptotic bias it is simplest to stick with the case of one instrument and
one endogenous variable. The points all carry over to the general case.
N N
∑ ( zi − z )( yi − y ) ∑ ( zi − z )ui
=
βˆ1, IV= N
i 1 =
= β1 + N i 1
∑ ( zi − z )( xi − x )
=i 1 =i 1
∑ ( zi − z )( xi − x )
(which skips a few steps that are the same as in the OLS case) that
σ zu
plimβˆ1, IV= β1 + .
σ xz
Thus, the asymptotic bias depends on the relative strength of the population relationships
between the instrument and the error term and between the instrument and the
endogenous variable.
Although it is assumed that σ zu = 0 , it is likely not zero but small in the population. As
such, you want σ xz to be as large as possible so that any correlation between the
instrument and the error term results in a small bias. If the instruments are very weak, so
that σ xz is small, then even a small correlation can lead to a large bias.
It is interesting to compare the asymptotic biases of OLS and IV. The asymptotic bias of
OLS is given by
σ xu
plimβ1,OLS= β1 + .
σ xx
The derivation is, not surprisingly, exactly parallel to that for the IV estimator.
12
σ zu
σ xz = ρ zu .
σ xu ρ xu ρ zx
σ xx
Thus, the relative asymptotic bias depends on three terms: the correlation between the
instrument and the error term, the correlation between the endogenous variable and the
error term and the correlation between the instrument and the endogenous variable.
For IV to be preferred to OLS, the absolute value of the ratio of the biases should be less
than one. For that to happen, you need that
| ρ zu |<| ρ xu ρ xz | .
Thus, IV is likely to have a lower bias than OLS when the instrument is not very
correlated with the residual, when the endogenous variable is highly correlated with the
residual, and when the instrument is strongly correlated with the endogenous variable.
Of course, considering both the bias and the variance would tip the balance a bit more
toward OLS.
This is most of what you need to know about picking a good instrument.
Finite sample bias in IV estimators has received a lot of attention in the literature in
recent years, motivated in large part by the Bound, Jaeger and Baker (1995) JASA paper
that showed that the Angrist and Krueger (1991) QJE paper on the effect of schooling on
earnings using compulsory schooling laws as instruments, despite using samples in the
hundreds of thousands, was still subject to substantial finite sample bias due to the large
number of weak instruments they employ.
Finite sample bias results from the fact that the first stage coefficients are estimated with
error in finite samples.
My discussion here mainly follows Bound, Jaeger and Baker (1995) JASA with some
additional bits from Angrist and Pischke (2009).
Using power series methods, Buse (1992) derives the following expression for the bias of
the instrumental variables estimator in samples of size N:
σ uv
bias( βˆIV , N ) ≈ (Q − 2)
θ ′Z e′Z eθ
13
where Q is the number of excluded instruments, σ uv is the covariance of the unobserved
components in the first stage and the outcome equation, and Z e here refers to the
excluded instruments.
A related derivation in Angrist and Pischke (2009) gets to the same place, but with Q in
place of (Q – 2). Note that “ ≈ ” is important here.
The matrices in the denominator have dimensions (1 x M)(M x N)(N x M)(M x 1) and so
the matrix multiplication results in a scalar.
1. All else equal, the bias increases in Q, though adding another instrument should also
increase the denominator. In this sense, the formula provides some guide to whether or
not you want to include a given instrument. It is clear that adding a sufficiently weak
instrument, one that does little to increase the denominator, can increase the finite sample
bias by increasing Q even if it decreases the asymptotic bias by increasing the covariance
between the instruments and the endogenous variable.
2. All else equal the finite sample bias decreases in N (as the name suggests that it
should), because the denominator increases with N while the probability limit of the
numerator does not change with N. But as Bound, Baker and Jaeger (1995) JASA show,
the bias can remain substantively large even in what are very large samples by the
standards of empirical research in economics.
3. This formula provides a direct motivation for the use of the first-stage F statistic on the
excluded instruments as a signal of potential weak instrument problems.
σ uv σ v2
bias( β IV ) ≈ (Q − 2) .
σ v2 θ ′Z e′Z eθ
θ ′Z e′Z eθ
The term τ = is called the concentration ratio. The numerator here is the
σ v2
increment to the explained sum of squares of the first stage that results from the excluded
instruments. If the excluded instruments are the only variables in the first stage, then this
is literally the explained sum of squares from the first stage.
The first term is approximately the bias of OLS when Z explains little of X so that X is
approximately equal to v.
The last two terms are approximately the inverse of the first-stage F-statistic on the
excluded instruments. To see this recall that the F-statistic plays off the difference in
residual variance between the full model and the model with the restrictions imposed:
14
SSRr − SSRur
F= M
SSRur
N − K −1
Stock, Wright and Yogo (2002) JBES suggest that a value of the first-stage F-statistic on
the excluded instruments exceeding 10 generally suffices to avoid problems with finite
sample bias.
Staiger and Stock (1997) Econometrica, using alternative asymptotics, also show that the
finite sample bias is proportional to the first-stage F-statistic.
Bias is towards the (probability limit of the) OLS estimate. The intuition here is that with
a random first stage you would, in a loose sense, get back the OLS estimates because the
random instrument will, in a finite sample, be correlated with random bits of the
endogenous variable.
Another way to think of this is that the problem relates to over-fitting in the first stage in
finite samples.
As the ratio of exogenous variation to randomness increases in the first stage, the
component of the endogenous variable that the first stage picks out becomes increasingly
exogenous, and the IV estimate migrates away from OLS towards its asymptotic value.
Getting the standard errors right when there is finite sample bias is also an issue, as the
usual TSLS variance matrix performs poorly in this context. Imbens and Wooldridge
(2007) present estimates suggesting that the TSLS variance matrix performs reasonably
well in the just identified case without excessive endogeneity. For the case of multiple
instruments, see the Monte Carlo analysis in Flores-Lagunes (2007) Journal of Applied
Econometrics.
Angrist and Pischke (2009) note that alternative estimators such as the LIML estimator
and Jackknife Instrumental Variables (JIVE) may perform better in contexts with weak
instruments (and, in the latter case, heteroskedasticity).
The ivreg command in Stata estimates models using instrumental variables or two-
stage least squares.
15
ivreg <depvar> <exogenous x> (<endogensous x> = <excluded
instruments>)
Thus, to estimate
ivreg y x1 x2 (x3 = z)
The first option to the ivreg command causes Stata to display the results of both
the first stage and second stage regressions.
Testing exogeneity
Introduction
This standard test compares the OLS and 2SLS estimates. Under the null that the
potentially endogenous variable is actually exogenous, these two estimates differ only
due to sampling error, but OLS is more efficient in the statistical sense. In contrast,
under the alternative hypothesis that the variable is actually endogenous, they should
differ, because the 2SLS estimator is consistent under the alternative but OLS is not.
Formal discussion
There are a number of different ways to perform this test. Following the Wooldridge
text, this discussion highlights two of the most common.
This method is quite straightforward. It consists of taking the residuals from the first
stage estimation and including them in the second stage along with the endogenous
variable.
X i Z iγ + vi .
=
16
The estimated residuals are given by
ˆi X i − Z iγˆ .
v=
Yi = X i β + vˆi β v + ui
The test is just the standard t-test of the null that the coefficient on the estimated residuals
from the first stage equals zero. In notation, the null is given by
H 0 : βv = 0 .
The intuition behind the test is that under the null of exogeneity, the residuals from the
first stage are just noise, and so should have a zero coefficient. In contrast, under the
alternative hypothesis of endogeneity, the first-stage residuals represent the “bad”
variation in the endogenous variable – the part that is correlated with the error term in the
outcome equation – just as they do in the control function version of the 2SLS estimator.
In this case, their coefficient should be non-zero.
This test is easily generalized to the case of multiple endogenous variables. In this case,
there is a vector of estimated residuals – one from the first stage for each of the
endogenous variables – and the test takes the form of an F-test of the null that all of the
coefficients on the estimated residuals equal zero.
The second method focuses solely on the coefficient of interest estimated using OLS and
IV. Typically the coefficient of interest is the coefficient on the endogenous variable.
Call the coefficient of interest βˆk . The test then consists of examining the statistic
The denominator is the variance of the difference in estimated coefficients under the null.
To see this recall that under the null the additional variance component in the 2SLS
estimates is just noise. Thus, under the null the covariance is just the OLS variance. The
standard formula for the variance of a difference then yields the result in the
denominator.
17
The intuition here is that under the null, the coefficients differ only due to sampling error.
The 2SLS estimator is inefficient, and so has a higher asymptotic variance, because it
uses a less efficient instrument for the endogenous variable under the null. Under the null
of exogeneity, the efficient instrument for the potentially endogenous variable is the
variable itself.
In practice, the estimated variance of the difference will sometimes not be positive
definite; the ordering is asymptotic and may not hold in particular finite samples. This is
one reason to prefer the regression-based tests. In this situation Stata reports an error
message and takes a generalized inverse so that a positive value results.
Remarks
In Cambridge, Massachusetts, this is called the Hausman test. In Chicago, it is called the
Durbin-Wu-Hausman test. You can take your pick. Using all three names is probably
the risk-averse choice; another alternative is to simply call it the “standard exogeneity
test” or some such.
The null here is exogeneity. Thus, the test is, in the usual classical statistical sense, being
conservative about concluding endogeneity. If theory (or even common sense) or
evidence from other studies suggests endogeneity, this may suffice to proceed with 2SLS
regardless of the results of the test.
In the past, and sometimes in the present, researchers would adopt a strategy of testing
exogeneity and the reporting the OLS estimates if the test “passed” and the IV estimates
if the test “failed”. The OLS or IV standard errors are invalid here due to pre-test bias.
Correct standard errors should take full account of the two-step procedure that includes
the test. Guggenberger (2010) Econometric Theory discusses these issues.
This test assumes you have a valid instrument. This is a strong assumption in many
cases. The implication of this is that if you fail to reject the null, it may be because the
instrument is invalid, rather than because the variable is in fact exogenous.
The discussions in the Greene, Wooldridge and Cameron and Trivedi books emphasize
different issues associated with the test. Greene also presents an alternative regression
based version of the test due to Wu (1973). This version of the test includes the predicted
values of the endogenous variables in the outcome equation along with the original X.
The test consists of an F test of the joint null that the coefficients on all of the predicted
values equal zero. The intuition is that under the null the differences between the
predicted values and the original values are just noise, and so should not be related to the
dependent variable.
18
The bottom line: if there is a reason to expect endogeneity but the test fails, it is nice to
report both the OLS and IV estimates, the test results, and to cite the Guggenberger
(2010) Econometric Theory paper and interpret the findings from the analysis
accordingly.
The D-W-H test is in fact far more general. The basic idea of comparing two estimates,
one that is consistent both under the null and the alternative and one that is consistent
only under the null, can be applied in many econometric contexts. We will see the test
again in the context of fixed and random effects in panel data models.
where y denotes the dependent variable, x1 denotes the exogenous variables included in
the outcome equation, x2 denotes the endogenous variables included in the outcome
equation and z1 denotes the excluded instruments.
Next do
hausman, save
This tells Stata to save the IV results for later comparison with the OLS results. Then
generate the OLS results by
regress y x1 x2
Finally, have Stata compare the OLS and IV results and do the Hausman test by using the
hausman command a second time.
The constant option tells Stata that the constant has the same interpretation in the two
models and so should be included in the test. The hausman command is very general
and also applies to circumstances where this is not the case, as in the comparison of
random effect and fixed effect models.
The sigmamore option tells Stata to use the same estimate of the covariance matrix for
both covariance matrices in the test – namely the covariance matrix from the efficient
estimator under the null. As noted in the Stata manual, this is the appropriate covariance
matrix for tests of exogeneity and over-identifying restrictions in IV models.
19
ivreg y x1 (x2 = z1)
estimates store ivname
regress y x1 x2
estimates store olsname
hausman ivname olsname, constant sigmamore
where “ivname” and “olsname” can be any names (other than Stata command names) that
you give to the saved IV and OLS estimates, respectively.
Motivation
In some cases, you may have more candidate instruments than you do endogenous
variables. In this case, because you only need as many instruments as there are
endogenous variables for identification there are over-identifying restrictions that can be
tested. These restrictions consist of the exogeneity of the “extra” instruments.
Intuition
The intuition behind the test is to compare 2SLS estimates obtained using all of the
instruments with those using just a necessary subset of the instrument. The original
version of the test in Hausman does exactly this.
Formal test
First obtain the 2SLS estimates using all of the exogenous variables as instruments.
Using these estimates, generate the predicted outcome equation residuals. These
residuals are generated using the observed value of the endogenous variable, not its
predicted value from the first stage.
Now regress the predicted residuals on all of the exogenous variables, which means all of
the candidate instruments as well as the exogenous variables from the outcome equation.
20
Under the null that E ( Z ′U ) = 0 , NR 2 has a chi-squared distribution with Q1 degrees of
freedom, where Q1 is the number of “extra” instruments. For example, if there is one
endogenous variable and three candidate instruments, then Q1 = 3 − 1 = 2 .
Under the alternative hypothesis that not all of the instruments are valid, the test statistic
will be large. Thus, the null is rejected for appropriately large values of the test statistic.
The test statistic is large under the alternative because if one or more of the instruments is
invalid, they will account for some of the variance in the predicted residuals.
The key here is that at least one of the instruments must be valid so that it can be a test of
over-identifying restrictions and not of just-identifying restrictions.
Doing the original Hausman version of the test in Stata uses the same general structure as
the exogeneity test.
where y denotes the dependent variable, x1 denotes the exogenous variables included in
the outcome equation, x2 denotes the endogenous variables included in the outcome
equation, z1 denotes the excluded instruments that are assumed to be valid under both the
null and the alternative and where “ivconsist” and “iveffic” are just names given to the
saved estimates from the consistent and efficient models, respectively.
There is also a command called overid available on the internet that does the
regression version of the test described above (and in the Wooldridge book) in one step,
immediately following the estimation of the ivreg command.
help overid
21
Note once again that the regression version of the test does not test the validity of one
instrument conditional on the others; rather, it tests the joint null that all of the
instruments are valid, leaving you to sort out which ones may be causing problems. As in
the exogeneity test, is a bit of an odd choice for the null.
Motivation
The recent literature on instrumental variables (and, indeed, the recent literature on
treatment effects in general) emphasizes the role of heterogeneous treatment effects.
The implicit baseline model here is a model of common treatment effects. That is, there
is a dummy variable for the treatment, and everyone has the same value for the
coefficient on this variable. If one is feeling ambitious, the dummy variable might be
interacted with various observed characteristics to estimate subgroup effects.
Similar issues arise with random coefficient models for continuous independent variables
of interest. The related literature is an active area of research.
In the case of IV, the key issue that has excited the most attention is the potential for
instruments that are uncorrelated with the outcome equation error term to nonetheless be
correlated with the person-specific component of the impact.
Yi = β X i + δ i Di + ε i .
This equation embodies heterogeneous treatment effects because the coefficient on the
treatment dummy has an “i” subscript.
Now let δ =
i δ + δ i* ,
where we decompose the individual treatment effect into the mean treatment effect
(conditional on treatment) and the individual deviation from the mean.
Yi = β X i + δ Di + (δ i* Di + ε i ) ,
22
where (δ i* Di + ε i ) is the composite error term. For the untreated units, it consists only of
the usual error term, while for the treated units it consists of both the usual error term and
the idiosyncratic component of the impact.
E (ε i | X i , Di , Z i ) = 0 .
In contrast, the corresponding condition for the heterogeneous treatment effect world is
given by
E[(δ i* Di + ε i ) | X i , Di , Z i ] =
0.
This is a different and stronger condition that will not be fulfilled by many standard
instruments.
To see why consider a typical IV strategy in a treatment effects context. Here variation
in the cost of treatment gets used to generate exogenous variation in the probability of
treatment. If the component of costs (e.g. distance from the training center or college,
presence of a bus strike) that generates the variation is uncorrelated with the outcome
equation residual, then this constitutes a valid instrument in the common effect world.
However, it is easy to show that such instruments are not likely to be valid in the
heterogeneous treatment effects world, if agents know both their costs and their likely
impacts from the program.
Di* =γ 0 + γ 1δ i + γ 2 Z i + υi ,
where Di = 1 iff Di* > 0 and Di = 0 otherwise, and where Z i is a variable that measures
the cost of participation.
In this case, units with high idiosyncratic impacts will participate even when their costs
are high, but units with low idiosyncratic impacts will participate only when their costs
are low. As a result (and this is the key point) conditional on D, the value of the
instrument is correlated with the unit-specific component of the impact. Put differently,
the standard model of participation – based on the model in Heckman and Robb (1985) –
implies that cost-based instruments are likely invalid in a heterogeneous treatment effects
world.
23
Put differently, in this case there are two kinds of selection, one on the outcome equation
unobserved component in the untreated state and one on the unobserved component of
the impact.
However, all is not lost. When the instrument is correlated with the unit-specific
component of the impact, but not with the original outcome equation error term, IV still
estimates an interpretable and (sometimes) useful parameter under some additional
assumptions. It is to this parameter that we now turn.
LATE
It is easiest to see what is going on with local average treatment effects if we stick to the
very simple case of a binary instrument, a binary treatment and no covariates.
Draw a 4 x 2 table with the four groups as the rows and columns for the values of the
binary instrument.
The compliers are the key group here. They are the units that respond to the instrument
by taking the treatment when they otherwise would not. For example, if costs are the
instrument, these units do not participate when costs are high but do participate when
costs are low.
Now consider the formula for the Wald estimator once again.
E (Y | Z =
1) − E (Y | Z =
0)
∆ IV = .
Pr( D =1| Z =1) − Pr( D =1| Z =0)
Using the four groups, we can rewrite the terms in the numerator as:
E (Y | Z =
1) =
E (Y1 | AT ) Pr( AT ) + E (Y1 | C ) Pr(C ) + E (Y0 | DF ) Pr( DF ) + E (Y0 | NT ) Pr( NT )
and
E (Y | Z =
0) =
E (Y1 | AT ) Pr( AT ) + E (Y0 | C ) Pr(C ) + E (Y1 | DF ) Pr( DF ) + E (Y0 | NT ) Pr( NT )
.
The denominator terms become
24
Pr( D =1| Z =1) =Pr( D =1| Z =1, AT ) Pr( AT ) + Pr( D =1| Z =1, C ) Pr(C ) =Pr( AT ) + Pr(C )
and
Pr( D =1| Z =0) =Pr( D =1| Z =1, AT ) Pr( AT ) + Pr( D =1| Z =1, DF ) Pr( DF ) =Pr( AT ) + Pr( DF )
This is a bit of a mess, but still provides some insight. The IV estimator in this case is a
weighted average of the treatment effect on the compliers and the negative of the
treatment effect on the defiers.
First, we assume monotonicity. That is, we assume that the instrument can only increase
or decrease the probability of participation. Thus, if increasing the instrument increases
the probability of participation for some units, it does so for all. This assumption fits in
very well for cost-based instruments, which theory suggests should have such a
monotonic effect.
In terms of the notation, monotonicity means that Pr( DF ) = 0 . In words, it means that
there are no defiers.
The second assumption is that there are some compliers, which is to say that Pr(C ) > 0 .
The behavior of some individuals must be affected by the instrument.
Thus, in a heterogeneous treatment effects world, where the instrument is correlated with
the impact and where monotonicity holds, the IV estimator gives the Local Average
Treatment Effect (LATE), defined as the mean impact of treatment on the compliers.
In statistics the LATE is called the CACE = Complier Average Causal Effect.
25
The LATE is a well-defined economic parameter. In some cases, as when the available
policy option consists of moving the instrument, it may be the parameter of greatest
interest. Thus, if we want to give a $500 tax credit for university attendance, the policy
parameter of interest is the impact of university on individuals who attend with the tax
credit but do not attend without it.
The LATE provides no information about the impact of treatment on the always-takers,
which could be large or small. If the mean impact of treatment on the treated is the
primary parameter of interest, this is a very important omission.
Discuss how this works in an experiment with dropouts and substitutes, using the JTPA
experiment numbers – 60 percent treated in the treatment group and 40 percent treated in
the control group. Make the point that the impact estimate when you do the Wald/Bloom
estimator is a LATE. It is the impact on the 20 percent who receive treatment because
they were randomly assigned into the treatment group rather than the control group.
A simple example
Return now to our simple example of the two training centers, but changes the parameters
as follows:
Change the impacts so that they vary across persons. In particular, assume that:
Assume that the impact of training is independent of what town you live in.
As before, Z = 1 denotes residence in the near town and Z = 0 denotes residence in the
far town.
The latter follows from the fact that in the far town only those persons with an impact of
20 will take the training, because only for them does the impact exceed the combined
travel and tuition costs.
26
Similarly,
E (Y | Z = 0) = Y0 + E (∆ | Z = 0) Pr( D = 1| Z = 0) =
100 + 10(0.5) + 20(0.5) = 115 ,
and
Inserting the numbers from the example into the formula gives:
115 − 110 5
∆ˆ IV = = = 10.
1.0 − 0.5 0.5
This is the same answer as before, but now it is a local average treatment effect. The
marginal group, the group that enters training when the instrument changes, consists of
persons with impacts of 10. They participate in the near town but not in the far town.
In particular,
= E (∆ |=
∆ ATE Z 1,=
D 1) Pr(=
Z 1) + E (∆ |=
Z 0,= Z 0) ,
D 1) Pr(=
and
∆TT = E (∆ | Z = 1, D = 1) Pr( Z = 1| D = 1) −
E (∆ | Z = 0, D = 1) Pr( Z = 0 | D = 1) .
That is, these parameters are weighted averages of the mean impacts in each town, where
the weights depend on the number of assumed participants in each town.
We have E (∆ | Z =1, D =1) =15 and E (∆ | Z =0, D =1) =20 due to the differences in
participation between the near and far town.
Thus,
ATE
∆= (15)(0.5) + (15)(0.5)
= 15,
27
and
You can bound the variance of the impacts (conditional on having solved the selection
problem) using the methods in e.g. Heckman, Smith and Clements (1997) Review of
Economic Studies and Djebbari and Smith (2008) Journal of Econometrics. It is also
possible to formally test the null of a zero variance.
Looking for variation in the average treatment effect across subgroups is also
informative. If mean impacts vary a lot with observed characteristics, this suggests that
they likely vary with unobserved characteristics as well.
The same issues arise in the more general contexts with a continuous endogenous
variable and either binary or continuous instruments. These contexts are beyond the
scope of this course, but the same general result is found. What is estimated is a
weighted average effect where the weights depend on the particular instrument used and
the effects for those units whose behavior is changed by the instrument.
See Angrist and Imbens (2005) JASA as well as the many recent papers on the “correlated
random coefficient model” by Heckman and various co-authors.
In a world with heterogeneous treatment effects, the use of multiple instruments affects
the nature of the parameter being estimated if the instruments are associated with the
treatment effects.
Consider the case of two valid instruments, each of which is uncorrelated with the level
but correlated with the impact. Each instrument, used separately, produces an estimate of
a different LATE. Put differently, using each instrument estimates a different parameter.
Using both instruments together in 2SLS then produces a hybrid of the two LATEs.
Again, a detailed treatment is beyond the scope of this course, but it is important to be
aware that using multiple instruments changes from a simple and obviously useful way to
28
(perhaps) improve the efficiency of the estimator to a complicated decision regarding the
nature and definition of the parameter being estimated when the world switches from a
homogeneous treatment effects world to a heterogeneous treatment effects world.
When different instruments estimate different LATEs, this also complicates the
interpretation of Durbin-Wu-Hausman tests. Getting a different estimate when using two
instruments rather than one of the two now no longer necessarily signals that the marginal
instrument is invalid. It may signal instead that the two instruments yield different
LATEs. See the recent Heckman, Schmierer and Urzua (2010) Journal of Econometrics
paper on this point.
Theory can be used to come up with a variable that may affect participation but not
outcomes other than through participation. Think about which variables might affect
treatment choice but not have a direct effect on outcomes in the context of a formal or
informal theory of the phenomenon being analyzed. Then find a data set, or collect some
data, that embody these variables along with the outcomes and other exogenous
covariates. In many contexts such variables will represents aspects of the costs of being
treated that are more or less exogenously assigned.
Examples include:
Examples:
BC disability person
Random assignment to judges of varying severity
Assignment to different caseworkers within a job training center
Variation in the budget across local offices of a job training program within a local labor
market as in Frölich and Lechner (2004)
29
Nature
The natural world sometimes creates exogenous variation. Thus, learning about the
natural world can have a payoff in economics.
Examples include:
Deliberate creation
30