Ch. 1 - Endogeneity
Ch. 1 - Endogeneity
Endogeneity in Single-Equation
Linear Models
1.1 Introduction
So far in our models we have assumed that the set of explanatory variables, X, are
exogenous variables. By exogeneity, conceptually, we mean that all variables in X are
generated independently from the dependent variable y. For example, consider that we
are trying to estimate the amount of money donated to charity by a household. Such
an amount depends on the marginal tax rate that an individual is paying, since if the
individual itemizes her deductions she has an incentive to give to pay lower taxes. But
it is also clear that the marginal tax rate depends on the amount donated, because a
large enough gift will bring the household to a lower tax bracket. Money donations and
marginal tax rate are thus generated in the same process, and the marginal tax rate is
an endogenous variable if used as an independent variable in a model to explain money
donations.
The consequence of all explanatory variables being exogenous is that the the errors
are independent of, unrelated with, the regressors, i.e. E[εi |xi ] = 0. In the presence of
endogenous regressors, these endogenous regressors now provide information about the
expected value of the residuals, i.e. E[εi |xi ] = ηi . This implies that the residuals and the
regressors are correlated, i.e. E [xi εi ] = γ, where γ 6= 0. In the presence of endogenous
1
2 CHAPTER 1. ENDOGENEITY
regressors, the OLS estimator is biased and inconsistent, so the OLS estimate βbj no
longer measures the marginal effect on y of the j th regressor variable, xj . The expected
value of the OLS estimator is
h i
E β|X = β + X0 X−1 X0 η.
b (1.1)
To understand this last point better, consider once again that we are trying to esti-
mate how much money households give to charity. This amount is related to the level
of social awareness that the individuals in the household have. Unfortunately such a
variable is not observable, so its information is collected in the error, εi . The problem is
that the level of social awareness is normally correlated with the age, level of education,
and other demographic variables that you would include as exogenous regressors in the
model, but that they would actually turn out to be endogenous. So these regressors
would be correlated with the residuals, and the estimate of the coefficients on those re-
gressors would not only measure the marginal effect of the regressors, but also some of
the effect caused by the unobservable level of social awareness, making the OLS estimates
biased.
The most efficient way to solve the problem that endogeneity presents is through the use
of instrumental variables (IV) estimators. Instrumental variables are a set of variables,
Z, that satisfy the following two properties:
The difficulty is in finding actual valid instruments, which are sometimes very hard
to come by. In the first example of endogenous variable we considered, the marginal tax
rate, a common approach is to use the “first dollar rate” which is the marginal tax rate
that an individual would face if she didn’t give any money to charity, and thus would
not itemize them. You can see how the second example presents a harder case, since it
is difficult to find a variable that is correlated with social awareness that is not already
included in the model, and thus could be used as an instrument for social awareness.
1.3. THE IV ESTIMATOR 3
Before we consider the IV estimators we can use, let us setup the problem we are
going to be considering. The model we want to estimate is
y = Xβ + ε. (1.2)
Equation (1.2) is called the structural equation. Let X = (X1 X2 ) where, X1 is the set
of endogenous variables, and X2 is the set of exogenous variables. We also assume that
there is a matrix of exogenous variables Z = (Z1 X2 ) that are relevant.
1.2.1 Identification
Depending on the dimension of Z relative to X, i.e. how many instruments are there
compared to the regressors, we have three cases of identification.
If dim(Z) < dim(X), i.e. there are fewer instrumental variables than endogenous
regressors, we have the not-identified, or underidentified, case. In this case there is no
consistent IV estimator. This case often happens in practice, and there is very little you
can do.
When dim(Z) = dim(X), i.e. the number of instruments equal the number of endoge-
nous regressors, we have the just-identified case. In this case we use the IV estimator
(next section).
Finally, when dim(Z) > dim(X), i.e. there are more instruments than endogenous
regressors, we have the overidentified case. In this case we consider the two-stage least
squares (2SLS) estimator.1
The IV estimator is used for the just-identified case, which is were dim(Z) = dim(X).
Let’s consider the exogeneity assumption E [zi εi ] = 0. From the structural equation
(1.2), εi = yi − x0i β, so the exogeneity condition becomes E [z0i (yi − x0i β)] = 0. The
1
Cameron and Trivedi (2009, ch. 6) also covers the generalized method of moments (GMM) estimator.
It does not provide the estimate for the estimate of the asymptotic variance with homoskedastic errors,
however.
4 CHAPTER 1. ENDOGENEITY
b = (Z0 X)−1 Z0 y.
β (1.3)
IV
b IV = β + (Z0 X)−1 Z0 ε.
β
Equation (1.4) shows that the IV estimator is an unbiased estimator of the popula-
tion’s vector of coefficients. We now consider the variance of the estimator.
h i 0
V β b IV | Z = E β b IV − β β b −β |Z
IV
h i
−1 −1
= E (Z0 X) Z0 εε0 Z (X0 Z) | Z
−1 −1
= (Z0 X) Z0 E [εε0 | Z] Z (X0 Z)
−1 −1
= σ 2 (Z0 X) Z0 Z (X0 Z) (1.5)
2
See Greene (2012, pp. 226–227).
3
We will cover heteroskedasticity in the next chapter. Cameron and Trivedi (2009, p. 176) presents
the robust estimates of the asymptotic variance for the different estimators.
1.4. 2SLS 5
Consider now the case when we have more instruments than endogenous variables so
dim(Z) > dim(X). Let Z have L variables, and X have K variables, where L > K. It
b is no longer valid because Z0 X is now L × K, and since L > K, this
is clear that β IV
matrix is not invertible. What can we do in this situation? One solution is to drop L−K
variables from Z and use the IV estimator we just covered in the previous section.
Even though there may be ways of making the elimination of those “extra” variables
non-arbitrary, the 2SLS estimator is much more efficient. The estimator is
h i−1
b 2SLS = X0 Z (Z0 Z)−1 Z0 X
β
−1
X0 Z (Z0 Z) Z0 y. (1.7)
As was the case with the IV estimator, the 2SLS is also an unbiased estimator of the
population’s vector of coefficients. To consider the variance of the 2SLS estimator, let
QZ = Z0 Z, QZX = Z0 X, and QXZ = X0 Z. The variance of the estimator can then be
expressed as
h i 0
V β 2SLS | Z = E β 2SLS − β β 2SLS − β |Z
b b b
h −1 −1 i
= E QXZ QZ −1 QZX QXZ QZ −1 Z0 εε0 ZQZ −1 QZX QXZ QZ −1 QZX |Z
−1 −1
= QXZ QZ −1 QZX QXZ QZ −1 Z0 E [εε0 | Z] ZQZ −1 QZX QXZ QZ −1 QZX
−1 −1
= σ 2 QXZ QZ −1 QZX QXZ QZ −1 QZ QZ −1 QZX QXZ QZ −1 QZX
−1
= σ 2 QXZ QZ −1 QZX .
6 CHAPTER 1. ENDOGENEITY
Substituting back the values for the different Q matrices, we have that the variance of
the 2SLS estimator is
h i h i−1
V β b 2SLS | Z = σ 2 X0 Z (Z0 Z)−1 Z0 X . (1.9)
ε0 b
The estimate for the errors is now b
ε = y−Xβ b2 = b
b 2SLS . Letting, once more, σ ε /n,
the estimate of equation (1.9) is
h i h i−1
2 0 0 −1 0
V β 2SLS = σ
b b b X Z (Z Z) Z X (1.10)
It can easily be shown that equations (1.7) and (1.9) simplify to equations (1.3) and
(1.5), respectively, when L = K, i.e. when Z and X have the same number of variables:
the just-identified case. You should check this on your own. This means that the 2SLS
estimator equals the IV estimator in the just-identified case.
The reason for the 2SLS estimator to be called the way it is, is because it can be
implemented by a 2-step estimation of an OLS model in each step. The process is the
following:
Let us check how the formula for the end estimator is the same as that on equation (1.7).
As Greene (2012, p. 231) says “if any column of X also appears in Z, then that column of
X is reproduced exactly in X.”
b This means that you can think of X b as the prediction of
a system of OLS estimations of X on Z. Let β b be the OLS estimate of such a system.
1S
0 −1 0 b = Z (Z0 Z)−1 Z0 X.
Since it is an OLS regression β 1S = (Z Z) Z X, and X
b b = Zβ 1S
0 −1 0
This means that the OLS estimate of the second stage is β 2S = X
b bXb X
b y. Note
b 0 = X0 Z (Z0 Z)−1 Z0 , and X
that X b 0X
b = X0 Z (Z0 Z)−1 Z0 X after some simplification. There-
1.5. SPECIFICATION TESTS 7
h i−1
fore, the second stage OLS estimate is βb = X0 Z (Z0 Z)−1 Z0 X −1
X0 Z (Z0 Z) Z0 y,
2S
which is the same as in equation (1.7).
So far we have considered how OLS estimation with endogenous regressors is biased,
and the estimators that can be used depending on how the model is identified. In this
section we consider two tests of specification. The first one is to test for endogneity in
the regressors, and the second one is to test for the validity of the instruments in the
overidentification case.
To understand this test, notice that in the presence of endogenous regressors the OLS
coefficients’ estimator βb is inconsistent, and the IV or 2SLS estimators are consistent.
However, if there are no endogenous regressors, all estimators are consistent, but the OLS
estimator is much more efficient, i.e. has lower standard errors. In this presentation I
will refer to the IV estimator βb IV . The exposition and implementation are also valid for
the 2SLS estimator, so you can think of β b IV as representing both the IV and the 2SLS
estimators.
The test statistic can in equation (1.11) can be simplified. First, notice that
h i h i h i h i
V
b β b IV − β
b =V b β b IV + V
b β b − 2C b β,b β
b IV , (1.12)
where Cb [·] represents the estimate of the asymptotic covariance between the elements
in the brackets.
8 CHAPTER 1. ENDOGENEITY
Second, and paraphrasing Greene (2012, p.235), “the covariance between an efficient
estimator, β b , of a parameter vector, β, and its difference from an inefficient estimator,
E
b bE − β
β I , of the same parameter vector, β b I , is zero.” This is saying that
h i h i h i
C β E , β E − β I = V β E − C β E , β I = 0,
b b b b b b b b b
For our case under the null hypothesis, the efficient estimator is the OLS estimator,
β E = β,
b b and the inefficient estimator is the IV estimator, β b = β
I
b . So applying
IV
equation (1.13) in equation (1.12), we have that
h i h i h i
Vb β b −β b =V b β
b − V
b β
b . (1.14)
IV IV
Finally, using equation (1.14) in equation (1.11), we have the test statistic
0 h i h i−1
H= βb −βb V
b β
b − V
b β
b b −β
β b . (1.15)
IV IV IV
This test statistic is χ2 (J) distributed, where J is the number of endogenous variables
you are testing for. To understand why, notice that in OLS we are assuming that there
is no correlation between the explanatory variables and the coefficients. If there are J
variables that are correlated with the residuals, i.e. endogenous, OLS would in fact be
placing restrictions on J correlation coefficients to be zero.
The test for endogeneity can be implemented as a two-step Durbin-Watson test of sig-
nificance. The first step is the same one as in the 2SLS two-step process, i.e. do an
OLS regression of each endogenous variable on the exogenous variables, Z, and estimate
the errors. Group the estimates of the errors in a matrix E,
b and perform another OLS
regression of the following augmented model:
y = Xβ + Eρ
b + ν. (1.16)
1.5. SPECIFICATION TESTS 9
Having done an OLS estimation of the model, we test for the joint significance of ρ,
i.e. if the restrictions that ρ = 0 hold. Notice that the number of columns of ρ is the
number of endogenous variables that we are testing are endogenous. Rejecting the null
hypothesis means that at least one of the variables is endogenous, but it will not tell
you which one. You can, of course, perform individual Wald tests of significance for the
different ρj in ρ to see if one variable is endogenous or not.
This purpose of the test is to see if the instruments are valid. In the just-identified
case it is not possible to test for the validity of the instrument, only whether there
is endogeneity. In the overidentification case we can test if the additional (L − K)
instruments are valid. The idea behind the test is a comparison between the 2SLS
estimator and the IV estimator when considering the just-identified case of the model.
If all the instruments are valid, both estimates should only differ because of sampling
error.
The null hypothesis of the test is that the orthogonality condition E [zi εi ] = 0 is true.
If the null cannot be rejected then the additional instruments are valid. If we can reject
the null hypothesis we can accept the alternative that at least one of the instruments is
not valid. Notice that the test does not tell us which is invalid, only that at least one is
invalid.
What we need, then is an estimate for the variance. Under the assumption of the
model
σ2
V [m̄] = 2 Z0 Z. (1.18)
n
This can easily be estimated using a sample estimator of σ 2 .
Greene (2012, p. 238) mentions that a more favored estimate of the variance is
b [m̄] = 1
X
V εb 2 zi z0 . (1.19)
n2 i=1 2SLS,i i
10 CHAPTER 1. ENDOGENEITY
The test for overidentification, under the assumption of homoskedastic errors, has a very
easy implementation. First run the 2SLS model and keep the residuals, v b. Then regress
2 2
v on Z, and the test statistic is n × R , and it follows a χ distribution with L − K
b
degrees of freedom.
So far we have concentrated on testing the identification condition, i.e. whether the
instruments are exogenous. However, for there to be valid instruments they must also
be relevant, i.e. correlated with the endogenous variables. The relevance assumption
implies that
1
plim Z0 X = QZX , a finite, nonzero, L × K matrix with rank K. (1.20)
n
Conceptually, given that the instruments are exogenous, if the condition in (1.20) is
satisfied then the IV and 2SLS estimators are consistent. While this is true, the case
of weak instruments, where (1/n) Z0 X is close to zero (null matrix). In principle the
problem of weak instruments it that it produces a large asymptotic variance and thus
lose precision. However, Nelson and Startz (1990b,a), and Hahn and Hausman (2003)
list two implications:
1. the 2SLS estimator is badly biased toward the OLS estimator, which is known to
be inconsistent; and
2. the standard first-order asymptotics will not give an accurate framework for sta-
tistical inference.
The problem of weak instruments therefore go beyond a lack of precision. Bound et al.
(1995) present some evidence that this problem extends beyond “small sample problems.”
1.6. WEAK INSTRUMENTS 11
In order to see if we may be dealing with weak instruments, there are several things we
can do. The simplest thing we can do is to observe the correlations of the endogenous
regressors with the instruments. This will allow us to see which instruments have low
correlations and may be trouble.
When there is just one endogenous variable identification method is based on the F
statistic of the Wald test of joint significance of the instruments used in the first stage
regression of the 2SLS regression. The rule of thumb is that an F statistic of less than
10 will indicate weak instruments.
When we have several endogenous variables, checking each one of them separately
using this method is not sufficient, because collinearity among the variables is important
to check, but doing the individual checks would not account for it. Shea (1997) developed
a partial Rk2 statistic, where k represents the endogenous regressor, that Godfrey (1999)
simplifies to
The F statistic can be calculated as F = [Rk2 / (L − 1)] / [(1 − Rk2 ) / (n − L)], assuming
that Z has a constant.4
Stock and Yogo (2005) present two tests using the same statistic. The test statistic
depends on whether we have just one endogenous variable or more. If there is only
one endogenous variable, the test statistic is the F statistic from the Wald test of joint
significance of the instruments in the first stage regression we mentioned before. With
more than one endogenous variable, since there is more than one first-stage regression,
the test statistic is the minimum eigenvalue of a matrix analog of the F statistic that
is defined in Stock and Yogo (2005, p. 84) (the Cragg-Donald statistic from Cragg and
Donald (1993)).
4
See Greene (2012, p. 250)
12 CHAPTER 1. ENDOGENEITY
The first test they propose is whether the instruments are weak because the estimate
will be highly biased. The critical values for this test are provided in Stock and Yogo
(2005, Table 5.1 p. 100) and they depend on the number of endogenous variables, n, the
number of exclusion restrictions (I understand this as the number of instruments used),
K2 , and the bias toleration you want with respect to the OLS estimator. For example,
for a model with one endogenous variable (n = 1), three instruments (K2 = 3), and a
5% bias toleration over OLS, the critical value would be 13.91. We would then reject
the null hypothesis that the instruments are weak because of the bias when the statistic
discussed before (since n = 1 in this case it would be the F statistic we discussed before)
would be larger than 13.91. Unfortunately the tables only provide critical values when
the model has two or more over identifying restrictions, i.e. L − K ≥ 2.
The second test they propose is related to how weak instruments could lead to size
distortion of Wald tests of significance on the parameters in finite samples. The critical
values for this test are provided in Stock and Yogo (2005, Table 5.2 p. 101). This table
gives critical values for K2 ≥ 1 so you can perform the test for both just-identified and
overidentified cases. The critical values are based on a level of significance of 5% for the
Wald tests. We need to select how much distortion we want to tolerate on the Wald tests
to set r. If we want a distortion of 5% since the Wald test is assumed to be performed at
the 5% level of significance, r = 5% + 5%. If we are willing to accept a distortion of 10%,
then r = 5% + 10% = 15. The other two numbers we need are K2 and n, which represent
the same as in the first test. So for example if we are willing to tolerate a distortion of
5%, and we have 1 endogenous variable, and 2 instruments, the critical value would be
19.93. We would reject the null of weak instruments because they can distort the size of
the Wald tests of significance if the test statistic is greater than 19.93.
1.7 Assignment
The dataset we are going to use for this assignment is the one used in Cameron and
Trivedi (2009, Ch. 6), an extract from the Medical Expenditure Panel Survey (MEPS)
of individuals 65 years and older who qualify for health care under the US Medicare
program. Medicare does not cover all medical expenses. For example, co-payments for
medical services and expenses of prescribed pharmaceutical drugs were not covered for
the time period studied here. About half of the eligible individuals therefore purchases
supplementary insurance in the private market that provides insurance coverage against
various out-of-pocket expenses.
1.7. ASSIGNMENT 13
The files containing the data for this assignment are Endogeneity.dta (Stata format)
and Endogeneity.txt (comma-separated values text file with the names of the variable
in the first row). Table 1.1 presents the description of the variables in the dataset.
For this assignment we want to estimate a model that explains the expenditure on
prescription drugs of the individuals. Since the variable is highly skewed to the right
even for non-zero values (you should check this) we are going to used the natural log
of the variable for the model (ldrugexp). The explanatory variables are going to be
hi empunion, totchr, age, female, blhisp, and linc. The following are the tasks you
are responsible for:
1. So you can compare to later estimates, run the OLS estimation of the model.
2. The variable hi empunion is suspected to be endogenous. Calculate the sample
correlation between hi empunion and ssiratio. Does ssiratio seem to be a
relevant instrument? Estimate an IV model using ssiratio as the instrument.
Compare the values of the coefficients, standard errors, and 95% confidence inter-
vals of the coefficients between this estimation and the OLS estimation you did in
part 1. Test for endogeneity using the Durbin-Watson implementation of the test.
Does the test confirm that hi empunion is endogenous? Perform the Stock and
Yogo (2005) test for weak instruments (remember we have only one endogenous
variable, so the test-statistic here is Wald’s F statistic of the joint significance of
the instruments in the first stage of the regression). Should we be concerned in
using ssiratio as an instrument with respect to size (remember that we can’t test
the bias in the just-identified case)?
3. The variables lowincome, multlc, and firmsz can also be used as instruments.
Estimate the sample correlation between hi empunion and each of these variables.
Do these variables seem to be relevant instruments? Estimate a 2SLS model using
all available instruments (including ssiratio and the ones mentioned here). Com-
pare the values of the coefficients, standard errors, and 95% confidence intervals
for the coefficients between this estimation and the OLS and IV estimation you
did in parts 1 and 2, respectively. Test for endogeneity using the Durbin-Watson
implementation. Does hi empunion seem to be endogenous? Test for the validity
of the overidentifying instruments using the implementation of the test discussed
here. Are the additional instruments valid? Finally, perform the Stock and Yogo
(2005) test for weak instruments (remember we have only one endogenous vari-
able, so the test-statistic here is Wald’s F statistic of the joint significance of the
instruments in the first stage of the regression). Should we be concerned in using
ssiratio as an instrument with respect to bias and/or size?
As a general rule for this and all assignments, before doing any estimation requested
1.8. STATA COMMANDS 15
in the list above, you should describe the variables you are going to use in this analysis,
and what type of relationship you observe a-priori between the dependent variable and
the independent variables. The writeup should have the following format:
a. Introduction Introduce the analysis you are going to perform. What’s the point of
the model you want the estimates for? What potential (yes, they are potential at
this stage) issues may the data have, and what are you going to do about them?
b. Data description Summary of the different variables to be used in the data. Anal-
ysis of the relationship between the independent variable and each of the dependent
variables (in this case of the structural model).
c. Estimations Present the estimations you have performed, and explain the results.
What coefficients are significant? What tests have you performed? Why have you
performed them? What do they show?
d. Conclusions Present the conclusions of your estimations here. Remember that even
though the assignment may be helping you learn things about a certain topic, the
purpose of a model estimation is always the same, how to describe the dependent
variable and inferencing about the population coefficients!!!
The Stata command for performing linear IV estimations is ivregress. This command
allows also for estimation through a generalized method of moments (GMM) estimator,
and a limited-information maximum likelihood estimator. For the purposes of the course,
we will use the 2SLS estimator. Notice that for a just-identified case, the 2SLS estimator
is identical to the IV estimator, so in Stata we also use 2SLS for the just-identified case.
The tests of endogeneity and overidentification are easily performed after estimation
of the model with the estat endogenous and estat overid commands, respectively.
Cameron and Trivedi (2009, Section 6.3 p. 177) present a worked out example of how to
perform the estimations and tests in Stata. The example there corrects the estimate of
the errors for heteroskedasticity, but since that is the topic of the next chapter, for now
you can mimic what they do but without adjusting for heteroskedasticity.
16 CHAPTER 1. ENDOGENEITY
The post estimation command estate firststage provides the F statistic for the
joint significance of the instruments used in the first-step regression, Shea (1997)’s Rk2
statistics, as week as the minimum eigenvalue test statistic together with the critical
values for both of the tests proposed by Stock and Yogo (2005).
Bibliography
Bound, John, David A. Jaeger, and Regina M. Baker, “Problems with Instru-
mental Variables Estimation When the Correlation Between the Instruments and the
Endogenous Explanatory variable is Weak,” Journal of the American Statistical As-
sociation, June 1995, 90 (30), 443–450.
Godfrey, Leslie G., “Instrument Relevance in Multivariate Linear Models,” The Re-
view of Economics and Statistics, August 1999, 81 (3), 550–552.
Greene, William H., Econometric Analysis, 7 ed., Upper Saddle River, NJ USA:
Prentice Hall, 2012.
Hahn, Jinyong and Jerry Hausman, “Weak Instruments: Diagnosis and Cures in
Empirical Econometrics,” The American Economic Review, May 2003, 93 (2), 118–
125.
and , “Some Further Results on the Exact Small Sample Properties of the Instru-
mental Variable Estimator,” Econometrica, July 1990, 58 (4), 967–976.
17
18 BIBLIOGRAPHY
Stock, James H. and Motohiro Yogo, “Testing for Weak Instruments in Linear IV
Regression,” in Donald W. K. Andrews and James H. Stock, eds., Identification and In-
ference for Econometric Models: Essays in Honor of Thomas Rothenberg, Cambridge:
Cambridge University Press, 2005, pp. 80–108.