Omitted Variable Bias C-T 4.7
Omitted Variable Bias C-T 4.7
Omitted Variable Bias C-T 4.7
15
Actual Data
90th percentile
Median
10th percentile
10
5
0
6 8 10 12
in estimated slopes as q increases as evident in Figure 4.1. Koenker and Bassett (1982)
developed quantile regression as a means to test for heteroskedastic errors when the
dgp is the linear model. For such a case a fanning out of the quantile regression lines
is interpreted as evidence of heteroskedasticity. Another interpretation is that the con-
ditional mean is nonlinear in x with increasing slope and this leads to quantile slope
coefficients that increase with quantile q.
More detailed illustrations of quantile regression are given in Buchinsky (1994) and
Koenker and Hallock (2001).
The term “model misspecification” in its broadest sense means that one or more of the
assumptions made on the data generating process are incorrect. Misspecifications may
occur individually or in combination, but analysis is simpler if only the consequences
of a single misspecification are considered.
In the following discussion we emphasize misspecifications that lead to inconsis-
tency of the least-squares estimator and loss of identifiability of parameters of inter-
est. The least-squares estimator may nonetheless continue to have a meaningful inter-
pretation, only one different from that intended under the assumption of a correctly
specified model. Specifically, the estimator may converge asymptotically to a param-
eter that differs from the true population value, a concept defined in Section 4.7.5 as
the pseudo-true value.
The issues raised here for consistency of OLS are relevant to other estimators in
other models. Consistency can then require stronger assumptions than those needed
90
4.7. MODEL MISSPECIFICATION
is erroneously specified. The question is whether the OLS estimator can be given any
meaningful interpretation, even though the dgp is in fact nonlinear.
The usual way to interpret regression coefficients is through the true micro relation-
ship, which here is
E[yi |xi ] = g(xi ).
In this case
β OLS does not measure the micro response of E[yi |xi ] to a change in xi , as
it does not converge to ∂g(xi )/∂xi . So the usual interpretation of
β OLS is not possible.
White (1980b) showed that the OLS estimator converges to that value of β that
minimizes the mean-squared prediction error
Ex [(g(x) − x β)2 ].
91
LINEAR MODELS
Hence prediction from OLS is the best linear predictor of the nonlinear regression
function if the mean-squared error is used as the loss function. This useful property
has already been noted in Section 4.2.3, but it adds little in interpretation of
β OLS .
In summary, if the true regression function is nonlinear, OLS is not useful for indi-
vidual prediction. OLS can still be useful for prediction of aggregate changes, giving
the sample average change in E[y|x] due to change in x (see Stoker, 1982). However,
microeconometric analyses usually seek models that are meaningful at the individual
level.
Much of this book presents alternatives to the linear model that are more likely to
be correctly specified. For example, Chapter 14 on binary outcomes presents model
specifications that ensure that predicted probabilities are restricted to lie between 0
and 1. Also, models and methods that rely on minimal distributional assumptions are
preferred because there is then less scope for misspecification.
4.7.3. Endogeneity
Endogeneity is formally defined in Section 2.3. A broad definition is that a regressor
is endogenous when it is correlated with the error term. If any one regressor is en-
dogenous then in general OLS estimates of all regression parameters are inconsistent
(unless the exogenous regressor is uncorrelated with the endogenous regressor).
Leading examples of endogeneity, dealt with extensively in this book in both linear
and nonlinear model settings, include simultaneous equations bias (Section 2.4), omit-
ted variable bias (Section 4.7.4), sample selection bias (Section 16.5), and measure-
ment error bias (Chapter 26). Endogeneity is quite likely to occur when cross-section
observational data are used, and economists are very concerned with this complication.
A quite general approach to control for endogeneity is the instrumental variables
method, presented in Sections 4.8 and 4.9 and in Sections 6.4 and 6.5. This method
cannot always be applied, however, as necessary instruments may not be available.
Other methods to control for endogeneity, reviewed in Section 2.8, include con-
trol for confounding variables, differences in differences if repeated cross-section or
panel data are available (see Chapter 21), fixed effects if panel data are available and
endogeneity arises owing to a time-invariant omitted variable (see Section 21.6), and
regression-discontinuity design (see Section 25.6).
y = x β + zα + v, (4.38)
92
4.7. MODEL MISSPECIFICATION
where x and z are regressors, with z a scalar regressor for simplicity, and v is an error
term that is assumed to be uncorrelated with the regressors x and z. OLS estimation of
y on x and z will yield consistent parameter estimates of β and α.
Suppose instead that y is regressed on x alone, with z omitted owing to unavailabil-
ity. Then the term zα is moved into the error term. The estimated model is
where the error term is now (zα + v). As before v is uncorrelated with x, but if z is
correlated with x the error term (zα + v) will be correlated with the regressors x. The
OLS estimator will be inconsistent for β if z is correlated with x.
There is enough structure in this example to determine the direction of the inconsis-
tency. Stacking all observations in an obvious manner gives the dgp y = Xβ + zα + v.
Substituting this into
β OLS = (X X)−1 X y yields
−1 −1 −1 −1
β OLS =β+ N −1 X X N X z α+ N −1 X X N Xv .
Under the usual assumption that X is uncorrelated with v, the final term has probability
limit zero. X is correlated with z, however, and
plim
β OLS = β+δα, (4.40)
where
δ = plim (N −1 X X)−1 N −1 X z
is the probability limit of the OLS estimator in regression of the omitted regressor (z)
on the included regressors (X).
This inconsistency is called omitted variables bias, where common terminology
states that various misspecifications lead to bias even though formally they lead to
inconsistency. The inconsistency exists as long as δ = 0, that is, as long as the omitted
variable is correlated with the included regressors. In general the inconsistency could
be positive or negative and could even lead to a sign reversal of the OLS coefficient.
For the returns to schooling example, the correlation between schooling and ability
is expected to be positive, so δ > 0, and the return to ability is expected to be positive,
so α > 0. It follows that δα > 0, so the omitted variables bias is positive in this ex-
ample. OLS of earnings on schooling alone will overstate the effect of education on
earnings.
A related form of misspecification is inclusion of irrelevant regressors. For ex-
ample, the regression may be of y on x and z, even though the dgp is more simply
y = x β + v. In this case it is straightforward to show that OLS is consistent, but there
is a loss of efficiency.
Controlling for omitted variables bias is necessary if parameter estimates are to be
given a causal interpretation. Since too many regressors cause little harm, but too few
regressors can lead to inconsistency, microeconometric models estimated from large
data sets tend to include many regressors. If omitted variables are still present then one
of the methods given at the end of Section 4.7.3 is needed.
93
LINEAR MODELS
and enough assumptions have been made to ensure that the regressors xi are uncorre-
lated with the error term (u i + xi (βi − β)). OLS regression of y on x will therefore
consistently estimate β, though note that the error is heteroskedastic even if u i is ho-
moskedastic.
For panel data a standard model is the random effects model (see Section 21.7) that
lets the intercept vary across individuals while the slope coefficients are not random.
For nonlinear models a similar result need not hold, and random parameter models
can be preferred as they permit a richer parameterization. Random parameter models
are consistent with existence of heterogeneous responses of individuals to changes in
x. A leading example is random parameters logit in Section 15.7.
More serious complications can arise when the regression parameters βi for an
individual are related to observed individual characteristics. Then OLS estimation can
lead to inconsistent parameter estimation. An example is the fixed effects model for
panel data (see Section 21.6) for which OLS estimation of y on x is inconsistent. In
94
4.8. INSTRUMENTAL VARIABLES
this example, but not in all such examples, alternative consistent estimators for a subset
of the regression parameters are available.