Omitted Variable Bias C-T 4.7

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

LINEAR MODELS

Regression Lines as Quantile Varies

Log Household Total Expenditure

15
Actual Data
90th percentile
Median
10th percentile

10
5
0

6 8 10 12

Log Household Medical Expenditure


Figure 4.2: Quantile regression estimated lines for q = 0.1, q = 0.5 and q = 0.9 from re-
gression of natural logarithm of medical expenditure on natural logarithm of total expenditure.
Data for 5006 Vietnamese households with positive medical expenditures in 1997.

in estimated slopes as q increases as evident in Figure 4.1. Koenker and Bassett (1982)
developed quantile regression as a means to test for heteroskedastic errors when the
dgp is the linear model. For such a case a fanning out of the quantile regression lines
is interpreted as evidence of heteroskedasticity. Another interpretation is that the con-
ditional mean is nonlinear in x with increasing slope and this leads to quantile slope
coefficients that increase with quantile q.
More detailed illustrations of quantile regression are given in Buchinsky (1994) and
Koenker and Hallock (2001).

4.7. Model Misspecification

The term “model misspecification” in its broadest sense means that one or more of the
assumptions made on the data generating process are incorrect. Misspecifications may
occur individually or in combination, but analysis is simpler if only the consequences
of a single misspecification are considered.
In the following discussion we emphasize misspecifications that lead to inconsis-
tency of the least-squares estimator and loss of identifiability of parameters of inter-
est. The least-squares estimator may nonetheless continue to have a meaningful inter-
pretation, only one different from that intended under the assumption of a correctly
specified model. Specifically, the estimator may converge asymptotically to a param-
eter that differs from the true population value, a concept defined in Section 4.7.5 as
the pseudo-true value.
The issues raised here for consistency of OLS are relevant to other estimators in
other models. Consistency can then require stronger assumptions than those needed

90
4.7. MODEL MISSPECIFICATION

for consistency of OLS, so that inconsistency resulting from model misspecification is


more likely.

4.7.1. Inconsistency of OLS


The most serious consequence of a model misspecification is inconsistent estimation
of the regression parameters β. From Section 4.4, the two key conditions needed to
demonstrate consistency of the OLS estimator are (1) the dgp is y = Xβ + u and (2)
the dgp is such that plim N −1 X u = 0. Then
 −1 −1 

β OLS = β + N −1 X X N Xu
p (4.37)
→ β,
where the first equality follows if y = Xβ + u (see (4.12)) and the second line uses
plim N −1 X u = 0.
The OLS estimator is likely to be inconsistent if model misspecification leads to
either specification of the wrong model for y, so that condition 1 is violated, or corre-
lation of regressors with the error, so that condition 2 is violated.

4.7.2. Functional Form Misspecification


A linear specification of the conditional mean function is merely an approximation in
R K to the true unknown conditional mean function in parameter space of indeterminate
dimension. Even if the correct regressors are chosen, it is possible that the conditional
mean is incorrectly specified.
Suppose the dgp is one with a nonlinear regression function
y = g(x) + v,

where the dependence of g(x) on unknown parameters is suppressed, and assume


E[v|x] = 0. The linear regression model
y = x β + u

is erroneously specified. The question is whether the OLS estimator can be given any
meaningful interpretation, even though the dgp is in fact nonlinear.
The usual way to interpret regression coefficients is through the true micro relation-
ship, which here is
E[yi |xi ] = g(xi ).

In this case 
β OLS does not measure the micro response of E[yi |xi ] to a change in xi , as
it does not converge to ∂g(xi )/∂xi . So the usual interpretation of 
β OLS is not possible.
White (1980b) showed that the OLS estimator converges to that value of β that
minimizes the mean-squared prediction error
Ex [(g(x) − x β)2 ].

91
LINEAR MODELS

Hence prediction from OLS is the best linear predictor of the nonlinear regression
function if the mean-squared error is used as the loss function. This useful property
has already been noted in Section 4.2.3, but it adds little in interpretation of 
β OLS .
In summary, if the true regression function is nonlinear, OLS is not useful for indi-
vidual prediction. OLS can still be useful for prediction of aggregate changes, giving
the sample average change in E[y|x] due to change in x (see Stoker, 1982). However,
microeconometric analyses usually seek models that are meaningful at the individual
level.
Much of this book presents alternatives to the linear model that are more likely to
be correctly specified. For example, Chapter 14 on binary outcomes presents model
specifications that ensure that predicted probabilities are restricted to lie between 0
and 1. Also, models and methods that rely on minimal distributional assumptions are
preferred because there is then less scope for misspecification.

4.7.3. Endogeneity
Endogeneity is formally defined in Section 2.3. A broad definition is that a regressor
is endogenous when it is correlated with the error term. If any one regressor is en-
dogenous then in general OLS estimates of all regression parameters are inconsistent
(unless the exogenous regressor is uncorrelated with the endogenous regressor).
Leading examples of endogeneity, dealt with extensively in this book in both linear
and nonlinear model settings, include simultaneous equations bias (Section 2.4), omit-
ted variable bias (Section 4.7.4), sample selection bias (Section 16.5), and measure-
ment error bias (Chapter 26). Endogeneity is quite likely to occur when cross-section
observational data are used, and economists are very concerned with this complication.
A quite general approach to control for endogeneity is the instrumental variables
method, presented in Sections 4.8 and 4.9 and in Sections 6.4 and 6.5. This method
cannot always be applied, however, as necessary instruments may not be available.
Other methods to control for endogeneity, reviewed in Section 2.8, include con-
trol for confounding variables, differences in differences if repeated cross-section or
panel data are available (see Chapter 21), fixed effects if panel data are available and
endogeneity arises owing to a time-invariant omitted variable (see Section 21.6), and
regression-discontinuity design (see Section 25.6).

4.7.4. Omitted Variables


Omission of a variable in a linear regression equation is often the first example of
inconsistency of OLS presented in introductory courses. Such omission may be the
consequence of an erroneous exclusion of a variable for which data are available or of
exclusion of a variable that is not directly observed. For example, omission of ability in
a regression of earnings (or more usually its natural logarithm) on schooling is usually
due to unavailability of a comprehensive measure of ability.
Let the true dgp be

y = x β + zα + v, (4.38)

92
4.7. MODEL MISSPECIFICATION

where x and z are regressors, with z a scalar regressor for simplicity, and v is an error
term that is assumed to be uncorrelated with the regressors x and z. OLS estimation of
y on x and z will yield consistent parameter estimates of β and α.
Suppose instead that y is regressed on x alone, with z omitted owing to unavailabil-
ity. Then the term zα is moved into the error term. The estimated model is

y = x β + (zα + v), (4.39)

where the error term is now (zα + v). As before v is uncorrelated with x, but if z is
correlated with x the error term (zα + v) will be correlated with the regressors x. The
OLS estimator will be inconsistent for β if z is correlated with x.
There is enough structure in this example to determine the direction of the inconsis-
tency. Stacking all observations in an obvious manner gives the dgp y = Xβ + zα + v.
Substituting this into 
β OLS = (X X)−1 X y yields
 −1  −1    −1  −1  

β OLS =β+ N −1 X X N X z α+ N −1 X X N Xv .

Under the usual assumption that X is uncorrelated with v, the final term has probability
limit zero. X is correlated with z, however, and

plim 
β OLS = β+δα, (4.40)

where
  
δ = plim (N −1 X X)−1 N −1 X z

is the probability limit of the OLS estimator in regression of the omitted regressor (z)
on the included regressors (X).
This inconsistency is called omitted variables bias, where common terminology
states that various misspecifications lead to bias even though formally they lead to
inconsistency. The inconsistency exists as long as δ = 0, that is, as long as the omitted
variable is correlated with the included regressors. In general the inconsistency could
be positive or negative and could even lead to a sign reversal of the OLS coefficient.
For the returns to schooling example, the correlation between schooling and ability
is expected to be positive, so δ > 0, and the return to ability is expected to be positive,
so α > 0. It follows that δα > 0, so the omitted variables bias is positive in this ex-
ample. OLS of earnings on schooling alone will overstate the effect of education on
earnings.
A related form of misspecification is inclusion of irrelevant regressors. For ex-
ample, the regression may be of y on x and z, even though the dgp is more simply
y = x β + v. In this case it is straightforward to show that OLS is consistent, but there
is a loss of efficiency.
Controlling for omitted variables bias is necessary if parameter estimates are to be
given a causal interpretation. Since too many regressors cause little harm, but too few
regressors can lead to inconsistency, microeconometric models estimated from large
data sets tend to include many regressors. If omitted variables are still present then one
of the methods given at the end of Section 4.7.3 is needed.
93
LINEAR MODELS

4.7.5. Pseudo-True Value


In the omitted variables example the least-squares estimator is subject to confounding
in the sense that it does not estimate β, but instead estimates a function of β, δ, and α.
The OLS estimate cannot be used as an estimate of β, which, for example, measures
the effect of an exogenous change in a regressor x such as schooling holding all other
regressors including ability constant.
From (4.40), however,  β OLS is a consistent estimator of the function (β + δα) and
has a meaningful interpretation. The probability limit of  β OLS of β ∗ = (β + δα) is
referred to as the pseudo-true value, see Section 5.7.1 for a formal definition, corre-
sponding to β OLS .
Furthermore, one can obtain the distribution of  β OLS even though it is inconsis-
tent for β. The estimated asymptotic variance of  β OLS measures dispersion around
(β + δα) and is given by the usual estimator, for example by s 2 (X X)−1 if the error in
(4.38) is homoskedastic.

4.7.6. Parameter Heterogeneity


The presentation to date has permitted regressors and error terms to vary across indi-
viduals but has restricted the regression parameters β to be the same across individuals.
Instead, suppose that the dgp is
yi = xi βi +u i , (4.41)

with subscript i on the parameters. This is an example of parameter heterogeneity,


where the marginal effect E[yi |xi ] = βi is now permitted to differ across individuals.
The random coefficients model or random parameters model specifies βi to be
independently and identically distributed over i with distribution that does not depend
on the observables xi . Let the common mean of βi be denoted β. The dgp can be
rewritten as
yi = xi β + (u i + xi (βi − β)),

and enough assumptions have been made to ensure that the regressors xi are uncorre-
lated with the error term (u i + xi (βi − β)). OLS regression of y on x will therefore
consistently estimate β, though note that the error is heteroskedastic even if u i is ho-
moskedastic.
For panel data a standard model is the random effects model (see Section 21.7) that
lets the intercept vary across individuals while the slope coefficients are not random.
For nonlinear models a similar result need not hold, and random parameter models
can be preferred as they permit a richer parameterization. Random parameter models
are consistent with existence of heterogeneous responses of individuals to changes in
x. A leading example is random parameters logit in Section 15.7.
More serious complications can arise when the regression parameters βi for an
individual are related to observed individual characteristics. Then OLS estimation can
lead to inconsistent parameter estimation. An example is the fixed effects model for
panel data (see Section 21.6) for which OLS estimation of y on x is inconsistent. In
94
4.8. INSTRUMENTAL VARIABLES

this example, but not in all such examples, alternative consistent estimators for a subset
of the regression parameters are available.

4.8. Instrumental Variables

A major complication that is emphasized in microeconometrics is the possibility of


inconsistent parameter estimation caused by endogenous regressors. Then regression
estimates measure only the magnitude of association, rather than the magnitude and
direction of causation, both of which are needed for policy analysis.
The instrumental variables estimator provides a way to nonetheless obtain consis-
tent parameter estimates. This method, widely used in econometrics and rarely used
elsewhere, is conceptually difficult and easily misused.
We provide a lengthy expository treatment that defines an instrumental variable and
explains how the instrumental variables method works in a simple setting.

4.8.1. Inconsistency of OLS


Consider the scalar regression model with dependent variable y and single regressor x.
The goal of regression analysis is to estimate the conditional mean function E[y|x]. A
linear conditional mean model, without intercept for notational convenience, specifies
E[y|x] = βx. (4.42)
This model without intercept subsumes the model with intercept if dependent and
regressor variables are deviations from their respective means. Interest lies in obtaining
a consistent estimate of β as this gives the change in the conditional mean given an
exogenous change in x. For example, interest may lie in the effect in earnings caused
by an increase in schooling attributed to exogenous reasons, such as an increase in the
minimum age at which students leave school, that are not a choice of the individual.
The OLS regression model specifies
y = βx + u, (4.43)

where u is an error term. Regression of y on x yields OLS estimate  β of β.


Standard regression results make the assumption that the regressors are uncorrelated
with the errors in the model (4.43). Then the only effect of x on y is a direct effect via
the term βx. We have the following path analysis diagram:
x −→ y

u
where there is no association between x and u. So x and u are independent causes
of y.
However, in some situations there may be an association between regressors and
errors. For example, consider regression of log-earnings (y) on years of schooling (x).
The error term u embodies all factors other than schooling that determine earnings,
95

You might also like