0% found this document useful (0 votes)
28 views53 pages

Topic 3 - Endogeneity

Econ 7IE

Uploaded by

saien1moodley5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views53 pages

Topic 3 - Endogeneity

Econ 7IE

Uploaded by

saien1moodley5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

ECON7IE

Topic 3
Endogeneity
WHY IS THIS TOPIC IMPORTANT?
• We commonly need to estimate models where:
– One or more important factors cannot be measured
– Some of the data may be inaccurate
– There are multiple causal relationships, not just X → Y

• These are all examples of the presence of endogeneity


– Its effect on the reliability of regression model results is a key issue in empirical
research

• In this topic, we’ll learn what endogeneity is, how it affects the reliability of OLS results,
and what methods can be used to overcome it

2
Part 1

The Problem of Endogeneity

We consider the case of an endogenous


explanatory variable, which arises when
one of the Classical Linear Regression
Model assumptions is violated.
1.1 DEFINITION OF ENDOGENEITY
• Consider the regression model
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝑢
• If any 𝑋𝑗 is correlated with 𝑢 for any reason, then:
– 𝑋𝑗 is an endogenous explanatory variable
• Three key statistical / economic reasons why 𝑋𝑗 and 𝑢 may be correlated:
a) Omitted variables that are correlated with 𝑋𝑗
b) Measurement error in 𝑋𝑗
c) Simultaneity (or bi-directional causality) between 𝑋𝑗 and 𝑌

• We will:
– Try to identify sources of endogeneity in models
– Derive expressions for the consequences of endogeneity
– See how we can estimate models to overcome this issue
4
a) Omitted variable
• An important explanatory variable is omitted from the regression
– And it is correlated with any of the included X variables
• Why might a variable be omitted from a model?
• E.g.
𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑠 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + ⋯ + 𝑢

b) Measurement error
• An explanatory variable is measured with error i.e. is inaccurate:
– Some variables are inherently difficult to measure e.g. income
– May need to use a proxy when true variable is unavailable
• E.g.
𝑙𝑒𝑖𝑠𝑢𝑟𝑒 𝑡𝑖𝑚𝑒 = 𝛽1 + 𝛽2 ℎℎ𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽3 𝑚𝑎𝑙𝑒 + ⋯ + 𝑢
5
c) Simultaneity
• One (or more) explanatory variables are jointly determined with Y
– i.e. X affects Y, and Y affects X
• Common in macro models
• Also occurs with many other complex economic processes
• E.g. effect of inflation on trade openness:
𝑜𝑝𝑒𝑛𝑛𝑒𝑠𝑠 = 𝛽1 + 𝛽2 𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 + 𝛽3 𝑙𝑛𝑝𝑐𝑖𝑛𝑐 + 𝛽4 𝑙𝑛𝑙𝑎𝑛𝑑 + 𝑢
𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 = 𝛼1 + 𝛼2 𝑜𝑝𝑒𝑛𝑛𝑒𝑠𝑠 + 𝛼3 𝑙𝑛𝑝𝑐𝑖𝑛𝑐 + 𝑢
• Possible to suspect/identify simultaneity even when only given one equation:
– If we suspect feedback from Y to X
• All demand and supply models suffer from simultaneity:
– Equilibrium price and quantity are determined simultaneously
– Through the interaction of demand and supply
6
Class Exercise Question 1
• Identify the source of endogeneity related to the first X variable in each
of the following models:
a) Omitted variables that are correlated with 𝑋𝑗
b) Measurement error in 𝑋𝑗
c) Simultaneity (or bi-directional causality) between 𝑋𝑗 and 𝑌
• In some cases, more than one source may apply!

1. 𝑚𝑢𝑟𝑑𝑒𝑟 𝑟𝑎𝑡𝑒 = 𝛽1 + 𝛽2 𝑝𝑜𝑙𝑖𝑐𝑒 + 𝛽3 𝑖𝑛𝑐𝑜𝑚𝑒 + 𝑢


2. 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑 = 𝛽1 + 𝛽2 𝑖𝑚𝑚𝑖𝑔_𝑠𝑡𝑎𝑡𝑢𝑠 + 𝛽3 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝑢
3. ℎ𝑒𝑎𝑙𝑡ℎ_𝑠𝑡𝑎𝑡𝑢𝑠 = 𝛽1 + 𝛽2 𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽3 𝑎𝑔𝑒 + ⋯ + 𝑢
4. 𝑔𝑟𝑜𝑤𝑡ℎ = 𝛽1 + 𝛽2 𝑖𝑛𝑠𝑡𝑖𝑡𝑢𝑡𝑖𝑜𝑛𝑎𝑙_𝑞𝑢𝑎𝑙𝑖𝑡𝑦 + 𝛽3 𝑐𝑎𝑝𝑖𝑡𝑎𝑙 + 𝛽4 𝑙𝑎𝑏𝑜𝑢𝑟 + 𝑢
5. 𝑙𝑛ℎ𝑤𝑎𝑔𝑒 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑒𝑥𝑝 + ⋯ + 𝑢
6. 𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦𝑇𝑉𝑠 = 𝛽1 + 𝛽2 𝑝𝑟𝑖𝑐𝑒𝑇𝑉𝑠 + 𝛽3 𝑖𝑛𝑐𝑜𝑚𝑒 + ⋯ + 𝑢
7
1.2 SUMMARY THUS FAR
• Endogeneity is present in a lot of models

• We need to be able to:


– Explain its source ✓
– Understand its effect on our ability to estimate reliable parameters
– Correct any resulting econometric problems

8
1.3 STANDARD ASSUMPTIONS FOR THE
CLASSICAL LINEAR REGRESSION MODEL (CLRM)
• These assumptions are required:
– For OLS estimators to be unbiased estimators of population parameters

• Assumptions relate to statistical properties of estimators:


– Somewhat abstract!
– Describe properties of estimators when random sampling is done repeatedly
– Have nothing to do with a particular sample
– i.e. not meaningful to discuss properties of estimates obtained from a single sample

9
• Assumption CLRM1:
– The model is linear in the parameters
• Assumption CLRM2:
– The dataset is a random sample drawn from the population
• Assumption CLRM3:
– There is no perfect multicollinearity
• Assumption CLRM4:
– The error terms must be uncorrelated with all the X variables
– i.e. there is no endogeneity

• When CLRM4 holds: we have exogenous explanatory variables


• But if any 𝑋𝑗 is correlated with 𝑢 for any reason, then 𝑋𝑗 is an endogenous explanatory
variable

10
Assumption CLRM4: Zero conditional mean
𝐸 𝑢|𝑋2 , 𝑋3 , … , 𝑋𝑘 = 0
or
𝑐𝑜𝑣 𝑢, 𝑋𝑗 = 0 , 𝑗 = 2, … , 𝑘
• CLRM4 is more likely to hold when fewer factors are in the error term
– i.e. When the model is better specified
• BUT CLRM4 can fail due to three sources discussed previously
• We cannot know for sure whether the average value of the unobserved factors is
unrelated to the explanatory variables.
• But this is the most important assumption:

Exogeneity is the key assumption to enable a causal interpretation


of the regression results
WHY?
11
1.4 RESULT: CONSISTENCY OF OLS
Under assumptions CLRM1-CLRM4:
OLS estimator 𝑏𝑗 is consistent for 𝛽𝑗 for all 𝑗 = 2, … , 𝑘

• What is consistency?
– It is an asymptotic or large sample property
– Let 𝑏𝑗 be the OLS estimator of 𝛽𝑗 for some j.
– For each N, 𝑏𝑗 has a probability distribution (representing its possible values in
different random samples of size N).
– If this estimator is consistent, then the distribution of 𝑏𝑗 becomes more and more
tightly distributed around 𝛽𝑗 as the sample size grows.
– As N tends to infinity, the distribution of 𝑏𝑗 collapses to the single point 𝛽𝑗 :

Say: 𝛽𝑗 is the
𝑝𝑙𝑖𝑚 𝑏𝑗 = 𝛽𝑗
probability limit of 𝑏𝑗
12
Fig C3. Sampling distributions of 𝑏𝑗 for increasing
sample sizes

𝑓(𝑏𝑗 )
N = 40

N = 16

N=4

𝛽𝑗 𝑏𝑗
Source: Wooldridge (2013) 13
Why does consistency matter?
• Virtually all economists agree:
– consistency is a minimal requirement for an estimator

• Nobel Prize-winning econometrician Clive W. J. Granger:


– “If you can’t get it right as N goes to infinity, you shouldn’t be in this business.”

14
Showing the consistency of OLS
• In general, we need matrix algebra to show this.
• But, we can illustrate it for a simple model with a single X
• The formula (estimator) for the slope coefficient is given by:
σ𝑁 ത
𝑖=1(𝑋𝑖2 − 𝑋2 ) 𝑌𝑖
𝑏2 = 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത1 )2
• Substituting 𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖2 + 𝑢𝑖 and rearranging gives:
1 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത2 ) 𝑢𝑖
𝑏2 = 𝛽2 + 𝑁
1 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത2 )2
𝑁
• Law of large numbers, the numerator and denominator converge in probability to
𝑐𝑜𝑣(𝑋2 , 𝑢) and 𝑣𝑎𝑟 𝑋2 , i.e.
𝑐𝑜𝑣(𝑋2 , 𝑢) CLRM4
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 +
𝑣𝑎𝑟 𝑋2
= 𝛽2 because 𝑐𝑜𝑣 𝑋2 , 𝑢 = 0 15
1.5 CONSEQUENCE OF VIOLATING ASSUMPTION CLRM4
• Given that:
𝑐𝑜𝑣 𝑋2 , 𝑢
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + 1.1
𝑣𝑎𝑟 𝑋2
• Then the inconsistency (or asymptotic bias) is:
𝑐𝑜𝑣(𝑋2 , 𝑢)
𝑝𝑙𝑖𝑚 𝑏2 − 𝛽2 =
𝑣𝑎𝑟 𝑋2
If 𝑐𝑜𝑣 𝑋2 , 𝑢 = 0 OLS is consistent and unbiased
If 𝑐𝑜𝑣 𝑋2 , 𝑢 < 0 OLS is inconsistent and biased downwards
If 𝑐𝑜𝑣 𝑋2 , 𝑢 > 0 OLS is inconsistent and biased upwards

• If the covariance is small, the inconsistency might be negligible


– But we cannot estimate 𝑐𝑜𝑣 𝑋2 , 𝑢 since 𝑢 is unobserved
• We need to use our knowledge of the relationship being estimated 16
• We will examine each of the three potential causes of endogeneity
• i.e. of violating assumption CLRM4
1. Omitted variables
2. Measurement error
3. Bidirectional causality (simultaneity)

• We will look at:


– Why is 𝑢 correlated with 𝑋𝑗 in each case?
– What is the nature of the resulting asymptotic bias in each case?
– What is the general econometric method of solving the endogeneity issue?
• Instrumental variables

17
2. OMITTED VARIABLES
• Suppose the true model is:
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢
• but instead we estimate:
𝑌 = 𝑏1 + 𝑏2 𝑋2 + 𝑣
– E.g. 𝑌 is earnings, 𝑋2 is years of education, and 𝑋3 is ability
– Does 𝑏2 measure the true return to education, 𝛽2 ?
• From eq.(1.1):
𝑐𝑜𝑣(𝑋2 , 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙)
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣(𝑋2 , 𝛽3 𝑋3 + 𝑢 )
= 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣 𝑋2 , 𝛽3 𝑋3 + 𝑐𝑜𝑣 𝑋2 , 𝑢
= 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣 𝑋2 , 𝑋3
= 𝛽2 + 𝛽3
𝑣𝑎𝑟 𝑋2
18
𝑐𝑜𝑣 𝑋2 , 𝑋3
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + 𝛽3
𝑣𝑎𝑟 𝑋2
• Therefore 𝑏2 is asymptotically unbiased only if either:
➢ 𝛽3 = 0 (i.e. there is no omitted variable), or
➢ 𝑋2 and 𝑋3 are uncorrelated.
• If neither of these two occurs, then b2 is biased and inconsistent,
– direction of asymptotic bias depends on sign of 𝛽3 𝑐𝑜𝑣 𝑋2 , 𝑋3 .
• In the example:
𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑠 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑢
– what is the direction of the bias of the return to education, when ability is
unobserved?
• Determining direction of bias is more complex with multiple Xs:
– Depends on their relationships with each other and with the omitted factor

Now try Exercise 3, Question 2! 19


3. MEASUREMENT ERROR
• Suppose that the true model is given by
𝑌 = 𝛽1 + 𝛽2 𝑋2∗ + 𝑢
• But 𝑋2∗ cannot be measured accurately: we only have an imperfect measure 𝑋2
– E.g. 𝑋2∗ is actual income, but 𝑋2 is reported income
• What is the effect on our ability to estimate 𝛽2 ?
• The measurement error in the population is simply
𝑒2 = 𝑋2 − 𝑋2∗
• We make the classical errors-in-variables (CEV) assumption: the measurement error is
uncorrelated with the true (unobserved) 𝑋2∗
• Simplify eq.(1.1) for 𝑝𝑙𝑖𝑚 𝑏2 , using various properties of variance and covariance in this
context, to become:
𝑐𝑜𝑣(𝑋2 , 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙) 𝑣𝑎𝑟 𝑋2∗
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + = 𝛽2
𝑣𝑎𝑟 𝑋2 𝑣𝑎𝑟 𝑋2∗ + 𝑣𝑎𝑟 𝑒2
20
• In the presence of measurement error: ‘Signal’ i.e true information
contained in 𝑋2∗

𝑣𝑎𝑟 𝑋2∗
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2
𝑣𝑎𝑟 𝑋2∗ + 𝑣𝑎𝑟 𝑒2

‘Noise’ i.e. measurement error

• Therefore, the OLS estimate 𝑏2 is biased towards zero (this is called attenuation bias).
– The larger the degree of measurement error, the greater is the attenuation bias.

• Issue is more complex in models with multiple Xs:


– Generally, measurement error in a single variable causes inconsistency in all
estimators
21
4. SIMULTANEITY
• Simultaneity arises when some of the Xs are jointly determined with
the dependent variable in the same economic model.
– There is bidirectional causality between X and Y
• We should view the equation we are interested in estimating as part of a system of
relationships:
– multiple causal relationships.

• Some examples:
– Models of demand and supply i.e. market equilibrium
• For commodities
• For an input into production e.g. labour
– Models of the macroeconomy
22
Example 1: Demand and supply
• Consider a system of supply and demand for a commodity:
Demand: Q = 1 + 2P + 3Y + u1 (4.1)
Supply: Q = 1 + 2P + u2 (4.2)
• In equilibrium, equate demand and supply:
1 +  2P +  3Y + u1 = 1 +  2P + u2
 2P −  2P = 1 + u2 − (1 +  3Y + u1 )
P ( 2 −  2 ) = 1 − 1 −  3Y + u2 − u1 slope

 − 3 u −u
P= 1 1 − Y+ 2 1 (4.3)
(3)
 2 − 2  2 − 2  2 − 2
intercept error term
• Thus P is a function of u1: i.e. X variable correlated with error term
• P is an endogenous explanatory variable
– It is simultaneously determined with Q
– Cannot meaningfully estimate (4.1) using OLS: 2 will inconsistent. 23
4.1 SIMULTANEITY BIAS

• Demand and supply equations (4.1) and (4.2) are known as structural equations:
– They describe the structure of the economy:
• Derivable from economic theory
• Have a causal interpretation
• In the structural equations:
– Price and quantity are determined simultaneously:
• price affects quantity and quantity affects price
– P and Q are endogenous variables, while Y is exogenous
– Estimation by OLS will lead to biased and inconsistent coefficient estimates
• Explanatory variables are correlated with error term
• Determining the direction of the bias is generally complicated in models with multiple X
variables.
24
Avoiding simultaneity bias
• Equations such as (4.3) are known as reduced form equations:
– Endogenous variables are expressed as a function only of all exogenous variables
(and a constant)
– Can derive a similar equation for Q
• Write the reduced form equations as:
𝑃 = 𝜋11 + 𝜋21 𝑌 + 𝑣1 4.3𝑎
𝑄 = 𝜋12 + 𝜋22 𝑌 + 𝑣2 4.4
• These reduced form equations can be estimated by OLS:
– All the RHS variables are exogenous
• But:
– We don’t care about the values of the 𝜋 parameters
– The parameters of interest are 𝛼1 , 𝛼2 and 𝛼3 , and 𝛽1 and 𝛽2 (from the structural
equations)
25
4.2 IDENTIFICATION OF STRUCTURAL EQUATIONS
• In OLS, we can identify the value of the parameters if:
– each explanatory variable is uncorrelated with the error term
• This condition does not hold when there is endogeneity

• We can sometimes still identify (or consistently estimate) the parameters in a structural
equation
– Similarly for cases of omitted variables or measurement error.

• Do we have enough information to retrieve the original coefficients (𝛼s and 𝛽s) from
the 𝜋s?
– Answer depends on having additional exogenous variables
– i.e. exogenous variables that are not in the equation of interest
26
Three possible situations
1. An equation is unidentified
– We cannot get the structural coefficients from the reduced form estimates
– E.g. the demand equation Q = 1 + 2P + 3Y + u1
– There are no additional exogenous variables in the model

2. An equation is exactly identified


– We can get unique structural form coefficient estimates
– E.g. the supply equation Q = 1 + 2P + u2

3. An equation is over-identified
– More than one set of structural coefficients could be obtained from the reduced form
– Example given later
27
Condition for a structural equation to be identified
• A structural equation satisfies the order condition if:
– number of exogenous variables excluded from the equation is
– at least as large as the number of right-hand side endogenous variables
• This is a necessary (but not sufficient) condition for identification
• The rank condition is a sufficient condition
– but requires matrix algebra: beyond scope of this module
• Express the order condition as:
K – k0  m0
• where K = no. of exogenous variables in the equation system (i.e. overall model)
in total
k0 = no. of exogenous variables in the structural equation
m0 = no. of endogenous variables on RHS of structural equation
28
Demand: Q = 1 + 2P + 3Y + u1 (4.1)
Supply: Q = 1 + 2P + u2 (4.2)

Are each of these structural equations identified?


For the model as a whole: K=
Demand equation: k0 = ; m0 =
K – k0 =

Supply equation: k0 = ; m0 =
K – k0 =

• Therefore we can get unbiased estimates of the parameters in the supply equation
– but not in the demand equation.

29
Example 2: Keynesian macro model
• For a closed economy:
𝐶 = 𝛽1 + 𝛽2 𝑌 + 𝛽3 𝑟 + 𝑢1 4.5
𝐼 = 𝛾1 + 𝛾2 𝑟 + 𝑢2 4.6
𝑌 ≡𝐶+𝐼+𝐺 4.7
• Three equations in the system:
– therefore three endogenous (dependent) variables
• Assume all other variables are exogenous

• Is equation (4.5) identified?


– For the model as a whole: K=
– For equation (4.5): k0 = ; m0 =
– Therefore:

30
• Various issues with such a simple macro model:
1. Difficult to argue that interest rates and government spending are exogenous
2. Model would be estimated with time series data, but is static:
• We expect adjustment lags

• Can adapt the model to deal with issue 2, e.g.


𝐶𝑡 = 𝛽1 + 𝛽2 𝑌𝑡 + 𝛽3 𝑟𝑡 + 𝛽4 𝐶𝑡−1 + 𝑢1
𝐼𝑡 = 𝛾1 + 𝛾2 𝑟𝑡 + 𝛾3 𝑌𝑡−1 + 𝑢2
• Then the lagged values can be treated as exogenous:
– They are referred to as predetermined variables
– Including lags helps with identification (as well as better modelling dynamic
behaviour)

Now try Exercise 3, Question 3.1 and 3.2! 34


Part 2

Estimation in the Presence of Endogeneity:


The use of instrumental variables

We focus on how to address endogeneity,


and various associated statistical tests
5. ESTIMATION: INSTRUMENTAL VARIABLE TECHNIQUE
• Recall:
– We cannot use OLS directly on the structural equations
– Because the endogenous explanatory variable/s are correlated with the errors

• One solution:
– Don’t use the endogenous Xs
– Rather, use some other variables instead
• We want these other variables to be:
– (highly) correlated with the endogenous Xs, but
– NOT correlated with the errors

• They are called INSTRUMENTS (IVs)


33
• Here, we express the use of instruments more formally:
• Consider the equation:
Y1 = 1 + 2X + 3Y2 + u
where X is exogenous and Y2 is endogenous (correlated with u).
• The method of instrumental variables requires that we find a variable Z which is an
instrument for Y2
• Z must be:
1) strongly correlated with Y2
Instrument relevance: corr (Z, Y2 )  0
but
2) not correlated with u
Instrument exogeneity: corr (Z, u) = 0
• If the instrument is good (i.e. satisfies the two conditions above):
– we can use it to consistently estimate the parameters in the equation of interest.
34
5.1 WHERE DO THE INSTRUMENTS COME FROM?
• Depends on the source of endogeneity
• Simultaneity:
– Provided we have a model with multiple equations:
– Instruments are the excluded exogenous variables from other equations
• Including any predetermined variables
• Omitted variable and measurement error:
– More challenging:
• There aren’t additional equations with extra variables
– Need to make an argument for choice of instrument/s, and justify
– Similarly for cases of simultaneity with only one equation
• Panel data often provides instruments from previous time periods
– See Topics 5 and 6 for more information
35
Some examples of instruments: 1

• We want to estimate the causal effect of skipping class on academic performance:


𝑚𝑎𝑟𝑘 = 𝛽1 + 𝛽2 𝑎𝑏𝑠𝑒𝑛𝑡 + 𝛽3 𝑝𝑟𝑒𝑣𝑚𝑎𝑟𝑘𝑠 + 𝛽4 𝑚𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 + 𝑢
– But motivation is an omitted variable
– We suspect it is correlated with absenteeism

• Proposed IV:
– Use distance between living location and campus as instrument for absent
• Motivation:
– Relevance: longer commute → probability of being absent (e.g. due to transport
problems)
– Exogeneity: distance not expected to be correlated with motivation
36
Some examples of instruments: 2
• We want to estimate the causal effect of education on earnings:
log(𝑤𝑎𝑔𝑒) = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑢
• Proposed IV 1: Parents’ education
– Relevance: parents’ education is correlated with child’s education in many samples
(true for SA?)
– Exogeneity: but likely to be correlated with child’s ability
• Proposed IV 2: Number of siblings
– Relevance: having more siblings is typically associated with lower education per child
(true for SA?)
– Exogeneity: likely to be uncorrelated with child’s ability
• Need to make similar arguments for measurement error cases
The statistical reliability of the results depends on having good IVs
37
5.2 TWO-STAGE LEAST SQUARES (2SLS)
• Two-stage least squares (2SLS) provides a method for using multiple
instrumental variables.
• 2SLS proceeds as follows:
– Stage 1:
• Regress each endogenous variable that appears on the RHS of the structural
equation on all of its instruments
– In simultaneous equations, this is the reduced form equation
• Predict the value of each endogenous variable, 𝑍መ
– Stage 2:
• Use the predicted value of each endogenous variable in place of the variable
itself
• Standard errors have to be corrected in Stage 2
• Interpret the resulting coefficients and perform hypothesis tests as usual.
38
Stata example
Consider a demand and supply model for a food product:
Demand: Q = 1 + 2P + 3PS + 4INC + u1
Supply: Q = 1 + 2P + 3PF + u2
• Q is quantity; P is price; PS is price of a substitute; INC is per capita income; PF is price of
factor of production
• Endogenous: Q and P; exogenous: PS, INC and PF.
• The demand equation, estimated by OLS:
. regress q p ps inc
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 3, 26) = 8.52
Model | 305.92719 3 101.97573 Prob > F = 0.0004
Residual | 311.209627 26 11.969601 R-squared = 0.4957
-------------+------------------------------ Adj R-squared = 0.4375
Total | 617.136817 29 21.2805799 Root MSE = 3.4597 If price and quantity
------------------------------------------------------------------------------ are simultaneously
q | Coef. Std. Err. t P>|t| [95% Conf. Interval] determined, then this
-------------+----------------------------------------------------------------
p | .0232954 .0768423 0.30 0.764 -.1346562 .181247 coefficient is likely to
ps | .7100395 .2143246 3.31 0.003 .269489 1.15059
inc | .0764442 1.190855 0.06 0.949 -2.371393 2.524282 be biased.
_cons | 1.091045 3.71158 0.29 0.771 -6.538218 8.720308
------------------------------------------------------------------------------
39
. ivregress 2sls q (p = ps inc pf) ps inc, first
First-stage regressions
-----------------------
Number of obs = 30
F( 3, 26) = 69.19
This stage creates an instrument for the Prob > F = 0.0000
R-squared = 0.8887
potentially-endogenous variable, price Adj R-squared = 0.8758
Root MSE = 6.5975
------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------

Instrumental variables (2SLS) regression Number of obs = 30


Wald chi2(3) = 20.43
Prob > chi2 = 0.0001
Stage 2 uses the instrument in R-squared = .
place of price in the regression Root MSE = 4.5895
After dealing with the
------------------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval] endogeneity, price has
-------------+---------------------------------------------------------------- a significant negative
p | -.3744591 .1533755 -2.44 0.015 -.6750695 -.0738486
ps | 1.296033 .3306669 3.92 0.000 .6479381 1.944128 effect on quantity
inc | 5.013977 2.125875 2.36 0.018 .847339 9.180615
_cons | -4.279471 5.161076 -0.83 0.407 -14.39499 5.836052 demanded
------------------------------------------------------------------------------
Instrumented: p
Instruments: ps inc pf 40
5.3 TESTING FOR INSTRUMENT VALIDITY
• Estimates produced using IV are consistent only when the IV used is valid
• Illustrate properties of IV estimation if Z is a poor IV:
Instrument exogeneity:
should be close to zero
𝑐𝑜𝑟𝑟(𝑍, 𝑢) 𝜎𝑢
𝑝𝑙𝑖𝑚 𝑏2,𝐼𝑉 = 𝛽2 + .
𝑐𝑜𝑟𝑟(𝑍, 𝑋2 ) 𝜎𝑋2
Instrument relevance:
should be large
• If Z is not exogenous: estimates are inconsistent
• If relevance of Z is weak:
– Can have large asymptotic bias (and high std errors)
– Even if 𝑐𝑜𝑟𝑟(𝑍, 𝑢) is small
41
1) Instrument relevance:
• Straightforward to assess:
– Examine the first stage of 2SLS
• Focus on significance of the IV’s, rather than all exogenous variables.
– IVs should be significantly related to the endogenous X:
• Use t-test for one IV, or F-test for multiple IVs
– Rule of thumb: for a single endogenous explanatory variable, the F-statistic in the
first stage should be greater than 10.
. ivregress 2sls q (p = ps inc pf) ps inc, first
First-stage regressions
-----------------------
Number of obs = 30
F( 3, 26) = 69.19
Prob > F = 0.0000
R-squared = 0.8887
Adj R-squared = 0.8758
Root MSE = 6.5975
------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------ 42
2) Instrument exogeneity:
• If the coefficients are exactly identified:
– There is no statistical test for this assumption.
– Researcher must use knowledge and judgement of the research question at hand.

• If equation is over-identified (i.e. extra IVs), can conduct a test

43
Test for over-identifying restrictions
• Suppose that we have q more instruments than we need:
– i.e. q = (K – k0) – (m0) > 0
– Recall that IVs must be excluded exogenous variables
– E.g. one endogenous X (m0 = 1), and three proposed IVs (K – k0 = 3)
• q = 3 – 1 = 2 over-identifying restrictions.
• Then we can test whether the 2SLS residuals are correlated with q linear functions of
the instruments

• Procedure for testing over-identifying restrictions:


1) Estimate structural equation by 2SLS; obtain residuals, 𝑢ො 1 .
2) Regress 𝑢ො1 on all exogenous variables. Obtain 𝑅12 .
3) Test statistic = 𝑛𝑅12 ~𝜒 2 with df = q
2
4) If 𝑛𝑅12 > 𝜒𝑐𝑟𝑖𝑡 , reject 𝐻0 : 𝐼𝑉𝑠 𝑢𝑛𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑢ො1
5) Conclude that at least some of the IVs are not exogenous. 44
• Recall that our model is:
Demand: Q = 1 + 2P + 3PS + 4INC + u1
Supply: Q = 1 + 2P + 3PF + u2
• q = (K – k0) – (m0) = (no. of proposed IVs) – (no. of endogenous Xs)
– Demand equation: q = (3-2) – (1) = 0
– Supply equation: q = (3-1) – (1) = 1
. ivregress 2sls q (p = ps inc pf) pf

Instrumental variables (2SLS) regression Number of obs = 30


Wald chi2(2) = 211.69
Prob > chi2 = 0.0000
R-squared = 0.9019
Root MSE = 1.4207

------------------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p | .3379816 .0236408 14.30 0.000 .2916465 .3843166
pf | -1.000909 .0782929 -12.78 0.000 -1.154361 -.8474581
_cons | 20.0328 1.160349 17.26 0.000 17.75856 22.30704
------------------------------------------------------------------------------
Instrumented: p
Instruments: pf ps inc

. predict u, resid
45
. reg u pf ps inc

Source | SS df MS Number of obs = 30


-------------+---------------------------------- F(3, 26) = 0.47
Model | 3.0948454 3 1.03161513 Prob > F = 0.7080
Residual | 57.4597199 26 2.20998923 R-squared = 0.0511
-------------+---------------------------------- Adj R-squared = -0.0584
Total | 60.5545653 29 2.08808846 Root MSE = 1.4866

------------------------------------------------------------------------------
u | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pf | .0363318 .067262 0.54 0.594 -.1019273 .1745909
ps | .0790798 .0790635 1.00 0.326 -.0834376 .2415971
inc | -.4023461 .3885424 -1.04 0.310 -1.201007 .3963143
_cons | -1.149104 1.799078 -0.64 0.529 -4.847162 2.548953
------------------------------------------------------------------------------

• Then 𝑛𝑅2 = 30 ∗ 0. 0511 = 1.533


2
• 𝜒𝑐𝑟𝑖𝑡 𝛼 = 0.05; 𝑑𝑓 = 𝑞 = 1 = 3.841
2
• 𝑛𝑅2 < 𝜒𝑐𝑟𝑖𝑡 , therefore cannot reject 𝐻0
• Therefore the instruments used are exogenous.

Now try Exercise 3, Question 3.3! 46


5.4 TESTING FOR ENDOGENEITY
• It is ‘costly’ to use IV if there is no endogeneity:
– IV is less efficient (has larger standard errors) than OLS.

• Statistical Properties of OLS and IV:


Endogeneity No endogeneity
OLS Inconsistent Consistent and efficient
IV Consistent Consistent but inefficient

• In the presence of endogeneity:


– Only IV is consistent
– BUT may have bias in small samples
• Recall: consistency is an asymptotic property
47
A. Regression-based Test
• Consider the equation:
Y1 = 1 + 2X + 3Y2 + u
where X is exogenous and Y2 may be endogenous.
• Estimate the reduced form equation for Y2
– i.e. regress Y2 on all the truly exogenous variables
– and obtain the residuals, e.

• Now include these residuals in the model of interest:


Y1 = 1 + 2X + 3Y2 + θe + u
• Hypotheses: H0: θ = 0, i.e. Y2 is exogenous
H1: θ  0, i.e. Y2 is endogenous
• Thus a standard t-test on the coefficient on e in the above regression:
– constitutes a test of the null hypothesis of Y2 being exogenous.
48
B. Hausman Test
• Estimate the model by both OLS and IV:
– Compare (statistically) the coefficient values and their variances.

• H0: no endogeneity bias (both OLS and IV estimators will be consistent, but
OLS is more efficient)
• H1: endogeneity (only IV will be consistent – the difference between the OLS and IV
coefficients will not converge to zero as n → )

• If there is a systematic difference in the OLS and IV estimates:


– the explanatory variable/s is/are endogenous.
• The test statistic is based on the differences between all of the coefficients:
– follows a chi-squared distribution (with df = number of instrumented variables).

49
Stata example
A. Regression-based test:
To test whether price is endogenous in the demand equation, estimate the
reduced form equation for price, then include its residuals in the demand equation:
reduced form equation: regress the potentially
. reg p ps inc pf
endog var, p, on all exog vars in the model
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 3, 26) = 69.19
Model | 9034.77551 3 3011.59184 Prob > F = 0.0000
Residual | 1131.69721 26 43.5268157 R-squared = 0.8887
-------------+------------------------------ Adj R-squared = 0.8758
Total | 10166.4727 29 350.568025 Root MSE = 6.5975

------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------
predict the residuals from
. predict e, resid
the reduced form equation 50
include the residuals as an extra
. regress q p ps inc e
variable in the demand equation
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 4, 25) = 60.88
Model | 559.677099 4 139.919275 Prob > F = 0.0000
Residual | 57.4597181 25 2.29838873 R-squared = 0.9069
-------------+------------------------------ Adj R-squared = 0.8920
Total | 617.136817 29 21.2805799 Root MSE = 1.516

------------------------------------------------------------------------------
q | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p | -.3744591 .0506639 -7.39 0.000 -.4788032 -.2701149
ps | 1.296033 .1092277 11.87 0.000 1.071074 1.520992
inc | 5.013977 .702231 7.14 0.000 3.567705 6.460249
e | .7124655 .0678067 10.51 0.000 .5728149 .852116
_cons | -4.279471 1.704836 -2.51 0.019 -7.790645 -.7682958
------------------------------------------------------------------------------

p-value on residuals = 0
Reject H0 at all levels of
significance
• Therefore reject H0: θ = 0 (p is exogenous)
• Therefore price is endogenous in the demand equation.
51
B. Hausman test:
Command for the Hausman test,
. hausman IV OLS, cons sigmamore
comparing the two sets of estimates
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| IV OLS Difference S.E.
-------------+----------------------------------------------------------------
p | -.3744591 .0232954 -.3977545 .0863877
ps | 1.296033 .7100395 .5859938 .1272711
inc | 5.013977 .0764442 4.937533 1.072376
_cons | -4.279471 1.091045 -5.370516 1.166414
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from ivregress
B = inconsistent under Ha, efficient under Ho; obtained from regress
Test: Ho: difference in coefficients not systematic

chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 21.20
Prob>chi2 = 0.0000
Reject H0 at all levels
of significance

• H0: no endogeneity bias


• Therefore endogeneity does exist in the demand equation:
– We must estimate the equation using IV, not OLS.
52
6. CONCLUSION

• Endogeneity is one of the key issues in empirical econometrics:


– It violates an assumption that is required to have unbiased, consistent estimators
– It means that relationships can no longer be interpreted as causal

• The way in which endogeneity is discussed and dealt with is a crucial determinant of:
– Reliability of empirical estimates
– Whether an empirical paper is published
– Success of empirical dissertations for advanced degrees

• In this topic, we’ve gone through some key tools for dealing with this issue:
– It remains a complex conceptual and empirical issue which is difficult grapple with.

53

You might also like