s10 IV Handout
s10 IV Handout
Instrumental
Variables
Matthew Blackwell
November 5, 2015
1 / 48
1. IV setup
4. IV extensions
2 / 48
1/ IV setup
3 / 48
Where are we? Where are we
going?
4 / 48
Basic IV setup with DAGs
𝑈
𝑍 𝐷 𝑌
exclusion restriction
5 / 48
An IV is only as good as its
assumptions
𝑍 𝐷 𝑌
exclusion restriction
6 / 48
IVs in the field
• Angrist (1990): Draft lottery as an IV for military service
(income as outcome)
• Acemoglu et al (2001): settler mortality as an IV for
institutional quality (GDP/capita as outcome)
• Levitt (1997): being an election year as IV for police force size
(crime as outcome)
• Kern & Hainmueller (2009): having West German TV
reception in East Berlin as an instrument for West German TV
watching (outcome is support for the East German regime)
• Nunn & Wantchekon (2011): historical distance of ethnic
group to the coast as a instrument for the slave raiding of
that ethnic group (outcome are trust attitudes today)
• Acharya, Blackwell, Sen (2015): cotton suitability as IV for
proportion slave in 1860 (outcome is white attitudes today)
7 / 48
2/ IV with
constant
treatment effects
8 / 48
IV with constant effects
𝑌𝑖 (𝑑, 𝑢) = 𝛼 + 𝜏𝑑 + 𝛾𝑢 + 𝜂𝑖
𝑌𝑖 = 𝛼 + 𝜏𝐷𝑖 + 𝛾𝑈𝑖 + 𝜂𝑖
9 / 48
The role of the instrument
Cov(𝛾𝑈𝑖 + 𝜂𝑖 , 𝑍𝑖 ) = 0
10 / 48
IV estimator with constant effects
𝑌𝑖 = 𝛼 + 𝜏𝐷𝑖 + 𝛾𝑈𝑖 + 𝜂𝑖
11 / 48
Weak instruments
• Natural estimator:
̂ 𝑖 , 𝑍𝑖 )
Cov(𝑌
𝜏
̂𝐼𝑉 =
̂ 𝑖 , 𝑍𝑖 )
Cov(𝐷
• What happens with a weak first stage? Can show that this
estimator converges to:
𝑝 Cov(𝑍𝑖 , 𝑈𝑖 )
𝜏
̂𝐼𝑉 → 𝜏 +
Cov(𝑍𝑖 , 𝐷𝑖 )
• If Cov(𝑍𝑖 , 𝐷𝑖 ) is small, then even very small violations of the
exclusion restriction Cov(𝑍𝑖 , 𝑈𝑖 ) ≠ 0 can lead to large
inconsistencies and finite sample bias.
• Important to convey the strength of the first-stage via 𝑡-test
or 𝐹-test with multiple instruments.
12 / 48
Wald Estimator
13 / 48
What about covariates?
𝔼[𝑍𝑖 𝜈𝑖 ] = 0 𝔼[𝑍𝑖 𝜀𝑖 ] = 0
𝔼[𝑋𝑖 𝜈𝑖 ] = 0 𝔼[𝑋𝑖 𝜀𝑖 ] = 0
• …but 𝐷𝑖 is endogenous: 𝔼[𝐷𝑖 𝜀𝑖 ] ≠ 0
14 / 48
Getting the reduced form
• We can plug the treatment equation into the outcome
equation:
15 / 48
Two-stage least squares
• Estimate 𝛼
̂ and 𝛾
̂ from OLS and form fitted values:
̂ 𝑖 |𝑋𝑖 , 𝑍𝑖 ] = ̂
𝔼[𝐷 𝐷𝑖 = 𝑋𝑖′ 𝛼
̂+𝛾
̂ 𝑍𝑖 .
• Regress of 𝑌𝑖 on 𝑋𝑖 and ̂
𝐷𝑖 . Add and subtract 𝜏̂
𝐷𝑖 :
𝑌𝑖 = 𝑋𝑖′ 𝛽 + 𝜏̂
𝐷𝑖 + [𝜀𝑖 + 𝜏(𝐷𝑖 − ̂
𝐷𝑖 )]
• Key question: is ̂
𝐷𝑖 uncorrelated with the error?
• ̂
𝐷𝑖 is just a function of 𝑋𝑖 and 𝑍𝑖 so it is uncorrelated with 𝜀𝑖 .
• We also know that ̂
𝐷𝑖 is uncorrelated with (𝐷𝑖 − ̂
𝐷𝑖 )?
16 / 48
Two-stage least squares
• Heuristic procedure:
1. Run regression of treatment on covariates and instrument
2. Construct fitted values of treatment
3. Run regression of outcome on covariates and fitted values
• Note that this isn’t how we actually estimate 2SLS because
the standard errors are all wrong.
• Computer wants to calculate the standard errors based on 𝜀∗𝑖 :
𝜀∗𝑖 = 𝑌𝑖 − 𝑋𝑖′ 𝛽 − 𝜏̂
𝐷𝑖
𝜀𝑖 = 𝑌𝑖 − 𝑋𝑖′ 𝛽 − 𝜏𝐷𝑖
17 / 48
Nunn & Wantchekon IV example
18 / 48
General 2SLS
𝑌𝑖 = 𝑋𝑖′ 𝛽 + 𝜀𝑖
𝔼[𝑍𝑖 𝜀𝑖 ] = 0
19 / 48
Nasty Matrix Algebra
• Projection matrix projects values from the columns of 𝑍𝑖 to
the columns of 𝑋𝑖 :
Π = (𝔼[𝑍𝑖 𝑍𝑖′ ])−1 𝔼[𝑍𝑖 𝑋𝑖′ ] (projection matrix)
𝑋𝑖̃ = Π′ 𝑍𝑖 (fitted values)
• To derive the 2SLS estimator, take the fitted values, Π′ 𝑍𝑖 and
multiply both sides of the outcome equation by them:
𝑌𝑖 = 𝑋𝑖′ 𝛽 + 𝜀𝑖
Π′ 𝑍𝑖 𝑌𝑖 = Π′ 𝑍𝑖 𝑋𝑖′ 𝛽 + Π′ 𝑍𝑖 𝜀𝑖
𝔼[Π′ 𝑍𝑖 𝑌𝑖 ] = 𝔼[Π′ 𝑍𝑖 𝑋𝑖′ ]𝛽 + 𝔼[Π′ 𝑍𝑖 𝜀𝑖 ]
𝔼[Π′ 𝑍𝑖 𝑌𝑖 ] = 𝔼[Π′ 𝑍𝑖 𝑋𝑖′ ]𝛽 + Π′ 𝔼[𝑍𝑖 𝜀𝑖 ]
𝔼[Π′ 𝑍𝑖 𝑌𝑖 ] = 𝔼[Π′ 𝑍𝑖 𝑋𝑖′ ]𝛽
𝔼[𝑋𝑖̃ 𝑌𝑖 ] = 𝔼[𝑋𝑖̃ 𝑋𝑖′ ]𝛽
𝛽 = (𝔼[𝑋𝑖̃ 𝑋𝑖′ ])−1 𝔼[𝑋𝑖̃ 𝑌𝑖 ]
20 / 48
How to estimate the parameters
• Collect 𝑋𝑖 into a 𝑛 × 𝑘 matrix 𝐗 = (𝑋1′ , … , 𝑋𝑛′ )
• Collect 𝑍𝑖 into a 𝑛 × 𝑙 matrix 𝐙 = (𝑍1′ , … , 𝑍𝑛′ )
̂ = 𝐙(𝐙′ 𝐙)−1 𝐙′ 𝐗 be the matrix of fitted values for 𝐗,
• Let 𝐗
then we have
𝑁 𝑝
• Matrix party trick: 𝐗′ 𝐙/𝑛 = (1/𝑛) ∑𝑖 𝑋𝑖 𝑍𝑖′ → 𝔼[𝑋𝑖 𝑍𝑖′ ].
• Take the population formula for the parameters:
̂ ′ 𝐗)−1 𝐗
𝛽 ̂ = (𝐗 ̂ ′𝐲
21 / 48
Asymptotics for 2SLS
̂ ′ 𝐗)−1 𝐗
𝛽 ̂ = (𝐗 ̂ ′𝐲
̂ ′ 𝐗)−1 𝐗
𝛽 ̂ = (𝐗 ̂ ′ (𝐗𝛽 + 𝜀)
̂ ′𝐗 = 𝐗
• Using the matrix party trick and that 𝐗 ̂ ′ 𝐗,
̂ we have
̂ ′ 𝐗)−1 𝐗
𝛽 ̂ = (𝐗 ̂ ′ 𝐗𝛽 + (𝐗
̂ ′ 𝐗)−1 𝐗
̂ ′𝜀
= 𝛽 + (𝐗̂ ′ 𝐗)
̂ −1 𝐗
̂ ′𝜀
−1
̂𝑖 𝑋
= 𝛽 + [𝑛−1 ∑ 𝑋 ̂𝑖′ ] 𝑛−1 ∑ 𝑋
̂𝑖 𝜀𝑖
𝑖 𝑖
𝑝
̂𝑖 𝜀𝑖 → 𝔼[𝑋
• Consistent because 𝑛−1 ∑𝑖 𝑋 ̂𝑖 𝜀𝑖 ] = 0.
22 / 48
Asymptotic variance for 2SLS
−1
̂𝑖 𝑋
√𝑛(𝛽̂ − 𝛽) = (𝑛−1 ∑ 𝑋 ̂𝑖′ ) (𝑛−1/2 ∑ 𝑋
̂𝑖 𝜀𝑖 )
𝑖 𝑖
̂𝑖 𝜀𝑖 converges in distribution to
• By the CLT, 𝑛−1/2 ∑𝑖 𝑋
𝑁(0, 𝐵), where 𝐵 = 𝔼[𝑋̂𝑖′ 𝜀′𝑖 𝜀𝑖 𝑋
̂𝑖 ].
𝑝
̂𝑖 𝑋
• By the LLN, 𝑛−1 ∑𝑖 𝑋 ̂𝑖′ → 𝔼[𝑋 ̂𝑖 𝑋̂𝑖′ ].
• Thus, we have that √𝑛(𝛽̂ − 𝛽) has asymptotic variance:
̂𝑖 𝑋
(𝔼[𝑋 ̂𝑖′ ])−1 𝔼[𝑋
̂𝑖′ 𝜀′𝑖 𝜀𝑖 𝑋
̂𝑖 ](𝔼[𝑋
̂𝑖 𝑋
̂𝑖′ ])−1
24 / 48
Overidentification tests
• Sargan-Hausman test:
▶ Under the null of all valid instruments, using all instruments
versus a subset should only differ by sampling variation.
▶ Regress 2SLS residuals, 𝜀𝑖̂ on 𝑋𝑖 and calculate 𝑅𝑢2 from this
regression.
▶ Under the null (and homoskedasticity), 𝑁𝑅𝑢2 ∼ Χ2𝑙−𝑘 .
▶ Degrees of freedom depends on how many overidentifying
restrictions there are.
• If we reject the null hypothesis in these overidentification
tests, then it means that the exclusion restrcitions for our
instruments are probably incorrect.
• Note that it won’t tell us which of them are incorrect, just
that at least one is.
• These overidentification tests depend heavily on the constant
effects assumption
25 / 48
3/ IV with
heterogenous
treatment effects
26 / 48
Instrumental Variables and
Potential Outcomes
• Basic idea of IV:
▶ 𝐷𝑖 not randomized, but 𝑍𝑖 is
▶ 𝑍𝑖 only affects 𝑌𝑖 through 𝐷𝑖
27 / 48
Key assumptions
1. Randomization
2. Exclusion Restriction
3. First-stage relationship
4. Monotonicity
28 / 48
Randomization
𝐸[𝑌𝑖 |𝑍𝑖 = 1] − 𝐸[𝑌𝑖 |𝑍𝑖 = 0] = 𝐸[𝑌𝑖 (𝐷𝑖 (1), 1) − 𝑌𝑖 (𝐷𝑖 (0), 0)]
29 / 48
Exclusion Restriction
30 / 48
The linear model with
heterogeneous effects
31 / 48
First Stage
32 / 48
Monotonicity
𝐷𝑖 (1) − 𝐷𝑖 (0) ≥ 0
33 / 48
Monotonicity means no defiers
34 / 48
Local Average Treatment Effect
(LATE)
35 / 48
Proof of the LATE theorem
• Under the exclusion restriction and randomization,
𝐸[𝑌𝑖 |𝑍𝑖 = 1]−𝐸[𝑌𝑖 |𝑍𝑖 = 0] = 𝐸[𝑌𝑖 (1)−𝑌𝑖 (0)|𝐷𝑖 (1) > 𝐷𝑖 (0)] Pr[𝐷𝑖 (1) > 𝐷𝑖 (0)]
37 / 48
Is the LATE useful?
38 / 48
Randomized trials with one-sided
noncompliance
• Will the LATE ever be equal to a usual causal quantity?
• When non-compliance is one-sided, then the LATE is equal to
the ATT.
• Think of a randomized experiment:
▶ Randomized treatment assignment = instrument (𝑍𝑖 )
▶ Non-randomized actual treatment taken = treatment (𝐷𝑖 )
• One-sided noncompliance: only those assigned to treatment
(control) can actually take the treatment (control). Or
39 / 48
Benefits of one-sided
noncompliance
• One-sided noncompliance ⇝ no “always-takers” and since
there are no defiers,
▶ Treated units must be compliers.
▶ ATT is the same as the LATE.
• Thus, we know that: 𝐸[𝑌𝑖 |𝑍𝑖 = 1] − 𝐸[𝑌𝑖 |𝑍𝑖 = 0] =
𝔼[𝑌𝑖 (0) + (𝑌𝑖 (1) − 𝑌𝑖 (0))𝐷𝑖 |𝑍𝑖 = 1] − 𝔼[𝑌𝑖 (0)|𝑍𝑖 = 0]
(exclusion restriction + one-sided noncompliance)
=𝔼[𝑌𝑖 (0)|𝑍𝑖 = 1] + 𝐸[(𝑌𝑖 (1) − 𝑌𝑖 (0))𝐷𝑖 |𝑍𝑖 = 1] − 𝔼[𝑌𝑖 (0)|𝑍𝑖 = 0]
=𝔼[𝑌𝑖 (0)] + 𝔼[(𝑌𝑖 (1) − 𝑌𝑖 (0))𝐷𝑖 |𝑍𝑖 = 1] − 𝔼[𝑌𝑖 (0)]
(randomization)
=𝔼[(𝑌𝑖 (1) − 𝑌𝑖 (0))𝐷𝑖 |𝑍𝑖 = 1]
=𝔼[𝑌𝑖 (1) − 𝑌𝑖 (0)|𝐷𝑖 = 1, 𝑍𝑖 = 1] Pr[𝐷𝑖 = 1|𝑍𝑖 = 1]
(law of iterated expectations + binary treatment)
=𝔼[𝑌𝑖 (1) − 𝑌𝑖 (0)|𝐷𝑖 = 1] Pr[𝐷𝑖 = 1|𝑍𝑖 = 1]
(one-sided noncompliance) 40 / 48
• Noting that Pr[𝐷𝑖 = 1|𝑍𝑖 = 0] = 0, then the Wald estimator is
just the ATT:
𝐸[𝑌𝑖 |𝑍𝑖 = 1] − 𝐸[𝑌𝑖 |𝑍𝑖 = 0]
= 𝐸[𝑌𝑖 (1) − 𝑌𝑖 (0)|𝐷𝑖 = 1]
Pr[𝐷𝑖 = 1|𝑍𝑖 = 1]
• Thus, under the additional assumption of one-sided
compliance, we can estimate the ATT using the usual IV
approach
41 / 48
4/ IV extensions
42 / 48
Falsification tests
𝑈
𝑍 𝐷 𝑌
exclusion restriction
44 / 48
Size, characteristics of the
compliers
Pr[𝐷𝑖 (1) > 𝐷𝑖 (0)] = 𝐸[𝐷𝑖 (1)−𝐷𝑖 (0)] = 𝐸[𝐷𝑖 |𝑍𝑖 = 1]−𝐸[𝐷𝑖 |𝑍𝑖 = 0]
45 / 48
Multiple instruments
̂
𝐷𝑖 = 𝜋1 𝑍1𝑖 + 𝜋2 𝑍2𝑖 .
46 / 48
2SLS as weighted average
47 / 48
Covariates and heterogeneous
effects
• It might be the case that the above assumptions only hold
conditional on some covariates, 𝑋𝑖 . That is, instead of
randomization, we might have conditional ignorability: