100% found this document useful (1 vote)
71 views37 pages

Exploring Marginal Treatment Effects Flexible Estimation Using Stata

Uploaded by

Xi Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
71 views37 pages

Exploring Marginal Treatment Effects Flexible Estimation Using Stata

Uploaded by

Xi Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Exploring Marginal Treatment Effects: Flexible

estimation using Stata


Martin Eckhoff Andresen∗

Abstract
In settings that exhibit selection on both levels and gains, Marginal
Treatment Effects (MTE) allow us to go beyond local average treatment
effects and estimate the whole distribution of effects. This paper surveys
the theory behind Marginal Treatment Effects and introduces the new Stata
package mtefe that uses several estimation methods to estimate MTE mod-
els. This package provides important improvements and flexibility over ex-
isting packages such as margte (Brave and Walstrum, 2014), and calculates
various treatment effect parameters based on the results. Use of the package
is illustrated using a range of examples.

JEL-codes: I26, C26, C36, C87

Keywords: Heterogeneity, marginal treatment effects, instrumental variables, margte,


mtefe, Stata, software


Statistics Norway. E-mail to [email protected]
1 Introduction1
Well known instrumental variables methods solve problems of selection on levels,
estimating local average treatment effects for instrument compliers even with non-
random selection into treatment. In the more reasonable case with selection into
treatment based both on levels and gains, however, local average treatment effects
may represent average treatment effects for very particular subpopulations, and we
learn little about the distribution of treatment effects in the population at large.
Marginal treatment effects allow us to go beyond local average treatment ef-
fects in settings that exhibit this sort of selection. A marginal treatment effect
is the average treatment effect for people with a particular resistance to treat-
ment, or alternatively at a particular margin of indifference. Thus, MTEs capture
heterogeneity in the treatment effect along the unobserved dimension we call re-
sistance to treatment. This is precisely what generates selection on unobserved
gains: People who choose treatment because they have a particularly low resis-
tance might have different gains than those with high resistance. Usually at the
cost of stronger assumptions than required under standard IV, we can estimate
the full distribution of (marginal) treatment effects and back out the parameters of
interest. These include average treatment effects, average treatment effects on the
treated and untreated and policy-relevant treatment effects from a hypothetical
policy that shifts the propensity to choose treatment.
The most common way of estimating MTEs is via the method of Local In-
strumental Variables (Heckman and Vytlacil, 2007). Alternatively, MTEs can be
estimated using the separate approach (Brinch et al., 2017; Heckman and Vytlacil,
2007) or, in the baseline case with joint normal errors, maximum likelihood. When
estimating MTEs using the separate approach or maximum likelihood, similarities
to selection models are apparent: We are basically estimating a traditional selec-
tion model. Marginal Treatment Effects thus represent old ideas in new wrapping,
but still provide substantial contributions to the interpretation and estimation
of models in settings with essential heterogeneity. Unfortunately, commands for
flexibly estimating MTEs and using their output is not easily available in popular
software such as Stata. Brave and Walstrum (2014) document the package margte
that estimates MTEs, but this package has some important limitations.
1
The author wishes to thank Edwin Leuven, Christian Brinch, Magne Mogstad, Thomas
Cornelissen, Martin Hüber, Katrine Vetlesen Løken, Simen Markussen, Scott Brave, Thomas
Walstrum, Yiannis Spanos and Jonathan Norris for advice, comments, help and testing at various
stages of this project. Brave and Walstrum also deserve a huge thanks for Stata package margte
that clearly inspired mtefe. All errors remain my own. While carrying out this research I have
been associated with the center for Equality, Social Organization, and Performance (ESOP) at
the Department of Economics at the University of Oslo. ESOP is supported by the Research
Council of Norway through its Centres of Excellence funding scheme, project number 179552.

2
This paper presents the new Stata package mtefe that estimates parametric
and semiparametric marginal treatment effect models. These include the joint
normal, polynomial and semiparametric models like Brave and Walstrum (2014),
and additionally adds the option of spline functions in the MTE for increased
flexibility. mtefe can estimate all models using either local IV or the separate
approach as well as maximum likelihood for the joint normal model, while the
estimation method in margte depends on the model. Furthermore, mtefe allows
for fixed effects using Stata’s categorical variables, important to isolate exogenous
variation in many applications and provides gains in computational speed over
generating dummies manually. mtefe supports frequency and probability weights,
important to obtain population estimates in many data sets.
In addition, mtefe exploits the full potential of marginal treatment effects by
calculating treatment effect parameters as weighted averages of the MTE curve,
shedding light on why for example the LATE differs from the ATE. Other im-
provements include calculation of analytic standard errors (that admittedly ignore
the uncertainty in the propensity score estimation), large improvements in com-
putational speed when estimating semiparametric models and reestimation of the
propensity score when bootstrapping for more appropriate inference.
Although quickly becoming a part of the toolbox of the applied econometrician,
applied work using Marginal Treatment Effects often impose restrictive paramet-
ric assumptions. The Monte Carlo simulations in Appendix B illustrates how
conclusions might be sensitive to these choices, at least under the particular data
generating processes used. This illustrates the importance of correct functional
form, and highlights that researchers should probe their results to functional form
assumptions when using marginal treatment effects as well as base the choice of
functional form on detailed knowledge about the case at hand, what might con-
stitute the unobservables and sound economic arguments for the nature of the
relationship between these unobservables and the outcome.
This paper proceeds as follows: Section 2 reviews the theory behind MTEs,
identifying assumptions, estimation methods and derivation of treatment param-
eter weights. Section 3 presents the mtefe command, it’s most important options
and a range of examples of use, while section 4 concludes. Appendix A details
the estimation algorithm of mtefe, while Appendix B contains the Monte Carlo
simulations.

2 Estimation of marginal treatment effects


Marginal treatment effects are based on generalized Roy model:

3
Yj = µj (X) + Uj for j = 0, 1 (1)
Y = DY1 + (1 − D)Y0 (2)
D = 1 [µD (Z) > V ] where Z = (X, Z− ) (3)

Y1 and Y0 are the potential outcomes in the treated and untreated state, such as
log wages with and without a college degree. They are both modeled as functions
of observables X, which may contain fixed effects.
Equation 3 is the selection equation, and can be interpreted as a latent index,
since 1 is the indicator function. It is a reduced form way of modeling selection
into treatment as a function of observables X and instruments Z− that affect the
probability of treatment, but not the potential outcomes.
The unobservable V in the choice equation is a negative shock to the latent
index determining treatment. It is often interpreted as unobserved resistance to or
negative preference for treatment. As long as the unobservable V has a continuous
distribution, we can rewrite the selection equation as P (Z) > UD , where UD
represents the quantiles of V and P (Z) the propensity score. UD , by construction,
has a uniform distribution in the population.2
Identification of this model requires the following assumption:
Assumption 1: Conditional independence (U0 , U1 , V ) ⊥ Z− | X
The model described by Equations 1-3 together with Assumption 1 implies and
is implied by the standard assumptions in Imbens and Angrist (1994) necessary
to interpret IV as a local average treatment effect. Vytlacil (2002) shows how the
standard IV assumptions of relevance, exclusion and monotonicity3 are equivalent
to some representation of a choice equation as in Equation 3. Thus, the model
described above is no more restrictive than the LATE model used in standard IV
analysis.
In principle, it is possible to estimate marginal treatment effects with no further
assumptions. However, this requires full support of the propensity scores in both
treated and untreated samples for all values of X. In practice, this is rarely feasible,
as shown for example in Carneiro et al. (2011). Instead, most applied papers
(Carneiro et al., 2011; Carneiro and Lee, 2009; Cornelissen et al., 2016b; Felfe and
Lalive, 2017; Maestas et al., 2013) proceed by assuming a stronger assumption:
Assumption 2: Separability E(Uj | X, V ) = E(Uj | V )
2
This normalization also insures that UD is uniform also within cells of X. For details, see
Matzkin (2007); Mogstad and Torgovitsky (2018).
3
Or rather, uniformity, as it is a condition across people, not across realizations of Z: This
assumption require that Pr(D = 1|Z = z) ≥ Pr(D = 1|Z = z 0 ) or the other way around for all
people, but it doesn’t require full monotonicity in the classical sense.

4
This can be a strong assumption, and must be carefully evaluated by the researcher
for each application. Nonetheless, it is clearly less restrictive than for example a
joint normal distribution of (U0 , U1 , V ), as assumed by traditional selection models.
This paper, along with the applied MTE literature4 and the mtefe command,
proceeds by imposing this stronger assumption, as well as working with linear
versions of µj (X) = Xβj and µD (Z) = γZ.
This assumption has two important consequences. First, the MTE is identified
over the common support, unconditional on X. Second, the marginal treatment
effects are additively separable in UD and X. This implies that the shape of the
MTE is independent of X: The intercept, but not the slope, is a function of X.5
Using this model, the returns to treatment is simply the difference between the
outcomes in the treated and untreated states. Marginal Treatment Effects were
introduced by Björklund and Moffitt (1987), later generalized by Heckman and
coauthors (1999; 2001; 2005; 2007), and using the above assumption and index
structure can be defined as

MTE(x, u) ≡ E(Y1 − Y0 | X = x, UD = u)
= x(β1 − β0 ) + E(U1 − U0 | UD = u) (4)
| {z } | {z }
heterogeneity in observables k(u): heterogeneity in unobservables

They measure average gains in outcomes for people with particular values of X
and the unobserved resistance to treatment UD . Alternatively, the MTE can be
interpreted as the mean return to treatment for individuals at a particular margin
of indifference. The above expression shows how the separability assumption allows
us to separate the treatment effect into one part varying with observables and
another part that varies across the unobserved resistance to treatment.
MTEs are closely related to local average treatment effects. A local average
treatment effect (LATE) is the average effect of treatment for people who are
shifted into (or out of) treatment when the instrument is exogenously shifted
from z to z 0 . In the above choice model, these people have UD in the interval
(P (z), P (z 0 )). Note how, when z − z 0 is infinitesimal small, so that P (z 0 ) = P (z),
the LATE converges to the MTE. A marginal treatment effect is thus a limit form
of LATE (Heckman and Vytlacil, 2007).
4
Some papers apply a stronger full independence assumption: (U0 , U1 , V ) ⊥ X . This implies
the separability in Assumption 2, but is stricter than necessary. In particular, the separability
assumption places no restrictions on the dependence between V and X. See also Mogstad and
Torgovitsky (2018, in particular section 6) for a review and a detailed discussion of the differences
between these assumptions.
5
Note, however, that you are free to specify fixed effects, so that a fully saturated model may
be specified.

5
.5
1.5

.5

.4
.4
1

.3
.3
.5

.2
.2
0

.1
.1
−.5

0
0 0.2 P(z) 0.6 P(z’) 1 0 0.2 P(z) 0.6 P(z’) 1
Unobserved resistance, propensity scores Unobserved resistance, propensity scores

MTE (left axis) E(Y | p) (right axis) E(Y1 | u) E(Y0 | u) MTE

(a) Local instrumental variables (b) Separate approach

Figure 1: Identification of marginal treatment effects

2.1 Estimation using Local Instrumental Variables


One way of estimating marginal treatment effects is using the method of local
instrumental variables developed by Heckman and Vytlacil (1999; 2001; 2005).
This method identifies the MTE as the derivative of the conditional expectation
of Y with respect to the propensity score.
Consider first the intuition for why the derivative identify the MTE using
Figure 1a. Abstracting from covariates, the dotted line represents the expected Y
for each value of the propensity score, which can be estimated from data given a
propensity score model. When p increases, the probability of getting the treatment
increases. If the treatment effect is constant, E(Y | p) is linear in p. In contrast,
under essential heterogeneity, the increase in p has the additional effect that the
expectations of the error terms change, resulting in nonlinearities in E(Y | p).
Local IV therefore identify the MTEs from nonlinearities in the expectation of Y
given p.
Now consider two particular instrument values z and z 0 , generating propensity
scores P (z) = 0.4 and P (z 0 ) = 0.8. This shift of the instrument induces people with
unobserved resistance P (z) < UD ≤ P (z 0 ) to switch intro treatment. The change
in E(Y | p) relative to the change in propensity scores, E(Y |p’)−E(Y
p0 −p
|p)
, identifies the
average of the MTE for UD in the interval p, p0 . The slope of the blue line depicted
in the figure is the average derivative over this interval, which is equal to the
average of the MTE in the same interval - the red line. Take this reasoning to the
limit by looking at closer and closer values of p and p0 , and the above expression
collapses to the derivative of E(Y | p). At the point where p0 is just an infinitesimal
increase over p the derivative identifies the MTE at the point where UD = p.
Formally, using p = P (Z), the index structure and the separability assumption,

6
it can be shown that

E(Y | X = x, P (Z) = p) = E[Y0 + D(Y1 − Y0 ) | X = x, P (Z) = p]


= xβ0 + x(β0 − β1 )p + pE[U1 − U0 | UD ≤ p] (5)
| {z }
K(p)

Where we can normalize E(Uj ) = 0 as long as X includes an intercept. K(p)


is a nonlinear function of p that captures heterogeneity along the unobservable
resistance to treatment UD : If people with different resistance to treatment have
different expectation of the error terms, the MTE will be nonlinear. The K(p)
notation follows Brinch et al. (2017).
Taking the derivative of this expression with respect to p and evaluating it at
u, we get the MTE

∂E(Y | X = x, P (Z) = p) ∂[pE(U1 − U0 | UD ≤ p]


|p=u = x(β1 − β0 ) + |p=u (6)
∂p ∂p
MTE(x, u) = (β1 − β0 )x + E[U1 − U0 | UD = u]
| {z }
k(u)

Where k(u) = E(U1 − U0 |UD = u).


The above suggests the following estimation procedure: We start by identifying
the selection into treatment based on Equation 3, using a probability model such as
probit, logit, the linear probability model or even a semiparametric binary choice
model. With this estimate of P (Z) in hand, what remains is to make an assumption
about the unknown function K(p) = pE(U1 −U0 |UD ≤ p)6 , estimate the conditional
expectation of Y from equation (5) and form its derivative to get the MTE. The
different functional form assumptions commonly made on K(p) are summarized
in Table 1. Alternatively, if we are unwilling to make parametric assumptions, the
MTE can be estimated using semiparametric methods as summarized in Table 2
by estimating Equation 5 as a partially linear model.

2.2 Estimation using the separate approach


Alternatively, as suggested by Heckman and Vytlacil (2007), MTEs can be esti-
mated using the separate approach. This has the benefit of estimating all the
parameters of both the potential outcomes, so that we can plot these over the
distribution of UD .
Consider first the intuition using Figure 1b. Abstracting from covariates, the
solid line depicts the true E(Y1 | UD = u), the expected value of the outcome in the
6
Rp
Or usually directly on k(u), and then exploit that K(p) = 0
k(u)du.

7
treated state for a given value of u. Likewise, the dashed line depicts the expected
value of the outcome in the untreated state. Imagine next comparing two particular
values of the instrument, z vs. z 0 . Individuals with Z = z have a predicted
propensity score of P (z) = 0.4, while individuals with z 0 have P (z 0 ) = 0.8. We will
have both treated and untreated individuals in both these groups. For untreated
individuals with Z = z, we know that their resistance UD > P (z) = 0.4. The
average outcome among these people therefore inform us of the average of the
dashed line between 0.4 and 1, depicted in blue. In contrast, the treated individuals
with Z = z have resistance UD ≤ P (z), and therefore inform us about the average
of Y1 for people with UD between 0 and 0.4, depicted in red. Likewise, the people
with Z = z 0 inform us about the average of Y1 from 0 to 0.8 (orange) and the
average of Y0 from 0.8 to 1 (green). With a continuous instrument, we will have
propensity scores all over the (0, 1) interval, and we can identify nonlinear functions
in u, but this illustration shows that a binary instrument identifies a linear MTE
model: We need only four averages from the groups comprised of the treated and
untreated individuals with the instrument switched on and off (Brinch et al., 2017).
In practice, this means specifying some function for the conditional expecta-
tions of the error terms, and then estimating the conditional expectations of Y1
and Y0 in the sample of treated and untreated separately:

E(Y1 | X = x, D = 1) = xβ1 + E(U1 |UD ≤ p) = xβ1 + K1 (p) (7)


E(Y0 | X = x, D = 0) = xβ0 + E(U0 |UD > p) = xβ0 + K0 (p) (8)

Where I follow the notation in Brinch et al. (2017) and control selection via the
control functions Kj (p). Notice that this is the specification we are using when
estimating selection models: Depending on the specification of Kj (p), this amounts
to for example Heckman selection or a semiparametric selection model.
To estimate the MTE, we estimate the conditional expectation of Y in the
sample of treated and untreated separately using the regression

Yj = Xβj + Kj (p) +  (9)

Based on the assumptions on the unknown functions kj (u), we can infer the
functional form of the control function Kj (p) as summarized in Table 1. Alterna-
tively, when using semiparametric methods, estimate Equation 7-8 using partially
linear models as summarized in Table 2. In the implementation, this is estimated
in a stacked regression to allow for some coefficients to be restricted to be the same
in the treated and untreated state. After estimating Kj (p), we can construct the
kj (u) functions and find the MTE estimate from

8
MTE(x, u) = E(Y1 | X = x, UD = u) − E(Y0 |X = x, UD = u) (10)
= x(β1 − β0 ) + k1 (u) − k0 (u)
where kj (u) = E(Uj | UD = u)

2.3 Estimation using maximum likelihood


In the case where we assume joint normality of (U0 , U1 , V ), the MTE can be esti-
mated using maximum likelihood like a Heckman selection model. It is straightfor-
ward to infer the log likelihood function. For details, see appendix A.4 or Lokshin
and Sajaia (2004).

2.4 Treatment effect parameters and weights


Marginal treatment effects unifies most common treatment effect parameters and
allow us to recover the parameters from the MTEs. Given that we have estimated
MTE(x, u), we only need estimates of the distribution of UD given x in a particular
population to obtain estimates of the average treatment effect for that population.
In practice, common treatment effect parameters can be expressed as some
weighted average of a particular MTE curve (Heckman and Vytlacil, 2007). For
the average treatment effect conditional on the event A = a that defines the
population relevant for the parameter of interest and X = x, we have

Z 1
TA (x) = E(Y1 − Y0 | A, X = x) = MTE(x, uD )ωA (x, u)duD (11)
0
where ωA (x, u) = fUD |A,X=x (u) (12)
where f is the conditional density of UD . Because of the additive separability given
by Assumption 2 and the linear forms of µj (X), this simplifies to

Z 1
TA (x) = x(β1 − β0 ) + k(uD )ωA (u)duD
0
where ωA (u) = fUD |A (u)
In principle, we can calculate the treatment effect parameters Ta (x) for any
value of x, but we are usually interested in the unconditional parameter in the
population. In general, we need to integrate over all values of X,7 but because of
7
Note, however, that in cases where the propensity score model is misspecified, the weighted
average of the conditional weights may differ from the unconditional weights. See e.g. Carneiro
et al. (2011,1).

9
Table 1: Parametric MTE models

Function Definition Normal Polynomial Polynomial with splines

U0 , U1 , V ∼ N (0, Σ), k(u) or kj (u) as Lth k(u) or kj (u) as Lth order polynomials (L≥2) with Q knots for
( Σ= ) order polynomials with quadratic and higher-order terms at (h1 , ..., hQ ) and mean 0
σ02
ρ01 σ12 mean 0
ρ0 ρ1 1

1 1
PL PL PL PQ
k(u) E(U1 − U0 | UD = u) (ρ1 − ρ0 )Φ−1 (u) l=1
πl (ul − l+1
) l=1
πl (ul − l+1
) +l=2 q=1 k
π q ((u − hq )l (u
1 ≥
l+1
(1−hq )
hq ) − (l+1) )
PL p(pl −1) P L p(pl +1)
K(p) pE(U1 − U0 | UD ≤ p) −(ρ1 − ρ0 )ϕ(Φ−1 (p)) l=1
πl l+1
π
l=1 l l+1
+
l+1
PL P Q q 1(p≥hq )(p−hq ) −(1−hq )l+1 p
l=2
π (
q=1 k l+1
)

1 1
PL PL PL PQ
k1 (u) E(U1 | UD = u) ρ1 Φ−1 (u) l=1
π1l (ul − l+1
) l=1
π1l (ul − l+1
) +
l=2 q=1 1l
π q ( (u ≥
1
l+1
(1−hq )
hq )(u − hq )l − (l+1) )
PL 1
P L l 1 L P PQ
k0 (u) E(U0 | UD = u) ρ0 Φ−1 (u) l=1
π0l (ul − l+1
) π (u − l+1 ) +
l=1 0l l=2 q=1 0l
π q ( (u ≥
1
l+1
(1−h q )
hq )(u − hq )l − (l+1) )
ϕ(Φ−1 (p)) l L (pl −1)
−1
PL P
K1 (p) E(U1 | UD ≤ p) −ρ1 p l=1
π1l pl+1 π
l=1 1l l+1
+
l+1
P L Q
P q 1(p≥hq )(p−hq ) −p(1−hq )l+1
l=2 q=1 1l
π ( p(l+1)
)
ϕ(Φ−1 (p)) PL p(1−pl ) L
P p(1−pl )
K0 (p) E(U0 | UD > p) ρ0 (1−p) l=1
π0l (1−p)(l+1) π
l=1 0l (1−p)(l+1)
+
l+1
P L Q
P q (1−hq ) p−1(p≥hq )(p−hq )l+1
l=2 q=1 0l
π ( (1−p)(l+1)
)

MTE(x, u) E(Y1 − Y0 | UD = u, X = x) x(β1 − β0 ) + k(u) = x(β1 − β0 ) + k1 (u) − k0 (u)


Y1 (x, u) E(Y1 | UD = u, X = x) xβ1 + k1 (u)
Y0 (x, u) E(Y0 | UD = u, X = x) xβ0 + k0 (u)
Note: Expressions for conditional expectations with different assumptions for the joint distribution of the error terms. 1(A) is the indicator function for the event
1
Rp Rp R1
A. Note that πl = π1l − π0l and equivalently for the spline coefficients. Calculated as K(p) = k(u)du, K1 (p) = p1 k1 (u)du and K0 (p) = 1−p k0 (u)du
0 0 p
Table 2: Semiparametric MTE models

Step Local IV Separate approach

Estimating equation Y = Xβ0 + X(β1 − β0 ) + K(p) +  Yj = Xβj + Kj (p) + 

separate local polynomial regressions of Y and X on


Double residual regression local polynomial regressions of Y , X and X × p give p in treated and untreated samples give residuals eYj
(Robinson, 1988) residuals eY , eX and eX×P and eXj , construct eY = DeY1 + (1 − D)eY0 and
similar for eX

Estimate β0, β1 − β0 using


eY = eX β0 + eX×p (β1 − β0 ) +  eY = eX β0 + D(β1 − β0 )eX + 
regression

Construct residual Ỹ = Y − X β 1 − β0 )p
c0 − X(β\ Ỹ = Y − X β 1 − β0 )D
c0 − X(β\

separate local polynomial regressions of Ỹ on p in


local polynomial regression of Ỹ on p, saving level \
Estimate K \ treated and untreated samples, saving level K j (p) and
K(p)
[ and slope K 0 (p)
\
slope K 0 (p)
j

k[ \ \0
Construct k k(u)
d =K \
0 (u) 1 (u) = K1 (u) + uK1 (u)
k[ \ 0
\
0 (u) = K0 (u) − (1 − u)K0 (u)

Construct MTE [
MTE(x, u) = x(β\ d
1 − β0 ) + k(u)
[
MTE(x, u) = x(β\ [ [
1 − β0 ) + k1 (u) − k0 (u)

p 0
Note: Steps in the estimationRof semiparametric MTE models using Local IV or the separate approach. To see the relation between kj (u) and Kj (p) note that
K1 (p) = E(U1 | UD ≤ p) = p1 E(U1 |UD = u)du ⇒ K10 (p) = − p1 K1 (p) + p1 k1 (p), which leads to k1 (u) = K1 (u) + uK1 (u). We can find similar expressions for
0
k0 (u).
In principle it is possible to combine the semiparametric and the polynomial approach by first estimating the β coefficients from the polynomial model and
then using semiparametric methods on to find K . This is the semiparametric polynomial MTE model, implemented if polynomial() and semiparametric are
specified together in mtefe. Although computationally far less complex, there is little theory to think that the semiparametric estimate of this model should be
any better than the MTE constructed from the parametric estimates.
Table 3: Unconditional treatment effect parameters and weights

Parameter Event A κ̂i ω̂(u)


1
ATE Average treatment effect 1 1 s
pi P (p>u)
ATT ATE on the treated UD ≤ p E(p) sE(p)
1−pi 1−P (p>u)
ATUT ATE on the untreated UD > p 1−E(p) s(1−E(p))
p(zi )−p(zi0 ) P (p(z 0 )>u)−P (p(z)>u)
LATE Local ATE P (z) < UD ≤ P (z 0 ) E(p0 )−E(p) s(p¯0 −p̄)
(υ̂i −E(υ̂))(Di −D̄) (E(υ̂|p>u)−E(υ̂))P (p>u)
2SLS Weighted average LATE cov(D,υ̂) s×cov(D,υ̂)
p0i −pi P (p0 >u)−P (p>u)
PRTE Policy-relevant ATE p < UD ≤ p0 E(p0 )−E(p) s(E(p0 )−E(p))

fV (γZ) fp (u)fV (FV−1 (u))


MPRTE1 Marginal PRTE 1 |γZ − V | <  E(fV (γZ)) E(fV (γZ))
MPRTE2 Marginal PRTE 2 |p − U | <  1 fp (u)
ufp (u)
MPRTE3 Marginal PRTE 3 | Up − 1| <  1 E(p)

Note: Weights for common treatment effect parameters. Discrete distribution of UD with s
points.

the additive separability implied by Assumption 2, we can estimate the average x


in the population of interest separately and calculate the unconditonal treatment
effect parameters as

Z 1
TA = E(Y1 − Y0 | A) = xa (β1 − β0 ) + k(uD )ωA (u)duD (13)
0
where ωA (u) = fUD |A (u) (14)
and xA = E(X | A) (15)

We can estimate xa using the weighted average N1 κA A


P
i xi , where κi is an
estimate of the relative probability that event a happens to person i, P (A|Z=z
P (A)
i)
.
In practice, we can estimate both the weighted average xA and ωA (u) by using
sample analogs from data on p, Z and D as summarized in Table 3.
The average treatment effect on the treated (ATT) is the average effect of
treatment for the subpopulation that choose treatment. When calculating the
weighted average of the X in this population, this parameter will weight people
with high propensity scores more precisely because they have a higher probability
of choosing treatment. Likewise, the weights ωAT T will weight points at the lower
end of the UD distribution higher, because a larger share of the population at these
values of UD will choose treatment - they have lower resistance.
In contrast, the average treatment effect on the untreated (ATUT) weights in-
dividuals with low propensity scores higher when calculating the weighted average
of the X. This is precisely because these people, all else the same, have higher

12
probability to be untreated. ωAT U T weights high values of UD more because these
people have high resistance.
A local average treatment effect (LATE) is the average effect of treatment for
people who are shifted into (or out of) treatment when the instruments is shifted
from z to z 0 . These are people with UD in the interval (P (z), P (z 0 )). To be in the
complier group, an individual must have resistance between P (z) and P (z 0 ). Note
how, when z − z 0 is infinitesimal small, the LATE converges to the MTE, and a
Marginal Treatment Effect is thus a limit form of LATE (Heckman and Vytlacil,
2007).
Similar to the way we can estimate the probability of being a complier in a
traditional IV analysis, we can estimate the weights of the linear IV or 2SLS pa-
rameter. With a continuous instrument, IV is a weighted average over all possible
LATEs comprised of all z − z 0 pairs (Angrist and Imbens, 1995). Heckman and
Vytlacil (2007) show the derivation of these weights, see also section A.6 in the
appendix and Cornelissen et al. (2016a). In sum, the IV parameter uses a weighted
average of X, in which people with large positive or negative values of υ̂, a measure
of the how much the instrument affects propensity scores for each individual, are
given more weight. These are precisely the people who have higher probabilities
of having their treatment status determined by the instrument. Likewise, values
of UD where people have υ̂ above the average, and thus are more likely to be
compliers, get a higher weight ωIV (u).
Assuming policy invariance (see Heckman and Vytlacil (2007) for a formal
definition), we can calculate the policy-relevant treatment effect (PRTE) for a
counterfactual policy that manipulates propensity scores. This is the expected
treatment effect for the people that are shifted into treatment by the new policy
relative to the baseline. If the policy is a shift in the instruments themselves, it is
natural to use the estimated first stage to evaluate the shift in propensity scores,
but propensity scores could be manipulated directly as well. Note that if the policy
is a particular set of instrument values z 0 and the baseline another set z, PRTE
and LATE are the same. In practice, the PRTE parameter weights the treatment
effect of people who are affected more strongly by the alternative policy relative
to the baseline.
Lastly, we can use the estimated MTEs to calculate Marginal Policy Relevant
Treatment Effects (MPRTE), which can be interpreted as average effects of making
marginal shifts to the propensity scores. MPRTE’s are fundamentally easier to
identify than PRTE’s (Carneiro et al., 2010), in particular because they do not
require full support, as marginal changes to propensity scores will not drive the
scores outside the common support.
Carneiro et al. (2011) suggest three ways to define distance to the margin.
The first MPRTE, labeled MPRTE1 in Table 3, defines the distance in terms of

13
the differences between the index γZ and the resistance V , and corresponds to
a marginal change in a variable entering the first stage, such as an instrument.
MPRTE2 defines the margin as having propensity scores close to the normalized
resistance UD , and correspond to a policy that would increase all propensity scores
with a small amount. MPRTE3 defines marginal as the relative distance between
the propensity score and UD , and correspond to a policy that increase all propensity
scores by a small fraction.
What we can see from the expressions in Table 3 is that an absolute shift in
propensity scores uses the observed density of the propensity scores as the weight
distribution, while a relative shift unsurprisingly will place more weight on the
upper part of the UD precisely because a relative shift increase the propensity
scores of people with high initial propensity scores more.

3 The mtefe package


As outlined in the introduction, mtefe contains a range of improvements over ear-
lier MTE packages such as margte. The basic syntax of the mtefe command mimics
that of Stata’s ivregress and other instrumental variables estimators, while the
syntax for many of its options resemble the same options in margte. All inde-
pendent variable lists also accept Stata’s categorical variables syntax i.varname.
fweights and pweights are supported.
     
mtefe depvar indepvars (depvar t = varlist iv ) if in weight , polynomial(#)
splines(numlist) semiparametric restricted(varlist) separate mlikelihood
link(string) trimsupport(#) prte(varname) vce(vcetype) bootreps(#) norepeat1
level(#) degree(#) ybwidth(#) xbwidth(#) gridpoints(#) kernel(string)
first second noplot savefirst(string) savepropensity(newvarname) savekp

saveweights(string) mtexs(string)
The most important options that determine which model is fit and how are:

polynomial(#) Specifies polynomial MTE models of degree #. In contrast to


margte, to insure consistency between estimation procedures, this is the
degree of k(u) and in turn the MTE, not the degree of K(p). Although most
restrictive, to insure consistency with margte, the default if polynomial is
not specified is the joint normal model.

semiparametric fits K(p) or Kj (p) using semiparametric methods as described


in section A. This amounts to the semiparametric model if polynomial is
not specified, or the semiparametric polynomial otherwise.

14
splines(numlist ) adds second order and higher splines to k(u) at the points
specified by numlist. Use only for polynomial models of degree ≥ 2. All
points in numlist must be in the interval (0, 1).
separate estimates the model using the separate approach rather than Local IV.
mlikelihood estimates the model using the maximum likehood rather than Local
IV. Only appropriate for the joint normal model.
link(lpm|probit|logit) Specifies the first stage model. Default is probit.
restricted(varlist ) specifies that variables in varlist are included in both first
and second stage, but are restricted to have the same effect in the two states.
The command allows a range of other options, for which I refer the reader to the
help file. The mtefe-package in addition contains the command mtefeplot, that
plots one or more marginal treatment effect plots, optionally including treatment
parameter weights, based on stored or saved MTE estimates from mtefe. The
command mtefe_gendata generates the data in the following examples and Monte
Carlo simulations.
By default, mtefe reports analytical standard errors for the coefficients, treat-
ment effect parameters and MTEs. These ignore the uncertainty in the estimation
of the propensity scores by treating these as fixed in the second stage of the esti-
mation, as well as the means of X and the treatment effect parameter weights that
are used for estimating treatment effects. For matching, Abadie and Imbens (2016)
show that ignoring the uncertainty in the propensity score increases the standard
errors for the average treatment effect, while the impact on other treatment ef-
fect parameters is ambiguous. To the best of my knowledge, we do not know
how this omission affects the standard errors in MTE applications, and careful
researchers should therefore bootstrap the standard errors using the bootreps()
option, which re-estimates the propensity scores, the mean of X and the treatment
effect parameter weights for each bootstrap repetition.

3.1 Example output from mtefe


To illustrate the use of mtefe and for Monte Carlo simulations in Appendix B,
imagine the following problem. We are interested in the monetary returns to a
college education. Unfortunately, college education is endogenous, for example be-
cause higher ability people do better in the labor market and take more education.
Furthermore, people choose education based partly on knowledge about their own
gains from college. Thus, the problem exhibits both selection on levels and gains.
Consider distance to college, thought to be a cost shifter for college education.
Although traditional in the returns to education literature, there are reasons to

15
doubt the exclusion restriction for this instrument (Carneiro and Heckman, 2002).
To fix ideas, suppose that the average distance to college in a district, a measure
of rurality, is correlated with average outcomes in the labor market. For example,
more rural labor markets could provide worse average employment opportunities,
particularly for college jobs. If these differences work at the district level, how-
ever, the instrument is valid conditional on district fixed effects that control for
the average variation in distance to college. Thus, the remaining within-district
variation in distance to college is a valid instrument for college attendance.
To implement this thought experiment, I draw the average labor market quality
for college- and non college jobs and the average distance to college for each district
from a joint normal distribution. The observed distance to college is equal to
this district-level average plus some random normal variation, so that the within-
district variation in distance is a valid instrument. In addition, I generate the error
terms U0 , U1 and V from either a joint normal or a polynomial error structure,
where the three errors are correlated and thus generate selection on both levels
and gains. Controls X include experience uniformly distributed on (0, 30) and
its square. These affect both the selection equation and the outcomes in the two
states. The full data generating process is described in Appendix B.
Using the data generating process outline above, the code in Figure 2 uses
mtefe and mtefe_gendata to generate data with normal error structure and es-
timate it using the joint normal MTE model and local IV. mtefe first reports the
estimated coefficients β0 , β1 − β0 ,ρ1 − ρ0 , then the MTE estimates for each point of
support and the treatment effect parameters as shown in Figure 2. Lastly, mtefe
reports the p-values for two statistical tests: A joint test of the β1 − β0 which can
be interpreted as a test of whether the treatment effect differs across X and a test
of essential heterogeneity. The latter is a joint test of all coefficients in k(u).8
Based on the output from mtefe it is straightforward to evaluate the impact
of covariates. Average differences in outcomes across covariates can be interpreted
directly from the β0 just like a regular control variable. For instance, the coefficient
for experience in the first panel of the output table in Figure 2 indicates that one
more year of experience translates into approximately 2.8% higher wages, although
the effect is nonlinear and we cannot say that it is the extra experience that cause
the higher wages without strong exogeneity assumptions on X.
Similarly, the β1 − β0 can be interpreted as differences in treatment effects
across covariate values, just like an interaction between treatment status and a
covariate in an OLS regression. The coefficient on experience in the second panel
of Figure 2 thus indicate that a person with one more year of experience has 2.3%
lower gains from college, but again we cannot give this a causal interpretation and
we have to account for the nonlinear effect.
8
Or, in the case of semiparametric models, a test of whether all MTEs are the same.

16
. mtefe_gendata, obs(10000) districts(10)
.
. mtefe lwage exp exp2 i.district (col=distCol)
Parametric normal MTE model Observations : 10000
Treatment model: Probit
Estimation method: Local IV

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

beta0
exp .0358398 .0064408 5.56 0.000 .0232145 .0484651
exp2 -.0008453 .0002019 -4.19 0.000 -.0012411 -.0004496

district
2 .2352456 .0680412 3.46 0.001 .1018712 .36862
3 .6294914 .0701091 8.98 0.000 .4920634 .7669194
4 .0131179 .0597721 0.22 0.826 -.1040474 .1302832
5 .0338606 .0705835 0.48 0.631 -.1044974 .1722186
6 .1699366 .0605086 2.81 0.005 .0513275 .2885458
7 -.1899241 .060115 -3.16 0.002 -.3077617 -.0720865
8 -.1842254 .0676843 -2.72 0.007 -.3169003 -.0515504
9 -.7908301 .0578436 -13.67 0.000 -.9042153 -.677445
10 -.4432749 .0597237 -7.42 0.000 -.5603455 -.3262044

_cons 3.164706 .0650331 48.66 0.000 3.037228 3.292184

beta1-beta0
exp -.0386384 .010241 -3.77 0.000 -.0587128 -.018564
exp2 .0012967 .0003288 3.94 0.000 .0006523 .0019412

district
2 .265112 .107039 2.48 0.013 .0552939 .4749301
(output omitted )
10 .3143661 .1072555 2.93 0.003 .1041237 .5246085

_cons .4255863 .0983572 4.33 0.000 .2327863 .6183863

k
mills -.4790282 .0611081 -7.84 0.000 -.5988124 -.359244

effects
ate .3283373 .0242932 13.52 0.000 .2807177 .3759568
att .5369432 .0388809 13.81 0.000 .4607287 .6131576
atut .1195067 .0384691 3.11 0.002 .0440995 .194914
late .3279726 .0245142 13.38 0.000 .2799198 .3760254
mprte1 .3463148 .0256971 13.48 0.000 .2959433 .3966862
mprte2 .3309428 .024298 13.62 0.000 .2833137 .3785719
mprte3 -.016257 .0498984 -0.33 0.745 -.1140679 .0815538

Test of observable heterogeneity, p-value 0.0000


Test of essential heterogeneity, p-value 0.0000

Note: Analytical standard errors ignore the facts that the propensity score,
(output omitted )

Figure 2: Example output from mtefe


In addition, mtefe plots the MTE curve with associated confidence intervals as
well as the density of the estimated propensity scores separately for the treated and
untreated individuals, so that the researcher can evaluate the common support.
Examples of these plots are found in Figures 3a and 3b. We see that the estimated
MTE in this example is downward sloping, with relatively high treatment effects
above 1 at the beginning of the UD distribution, eventually declining to negative
effects as low as -.5 at the right end of the distribution. This implies an average
treatment effect of around .36, and the downward sloping pattern is consistent
with positive selection on unobservable gains as predicted by a simple Roy model.
Alternatively, we can use the polynomial MTE model and the separate es-
timation approach or the semiparametric model to relax the joint normal as-
sumption. mtefe estimates these models if you specify the polynomial(2) or the
semiparametric option,9 respectively. When estimating MTE models, it is useful
to plot several MTE curves in the same figure. This can be done using mtefeplot,
specifying the names of the saved or stored MTE estimates. An example of this
plot is provided in Figure 3c for the normal, polynomial and semiparametric MTE
models. MTEs are downward sloping and relatively similar in all three specifica-
tions.
As discussed in Section 2.4, mtefe also estimates the treatment parameter
weights and the parameters themselves. One way of illustrating why for example
the Local Average Treatment Effect10 differs from the average treatment effect is
by depicting the weights that LATE put on different parts of the X and UD dis-
tribution. This can be done using mtefeplot with the late option. The resulting
plot is found in Figure 3e. As the MTE curve at the average of X and the MTE
curve for compliers practically overlap, it does not seem like the people induced
to enroll because of the instrument have different values of X - this is hardly sur-
prising given the data generating process. Instead, the weight distribution reveal
that compliers have a much higher probability to have unobserved resistance in
the middle part of the distribution. These people have MTEs slightly below the
average, so when taking a weighted average over the MTE curve for compliers, we
find that the local average treatment effect is lower than the average treatment
effect. The compliers to the instrument are people with slightly lower than average
gains from college.
Furthermore we can exploit the fact that when using the separate estimation
procedure, we identify both k1 (u) and k0 (u). After first estimating the polynomial
9
For semiparametric models, the option gridpoints(#) will greatly improve computational
speed by performing the first local linear regression at # points equally spaced over the support
of p rather than at each and every observed value of p in the population.
10
The term is used somewhat ambiguously here to refer to the linear IV estimate as a weighted
average over all possible LATEs from all combinations of two values of the instrument, in line
with a lot of the literature using continuous instruments.

18
model using the separate option to use the separate approach rather than local
IV, we can plot the resulting potential outcomes using the separate option of
mtefeplot. As the MTE is simply the difference between Y1 and Y0 , we can inves-
tigate whether the downward sloping trend in the MTE is generated by upward
sloping Y0 , downward sloping Y1 , or a combination. From Figure 3d we see that Y0
is relatively flat while Y1 is clearly downward sloping. This indicates that people
who have low resistance to treatment do much better than their high-resistance
counterparts with a college degree, but relatively similar without. Therefore, these
people have higher effects of treatment.
As a last example, consider a hypothetical policy that mandates a maximum
distance of 40 miles to the closest college. To estimate the effect of such a policy,
we predict the propensity scores from the probit model using the adjusted distance
to college:
. qui probit col distCol exp exp2 i.district
. gen temp=distCol
. replace distCol=40 if distCol>40
. predict double p_policy
. replace distCol=temp
. mtefe lwage exp exp2 i.district (col=distCol), pol(2) prte(p_policy)
(output omitted )

Results of this exercise is depicted in Figure 3f. First, note how the MTE curve
for the policy compliers practically overlaps the MTE curve at the mean. Policy
compliers do not seem have X-values that give them different gains from college
than the average. Next, notice the distribution of weights: The compliers to the
policy come exclusively from the lower part of the UD -distribution. This is not
surprising, given that the people most affected by the reform are people with low
propensity scores (driven by high distance to college) before the reform. To be
affected by the reform, these people have to be untreated under the baseline, and
thus have UD ’s above those low propensity scores. Compared to the people with
higher propensity scores, the average UD for these people are relatively low, gen-
erating high weights on the lower part of the UD -distribution. Thus, the potential
compliers to the policy have low UD ’s , and subsequently high treatment effects.
Therefore, the expected gain from the reform is larger than the average treatment
effect.
Lastly, we might be interested in the pattern of selection on observable gains. Is
it the case that covariates that positively impact treatment choices also positively
impact the gains from treatment? This is the case if γ × (β1 − β0 ) > 0, which
happens either if both coefficients are positive or both are negative. As an example,
there is positive selection on the covariate experience in the example reported in
Figure 2: More experience is associated both with less college (not shown) and
with lower treatment effects.

19
. mtefe lwage exp exp2 i.district (col=distCol)
. est sto normal
. mtefe lwage exp exp2 i.district (col=distCol), polynomial(2) separate
. est sto polynomial
. mtefe lwage exp exp2 i.district (col=distCol), semiparametric gridpoints(100)
. est sto semipar
. mtefeplot normal polynomial semipar, memory names("Normal" "Polynomial" "Semiparametric")
. mtefeplot polynomial, separate
. mtefeplot polynomial, late memory

Common support Marginal Treatment Effects


10

2
8

1
Treatment effect
6
Density
4

0
2

−1
0

0 .2 .4 .6 .8 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Propensity score Unobserved resistance to treatment

Treated Untreated MTE 95% CI ATE

(a) Common support plot, probit (b) Estimated MTE curve, normal

Marginal Treatment Effects Marginal Treatment Effects


2

5
1.5

4.5
Potential outcomes
Treatment effect

Treatment effect
1

4
.5
0

3.5
0
−1

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Unobserved resistance to treatment Unobserved resistance to treatment

Normal Polynomial Semiparametric MTE ATE Y0 Y1

(c) MTE estimates, different models (d) MTEs and potential outcomes, polynomial

Marginal Treatment Effects Marginal Treatment Effects


1.5

.03
1.5

.012
.01

1
Treatment effect

Treatment effect

.02
1

.006 .008
Weights

Weights

PRTE
.5

ATE
.5

.01

ATE
LATE
0
.004
0

.002

−.5

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Unobserved resistance to treatment Unobserved resistance to treatment
MTE MTE (LATE) MTE MTE (PRTE)
2SLS LATE weights PRTE weights

(e) MTE for compliers with LATE weights (f) Estimated effects of a hypothetical policy

Figure 3: Example figures from mtefe and mtefeplot


3.2 Postestimation using mtefe
Since mtefe saves results in e(), estimates are readily available for postestimation.
As an example, expected treatment effects can be predicted for every individual
with characteristics X, D and p , using that

E(Y1 − Y0 | X, D, p) (16)
= x(β1 − β0 ) + DE(U1 − U0 | UD ≤ p) + (1 − D)E(U1 − U0 | UD > p)
D−p
= x(β1 − β0 ) + K(p)
p(1 − p)
where I use the fact that both U1 and U0 is normalized to have mean zero.
As all these objects are estimated by mtefe, we can predict treatment effects for
each individual. Use options savekp and savepropensity() to save the propensity
scores and the relevant variables of the K(p) function, and then estimate expected
treatment effects from the expression above. The resulting variable contains the
expected treatment effect given treatment status, propensity scores and X for each
individual. Summarizing this predicted treatment effect among the treated and
untreated separately closely match the ATT and ATUT, respectively.
This procedure highlights the relationship between selection models and MTE
- but notice how both the ATT and ATUT could be recovered without specifying
Kj (p), only the expected difference K(p).
When using the separate approach, potential outcomes can be predicted di-
rectly because both K(p) and K0 (p) are estimated:

D−p
E(Y1 | X, D, p) = xβ1 + K1 (p) (17)
1−p
p−D
E(Y0 | X, D, p) = xβ0 + K0 (p)
p
K(p) − (1 − p)K0 (p)
where K1 (p) =
p
These expressions can be calculated by predicting Kj (p) and constructing the
expected value of the outcome for each indidivual from the expressions above. The
difference between these predicted outcomes closely match the predicted treatment
effect from above and treatment effect parameters calculated by mtefe as weighted
averages over the appropriate MTE curves.

4 Conclusion
Marginal treatment effects are increasingly becoming a part of the toolbox of
the applied econometrician when the problem at hand exhibits selection on both

21
levels and unobserved gains. In contrast to traditional linear IV analysis, marginal
treatment effects uncover the distribution of treatment effects rather than a local
average treatment effect, which is often of little interest. In practice, this typically
comes at the cost of stricter assumptions.
This paper has outlined the Marginal Treatment Effects framework, the re-
lationship to traditional selection models and three different methods for esti-
mating these models. The Stata package mtefe that is documented contains a
range of improvements over existing packages, among these are support for fixed
effects, estimation using both local IV, the separate approach and maximum like-
lihood, a larger number of parametric and semiparametric MTE models, support
for weights, speed improvements when running semiparametric models, more ap-
propriate bootstrap inference and improved graphical output. In addition, mtefe
calculates common treatment effect parameters such as local average treatment
effects, average treatment effects on the treated and untreated and policy-relevant
treatment effects for a user-specified shift in propensity scores to allow the user to
exploit the potential of the marginal treatment effects framework in understanding
treatment effect heterogeneity.
The Monte Carlo simulations detailed in Appendix B show that marginal treat-
ment effect estimation can be sensitive to function form specifications and that
wrong functional form assumptions may result in too high rates of rejection. It
should be note that the two data-generating processes used and the functional
form choices made in these simulations are very restrictive, and it is perhaps not
surprising that a second-order polynomial cannot approximate a normal very well
or the other way around. Nonetheless, they illustrate two things: First, applied
researchers should base the choice of functional form on detailed knowledge about
the case at hand, what might plausibly constitute the unobservable dimension and
economic arguments for how these factors might affect the outcome. Second, re-
searchers should strive to probe the robustness of their results to the functional
form choices, using less restrictive models to guide the choice of specification.

22
References
Abadie, A. and Imbens, G. W. (2016). Matching on the estimated propensity
score. Econometrica, 84(2):781–807.
Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of
average causal effects in models with variable treatment intensity. Journal
of the American Statistical Association, 90(430):431–442.
Björklund, A. and Moffitt, R. (1987). The estimation of wage gains and welfare
gains in self-selection. The Review of Economics and Statistics, 69(1):42–49.
Brave, S. and Walstrum, T. (2014). Estimating marginal treatment effects using
parametric and semiparametric methods. Stata Journal, 14(1):191–217(27).
Brinch, C. N., Mogstad, M., and Wiswall, M. (2017). Beyond late with a discrete
instrument. Journal of Political Economy, 125(4):985 – 1039.
Carneiro, P. and Heckman, J. J. (2002). The evidence on credit constraints in
post-secondary schooling*. The Economic Journal, 112(482):705–734.
Carneiro, P., Heckman, J. J., and Vytlacil, E. (2010). Evaluating Marginal Policy
Changes and the Average Effect of Treatment for Individuals at the Margin.
Econometrica, 78(1):377–394.
Carneiro, P., Heckman, J. J., and Vytlacil, E. J. (2011). Estimating marginal
returns to education. American Economic Review, 101(6):2754–81.
Carneiro, P. and Lee, S. (2009). Estimating distributions of potential outcomes
using local instrumental variables with an application to changes in college
enrollment and wage inequality. Journal of Econometrics, 149(2):191–208.
Carneiro, P., Lokshin, M., and Umapathi, N. (2017). Average and marginal returns
to upper secondary schooling in indonesia. Journal of Applied Econometrics,
32(1):16–36.
Cornelissen, T., Dustmann, C., Raute, A., and Schönberg, U. (2016a). From
LATE to MTE: Alternative methods for the evaluation of policy interven-
tions. Labour Economics, 41:47 – 60. SOLE/EALE conference issue 2015.
Cornelissen, T., Dustmann, C., Raute, A., and Schönberg, U. (2016b). Who
benefits from universal child care? Estimating marginal returns to early
child care attendance. Forthcoming in Journal of Political Economy.
Felfe, C. and Lalive, R. (2017). Does early child care affect children’s development?
Working paper.
Heckman, J. J., Urzua, S., and Vytlacil, E. (2006a). Estimation of treatment ef-
fects under essential heterogeneity. http://jenni.uchicago.edu/underiv/
documentation_2006_03_20.pdf. Supplement to "Understanding Instru-
mental Variables in Models with Essential Heterogeneity".
Heckman, J. J., Urzua, S., and Vytlacil, E. (2006b). Understanding instrumental
variables in models with essential heterogeneity. The Review of Economics
and Statistics, 88(3):389–432.

23
Heckman, J. J. and Vytlacil, E. J. (1999). Local instrumental variables and latent
variable models for identifying and bounding treatment effects. Proceed-
ings of the National Academy of Sciences of the United States of America,
96(8):4730–4734.
Heckman, J. J. and Vytlacil, E. J. (2001). Local instrumental variables. In Hsiao,
C., Morimune, K., and Powell, J., editors, Nonlinear Statistical Modeling:
Proceedings of the Thirteenth International Symposium in Economic The-
ory and Econometrics. Essays in Honor of Takeshi Amemiya, pages 1–46.
Cambridge Univ. Press, New York.
Heckman, J. J. and Vytlacil, E. J. (2005). Structural equations, treatment effects,
and econometric policy evaluation. Econometrica, 73(3):669–738.
Heckman, J. J. and Vytlacil, E. J. (2007). Chapter 71 econometric evaluation
of social programs, part II: Using the marginal treatment effect to organize
alternative econometric estimators to evaluate social programs, and to fore-
cast their effects in new environments. volume 6, Part B of Handbook of
Econometrics, pages 4875 – 5143. Elsevier.
Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local
average treatment effects. Econometrica, 62(2):pp. 467–475.
Klein, R. W. and Spady, R. H. (1993). An efficient semiparametric estimator for
binary response models. Econometrica, 61(2):387–421.
Lokshin, M. and Sajaia, Z. (2004). Maximum likelihood estimation of endogenous
switching regression models. Stata Journal, 4(3):282–289.
Maestas, N., Mullen, K. J., and Strand, A. (2013). Does disability insurance receipt
discourage work? Using examiner assignment to estimate causal effects of
SSDI receipt. American Economic Review, 103(5):1797–1829.
Matzkin, R. L. (2007). Chapter 73 nonparametric identification. volume 6 of
Handbook of Econometrics, pages 5307 – 5368. Elsevier.
Mogstad, M. and Torgovitsky, A. (2018). Identification and extrapolation with
instrumental variables. Workign paper, prepapred for Annual Review of
Economics.
Robinson, P. (1988). Root- n-consistent semiparametric regression. Econometrica,
56(4):931–54.
Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An
equivalence result. Econometrica, 70(1):331–341.

24
A Estimation algorithms
The mtefe package is inspired by Heckman et al. (2006a; 2006b) and Brave and
Walstrum (2014). The following sections describe the main steps of the estimation
using this program.

A.1 First stage and common support


1. Estimate the first stage using a linear probability model, a probit or a logit
model of D on Z. In principle, this step could be made more flexible by
using for example a semiparametric single index model such as Klein and
Spady (1993), not currently implemented in mtefe.

2. Predict the propensity score p̂

(a) if using the linear probability manually adjust propensity scores below
0 to 0 and above 1 to 1

3. Construct the common support matrix

(a) If trimming the sample using trimsupport()


i. Estimate the density of the propensity scores separately in the two
samples
ii. Remove points of support with the lowest densities until the spec-
ified share of each sample has been removed
iii. Construct the common support as the points of overlapping support
between the two samples
iv. Remove observations with estimated propensity scores outside the
common support from the estimation sample
v. Re-estimate the baseline propensity score model on the trimmed
sample.
(b) If not trimming
i. If using a parametric model use the full support in .01 intervals
from .01 to .99.
ii. If using a semiparametric model, use all points of overlapping sup-
port in the treated and untreated samples, from 0.01 to 0.99 in
intervals of .01

25
4. Plot the distribution of propensity scores in the treated and untreated sam-
ples, including the trimming limits if used, to visualize the common support.

5. Compute weights for the ATT, ATUT, LATE, MPRTE’s and, if specified,
PRTE parameter weights as described in section 2.4.

A.2 Local IV estimation


1. Construct K(p), which depends on your choice of parametric or semipara-
metric model - see Tables 1 and 2.

2. Estimate the conditional expectation of Y given X and p as

Y = Xβ0 + X(β1 − β0 )p + K(p) + 

3. From these estimates, recover k(u) = K 0 (u)

4. Construct the MTE as the derivative of the conditional expectation:

\ u) = x(β\
MTE(x, 1 − β0 ) + k(u)
d

A.3 Separate approach


1. Construct K1 (p) and K0 (p), depending on your specification, see Tables 1
and 2.

2. Estimate the conditional mean of Y from the stacked regression

Y = Xβ0 + KD (p) + D(X(β1 − β0 ) + KD (p)) + 


(
K1 (p) if D = 1
where KD (p) =
K0 (p) if D = 0

3. From these estimates, recover the kj (u) functions

4. Construct the estimates of the potential outcomes and the MTE as

Yˆj (x, u) = xβ
c + k[
j j (u)

\ u) = Ŷ1 (x, u) − Yˆ0 (x, u)


MTE(x,

26
A.4 Maximum likelihood estimation
Relevant only for the joint normal model, the mlikelihood option implements
the maximum likelihood estimator described in Lokshin and Sajaia (2004) The
individual log likelihood contribution is

1 U1i
`i = Di [ln(F (η1i )) + ln( f( )]
σ1 σ1
1 U0i
+ (1 − Di )[ln(F (−η0i )) + ln( f ( )]
σ0 σ10
Uji 1
where ηji = (γZi + ρj )q
σj 1 − ρ2j
where f is the standard normal density and F the standard normal CDF.
This log likelihood can be maximized to give the coefficients γ, β0 , β1 , σ0 , σ1 , ρ0 , ρ1 .
These parameter estimates can be used to construct the MTE and treatment effect
parameters as detailed in Table 1.

A.5 Standard errors


When using parametric models, mtefe by default calculates standard errors for the
MTEs and the treatment effect parameters from the estimated variance of β1 − β0
and the parameters of k(u). This ignores the fact that the propensity score and
the means of X themselves are estimated objects. Little is known of the effect
of this omission, although Abadie and Imbens (2016) show that ignoring the fact
that the propensity scores are estimated when matching increases the standard
errors of the Average Treatment Effect, but the effect on other parameters is am-
biguous. Alternatively, and preferably, standard errors should be estimated using
the bootstrap. Implement this using the bootreps(#) option, alternatively use
the vce(cluster clustvar) if cluster bootstrap is appropriate. This procedure re-
estimates the propensity score, the mean of X and the treatment effect parameter
weights for each bootstrap replication, and so takes into account the uncertainty
from the first stage, unless the option norepeat1 is specified.

A.6 IV weights
To estimate the IV weights, we need measures of the impact of the instrument on
each individual conditional on X. We are looking at partitioning the linear first
stage regression

D = XβD + γZ− + 

27
1. Remove the impact of covariates on D and Z− by regressing them separately
on X and saving the residuals in D and Z− .

2. Regress the residualized treatment D on Z− (without a constant) to re-


cover the first stage estimate of the impact of the instrument on treatment
conditional on controls by the Frisch-Waugh-Lowell theorem.

3. The predicted values from this regression, υ̂, contains the individual impact
of the instrument on treatment conditional on controls. Note how we can
recover first stage estimate from cov(D, υ̂), the reduced form estimate from
cov(Y,υ̂)
cov(Y, υ̂)and the traditional 2SLS estimate from cov(D,υ̂)

4. Compute the weights

¯
(υ̂i − υ̂)(Di − D̄)
κ̂LAT
i
E
=
cov(D, υ̂)

Notice how κLAT E weights treated individuals with positive υ̂ and untreated
individuals with negative υ̂ more - these are individuals who are more strongly
affected by the instruments and so more likely to be compliers.

5. Compute

(E(υ̂|p > u) − E(υ̂))P (p > u)


ω̂LAT E (u) =
s × cov(D, υ̂)
by replacing expectations and probabilities with sample analogs. This puts
more weight on the values of UD above which people have higher υ̂ - precisely
the people who are more affected by the instruments and are more likely to
be compliers.

This procedure recover the weights from Cornelissen et al. (2016a).

B Monte Carlo simulations


To investigate the properties of the different MTE estimators and models imple-
mented by the mtefe package, I simulate data from a data generating process
based on the example described above and detailed in section B.1. For each of
the data-generating processes (joint normal or polynomial error structure) I draw
10,000 observations for each of 500 repetitions, randomly allocating each observa-
tion to one of 10 districts. The first stage model that determines selection into
treatment is either a linear probability model or a probit.

28
For each repetition, I calculate the difference between the estimated and the
true parameter, the estimated analytic and bootstrapped standard errors and the
rejection rates for the true parameter. I estimate all models using the separate
approach and local IV to compare any differences.

B.1 Data generating process


This algorithm determines the data generating process implemented by mtefe_gendata
that is part of the mtefe package and used in the Monte Carlo simulations and
examples in this paper. This DGP generates selection on both levels and gains, as
well as a continuous insrument that is valid only conditonal on fixed effects.
1. In a sample of N individuals, randomly allocate each to one of G districts
with equal probability

2. Draw average labor market quality Πj in the college and non-college labor
markets and average distance to college in each district AvgDist from a joint
normal distribution:

Π0 , Π1 , AvgDist ∼ N (0, Σf )
 

.1 

Σf = .05 .1
−.05 −.1 10

 

In the simulations, I draw these once rather than repeat for each simula-
tion to have a true value to compare the estimated coefficients to, but by
default mtefe_gendata draws these for each replication unless you specify
the parameters.

3. Generate individual distance to college distCol as AvgDist plus N (40, 10)

4. Draw exp from U (0, 30) and construct exp2 as it’s square

5. Construct the error terms U0 , U1 , V from one of two error structures:

Normal Draw randomly from a joint normal model:

U0 , U1 , V ∼ N (0, Σ)
 

 .5 

Σf = .3 .5
−.1 −.5 1

 

29
Polynomial Generate Uj as second order polynomials of UD with mean
zero:

(a) Draw UD = U (0, 1) and generate V = F −1 (UD )


P2 1
(b) Generate Uj = l=1 πjl (UDl − l+1
) + , where  ∼ N (0, 0.2)
i. Using π11 = 0.5, π12 = −0.1
ii. And π01 = 2, π12 = −1

6. Construct the potential outcomes as

Yj = Xβj + Πj + Uj

X = exp exp2 1
β0 = 0.025 −0.0004 3.2
β1 = 0.01 0 3.6

7. Determine treatment using one of two binary choice models:

Probit

D = 1 [γZ > V ]

Z = distCol exp exp2 1


γ = −0.125 −0.08 0.002 5.59

LPM

D = 1 [γZ > UD ]

Z = distCol exp exp2 1


γ = −0.015 −0.01 0.00025 1.17375

8. Determine the observed outcome lwage as DY1 + (1 − D)Y0

30
B.2 A well specified baseline
To establish a baseline, I first investigate the case where both the first stage and
the polynomial model for k(u) are correctly specified.
The results from these exercises are displayed in Table 4. Unsurprisingly, and
in line with theory and Monte Carlo evidence from Brave and Walstrum (2014),
mtefe does a good job at estimating both the coefficients of the outcome equations
and the MTEs when both the first stage and the parametric model are correctly
specified. The difference between the estimated and true coefficients center at 0
and have low standard deviations. The estimated analytical standard errors comes
close to the standard deviations of the coefficient. Furthermore, even though the
analytical standard errors ignore the fact that the propensity score itself is an
estimated object, they fall very close to the bootstrapped standard errors that
account for this by re-estimating the propensity scores for each bootstrap replica-
tion. Note that bootstrapped standard errors aren’t always higher than analytic
standard errors. Rejection rates vary somewhat from parameter to parameter, but
lie around .05 as they should.
Comparing the results from Local IV estimates with the results from the sep-
arate approach reveal one important difference: Across the model specifications,
estimates seem to be more precise using the separate approach. Both the standard
deviations of the coefficients and the estimated standard errors are lower than es-
timates from Local IV, as illustrated for a range of parameters in Figure 4. This
is surprising, given that more parameters are estimated when using the separate
approach than Local IV. I have no good explanation for this result, but it warrants
more research.

B.3 Misspecification of k(u)


The above experiment focused on the case where both the first stage model and the
parametric model for k(u) was correctly specified. To illustrate how the estimators
do when one of the two are misspecified, I keep the assumption that the first stage
model is correctly specified, but use a misspecified MTE model. In Table 5, I
generate data from the normal model and estimate the MTE using the polynomial
model of second order and vice versa.
Both models do a relatively good job at estimating the coefficients of the out-
come equations, and rejection rates for the true β0 and β1 − β0 coefficients are
close to .05 as they should. Note, however, that the bootstrap now outperform the
analytic standard errors, yielding rejections rates closer to 5% than the analytic
standard errors in most cases.
The MTE estimates, and in turn the ATE, is severy overrejected. Bootstrap-
ping the standard errors and accounting for the uncertainty in the estimates of p

31
Table 4: Monte Carlo estimates: A well specified baseline

True model Normal Normal Polynomial Polynomial


Model Normal Normal Polynomial Polynomial
Procedure Local IV Separate Local IV Separate
Coefficient Truth diff s.e. r diff s.e. r diff s.e. r diff s.e. r

exp 0.025 -0.000 0.006 0.042 -0.000 0.004 0.088 0.000 0.003 0.078 0.000 0.002 0.084
(0.006) 0.006 0.050 (0.005) 0.004 0.066 (0.003) 0.003 0.044 (0.002) 0.002 0.050
exp2 -0.004 0.000 0.000 0.040 0.000 0.000 0.066 -0.000 0.000 0.070 -0.000 0.000 0.096
(0.000) 0.000 0.050 (0.000) 0.000 0.056 (0.000) 0.000 0.032 (0.000) 0.000 0.044
3.district 0.34 -0.003 0.062 0.036 -0.001 0.042 0.074 -0.001 0.025 0.062 -0.002 0.016 0.104
(0.058) 0.058 0.054 (0.045) 0.044 0.060 (0.026) 0.026 0.058 (0.020) 0.019 0.054
β0

6.district 0.34 -0.004 0.059 0.060 -0.003 0.040 0.072 0.000 0.023 0.078 -0.001 0.015 0.102
(0.061) 0.058 0.058 (0.044) 0.042 0.060 (0.026) 0.026 0.056 (0.019) 0.019 0.054
10.district -0.51 -0.004 0.060 0.052 -0.002 0.041 0.062 0.001 0.024 0.064 0.000 0.016 0.096
(0.059) 0.057 0.064 (0.043) 0.043 0.058 (0.025) 0.025 0.062 (0.019) 0.019 0.060
constant 3.2 0.006 0.065 0.042 0.002 0.043 0.078 -0.001 0.026 0.084 0.000 0.017 0.098
(0.063) 0.062 0.046 (0.047) 0.045 0.064 (0.028) 0.028 0.072 (0.021) 0.020 0.062

exp -0.015 0.001 0.010 0.042 0.001 0.006 0.076 -0.000 0.004 0.038 0.000 0.002 0.058
(0.010) 0.010 0.044 (0.006) 0.006 0.070 (0.004) 0.004 0.048 (0.002) 0.002 0.062
exp2 0.0004 -0.000 0.000 0.038 -0.000 0.000 0.070 0.000 0.000 0.040 -0.000 0.000 0.050
(0.000) 0.000 0.046 (0.000) 0.000 0.056 (0.000) 0.000 0.052 (0.000) 0.000 0.056
3.district -0.35 0.003 0.106 0.028 0.001 0.060 0.062 -0.000 0.042 0.034 0.001 0.023 0.052
(0.098) 0.098 0.052 (0.062) 0.061 0.060 (0.039) 0.039 0.048 (0.024) 0.023 0.054
β1 − β0

6.district 1.00 0.007 0.109 0.078 0.005 0.061 0.038 -0.002 0.043 0.068 0.001 0.023 0.062
(0.114) 0.112 0.066 (0.058) 0.062 0.036 (0.047) 0.047 0.046 (0.024) 0.023 0.066
10.district -0.07 0.006 0.107 0.044 0.001 0.060 0.050 -0.001 0.043 0.024 0.001 0.023 0.032
(0.104) 0.100 0.066 (0.061) 0.061 0.050 (0.037) 0.038 0.042 (0.022) 0.023 0.034
constant 0.4 -0.010 0.098 0.030 -0.005 0.058 0.062 0.002 0.041 0.042 -0.000 0.023 0.070
(0.091) 0.094 0.036 (0.060) 0.059 0.050 (0.039) 0.039 0.052 (0.025) 0.024 0.062

ρ1 − ρ0 -0.4 0.002 0.062 0.048 -0.000 0.032 0.048


(0.059) 0.059 0.062 (0.032) 0.032 0.050
π11 − π01 -1.5 -0.051 0.434 0.048 -0.014 0.253 0.028
k(u)

(0.431) 0.451 0.038 (0.236) 0.249 0.038


π12 − π02 0.9 0.046 0.433 0.048 0.010 0.248 0.020
(0.424) 0.439 0.050 (0.218) 0.231 0.046

MTE(x̄, 0.05) 1.02 -0.003 0.103 0.042 0.001 0.056 0.052 0.009 0.070 0.048 0.004 0.045 0.078
0.74 (0.098) 0.099 0.052 (0.056) 0.057 0.050 (0.071) 0.075 0.034 (0.048) 0.050 0.046
MTE(x̄, 0.1) 0.88 -0.003 0.081 0.038 0.001 0.045 0.042 0.007 0.052 0.054 0.003 0.035 0.100
0.67 (0.077) 0.078 0.046 (0.046) 0.047 0.044 (0.054) 0.057 0.030 (0.039) 0.040 0.044
MTE(x̄, 0.25) 0.63 -0.001 0.047 0.028 0.000 0.030 0.044 0.002 0.021 0.068 0.002 0.017 0.104
0.49 (0.044) 0.045 0.044 (0.030) 0.031 0.038 (0.023) 0.023 0.050 (0.020) 0.020 0.056
MTE(x̄, 0.5) 0.36 0.000 0.025 0.026 0.000 0.021 0.050 -0.002 0.027 0.060 -0.000 0.016 0.032
MTE

0.29 (0.023) 0.023 0.036 (0.022) 0.022 0.034 (0.027) 0.028 0.048 (0.015) 0.016 0.042
MTE(x̄, 0.75) 0.09 0.002 0.050 0.040 -0.000 0.031 0.058 -0.000 0.022 0.012 -0.000 0.018 0.024
0.19 (0.048) 0.047 0.060 (0.032) 0.031 0.054 (0.018) 0.020 0.018 (0.016) 0.018 0.032
MTE(x̄, 0.9) -0.15 0.003 0.085 0.044 -0.001 0.047 0.054 0.003 0.056 0.032 -0.000 0.037 0.014
0.19 (0.082) 0.080 0.058 (0.048) 0.046 0.058 (0.050) 0.051 0.048 (0.031) 0.033 0.032
MTE(x̄, 0.95) -0.29 0.004 0.106 0.044 -0.001 0.057 0.046 0.005 0.074 0.030 0.000 0.047 0.016
0.20 (0.103) 0.101 0.060 (0.058) 0.057 0.052 (0.068) 0.068 0.054 (0.039) 0.041 0.028

ATE(x̄) 0.37 0.000 0.025 0.026 0.000 0.021 0.050 0.002 0.014 0.054 0.001 0.010 0.086
0.36 (0.023) 0.023 0.036 (0.022) 0.022 0.034 (0.014) 0.015 0.048 (0.012) 0.012 0.042

Note: Monte Carlo experiments described in section B. Standard deviations in parentheses. Results based on bootstrapped
standard errors are underlined, otherwise analytic. True values shows normal (top) and polynomial (bottom). First stage is
probit. 500 replications, each with a sample size of 10,000 in 10 districts.
20 40 60 80

15
10
5
0

0
−.02 0 .02 .04 −.2 −.1 0 .1 .2
beta1−beta0:exp mte:u25
15

10 15 20
10
5

5
0

0
−.2 −.1 0 .1 .2 −.1 −.05 0 .05 .1
k:mills mte:u50
10 15 20

15
10
5
5
0

−.1 −.05 0 .05 .1 −.2 −.1 0 .1 .2


effect:ate mte:u75

(a) Normal model


10 15 20
1 1.5 2
.5

5
0

−2 −1 0 1 2 −.05 0 .05
k:p1 mte:u25
30
1 1.5 2

20
10
.5
0

−2 −1 0 1 2 −.1 −.05 0 .05 .1


k:p2 mte:u50
0 5 10 15 20 25
30
20
10
0

−.04 −.02 0 .02 .04 −.1 −.05 0 .05


effect:ate mte:u75

(b) Polynomial model


Note: Kernel density plots of estimated coefficients using local IV (solid), the separate approach (dashed) for
selected parameters of the MTE models. Well specified baseline model. 500 Monte Carlo simulations from the
normal (a) and polynomial (b) models.

Figure 4: Efficiency of the two estimation methods


produce lower rejection rates, but these are still above 5%. The problem seems to
be driven by inconsistent estimates rather than too low standard errors. The re-
sult of this is large rates of overrejection when the functional form is misspecified:
Average rejection rates are far above .05 for a 5% test. The polynomial model
seems to do a better job at approximating the normal model than the other way
around, as apparent by the lower rejection rates.
This illustrates the importance of using the correct parametric model, and
working with flexible specifications. In this case, a higher order polynomial or a
polynomial with splines could have generated a better fit to the underlying model.

B.4 Misspecification of P (Z)


Next, I move on to investigate the specification of the first stage. To focus on the
problem of misspecification of the first stage, assume that the parametric model
for k(u) is correct and second order polynomial. Instead, I let the functional form
of the selection equation be either linear or probit when generating data, and use
the other functional form when estimating the model.
The results from this exercise is provided in Table 6. The β parameters are
somewhat consistently estimated with rejection rates around .05, but there are
larger variation across parameters than for the previous experiments. The MTEs,
ATE and parameters of the k(u) function are severely overrejected. Again, this
does not seem to be driven by underestimation of standard errors if we compare
them to the standard deviation of the estimated coefficients. Rather, it seems to
be the coefficients themselves that do not center at 0.
As an alternative misspecification, consider the case where the functional form
is correct, but where the experience controls are not included in either the first
or the second stage estimation. This should affect the precision of the coefficients
in the propensity score model, but since the omitted regressors are orthogonal to
everything else, it should not bias the estimates of other parameters except the
constant. The results from this exercise is found in Table 7, and fortunately we
see that the results are not sensitive to this sort of omission of variables as long as
the exclusion restriction holds.
Adding up, specification of the propensity score generally receives relatively
little attention applied papers, but turns out to be important. Careful researchers
should evaluate the robustness of their MTE models using various and flexible first
stage models.

34
Table 5: Monte Carlo estimates: Misspecification of k(u)

True model Polynomial Polynomial Normal Normal


Model Normal Normal Polynomial Polynomial
Procedure Local IV Separate Local IV Separate
Coefficient Truth diff s.e. r diff s.e. r diff s.e. r diff s.e. r

exp 0.025 -0.000 0.003 0.086 -0.000 0.002 0.126 -0.000 0.006 0.042 -0.000 0.004 0.068
(0.003) 0.003 0.054 (0.002) 0.002 0.068 (0.006) 0.006 0.050 (0.005) 0.004 0.050
exp2 -0.004 0.000 0.000 0.084 0.000 0.000 0.128 0.000 0.000 0.038 0.000 0.000 0.078
(0.000) 0.000 0.052 (0.000) 0.000 0.068 (0.000) 0.000 0.046 (0.000) 0.000 0.058
3.district 0.34 0.002 0.025 0.052 0.001 0.016 0.092 -0.001 0.062 0.040 0.000 0.042 0.088
(0.026) 0.025 0.054 (0.018) 0.019 0.044 (0.059) 0.058 0.054 (0.046) 0.044 0.066
β0

6.district 0.34 -0.001 0.023 0.086 -0.004 0.015 0.110 -0.005 0.059 0.070 -0.003 0.040 0.084
(0.026) 0.026 0.060 (0.018) 0.018 0.062 (0.062) 0.058 0.074 (0.045) 0.043 0.076
10.district -0.51 -0.001 0.024 0.050 -0.002 0.016 0.094 -0.002 0.060 0.050 -0.000 0.041 0.054
(0.025) 0.025 0.040 (0.018) 0.019 0.034 (0.059) 0.057 0.060 (0.044) 0.043 0.046
constant 3.2 0.007 0.026 0.076 0.031 0.016 0.480 0.011 0.066 0.034 -0.005 0.045 0.060
(0.028) 0.027 0.070 (0.019) 0.019 0.364 (0.064) 0.063 0.044 (0.047) 0.047 0.046

exp -0.015 -0.000 0.004 0.062 0.000 0.002 0.054 0.000 0.010 0.058 0.000 0.006 0.060
(0.004) 0.004 0.064 (0.002) 0.002 0.056 (0.010) 0.010 0.064 (0.006) 0.006 0.052
exp2 0.0004 0.000 0.000 0.056 -0.000 0.000 0.058 -0.000 0.000 0.036 -0.000 0.000 0.060
(0.000) 0.000 0.052 (0.000) 0.000 0.058 (0.000) 0.000 0.052 (0.000) 0.000 0.052
3.district -0.35 -0.002 0.042 0.036 0.000 0.023 0.034 0.000 0.106 0.030 -0.001 0.060 0.044
(0.039) 0.039 0.048 (0.022) 0.023 0.040 (0.097) 0.098 0.046 (0.061) 0.061 0.042
β1 − β0

6.district 1.00 0.001 0.043 0.066 0.003 0.023 0.050 0.005 0.109 0.046 0.002 0.061 0.064
(0.046) 0.047 0.040 (0.023) 0.023 0.054 (0.109) 0.112 0.040 (0.064) 0.062 0.064
10.district -0.07 0.001 0.043 0.028 0.002 0.023 0.052 0.001 0.107 0.040 -0.001 0.060 0.058
(0.040) 0.038 0.056 (0.023) 0.023 0.054 (0.099) 0.100 0.056 (0.060) 0.061 0.052
constant 0.4 -0.023 0.039 0.078 -0.030 0.022 0.248 0.003 0.102 0.038 0.035 0.060 0.088
(0.038) 0.037 0.102 (0.022) 0.023 0.242 (0.101) 0.097 0.056 (0.060) 0.062 0.082

ρ1 − ρ0 N/A* -0.199 0.025 -0.115 0.012


(0.024) 0.024 (0.012) 0.012
π11 − π01 N/A* -1.333 1.091 -2.513 0.664
k(u)

(1.074) 1.079 (0.628) 0.672


π12 − π02 N/A* 0.104 1.087 1.296 0.651
(1.058) 1.066 (0.620) 0.647

MTE(x̄, 0.05) 1.02 -0.073 0.041 0.460 -0.215 0.021 1.000 -0.087 0.175 0.086 0.079 0.117 0.102
0.74 (0.041) 0.042 0.438 (0.023) 0.023 1.000 (0.173) 0.174 0.088 (0.112) 0.122 0.082
MTE(x̄, 0.1) 0.88 -0.077 0.032 0.672 -0.189 0.017 1.000 -0.008 0.132 0.050 0.109 0.092 0.220
0.67 (0.033) 0.034 0.644 (0.019) 0.019 1.000 (0.131) 0.131 0.050 (0.089) 0.096 0.194
MTE(x̄, 0.25) 0.63 -0.020 0.019 0.196 -0.080 0.011 1.000 0.040 0.052 0.110 0.043 0.044 0.166
0.49 (0.020) 0.020 0.148 (0.013) 0.014 1.000 (0.051) 0.050 0.142 (0.046) 0.046 0.140
MTE(x̄, 0.5) 0.36 0.052 0.010 1.000 0.048 0.008 1.000 -0.004 0.068 0.056 -0.073 0.043 0.398
MTE

0.29 (0.011) 0.011 1.000 (0.010) 0.010 1.000 (0.068) 0.067 0.054 (0.043) 0.043 0.398
MTE(x̄, 0.75) 0.09 0.012 0.020 0.084 0.065 0.012 1.000 -0.035 0.056 0.092 -0.026 0.047 0.088
0.19 (0.019) 0.018 0.106 (0.012) 0.012 1.000 (0.053) 0.052 0.114 (0.047) 0.048 0.082
MTE(x̄, 0.9) -0.15 -0.106 0.034 0.894 -0.002 0.018 0.030 0.034 0.142 0.046 0.160 0.098 0.372
0.19 (0.032) 0.031 0.922 (0.017) 0.016 0.054 (0.132) 0.133 0.060 (0.096) 0.096 0.384
MTE(x̄, 0.95) -0.29 -0.187 0.042 0.990 -0.052 0.022 0.646 0.122 0.186 0.090 0.300 0.124 0.682
0.20 (0.040) 0.040 0.994 (0.021) 0.020 0.716 (0.174) 0.176 0.104 (0.122) 0.121 0.692

ATE(x̄) 0.37 -0.021 0.010 0.588 -0.025 0.008 0.822 0.005 0.036 0.822 0.033 0.026 0.822
0.36 (0.011) 0.011 0.482 (0.010) 0.010 0.720 (0.031) 0.034 0.042 (0.025) 0.028 0.186

Note: Monte Carlo experiments described in section B. Standard deviations in parentheses. Results based on bootstrapped
standard errors are underlined, otherwise analytic. 500 replications, each with a sample size of 10,000 in 10 districts. *
Coefficients reported for k(u) rather than differences.
Table 6: Monte Carlo estimates: Misspecification of P (Z)

True first stage model Probit Probit LPM LPM


First stage model LPM LPM Probit Probit
Procedure Local IV Separate Local IV Separate
Coefficient Truth diff s.e. r diff s.e. r diff s.e. r diff s.e. r

exp 0.025 0.001 0.003 0.102 0.000 0.002 0.092 -0.000 0.005 0.058 0.000 0.001 0.100
(0.003) 0.003 0.070 (0.002) 0.002 0.050 (0.005) 0.005 0.042 (0.002) 0.002 0.060
exp2 -0.004 -0.000 0.000 0.102 -0.000 0.000 0.088 0.000 0.000 0.056 -0.000 0.000 0.104
(0.000) 0.000 0.070 (0.000) 0.000 0.050 (0.000) 0.000 0.050 (0.000) 0.000 0.058
3.district 0.34 0.024 0.028 0.132 -0.001 0.016 0.110 -0.002 0.047 0.028 -0.000 0.014 0.050
(0.029) 0.029 0.128 (0.019) 0.019 0.068 (0.041) 0.041 0.052 (0.015) 0.016 0.028
β0

6.district 0.34 -0.071 0.026 0.736 -0.000 0.015 0.126 0.004 0.045 0.080 0.000 0.014 0.106
(0.030) 0.030 0.664 (0.019) 0.019 0.060 (0.050) 0.052 0.040 (0.016) 0.016 0.044
10.district -0.51 0.002 0.027 0.048 -0.000 0.016 0.084 0.001 0.046 0.016 -0.000 0.014 0.100
(0.028) 0.028 0.044 (0.019) 0.019 0.046 (0.037) 0.038 0.050 (0.016) 0.016 0.050
constant 3.2 0.006 0.029 0.056 -0.010 0.017 0.116 -0.067 0.070 0.142 -0.021 0.027 0.172
(0.030) 0.031 0.050 (0.020) 0.021 0.060 (0.068) 0.071 0.130 (0.032) 0.033 0.084

exp -0.015 -0.003 0.005 0.092 -0.000 0.002 0.038 0.001 0.009 0.056 0.000 0.002 0.044
(0.005) 0.005 0.094 (0.002) 0.002 0.040 (0.008) 0.009 0.058 (0.002) 0.002 0.044
exp2 0.0004 0.000 0.000 0.086 0.000 0.000 0.042 -0.000 0.000 0.052 -0.000 0.000 0.048
(0.000) 0.000 0.076 (0.000) 0.000 0.040 (0.000) 0.000 0.048 (0.000) 0.000 0.048
3.district -0.35 -0.049 0.050 0.148 0.001 0.023 0.054 0.002 0.091 0.024 -0.001 0.020 0.030
(0.045) 0.046 0.176 (0.023) 0.023 0.056 (0.078) 0.077 0.056 (0.018) 0.020 0.024
β1 − β0

6.district 1.00 0.143 0.051 0.776 -0.000 0.023 0.048 -0.007 0.092 0.094 0.000 0.020 0.048
(0.056) 0.054 0.742 (0.024) 0.023 0.052 (0.105) 0.110 0.050 (0.020) 0.020 0.048
10.district -0.07 -0.007 0.050 0.036 -0.001 0.023 0.062 -0.003 0.091 0.008 -0.000 0.020 0.044
(0.045) 0.046 0.050 (0.024) 0.023 0.052 (0.071) 0.071 0.052 (0.019) 0.020 0.046
constant 0.4 -0.006 0.047 0.038 0.013 0.024 0.092 0.099 0.123 0.102 0.025 0.038 0.120
(0.044) 0.045 0.048 (0.024) 0.025 0.074 (0.114) 0.120 0.100 (0.040) 0.043 0.070

π11 − π01 -1.5 2.612 0.411 1.000 1.466 0.291 1.000 -1.808 1.459 0.236 -0.481 0.496 0.176
(0.411) 0.428 1.000 (0.282) 0.297 1.000 (1.474) 1.514 0.206 (0.541) 0.554 0.122
k(u)

π12 − π02 0.9 -2.488 0.410 1.000 -1.435 0.288 1.000 1.654 1.463 0.202 0.412 0.480 0.136
(0.398) 0.415 1.000 (0.261) 0.278 1.000 (1.446) 1.501 0.180 (0.494) 0.506 0.120

MTE(x̄, 0.05) 0.74 -0.361 0.067 1.000 -0.173 0.049 0.926 0.368 0.275 0.274 0.105 0.107 0.184
(0.070) 0.073 1.000 (0.056) 0.057 0.852 (0.286) 0.290 0.232 (0.126) 0.130 0.100
MTE(x̄, 0.1) 0.67 -0.249 0.051 1.000 -0.110 0.038 0.774 0.290 0.215 0.276 0.084 0.087 0.180
(0.054) 0.056 0.998 (0.045) 0.045 0.680 (0.224) 0.227 0.234 (0.104) 0.108 0.094
MTE(x̄, 0.25) 0.49 0.012 0.022 0.116 0.034 0.017 0.498 0.106 0.080 0.254 0.033 0.042 0.180
(0.023) 0.024 0.084 (0.022) 0.022 0.366 (0.086) 0.087 0.204 (0.052) 0.055 0.072
MTE(x̄, 0.5) 0.29 0.198 0.026 1.000 0.132 0.020 1.000 -0.036 0.037 0.174 -0.010 0.018 0.126
MTE

(0.026) 0.027 1.000 (0.017) 0.019 1.000 (0.038) 0.039 0.136 (0.020) 0.020 0.084
MTE(x̄, 0.75) 0.19 0.074 0.024 0.900 0.050 0.019 0.758 0.029 0.090 0.026 -0.001 0.046 0.050
(0.021) 0.021 0.934 (0.018) 0.018 0.768 (0.079) 0.086 0.042 (0.044) 0.046 0.046
MTE(x̄, 0.9) 0.19 -0.150 0.056 0.776 -0.086 0.041 0.580 0.167 0.232 0.090 0.029 0.092 0.046
(0.049) 0.049 0.866 (0.036) 0.036 0.668 (0.213) 0.229 0.086 (0.087) 0.091 0.050
MTE(x̄, 0.95) 0.20 -0.250 0.073 0.944 -0.145 0.052 0.820 0.230 0.295 0.098 0.043 0.113 0.046
(0.065) 0.065 0.972 (0.045) 0.046 0.880 (0.273) 0.292 0.102 (0.107) 0.111 0.050

ATE(x̄) 0.37 -0.005 0.015 0.056 0.014 0.011 0.274 0.099 0.092 0.188 0.024 0.032 0.134
0.36 (0.015) 0.016 0.046 (0.013) 0.014 0.168 (0.090) 0.094 0.158 (0.036) 0.038 0.072

Note: Monte Carlo experiments described in section B. Standard deviations in parentheses. Results based on bootstrapped
standard errors are underlined, otherwise analytic. 500 replications, each with a sample size of 10,000 in 10 districts. Error
structure is polynomial.
Table 7: Monte Carlo estimates: Omissions in the specification of P (Z)

True first stage model LPM LPM Probit Probit


First stage model LPM LPM Probit Probit
Procedure Local IV Separate Local IV Separate
Coefficient Truth diff s.e. r diff s.e. r diff s.e. r diff s.e. r

3.district 0.34 -0.003 0.051 0.018 0.000 0.015 0.088 -0.000 0.026 0.072 -0.000 0.017 0.096
(0.044) 0.045 0.036 (0.018) 0.018 0.056 (0.028) 0.027 0.058 (0.020) 0.020 0.052
6.district 0.34 -0.005 0.049 0.096 -0.000 0.015 0.088 -0.001 0.025 0.076 -0.001 0.016 0.108
(0.058) 0.055 0.066 (0.018) 0.018 0.056 (0.028) 0.028 0.056 (0.019) 0.020 0.054
β0

10.district -0.51 -0.001 0.050 0.020 0.000 0.015 0.096 -0.002 0.025 0.064 -0.002 0.016 0.124
(0.041) 0.042 0.046 (0.018) 0.018 0.056 (0.027) 0.026 0.056 (0.020) 0.020 0.066
constant 3.2 0.258 0.055 0.994 0.259 0.025 1.000 0.254 0.021 1.000 0.255 0.014 1.000
(0.053) 0.055 0.992 (0.031) 0.032 1.000 (0.023) 0.022 1.000 (0.017) 0.017 1.000

3.district -0.35 0.005 0.098 0.016 -0.001 0.022 0.060 0.002 0.045 0.042 0.001 0.024 0.042
(0.083) 0.085 0.038 (0.022) 0.022 0.066 (0.043) 0.042 0.058 (0.024) 0.024 0.048
6.district 1.00 0.009 0.099 0.110 0.000 0.022 0.052 0.002 0.046 0.058 0.001 0.025 0.034
(0.121) 0.115 0.076 (0.023) 0.022 0.050 (0.049) 0.050 0.042 (0.023) 0.024 0.042
β1 − β0

10.district -0.07 0.002 0.098 0.008 -0.001 0.022 0.036 0.002 0.045 0.030 0.002 0.024 0.058
(0.075) 0.079 0.042 (0.021) 0.022 0.038 (0.043) 0.041 0.050 (0.025) 0.024 0.064
constant 0.4 -0.111 0.100 0.166 -0.113 0.037 0.842 -0.104 0.034 0.872 -0.107 0.020 1.000
(0.089) 0.094 0.206 (0.039) 0.042 0.784 (0.033) 0.032 0.896 (0.020) 0.020 1.000

π11 − π01 -1.5 0.106 1.162 0.058 0.197 0.501 0.086 0.028 0.458 0.054 0.136 0.269 0.074
(1.191) 1.211 0.050 (0.532) 0.567 0.058 (0.471) 0.476 0.050 (0.260) 0.271 0.076
k(u)

π12 − π02 0.9 -0.109 1.167 0.064 -0.202 0.488 0.064 -0.038 0.457 0.048 -0.147 0.264 0.058
(1.183) 1.196 0.060 (0.490) 0.519 0.050 (0.455) 0.465 0.044 (0.238) 0.251 0.074

MTE(x̄, 0.05) 0.74 -0.016 0.220 0.062 -0.033 0.107 0.100 0.000 0.074 0.082 -0.015 0.048 0.092
(0.227) 0.234 0.038 (0.123) 0.132 0.040 (0.079) 0.079 0.054 (0.054) 0.054 0.054
MTE(x̄, 0.1) 0.67 -0.012 0.172 0.068 -0.025 0.086 0.100 0.001 0.055 0.080 -0.010 0.037 0.106
(0.178) 0.184 0.036 (0.102) 0.108 0.038 (0.061) 0.060 0.054 (0.043) 0.044 0.058
MTE(x̄, 0.25) 0.49 -0.001 0.068 0.068 -0.006 0.042 0.124 0.003 0.022 0.070 0.003 0.018 0.112
(0.072) 0.075 0.048 (0.052) 0.055 0.048 (0.024) 0.024 0.054 (0.022) 0.022 0.050
MTE(x̄, 0.5) 0.29 0.005 0.035 0.086 0.006 0.020 0.098 0.003 0.028 0.070 0.009 0.017 0.076
MTE

(0.039) 0.037 0.066 (0.022) 0.022 0.074 (0.029) 0.030 0.062 (0.016) 0.017 0.092
MTE(x̄, 0.75) 0.19 -0.003 0.079 0.030 -0.008 0.046 0.050 -0.002 0.023 0.064 -0.002 0.019 0.072
(0.073) 0.074 0.044 (0.046) 0.046 0.052 (0.023) 0.021 0.082 (0.021) 0.019 0.076
MTE(x̄, 0.9) 0.19 -0.014 0.191 0.050 -0.028 0.093 0.054 -0.007 0.060 0.026 -0.018 0.040 0.064
(0.184) 0.186 0.056 (0.090) 0.092 0.056 (0.055) 0.055 0.056 (0.037) 0.036 0.100
MTE(x̄, 0.95) 0.20 -0.019 0.241 0.054 -0.037 0.114 0.054 -0.009 0.079 0.022 -0.025 0.051 0.060
(0.234) 0.236 0.058 (0.110) 0.112 0.058 (0.073) 0.073 0.042 (0.046) 0.045 0.096

ATE(x̄) 0.37 -0.004 0.071 0.050 -0.011 0.033 0.074 0.000 0.015 0.060 -0.003 0.011 0.108
0.36 (0.069) 0.073 0.044 (0.035) 0.038 0.040 (0.016) 0.016 0.056 (0.013) 0.013 0.054

Note: Monte Carlo experiments described in section B, where exp and exp2 is omitted from the model. Standard deviations in
parentheses. Results based on bootstrapped standard errors are underlined, otherwise analytic. 500 replications, each with a
sample size of 10,000 in 10 districts. Error structure is polynomial.

You might also like