Exploring Marginal Treatment Effects Flexible Estimation Using Stata
Exploring Marginal Treatment Effects Flexible Estimation Using Stata
Abstract
In settings that exhibit selection on both levels and gains, Marginal
Treatment Effects (MTE) allow us to go beyond local average treatment
effects and estimate the whole distribution of effects. This paper surveys
the theory behind Marginal Treatment Effects and introduces the new Stata
package mtefe that uses several estimation methods to estimate MTE mod-
els. This package provides important improvements and flexibility over ex-
isting packages such as margte (Brave and Walstrum, 2014), and calculates
various treatment effect parameters based on the results. Use of the package
is illustrated using a range of examples.
∗
Statistics Norway. E-mail to [email protected]
1 Introduction1
Well known instrumental variables methods solve problems of selection on levels,
estimating local average treatment effects for instrument compliers even with non-
random selection into treatment. In the more reasonable case with selection into
treatment based both on levels and gains, however, local average treatment effects
may represent average treatment effects for very particular subpopulations, and we
learn little about the distribution of treatment effects in the population at large.
Marginal treatment effects allow us to go beyond local average treatment ef-
fects in settings that exhibit this sort of selection. A marginal treatment effect
is the average treatment effect for people with a particular resistance to treat-
ment, or alternatively at a particular margin of indifference. Thus, MTEs capture
heterogeneity in the treatment effect along the unobserved dimension we call re-
sistance to treatment. This is precisely what generates selection on unobserved
gains: People who choose treatment because they have a particularly low resis-
tance might have different gains than those with high resistance. Usually at the
cost of stronger assumptions than required under standard IV, we can estimate
the full distribution of (marginal) treatment effects and back out the parameters of
interest. These include average treatment effects, average treatment effects on the
treated and untreated and policy-relevant treatment effects from a hypothetical
policy that shifts the propensity to choose treatment.
The most common way of estimating MTEs is via the method of Local In-
strumental Variables (Heckman and Vytlacil, 2007). Alternatively, MTEs can be
estimated using the separate approach (Brinch et al., 2017; Heckman and Vytlacil,
2007) or, in the baseline case with joint normal errors, maximum likelihood. When
estimating MTEs using the separate approach or maximum likelihood, similarities
to selection models are apparent: We are basically estimating a traditional selec-
tion model. Marginal Treatment Effects thus represent old ideas in new wrapping,
but still provide substantial contributions to the interpretation and estimation
of models in settings with essential heterogeneity. Unfortunately, commands for
flexibly estimating MTEs and using their output is not easily available in popular
software such as Stata. Brave and Walstrum (2014) document the package margte
that estimates MTEs, but this package has some important limitations.
1
The author wishes to thank Edwin Leuven, Christian Brinch, Magne Mogstad, Thomas
Cornelissen, Martin Hüber, Katrine Vetlesen Løken, Simen Markussen, Scott Brave, Thomas
Walstrum, Yiannis Spanos and Jonathan Norris for advice, comments, help and testing at various
stages of this project. Brave and Walstrum also deserve a huge thanks for Stata package margte
that clearly inspired mtefe. All errors remain my own. While carrying out this research I have
been associated with the center for Equality, Social Organization, and Performance (ESOP) at
the Department of Economics at the University of Oslo. ESOP is supported by the Research
Council of Norway through its Centres of Excellence funding scheme, project number 179552.
2
This paper presents the new Stata package mtefe that estimates parametric
and semiparametric marginal treatment effect models. These include the joint
normal, polynomial and semiparametric models like Brave and Walstrum (2014),
and additionally adds the option of spline functions in the MTE for increased
flexibility. mtefe can estimate all models using either local IV or the separate
approach as well as maximum likelihood for the joint normal model, while the
estimation method in margte depends on the model. Furthermore, mtefe allows
for fixed effects using Stata’s categorical variables, important to isolate exogenous
variation in many applications and provides gains in computational speed over
generating dummies manually. mtefe supports frequency and probability weights,
important to obtain population estimates in many data sets.
In addition, mtefe exploits the full potential of marginal treatment effects by
calculating treatment effect parameters as weighted averages of the MTE curve,
shedding light on why for example the LATE differs from the ATE. Other im-
provements include calculation of analytic standard errors (that admittedly ignore
the uncertainty in the propensity score estimation), large improvements in com-
putational speed when estimating semiparametric models and reestimation of the
propensity score when bootstrapping for more appropriate inference.
Although quickly becoming a part of the toolbox of the applied econometrician,
applied work using Marginal Treatment Effects often impose restrictive paramet-
ric assumptions. The Monte Carlo simulations in Appendix B illustrates how
conclusions might be sensitive to these choices, at least under the particular data
generating processes used. This illustrates the importance of correct functional
form, and highlights that researchers should probe their results to functional form
assumptions when using marginal treatment effects as well as base the choice of
functional form on detailed knowledge about the case at hand, what might con-
stitute the unobservables and sound economic arguments for the nature of the
relationship between these unobservables and the outcome.
This paper proceeds as follows: Section 2 reviews the theory behind MTEs,
identifying assumptions, estimation methods and derivation of treatment param-
eter weights. Section 3 presents the mtefe command, it’s most important options
and a range of examples of use, while section 4 concludes. Appendix A details
the estimation algorithm of mtefe, while Appendix B contains the Monte Carlo
simulations.
3
Yj = µj (X) + Uj for j = 0, 1 (1)
Y = DY1 + (1 − D)Y0 (2)
D = 1 [µD (Z) > V ] where Z = (X, Z− ) (3)
Y1 and Y0 are the potential outcomes in the treated and untreated state, such as
log wages with and without a college degree. They are both modeled as functions
of observables X, which may contain fixed effects.
Equation 3 is the selection equation, and can be interpreted as a latent index,
since 1 is the indicator function. It is a reduced form way of modeling selection
into treatment as a function of observables X and instruments Z− that affect the
probability of treatment, but not the potential outcomes.
The unobservable V in the choice equation is a negative shock to the latent
index determining treatment. It is often interpreted as unobserved resistance to or
negative preference for treatment. As long as the unobservable V has a continuous
distribution, we can rewrite the selection equation as P (Z) > UD , where UD
represents the quantiles of V and P (Z) the propensity score. UD , by construction,
has a uniform distribution in the population.2
Identification of this model requires the following assumption:
Assumption 1: Conditional independence (U0 , U1 , V ) ⊥ Z− | X
The model described by Equations 1-3 together with Assumption 1 implies and
is implied by the standard assumptions in Imbens and Angrist (1994) necessary
to interpret IV as a local average treatment effect. Vytlacil (2002) shows how the
standard IV assumptions of relevance, exclusion and monotonicity3 are equivalent
to some representation of a choice equation as in Equation 3. Thus, the model
described above is no more restrictive than the LATE model used in standard IV
analysis.
In principle, it is possible to estimate marginal treatment effects with no further
assumptions. However, this requires full support of the propensity scores in both
treated and untreated samples for all values of X. In practice, this is rarely feasible,
as shown for example in Carneiro et al. (2011). Instead, most applied papers
(Carneiro et al., 2011; Carneiro and Lee, 2009; Cornelissen et al., 2016b; Felfe and
Lalive, 2017; Maestas et al., 2013) proceed by assuming a stronger assumption:
Assumption 2: Separability E(Uj | X, V ) = E(Uj | V )
2
This normalization also insures that UD is uniform also within cells of X. For details, see
Matzkin (2007); Mogstad and Torgovitsky (2018).
3
Or rather, uniformity, as it is a condition across people, not across realizations of Z: This
assumption require that Pr(D = 1|Z = z) ≥ Pr(D = 1|Z = z 0 ) or the other way around for all
people, but it doesn’t require full monotonicity in the classical sense.
4
This can be a strong assumption, and must be carefully evaluated by the researcher
for each application. Nonetheless, it is clearly less restrictive than for example a
joint normal distribution of (U0 , U1 , V ), as assumed by traditional selection models.
This paper, along with the applied MTE literature4 and the mtefe command,
proceeds by imposing this stronger assumption, as well as working with linear
versions of µj (X) = Xβj and µD (Z) = γZ.
This assumption has two important consequences. First, the MTE is identified
over the common support, unconditional on X. Second, the marginal treatment
effects are additively separable in UD and X. This implies that the shape of the
MTE is independent of X: The intercept, but not the slope, is a function of X.5
Using this model, the returns to treatment is simply the difference between the
outcomes in the treated and untreated states. Marginal Treatment Effects were
introduced by Björklund and Moffitt (1987), later generalized by Heckman and
coauthors (1999; 2001; 2005; 2007), and using the above assumption and index
structure can be defined as
MTE(x, u) ≡ E(Y1 − Y0 | X = x, UD = u)
= x(β1 − β0 ) + E(U1 − U0 | UD = u) (4)
| {z } | {z }
heterogeneity in observables k(u): heterogeneity in unobservables
They measure average gains in outcomes for people with particular values of X
and the unobserved resistance to treatment UD . Alternatively, the MTE can be
interpreted as the mean return to treatment for individuals at a particular margin
of indifference. The above expression shows how the separability assumption allows
us to separate the treatment effect into one part varying with observables and
another part that varies across the unobserved resistance to treatment.
MTEs are closely related to local average treatment effects. A local average
treatment effect (LATE) is the average effect of treatment for people who are
shifted into (or out of) treatment when the instrument is exogenously shifted
from z to z 0 . In the above choice model, these people have UD in the interval
(P (z), P (z 0 )). Note how, when z − z 0 is infinitesimal small, so that P (z 0 ) = P (z),
the LATE converges to the MTE. A marginal treatment effect is thus a limit form
of LATE (Heckman and Vytlacil, 2007).
4
Some papers apply a stronger full independence assumption: (U0 , U1 , V ) ⊥ X . This implies
the separability in Assumption 2, but is stricter than necessary. In particular, the separability
assumption places no restrictions on the dependence between V and X. See also Mogstad and
Torgovitsky (2018, in particular section 6) for a review and a detailed discussion of the differences
between these assumptions.
5
Note, however, that you are free to specify fixed effects, so that a fully saturated model may
be specified.
5
.5
1.5
.5
.4
.4
1
.3
.3
.5
.2
.2
0
.1
.1
−.5
0
0 0.2 P(z) 0.6 P(z’) 1 0 0.2 P(z) 0.6 P(z’) 1
Unobserved resistance, propensity scores Unobserved resistance, propensity scores
6
it can be shown that
7
treated state for a given value of u. Likewise, the dashed line depicts the expected
value of the outcome in the untreated state. Imagine next comparing two particular
values of the instrument, z vs. z 0 . Individuals with Z = z have a predicted
propensity score of P (z) = 0.4, while individuals with z 0 have P (z 0 ) = 0.8. We will
have both treated and untreated individuals in both these groups. For untreated
individuals with Z = z, we know that their resistance UD > P (z) = 0.4. The
average outcome among these people therefore inform us of the average of the
dashed line between 0.4 and 1, depicted in blue. In contrast, the treated individuals
with Z = z have resistance UD ≤ P (z), and therefore inform us about the average
of Y1 for people with UD between 0 and 0.4, depicted in red. Likewise, the people
with Z = z 0 inform us about the average of Y1 from 0 to 0.8 (orange) and the
average of Y0 from 0.8 to 1 (green). With a continuous instrument, we will have
propensity scores all over the (0, 1) interval, and we can identify nonlinear functions
in u, but this illustration shows that a binary instrument identifies a linear MTE
model: We need only four averages from the groups comprised of the treated and
untreated individuals with the instrument switched on and off (Brinch et al., 2017).
In practice, this means specifying some function for the conditional expecta-
tions of the error terms, and then estimating the conditional expectations of Y1
and Y0 in the sample of treated and untreated separately:
Where I follow the notation in Brinch et al. (2017) and control selection via the
control functions Kj (p). Notice that this is the specification we are using when
estimating selection models: Depending on the specification of Kj (p), this amounts
to for example Heckman selection or a semiparametric selection model.
To estimate the MTE, we estimate the conditional expectation of Y in the
sample of treated and untreated separately using the regression
Based on the assumptions on the unknown functions kj (u), we can infer the
functional form of the control function Kj (p) as summarized in Table 1. Alterna-
tively, when using semiparametric methods, estimate Equation 7-8 using partially
linear models as summarized in Table 2. In the implementation, this is estimated
in a stacked regression to allow for some coefficients to be restricted to be the same
in the treated and untreated state. After estimating Kj (p), we can construct the
kj (u) functions and find the MTE estimate from
8
MTE(x, u) = E(Y1 | X = x, UD = u) − E(Y0 |X = x, UD = u) (10)
= x(β1 − β0 ) + k1 (u) − k0 (u)
where kj (u) = E(Uj | UD = u)
Z 1
TA (x) = E(Y1 − Y0 | A, X = x) = MTE(x, uD )ωA (x, u)duD (11)
0
where ωA (x, u) = fUD |A,X=x (u) (12)
where f is the conditional density of UD . Because of the additive separability given
by Assumption 2 and the linear forms of µj (X), this simplifies to
Z 1
TA (x) = x(β1 − β0 ) + k(uD )ωA (u)duD
0
where ωA (u) = fUD |A (u)
In principle, we can calculate the treatment effect parameters Ta (x) for any
value of x, but we are usually interested in the unconditional parameter in the
population. In general, we need to integrate over all values of X,7 but because of
7
Note, however, that in cases where the propensity score model is misspecified, the weighted
average of the conditional weights may differ from the unconditional weights. See e.g. Carneiro
et al. (2011,1).
9
Table 1: Parametric MTE models
U0 , U1 , V ∼ N (0, Σ), k(u) or kj (u) as Lth k(u) or kj (u) as Lth order polynomials (L≥2) with Q knots for
( Σ= ) order polynomials with quadratic and higher-order terms at (h1 , ..., hQ ) and mean 0
σ02
ρ01 σ12 mean 0
ρ0 ρ1 1
1 1
PL PL PL PQ
k(u) E(U1 − U0 | UD = u) (ρ1 − ρ0 )Φ−1 (u) l=1
πl (ul − l+1
) l=1
πl (ul − l+1
) +l=2 q=1 k
π q ((u − hq )l (u
1 ≥
l+1
(1−hq )
hq ) − (l+1) )
PL p(pl −1) P L p(pl +1)
K(p) pE(U1 − U0 | UD ≤ p) −(ρ1 − ρ0 )ϕ(Φ−1 (p)) l=1
πl l+1
π
l=1 l l+1
+
l+1
PL P Q q 1(p≥hq )(p−hq ) −(1−hq )l+1 p
l=2
π (
q=1 k l+1
)
1 1
PL PL PL PQ
k1 (u) E(U1 | UD = u) ρ1 Φ−1 (u) l=1
π1l (ul − l+1
) l=1
π1l (ul − l+1
) +
l=2 q=1 1l
π q ( (u ≥
1
l+1
(1−hq )
hq )(u − hq )l − (l+1) )
PL 1
P L l 1 L P PQ
k0 (u) E(U0 | UD = u) ρ0 Φ−1 (u) l=1
π0l (ul − l+1
) π (u − l+1 ) +
l=1 0l l=2 q=1 0l
π q ( (u ≥
1
l+1
(1−h q )
hq )(u − hq )l − (l+1) )
ϕ(Φ−1 (p)) l L (pl −1)
−1
PL P
K1 (p) E(U1 | UD ≤ p) −ρ1 p l=1
π1l pl+1 π
l=1 1l l+1
+
l+1
P L Q
P q 1(p≥hq )(p−hq ) −p(1−hq )l+1
l=2 q=1 1l
π ( p(l+1)
)
ϕ(Φ−1 (p)) PL p(1−pl ) L
P p(1−pl )
K0 (p) E(U0 | UD > p) ρ0 (1−p) l=1
π0l (1−p)(l+1) π
l=1 0l (1−p)(l+1)
+
l+1
P L Q
P q (1−hq ) p−1(p≥hq )(p−hq )l+1
l=2 q=1 0l
π ( (1−p)(l+1)
)
Construct residual Ỹ = Y − X β 1 − β0 )p
c0 − X(β\ Ỹ = Y − X β 1 − β0 )D
c0 − X(β\
k[ \ \0
Construct k k(u)
d =K \
0 (u) 1 (u) = K1 (u) + uK1 (u)
k[ \ 0
\
0 (u) = K0 (u) − (1 − u)K0 (u)
Construct MTE [
MTE(x, u) = x(β\ d
1 − β0 ) + k(u)
[
MTE(x, u) = x(β\ [ [
1 − β0 ) + k1 (u) − k0 (u)
p 0
Note: Steps in the estimationRof semiparametric MTE models using Local IV or the separate approach. To see the relation between kj (u) and Kj (p) note that
K1 (p) = E(U1 | UD ≤ p) = p1 E(U1 |UD = u)du ⇒ K10 (p) = − p1 K1 (p) + p1 k1 (p), which leads to k1 (u) = K1 (u) + uK1 (u). We can find similar expressions for
0
k0 (u).
In principle it is possible to combine the semiparametric and the polynomial approach by first estimating the β coefficients from the polynomial model and
then using semiparametric methods on to find K . This is the semiparametric polynomial MTE model, implemented if polynomial() and semiparametric are
specified together in mtefe. Although computationally far less complex, there is little theory to think that the semiparametric estimate of this model should be
any better than the MTE constructed from the parametric estimates.
Table 3: Unconditional treatment effect parameters and weights
Note: Weights for common treatment effect parameters. Discrete distribution of UD with s
points.
Z 1
TA = E(Y1 − Y0 | A) = xa (β1 − β0 ) + k(uD )ωA (u)duD (13)
0
where ωA (u) = fUD |A (u) (14)
and xA = E(X | A) (15)
12
probability to be untreated. ωAT U T weights high values of UD more because these
people have high resistance.
A local average treatment effect (LATE) is the average effect of treatment for
people who are shifted into (or out of) treatment when the instruments is shifted
from z to z 0 . These are people with UD in the interval (P (z), P (z 0 )). To be in the
complier group, an individual must have resistance between P (z) and P (z 0 ). Note
how, when z − z 0 is infinitesimal small, the LATE converges to the MTE, and a
Marginal Treatment Effect is thus a limit form of LATE (Heckman and Vytlacil,
2007).
Similar to the way we can estimate the probability of being a complier in a
traditional IV analysis, we can estimate the weights of the linear IV or 2SLS pa-
rameter. With a continuous instrument, IV is a weighted average over all possible
LATEs comprised of all z − z 0 pairs (Angrist and Imbens, 1995). Heckman and
Vytlacil (2007) show the derivation of these weights, see also section A.6 in the
appendix and Cornelissen et al. (2016a). In sum, the IV parameter uses a weighted
average of X, in which people with large positive or negative values of υ̂, a measure
of the how much the instrument affects propensity scores for each individual, are
given more weight. These are precisely the people who have higher probabilities
of having their treatment status determined by the instrument. Likewise, values
of UD where people have υ̂ above the average, and thus are more likely to be
compliers, get a higher weight ωIV (u).
Assuming policy invariance (see Heckman and Vytlacil (2007) for a formal
definition), we can calculate the policy-relevant treatment effect (PRTE) for a
counterfactual policy that manipulates propensity scores. This is the expected
treatment effect for the people that are shifted into treatment by the new policy
relative to the baseline. If the policy is a shift in the instruments themselves, it is
natural to use the estimated first stage to evaluate the shift in propensity scores,
but propensity scores could be manipulated directly as well. Note that if the policy
is a particular set of instrument values z 0 and the baseline another set z, PRTE
and LATE are the same. In practice, the PRTE parameter weights the treatment
effect of people who are affected more strongly by the alternative policy relative
to the baseline.
Lastly, we can use the estimated MTEs to calculate Marginal Policy Relevant
Treatment Effects (MPRTE), which can be interpreted as average effects of making
marginal shifts to the propensity scores. MPRTE’s are fundamentally easier to
identify than PRTE’s (Carneiro et al., 2010), in particular because they do not
require full support, as marginal changes to propensity scores will not drive the
scores outside the common support.
Carneiro et al. (2011) suggest three ways to define distance to the margin.
The first MPRTE, labeled MPRTE1 in Table 3, defines the distance in terms of
13
the differences between the index γZ and the resistance V , and corresponds to
a marginal change in a variable entering the first stage, such as an instrument.
MPRTE2 defines the margin as having propensity scores close to the normalized
resistance UD , and correspond to a policy that would increase all propensity scores
with a small amount. MPRTE3 defines marginal as the relative distance between
the propensity score and UD , and correspond to a policy that increase all propensity
scores by a small fraction.
What we can see from the expressions in Table 3 is that an absolute shift in
propensity scores uses the observed density of the propensity scores as the weight
distribution, while a relative shift unsurprisingly will place more weight on the
upper part of the UD precisely because a relative shift increase the propensity
scores of people with high initial propensity scores more.
14
splines(numlist ) adds second order and higher splines to k(u) at the points
specified by numlist. Use only for polynomial models of degree ≥ 2. All
points in numlist must be in the interval (0, 1).
separate estimates the model using the separate approach rather than Local IV.
mlikelihood estimates the model using the maximum likehood rather than Local
IV. Only appropriate for the joint normal model.
link(lpm|probit|logit) Specifies the first stage model. Default is probit.
restricted(varlist ) specifies that variables in varlist are included in both first
and second stage, but are restricted to have the same effect in the two states.
The command allows a range of other options, for which I refer the reader to the
help file. The mtefe-package in addition contains the command mtefeplot, that
plots one or more marginal treatment effect plots, optionally including treatment
parameter weights, based on stored or saved MTE estimates from mtefe. The
command mtefe_gendata generates the data in the following examples and Monte
Carlo simulations.
By default, mtefe reports analytical standard errors for the coefficients, treat-
ment effect parameters and MTEs. These ignore the uncertainty in the estimation
of the propensity scores by treating these as fixed in the second stage of the esti-
mation, as well as the means of X and the treatment effect parameter weights that
are used for estimating treatment effects. For matching, Abadie and Imbens (2016)
show that ignoring the uncertainty in the propensity score increases the standard
errors for the average treatment effect, while the impact on other treatment ef-
fect parameters is ambiguous. To the best of my knowledge, we do not know
how this omission affects the standard errors in MTE applications, and careful
researchers should therefore bootstrap the standard errors using the bootreps()
option, which re-estimates the propensity scores, the mean of X and the treatment
effect parameter weights for each bootstrap repetition.
15
doubt the exclusion restriction for this instrument (Carneiro and Heckman, 2002).
To fix ideas, suppose that the average distance to college in a district, a measure
of rurality, is correlated with average outcomes in the labor market. For example,
more rural labor markets could provide worse average employment opportunities,
particularly for college jobs. If these differences work at the district level, how-
ever, the instrument is valid conditional on district fixed effects that control for
the average variation in distance to college. Thus, the remaining within-district
variation in distance to college is a valid instrument for college attendance.
To implement this thought experiment, I draw the average labor market quality
for college- and non college jobs and the average distance to college for each district
from a joint normal distribution. The observed distance to college is equal to
this district-level average plus some random normal variation, so that the within-
district variation in distance is a valid instrument. In addition, I generate the error
terms U0 , U1 and V from either a joint normal or a polynomial error structure,
where the three errors are correlated and thus generate selection on both levels
and gains. Controls X include experience uniformly distributed on (0, 30) and
its square. These affect both the selection equation and the outcomes in the two
states. The full data generating process is described in Appendix B.
Using the data generating process outline above, the code in Figure 2 uses
mtefe and mtefe_gendata to generate data with normal error structure and es-
timate it using the joint normal MTE model and local IV. mtefe first reports the
estimated coefficients β0 , β1 − β0 ,ρ1 − ρ0 , then the MTE estimates for each point of
support and the treatment effect parameters as shown in Figure 2. Lastly, mtefe
reports the p-values for two statistical tests: A joint test of the β1 − β0 which can
be interpreted as a test of whether the treatment effect differs across X and a test
of essential heterogeneity. The latter is a joint test of all coefficients in k(u).8
Based on the output from mtefe it is straightforward to evaluate the impact
of covariates. Average differences in outcomes across covariates can be interpreted
directly from the β0 just like a regular control variable. For instance, the coefficient
for experience in the first panel of the output table in Figure 2 indicates that one
more year of experience translates into approximately 2.8% higher wages, although
the effect is nonlinear and we cannot say that it is the extra experience that cause
the higher wages without strong exogeneity assumptions on X.
Similarly, the β1 − β0 can be interpreted as differences in treatment effects
across covariate values, just like an interaction between treatment status and a
covariate in an OLS regression. The coefficient on experience in the second panel
of Figure 2 thus indicate that a person with one more year of experience has 2.3%
lower gains from college, but again we cannot give this a causal interpretation and
we have to account for the nonlinear effect.
8
Or, in the case of semiparametric models, a test of whether all MTEs are the same.
16
. mtefe_gendata, obs(10000) districts(10)
.
. mtefe lwage exp exp2 i.district (col=distCol)
Parametric normal MTE model Observations : 10000
Treatment model: Probit
Estimation method: Local IV
beta0
exp .0358398 .0064408 5.56 0.000 .0232145 .0484651
exp2 -.0008453 .0002019 -4.19 0.000 -.0012411 -.0004496
district
2 .2352456 .0680412 3.46 0.001 .1018712 .36862
3 .6294914 .0701091 8.98 0.000 .4920634 .7669194
4 .0131179 .0597721 0.22 0.826 -.1040474 .1302832
5 .0338606 .0705835 0.48 0.631 -.1044974 .1722186
6 .1699366 .0605086 2.81 0.005 .0513275 .2885458
7 -.1899241 .060115 -3.16 0.002 -.3077617 -.0720865
8 -.1842254 .0676843 -2.72 0.007 -.3169003 -.0515504
9 -.7908301 .0578436 -13.67 0.000 -.9042153 -.677445
10 -.4432749 .0597237 -7.42 0.000 -.5603455 -.3262044
beta1-beta0
exp -.0386384 .010241 -3.77 0.000 -.0587128 -.018564
exp2 .0012967 .0003288 3.94 0.000 .0006523 .0019412
district
2 .265112 .107039 2.48 0.013 .0552939 .4749301
(output omitted )
10 .3143661 .1072555 2.93 0.003 .1041237 .5246085
k
mills -.4790282 .0611081 -7.84 0.000 -.5988124 -.359244
effects
ate .3283373 .0242932 13.52 0.000 .2807177 .3759568
att .5369432 .0388809 13.81 0.000 .4607287 .6131576
atut .1195067 .0384691 3.11 0.002 .0440995 .194914
late .3279726 .0245142 13.38 0.000 .2799198 .3760254
mprte1 .3463148 .0256971 13.48 0.000 .2959433 .3966862
mprte2 .3309428 .024298 13.62 0.000 .2833137 .3785719
mprte3 -.016257 .0498984 -0.33 0.745 -.1140679 .0815538
Note: Analytical standard errors ignore the facts that the propensity score,
(output omitted )
18
model using the separate option to use the separate approach rather than local
IV, we can plot the resulting potential outcomes using the separate option of
mtefeplot. As the MTE is simply the difference between Y1 and Y0 , we can inves-
tigate whether the downward sloping trend in the MTE is generated by upward
sloping Y0 , downward sloping Y1 , or a combination. From Figure 3d we see that Y0
is relatively flat while Y1 is clearly downward sloping. This indicates that people
who have low resistance to treatment do much better than their high-resistance
counterparts with a college degree, but relatively similar without. Therefore, these
people have higher effects of treatment.
As a last example, consider a hypothetical policy that mandates a maximum
distance of 40 miles to the closest college. To estimate the effect of such a policy,
we predict the propensity scores from the probit model using the adjusted distance
to college:
. qui probit col distCol exp exp2 i.district
. gen temp=distCol
. replace distCol=40 if distCol>40
. predict double p_policy
. replace distCol=temp
. mtefe lwage exp exp2 i.district (col=distCol), pol(2) prte(p_policy)
(output omitted )
Results of this exercise is depicted in Figure 3f. First, note how the MTE curve
for the policy compliers practically overlaps the MTE curve at the mean. Policy
compliers do not seem have X-values that give them different gains from college
than the average. Next, notice the distribution of weights: The compliers to the
policy come exclusively from the lower part of the UD -distribution. This is not
surprising, given that the people most affected by the reform are people with low
propensity scores (driven by high distance to college) before the reform. To be
affected by the reform, these people have to be untreated under the baseline, and
thus have UD ’s above those low propensity scores. Compared to the people with
higher propensity scores, the average UD for these people are relatively low, gen-
erating high weights on the lower part of the UD -distribution. Thus, the potential
compliers to the policy have low UD ’s , and subsequently high treatment effects.
Therefore, the expected gain from the reform is larger than the average treatment
effect.
Lastly, we might be interested in the pattern of selection on observable gains. Is
it the case that covariates that positively impact treatment choices also positively
impact the gains from treatment? This is the case if γ × (β1 − β0 ) > 0, which
happens either if both coefficients are positive or both are negative. As an example,
there is positive selection on the covariate experience in the example reported in
Figure 2: More experience is associated both with less college (not shown) and
with lower treatment effects.
19
. mtefe lwage exp exp2 i.district (col=distCol)
. est sto normal
. mtefe lwage exp exp2 i.district (col=distCol), polynomial(2) separate
. est sto polynomial
. mtefe lwage exp exp2 i.district (col=distCol), semiparametric gridpoints(100)
. est sto semipar
. mtefeplot normal polynomial semipar, memory names("Normal" "Polynomial" "Semiparametric")
. mtefeplot polynomial, separate
. mtefeplot polynomial, late memory
2
8
1
Treatment effect
6
Density
4
0
2
−1
0
0 .2 .4 .6 .8 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Propensity score Unobserved resistance to treatment
(a) Common support plot, probit (b) Estimated MTE curve, normal
5
1.5
4.5
Potential outcomes
Treatment effect
Treatment effect
1
4
.5
0
3.5
0
−1
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Unobserved resistance to treatment Unobserved resistance to treatment
(c) MTE estimates, different models (d) MTEs and potential outcomes, polynomial
.03
1.5
.012
.01
1
Treatment effect
Treatment effect
.02
1
.006 .008
Weights
Weights
PRTE
.5
ATE
.5
.01
ATE
LATE
0
.004
0
.002
−.5
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Unobserved resistance to treatment Unobserved resistance to treatment
MTE MTE (LATE) MTE MTE (PRTE)
2SLS LATE weights PRTE weights
(e) MTE for compliers with LATE weights (f) Estimated effects of a hypothetical policy
E(Y1 − Y0 | X, D, p) (16)
= x(β1 − β0 ) + DE(U1 − U0 | UD ≤ p) + (1 − D)E(U1 − U0 | UD > p)
D−p
= x(β1 − β0 ) + K(p)
p(1 − p)
where I use the fact that both U1 and U0 is normalized to have mean zero.
As all these objects are estimated by mtefe, we can predict treatment effects for
each individual. Use options savekp and savepropensity() to save the propensity
scores and the relevant variables of the K(p) function, and then estimate expected
treatment effects from the expression above. The resulting variable contains the
expected treatment effect given treatment status, propensity scores and X for each
individual. Summarizing this predicted treatment effect among the treated and
untreated separately closely match the ATT and ATUT, respectively.
This procedure highlights the relationship between selection models and MTE
- but notice how both the ATT and ATUT could be recovered without specifying
Kj (p), only the expected difference K(p).
When using the separate approach, potential outcomes can be predicted di-
rectly because both K(p) and K0 (p) are estimated:
D−p
E(Y1 | X, D, p) = xβ1 + K1 (p) (17)
1−p
p−D
E(Y0 | X, D, p) = xβ0 + K0 (p)
p
K(p) − (1 − p)K0 (p)
where K1 (p) =
p
These expressions can be calculated by predicting Kj (p) and constructing the
expected value of the outcome for each indidivual from the expressions above. The
difference between these predicted outcomes closely match the predicted treatment
effect from above and treatment effect parameters calculated by mtefe as weighted
averages over the appropriate MTE curves.
4 Conclusion
Marginal treatment effects are increasingly becoming a part of the toolbox of
the applied econometrician when the problem at hand exhibits selection on both
21
levels and unobserved gains. In contrast to traditional linear IV analysis, marginal
treatment effects uncover the distribution of treatment effects rather than a local
average treatment effect, which is often of little interest. In practice, this typically
comes at the cost of stricter assumptions.
This paper has outlined the Marginal Treatment Effects framework, the re-
lationship to traditional selection models and three different methods for esti-
mating these models. The Stata package mtefe that is documented contains a
range of improvements over existing packages, among these are support for fixed
effects, estimation using both local IV, the separate approach and maximum like-
lihood, a larger number of parametric and semiparametric MTE models, support
for weights, speed improvements when running semiparametric models, more ap-
propriate bootstrap inference and improved graphical output. In addition, mtefe
calculates common treatment effect parameters such as local average treatment
effects, average treatment effects on the treated and untreated and policy-relevant
treatment effects for a user-specified shift in propensity scores to allow the user to
exploit the potential of the marginal treatment effects framework in understanding
treatment effect heterogeneity.
The Monte Carlo simulations detailed in Appendix B show that marginal treat-
ment effect estimation can be sensitive to function form specifications and that
wrong functional form assumptions may result in too high rates of rejection. It
should be note that the two data-generating processes used and the functional
form choices made in these simulations are very restrictive, and it is perhaps not
surprising that a second-order polynomial cannot approximate a normal very well
or the other way around. Nonetheless, they illustrate two things: First, applied
researchers should base the choice of functional form on detailed knowledge about
the case at hand, what might plausibly constitute the unobservable dimension and
economic arguments for how these factors might affect the outcome. Second, re-
searchers should strive to probe the robustness of their results to the functional
form choices, using less restrictive models to guide the choice of specification.
22
References
Abadie, A. and Imbens, G. W. (2016). Matching on the estimated propensity
score. Econometrica, 84(2):781–807.
Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of
average causal effects in models with variable treatment intensity. Journal
of the American Statistical Association, 90(430):431–442.
Björklund, A. and Moffitt, R. (1987). The estimation of wage gains and welfare
gains in self-selection. The Review of Economics and Statistics, 69(1):42–49.
Brave, S. and Walstrum, T. (2014). Estimating marginal treatment effects using
parametric and semiparametric methods. Stata Journal, 14(1):191–217(27).
Brinch, C. N., Mogstad, M., and Wiswall, M. (2017). Beyond late with a discrete
instrument. Journal of Political Economy, 125(4):985 – 1039.
Carneiro, P. and Heckman, J. J. (2002). The evidence on credit constraints in
post-secondary schooling*. The Economic Journal, 112(482):705–734.
Carneiro, P., Heckman, J. J., and Vytlacil, E. (2010). Evaluating Marginal Policy
Changes and the Average Effect of Treatment for Individuals at the Margin.
Econometrica, 78(1):377–394.
Carneiro, P., Heckman, J. J., and Vytlacil, E. J. (2011). Estimating marginal
returns to education. American Economic Review, 101(6):2754–81.
Carneiro, P. and Lee, S. (2009). Estimating distributions of potential outcomes
using local instrumental variables with an application to changes in college
enrollment and wage inequality. Journal of Econometrics, 149(2):191–208.
Carneiro, P., Lokshin, M., and Umapathi, N. (2017). Average and marginal returns
to upper secondary schooling in indonesia. Journal of Applied Econometrics,
32(1):16–36.
Cornelissen, T., Dustmann, C., Raute, A., and Schönberg, U. (2016a). From
LATE to MTE: Alternative methods for the evaluation of policy interven-
tions. Labour Economics, 41:47 – 60. SOLE/EALE conference issue 2015.
Cornelissen, T., Dustmann, C., Raute, A., and Schönberg, U. (2016b). Who
benefits from universal child care? Estimating marginal returns to early
child care attendance. Forthcoming in Journal of Political Economy.
Felfe, C. and Lalive, R. (2017). Does early child care affect children’s development?
Working paper.
Heckman, J. J., Urzua, S., and Vytlacil, E. (2006a). Estimation of treatment ef-
fects under essential heterogeneity. http://jenni.uchicago.edu/underiv/
documentation_2006_03_20.pdf. Supplement to "Understanding Instru-
mental Variables in Models with Essential Heterogeneity".
Heckman, J. J., Urzua, S., and Vytlacil, E. (2006b). Understanding instrumental
variables in models with essential heterogeneity. The Review of Economics
and Statistics, 88(3):389–432.
23
Heckman, J. J. and Vytlacil, E. J. (1999). Local instrumental variables and latent
variable models for identifying and bounding treatment effects. Proceed-
ings of the National Academy of Sciences of the United States of America,
96(8):4730–4734.
Heckman, J. J. and Vytlacil, E. J. (2001). Local instrumental variables. In Hsiao,
C., Morimune, K., and Powell, J., editors, Nonlinear Statistical Modeling:
Proceedings of the Thirteenth International Symposium in Economic The-
ory and Econometrics. Essays in Honor of Takeshi Amemiya, pages 1–46.
Cambridge Univ. Press, New York.
Heckman, J. J. and Vytlacil, E. J. (2005). Structural equations, treatment effects,
and econometric policy evaluation. Econometrica, 73(3):669–738.
Heckman, J. J. and Vytlacil, E. J. (2007). Chapter 71 econometric evaluation
of social programs, part II: Using the marginal treatment effect to organize
alternative econometric estimators to evaluate social programs, and to fore-
cast their effects in new environments. volume 6, Part B of Handbook of
Econometrics, pages 4875 – 5143. Elsevier.
Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local
average treatment effects. Econometrica, 62(2):pp. 467–475.
Klein, R. W. and Spady, R. H. (1993). An efficient semiparametric estimator for
binary response models. Econometrica, 61(2):387–421.
Lokshin, M. and Sajaia, Z. (2004). Maximum likelihood estimation of endogenous
switching regression models. Stata Journal, 4(3):282–289.
Maestas, N., Mullen, K. J., and Strand, A. (2013). Does disability insurance receipt
discourage work? Using examiner assignment to estimate causal effects of
SSDI receipt. American Economic Review, 103(5):1797–1829.
Matzkin, R. L. (2007). Chapter 73 nonparametric identification. volume 6 of
Handbook of Econometrics, pages 5307 – 5368. Elsevier.
Mogstad, M. and Torgovitsky, A. (2018). Identification and extrapolation with
instrumental variables. Workign paper, prepapred for Annual Review of
Economics.
Robinson, P. (1988). Root- n-consistent semiparametric regression. Econometrica,
56(4):931–54.
Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An
equivalence result. Econometrica, 70(1):331–341.
24
A Estimation algorithms
The mtefe package is inspired by Heckman et al. (2006a; 2006b) and Brave and
Walstrum (2014). The following sections describe the main steps of the estimation
using this program.
(a) if using the linear probability manually adjust propensity scores below
0 to 0 and above 1 to 1
25
4. Plot the distribution of propensity scores in the treated and untreated sam-
ples, including the trimming limits if used, to visualize the common support.
5. Compute weights for the ATT, ATUT, LATE, MPRTE’s and, if specified,
PRTE parameter weights as described in section 2.4.
\ u) = x(β\
MTE(x, 1 − β0 ) + k(u)
d
Yˆj (x, u) = xβ
c + k[
j j (u)
26
A.4 Maximum likelihood estimation
Relevant only for the joint normal model, the mlikelihood option implements
the maximum likelihood estimator described in Lokshin and Sajaia (2004) The
individual log likelihood contribution is
1 U1i
`i = Di [ln(F (η1i )) + ln( f( )]
σ1 σ1
1 U0i
+ (1 − Di )[ln(F (−η0i )) + ln( f ( )]
σ0 σ10
Uji 1
where ηji = (γZi + ρj )q
σj 1 − ρ2j
where f is the standard normal density and F the standard normal CDF.
This log likelihood can be maximized to give the coefficients γ, β0 , β1 , σ0 , σ1 , ρ0 , ρ1 .
These parameter estimates can be used to construct the MTE and treatment effect
parameters as detailed in Table 1.
A.6 IV weights
To estimate the IV weights, we need measures of the impact of the instrument on
each individual conditional on X. We are looking at partitioning the linear first
stage regression
D = XβD + γZ− +
27
1. Remove the impact of covariates on D and Z− by regressing them separately
on X and saving the residuals in D and Z− .
3. The predicted values from this regression, υ̂, contains the individual impact
of the instrument on treatment conditional on controls. Note how we can
recover first stage estimate from cov(D, υ̂), the reduced form estimate from
cov(Y,υ̂)
cov(Y, υ̂)and the traditional 2SLS estimate from cov(D,υ̂)
¯
(υ̂i − υ̂)(Di − D̄)
κ̂LAT
i
E
=
cov(D, υ̂)
Notice how κLAT E weights treated individuals with positive υ̂ and untreated
individuals with negative υ̂ more - these are individuals who are more strongly
affected by the instruments and so more likely to be compliers.
5. Compute
28
For each repetition, I calculate the difference between the estimated and the
true parameter, the estimated analytic and bootstrapped standard errors and the
rejection rates for the true parameter. I estimate all models using the separate
approach and local IV to compare any differences.
2. Draw average labor market quality Πj in the college and non-college labor
markets and average distance to college in each district AvgDist from a joint
normal distribution:
Π0 , Π1 , AvgDist ∼ N (0, Σf )
.1
Σf = .05 .1
−.05 −.1 10
In the simulations, I draw these once rather than repeat for each simula-
tion to have a true value to compare the estimated coefficients to, but by
default mtefe_gendata draws these for each replication unless you specify
the parameters.
4. Draw exp from U (0, 30) and construct exp2 as it’s square
U0 , U1 , V ∼ N (0, Σ)
.5
Σf = .3 .5
−.1 −.5 1
29
Polynomial Generate Uj as second order polynomials of UD with mean
zero:
Yj = Xβj + Πj + Uj
X = exp exp2 1
β0 = 0.025 −0.0004 3.2
β1 = 0.01 0 3.6
Probit
D = 1 [γZ > V ]
LPM
D = 1 [γZ > UD ]
30
B.2 A well specified baseline
To establish a baseline, I first investigate the case where both the first stage and
the polynomial model for k(u) are correctly specified.
The results from these exercises are displayed in Table 4. Unsurprisingly, and
in line with theory and Monte Carlo evidence from Brave and Walstrum (2014),
mtefe does a good job at estimating both the coefficients of the outcome equations
and the MTEs when both the first stage and the parametric model are correctly
specified. The difference between the estimated and true coefficients center at 0
and have low standard deviations. The estimated analytical standard errors comes
close to the standard deviations of the coefficient. Furthermore, even though the
analytical standard errors ignore the fact that the propensity score itself is an
estimated object, they fall very close to the bootstrapped standard errors that
account for this by re-estimating the propensity scores for each bootstrap replica-
tion. Note that bootstrapped standard errors aren’t always higher than analytic
standard errors. Rejection rates vary somewhat from parameter to parameter, but
lie around .05 as they should.
Comparing the results from Local IV estimates with the results from the sep-
arate approach reveal one important difference: Across the model specifications,
estimates seem to be more precise using the separate approach. Both the standard
deviations of the coefficients and the estimated standard errors are lower than es-
timates from Local IV, as illustrated for a range of parameters in Figure 4. This
is surprising, given that more parameters are estimated when using the separate
approach than Local IV. I have no good explanation for this result, but it warrants
more research.
31
Table 4: Monte Carlo estimates: A well specified baseline
exp 0.025 -0.000 0.006 0.042 -0.000 0.004 0.088 0.000 0.003 0.078 0.000 0.002 0.084
(0.006) 0.006 0.050 (0.005) 0.004 0.066 (0.003) 0.003 0.044 (0.002) 0.002 0.050
exp2 -0.004 0.000 0.000 0.040 0.000 0.000 0.066 -0.000 0.000 0.070 -0.000 0.000 0.096
(0.000) 0.000 0.050 (0.000) 0.000 0.056 (0.000) 0.000 0.032 (0.000) 0.000 0.044
3.district 0.34 -0.003 0.062 0.036 -0.001 0.042 0.074 -0.001 0.025 0.062 -0.002 0.016 0.104
(0.058) 0.058 0.054 (0.045) 0.044 0.060 (0.026) 0.026 0.058 (0.020) 0.019 0.054
β0
6.district 0.34 -0.004 0.059 0.060 -0.003 0.040 0.072 0.000 0.023 0.078 -0.001 0.015 0.102
(0.061) 0.058 0.058 (0.044) 0.042 0.060 (0.026) 0.026 0.056 (0.019) 0.019 0.054
10.district -0.51 -0.004 0.060 0.052 -0.002 0.041 0.062 0.001 0.024 0.064 0.000 0.016 0.096
(0.059) 0.057 0.064 (0.043) 0.043 0.058 (0.025) 0.025 0.062 (0.019) 0.019 0.060
constant 3.2 0.006 0.065 0.042 0.002 0.043 0.078 -0.001 0.026 0.084 0.000 0.017 0.098
(0.063) 0.062 0.046 (0.047) 0.045 0.064 (0.028) 0.028 0.072 (0.021) 0.020 0.062
exp -0.015 0.001 0.010 0.042 0.001 0.006 0.076 -0.000 0.004 0.038 0.000 0.002 0.058
(0.010) 0.010 0.044 (0.006) 0.006 0.070 (0.004) 0.004 0.048 (0.002) 0.002 0.062
exp2 0.0004 -0.000 0.000 0.038 -0.000 0.000 0.070 0.000 0.000 0.040 -0.000 0.000 0.050
(0.000) 0.000 0.046 (0.000) 0.000 0.056 (0.000) 0.000 0.052 (0.000) 0.000 0.056
3.district -0.35 0.003 0.106 0.028 0.001 0.060 0.062 -0.000 0.042 0.034 0.001 0.023 0.052
(0.098) 0.098 0.052 (0.062) 0.061 0.060 (0.039) 0.039 0.048 (0.024) 0.023 0.054
β1 − β0
6.district 1.00 0.007 0.109 0.078 0.005 0.061 0.038 -0.002 0.043 0.068 0.001 0.023 0.062
(0.114) 0.112 0.066 (0.058) 0.062 0.036 (0.047) 0.047 0.046 (0.024) 0.023 0.066
10.district -0.07 0.006 0.107 0.044 0.001 0.060 0.050 -0.001 0.043 0.024 0.001 0.023 0.032
(0.104) 0.100 0.066 (0.061) 0.061 0.050 (0.037) 0.038 0.042 (0.022) 0.023 0.034
constant 0.4 -0.010 0.098 0.030 -0.005 0.058 0.062 0.002 0.041 0.042 -0.000 0.023 0.070
(0.091) 0.094 0.036 (0.060) 0.059 0.050 (0.039) 0.039 0.052 (0.025) 0.024 0.062
MTE(x̄, 0.05) 1.02 -0.003 0.103 0.042 0.001 0.056 0.052 0.009 0.070 0.048 0.004 0.045 0.078
0.74 (0.098) 0.099 0.052 (0.056) 0.057 0.050 (0.071) 0.075 0.034 (0.048) 0.050 0.046
MTE(x̄, 0.1) 0.88 -0.003 0.081 0.038 0.001 0.045 0.042 0.007 0.052 0.054 0.003 0.035 0.100
0.67 (0.077) 0.078 0.046 (0.046) 0.047 0.044 (0.054) 0.057 0.030 (0.039) 0.040 0.044
MTE(x̄, 0.25) 0.63 -0.001 0.047 0.028 0.000 0.030 0.044 0.002 0.021 0.068 0.002 0.017 0.104
0.49 (0.044) 0.045 0.044 (0.030) 0.031 0.038 (0.023) 0.023 0.050 (0.020) 0.020 0.056
MTE(x̄, 0.5) 0.36 0.000 0.025 0.026 0.000 0.021 0.050 -0.002 0.027 0.060 -0.000 0.016 0.032
MTE
0.29 (0.023) 0.023 0.036 (0.022) 0.022 0.034 (0.027) 0.028 0.048 (0.015) 0.016 0.042
MTE(x̄, 0.75) 0.09 0.002 0.050 0.040 -0.000 0.031 0.058 -0.000 0.022 0.012 -0.000 0.018 0.024
0.19 (0.048) 0.047 0.060 (0.032) 0.031 0.054 (0.018) 0.020 0.018 (0.016) 0.018 0.032
MTE(x̄, 0.9) -0.15 0.003 0.085 0.044 -0.001 0.047 0.054 0.003 0.056 0.032 -0.000 0.037 0.014
0.19 (0.082) 0.080 0.058 (0.048) 0.046 0.058 (0.050) 0.051 0.048 (0.031) 0.033 0.032
MTE(x̄, 0.95) -0.29 0.004 0.106 0.044 -0.001 0.057 0.046 0.005 0.074 0.030 0.000 0.047 0.016
0.20 (0.103) 0.101 0.060 (0.058) 0.057 0.052 (0.068) 0.068 0.054 (0.039) 0.041 0.028
ATE(x̄) 0.37 0.000 0.025 0.026 0.000 0.021 0.050 0.002 0.014 0.054 0.001 0.010 0.086
0.36 (0.023) 0.023 0.036 (0.022) 0.022 0.034 (0.014) 0.015 0.048 (0.012) 0.012 0.042
Note: Monte Carlo experiments described in section B. Standard deviations in parentheses. Results based on bootstrapped
standard errors are underlined, otherwise analytic. True values shows normal (top) and polynomial (bottom). First stage is
probit. 500 replications, each with a sample size of 10,000 in 10 districts.
20 40 60 80
15
10
5
0
0
−.02 0 .02 .04 −.2 −.1 0 .1 .2
beta1−beta0:exp mte:u25
15
10 15 20
10
5
5
0
0
−.2 −.1 0 .1 .2 −.1 −.05 0 .05 .1
k:mills mte:u50
10 15 20
15
10
5
5
0
5
0
−2 −1 0 1 2 −.05 0 .05
k:p1 mte:u25
30
1 1.5 2
20
10
.5
0
34
Table 5: Monte Carlo estimates: Misspecification of k(u)
exp 0.025 -0.000 0.003 0.086 -0.000 0.002 0.126 -0.000 0.006 0.042 -0.000 0.004 0.068
(0.003) 0.003 0.054 (0.002) 0.002 0.068 (0.006) 0.006 0.050 (0.005) 0.004 0.050
exp2 -0.004 0.000 0.000 0.084 0.000 0.000 0.128 0.000 0.000 0.038 0.000 0.000 0.078
(0.000) 0.000 0.052 (0.000) 0.000 0.068 (0.000) 0.000 0.046 (0.000) 0.000 0.058
3.district 0.34 0.002 0.025 0.052 0.001 0.016 0.092 -0.001 0.062 0.040 0.000 0.042 0.088
(0.026) 0.025 0.054 (0.018) 0.019 0.044 (0.059) 0.058 0.054 (0.046) 0.044 0.066
β0
6.district 0.34 -0.001 0.023 0.086 -0.004 0.015 0.110 -0.005 0.059 0.070 -0.003 0.040 0.084
(0.026) 0.026 0.060 (0.018) 0.018 0.062 (0.062) 0.058 0.074 (0.045) 0.043 0.076
10.district -0.51 -0.001 0.024 0.050 -0.002 0.016 0.094 -0.002 0.060 0.050 -0.000 0.041 0.054
(0.025) 0.025 0.040 (0.018) 0.019 0.034 (0.059) 0.057 0.060 (0.044) 0.043 0.046
constant 3.2 0.007 0.026 0.076 0.031 0.016 0.480 0.011 0.066 0.034 -0.005 0.045 0.060
(0.028) 0.027 0.070 (0.019) 0.019 0.364 (0.064) 0.063 0.044 (0.047) 0.047 0.046
exp -0.015 -0.000 0.004 0.062 0.000 0.002 0.054 0.000 0.010 0.058 0.000 0.006 0.060
(0.004) 0.004 0.064 (0.002) 0.002 0.056 (0.010) 0.010 0.064 (0.006) 0.006 0.052
exp2 0.0004 0.000 0.000 0.056 -0.000 0.000 0.058 -0.000 0.000 0.036 -0.000 0.000 0.060
(0.000) 0.000 0.052 (0.000) 0.000 0.058 (0.000) 0.000 0.052 (0.000) 0.000 0.052
3.district -0.35 -0.002 0.042 0.036 0.000 0.023 0.034 0.000 0.106 0.030 -0.001 0.060 0.044
(0.039) 0.039 0.048 (0.022) 0.023 0.040 (0.097) 0.098 0.046 (0.061) 0.061 0.042
β1 − β0
6.district 1.00 0.001 0.043 0.066 0.003 0.023 0.050 0.005 0.109 0.046 0.002 0.061 0.064
(0.046) 0.047 0.040 (0.023) 0.023 0.054 (0.109) 0.112 0.040 (0.064) 0.062 0.064
10.district -0.07 0.001 0.043 0.028 0.002 0.023 0.052 0.001 0.107 0.040 -0.001 0.060 0.058
(0.040) 0.038 0.056 (0.023) 0.023 0.054 (0.099) 0.100 0.056 (0.060) 0.061 0.052
constant 0.4 -0.023 0.039 0.078 -0.030 0.022 0.248 0.003 0.102 0.038 0.035 0.060 0.088
(0.038) 0.037 0.102 (0.022) 0.023 0.242 (0.101) 0.097 0.056 (0.060) 0.062 0.082
MTE(x̄, 0.05) 1.02 -0.073 0.041 0.460 -0.215 0.021 1.000 -0.087 0.175 0.086 0.079 0.117 0.102
0.74 (0.041) 0.042 0.438 (0.023) 0.023 1.000 (0.173) 0.174 0.088 (0.112) 0.122 0.082
MTE(x̄, 0.1) 0.88 -0.077 0.032 0.672 -0.189 0.017 1.000 -0.008 0.132 0.050 0.109 0.092 0.220
0.67 (0.033) 0.034 0.644 (0.019) 0.019 1.000 (0.131) 0.131 0.050 (0.089) 0.096 0.194
MTE(x̄, 0.25) 0.63 -0.020 0.019 0.196 -0.080 0.011 1.000 0.040 0.052 0.110 0.043 0.044 0.166
0.49 (0.020) 0.020 0.148 (0.013) 0.014 1.000 (0.051) 0.050 0.142 (0.046) 0.046 0.140
MTE(x̄, 0.5) 0.36 0.052 0.010 1.000 0.048 0.008 1.000 -0.004 0.068 0.056 -0.073 0.043 0.398
MTE
0.29 (0.011) 0.011 1.000 (0.010) 0.010 1.000 (0.068) 0.067 0.054 (0.043) 0.043 0.398
MTE(x̄, 0.75) 0.09 0.012 0.020 0.084 0.065 0.012 1.000 -0.035 0.056 0.092 -0.026 0.047 0.088
0.19 (0.019) 0.018 0.106 (0.012) 0.012 1.000 (0.053) 0.052 0.114 (0.047) 0.048 0.082
MTE(x̄, 0.9) -0.15 -0.106 0.034 0.894 -0.002 0.018 0.030 0.034 0.142 0.046 0.160 0.098 0.372
0.19 (0.032) 0.031 0.922 (0.017) 0.016 0.054 (0.132) 0.133 0.060 (0.096) 0.096 0.384
MTE(x̄, 0.95) -0.29 -0.187 0.042 0.990 -0.052 0.022 0.646 0.122 0.186 0.090 0.300 0.124 0.682
0.20 (0.040) 0.040 0.994 (0.021) 0.020 0.716 (0.174) 0.176 0.104 (0.122) 0.121 0.692
ATE(x̄) 0.37 -0.021 0.010 0.588 -0.025 0.008 0.822 0.005 0.036 0.822 0.033 0.026 0.822
0.36 (0.011) 0.011 0.482 (0.010) 0.010 0.720 (0.031) 0.034 0.042 (0.025) 0.028 0.186
Note: Monte Carlo experiments described in section B. Standard deviations in parentheses. Results based on bootstrapped
standard errors are underlined, otherwise analytic. 500 replications, each with a sample size of 10,000 in 10 districts. *
Coefficients reported for k(u) rather than differences.
Table 6: Monte Carlo estimates: Misspecification of P (Z)
exp 0.025 0.001 0.003 0.102 0.000 0.002 0.092 -0.000 0.005 0.058 0.000 0.001 0.100
(0.003) 0.003 0.070 (0.002) 0.002 0.050 (0.005) 0.005 0.042 (0.002) 0.002 0.060
exp2 -0.004 -0.000 0.000 0.102 -0.000 0.000 0.088 0.000 0.000 0.056 -0.000 0.000 0.104
(0.000) 0.000 0.070 (0.000) 0.000 0.050 (0.000) 0.000 0.050 (0.000) 0.000 0.058
3.district 0.34 0.024 0.028 0.132 -0.001 0.016 0.110 -0.002 0.047 0.028 -0.000 0.014 0.050
(0.029) 0.029 0.128 (0.019) 0.019 0.068 (0.041) 0.041 0.052 (0.015) 0.016 0.028
β0
6.district 0.34 -0.071 0.026 0.736 -0.000 0.015 0.126 0.004 0.045 0.080 0.000 0.014 0.106
(0.030) 0.030 0.664 (0.019) 0.019 0.060 (0.050) 0.052 0.040 (0.016) 0.016 0.044
10.district -0.51 0.002 0.027 0.048 -0.000 0.016 0.084 0.001 0.046 0.016 -0.000 0.014 0.100
(0.028) 0.028 0.044 (0.019) 0.019 0.046 (0.037) 0.038 0.050 (0.016) 0.016 0.050
constant 3.2 0.006 0.029 0.056 -0.010 0.017 0.116 -0.067 0.070 0.142 -0.021 0.027 0.172
(0.030) 0.031 0.050 (0.020) 0.021 0.060 (0.068) 0.071 0.130 (0.032) 0.033 0.084
exp -0.015 -0.003 0.005 0.092 -0.000 0.002 0.038 0.001 0.009 0.056 0.000 0.002 0.044
(0.005) 0.005 0.094 (0.002) 0.002 0.040 (0.008) 0.009 0.058 (0.002) 0.002 0.044
exp2 0.0004 0.000 0.000 0.086 0.000 0.000 0.042 -0.000 0.000 0.052 -0.000 0.000 0.048
(0.000) 0.000 0.076 (0.000) 0.000 0.040 (0.000) 0.000 0.048 (0.000) 0.000 0.048
3.district -0.35 -0.049 0.050 0.148 0.001 0.023 0.054 0.002 0.091 0.024 -0.001 0.020 0.030
(0.045) 0.046 0.176 (0.023) 0.023 0.056 (0.078) 0.077 0.056 (0.018) 0.020 0.024
β1 − β0
6.district 1.00 0.143 0.051 0.776 -0.000 0.023 0.048 -0.007 0.092 0.094 0.000 0.020 0.048
(0.056) 0.054 0.742 (0.024) 0.023 0.052 (0.105) 0.110 0.050 (0.020) 0.020 0.048
10.district -0.07 -0.007 0.050 0.036 -0.001 0.023 0.062 -0.003 0.091 0.008 -0.000 0.020 0.044
(0.045) 0.046 0.050 (0.024) 0.023 0.052 (0.071) 0.071 0.052 (0.019) 0.020 0.046
constant 0.4 -0.006 0.047 0.038 0.013 0.024 0.092 0.099 0.123 0.102 0.025 0.038 0.120
(0.044) 0.045 0.048 (0.024) 0.025 0.074 (0.114) 0.120 0.100 (0.040) 0.043 0.070
π11 − π01 -1.5 2.612 0.411 1.000 1.466 0.291 1.000 -1.808 1.459 0.236 -0.481 0.496 0.176
(0.411) 0.428 1.000 (0.282) 0.297 1.000 (1.474) 1.514 0.206 (0.541) 0.554 0.122
k(u)
π12 − π02 0.9 -2.488 0.410 1.000 -1.435 0.288 1.000 1.654 1.463 0.202 0.412 0.480 0.136
(0.398) 0.415 1.000 (0.261) 0.278 1.000 (1.446) 1.501 0.180 (0.494) 0.506 0.120
MTE(x̄, 0.05) 0.74 -0.361 0.067 1.000 -0.173 0.049 0.926 0.368 0.275 0.274 0.105 0.107 0.184
(0.070) 0.073 1.000 (0.056) 0.057 0.852 (0.286) 0.290 0.232 (0.126) 0.130 0.100
MTE(x̄, 0.1) 0.67 -0.249 0.051 1.000 -0.110 0.038 0.774 0.290 0.215 0.276 0.084 0.087 0.180
(0.054) 0.056 0.998 (0.045) 0.045 0.680 (0.224) 0.227 0.234 (0.104) 0.108 0.094
MTE(x̄, 0.25) 0.49 0.012 0.022 0.116 0.034 0.017 0.498 0.106 0.080 0.254 0.033 0.042 0.180
(0.023) 0.024 0.084 (0.022) 0.022 0.366 (0.086) 0.087 0.204 (0.052) 0.055 0.072
MTE(x̄, 0.5) 0.29 0.198 0.026 1.000 0.132 0.020 1.000 -0.036 0.037 0.174 -0.010 0.018 0.126
MTE
(0.026) 0.027 1.000 (0.017) 0.019 1.000 (0.038) 0.039 0.136 (0.020) 0.020 0.084
MTE(x̄, 0.75) 0.19 0.074 0.024 0.900 0.050 0.019 0.758 0.029 0.090 0.026 -0.001 0.046 0.050
(0.021) 0.021 0.934 (0.018) 0.018 0.768 (0.079) 0.086 0.042 (0.044) 0.046 0.046
MTE(x̄, 0.9) 0.19 -0.150 0.056 0.776 -0.086 0.041 0.580 0.167 0.232 0.090 0.029 0.092 0.046
(0.049) 0.049 0.866 (0.036) 0.036 0.668 (0.213) 0.229 0.086 (0.087) 0.091 0.050
MTE(x̄, 0.95) 0.20 -0.250 0.073 0.944 -0.145 0.052 0.820 0.230 0.295 0.098 0.043 0.113 0.046
(0.065) 0.065 0.972 (0.045) 0.046 0.880 (0.273) 0.292 0.102 (0.107) 0.111 0.050
ATE(x̄) 0.37 -0.005 0.015 0.056 0.014 0.011 0.274 0.099 0.092 0.188 0.024 0.032 0.134
0.36 (0.015) 0.016 0.046 (0.013) 0.014 0.168 (0.090) 0.094 0.158 (0.036) 0.038 0.072
Note: Monte Carlo experiments described in section B. Standard deviations in parentheses. Results based on bootstrapped
standard errors are underlined, otherwise analytic. 500 replications, each with a sample size of 10,000 in 10 districts. Error
structure is polynomial.
Table 7: Monte Carlo estimates: Omissions in the specification of P (Z)
3.district 0.34 -0.003 0.051 0.018 0.000 0.015 0.088 -0.000 0.026 0.072 -0.000 0.017 0.096
(0.044) 0.045 0.036 (0.018) 0.018 0.056 (0.028) 0.027 0.058 (0.020) 0.020 0.052
6.district 0.34 -0.005 0.049 0.096 -0.000 0.015 0.088 -0.001 0.025 0.076 -0.001 0.016 0.108
(0.058) 0.055 0.066 (0.018) 0.018 0.056 (0.028) 0.028 0.056 (0.019) 0.020 0.054
β0
10.district -0.51 -0.001 0.050 0.020 0.000 0.015 0.096 -0.002 0.025 0.064 -0.002 0.016 0.124
(0.041) 0.042 0.046 (0.018) 0.018 0.056 (0.027) 0.026 0.056 (0.020) 0.020 0.066
constant 3.2 0.258 0.055 0.994 0.259 0.025 1.000 0.254 0.021 1.000 0.255 0.014 1.000
(0.053) 0.055 0.992 (0.031) 0.032 1.000 (0.023) 0.022 1.000 (0.017) 0.017 1.000
3.district -0.35 0.005 0.098 0.016 -0.001 0.022 0.060 0.002 0.045 0.042 0.001 0.024 0.042
(0.083) 0.085 0.038 (0.022) 0.022 0.066 (0.043) 0.042 0.058 (0.024) 0.024 0.048
6.district 1.00 0.009 0.099 0.110 0.000 0.022 0.052 0.002 0.046 0.058 0.001 0.025 0.034
(0.121) 0.115 0.076 (0.023) 0.022 0.050 (0.049) 0.050 0.042 (0.023) 0.024 0.042
β1 − β0
10.district -0.07 0.002 0.098 0.008 -0.001 0.022 0.036 0.002 0.045 0.030 0.002 0.024 0.058
(0.075) 0.079 0.042 (0.021) 0.022 0.038 (0.043) 0.041 0.050 (0.025) 0.024 0.064
constant 0.4 -0.111 0.100 0.166 -0.113 0.037 0.842 -0.104 0.034 0.872 -0.107 0.020 1.000
(0.089) 0.094 0.206 (0.039) 0.042 0.784 (0.033) 0.032 0.896 (0.020) 0.020 1.000
π11 − π01 -1.5 0.106 1.162 0.058 0.197 0.501 0.086 0.028 0.458 0.054 0.136 0.269 0.074
(1.191) 1.211 0.050 (0.532) 0.567 0.058 (0.471) 0.476 0.050 (0.260) 0.271 0.076
k(u)
π12 − π02 0.9 -0.109 1.167 0.064 -0.202 0.488 0.064 -0.038 0.457 0.048 -0.147 0.264 0.058
(1.183) 1.196 0.060 (0.490) 0.519 0.050 (0.455) 0.465 0.044 (0.238) 0.251 0.074
MTE(x̄, 0.05) 0.74 -0.016 0.220 0.062 -0.033 0.107 0.100 0.000 0.074 0.082 -0.015 0.048 0.092
(0.227) 0.234 0.038 (0.123) 0.132 0.040 (0.079) 0.079 0.054 (0.054) 0.054 0.054
MTE(x̄, 0.1) 0.67 -0.012 0.172 0.068 -0.025 0.086 0.100 0.001 0.055 0.080 -0.010 0.037 0.106
(0.178) 0.184 0.036 (0.102) 0.108 0.038 (0.061) 0.060 0.054 (0.043) 0.044 0.058
MTE(x̄, 0.25) 0.49 -0.001 0.068 0.068 -0.006 0.042 0.124 0.003 0.022 0.070 0.003 0.018 0.112
(0.072) 0.075 0.048 (0.052) 0.055 0.048 (0.024) 0.024 0.054 (0.022) 0.022 0.050
MTE(x̄, 0.5) 0.29 0.005 0.035 0.086 0.006 0.020 0.098 0.003 0.028 0.070 0.009 0.017 0.076
MTE
(0.039) 0.037 0.066 (0.022) 0.022 0.074 (0.029) 0.030 0.062 (0.016) 0.017 0.092
MTE(x̄, 0.75) 0.19 -0.003 0.079 0.030 -0.008 0.046 0.050 -0.002 0.023 0.064 -0.002 0.019 0.072
(0.073) 0.074 0.044 (0.046) 0.046 0.052 (0.023) 0.021 0.082 (0.021) 0.019 0.076
MTE(x̄, 0.9) 0.19 -0.014 0.191 0.050 -0.028 0.093 0.054 -0.007 0.060 0.026 -0.018 0.040 0.064
(0.184) 0.186 0.056 (0.090) 0.092 0.056 (0.055) 0.055 0.056 (0.037) 0.036 0.100
MTE(x̄, 0.95) 0.20 -0.019 0.241 0.054 -0.037 0.114 0.054 -0.009 0.079 0.022 -0.025 0.051 0.060
(0.234) 0.236 0.058 (0.110) 0.112 0.058 (0.073) 0.073 0.042 (0.046) 0.045 0.096
ATE(x̄) 0.37 -0.004 0.071 0.050 -0.011 0.033 0.074 0.000 0.015 0.060 -0.003 0.011 0.108
0.36 (0.069) 0.073 0.044 (0.035) 0.038 0.040 (0.016) 0.016 0.056 (0.013) 0.013 0.054
Note: Monte Carlo experiments described in section B, where exp and exp2 is omitted from the model. Standard deviations in
parentheses. Results based on bootstrapped standard errors are underlined, otherwise analytic. 500 replications, each with a
sample size of 10,000 in 10 districts. Error structure is polynomial.