Imbens & Wooldridge (2009) Recent Developments in The Econometrics of Program Evaluation
Imbens & Wooldridge (2009) Recent Developments in The Econometrics of Program Evaluation
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
is collaborating with JSTOR to digitize, preserve and extend access to Journal of Economic
Literature
Many empirical questions in economics and other social sciences depend on causal
effects of programs or policies. In the last two decades, much research has been done
on the econometric and statistical analysis of such causal effects. This recent theoreti
cal literature has built on, and combined features of, earlier work in both the statistics
and econometrics literatures. It has by now reached a level of maturity that makes
it an important tool in many areas of empirical research in economics, including
labor economics, public finance, development economics, industrial organization,
and other areas of empirical microeconomics. In this review, we discuss some of the
recent developments. We focus primarily on practical issues for empirical research
ers, as well as provide a historical overview of the area and give references to more
technical research.
and Jeffrey A. Smith (1999), Heckman and can involve different physical units or the
Edward Vytlacil (2007a, 2007b), and Jaap same physical unit at different times.
H. Abbring and Heckman (2007) provide an The problem of evaluating the effect of a
excellent overview of the important theoreti binary treatment or program is a well studied
cal work by Heckman and his coauthors in problem with a long history in both econo
this area. metrics and statistics. This is true both in
The central problem studied in this liter the theoretical literature as well as in the
ature is that of evaluating the effect of the more applied literature. The econometric
exposure of a set of units to a program, or literature goes back to early work by Orley
treatment, on some outcome. In economic Ashenfelter (1978) and subsequent work by
studies, the units are typically economic Ashenfelter and David Card (1985), Heckman
agents such as individuals, households, mar and Richard Robb (1985), LaLonde (1986),
kets, firms, counties, states, or countries Thomas Fraker and Rebecca Maynard
but, in other disciplines where evaluation (1987), Card and Daniel G. Sullivan (1988),
methods are used, the units can be animals, and Charles F. Manski (1990). Motivated
plots of land, or physical objects. The treat primarily by applications to the evaluation of
ments can be job search assistance programs, labor market programs in observational set
educational programs, vouchers, laws or tings, the focus in the econometric literature
regulations, medical drugs, environmental is traditionally on endogeneity, or self-selec
exposure, or technologies. A critical feature tion, issues. Individuals who choose to enroll
is that, in principle, each unit can be exposed in a training program are by definition dif
to multiple levels of the treatment. Moreover, ferent from those who choose not to enroll.
this literature is focused on settings with These differences, if they influence the
observations on units exposed, and not response, may invalidate causal comparisons
exposed, to the treatment, with the evalua of outcomes by treatment status, possibly
tion based on comparisons of units exposed even after adjusting for observed covariates.
and not exposed.1 For example, an individual Consequently, many of the initial theoreti
may enroll or not in a training program, or he cal studies focused on the use of traditional
or she may receive or not receive a voucher, econometric methods for dealing with endo
or be subject to a particular regulation or geneity, such as fixed effect methods from
not. The object of interest is a comparison panel data analyses, and instrumental vari
of the two outcomes for the same unit when ables methods. Subsequently, the economet
exposed, and when not exposed, to the treat rics literature has combined insights from
ment. The problem is that we can at most the semiparametric literature to develop new
observe one of these outcomes because the estimators for a variety of settings, requir
unit can be exposed to only one level of the ing fewer functional form and homogeneity
treatment. Paul W. Holland (1986) refers to assumptions.
this as the fundamental problem of causal The statistics literature starts from a dif
inference. In order to evaluate the effect of ferent perspective. This literature originates
the treatment, we therefore always need to in the analysis of randomized experiments by
compare distinct units receiving the different Ronald A. Fisher (1935) and Jerzy Splawa
levels of the treatment. Such a comparison Neyman (1990). From the early 1970s, Rubin
(1973a, 1973b, 1974, 1977, 1978), in a series
of papers, formulated the now dominant
]As oppposed to studies where the causal effect of
fundamentally new programs is predicted through direct approach to the analysis of causal effects in
identification of preferences and production functions. observational studies. Rubin proposed the
will discuss several of them. One approach particular attention to the practical issues
(Rosenbaum and Rubin 1983b; Rosenbaum raised by the implementation of these meth
1995) consists of sensitivity analyses, where ods. At this stage, the literature has matured
robustness of estimates to specific limited to the extent that it has much to offer the
departures from unconfoundedness are empirical researcher. Although the evalu
investigated. A second approach, developed ation problem is one where identification
by Manski (1990, 2003, 2007), consists of problems are important, there is currently a
bounds analyses, where ranges of estimands much better understanding of which assump
consistent with the data and the limited tions are most useful, as well as a better set
assumptions the researcher is willing to make, of methods for inference given different sets
are derived and estimated. A third approach, of assumptions.
instrumental variables, relies on the pres Most of this review will be limited to set
ence of additional treatments, the so-called tings with binary treatments. This is in keep
instruments, that satisfy specific exogeneity ing with the literature, which has largely
and exclusion restrictions. The formulation focused on binary treatment case. There are
of this method in the context of the potential some extensions of these methods to mul
outcomes framework is presented in Imbens tivalued, and even continuous, treatments
and Angrist (1994) and Angrist, Imbens, and (e.g., Imbens 2000; Michael Lechner 2001;
Rubin (1996). A fourth approach applies to Lechner and Ruth Miquel 2005; Richard D.
settings where, in its pure form, overlap is Gill and James M. Robins 2001; Hirano and
completely absent because the assignment Imbens 2004), and some of these extensions
is a deterministic function of covariates, but will be discussed in the current review. But
comparisons can be made exploiting conti the work in this area is ongoing, and much
nuity of average outcomes as a function of remains to be done here.
covariates. This setting, known as the regres The running example we will use through
sion discontinuity design, has a long tradition out the paper is that of a job market training
in statistics (see William R. Shadish, Thomas program. Such programs have been among
D. Cook, and Donald T. Campbell 2002 and the leading applications in the economics lit
Cook 2008 for historical perspectives), but erature, starting with Ashenfelter (1978) and
has recently been revived in the economics including LaLonde (1986) as a particularly
literature through work by Wilbert van der influential study. In such settings, a number
Klaauw (2002), Hahn, Todd, and van der of individuals do, or do not enroll in a training
Klaauw (2001), David S. Lee (2001), and Jack program, with labor market outcomes, such
R. Porter (2003). Finally, a fifth approach, as yearly earnings or employment status, as
referred to as difference-in-differences, relies the main outcome of interest. An individual
on the presence of additional data in the form not participating in the program may have
of samples of treated and control units before chosen not to do so, or may have been ineli
and after the treatment. An early applica gible for various reasons. Understanding the
tion is Ashenfelter and Card (1985). Recent choices made, and constraints faced, by the
theoretical work includes Abadie (2005), potential participants, is a crucial component
Bertrand, Duflo, and Mullainathan (2004), of any analysis. In addition to observing par
Stephen G. Donald and Kevin Lang (2007), ticipation status and outcome measures, we
and Susan Athey and Imbens (2006). typically observe individual background char
In this review, we will discuss in detail acteristics, such as education levels and age,
some of the new methods that have been as well as information regarding prior labor
developed in this literature. We will pay market histories, such as earnings at various
levels of aggregation (e.g., yearly, quarterly, or but not both, and thus only one of these two
monthly). In addition, we may observe some potential outcomes can be realized. Prior to
of the constraints faced by the individuals, the assignment being determined, both are
including measures used to determine eli potentially observable, hence the label poten
gibility, as well as measures of general labor tial outcomes. If individual i participates in
market conditions in the local labor markets the program, Y?(1) will be realized and Yf(0)
faced by potential participants. will ex post be a counterfactual outcome. If,
on the other hand individual i does not par
2. The Rubin Causal Model: Potential ticipate in the program, Yiv0) will be realized
and Y?(1) will be the ex post counterfactual.
Outcomes, the Assignment Mechanism,
and Interactions We will denote the realized outcome by Y?,
with Y the N-vector with ?-th element equal
In this section, we describe the essential to Y?. The preceding discussion implies that
elements of the modern approach to program
evaluation, based on the work by Rubin. Y, = YiiWi) = Y,(0) (1 -Wi) + Y<(1) Wf
Suppose we wish to analyze a job training
program using observations on N individu ?y?(o) ifw; = o,
als, indexed by i = l,...,N. Some of these
individuals were enrolled in the training {y,(1) if wt = 1.
program. Others were not enrolled, either
because they were ineligible or chose not to The potential outcomes are tied to the spe
enroll. We use the indicator W{ to indicate cific manipulation that would have made
whether individual i enrolled in the training one of them the realized outcome. The more
program, with W? = 0 if individual i did not, precise the specification of the manipulation,
and Wt = 1 if individual i did, enroll in the the more well-defined the potential out
program. We use W to denote the N-vector comes are.
with i-th element equal to Wt, and N0 and NY This distinction between the pair of poten
to denote the number of control and treated tial outcomes (Yiv0),Yf(l)) and the realized
units, respectively. For each unit, we also outcome Yf is the hallmark of modern statis
observe a K-dimensional column vector of tical and econometric analyses of treatment
covariates or pretreatment variables, Xi9 with effects. We offer some comments on it. The
X denoting the NxK matrix with i-th row potential outcomes framework has important
equal to X[. precursors in a variety of other settings. Most
2.1 Potential Outcomes directly, in the context of randomized experi
ments, the potential outcome framework was
The first element of the RCM is the notion introduced by Splawa-Neyman (1990) to
of potential outcomes. For individual ?, for derive the properties of estimators and confi
i = 1,.. .,N, we postulate the existence of two dence intervals under repeated sampling.
potential outcomes, denoted by Y?(0) and The potential outcomes framework also
Y?(1). The first, Y?(0), denotes the outcome that has important antecedents in econometrics.
would be realized by individual i if he or she Specifically, it is interesting to compare the
did not participate in the program. Similarly, distinction between potential outcomes Y?(0)
Y?(1) denotes the outcome that would be real and Yf(l) and the realized outcome Yz in
ized by individual i if he or she did partici Rubin's approach to Trygve Haavelmo s (1943)
pate in the program. Individual i can either work on simultaneous equations models
participate or not participate in the program, (SEMs). Haavelmo discusses identification of
supply and demand models. He makes a dis The potential outcomes framework has
tinction between "any imaginable price 7r" as a number of advantages over a framework
the argument in the demand and supply func based directly on realized outcomes. The
tions, of (tt) and (fin), and the "actual price p," first advantage of the potential outcome
which is the observed equilibrium price satis framework is that it allows us to define causal
fying cf(p) = q( (p). The supply and demand effects before specifying the assignment
functions play the same role as the potential mechanism, and without making functional
outcomes in Rubin's approach, with the equi form or distributional assumptions. The most
librium price similar to the realized outcome. common definition of the causal effect at the
Curiously, Haavelmo's notational distinction unit level is as the difference Y?(1) ? Y?(0),
between equilibrium and potential prices has but we may wish to look at ratios Y?(1)/Y?(0),
gotten blurred in many textbook discussions of or other functions. Such definitions do not
simultaneous equations. In such discussions, require us to take a stand on whether the
the starting point is often the general formula effect is constant or varies across the popu
tion Yr + XB = U for N x M vectors of real lation. Further, defining individual-specific
ized outcomes Y, N x L matrices of exogenous treatment effects using potential outcomes
covariates X, and an N x M matrix of unob does not require us to assume endogeneity or
served components U. A nontrivial byproduct exogeneity of the assignment mechanism. By
of the potential outcomes approach is that it contrast, the causal effects are more difficult
forces users of SE Ms to articulate what the to define in terms of the realized outcomes.
potential outcomes are, thereby leading to Often, researchers write down a regression
better applications of SE Ms. A related point is function Y i ? a + r Wt + ?*. This regres
made in Pearl (2000). sion function is then interpreted as a struc
Another area where potential outcomes tural equation, with r as the causal effect.
are used explicitly is in the econometric Left unclear is whether the causal effect
analyses of production functions. Similar to is constant or not, and what the properties
the potential outcomes framework, a pro of the unobserved component, eh are. The
duction function g(x, e) describes production potential outcomes approach separates these
levels that would be achieved for each value issues, and allows the researcher to first
of a vector of inputs, some observed (x) and define the causal effect of interest without
some unobserved (e). Observed inputs may considering probabilistic properties of the
be chosen partly as a function of (expected) outcomes or assignment.
values of unobserved inputs. Only for the The second advantage of the poten
level of inputs actually chosen do we observe tial outcome approach is that it links the
the level of the output. Potential outcomes analysis of causal effects to explicit manip
are also used explicitly in labor market set ulations. Considering the two potential out
tings by A. D. Roy (1951). Roy models indi comes forces the researcher to think about
viduals choosing from a set of occupations. scenarios under which each outcome could
Individuals know what their earnings would be observed, that is, to consider the kinds
be in each of these occupations and choose of experiments that could reveal the causal
the occupation (treatment) that maximizes effects. Doing so clarifies the interpretation
their earnings. Here we see the explicit use of causal effects. For illustration, consider
of the potential outcomes, combined with a a couple of recent examples from the eco
specific selection/assignment mechanism, nomics literature. First, consider the causal
namely, choosing the treatment with the effects of gender or ethnicity on outcomes
highest potential outcome. of job applications. Simple comparisons of
economic outcomes by ethnicity are diffi model the probability of enrolling in the pro
cult to interpret. Are they the result of dis gram given the earnings in both treatment
crimination by employers, or are they the arms conditional on individual characteris
result of differences between applicants, tics. This sequential modeling will lead to a
possibly arising from discrimination at an model for the realized outcome, but it may
earlier stage of life? Now, one can obtain be easier than directly specifying a model for
unambiguous causal interpretations by link the realized outcome.
ing comparisons to specific manipulations. A fourth advantage of the potential out
A recent example is the study by Bertrand comes approach is that it allows us to for
and Mullainathan (2004), who compare call mulate probabilistic assumptions in terms of
back rates for job applications submitted potentially observable variables, rather than
with names that suggest African-American in terms of unobserved components. In this
or Caucasian ethnicity. Their study has a approach, many of the critical assumptions
clear manipulation?a name change?and will be formulated as (conditional) indepen
therefore a clear causal effect. As a sec dence assumptions involving the potential
ond example, consider some recent eco outcomes. Assessing their validity requires
nomic studies that have focused on causal the researcher to consider the dependence
effects of individual characteristics such structure if all potential outcomes were
as beauty (e.g., Daniel S. Hamermesh and observed. By contrast, models in terms of
Jeff E. Biddle 1994) or height. Do the dif realized outcomes often formulate the criti
ferences in earnings by ratings on a beauty cal assumptions in terms of errors in regres
scale represent causal effects? One possible sion functions. To be specific, consider again
interpretation is that they represent causal the regression function Y? = a + r W?, -f- e{.
effects of plastic surgery. Such a manipula Typically (conditional independence) assump
tion would make differences causal, but it tions are made on the relationship between e{
appears unclear whether cross-sectional and W{. Such assumptions implicitly bundle a
correlations between beauty and earnings number of assumptions, including functional
in a survey from the general population rep form assumptions and substantive exogeneity
resent causal effects of plastic surgery. assumptions. This bundling makes the plau
A third advantage of the potential outcome sibility of these assumptions more difficult to
assess.
approach is that it separates the modeling
of the potential outcomes from that of the A fifth advantage of the potential outcom
assignment mechanism. Modeling the real approach is that it clarifies where the un
ized outcome is complicated by the fact that tainty in the estimators comes from. Even
it combines the potential outcomes and the we observe the entire (finite) population (
assignment mechanism. The researcher may is increasingly common with the grow
have very different sources of information to availability of administrative data sets)
bear on each. For example, in the labor mar we can estimate population averages with
ket program example we can consider the uncertainty?causal effects will be uncerta
outcome, say, earnings, in the absence of the because for each unit at most one of the
program: Y,-(0). We can model this in terms of potential outcomes is observed. One may st
individual characteristics and labor market use super population arguments to just
histories. Similarly, we can model the out approximations to the finite sample distri
come given enrollment in the program, again tions, but such arguments are not required
conditional on individual characteristics and motivate the existence of uncertainty abo
labor market histories. Then finally we can the causal effect.
independence4. Although the analysis of data outcomes for another unit. Only the level of
with such assignment mechanisms is not as the treatment applied to the specific individ
straightforward as that of randomized exper ual is assumed to potentially affect outcomes
iments, there are now many practical meth for that particular individual. In the statistics
ods available for this case. We review them literature, this assumption is referred to as
in section 5.
the Stable-Unit-Treatment-Value-Assumption
The third class of assignment mechanisms (Rubin 1978). In this paper, we mainly focus
contains all remaining assignment mecha on settings where this assumption is main
nisms with some dependence on potential tained. In the current section, we discuss
outcomes.5 Many of these create substantive some of the literature motivated by concerns
problems for the analysis, for which there is about this assumption.
no general solution. There are a number of This lack-of-interaction assumption is very
special cases that are by now relatively well plausible in many biom?dical applications.
understood, and we discuss these in section 6. Whether one individual receives or does
The most prominent of these cases are instru not receive a new treatment for a stroke or
mental variables, regression discontinuity, and not is unlikely to have a substantial impact
differences-in-differences. In addition, we on health outcomes for any other individual.
discuss two general methods that also relax However, there are also many cases in which
the unconfoundedness assumption but do not such interactions are a major concern and the
replace it with additional assumptions. The assumption is not plausible. Even in the early
first relaxes the unconfoundedness assump experimental literature, with applications
tion in a limited way and investigates the sen to the effect of various fertilizers on crop
sitivity of the estimates to such violations. The yields, researchers were cognizant of poten
second drops the unconfoundedness assump tial problems with this assumption. In order
tion entirely and establishes bounds on esti to minimize leaking of fertilizer applied to
mands of interest. The latter is associated with one plot into an adjacent plot experimenters
the work by Manski (1990, 1995, 2007). used guard rows to physically separate the
2.3 Interactions and General plots that were assigned different fertilizers.
A different concern arises in epidemiological
Equilibrium Effects
applications when the focus is on treatments
In most of the literature, it is assumed that such as vaccines for contagious diseases. In
treatments received by one unit do not affect that case, it is clear that the vaccination of
one unit can affect the outcomes of others in
their proximity, and such effects are a large
4E.g., Lechner 2001; A. Colin Cameron and Pravin K. part of the focus of the evaluation.
Trivedi 2005.
5 This includes some mechanisms where the In economic applications, interactions
dependence on potential outcomes does not create any between individuals are also a serious con
problems in the analyses. Most prominent in this category cern. It is clear that a labor market program
are sequential assignment mechanisms. For example, one that affects the labor market outcomes for
could randomly assign the first ten units to the treatment
or control group with probability 1/2. From then on one one individual potentially has an effect on
could skew the assignment probability to the treatment the labor market outcomes for others. In a
with the most favorable outcomes so far. For example,
if the active treatment looks better than the control world with a fixed number of jobs, a train
treatment based on the first N units, then the (N + l)th ing program could only redistribute the jobs,
unit is assigned to the active treatment with probability and ignoring this constraint on the number
0.8 and vice versa. Such assignment mechanisms are not
very common in economics settings, and we ignore them of jobs by using a partial, instead of a gen
in this discussion. eral, equilibrium analysis could lead one to
erroneously conclude that extending the pro depending on some distance metric, either
gram to the entire population would raise geographical distance or proximity in some
aggregate employment. Such concerns have economic metric.
rarely been addressed in the recent program The most interesting literature in this area
evaluation literature. Exceptions include views the interactions not as a nuisance but
Heckman, Lance Lochner, and Christopher as the primary object of interest. This litera
Taber (1999) who provide some simulation ture, which includes models of social inter
evidence for the potential biases that may actions and peer effects, has been growing
result from ignoring these issues. rapidly in the last decade, following the early
In practice these general equilibrium effects work by Manski (1993). See Manski (2000a)
may, or may not, be a serious problem. The and William Brock and Steven N. Durlauf
indirect effect on one individual of exposure (2000) for recent surveys. Empirical work
to the treatment of a few other units is likely to includes Jeffrey R. Kling, Jeffrey B. Liebman,
be much smaller than the direct effect of the and Katz (2007), who look at the effect of
exposure of the first unit itself. Hence, with households moving to neighborhoods with
most labor market programs both small in higher average socioeconomic status; Bruce
scope and with limited effects on the individ I. Sacerdote (2001), who studies the effect
ual outcomes, it appears unlikely that general of college roommate behavior on a student's
equilibrium effects are substantial and they grades; Edward L. Glaeser, Sacerdote, and
can probably be ignored for most purposes. Jose A. Scheinkman (1996), who study social
One general solution to these problems is interactions in criminal behavior; Anne C.
to redefine the unit of interest. If the inter Case and Lawrence F. Katz (1991), who look
actions between individuals are at an inter at neighborhood effects on disadvantaged
mediate level, say a local labor market, or a youths; Bryan S. Graham (2008), who infers
classroom, rather than global, one can ana interactions from the effect of class size on
lyze the data using the local labor market the variation in grades; and Angrist and Lang
or classroom as the unit and changing the (2004), who study the effect of desegregation
no-interaction assumption to require the programs on students' grades. Many iden
absence of interactions among local labor tification and inferential questions remain
markets or classrooms. Such aggregation is unanswered in this literature.
likely to make the no-interaction assump
tion more plausible, albeit at the expense of 3. What Are We Interested In?
reduced precision.
Estimands and Hypotheses
An alternative solution is to directly model
the interactions. This involves specifying In this section, we discuss some of the
which individuals interact with each other, questions that researchers have asked in this
and possibly relative magnitudes of these literature. A key feature of the current litera
interactions. In some cases it may be plau ture, and one that makes it more important to
sible to assume that interactions are limited be precise about the questions of interest, is
to individuals within well-defined, possibly the accommodation of general heterogeneity
overlapping groups, with the intensity of in treatment effects. In contrast, in many
the interactions equal within this group. early studies it was assumed that the effect
This would be the case in a world with a of a treatment was constant, implying that
fixed number of jobs in a local labor market. the effect of various policies could be cap
Alternatively, it maybe that interactions occur tured by a single parameter. The essentially
in broader groups but decline in importance unlimited heterogeneity in the effects of the
treatment allowed for in the current litera expectation of the unit-level causal effect,
ture implies that it is generally not possible
to capture the effects of all policies of inter
est in terms of a few summary statistics. In TPATE = E[Y,(1) - YM
practice researchers have reported estimates
of the effects of a few focal policies. In this If the policy under consideration would
section we describe some of these estimands. expose all units to the treatment or none at
Most of these estimands are average treat all, this is the most relevant quantity. Another
ment effects, either for the entire population popular estimand is the Population Average
or for some subpopulation, although some Treatment effect on the Treated (PATT), the
correspond to other features of the joint dis average over the subpopulation of treated
tribution of potential outcomes. units:
Most of the empirical literature has focused
on estimation. Much less attention has been TPATT = ?;[Yi(l)-Y/(0)|Wi=l].
devoted to testing hypotheses regarding the
properties or presence of treatment effects. In many observational studies, rPATT is a
Here we discuss null and alternative hypoth more interesting estimand than the overall
eses that may be of interest in settings with average effect. As an example, consider the
heterogeneous effects. Finally, we discuss case where a well defined population was
some of the recent literature on decision exposed to a treatment, say a job training
theoretic approaches to program evaluation program. There may be various possibilities
that ties estimands more closely to optimal for a comparison group, including subjects
policies. drawn from public use data sets. In that case,
it is generally not interesting to consider the
3.1 Average Treatment Effects
effect of the program for the comparison
The econometric literature has largely group: for many members of the comparison
focused on average effects of the treatment. group (e.g., individuals with stable, high-wage
The two most prominent average effects are jobs) it is difficult and uninteresting to imag
defined over an underlying population. In ine their being enrolled in the labor market
cases where the entire population can be program. (Of course, the problem of averag
sampled, population treatment effects rely on ing across units that are unlikely to receive
the notion of a superpopulation, where the future treatments can be mitigated by more
current population that is available is viewed carefully constructing the comparison group
as just one of many possibilities. In either to be more like the treatment group, mak
case, the the sample of size N is viewed as ing rPATE a more meaningful parameter. See
a random sample from a large (super-)popu the discussion below.) A second case where
lation, and interest is in the average effect tpatt is the, estimand of most interest is in
in the superpopulation.6 The most popular the setting of a voluntary program where
treatment effect is the Population Average those not enrolled will never be required
Treatment Effect (PATE), the population to participate in the program. A specific
example is the effect of serving in the mili
6 For simplicity, we restrict ourselves to random sam tary where an interesting question concerns
pling. Some data sets are obtained by stratified sampling. the foregone earnings for those who served
Most of the estimators we consider can be adjusted for (Angrist 1998).
stratified sampling. See, for example, Wooldridge (1999,
2007) on inverse probability weighting of averages and In practice, there is typically little motiva
objective functions. tion presented for the focus on the overall
average effect or the average effect for the have to be particularly concerned with the
treated. Take a job training program. The distinction between the two estimands at the
overall average effect would be the param estimation stage. However, there is an impor
eter of interest if the policy under con tant difference between the population and
sideration is a mandatory exposure to the conditional estimands at the inference stage.
treatment versus complete elimination. It If there is heterogeneity in the effect of the
is rare that these are the alternatives, with treatment, we can estimate the sample aver
more typically exemptions granted to various age treatment effect rCATE more precisely
subpopulations. Similarly the average effect than the population average treatment effect
for the treated would be informative about rPATE. When one estimates the variance of an
the effect of entirely eliminating the current estimator f?which can serve as an estimate
program. More plausible regime changes for rPATE or rCATE ?one therefore needs to
would correspond to a modest extension of be explicit about whether one is interested in
the program to other jurisdictions, or a con the variance relative to the population or to
traction to a more narrow population. the conditional average treatment effect. We
A somewhat subtle issue is that we may will return to this issue in section 5.
wish to separate the extrapolation from the A more general class of estimands includes
sample to the superpopulation from the average causal effects for subpopulations
problem of inference for the sample at hand. and weighted average causal effects. Let A
This suggests that, rather than focusing on be a subset of the covariate space X, and let
PATE or PATT, we might first focus on the rCATE,A denote the conditional average causal
average causal effect conditional on the cova effect for the subpopulation with X? G A:
riates in the sample,
the entire distribution of outcomes, or solely With experimental data the statisti
about average outcomes, and may also take cal analysis is generally straightforward.
into account costs associated with participa Differencing average outcomes by treatment
tion. If the administrator knew exactly the status or, equivalently, regressing the out
conditional distribution of the potential out come on an intercept and an indicator for the
comes given the covariate information this treatment, leads to an unbiased estimator for
would be a simple problem: the administra the average effect of the treatment. Adding
tor would simply compare the expected wel covariates to the regression function typically
fare for different rules and choose the one improves precision without jeopardizing con
with the highest value. However, the admin sistency because the randomization implies
istrator does not have this knowledge and that in large samples the treatment indicator
needs to make a decision given uncertainty and the covariates are independent. In prac
about these distributions. In these settings, it tice, researchers have rarely gone beyond
is clearly important that the statistical model basic regression methods. In principle,
allows for heterogeneity in the treatment however, there are additional methods that
effects. can be useful in these settings. In section
Graham, Imbens, and Ridder (2006) 4.2, we review one important experimental
extend the type of problems studied in this technique, randomization-based inference,
literature by incorporating resource con including Fisher's method for calculating
straints. They focus on problems that include exact p-values, that deserves wider usage in
as a special case the problem of allocating a social sciences. See Rosenbaum (1995) for a
fixed number of slots in a program to a set of textbook discussion.
individuals on the basis of observable charac
4.1 Randomized Experiments in Economics
teristics of these individuals given a random
sample of individuals for whom outcome and Randomized experiments have a long
covariate information is available. tradition in biostatistics. In this literature
they are often viewed as the only cred
ible approach to establishing causality. For
4. Randomized Experiments
example, the United States Food and Drug
Experimental evaluations have tradition Administration typically requires evidence
ally been rare in economics. In many cases from randomized experiments in order to
ethical considerations, as well as the reluc approve new drugs and medical procedures.
tance of administrators to deny services to A first comment concerns the fact that even
randomly selected individuals after they randomized experiments rely to some extent
have been deemed eligible, have made it on substantive knowledge. It is only once
difficult to get approval for, and implement, the researcher is willing to limit interactions
randomized evaluations. Nevertheless, the between units that randomization can estab
few experiments that have been conducted, lish causal effects. In settings with poten
including some of the labor market training tially unrestricted interactions between
programs, have generally been influential, units, randomization by itself cannot solve
sometimes extremely so. More recently, the identification problems required for
many exciting and thought-provoking experi establishing causality. In biom?dical settings,
ments have been conducted in development where such interaction effects are often argu
economics, raising new issues of design and ably absent, randomized experiments are
analysis (see Duflo, Rachel Glennerster, and therefore particularly attractive. Moreover,
Kremer 2008 for a review). in biom?dical settings it is often possible to
keep the units ignorant of their treatment Examples of such programs include the
status, further enhancing the interpretation Greater Avenues to INdependence (GAIN)
of the estimated effects as causal effects of programs (e.g., James Riccio and Daniel
the treatment, and thus improving the exter Friedlander 1992, the WIN programs (e.g.,
nal validity. Judith M. Gueron and Edward Pauly 1991;
In the economics literature randomization Friedlander and Gueron 1992; Friedlander
has played a much less prominent role. At var and Philip K. Robins 1995), the Self
ious times social experiments have been con Sufficiency Project in Canada (Card and Dean
ducted, but they have rarely been viewed as R. Hyslop 2005, and Card and Robins 1996),
the sole method for establishing causality, and and the Statistical Assistance for Programme
in fact they have sometimes been regarded Selection in Switzerland (Stefanie Behncke,
with some suspicion concerning the rele Markus Fr?lich, and Lechner 2006). Like
vance of the results for policy purposes (e.g., the NSW evaluation, these experiments have
Heckman and Smith 1995; see Gary Burtless been useful not merely in establishing the
1995 for a more positive view of experiments effects of particular programs but also in pro
in social sciences). Part of this may be due to viding fertile testing grounds for new statisti
the fact that for the treatments of interest to cal evaluations methods.
economists, e.g., education and labor mar Recently there has been a large number of
ket programs, it is generally impossible to do exciting and innovative experiments, mainly
blind or double-blind experiments, creating in development economics but also in oth
the possibility of placebo effects that com ers areas, including public finance (Duflo
promise the internal validity of the estimates. and Emmanuel Saez 2003; Duflo et al.
Nevertheless, this suspicion often down 2006; Raj Chetty, Adam Looney, and Kory
plays the fact that many of the concerns that Kroft forthcoming). The experiments in
have been raised in the context of random development economics include many edu
ized experiments, including those related to cational experiments (e.g., T. Paul Schultz
missing data, and external validity, are often 2001; Orazio Attanasio, Costas Meghir,
equally present in observational studies. and Ana Santiago 2005; Duflo and Rema
Among the early social experiments in eco Hanna 2005; Banerjee et al. 2007; Duflo
nomics were the negative income tax experi 2001; Miguel and Kremer 2004). Others
ments in Seattle and Denver in the early study topics as wide-ranging as corruption
1970s, formally referred to as the Seattle and (Benjamin A. Olken 2007; Claudio Ferraz
Denver Income Maintenance Experiments and Frederico Finan 2008) or gender issues
(SIME and DIME). In the 1980s, a number in politics (Raghabendra Chattopadhyay and
of papers called into question the reliability of Duflo 2004). In a number of these experi
econometric and statistical methods for esti ments, economists have been involved from
mating causal effects in observational studies. the beginning in the design of the evalua
In particular, LaLonde (1986) and Fraker and tions, leading to closer connections between
Maynard (1987), using data from the National the substantive economic questions and the
Supported Work (NSW) programs, suggested design of the experiments, thus improving
that widely used econometric methods were the ability of these studies to lead to con
unable to replicate the results from experi clusive answers to interesting questions.
mental evaluations. These influential con These experiments have also led to renewed
clusions encouraged government agencies to interest in questions of optimal design.
insist on the inclusion of experimental evalu Some of these issues are discussed in Duflo,
ation components in job training programs. Glennerster, and Kremer (2008), Miriam
Bruhn and David McKenzie (2008), and Whether the null of no effect for any unit
Imbens et al. (2008). versus the null of no effect on average is
more interesting was the subject of a testy
4.2 Randomization-Based Inference and
Fishers Exact P-Values exchange between Fisher (who focused on
the first) and Neyman (who thought the lat
Fisher (1935) was interested in calculating ter was the interesting hypothesis, and who
p-values for hypotheses regarding the effect of stated that the first was only of academic
treatments. The aim is to provide exact infer interest) in Splawa-Neyman (1990). Putting
ences for a finite population of size N. This the argument about its ultimate relevance
finite population may be a random sample aside, Fisher's test is a powerful tool for
from a large superpopulation, but that is not establishing whether a treatment has any
exploited in the analysis. The inference is non effect. It is not essential in this framework
parametric in that it does not make functional that the probabilities of assignment to the
form assumptions regarding the effects; it is treatment group are equal for all units. It is
exact in that it does not rely on large sample crucial, however, that the probability of any
approximations. In other words, the p-values particular assignment vector is known. These
coming out of this analysis are exact and valid probabilities may differ by unit provided the
irrespective of the sample size. probabilities are known.
The most common null hypothesis in The implication of Fisher's framework is
Fisher's framework is that of no effect of the that, under the null hypothesis, we know the
treatment for any unit in this population, exact value of all the missing potential out
against the alternative that, at least for some comes. Thus there are no nuisance param
units, there is a non-zero effect: eters under the null hypothesis. As a result,
we can deduce the distribution of any statis
H0:Yf(0) = 7,(1), Vt = l,...,2V, tic, that is, any function of the realized values
of (YfjW^!, generated by the randomiza
against Ha : 3i such that Y?(0) ^ Y?(1). tion. For example, suppose the statistic is
the average difference between treated and
It is not important that the null hypothesis control outcomes, T(W,Y) = Y} ? Y(), where
is that the effects are all zero. What is essen Y?, = lLi:W,=w ?i/Nw, for w = 0,1. Now sup
tial is that the null hypothesis is sharp, that pose we had assigned a different set of units
is, the null hypothesis specifies the value of to the treatment. Denote the vector of alter
all unobserved potential outcomes for each native treatment assignments by W. Under
unit. A more general null hypothesis could the null hypothesis we know all the potential
be that Yiv0) = Y?(1) + c for some prespeci outcomes and thus we can deduce what the
fied c, or that Yiv0) = Y?(1) + ct for some set of value of the statistic would have been under
prespecified ct. Importantly, this framework that alternative assignment, namely T(W,Y).
cannot accommodate null hypotheses such We can infer the value of the statistic for all
as the average effect of the treatment is zero, possible values of the assignment vector W,
against the alternative hypothesis of a non and since we know the distribution of W we
zero average effect, or can deduce the distribution of T(W, Y). The
distribution generated by the randomization
of the treatment assignment is referred to as
Hl):jrYlYi(a)-Yi(0))
i = 0, the randomization distribution. The p-value
of the statistic is then calculated as the prob
against H'a : i ? Y, ((1) - Yf(0)) ? 0. ability of a value for the statistic that is at
i
least as large, in absolute value, as that of the this point, we took data from eight
observed statistic, T(W, Y). ized evaluations of labor market prog
In moderately large samples, it is typi Four of the programs are from th
cally not feasible to calculate the exact demonstration programs. The four
p-values for these tests. In that case, one tions took place in Arkansas, Baltimo
can approximate the p-value by basing it on Diego, and Virginia. See Gueron an
a large number of draws from the random (1991), Friedlander and Gueron (1992
ization distribution. Here the approximation Greenberg and Michael Wiseman
error is of a very different nature than that and Friedlander and Robins (1995) for
in typical large sample approximations: it is detailed discussions of each of thes
controlled by the researcher, and if more ations. The second set of four prog
precision is desired one can simply increase from the GAIN programs in Californ
the number of draws from the randomiza four locations are Alameda, Los An
tion distribution. Riverside, and San Diego. See Ric
In the form described above, with the Friedlander (1992), Riccio, Friedland
statistic equal to the difference in averages Freedman (1994), and Dehejia (20
by treatment status, the results are typically more details on these programs an
not that different from those using Wald evaluations. In each location, we take
tests based on large sample normal approxi outcome total earnings for the first
mations to the sampling distribution to the or second (WIN) year following the pr
difference in means Yt ? Y(), as long as the and we focus on the subsample of ind
sample size is moderately large. The Fisher who had positive earnings at some poin
approach to calculating p-values is much to the program. We calculate three p
more interesting with other choices for for each location. The first p-value is
the statistic. For example, as advocated by on the normal approximation to the
Rosenbaum in a series of papers (Rosenbaum tic calculated as the difference in a
1984a, 1995), a generally attractive choice is outcomes for treated and control in
the difference in average ranks by treatment als divided by the estimated standard
status. First the outcome is converted into The second p -value is based on ran
ranks (typically with, in case of ties, all pos tion inference using the difference i
sible rank orderings averaged), and then the age outcomes by treatment status. An
test is applied using the average difference third p -value is based on the random
in ranks by treatment status as the statistic. distribution using the difference in a
The test is still exact, with its exact distri ranks by treatment status as the statis
bution under the null hypothesis known as results are in table 1.
the Wilcoxon distribution. Naturally, the test In all eight cases, the p-values ba
based on ranks is less sensitive to outliers the /-test are very similar to thos
than the test based on the difference in on randomization inference. This o
means.
is not surprising given the reasonabl
sample the
If the focus is on establishing whether sizes, ranging from 71 (Ark
treatment has some effect on the outcomes,
WIN) to 4,779 (San Diego, GAIN). Ho
rather than on estimating the average size
in a number of cases, the p-value f
of the effect, such rank tests are rank
muchtest
moreis fairly different from tha
on the level difference. In both sets o
likely to provide informative conclusions
locations there is one location whe
than standard Wald tests based differences
rank
in averages by treatment status. To test suggests a clear rejection
illustrate
TABLE 1
P-VALUES FOR FlSHER EXACT TESTS: RANKS VERSUS LEVELS
5 percent level whereas the level-based test 5. Estimation and Inference under
would suggest that the null hypothesis of no Unconfoundedness
effect should not be rejected at the 5 per
cent level. In the WIN (San Diego) evalua Methods for estimation of average treat
tion, the p-value goes from 0.068 (levels) to ment effects under unconfoundedness are
0.024 (ranks), and in the GAIN (San Diego) the most widely used in this literature. The
evaluation, the p-value goes from 0.136 (lev central paper in this literature, which intro
els) to 0.018 (ranks). It is not surprising that duces the key assumptions, is Rosenbaum
the tests give different results. Earnings data and Rubin (1983b), although the literature
are very skewed. A large proportion of the goes further back (e.g., William G. Cochran
populations participating in these programs 1968; Cochran and Rubin 1973; Rubin 1977).
have zero earnings even after conditioning Often the unconfoundedness assumption,
on positive past earnings, and the earnings which requires that conditional on observed
distribution for those with positive earnings covariates there are no unobserved factors
is skewed. In those cases, a rank-based test that are associated both with the assignment
is likely to have more power against alterna and with the potential outcomes, is contro
tives that shift the distribution toward higher versial. Nevertheless, in practice, where often
earnings than tests based on the difference data have been collected in order to make this
in means. assumption more plausible, there are many
As a general matter it would be useful in cases where there is no clearly superior alter
randomized experiments to include such native, and the only alternative is to abandon
results for rank-based p-values, as a generally the attempt to get precise inferences. In this
applicable way of establishing whether the section, we discuss some of these methods
treatment has any effect. As with all omnibus and the issues related to them. A general
tests, one should use caution in interpreting theme of this literature is that the concern is
a rejection, as the test can pick up interesting more with biases than with efficiency.
changes in the distribution (such as a mean Among the many recent economic appli
or median effect) but also less interesting cations relying on assumptions of this type
changes (such as higher moments about the are Blundell et al. (2001), Angrist (1998),
mean). Card and Hyslop (2005), Card and Brian P.
McCall (1996), V. Joseph Hotz, Imbens, and in the subsample with treatment Wl = w.
Jacob A. Klerman (2006), Card and Phillip Imbens and Rubin (forthcoming) suggest as a
B. Levine (1994), Card, Carlos Dobkin, and rule of thumb that with a normalized differ
Nicole Maestas (2004), Hotz, Imbens, and ence exceeding one quarter, linear regression
Julie H. Mortimer (2005), Lechner (2002a), methods tend to be sensitive to the specifi
Abadie and Javier Gardeazabal (2003), and cation. Note the difference with the often
Bloom (2005). reported /-statistic for the null hypothesis of
This setting is closely related to that under equal means,
lying standard multiple regression analysis
with a rich set of controls. See, for example,
Burt S. Barnow, Glend G. Cain, and Arthur (4) T= . X'~*?-.
S. Goldberger (1980). Unconfoundedness JS*/N0 + Sf/N,
implies that we have a sufficiently rich set of
predictors for the treatment indicator, con The reason for focusing on the normalized
tained in the vector of covariates X?, such difference, (3), rather than on the /-statistic,
that adjusting for differences in these covari (4), as a measure of the degree of difficulty in
ates leads to valid estimates of causal effects. the statistical problem of adjusting for differ
Combined with linearity assumptions of the ences in covariates, comes from their relation
conditional expectations of the potential out to the sample size. Clearly, simply increasing
comes given covariates, the unconfoundedness the sample size does not make the problem
assumption justifies linear regression. But in of inference for the average treatment effect
the last fifteen years the literature has moved inherently more difficult. However, quadru
away from the earlier emphasis on regression pling the sample size leads, in expectation,
methods. The main reason is that, although to a doubling of the /-statistic. In contrast,
locally linearity of the regression functions increasing the sample size does not system
may be a reasonable approximation, in many atically affect the normalized difference. In
cases the estimated average treatment effects the landmark LaLonde (1986) paper the nor
based on regression methods can be severely malized difference in mean exceeds unity for
biased if the linear approximation is not accu many of the covariates, immediately show
rate globally. To assess the potential problems ing that standard regression methods are
with (global) regression methods, it is useful unlikely to lead to credible results for those
to report summary statistics of the covariates data, even if one views unconfoundedness as
by treatment status. In particular, one may a reasonable assumption.
wish to report, for each covariate, the differ As a result of the concerns with the sen
ence in averages by treatment status, scaled sitivity of results based on linear regres
by the square root of the sum of the vari sion methods to seemingly minor changes
ances, as a scale-free measure of the differ in specification, the literature has moved to
ence in distributions. To be specific, one may more sophisticated methods for adjusting for
wish to report the normalized difference differences in covariates. Some of these more
sophisticated methods use the propensity
score?the conditional probability of receiv
nonparametric versions of the regression esti An ongoing discussion concerns the role
mators) in fact achieve the semiparametric of the propensity score, e(x) ? pr(W? = 11 X?
efficiency bound; thus, they would tend to be = x), introduced by Rosenbaum and Rubin
similar in large samples. Choices among them (1983b), and indeed whether there is any
typically rely on small sample arguments, role for this concept. See for recent contribu
which are rarely formalized, and which do not tions to this discussion Hahn (1998), Imbens
uniformly favor one estimator over another. (2004), Angrist and Hahn (2004), Peter C.
Most estimators currently in use can be writ Austin (2008a, 2008b), Dehejia (2005a),
ten as the difference of a weighted average of Smith and Todd (2001, 2005), Heckman,
the treated and control outcomes, with the Ichimura, and Todd (1998), Fr?lich (2004a,
weights in both groups adding up to one: 2004b), B. B. Hansen (2008), Jennifer Hill
(2008), Robins and Ya'acov Ritov (1997),
N
Rubin (1997, 2006), and Elizabeth A. Stuart
t=Y,i=\
V ?:W,=1
Yf, with E \ =(2008).
1,
In this section, we first discuss the key
E A,= -L
?:W,.=0
assumptions underlying an analysis based on
unconfoundedness. We then review some of
the efficiency bound results for average treat
The estimators differ in the way the weights X{ ment effects. Next, in sections 5.3 to 5.5, we
depend on the full vector of assignments and briefly review the basic methods relying on
matrix of covariates (including those of other regression, propensity score methods, and
units). For example, some estimators implicitly matching. Although still fairly widely used,
allow the weights to be negative for the treated we do not recommend these methods in prac
units and positive for controls units, whereas tice. In sections 5.6 to 5.8, we discuss three
others do not. In addition, some depend on of the combination methods that we view as
essentially all other units whereas others more attractive and recommend in practice.
depend only on units with similar covariate We discuss estimating variances in section
values. Nevertheless, despite the commonali 5.9. Next we discuss implications of lack of
ties of the estimators and large sample equiva overlap in the covariate distributions. In par
lence results, in practice the performance of ticular, we discuss two general methods for
the estimators can be quite different, partic constructing samples with improved covari
ularly in terms of robustness and bias. Little ate balance, both relying heavily on the pro
is known about finite sample properties. The pensity score. In section 5.11, we describe
few simulation studies include Zhong Zhao methods that can be used to assess the plau
(2004), Fr?lich (2004a), and Matias Busso, sibility of the unconfoundedness assumption,
John DiNardo, and Justin McCrary (2008). even though this assumption is not directly
On a more positive note, some understanding testable. We discuss methods for testing for
has been reached regarding the sensitivity of the presence of average treatment effects
specific estimators to particular configura and for the presence of treatment effect het
tions of the data, such as limited overlap in erogeneity under unconfoundedness in sec
covariate distributions. Currently, the best tion 5.12.
practice is to combine linear regression with
5.1 Identification
either propensity score or matching methods
in ways that explicitly rely on local, rather than The key assumption is unconfounded
global, linear approximations to the regression ness, introduced by Rosenbaum and Rubin
functions. (1983b),
where the second equality follows by uncon effect, the third term drops out, and the vari
foundedness: E[Yi(w)\Wi = w,X{] does not ance bound for rCATE is
depend on w. By the overlap assumption, we GT?(X?) , CT(?(X?)
can estimate both terms in the last line, and (8) Vc
therefore we can identify r(x). Given that we
eiXt) 1 - eOQ
can identify r(x) for all x, we can identify the
expected value across the population distri Still, the role of heterogeneity in the treat
bution of the covariates, ment effect is potentially important. Suppose
we actually had prior knowledge that the
(6) rPATE = E[r(X,)], average treatment effect conditional on the
covariates is constant, or r(x) = rPATE for all
as well as rPATT and other estimands. x. Given this assumption, the model is closely
related to the partial linear model (Peter M.
5.2 Efficiency Bounds
Robinson 1988; James H. Stock 1989). Given
Before discussing specific estimation this prior knowledge, the variance bound is
methods, it is useful to see what we can learn
about the parameters of interest, given just (9) Vconst
the strong ignorability of treatment assign
ment assumption, without functional form or
distributional assumptions. In order to do so, = \E af(X?) -+
, <t*(X,)
eOQ 1 - e(Xt)
we need some additional notation. Let ctq(x)
= V(Y,(0)|X, = x) and af(x) = V^DlX,
= x) denote the conditional variances of the This variance bound can be much l
potential outcomes given the covariates. than (8) if there is variation in the pro
Hahn (1998) derives the lower bounds for sity score. Knowledge of lack of variati
asymptotic variances of x/?-consistent esti the treatment effect can be very valuable,
mators for rPATE as conversely, allowing for general hetero
ity in the treatment effect can be expe
(7) 'PATE aUXt) + <r02(Xf) in terms of precision.
In addition to the conditional variance
the counterfactual outcomes, a third im
+ (t(X,) - r)2 tant determinant of the efficiency bou
the propensity score. Because it enters
where p = E[e(Xj\ is the unconditional treat (7) in the denominator, the presence of
ment probability. Interestingly, this lower with the propensity score close to zero o
bound holds irrespective of whether the will make it difficult to obtain precis
propensity score is known or not. The form mates of the average effect of the treatm
of this variance bound is informative. It is no One approach to address this problem, d
surprise that rPATE is more difficult to esti oped by Crump et al. (2009) and discusse
mate the larger are the variances (Tq(x) and more detail in section 5.10, is to drop ob
erf (x). However, as shown by the presence of vations with the propensity score clo
the third term, it is also more difficult to esti zero and one, and focus on the average e
mate tpate, the more variation there is in the of the treatment in the subpopulation
average treatment effect conditional on the propensity scores away from zero. Su
covariates. If we focus instead on estimat we focus on rCATE A, the average of r(X
ing rCATE, the conditional average treatment Xf G A. Then the variance bound is
By definition, the average treatment effect 8There is a somewhat subtle issue in estimating treat
conditional on X = x is r(x) = ??](x) ? /jLq(x). ment effects from stratified samples or samples with
As we discussed in the identification subsec missing values of the covariates. If the missingness or
stratification are determined by outcomes on the covari
tion, under the unconfoundedness assump ates, Xh and the conditional means are correctly specified,
tion, /i()(x) = E[YZ-| Wt = 0,XZ = x] and /?i(x) then the missing data or stratification can be ignored for
= E[Yt | Wt ? 1,X? = x], which means we can the purposes of estimating the regression parameters; see,
for example, Wooldridge (1999, 2007). However, sample
estimate //0( ) using regression methods for selection or stratification based on X, cannot be ignored in
the untreated subsample and /?i( ) using estimating, say, rPATE, because rPATE equals the expected
the treated subsample. Given consistent difference in regression functions across the population
distribution of Xt. Therefore, consistent estimation of rPATE
estimators /?0( ) and /?i( ), a consistent esti requires applying inverse probability weights or sampling
mator for either rPATE or rCATE is weights to the average in (11).
Standard error is only valid for rCATE and not linear approximation to the regression func
forrPATE.) tion is globally accurate, regression may lead
A different representation of freg is useful to severe biases. Another way of interpreting
in order to illustrate some of the concerns this problem is as a multicollinearity prob
with regression estimators in this setting. lem. If the averages of the covariates in
Suppose we do use the linear model in (12). the two treatment arms are very different,
It can be shown that the correlation between the covariates and
the treatment indicator is relatively high.
Although conventional least squares standard
(14) ^=Y1_Y0-(7^.A errors take the degree of multicollinearity
into account, they do so conditional on the
Nx - Y ? ? specification of the regression function. Here
the concern is that any misspecification may
be exacerbated by the collinearity problem.
As noted in the introduction to section 5,
To adjust for differences in covariates
between treated and control units, the simplean easy way to establish the severity of this
difference in average outcomes, Y1 ? Y(), isproblem is to inspect the normalized differ
adjusted by the difference in average covari ences (xl - x0)/v/s02+ sf).
ates, Xi ? X0, multiplied by the weightedIn the case of the standard regression esti
average of the regression coeffi mator it is straightforward to derive and to
estimate the variance when we view the esti
cients ?0 and ?x in the two treatment regimes.
mator as an estimator of rCATE. Assuming the
This is a useful representation. It shows that
if the averages of the covariates in the twolinear regression model is correctly specified,
treatment arms are very different, then thewe have
adjustment to the simple mean difference can
be large. We can see that even more clearly (15) V?(freg - rCATE) AMO, V0 + V{\
by inspecting the predicted outcome for the
treated units had they been subject to the con where Vw = N E [(aw - aj2],
trol treatments:
which can be obtained directly from standard
?[Y,(1) | Wt = 0] = Y0 + & (X, - X0). regression output. Estimating the variance
when we view the estimator as an estimator
The regression parameter ?Q is estimatedof tpate requires adding a term capturing the
variation in the treatment effect conditional
on the control sample, where the average
on the covariates. The form is then
of the covariates is equal to X0. It there
fore likely provides a good approximation to
the conditional mean function around that V?(freg - rCATE) -4 mo, v0 + v, + VT),
value. However, this estimated regression
where the third term in the normalized vari
function is then used to predict outcomes
ance is
in the treated sample, where the average of
the covariates is equal to XL. If these cova
riate averages are very different, and thus vT = {?1 - A,)'
the regression model is used to predict out
comes far away from where the parameters E[(X, - E[XJ)(X, - EVQ)1(?l - /??),
were estimated, the results can be sensitive to
which can be estimated as
minor changes in the specification. Unless the
have been explored. The first relies on local (a(x), ?(x)) = arg min ^ A?
aj? .=1
smoothing, and the second on increasingly
flexible global approximations. We discuss x (y.-a-?XXi-x))2,
both in turn.
Heckman, Ichimura, and Todd (1997) and with the same weights X? as in the standard
Heckman et al. (1998) consider local smooth kernel estimator. The regression function at
ing methods to estimate the two regression x is then estimated as ?(x) = ?(x). In order
functions. The first method they consider is to achieve convergence at the best pos
kernel regression. Given a kernel K( ), and a sible rate for rreg, one needs to use higher
bandwidth h, the kernel estimator for ?iw(x) order kernels, although the order required
is is less than that for the standard kernel
estimator.
A?/*) = Y. Yi ' K with weight For both the standard kernel estimator
and the local linear estimator an important
choice is that of the bandwidth h. In prac
tice, researchers have used ad hoc methods
for bandwidth selection. Formal results on
bandwidth selection from the literature on
Although the rate of convergence of the nonparametric regression are not directly
kernel estimator to the regression function applicable. Those results are based on mini
is slower than the conventional parametric mizing a global criterion such as the expected
rate N~l//2, the rate of convergence of the value of the squared difference between the
implied estimator for the average treatment estimated and true regression function, with
effect, freg in (11), is the regular parametric the expectation taken with respect to the
rate under regularity conditions. These con marginal distribution of the covariates. Thus,
ditions include smoothness of the regression they focus on estimating the regression func
functions and require the use of higher order tion well everywhere. Here the focus is on
kernels (with the order of the kernel depend a particular scalar functional of the regres
ing on the dimension of the covariates). In sion function, and it is not clear whether the
Before we turn to propensity score methods, The basic insight is that for any binary vari
we should comment on estimating the average able Wh and any random vector Xi9 it is true
treatment effects on the treated, rPATT and (without assuming unconfoundedness) that
rCATT. In this case, f(X?) gets averaged across
observations with W? = 1, rather than across
the entire sample as in (11) Because /ii(x) is
estimated on the treated subsample, in esti Hence, within subpopulations with the same
mating PATT or CATT there is no problem value for the propensity score, covariates are
if /?i(x) is poorly estimated at covariate values independent of the treatment indicator and
that are common in the control group but thus cannot lead to biases (the same way in
scarce in the treatment group. But we must a regression framework omitted variables that
have a good estimate of /x0(x) at covariate val are uncorrelated with included covariates do
ues common in the treatment group, and this not introduce bias). Since under unconfound
is not ensured because we can only use the edness all biases can be removed by adjusting
control group to obtain /?0(x). Nevertheless, in for differences in covariates, this means that
many settings /i0(x) can be estimated well over within subpopulations homogenous in the
the entire range of the covariates because the propensity score there are no biases in com
control group often includes units that are sim parisons between treated and control units.
ilar to those in the treatment group. By con Given the Rosenbaum-Rubin result, it is
trast, often there are numerous control group sufficient, under the maintained assumption of
units?for example, high-income workers in unconfoundedness, to adjust solely for differ
the context of a job training program?that ences in the propensity score between treated
are quite different from any units in the treat and control units. This result can be exploited
ment group, making the ATE parameters con in a number of ways. Here we discuss three
siderably more difficult to estimate than ATT of these that have been used in practice. The
parameters. (Further, the ATT parameters are first two of these methods exploit the fact
more interesting from a policy perspective in that the propensity score can be viewed as a
such cases, unless one redefines the popula covariate that is sufficient to remove biases in
tion to exclude some units that are unlikely to estimation of average treatment effects. For
ever be in the treatment group.) this purpose, any one-to-one function of the
propensity score could also be used. The third
5.4 Methods Based on the Propensity Score
method further uses the fact that the pro
The first set of alternatives to regres pensity score is the conditional probability of
sion estimators relies on estimates of the receiving the treatment.
propensity score. These methods were intro The first method simply uses the pro
duced in Rosenbaum and Rubin (1983b). pensity score in place of the covariates in
An early economic discussion is in Card and regression analysis. Define vw(e) ? E[Yt\ Wt
Sullivan (1988). Rosenbaum and Rubin show = w,e(X?) = e\ Unconfoundedness in com
that, under unconfoundedness, independence bination with the Rosenbaum-Rubin result
of potential outcomes and treatment indica implies that uw(e) ? E\Y?w) \ e(X?) = e\ Then
tors also holds after conditioning solely on the we can estimate vw(e) very generally using
propensity score, e(x) ? pr(W? = 11 X{ = x): kernel or series estimation on the propensity
score, something which is greatly simpli
W/1(Y/(0),Y/(1))|X? fied by the fact that the propensity score is a
scalar. Heckman, Ichimura, and Todd (1998)
=> W/l(Y/(0),Y?.(l))|e(X/). consider local smoothers and Hahn (1998)
calculations, researchers have often used five where the second and final inequalities fol
strata, although depending on the sample low by iterated expectations and the third
size and the joint distribution of the data, equality holds by unconfoundedness. The
fewer or more blocks will generally lead to a implication is that weighting the treated
lower expected mean squared error. population by the inverse of the propensity
The variance for this estimator is typi score recovers the expectation of the uncon
cally calculated conditional on the strata ditional response under treatment. A similar
indicators, and assuming random assignment calculation shows E[((l - W?)Y?)/(1 - e{X^)}
within the strata. That is, for stratum j, the = E[Yf(0)], and together these imply
estimator is tj, and its variance is estimated
as Vj = Vp -\-Vji, where (16) rPATE = E WrY, (l-Wj'Yt
e(Xt) 1 - *(X<)
si
Vjw = vp> where Equation (16) suggests an obvious estimator
?I TPATE:
S2 - ? Y (Y- - Y )2
J ISjw i:B?j=iW.=w -I (1' ) ^weight = "Tt~
N
The overall variance is then estimated as "'wrY, (l-wg-y,
e{X?) 1 - e(Xt)
V(r1): lock/
Nn+Nj^*
V?
?(Vq, + Vy)-( N
which, as a sample average from a r
sample, is consistent for rPATE an
This variance estimator is appropriate for normally distribute
asymptotically
estimator
rCATE, although it ignores biases arisingin (17) is essentially due to D
from
variation in the propensity scoreHorvitz
within and D. J. Thompson (1952).9
strata.
In propensity
The third method exploiting the practice, (17) is not a feasible es
tor
score is based on weighting. Recall because
that rPATEit depends on the prop
= E[Y,(1) - Y,.(0)] = E[Yf(l)] - score
E\Y,(0)]- Wee(-), which is rarely kn
function
surprising
consider the two terms separately. result is that, even if we kn
Because
W? Y, = W? Y,(?), we have propensity score, rweight does not achi
efficiency bound given in (7). It turns
W, Y, WrY^l) be better, in terms of large sample effici
e(X?) e(X,) to weight using the estimated rather th
true propensity score. Hirano, Imben
Wf-Y((l) X,. Ridder (2003) establish conditions
e(Xt) which replacing e( ) with a logistic siev
mator results in a weighted propensity
estimator that achieves the variance b
E(W1|X().E(Y,(1)|X)' The estimator is practically simple t
pute, as estimation of the propensity
involves a straightforward logit estim
E eOO-EWDlXJ
e{X,)
9 Because the Horvitz-Thompson estimator is ba
sample averages, adjustments for stratified sampl
E[E(Yi(l)|Xj] = ?[Yi(l)], straightforward if one is provided sampling weights
involving flexible functions of the covariates. the block. This has the advantage of avoiding
Theoretically, the number of terms in the particularly large weights, but comes at the
approximation should increase with the sam expense of introducing bias if the propensity
ple size. In the second step, given the esti score is correctly specified.
mated propensity score e(x), one estimates A particular concern with IPW estimators
arises again when the covariate distributions
are substantially different for the two treatment
(18) T^~h^wlhw?~ groups. That implies that the propensity score
gets close to zero or one for some values of the
f (l-W?)-Y?/f W, covariates. Small or large values of the pro
?? 1-eOQ /??l-?W pensity score raises a number of issues. One
concern is that alternative parametric models
We refer to this as the inverse probabil for the binary data, such as probit and logit
ity weighting (IPW) estimator. See Hirano, models that can provide similar approxima
Imbens, and Ridder (2003) for intuition as tions in terms of estimated probabilities over
to why estimating the propensity score leads the middle ranges of their arguments, tend to
to a more efficient estimator, asymptotically, be more different when the probabilities are
than knowing the propensity score. close to zero or one. Thus the choice of model
Ichimura and Oliver Linton (2005) stud and specification becomes more important,
ied fIPW when e(-) is obtained via kernel and it is often difficult to make well motivated
regression, and they consider the problem of choices in treatment effect settings. A second
optimal bandwidth choice when the object of concern is that for units with propensity scores
interest is rPATE. More recently, Li, Racine, close to zero or one, the weights can be large,
and Wooldridge (forthcoming) consider making those units particularly influential in
kernel estimation for discrete as well as con the estimates of the average treatment effects,
tinuous covariates. The estimator proposed and thus making the estimator imprecise.
by Li, Racine, and Wooldridge achieves the These concerns are less serious than those
variance lower bound. See Hirano, Imbens, regarding regression estimators because at
and Ridder (2003) and Wooldridge (2007) least the IPW estimates will accurately reflect
for methods for estimating the variance for uncertainty. Still, these concerns make the
these estimators. simple IPW estimators less attractive. (As
Note that the blocking estimator can also for regression cases, the problem can be less
be interpreted as a weighting estimator. severe for the ATT parameters because pro
Consider observations in block j. Within the pensity score values close to zero play no role.
block, the N? treated observations all get Problems for estimating ATT arise when some
equal weight 1/N?. In the estimator for the units, as described by their observed covari
overall average treatment effect, this block ates, are almost certain to receive treatment.)
gets weight (?^ + N?)/N, so we can write f 5.5 Matching
? Hi=i Y* Y?, where for treated observations
in block/ the weight normalized by N is N \ Matching estimators impute the missing
= (Nj0 -f- Nji)/Nji), and for control observa potential outcomes using only the outcomes
tions it is N- X? = (Nj0 + N?)/Nj0). Implicitly of a few nearest neighbors of the opposite
this estimator is based on an estimate of treatment group. In that sense, matching is
the propensity score in block j equal to similar to nonparametric kernel regression,
Nji/(NJ0 + N?). Compared to the IPW estima with the number of neighbors playing the role
tor, the propensity score is smoothed within of the bandwidth in the kernel regression. A
formal difference with kernel methods is that replacement." Given the matched pairs, the
the asymptotic distribution for matching esti treatment effect within a pair is estimated
mators is derived conditional on the implicit as the difference in outcomes, and the over
bandwidth, that is, the number of neighbors, all average as the average of the within-pair
often fixed at a small number, e.g., one. Using difference. Exploiting the representation of
such asymptotics, the implicit estimate fi the estimator as a difference in two sample
w(x) is (close to) unbiased, but not consistent, means, inference is based on standard meth
for ?iw{x). In contrast, the kernel regression ods for differences in means or methods for
estimators discussed in the previous section paired randomized experiments, ignoring
implied consistency of p,w(x). any remaining bias. Fully efficient matching
Matching estimators have the attractive algorithms that take into account the effect
feature that the smoothing parameters are of a particular choice of match for treated
easily interpretable. Given the matching unit i on the pool of potential matches for
metric, the researcher only has to choose unit j are computationally cumbersome. In
the number of matches. Using only a single practice, researchers use greedy algorithms
match leads to the most credible inference that sequentially match units. Most com
with the least bias, at the cost of sacrificing monly the units are ordered by the value of
some precision. This sits well with the focus the propensity score with the highest pro
in the literature on reducing bias rather than pensity score units matched first. See Gu and
variance. It also can make the matching esti Rosenbaum (1993) and Rosenbaum (1995)
mator easier to use than those estimators that for discussions.
require more complex choices of smoothing Abadie and Imbens (2006) study formal
parameters, and this may be another expla asymptotic properties of matching estimators
nation for its popularity. in a different setting, where both treated and
Matching estimators have been widely control units are (potentially) matched and
studied in practice and theory (e.g., X. Gu and matching is done with replacement. Code for
Rosenbaum 1993; Rosenbaum 1989, 1995, the Abadie-Imbens estimator is available in
2002; Rubin 1973b, 1979; Rubin and Neal Matlab and Stata (see Abadie et al. 2004).10
Thomas 1992a, 1992b, 1996,2000; Heckman, Formally, given a sample, {(Yf,X?, W?)}/Ij,
Ichimura, and Todd 1998; Dehejia and Sadek let ?i(?) be the nearest neighbor to i, that is,
Wahba 1999; Abadie and Imbens 2006; ?x(i) is equal to the nonnegative integer j, for
Alexis Diamond and Jasjeet S. Sekhon 2008; jE{l,...,JV},ifW^Wf,and
Sekhon forthcoming; Sekhon and Richard
Grieve 2008; Rosenbaum and Rubin 1985; || x _ Xi\\ = min \\Xk - Xi\\.
Stefano M. Iacus, Gary King, and Giuseppe
Porro 2008). Most often they have been More generally, let ?m(i) be the index that sat
applied in settings where, (1) the interest is in isfies W?m(?) ^ W{ and that is the ra-th closest
the average treatment effect for the treated, to unit i:
and (2) there is a large reservoir of potential
controls, although recent work (Abadie and ? l{\\Xl-Xl\\<\\XeM-Xl\\}=m,
Imbens 2006) shows that matching estima
tors can be modified to estimate the overall
average effect. The setting with many poten
tial controls allows the researcher to match
10 See Sascha O. Becker and Andrea ichino (2002) and
each treated unit to one or more distinct
Edwin Leuven and Barbara Sianesi (2003) for alternative
controls, hence the label "matching without Stata implementations of matching estimators.
where 1{ } is the indicator function, equal to it is therefore critical that some weights are
one if the expression in brackets is true and negative through the device of higher order
zero otherwise. In other words, ?m(i) is the kernels, with the exact order required depen
index of the unit in the opposite treatment dent on the dimension of the covariates (see,
group that is the ra-th closest to unit i in e.g., Heckman, Ichimura, and Todd 1998). In
terms of the distance measure based on the practice, however, researchers have not used
norm || ||. Let JM(i) C {1, ...,N] denote the higher order kernels, and so bias concerns
set of indices for the first M matches for unit for nearest-neighbor matching estimators
i'- Jm(?) = Ri(?)> , ?m(?)}- Now impute the are even more relevant for kernel matching
missing potential outcomes as the average methods.
of the outcomes for the matches, by defin There are three caveats to the Abadie
ing Yi(0) and Yf(l) as Imbens bias result. First, it is only the con
tinuous covariates that should be counted in
if w; = o, the dimension of the covariates. With dis
t,(o)={5 crete covariates the matching will be exact
in large samples, and as a result such cova
fl/i
1/mz^Yj if w, = o, riates do not contribute to the order of the
%(X) if W, = 1, bias. Second, if one matches only the treated,
and the number of potential controls is much
The simple matching estimator discussed in larger than the number of treated units, one
Abadie and Imbens is then can justify ignoring the bias by appealing to
an asymptotic sequence where the number
(19) Uh = ^(^-*)' 1=1
of potential controls increases faster with
the sample size than the number of treated
Abadie and Imbens show that the bias of units. Specifically, if the number of controls,
this estimator is of order 0(N_1/K), where K N0, and the number of treated, N}, satisfy
is the dimension of the covariates. Hence, if Nx/Nq/k ?> 0, then the bias disappears in
one studies the asymptotic distribution of the large samples after normalization by \f?[.
estimator by normalizing by \/N (as can be Third, even though the order of the bias may
justified by the fact that the variance of the be high, the actual bias may still be small
estimator is of order 0(1/N)), the bias does if the coefficients in the leading term are
not disappear if the dimension of the covari small. This is possible if the biases for differ
ates is equal to two, and will dominate the ent units are at least partially offsetting. For
large sample variance if K is at least three. To example, the leading term in the bias relies
put this result in perspective, it is useful to on the regression function being nonlinear,
relate it to bias properties of estimators based and the density of the covariates having a
on kernel regression. Kernel estimators can nonzero slope. If either the regression func
be viewed as matching estimators where tion is well approximated by a linear func
all observations within some bandwidth hN tion, or the density is approximately flat, the
receive some weight. As the sample size N bias may be fairly limited.
increases, the bandwidth hN shrinks, but Abadie and Imbens (2006) also show
sufficiently slow in order to ensure that the that matching estimators are generally not
number of units receiving non-zero weights efficient. Even in the case where the bias
diverges. If all the weights are positive, the is of low enough order to be dominated by
bias for kernel estimators would generally be the variance, the estimators do not reach
worse. In order to achieve root-N consistency, the efficiency bound given a fixed number
of matches. To reach the bound the num on estimating ?jlw(x) ? E[Yi(w)\Xi = x] for
ber of matches would need to increase with w = 0,1 and averaging the difference as in
the sample size. If M ?? oo, with M/N ?* 0, (11), and the second is based on estimating
then the matching estimator is essentially the propensity score e(x) = pr(W{ = 11 Xt = x)
like a nonparametric regression estima and using that to weight the outcomes as in
tor. However, it is not clear that using an (18). For each approach, we have discussed
approximation based on a sequence with estimators that achieve the asymptotic effi
an increasing number of matches improves ciency bound. If we have large sample sizes,
the accuracy of the approximation. Given relative to the dimension of Xh we might
that in an actual data set one uses a spe think our nonparametric estimators of the
cific number of matches, M, it would appear conditional means or propensity score are
appropriate to calculate the asymptotic sufficiently accurate to invoke the asymptotic
variance conditional on that number, rather efficiency results described above.
than approximate the distribution as if this In other cases, however, we might choose
number is large. Calculations in Abadie and flexible parametric models without being
Imbens show that the efficiency loss from confident that they necessarily approximate
even a very small number of matches is the means or propensity score well. As we
quite modest, and so the concerns about the discussed earlier, one reason for viewing esti
inefficiency of matching estimators may not mators of conditional means or propensity
be very relevant in practice. Little is known scores as flexible parametric models is that
about the optimal number of matches, or it greatly simplifies standard error calcula
about data-dependent ways of choosing it. tions for treatment effect estimates. In such
All of the distance metrics used in prac cases, one might want to adopt a strategy that
tice standardize the covariates in some combines regression and propensity score
manner. Abadie and Imbens use a diagonal methods in order to achieve some robust
matrix with each diagonal element equal to ness to misspecification of the parametric
the inverse of the corresponding covariate models. It may be helpful to think about the
variance. The most common metric is the analogy to omitted variable bias. Suppose
Mahalanobis metric, which is based on the we are interested in the coefficient on W? in
inverse of the full covariance matrix. Zhao the (long) linear regression of Y, on a con
(2004), in an interesting discussion of the stant, W? and X?. Suppose we omit Xt from
choice of metrics, suggests some alterna the long regression, and just run the short
tives that depend on the correlation between regression of Yf on a constant and W?. The
covariates, treatment assignment, and out bias in the estimate from the short regression
comes. So far there is little experience with is equal to the product of the coefficient on
any metrics beyond inverse-of-the-variances Xi in the long regression, and the coefficient
and the Mahalanobis metrics. Zhao (2004) on Xi in a regression of Wi on a constant and
reports the results of some simulations using X?. Weighting can be interpreted as remov
his proposed metrics, finding no clear winner ing the correlation between Wt and Xh and
given his specific design. regression as removing the direct effect of X?.
Weighting therefore removes the bias from
5.6 Combining Regression and Propensity
omitting Xi from the regression. As a result,
Score Weighting
combining regression and weighting can lead
In sections 5.3 and 5.4, we describe meth to additional robustness by both removing
ods for estimating average causal effects the correlation between the omitted covari
based on two strategies: the first is based ates, and by reducing the correlation between
the omitted and included variables. This is (2007), weighting the objective function
the idea behind the doubly-robust estima by any nonnegative function of Xt does not
tors developed in Robins and Rotnitzky affect consistency of least squares.11 As a
(1995), Robins, Rotnitzky and Lue Ping Zhao result, even if the logit model for the propen
(1995), and Mark J. van der Laan and Robins sity score is misspecified, the binary response
(2003). MLE 7 still has a well-defined probability
Suppose we model the two regression func limit, say 7*, and the IPW estimator that uses
tions as ?jlw(x) ? aw + ?'w (x ? X), for w = 0,1 weights l/p(Xf;7) for treated observations
(where we abuse notation a bit and insert the and 1/(1 ? p(X?;7)) for control observations
sample averages of the covariates for their pop is asymptotically equivalent to the estima
ulation means). More generally, we may use a tor that uses weights based on 7*.12 It does
nonlinear model for the conditional expecta not matter that for some x, e(x) ^ p(x;7*).
tion, or just a more flexible linear approxima This is the first part of the double robustness
tion. Suppose we model the propensity score result: if the parametric conditional means
as e(x) = p(x; 7), for example as p(x; 7) = exp(70 for E[Y(w) | X = x] are correctly specified, the
+ x'7i)/(l + exp(7o + *'7i))- In the first step, model for the propensity score can be arbi
we estimate 7 by maximum likelihood and trarily misspecified for the true propensity
obtain the estimated propensity scores as ? score. Equation (20) still leads to a consistent
(Xf) = p(x; 7). In the second step, we use lin estimator for rPATE.
ear regression, where we weight the objec When the conditional means are correctly
tive function by the inverse probability of specified, weighting will generally hurt in
treatment or non-treatment. Specifically, to terms of asymptotic efficiency. The optimal
estimate (a0,?0) and (a^?i), we would solve weight is the inverse of the variance, and
the weighted least squares problems in general there is no reason to expect that
weighting the inverse of (one minus) the pro
(20)a?Amin ? (Yi~a"7/^-^
?:Wf=0 p(Xf; 7))
pensity score gives a good approximation to
that. Specifically, under homoskedasticity
and of Yi(w) so that aw = o~w(x), in the context of
least squares?the IPW estimator of (aw, ?w)
(Yj-a1-)9i(Xj-X))2
mm
? is less efficient than the unweighted estima
<*i./*i /:W,=i 1 - p(X?; tor;
7)) see Wooldridge (2007). The motivation
for propensity score weighting is different: it
Given the estimated conditional mean
offers func
a robustness advantage for estimating
rPATE
tions, we estimate rPATE, using the expres
The second part
sion for freg = ? 1 ? ?0 as in equation of the double robustness
(13).
But what is the motivation for result assumes that by
weighting the logit model (or an
the inverse propensity scorealternative
when binarywe didresponse model) is cor
not use such weighting in section 5.3?
rectly specified forThe
the propensity score, so
motivation is the double robustness result
that e(x) = p(x;7*), but allows the condi
tional meansee
due to Robins and Rotnitzky (1995); functions
also to be misspecified.
Daniel O. Scharfstein, Rotnitzky, and Robins
(1999). 11 More generally, it does not affect the consistency of
First, suppose that the conditional expec any quasi-likelihood method that is robust for estimating
tation is indeed linear, or E[Y?(t?;)|X? = x] the parameters of the conditional mean. These are likeli
hoods in the linear exponential family, as described in C.
= aw -f- ?'w (x ? X). Then, as discussed in Gourieroux, A. Monfort, and A. Trognon (1984a, 1984b).
the treatment effect context by Wooldridge 12 See Wooldridge (2007).
The result is that in that case aw ?? E[Yf(o;)], Once we estimate r based on (20), how
and thus r - a} - ?() - E[Yf(l)] - E[Y?(0)] should we obtain a standard error? The nor
? rPATE and the estimator is still consistent. malized variance still has the form V() -f Vi,
Let the weight for control observations be A? where Vw ? E[(?w ? ptw)2]. One option is to
= (1 - /j(X,;7*))-72>;=o (1 - piXp))-1. exploit the representation of ?0] as a weighted
Then the least squares estimator for a0 is average of Yt -f- ?0 (Xt ? X), and use the naive
variance estimator based on weighted least
(21) ?0 = ? (1 - W,) A,.
1=1
squares with known weights:
The weights imply that E[(l - W^A.Yj and similar for Vj. In general, we may again
= ?[Y/(0)]_and E[(l - W^X, - X)] want to adjust for the estimation of the
= E[X? ? X] = 0, and as a result a0 ?> parameters in 7 See Wooldridge (2007) for
?|X-(0)]. Similarly, the average of the pre details.
dicted values for Y?(1) converges to E[Y?(1)], Although combining weighting and regres
and so the resulting estimator fIPW = aY ? a0 sion is more attractive then either weighting
is consistent for rPATE and tcate irrespective or regression on their own, it still requires at
of the shape of the regression functions. This least one of the two specifications to be accu
is the second part of the double robustness rate globally. It has been used regularly in
part, at least for linear regression. the epidemiology literature, partly through
For certain kinds of responses, including the efforts of Robins and his coauthors, but
binary responses, fractional responses, and has not been widely used in the economics
count responses, linearity of E[Y?(u;)|X? = literature.
x] is a poor assumption. Using linear con
ditional expectations for limited dependent 5.7 Subclassification and Regression
variables effectively abdicates the first part
of the double robustness result. Instead, We can also combine subclassification
we should use coherent models of the con with regression. The advantage relative to
ditional means, as well as a sensible model weighting and regression is that we do not
for the propensity score, with the hope that use global approximations to the regression
the mean functions, propensity score, or function. The idea is that within stratum j,
both are correctly specified. Beyond speci we estimate the average treatment effect by
fying logically coherent for E[Y?(w) |X? = x] regressing the outcome on a constant, an
so that the first part of double robustness indicator for the treatment, and the covari
has a chance, for the second part we need ates, instead of simply taking the difference
to choose functional forms and estimators in averages by treatment status as in section
with the following property: even when the 5.4. The latter can be viewed as a regression
mean functions are misspecified, E[Y?(t?;)] = estimate based on a regression with only an
E[fx(Xi,SD], where ?*w is the probability limit intercept and the treatment indicator. The
of Sw. Fortunately, for the common kinds of further regression adjustment simply adds
limited dependent variables used in appli (some of) the covariates to that regression.
cations, such functional forms and estima The key difference with using regression in
tors exist; see Wooldridge (2007) for further the full sample is that, within a stratum, the
discussion. propensity score varies relatively little. As a
result, the covariate distributions are simi regression is not used to extrapolate far out
lar, and the regression function is not used to of sample.
extrapolate far out of sample. The idea behind the regression adjustment
To be precise, we estimate on the observa is to replace Yf(0) and Yf(l) by
tions with Btj = 1, the regression function
\Yi ifWi = 0,
t,(0) =
Y^^+r^Wi + ^-re, i ? LjejM(i) OTj+A)(x? - x;.)) if w< = 1,
TABLE 2
Balance Improvements in the Lalonde Data (Dehejia-Wahba Sample)
CPS Controls NSW Treated Normalized Difference Treated-Controls
covariate space to minimize the asymptotic half of what it is in the full sample, with this
variance of the efficient estimator of the aver improvement obtained by dropping approxi
age treatment effect for that set. Under some mately 20 percent of the original sample.
conditions (in particular homoskedasticity), A potentially controversial feature of all
they show that the optimal set A* depends these methods is that they change what is
only on the value of the propensity score. This being estimated. Instead of estimating rPATE,
method suggests discarding observations with the Crump et al. (2009) approach estimates
a propensity score less than a away from the rCATE,A- This results in reduced external
two extremes, zero and one: validity, but it is likely to improve internal
validity.
A* = {x G X | a < e(x) < 1 - a],
5.11 Assessing the Unconfoundedness
where a satisifies a condition based on the Assumption
marginal distribution of the propensity The unconfoundedness assumption used
score: in section 5 is not testable. It states that
the conditional distribution of the outcome
1
under the control given receipt of the active
a (1 ? a) treatment and covariates, is identical to the
1 1 distribution of the control outcome condi
2-E
3(X)-(l-e(X)) e(X) (1 - e(X)) tional on being in the control and covari
ates. A similar assumption is made for the
distribution of the treatment outcome. Yet
^ a - (1 ? a) \ ' since the data are completely uninformative
about the distribution of Y?(0) for those who
Based on empirical examples and numerical received the active treatment and of Y?(1)
calculations with beta distributions for the for those receiving the control, the data can
propensity score, Crump et al. (2009) suggest never reject the unconfoundedness assump
that the rule-of-thumb fixing a at 0.10 gives tion. Nevertheless, there are often indi
good results. rect ways of assessing this assumption. The
To illustrate this method, table 3 presents most important of these were developed in
summary statistics for data from Imbens, Rosenbaum (1987) and Heckman and Hotz
Rubin and Sacerdote (2001) on lottery play (1989). Both methods rely on testing the
ers, including "winners" who won big prizes, null hypothesis that an average causal effect
and "losers" who did not. Even though win is zero, where the particular average causal
ning the lottery is obviously random, varia effect is known to equal zero. If the testing
tion in the number of tickets bought, and procedure rejects the null hypothesis, this is
nonresponse, creates imbalances in the cova interpreted as weakening the support for the
riate distributions. In the full sample (sample unconfoundedness assumption. These tests
size N = 496), some of the covariates dif can be divided into two groups.
fer by as much as 0.64 standard deviations. The first set of tests focuses on estimating
Following the Crump et al. calculations leads the causal effect of a treatment that is known
to a bound of 0.0914. Discarding the obser not to have an effect. It relies on the presence
vations with an estimated propensity score of two or more control groups (Rosenbaum
outside the interval [0.0914, 0.9086] leads 1987). Suppose one has two potential control
to a sample size 388. In this subsample, the groups, for example eligible nonparticipants
largest normalized difference is 0.35, about and in?ligibles, as in Heckman, Ichimura and
TABLE 3
Balance Improvements in the Lottery Data
Losers Winners Normalized Difference Treated-Controls
and this is not testable. Instead we focus on Next, we turn to implementation of the
testing an implication of the stronger condi tests. We can simply test whether there is a
tional independence relation difference in average values of Y? between
the two control groups, after adjusting for
(24) Y?(0), Yi(l) 1 Gi | X,. differences in X?. That is, we effectively test
whether
This independence condition implies (23),
but in contrast to that assumption, it also E[E[Y,|G,= -1,XJ -EfYjG^ 0,Xj]= 0.
implies testable restrictions. In particular, we
focus on the implication that More generally we may wish to test
the groups adjusted for differences in Xf are if on average it does not affect outcomes.13
zero, or test whether the average difference They show that in some data sets they reject
is zero for all values of the covariates (e.g., the null hypothesis (30) even though they
Crump et al. 2008). cannot reject the null hypothesis of a zero
average effect.
5.12 Testing Taking the motivation in Crump et al.
(2008) one step further, one may also be
Most of the focus in the evaluation litera interested in testing the null hypothesis that
ture has been on estimating average treat the conditional distribution of Yiv0) given X?
ment effects. Testing has largely been limited = x is the same as the conditional distribu
to the null hypothesis that the average effect tion of Y?(1) given X2 = x. Under the main
is zero. In that case testing is straightforward tained hypothesis of unconfoundedness, this
since many estimators exist for the average is equivalent to testing the null hypothesis
treatment effect that are approximately nor that
mally distributed in large samples with zero
asymptotic bias. In addition there is some H.-.YilWilXi,
testing based on the Fisher approach using
the randomization distribution. In many against the alternative hypothesis that Y? is
cases, however, there are other null hypoth not independent of W? given X?. Tests of this
eses of interest. Crump et al. (2008) develop type can be implemented using the methods
tests of the null hypotheses of zero average of Linton and Pedro G?zalo (2003). There
effects conditional on the covariates, and of have been no applications of these tests in
a constant average effect conditional on the the program evaluation literature.
covariates. Formally, in the first case the null
hypothesis 5.13 Selection of Covariates
Part of their motivation is that in many cases 13 A second motivation is that it may be impossible to
there is substantive interest in whether the obtain precise estimates for rPATE even in cases where one
can convincingly reject some of the hypotheses regarding
program is beneficial for some groups, even r(x).
parameters should change with the sample functional form and functions of a small set
size. For example, using regression estima of covariates.
tors, one would have to choose the bandwidth
if using kernel estimators, or the number of 6. Selection on Unobservables
terms in the series if using series estimators.
The program evaluation literature does not In this section we discuss a number of
provide much guidance as to how to choose methods that relax the pair of assump
these smoothing parameters in practice. tions made in section 5. Unlike in the set
More generally, the nonparametric estima ting under unconfoundedness, there is not
tion literature has little to offer in this regard. a unified set of methods for this case. In
Most of the results in this literature offer a number of special cases there are well
optimal choices for smoothing parameters if understood methods, but there are many
the criterion is integrated squared error. In cases without clear recommendations. We
the current setting the interest is in a sca will highlight some of the controversies and
lar parameter, and the choice of smoothing different approaches. First we discuss some
parameter that is optimal for the regression methods that simply drop the unconfound
function itself need not be close to optimal edness assumption. Next, in section 6.2, we
for the average treatment effect. discuss sensitivity analyses that relax the
Hirano and Imbens (2001) consider an unconfoundedness assumption in a more
estimator that combines weighting with the limited manner. In section 6.3, we discuss
propensity score and regression. In their appli instrumental variables methods. Then, in
cation they have a large number of covariates, section 6.4 we discuss regression disconti
and they suggest deciding which ones to include nuity designs, and in section 6.5 we discuss
on the basis of ?-statistics. They find that the difference-in-differences methods.
results are fairly insensitive to the actual cutoff 6.1 Bounds
point if they use the weight/regression estima
tor, but find more sensitivity if they only use In a series of papers and books, Manski
weighting or regression. They do not provide (1990, 1995, 2003, 2005, 2007) has
formal properties for these choices. developed a general framework for inference
Ichimura and Linton (2005) consider in settings where the parameters of interest
inverse probability weighting estimators and are not identified. Manski's key insight is that
analyze the formal problem of bandwidth even if in large samples one cannot infer the
selection with the focus on the average treat exact value of the parameter, one may be
ment effect. Imbens, Newey and Ridder able to rule out some values that one could
(2005) look at series regression estimators not rule out a priori. Prior to Manski's work,
and analyze the choice of the number of researchers had typically dismissed models
terms to be included, again with the objective that are not point-identified as not useful in
being the average treatment effect. Imbens practice. This framework is not restricted to
and Rubin (forthcoming) discuss some step causal settings, and the reader is referred to
wise covariate selection methods for finding Manski (2007) for a general discussion of the
a specification for the propensity score. approach. Here we limit the discussion to
It is clear that more work needs to be program evaluation settings.
done in this area, both for the case where We start by discussing Manksi's per
the choice is which covariates to include spective in a very simple case. Suppose we
from a large set of potential covariates, have no covariates and a binary outcome Y?
and in the case where the choice concerns G {0,1}. Let the goal be inference for the
average effect in the population, rPATE. We assumptions we cannot rule out any value
can decompose the population average treat inside the bounds. See Manski et al. (1992)
ment effect as
for an empirical example of these particular
bounds.
rFATE = EfrU) | W, = 1] pr(Wj = 1) In this specific case the bounds are not
particularly informative. The width of the
+ E[y((l) | W? = 0] pr(W; = 0) bounds, the difference in ru ? 77, with 77 and
ru given above, is always equal to one, imply
- E\Y,(0)\W,= l]-pr(Wt= 1) ing we can never rule out a zero average treat
ment effect. (In some sense this is obvious:
+ E[Y,(0)|Wi=0]-pr(Wi=0)]. if we refrain from making any assumptions
regarding the treatment effects we cannot
Of the eight components of this expres rule out that the treatment effect is zero for
sion, we can estimate six. The data con any unit.) In general, however, we can add
tain no information about the remaining some assumptions, short of making the type
two, E[Y?(1)| Wt = 0] and E[Y?(0)|W? = 1]. of assumption as strong as unconfoundedness
Because the outcome is binary, and before that gets us back to the point-identified case.
seeing any data, we can deduce that these With such weaker assumptions we maybe able
two conditional expectations must lie inside to tighten the bounds and obtain informative
the interval [0,1], but we cannot say any more results, without making the strong assump
without additional assumptions. This implies tions that strain credibility. The presence of
that without additional assumptions we can covariates increases the scope for additional
be sure that assumptions that may tighten the bounds.
Examples of such assumptions include those
TPATE IT/>T,J> in the spirit of instrumental variables, where
some covariates are known not to affect the
where we can express the lower and upper potential outcomes (e.g., Manski 2007), or
bound in terms of estimable quantities, monotonicity assumptions where expected
outcomes are monotonically related to cova
^=?[^(1)1^= l]-pr(W;=l) riates or treatments (e.g., Manski and John
V. Pepper 2000). For an application of these
-pr(W,= D-EfrWlW^O] methods, see Hotz, Charles H. Mullin, and
Seth G. Sanders (1997). We return to some of
x pr(Wf = 0), these settings in section 6.3.
This discussion has focused on identifica
and tion and demonstrated what can be learned
in large samples. In practice these bounds
rM-E[Y?(l)|W?= l]-pr(W?=l) need to be estimated, which leads to addi
tional uncertainty regarding the estimands.
+ pr(W? = 0) - E[Yf(0) | Wf = 0] A fast developing literature (e.g., Horowitz
and Manski 2000; Imbens and Manski 2004;
x pr(Wi = 0), Chernozhukov, Hong, and Elie Tamer 2007;
Arie Beresteanu and Francesca Molinari
In other words, we can bound the average 2006; Romano and Azeem M. Shaikh 2006a,
treatment effect. In this example the bounds 2006b; Ariel Pakes et al. 2006; Adam M.
are tight, meaning that without additional Rosen 2006; Donald W. K. Andrews and
Gustavo Soares 2007; Ivan A. Canay 2007; completely relaxing the unconfoundedness
and J?rg Stoye 2007) discusses construction assumption, the idea is to relax it slightly.
of confidence intervals in general settings More specifically, violations of unconfound
with partial identification. One point of con edness are interpreted as evidence of the
tention in this literature has been whether presence of unobserved covariates that are
the focus should be on confidence intervals correlated, both with the potential outcomes
for the parameter of interest (rPATE in this and with the treatment indicator. The size of
case), or for the identified set. Imbens and bias these violations of unconfoundedness
Manski (2004) develop confidence sets for can induce depends on the strength of these
the parameter. In large samples, and at a correlations. Sensitivity analyses investigate
95 percent confidence level, the Imbens whether results obtained under the main
Manski confidence intervals amount to tained assumption of unconfoundedness can
taking the lower bound minus 1.645 times be changed substantially, or even overturned
the standard error of the lower bound and entirely, by modest violations of the uncon
the upper bound plus 1.645 times its stan foundedness assumption.
dard error. The reason for using 1.645 To be specific, consider a job train
rather than 1.96 is to take account of the ing program with voluntary enrollment.
fact that, even in the limit, the width of the Suppose that we have monthly labor market
confidence set will not shrink to zero, and histories for a two year period prior to the
therefore one only needs to be concerned program. We may be concerned that indi
with one-sided errors. Chernozhukov, Hong, viduals choosing to enroll in the program
and Tamer (2007) focus on confidence sets are more motivated to find a job than those
that include the entire partially identified that choose not to enroll in the program.
set itself with fixed probability. For a given This unobserved motivation may be related
confidence level, the latter approach gener to subsequent earnings both in the presence
ally leads to larger confidence sets than the and in the absence of training. Conditioning
Imbens-Manski approach. See also Romano on the recent labor market histories of indi
and Shaikh (2006a, 2006b) for subsampling viduals may limit the bias associated with
approaches to inference in these settings. this unobserved motivation, but it need not
eliminate it entirely. However, we may be
6.2 Sensitivity Analysis willing to limit how highly correlated unob
served motivation is with the enrollment
Unconfoundedness has traditionally been decision and the earnings outcomes in the
seen as an all or nothing assumption: either two regimes, conditional on the labor mar
it is satisfied and one proceeds accord ket histories. For example, if we compare
ingly using the methods appropriate under two individuals with the same labor mar
unconfoundedness, such as matching, or ket history for the last two years, e.g., not
the assumption is deemed implausible and employed the last six months and working
one considers alternative methods. The lat the eighteen months before, and both with
ter include the bounds approach discussed one two-year old child, it may be reason
in section 6.1, as well as approaches relying able to assume that these cannot differ radi
on alternative assumptions, such as instru cally in their unobserved motivation given
mental variables, which will be discussed in that their recent labor market outcomes
section 6.3. However, there is an important have been so similar. The sensitivity analy
alternative that has received much less atten ses developed by Rosenbaum and Rubin
tion in the economics literature. Instead of (1983a) formalize this idea and provides a
tool for making such assessments. Imbens this changes the point estimate of the aver
(2003) applies this sensitivity analysis to age treatment effect.
data from labor market training programs. Typically the sensitivity analysis is done
The second approach is associated with in fully parametric settings, although
work by Rosenbaum (1995). Similar to the since the models can be arbitrarily flex
Rosenbaum-Rubin approach Rosenbaum's ible, this is not particularly restrictive.
method relies on an unobserved covariate Following Rosenbaum and Rubin (1983b),
that generates the deviations from uncon we illustrate this approach in a setting
foundedness. The analysis differs in that with binary outcomes. See Imbens (2003)
sensitivity is measured using only the rela and Lee (2005b) for examples in econom
tion between the unobserved covariate and ics. Rosenbaum and Rubin (1983a) fix the
the treatment assignment, with the focus marginal distribution of the unobserved
on the correlation required to overturn, or covariate to be binomial with p = pr(l/? =
change substantially, p-values of statistical 1), and assume independence of U{ and X?.
tests of no effect of the treatment. They specify a logistic distribution for the
treatment assignment:
6.2.1 The Rosenbaum-Rubin Approach to
Sensitivity Analysis
pr(W?= l\Xi = x, Ui = u)
The starting point is that unconfound
exp(cv() + ot[ x + a2'u)
edness is satisfied only conditional on the
observed covariates Xf and an unobserved 1 4- exp(a0 H~ ai x + o?2-u)
scalar covariate Uf. They also specify logistic regression func
tions for the two potential outcomes:
Y^OXY^DIWJX,,^.
pr(Y?M = 1 \Xi = x, Ui = u) =
This set up in itself is not restrictive, although
once parametric assumptions are made the exp(A?o + ?'wX x + ?w2 u)
assumption of a scalar unobserved covariate 1 + expido + ?'wi x + ?w2 ' u) '
Vi is restrictive.
Now consider both the conditional dis
tribution of the potential outcomes given For the subpopulation with X? = x and C7f =
observed and unobserved covariates and the u, the average treatment effect is
conditional probability of assignment given
observed and unobserved covariates. Rather E[Yf(l) - Yf(0|Xi = x,Ui = u] =
than attempting to estimate both these con
ditional distributions, the idea behind the exp(/?10 + /?u* + /?i2'm)
sensitivity analysis is to specify the form and 1 + exp(/?10 + /?n x + ?l2 ' u)
the amount of dependence of these condi
tional distributions on the unobserved cova _ exp(/?oo+ /?pix + /302-t/)
riate, and estimate only the dependence on 1 + exp(/300 + An * + A)2 '") '
the observed covariate. Conditional on the
specification of the first part estimation of The average treatment effect rCATE can be
the latter is typically straightforward. The expressed in terms of the parameters of this
idea is then to vary the amount of depen model and the distribution of the observable
dence of the conditional distributions on the covariates by averaging over Xi5 and integrat
unobserved covariate and assess how much ing out the unobserved covariate U:
fCATE = i?p, a2, ?02, /?i2, o?ih au Ao, functional form assumptions, and so attempts
to estimate 0sens are therefore unlikely to be
Pou Ao? Ai) effective. Given 0sens, however, estimating the
remaining parameters is considerably easier.
,fN exp(Ao + Ai *? + A2) In the second step the plan is therefore to
N { tt? r V 1 + exp(Ao + Ai ^ + A2) fix the first set of parameters and estimate
the others by maximum likelihood, and then
exp( Ao + Ai X,. + ?02) translate this into an estimate for r. Thus, for
1 + exp( Ao + Ai *, + A>2? fixed 0sens, we first estimate the remaining
parameters through maximum likelihood:
exp(Ao + Ai *?)
+ (1 - p)
1 + exp(Ao + Ai K 0other(0.sens) = arg aX Mother I #sens)>
Mother
We do not know the values of the parameters Tv^sens/ ~ ^vVsens? "other \"sens//>
( p, a, ?), but the data are somewhat informative
about them. One conventional approach would Finally, in the third step, we consider the range
be to attempt to estimate all parameters, and of values of the function r(0sem) for a reason
then use those estimates to obtain an estimate able set of values for the sensitivity parameters
for the average treatment effect. Given the (0sens), and obtain a set of values for rCATK.
specific parametric model this may be possi The key question is how to choose the
ble, although in general this would be difficult set of reasonable values for the sensitiv
given the inclusion of unobserved covariates ity parameters. If we do not wish to restrict
in the basic model. A second approach, as dis this set at all, we end up with unrestricted
cussed in section 6.1, is to derive bounds on bounds along the lines of section 6.1. The
r given the model and the data. A sensitivity power from the sensitivity approach comes
analysis offers a third approach. from the researcher's willingness to put
The Rosenbaum-Rubin sensitivity analy real limits on the values of the sensitivity
sis proceeds by dividing the parameters into parameters (p,a2,?02,?i2). Among these
two sets. The first set includes the parameters parameters it is difficult to put real limits on
that would be set to boundary values under p, and typically it is fixed at 1/2, with little
unconfoundedness, (a2, ?02, ?i2), plus tne sensitivity to its choice. The more interesting
parameter p capturing the marginal distribu parameters are (a2,?02,?i2). Let us assume
tion of the unobserved covariate [7f. Together that the effect of the unobserved covariate is
we refer to these as the sensitivity parame the same in both treatment arms, ?2 = ?02
ters, 0sens = (p,&2,?02,?l2). The second set = ?2\, so that there are only two parameters
consists of the remaining parameters, #other left to fix, a2 and ?2. Imbens (2003) sug
= (a0, a 1, Ao? ?ou Ao? Ai)- The idea is that gests linking the parameters to the effects of
0sens is difficult to estimate. Estimates of the the observed covariates on assignment and
other parameters under unconfoundedness potential outcomes. Specifically he suggests
could be obtained by fixing a2 = ?02 = ?l2 to calculate the partial correlations between
= 0 and p at an arbitrary value. The data are observed covariates and the treatment and
not directly informative about the effect of potential outcomes, and then as a bench
an unobserved covariate in the absence of mark look at the sensitivity to an unobserved
covariate that has partial correlations with suggests bounding the ratio of the odds ratios
treatment and potential outcomes as high as e(x?)/(l - e(Xi)) and e(xj)/(l - e(x?)):
any of the observed covariates. For example,
Imbens considers, in the labor market train i/re(Xi)
< (1_-!?<
? e(x?j)
r
ing example, what the effect would be of (1 - e(Xi)) e(xj)
omitting unobserved motivation, if in fact
motivation had as much explanatory power If T = 1, we are back in the sett
for future earnings and for treatment choice unconfoundedness. If we allow T
as did earnings in the year prior to the train are not restricting the association
ing program. A bounds analysis, in contrast, the treatment indicator and the
would implicitly allow unobserved motiva outcomes. Rosenbaum investig
tion to completely determine both selection much the odds would have to be dif
into the program and future earnings. Even order to substantially change the p-
though putting hard limits on the effect of starting from the other side, he in
motivation on earnings and treatment choice for fixed values of T what the impl
may be difficult, it may be reasonable to put on the p-value.
some limits on it, and the Rosenbaum-Rubin For example, suppose that a tes
sensitivity analysis provides a useful frame null hypothesis of no effect has a p
work for doing so. 0.0001 under the assumption of un
edness. If the data suggest it would
6.2.2 Rosenbaums Method for Sensitivity presence of an unobserved covar
Analysis changes the odds of participation by
ten in order to increase that p-valu
Rosenbaum (1995) developed a slightly then one would likely consider the
different approach. The advantage of his be very robust. If instead a small c
approach is that it requires fewer tuning the odds of participation, say with
parameters than the Rosenbaum-Rubin T = 1.5, would be sufficient for a c
approach. Specifically, it only requires the the p-value to 0.05, the study would
researcher to consider the effect unobserved less robust.
confounders may have on the probability of 6.3 Instrumental Variables
treatment assignment. Rosenbaum's focus
is on the effect the presence of unobserved In this section, we review the r
covariates could have on the p -value for the erature on instrumental variables.
test of no effect of the treatment based on the on the part of the literature conce
unconfoundedness assumption, in contrast to heterogenous effects. In the cur
the Rosenbaum-Rubin focus on point esti tion, we limit the discussion to the
mates for average treatment effects. Consider a binary endogenous variable. T
two units i andj with the same value for the literature focused on identificati
covariates, x? = x;. If the unconfoundedness population average treatment effec
assumption conditional on X? holds, both units average effect on the treated. Iden
must have the same probability of assignment of these estimands ran into seri
lems once researchers wished to
to the treatment, e{x?) = e(x?. Now suppose
unconfoundedness only holds conditional on unrestricted heterogeneity in the e
both X, and a binary unobserved covariate the treatment. In an important earl
i/f. In that case the assignment probabilities Bloom (1984) showed that if eligibili
for these two units may differ. Rosenbaum program is used as an instrument,
can identify the average effect of the treat the observed outcome Y? and the potential
ment for those who received the treatment. outcomes Yf(0) and Y?(1), is
Key for the Bloom result is that the instru
ment changes the probability of receiving w, = w; (o) (i - z,)
the treatment to zero. In order to identify
the average effect on the overall popula
tion, the instrument would also need to shift + wtw
+ w.(i).z=?w*(0) ifZ<=?r
A |w_(1) ifz<=
the probability of receiving the treatment
to one. This type of identification is some Exogeneity of the instrument is captured by
times referred to as identification at infinity the assumption that all potential outcomes
(Gary Chamberlain 1986; Heckman 1990) in are independent of the instrument, or
settings with a continuous instrument. The
practical usefulness of such identification (Y,(0), Y,(l), Wj(0), Wi(l)) 1 Z,.
results is fairly limited outside of cases where
eligibility is randomized. Finding a credible Formulating exogeneity in this way is attrac
instrument is typically difficult enough, with tive compared to conventional residual
out also requiring that the instrument shifts based definitions, as it does not require the
the probability of the treatment close to zero researcher to specify a regression function in
and one. In fact, the focus of the current order to define the residuals. This assump
literature on instruments that can credibly tion captures two properties of the instru
be expected to satisfy exclusion restrictions ment. First, it captures random assignment
makes it even more difficult to find instru of the instrument so that causal effects of the
ments that even approximately satisfy these instrument on the outcome and treatment
support conditions. Imbens and Angrist received can be estimated consistently. This
(1994) got around this problem by changing part of the assumption, which is implied by
the focus to average effects for the subpopu explicitly randomization of the instrument, as
lation that is affected by the instrument. for example in the seminal draft lottery study
Initially we focus on the case with a binary by Angrist (1990), is not sufficient for causal
instrument. This case provides some of the interpretations of instrumental variables
clearest insight into the identification prob methods. The second part of the assumption
lems. In that case the identification at infin captures an exclusion restriction that there
ity arguments are obviously not satisfied and is no direct effect of the instrument on the
so one cannot (point-)identify the population outcome. This second part is captured by the
average treatment effect. absence of % in the definition of the potential
outcome Y?w). This part of the assumption is
6.3.1 A Binary Instrument
not implied by randomization of the instru
Imbens and Angrist adopt a potential out ment and it has to be argued on a case by
come notation for the receipt of the treatment, case basis. See Angrist, Imbens, and Rubin
as well as for the outcome itself. Let Z? denote (1996) for more discussion on the distinction
the value of the instrument for individual i. between these two assumptions, and for a
Let W;(0) and W?(1) denote the level of the formulation that separates them.
treatment received if the instrument takes on Imbens and Angrist introduce a new con
the values 0 and 1 respectively. As before, let cept, the compliance type of an individual.
Yf(0) and Y?(1) denote the potential values for The type of an individual describes the level
the outcome of interest. The observed treat of the treatment that an individual would
ment is, analogously to the relation between receive given each value of the instrument.
In other words, it is captured by the pair of Bloom set up with one-sided noncompliance
values (Wi(0), Wf(l)). With both the treat both always-takers and defiers are absent by
ment and instrument binary, there are four assumption.
types of responses for the potential treat Under these two assumptions, inde
ment. It is useful to define the compliance pendence of all four potential outcomes
types explicitly: (Y?(0),Y?(1), Wf(0),W;.(l)) and the instrument
Z?, and monotonicity, Imbens and Angrist
( never-taker if W?(0) = Wf(l) = 0 show that one can identify the average
_ J complier if W;(0) = 0, Wl(l) = 1 effect of the treatment for the subpopula
Ti = | d?fier if Wf(0) = 1, W?(1) = 0' tion of compliers. Before going through their
[ always-taker if W?(0) = W?(1) = 1 argument, it is useful to see why we cannot
generally identify the average effect of the
The labels never-taker, complier, d?fier, treatment for others subpopulations. Clearly,
and always-taker (e.g., Angrist, Imbens, and one cannot identify the average effect of the
Rubin 1996) refer to the setting of a random treatment for never-takers because they are
ized experiment with noncompliance, where never observed receiving the treatment, and
the instrument is the (random) assignment so E[Yf(l) | T? = n] is not identified. Thus,
to the treatment and the endogenous regres only compliers are observed in both treat
sor is an indicator for the actual receipt of ment groups, so only for this group is there
the treatment. Compliers are in that case any chance of identifying the average treat
individuals who (always) comply with their ment effect. In order to understand the
assignment, that is, take the treatment if positive component of the Imbens-Angrist
assigned to it and not take it if assigned to result, that we can identify the average effect
the control group. One cannot infer from the for compliers, it is useful to consider the
observed data (Zh Wh Y?) whether a particular subpopulations defined by instrument and
individual is a complier or not. It is important treatment. Table 4 shows the information
not to confuse compliers (who comply with we have about the individual's type given
their actual assignment and would have com the monotonicity assumption. Consider indi
plied with the alternative assignment) with viduals with (Z? = 1, W? = 0). Because of
individuals who are observed to comply with monotonicity such individuals can only be
their actual assignment: that is, individuals never-takers. Similarly, individuals (Z? = 0,
who complied with the assignment they actu Wt = 1) can only be always-takers. However,
ally received, Zt = Wt. For such individuals consider individuals with (Z? = 0, W? = 0).
we do not know what they would have done Such individuals can be either compliers
had their assignment been different, that is or never-takers. We cannot infer the type
we do not know the value of W?(1 ? Z?). of such individuals from the observed data
Imbens and Angrist then invoke an addi alone. Similarly, individuals with (Zf = 1,
tional assumption they refer to as monotonicity. Wi ? 1) can be either compliers or always
Monotonicity requires that W?(1) > W?(0) for takers.
all individuals, or that increasing the level of The intuition for the identification result
the instrument does not decrease the level is as follows. The first step is to see that we
of the treatment. This assumption is equiva can infer the population proportions of the
lent to ruling out the presence of defiers, and three remaining subpopulations, never
it is therefore sometimes referred to as the takers, always-takers and compliers (using
"no-defiance" assumption (Alexander Balke the fact that the monotonicity assumption
and Pearl 1994; Pearl 2000). Note that in the rules out the presence of defiers). Call these
TABLE 4
Type by Observed Variables
Z,
1
0 Nevertaker/Complier Nevertaker
' 1 Alwaystaker Alwaystaker/Complier
The only quantities not consistently estima Imbens and Angrist show that the standard
ble are the average effects for never-takers instrumental variables estimand, using g(Z?)
and always-takers. Even for those we have as an instrument for Wh is equal to a particu
some information. For example, we can write lar weighted average:
E[Yi(l) - Yi(0) | Ti =n] = ?[Y?(1) \Tt = n] -
?[Y,(0) | Ti ? n\ The second term we can E[Y,.-(g(Z,.)-E[g(Zi)])] _
estimate, and the data are completely unin ElWrigW-ElgiZ,)])] Tlate>
formative about the first term. Hence, if there
are natural bounds on Y?(1) (for example, if for a particular set of nonnegative weights as
the outcome is binary), we can use that to long as E[Wf | g(Z?) = g] increases in g.
bound E[Yi(l) \ Tt ? n], and then in turn use Heckman and Vytlacil (2006) and
that to bound rPATE. These bounds are tight. Heckman, Sergio Urzua, and Vytlacil (2006)
See Manski (1990), Toru Kitagawa (2008), study the case with a continuous instrument.
and Balke and Pearl (1994). They use an additive latent single index setup
6.3.2 Multivalued Instruments and where the treatment received is equal to
Weighted Local Average Treatment
Effects
w; = i{h(z? + Vi > o},
The previous discussion was in terms of a where h( ) is strictly monotonie, and the
single binary instrument. In that case there is latent type V? is independent of Zf. In general,
no other average effect of the treatment that in the presence of multiple instruments, this
can be estimated consistently other than the latent single index framework imposes sub
local average treatment effect, rLATE. With stantive restrictions.14 Without loss of gener
a multivalued instrument, or with multiple ality we can take the marginal distribution
binary instruments (still maintaining the set of Vi to be uniform. Given this framework,
ting of a binary treatment?see for extensions Heckman, Urzua, and Vytlacil (2006) define
of the local average treatment effect con the marginal treatment effect as a function
cept to the multiple treatment case Angrist of the latent type v of an individual,
and Imbens (1995) and Card (2001), we can
estimate a variety of local average treatment TMTE(v) = E[Yi(l)-Yi(0)\Vi = v].
effects. Let Z ? {zu ... ,zK] denote the set of
values for the instruments. Initially we take In the single continuous instrument case,
the set of values to be finite. Then for each
rMTE(v) is, under some differentiability and
pair (zk, z?) with pr(W? = 11 Zf = zk) > pr(W? invertibility conditions, equal to a limit of
= 11 Zi = z?) one can define a local average local average treatment effects:
treatment effect:
TmteW = lim rLATE(/i \v\ z). Kenneth Y. Chay and Michael Greenstone
zih-\v)
(2005), Card, Alexandre Mas, and Jesse
A parametric version of this concept goes Rothstein (2007), Lee, Enrico Moretti, and
back to work by Anders Bj?rklund and Matthew J. Butler (2004), Jens Ludwig and
Robert Moffitt (1987). All average treatment Douglas L. Miller (2007), Patrick J. McEwan
effects, including the overall average effect, and Joseph S. Shapiro (2008), Sandra E.
the average effect for the treated, and any Black (1999), Susan Chen and van der Klaauw
local average treatment effect can now be (2008), Ginger Zhe Jin and Phillip Leslie
expressed in terms of integrals of this mar (2003), Thomas Lemieux and Kevin Milligan
ginal treatment effect, as shown in Heckman (2008), Per Pettersson-Lidbom (2007, 2008),
and Vytlacil (2005). For example, rPATE = and Pettersson-Lidbom and Bj?rn Tyrefors
Jo tmte(^) dv. A complication in practice is (2007). Key theoretical and conceptual
that not necessarily all the marginal treat contributions include the interpretation of
ment effects can be estimated. For example, estimates for fuzzy regression discontinu
if the instrument is binary, Z? ? {0,1}, then ity designs allowing for general heterogene
for individuals with V? < min(?h(0), ?h(l)), ity of treatment effects (Hahn, Todd, and
it follows that Wi = 0, and for these never van der Klaauw 2001), adaptive estimation
takers we cannot estimate rMTE(v). Any methods (Yixiao Sun 2005), methods for
average effect that requires averaging over bandwidth selection tailored to the RD set
such values of v is therefore also not point ting, (Ludwig and Miller 2005; Imbens and
identified. Moreover, average effects that can Karthik Kalyanaraman 2008) and various
be expressed as integrals of rMTE(v) may be tests for discontinuities in means and distri
identified even if some of the tmte(?) that butions of nonaffected variables (Lee 2008;
are being integrated over are not identified. McCrary 2008) and for misspecification
Again, in a binary instrument example with (Lee and Card 2008). For recent reviews in
pr(W? =11^=1)= 1, and pr(W? = 1\Z{ the economics literature, see van der Klaauw
? 0) = 0, the average treatment effect rPATE (2008b), Imbens and Lemieux (2008), and
is identified, but rMTE(v) is not identified for Lee and Lemieux (2008).
any value of v. The basic idea behind the RD design is that
assignment to the treatment is determined,
6.4 Regression Discontinuity Designs
either completely or partly, by the value of
Regression discontinuity (RD) methods a predictor (the forcing variable X?) being on
have been around for a long time in the psy either side of a common threshold. This gen
chology and applied statistics literature, going erates a discontinuity, sometimes of size one,
back to the early 1960s. For discussions and in the conditional probability of receiving
references from this literature, see Donald L. the treatment as a function of this particular
Thistlethwaite and Campbell (1960), William predictor. The forcing variable is often itself
M. K. Trochim (2001), Shadish, Cook, and associated with the potential outcomes, but
Campbell (2002), and Cook (2008). Except this association is assumed to be smooth. As
for some important foundational work by a result any discontinuity of the conditional
Goldberger (1972a, 1972b), it is only recently distribution of the outcome as a function of
that these methods have attracted much atten
this covariate at the threshold is interpreted
tion in the economics literature. For some of as evidence of a causal effect of the treatment.
the recent applications, see Van Der Klaauw The design often arises from administrative
(2002, 2008a), Lee (2008), Angrist and decisions, where the incentives for individu
Victor Lavy (1999), DiNardo and Lee (2004), als to participate in a program are rationed
for reasons of resource constraints, and clear averaging we make a smoothness assump
transparent rules, rather than discretion, by tion that the two conditional expectations
administrators are used for the allocation of E[Yi(w) \X{ = x], for w ? 0,1, are continuous
these incentives. in x. Under this assumption, E[Y?(0) |X? = c]
It is useful to distinguish between two gen = limx?cE[Yi(0) \Xt = x] = limx?cE[Yt \Xi = x],
eral settings, the sharp and the fuzzy regres implying that
sion discontinuity designs (e.g., Trochim
1984, 2001; Hahn, Todd, and van der Klaauw rSRD = limx[c
E[Yi | X? = x] - lim E[Yi \ Xi = x],
x]c
2001; Imbens and Lemieux 2008; van der
Klaauw 2008b; Lee and Lemieux 2008). where this express
a deterministic fu
6.4.1 The Sharp Regression Discontinuity
of the SRD). The s
Design one of estimating
In the sharp regression discontinuity (SRD) parametrically at
design, the assignment W{ is a deterministic cuss the statistica
function of one of the covariates, the forcing section 6.4.4.
(or treatment-determining) variable X?:
6.4.2 The Fuzzy Regression Discontinuity
Wi = IK > d Design
In the fuzzy regression discontinuity (FRD)
where 1[-] is the indicator function, equal to design, the probability of receiving the treat
one if the even in brackets is true and zero ment need not change from zero to one at the
otherwise. All units with a covariate value of threshold. Instead the design only requires a
at least c are in the treatment group (and par discontinuity in the probability of assignment
ticipation is mandatory for these individuals), to the treatment at the threshold:
and all units with a covariate value less than
c are in the control group (members of this limpr(Wi=
x[c l|Xf = x)
group are not eligible for the treatment). In
the SRD design, we focus on estimation of ^ limpr(W?=
x]c l\Xi = x),
to use local smoothing methods such as ker nel). The choice of bandwidth then
nel regression rather than global smoothing to to dropping all observations such
methods such as sieves or series regression [c ? h, c -j- h]. The question becom
because the latter will generally be sensi choose the bandwidth h.
tive to behavior of the regression function Most standard methods for c
away from the threshold. Local smoothing bandwidths in nonparametric reg
methods are generally well understood (e.g., including both cross-validation and
Charles J. Stone 1977; Herman J. Bierens methods, are based on criteria that
1987; Hardie 1990; Adrian Pagan and Aman the squared error over the entire di
Ullah 1999). For a particular choice of the of the covariates: Jz (m(z) ? m(z))2
kernel, K( ), e.g., a rectangular kernel K(z) ? For our purposes this criterion d
1[?h < z < h], or a Gaussian kernel K(z) = reflect the object of interest. We ar
exp(?z2/2)/v (27r), the regression function cally interested in the regression fu
at x, mix) ? E[Yt \ Xt ? x] is estimated as a single point, moreover, this point
N
a boundary point. Thus we woul
Mx) = E y i - K choose h to minimize E[(m(c) ? m(c
the data with X? < c only, or using
with Xi> c only). If the density of t
with weights A, ? ?--^- is . high at the threshold, a b
? Ef=i?c(V) variable
selection procedure based on glob
may lead to a bandwidth that is mu
An important difference with the primary than is appropriate.
focus in the nonparametric regression litera There are few attempts to form
ture is that in the RD setting we are inter standardize the choice of a bandw
ested in the value of the regression functions such cases. Ludwig and Miller (2
at boundary points. Standard kernel regres Imbens and Lemieux (2008) discu
sion methods do not work well in such cases. cross-validation methods that tar
More attractive methods for this case are directly the object of interest in RD
local linear regression (Fan and Gijbels Assuming the density of X? is continu
1996; Porter 2003; Burkhardt Seifert and and that the conditional variance of
Theo Gasser 1996, 2000; Ichimura and Todd X? is continuous and equal to a2 a
2007), where locally a linear regression func Imbens and Kalyanaraman (2009) sh
tion, rather than a constant regression func the optimal bandwidth depends o
tion, is fitted. This leads to an estimator for ond derivatives of the regression fu
the regression function at x equal to the threshold and has the form
x? _P
?+-J M*
= arg min E V (Yj - a - /3 (X< - x))2, + i - p_
with the same weights \ as before. Vlim4c(g(x))2+liml1c(|f(x))2
In that
case the main remaining choice concerns
the bandwidth, denoted by h. where p is the fraction
Suppose one of observations with
uses a rectangular kernel, K(z) X? > = l[?h
c, and < azconstant that depends
CK is
< h] (and typically the results are
on therelatively
kernel. For a rectangular kernel K(z)
? l-h<z<h>
robust with respect to the choice of the tne constant
ker equals CK ? 2.70.
Imbens and Kalyanaram propose and imple times in an attempt to raise their score above
ment a plug in method for the bandwidth.16 the threshold.
If one uses a rectangular kernel, and given There are two sets of specification checks
a choice for the bandwidth, estimation for that researchers can typically perform to at
the SRD and FRD designs can be based on least partly assess the empirical relevance of
ordinary least squares and two stage least these concerns. Although the proposed proce
squares, respectively. If the bandwidth goes dures do not directly test null hypotheses that
to zero sufficiently fast, so that the asymp are required for the RD approach to be valid,
totic bias can be ignored, one can also base it is typically difficult to argue for the validity
inference on these methods. (See HTV and of the approach when these null hypotheses
Imbens and Lemieux 2008.) do not hold. First, one may look for discon
tinuities in average value of the covariates
6.4.5 Specification Checks around the threshold. In most cases, the rea
There are two important concerns in the son for the discontinuity in the probability of
application of RD designs, be they sharp the treatment does not suggest a discontinu
or fuzzy. These concerns can sometimes be ity in the average value of covariates. Finding
assuaged by investigating various implica a discontinuity in other covariates typically
tions of the identification argument underly casts doubt on the assumptions underlying the
ing the regression discontinuity design. RD design. Specifically, for covariates Z?, the
A first concern about RD designs is the pos test would look at the difference
sibility of other changes at the same threshold
value of the covariate. For example, the same rz = limxjc
E[Zi \Xi =x]c
x]- lim E[Zt \X{ = x].
age limit may affect eligibility for multiple
programs. If all the programs whose eligibil Second, McCrary (
ity changes at the same cutoff value affect null hypothesis of
the outcome of interest, an RD analysis may of the covariate t
mistakenly attribute the combined effect to ment at the thre
the treatment of interest. The second con tive of a jump in t
cern is that of manipulation by the individu point. A discontinu
als of the covariate value that underlies the covariate at the p
assignment mechanism. The latter is less of a discontinuity in th
concern when the forcing variable is a fixed, occurs is suggestiv
immutable characteristic of an individual manipulation assu
such as age. It is a particular concern when on the difference
eligibility criteria are known to potential par
ticipants and are based on variables that are Tf(x) = lip/x(x)
x[c x]c
affected by individual choices. For example,
if eligibility for financial aid depends on test In both cases a substantially
scores that are graded by teachers who know significant difference in th
the cutoff values, there may be a tendency to limits suggest that there m
push grades high enough to make students with the RD approach. In pr
eligible. Alternatively if thresholds are known ful than formal statistical t
to students they may take the test multiple analyses of the type disc
6.4.3 where histogram-type
16 Code in Matlab and Stata for calculating the optimal conditional expectation of
bandwidth is available on their website.
of the marginal density/x(x
6.5 Difference-in-Differences Methods Donald and Lang 2007), as well as the recent
extensions by Athey and Imbens (2006) who
Since the seminal work by Ashenfelter (1978) develop a functional form-free version of the
and Ashenfelter and Card (1985), the use difference-in-differences methodology, and
of Difference-In-Differences (DID) meth Abadie, Diamond, and Jens Hainmueller
ods has become widespread in empirical (2007), who develop a method for construct
economics. Influential applications include ing an artificial control group from multiple
Philip J. Cook and George Tauchen (1982, nonexposed groups.
1984), Card (1990), Bruce D. Meyer, W. Kip
Viscusi, and David L. Durbin (1995), Card 6.5.1 Repeated Cross Sections
and Krueger (1993, 1994), Nada Eissa and The standard model for the DID approach
Liebman (1996), Blundell, Alan Duncan, and is as follows. Individual i belongs to a group,
Meghir (1998), and many others. The DID G? G {0,1} (where group 1 is the treatment
approach is often associated with so-called group), and is observed in time period T? G
"natural experiments," where policy changes {0,1}. For i = 1,..., N, a random sample from
can be used to effectively define control and the population, individual is group identity
treatment groups. See Angrist and Krueger and time period can be treated as random
(1999), Angrist and Pischke (2009), and variables. In the standard DID model, we
Blundell and Thomas MaCurdy (1999) for can write the outcome for individual i in the
textbook discussions. absence of the intervention, Y?(0) as
The simplest setting is one where out
comes are observed for units observed in (33) Y,(0) = a + ?- Ti + 7- G, + ei9
one of two groups, in one of two time peri
ods. Only units in one of the two groups, with unknown parameters a, ?, and 7. We
in the second time period, are exposed to ignore the potential presence of other cova
a treatment. There are no units exposed to riates, which introduce no special com
the treatment in the first period, and units plications. The second coefficient in this
from the control group are never observed specification, ?, represents the time com
to be exposed to the treatment. The average ponent common to both groups. The third
gain over time in the non-exposed (control) coefficient, 7, represents a group-specific,
group is subtracted from the gain over time time-invariant component. The fourth term,
in the exposed (treatment) group. This dou e{, represents unobservable characteristics
ble differencing removes biases in second of the individual. This term is assumed to
period comparisons between the treatment be independent of the group indicator and
and control group that could be the result have the same distribution over time, i.e.,
from permanent differences between those et 1 (G?,T?), and is normalized to have mean
zero.
groups, as well as biases from compari
sons over time in the treatment group that An alternative set up leading to the sa
could be the result of time trends unrelated estimator allows for a time-invariant indiv
to the treatment. In general this allows for ual-specific fixed effect, 7?, potentially co
the endogenous adoption of the new treat lated with Gh and models Yf(0) as
ment (see Timothy Besley and Case 2000
and Athey and Imbens 2006). We discuss (34) Yi(0) = a + ?-Ti+>yi+ei.
here the conventional set up, and recent
work on inference (Bertrand, Duflo, and (See, e.g., Angrist and Krueger 1999.) T
Mullainathan 2004; Hansen 2007a, 2007b; generalization of the standard model do
not affect the standard DID estimand, and 6.5.2 Multiple Groups and Multiple Periods
it will be subsumed as a special case of the
model we propose. With multiple time periods and multiple
The equation for the outcome without groups we can use a natural extension of the
the treatment is combined with an equa two-group two-time-period model for the
tion for the outcome given the treatment: outcome in the absence of the intervention.
Yi(l) = Yf(0) -f rDID. The standard DID Let T and G denote the number of time peri
estimand is under this model equal to ods and groups respectively. Then:
T
period, irrespective of the group member the treatment group, no assumptions are
ship. The distribution of Ut is allowed to required about how the intervention affects
vary across groups, but not over time within outcomes.
groups, so that U{ 1 T? | Gf. Athey and Imbens The average effect of the treatment for
call the resulting model the changes-in the second period treatment group is rcic
changes (CIC) model. = E[Y,(1) - Y,(0)|G, = 1, T{ = 1]. Because
The standard DID model in (33) adds the first term of this expression is equal to
three additional assumptions to the CIC E\Xt(l) | G, = 1, Tt = 1] = E\Y, IG, = 1, T, =
model, namely 1], it can be estimated directly from the data.
The difficulty is in estimating the second
(39) Ui - E[Ui | GJ 1 Gi (additivity) term. Under the assumptions of monotonicity
oth0(u,t) in u, and conditional independence
(40) h0(u,t) = (?>{u + S-t), of Ti and U? given Gh Athey and Imbens
(single index model) show that in fact the full distribution of Y(0)
given Gt = Ti ? 1 is identified through the
for a strictly increasing function 0( ), and equality
For example, an individual may remain in a example, with three treatments, it may be
training program for a number of periods. In that no units are exposed to treatment level 2
each period the assignment to the program if Xi is in some subset of the covariate space.
is assumed to be unconfounded, given per The insights from the binary case directly
manent characteristics and outcomes up to extend to this multiple (but few) treatment
that point. In the last two cases we briefly case. If the number of treatments is relatively
discuss multivalued endogenous treatments. large, one may wish to smooth across treat
In the fourth case, we look at settings with ment levels in order to improve precision of
a discrete multivalued treatment in the pres the inferences.
ence of endogeneity. We allow the treatment 7.2 Continuous Treatments with
to be continuous in the final case. The last
Unconfounded Treatment Assignment
two cases tie in closely with the simultane
ous equations literature, where, somewhat In the case where the treatment taking
separately from the program evaluation lit on many values, Imbens (2000), Lechner
erature, there has been much recent work on (2001, 2004), Hirano and Imbens (2004),
nonparametric identification and estimation. and Carlos A. Flores (2005) extended some
Especially in the discrete case, many of the of the propensity score methodology under
results in this literature are negative in the unconfoundedness. The key maintained
sense that, without unattractive restrictions
assumption is that adjusting for pre-treat
on heterogeneity or functional form, few ment differences removes all biases, and thus
objects of interest are point-identified. Some solves the problem of drawing causal infer
of the literature has turned toward establish ences. This is formalized by using the con
ing bounds. This is an area with much ongo cept of weak unconfoundedness, introduced
ing work and considerable scope for further by Imbens (2000). Assignment to treatment
research. Wi is weakly unconfounded, given pre-treat
7.1 Multivalued Discrete Treatments with ment variables Xh if
Although in substantive terms the weak the assignment mechanism; see for example,
unconfoundedness assumption is not very Marshall M. Joffe and Rosenbaum (1999).
different from the assumption used by Because weak unconfoundedness given all
Rosenbaum and Rubin (1983b), it is important pretreatment variables implies weak uncon
that one does not need the stronger assump foundedness given the generalized propen
tion to validate estimation of the expected sity score, one can estimate average outcomes
value of Yi(w) by adjusting for Xf: under by conditioning solely on the generalized
weak unconfoundedness, we have E[Yi(w) | XJ propensity score. If assignment to treatment
= E[Yi(w) | Wi = w9 XJ = E[Yi \Wi = w9 XJ, is weakly unconfounded given pretreatment
and expected outcomes can then be esti variables X, then two results follow. First, for
mated by averaging these conditional means: all w,
E[Yi(wj] = E[E[Yi(w) | XJ]. In practice, it can
be difficult to estimate E[Y?(t?;)] in this man ?(w,r) = E[Yi(w) \r(w, Xf) = r]
ner when the dimension of X? is large, or if
w takes on many values, because the first = E[Yi\Wi = w,r(Wi,Xi) = rl
step requires estimation of the expectation
of Yi(w) given the treatment level and all pre which can be estimated using data on Y?, W?,
treatment variables. It was this difficulty that and r(Wi, X?). Second, the average outcome
motivated Rosenbaum and Rubin (1983b) to given a particular level of the treatment,
develop the propensity score methodology. E[Yi(w)], can be estimated by appropriately
Imbens (2000) introduces the general averaging ?(w, r):
ized propensity score for the multiple treat
ment case. It is the conditional probability of E[Yi(w)] = E[?(w,r(w,Xi))l
receiving a particular level of the treatment
given the pr?treatment variables: As with the implementation of the binary
treatment propensity score methodology, the
r(w, x) = pr(Wi = w | X? = x). implementation of the generalized propensity
score method consists of three steps. In the
In the continuous case, where, say, W? first step the score r(w,x) is estimated. With
takes values in the unit interval, r(w,x) a binary treatment the standard approach
? FW|X(a)|x). Suppose assignment to treat (Rosenbaum and Rubin 1984; Rosenbaum
ment Wi is weakly unconfounded given pre 1995) is to estimate the propensity score
treatment variables Xf. Then, by the same using a logistic regression. More generally, if
argument as in the binary treatment case, the treatments correspond to ordered levels
assignment is weakly unconfounded given of a treatment, such as the dose of a drug or
the generalized propensity score, as ? ?? 0, the time over which a treatment is applied,
one may wish to impose smoothness of the
l{w - S < Wi < w + 6} 1 Yi(w)\r(w, X?), score in w. For continuous Wh Hirano and
Imbens (2004) use a lognormal distribution.
for all w. This is the point where using the In the second step, the conditional expecta
weak form of the unconfoundedness assump tion ?(w,r) = E[Yi | Wi = w, r(Wt, Xf) = r] is
tion is important. There is, in general, no sca estimated. Again, the implementation may be
lar function of the covariates such that the different in the case where the levels of the
level of the treatment W? is independent of treatment are qualitatively distinct than in
the set of potential outcomes iY^i?;)}^^!], the case where smoothness of the conditional
unless additional structure is imposed on expectation function in w is appropriate.
Here, some form of linear or nonlinear instrument, the instrumental variables esti
regression may be used. In the third step the mand can still be interpreted as an average
average response at treatment level w is esti causal effect, but with a complicated weight
mated as the average of the estimated con ing scheme. There are essentially two levels
ditional expectation, $(w,r(w,Xt)), averaged of averaging going on. First, at each level
over the distribution of the pretreatment of the treatment we can only get the aver
variables, Xb...,XN. Note that to get the age effect of a unit increase in the treatment
average ?[Y?(i?;)], the second argument in the for compliers at that level. In addition, there
conditional expectation ?(w, r) is evaluated at is averaging over all levels of the treatment,
r(w,Xt), not at r(Wi,Xi). with the weights equal to the proportion of
compliers at that level.
7.2.1 Dynamic Treatments with Imbens (2007) studies, in more detail,
Unconfounded Treatment Assignment
the case where the endogenous treatment
Multiple-valued treatments can arise takes on three values and shows the limits to
because at any point in time individuals identification in the case with heterogenous
can be assigned to multiple different treat treatment effects.
ment arms, or because they can be assigned
7.4 Continuous Endogenous Treatments
sequentially to different treatments. Gill and
Robins (2001) analyze this case, where they Perhaps surprisingly, there are many
assume that at any point in time an uncon more results for the case with continuous
foundedness assumption holds. Lechner endogenous treatments than for the discrete
and Miquel (2005) (see also Lechner 1999, case that do not impose restrictive assump
and Lechner, Miquel, and Conny Wunsch tions. Much of the focus has been on tri
2004) study a related case, where again a angular systems, with a single unobserved
sequential unconfoundedness assumption is component of the equation determining the
maintained to identify the average effects treatment:
of interest. Abbring and Gerard J. van den
Berg (2003) study settings with duration
data. These methods hold great promise but,
until now, there have been few substantive where r\i is scalar, and an essentially unre
stricted outcome equation:
applications.
7.3 Multivalued Discrete Endogenous Yi = g(Wi,et),
Treatments
where ?f may be a vector. Blundell and James
In settings with general heterogeneity in L. Powell (2003, 2004), Chernozhukov and
the effects of the treatment, the case with Hansen (2005), Imbens and Newey (forth
more than two treatment levels is consider coming), and Andrew Chesher (2003) study
ably more challenging than the binary case. various versions of this setup. Imbens and
There are few studies investigating identifi Newey (forthcoming) show that if h(z, rj) is
cation in these settings. Angrist and Imbens strictly monotone in 77, then one can iden
(1995) and Angrist, Kathryn Graddy and tify average effects of the treatment subject
Imbens (2000) study the interpretation of to support conditions on the instrument.
the standard instrumental variable estimand, They suggest a control function approach
the ratio of the covariances of outcome and to estimation. First 77 is normalized to have
instrument and treatment and instrument. a uniform distribution on [0,1] (e.g., Rosa
They show that in general, with a valid L. Matzkin 2003). Then 77, is estimated
Security Administrative Records." American Eco Economie Research Working Paper 6600.
nomic Review, 80(3): 313?36. Attanasio, Orazio, Costas Meghir, and Ana Santiago.
Angrist, Joshua D. 1998. "Estimating the Labor Market 2005. "Education Choices in Mexico: Using a Struc
Impact of Voluntary Military Service Using Social tural Model and a Randomized Experiment to Eval
Security Data on Military Applicants." Economet uate Progresa." Institute for Fiscal Studies Centre
rica, 66(2): 249-88. for the Evaluation of Development Policies Working
Angrist, Joshua D. 2004. "Treatment Effect Hetero Paper EWP05/01.
geneity in Theory and Practice." Economic Journal, Austin, Peter C. 2008a. "A Critical Appraisal of Pro
114(494): C52-83. pensity-Score Matching in the Medical Literature
Angrist, Joshua D., Eric Bettinger, and Michael Kre between 1996 and 2003." Statistics in Medicine,
mer. 2006. "Long-Term Educational Consequences 27(12): 2037-49.
of Secondary School Vouchers: Evidence from Austin, Peter C. 2008b. "Discussion of 'A Critical
Administrative Records in Colombia." American Appraisal of Propensity-Score Matching in the Med
Economic Review, 96(3): 847-62. ical Literature between 1996 and 2003': Rejoinder."
Angrist, Joshua D., Kathryn Graddy, and Guido W. Statistics in Medicine, 27(12): 2066-69.
Imbens. 2000. "The Interpretation of Instrumen Balke, Alexander, and Judea Pearl. 1994. "Nonpara
tal Variables Estimators in Simultaneous Equations metric Bounds of Causal Effects from Partial Com
Models with an Application to the Demand for Fish." pliance Data." University of California Los Angeles
Review of Economic Studies, 67(3): 499-527. Cognitive Systems Laboratory Technical Report
Angrist, Joshua D., and Jinyong Hahn. 2004. "When to R-199.
Control for Covariates? Panel Asymptotics for Esti Banerjee, Abhijit V., Shawn Cole, Esther Duflo, and
mates of Treatment Effects." Review of Economics Leigh Linden. 2007. "Remedying Education: Evi
and Statistics, 86(1): 58-72. dence from Two Randomized Experiments in India."
Angrist, Joshua D., and Guido W. Imbens. 1995. "Two Quarterly Journal of Economics, 122(3): 1235-64.
Stage Least Squares Estimation of Average Causal Barnow, Burt S., Glend G. Cain, and Arthur S. Gold
Effects in Models with Variable Treatment Inten berger. 1980. "Issues in the Analysis of Selectivity
sity." Journal of the American Statistical Associa Bias." In Evaluation Studies, Volume 5, ed. Ernst W.
tion, 90(430): 431-42. Stromsdorfer and George Farkas, 43-59. San Fran
Angrist, Joshua D., Guido W. Imbens, and Donald B. cisco: Sage.
Rubin. 1996. "Identification of Causal Effects Using Becker, Sascha O., and Andrea Ichino. 2002. "Estima
Instrumental Variables." Journal of the American tion of Average Treatment Effects Based on Propen
Statistical Association, 91(434): 444-55. sity Scores." Stata Journal, 2(4): 358-77.
Angrist, Joshua D., and Alan B. Krueger. 1999. "Empir Behncke, Stefanie, Markus Fr?lich, and Michael Lech
ical Strategies in Labor Economics." In Handbook of ner. 2006. "Statistical Assistance for Programme
Labor Economics, Volume 3A, ed. Orley Ashenfelter Selection?For a Better Targeting of Active Labour
and David Card, 1277-1366. Amsterdam; New York Market Policies in Switzerland." University of St.
and Oxford: Elsevier Science, North-Holland. Gallen Department of Economics Discussion Paper
Angrist, Joshua D., and Kevin Lang. 2004. "Does 2006-09.
School Integration Generate Peer Effects? Evidence Beresteanu, Arie, and Francesca Molinari. 2006.
from Boston's Metco Program." American Economic "Asymptotic Properties for a Class of Partially Iden
Review, 94(5): 1613-34. tified Models." Institute for Fiscal Studies Centre
Angrist, Joshua D., and Victor Lavy. 1999. "Using Mai for Microdata Methods and Practice Working Paper
monides' Rule to Estimate the Effect of Class Size CWP1%6.
on Scholastic Achievement." Quarterly Journal of Bertrand, Marianne, Esther Duflo, and Sendhil Mul
Economics, 114(2): 533-75. lainathan. 2004. "How Much Should We Trust
Angrist, Joshua D., and J?rn-Steffen Pischke. 2009. Differences-in-Differences Estimates?" Quarterly
Mostly Harmless Econometrics: An Empiricist's Journal of Economics, 119(1): 249-75.
Companion. Princeton: Princeton University Press. Bertrand, Marianne, and Sendhil Mullainathan. 2004.
Ashenfelter, Orley. 1978. "Estimating the Effect of "Are Emily and Greg More Employable than Lak
Training Programs on Earnings." Review of Eco isha and Jamal? A Field Experiment on Labor Mar
nomics and Statistics, 6(1): 47-57. ket Discrimination." American Economic Review,
Ashenfelter, Orley, and David Card. 1985. "Using the 94(4): 991-1013.
Longitudinal Structure of Earnings to Estimate the Besley, Timothy, and Anne C. Case. 2000. "Unnatu
Effect of Training Programs." Review of Economics ral Experiments? Estimating the Incidence of
and Statistics, 67(4): 648-60. Endogenous Policies." Economic Journal, 110(467):
Athey, Susan, and Guido W. Imbens. 2006. "Identifica F672-94.
tion and Inference in Nonlinear Difference-in-Dif Bierens, Herman J. 1987. "Kernel Estimators of Regres
ferences Models." Econometrica, 74(2): 431?97. sion Functions." In Advances in Econometrics: Fifth
Athey, Susan, and Scott Stern. 1998. "An Empirical World Congress, Volume 1, ed. Truman F. Bew
Framework for Testing Theories About Complimen ley, 99-144. Cambridge and New York: Cambridge
tarity in Organizational Design." National Bureau of University Press.
Bitler, Marianne, Jonah Gelbach, and Hilary Hoynes. Economie Perspectives, 9(2): 63-84.
2006. "What Mean Impacts Miss: Distributional Busso, Matias, John DiNardo, and Justin McCrary.
Effects of Welfare Reform Experiments." American 2008. "Finite Sample Properties of Semipara
Economic Review, 96(4): 988-1012. metric Estimators of Average Treatment Effects."
Bj?rklund, Anders, and Robert Moffitt. 1987. "The Unpublished.
Estimation of Wage Gains and Welfare Gains in Caliendo, Marco. 2006. Micro econometric Evaluation
Self-Selection." Review of Economics and Statistics, of Labour Market Policies. Heidelberg: Springer,
69(1): 42-49. Physica-Verlag.
Black, Sandra E. 1999. "Do Better Schools Matter? Cameron, A. Colin, and Pravin K. Trivedi. 2005.
Parental Valuation of Elementary Education." Quar Micro econometrics: Methods and Applications.
terly Journal of Economics, 114(2): 577-99. Cambridge and New York: Cambridge University
Bloom, Howard S. 1984. "Accounting for No-Shows Press.
in Experimental Evaluation Designs." Evaluation Canay, Ivan A. 2007. "EL Inference for Partially Iden
Review, 8(2): 225-46. tified Models: Large Deviations Optimally and
Bloom, Howard S., ed. 2005. Learning More from Bootstrap Validity." Unpublished.
Social Experiments: Evolving Analytic Approaches. Card, David. 1990. "The Impact of the Mariel Boatlift
New York: Russell Sage Foundation. on the Miami Labor Market." Industrial and Lahor
Blundell, Richard, and Monica Costa Dias. 2002. Relations Review, 43(2): 245-57.
"Alternative Approaches to Evaluation in Empirical Card, David. 2001. "Estimating the Return to School
Microeconomics." Institute for Fiscal Studies Cen ing: Progress on Some Persistent Econometric Prob
tre for Microdata Methods and Practice Working lems." Econometrica, 69(5): 1127-60.
Paper CWP1%2. Card, David, Carlos Dobkin, and Nicole Maestas.
Blundell, Richard, Monica Costa Dias, Costas Meghir, 2004. "The Impact of Nearly Universal Insurance
and John Van Reenen. 2001. "Evaluating the Coverage on Health Care Utilization and Health:
Employment Impact of a Mandatory Job Search Evidence from Medicare." National Bureau of Eco
Assistance Program." Institute for Fiscal Studies nomic Research Working Paper 10365.
Working Paper WPO^O. Card, David, and Dean R. Hyslop. 2005. "Estimating
Blundell, Richard, Alan Duncan, and Costas Meghir. the Effects of a Time-Limited Earnings Subsidy for
1998. "Estimating Labor Supply Responses Using Welfare-Leavers." Econometrica, 73(6): 1723-70.
Tax Reforms." Econometrica, 66(4): 827-61. Card, David, and Alan B. Krueger. 1993. "Trends in
Blundell, Richard, Amanda Gosling, Hidehiko Relative Black-White Earnings Revisited." Ameri
Ichimura, and Costas Meghir. 2004. "Changes in the can Economic Review, 83(2): 85-91.
Distribution of Male and Female Wages Accounting Card, David, and Alan B. Krueger. 1994. "Minimum
for Employment Composition Using Bounds." Insti Wages and Employment: A Case Study of the Fast
tute for Fiscal Studies Working Paper W04/25. Food Industry in New Jersey and Pennsylvania."
Blundell, Richard, and Thomas MaCurdy. 1999. "Labor American Economic Review, 84(4): 772-93.
Supply: A Review of Alternative Approaches." In Card, David, and Phillip B. Levine. 1994. "Unemploy
Handbook of Labor Economics, Volume 3A, ed. ment Insurance Taxes and the Cyclical and Seasonal
Orley Ashenfelter and David Card, 1559-1695. Properties of Unemployment." Journal of Public
Amsterdam; New York and Oxford: Elsevier Sci Economics, 53(1): 1-29.
ence, North-Holland. Card, David, Alexandre Mas, and Jesse Rothstein.
Blundell, Richard, and James L. Powell. 2003. "Endo 2007. "Tipping and the Dynamics of Segregation."
geneity in Nonparametric and Semiparametric National Bureau of Economic Research Working
Regression Models." In Advances in Economics and Paper 13052.
Econometrics: Theory and Applications, Eighth Card, David, and Brian P. McCall. 1996. "Is Workers'
World Congress, Volume 2, ed. Mathias Dewatri Compensation Covering Uninsured Medical Costs?
pont, Lars Peter Hansen, and Stephen J. Turnovsky, Evidence from the 'Monday Effect.'" Industrial and
312-57. Cambridge and New York: Cambridge Uni Labor Relations Review, 49(4): 690-706.
versity Press. Card, David, and Philip K. Robins. 1996. "Do Finan
Blundell, Richard, and James L. Powell. 2004. "Endo cial Incentives Encourage Welfare Recipients to
geneity in Semiparametric Binary Response Mod Work? Evidence from a Randomized Evaluation of
els." Review of Economic Studies, 71(3): 655-79. the Self-Sufficiency Project." National Bureau of
Brock, William, and Steven N. Durlauf. 2000. "Interac Economic Research Working Paper 5701.
tions-Based Models." National Bureau of Economic Card, David, and Daniel G. Sullivan. 1988. "Measur
Research Technical Working Paper 258. ing the Effect of Subsidized Training Programs on
Bruhn, Miriam, and David McKenzie. 2008. "In Pur Movements In and Out of Employment." Economet
suit of Balance: Randomization in Practice in Devel rica, 56(3): 497-530.
opment Field Experiments." World Bank Policy Case, Anne C, and Lawrence F. Katz. 1991. "The
Research Working Paper 4752. Company You Keep: The Effects of Family and
Burtless, Gary. 1995. "The Case for Randomized Field Neighborhood on Disadvantaged Youths." National
Trials in Economic and Policy Research." Journal of Bureau of Economic Research Working Paper 3705.
Chamberlain, Gary. 1986. "Asymptotic Efficiency in Davison, A. C, and D. V Hinkley. 1997. Bootstrap
Semi-parametric Models with Censoring." Journal Methods and Their Application. Cambridge and
of Econometrics, 32(2): 189-218. New York: Cambridge University Press.
Chattopadhyay, Raghabendra, and Esther Duflo. Dehejia, Rajeev H. 2003. "Was There a Riverside
2004. "Women as Policy Makers: Evidence from a Miracle? A Hierarchical Framework for Evaluating
Randomized Policy Experiment in India." Econo Programs with Grouped Data.." Journal of Business
metrica, 72(5): 1409-43. and Economic Statistics, 21(1): 1-11.
Chay, Kenneth Y., and Michael Greenstone. 2005. Dehejia, Rajeev H. 2005a. "Practical Propensity Score
"Does Air Quality Matter? Evidence from the Hous Matching: A Reply to Smith and Todd." Journal of
ing Market." Journal of Political Economy, 113(2): Econometrics, 125(1-2): 355-64.
376-424. Dehejia, Rajeev H. 2005b. "Program Evaluation as a
Chen, Susan, and Wilbert van der Klaauw. 2008. "The Decision Problem." Journal of Econometrics, 125(1?
Work Disincentive Effects of the Disability Insur 2): 141-73.
ance Program in the 1990s." Journal of Economet Dehejia, Rajeev H., and Sadek Wahba. 1999. "Causal
rics, 142(2): 757-84. Effects in Nonexperimental Studies: Reevaluat
Chen, Xiaohong. 2007. "Large Sample Sieve Estima ing the Evaluation of Training Programs." Journal
tion of Semi-nonparametric Models." In Handbook of the American Statistical Association, 94(448):
of Econometrics, Volume 6B, ed. James J. Heckman 1053-62.
and Edward E. Learner, 5549-5632. Amsterdam Diamond, Alexis, and Jasjeet S. Sekhon. 2008. "Genetic
and Oxford: Elsevier, North-Holland. Matching for Estimating Causal Effects: A General
Chen, Xiaohong, Han Hong, and Alessandro Tarozzi. Multivariate Matching Method for Achieving Bal
2008. "Semiparametric Efficiency in GMM Mod ance in Observational Studies." Unpublished.
els with Auxiliary Data." Annals of Statistics, 36(2): DiNardo, John, and David S. Lee. 2004. "Economic
808-43. Impacts of New Unionization on Private Sector
Chernozhukov, Victor, and Christian B. Hansen. Employers: 1984-2001." Quarterly Journal of Eco
2005. "An IV Model of Quantile Treatment Effects." nomics, 119(4): 1383-1441.
Econometrica, 73(1): 245-61. Doksum, Kjell. 1974. "Empirical Probability Plots
Chernozhukov, Victor, Han Hong, and Elie Tamer. and Statistical Inference for Nonlinear Models in
2007. "Estimation and Confidence Regions for the Two-Sample Case." Annals of Statistics, 2(2):
Parameter Sets in Econometric Models." Economet 267-77.
rica, 75(5): 1243-84. Donald, Stephen G., and Kevin Lang. 2007. "Inference
Chesher, Andrew. 2003. "Identification in Nonsepa with Difference-in-Differences and Other Panel
rable Models." Econometrica, 71(5): 1405-41. Data." Review of Economics and Statistics, 89(2):
Chetty, Raj, Adam Looney, and Kory Kroft. Forthcom 221-33.
ing. "Salience and Taxation: Theory and Evidence." Duflo, Esther. 2001. "Schooling and Labor Market
American Economic Review. Consequences of School Construction in Indone
Cochran, William G. 1968. "The Effectiveness of sia: Evidence from an Unusual Policy Experiment."
Adjustment by Subclassification in Removing Bias in American Economic Review, 91(4): 795-813.
Observational Studies." Biometrics, 24(2): 295-314. Duflo, Esther, William Gale, Jeffrey B. Liebman, Peter
Cochran, William G., and Donald B. Rubin. 1973. Orszag, and Emmanuel Saez. 2006. "Saving Incen
"Controlling Bias in Observational Studies: A tives for Low- and Middle-Income Families: Evi
Review." Sankhya, 35(4): 417-46. dence from a Field Experiment with H&R Block."
Cook, Thomas D. 2008. "'Waiting for Life to Arrive': Quarterly Journal of Economics, 121(4): 1311-46.
A History of the Regression-Discontinuity Design Duflo, Esther, Rachel Glennerster, and Michael Kre
in Psychology, Statistics and Economics." Journal of mer. 2008. "Using Randomization in Development
Econometrics, 142(2): 636-54. Economics Research: A Toolkit." In Handbook of
Cook, Philip J., and George Tauchen. 1982. "The Effect Development Economics, Volume 4, ed. T. Paul
of Liquor Taxes on Heavy Drinking." Bell Journal of Schultz and John Strauss, 3895-3962. Amsterdam
Economics, 13(2): 379-90. and Oxford: Elsevier, North-Holland.
Cook, Philip J., and George Tauchen. 1984. "The Effect Duflo, Esther, and Rema Hanna. 2005. "Monitor
of Minimum Drinking Age Legislation on Youthful ing Works: Getting Teachers to Come to School."
Auto Fatalities, 1970-1977." Journal of Legal Stud National Bureau of Economic Research Working
ies, 13(1): 169-90. Paper 11880.
Crump, Richard K., V Joseph Hotz, Guido W. Imbens, Duflo, Esther, and Emmanuel Saez. 2003. "The Role
and Oscar A. Mitnik. 2009. "Dealing with Lim of Information and Social Interactions in Retire
ited Overlap in Estimation of Average Treatment ment Plan Decisions: Evidence from a Random
Effects." Biometrika, 96:187-99. ized Experiment." Quarterly Journal of Economics,
Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, 118(3): 815-42.
and Oscar A. Mitnik. 2008. "Nonparametric Tests Efron, Bradley, and Robert J. Tibshirani. 1993. An
for Treatment Effect Heterogeneity." Review of Introduction to the Bootstrap. New York and
Economics and Statistics, 90(3): 389-405. London: Chapman and Hall.
Eissa, Nada, and Jeffrey B. Liebman. 1996. "Labor Econometrica, 52(3): 681-700.
Supply Response to the Earned Income Tax Credit." Graham, Bryan S. 2008. "Identifying Social Interac
Quarterly Journal of Economics, 111(2): 605-37. tions through Conditional Variance Restrictions."
Engle, Robert F., David F. Hendry, and Jean-Francois Econometrica, 76(3): 643-60.
Richard. 1983. "Exogeneity." Econometrica, 51(2): Graham, Bryan S., Guido W. Imbens, and Geert Rid
277-304. der. 2006. "Complementarity and Aggregate Impli
Fan, J., and I. Gijbels. 1996. Local Polynomial Mod cations of Assortative Matcning: A Nonparametric
elling and Its Applications. London: Chapman and Analysis." Unpublished.
Hall. Greenberg, David, and Michael Wiseman. 1992. "What
Ferraz, Claudio, and Frederico Finan. 2008. "Exposing Did the OBRA Demonstrations Do?" In Evaluat
Corrupt Politicians: The Effects of brazil's Publicly ing Welfare and Training Programs, ed. Charles F.
Released Audits on Electoral Outcomes." Quarterly Manski and Irwin Garfinkel, 25-75. Cambridge and
Journal of Economics, 123(2): 703-45. London: Harvard University Press.
Firpo, Sergio. 2007. "Efficient Semiparametric Esti Gu, X., and Paul R. Rosenbaum. 1993. "Comparison
mation of Quantile Treatment Effects." Economet of Multivariate Matching Methods: Structures, Dis
rica, 75(1): 259-76. tances and Algorithms." Journal of Computational
Fisher, Ronald A. 1935. The Design of Experiments, and Graphical Statistics, 2(4): 405-20.
First edition. London: Oliver and Boyd. Gueron, Judith M., and Edward Pauly. 1991. From Wel
Flores, Carlos A. 2005. "Estimation of Dose-Response fare to Work. New York: Russell Sage Foundation.
Functions and Optimal Doses with a Continuous Haavelmo, Trygve. 1943. "The Statistical Implications
Treatment." Unpublished. of a System of Simultaneous Equations." Economet
Fraker, Thomas, and Rebecca Maynard. 1987. "The rica, 11(1): 1-12.
Adequacy of Comparison Group Designs for Evalu Hahn, Jinyong. 1998. "On the Role of the Propensity
ations of Employment-Related Programs." Journal Score in Efficient Semiparametric Estimation of
of Human Resources, 22(2): 194-227. Average Treatment Effects." Econometrica, 66(2):
Friedlander, Daniel, and Judith M. Gueron. 1992. "Are 315-31.
High-Cost Services More Effective than Low-Cost Hahn, Jinyong, Petra E. Todd, and Wilbert van der
Services?" In Evaluating Welfare Training Pro Klaauw. 2001. "Identification and Estimation of
grams, ed. Charles F. Manski and Irwin Garfinkel, Treatment Effects with a Regression-Discontinuity
143-98. Cambridge and London: Harvard Univer Design." Econometrica, 69(1): 201-09.
sity Press. Ham, John C, and Robert J. LaLonde. 1996. "The
Friedlander, Daniel, and Philip K. Robins. 1995. Effect of Sample Selection and Initial Conditions
"Evaluating Program Evaluations: New Evidence in Duration Models: Evidence from Experimental
on Commonly Used Nonexperimental Methods." Data on Training." Econometrica, 64(1): 175-205.
American Economic Review, 85(4): 923-37. Hamermesh, Daniel S., and Jeff E. Biddle. 1994.
Fr?lich, Markus. 2004a. "Finite-Sample Properties of "Beauty and the Labor Market." American Eco
Propensity-Score Matching and Weighting Estima nomic Review, 84(5): 1174-94.
tors." Review of Economics and Statistics, 86(1): Hansen, B. B. 2008. "The Essential Role of Balance
77-90. Tests in Propensity-Matched Observational Studies:
Fr?lich, Markus. 2004b. "A Note on the Role of the Comments on A Critical Appraisal of Propensity
Propensity Score for Estimating Average Treatment Score Matching in the Medical Literature between
Effects." Econometric Reviews, 23(2): 167-74. 1996 and 2003' by Peter Austin." Statistics in Medi
Gill, Richard D., and James M. Robins. 2001. "Causal cine, 27(12): 2050-54.
Inference for Complex Longitudinal Data: The Hansen, Christian B. 2007a. "Asymptotic Properties of
Continuous Case." Annals of Statistics, 29(6): a Robust Variance Matrix Estimator for Panel Data
1785-1811. When T Is Large." Journal of Econometrics, 141(2):
Glaeser, Edward L., Bruce Sacerdote, and Jose A. 597-620.
Scheinkman. 1996. "Crime and Social Interactions." Hansen, Christian B. 2007b. "Generalized Least
Quarterly Journal of Economics, 111(2): 507-48. Squares Inference in Panel and Multilevel Models
Goldberger, Arthur S. 1972a. "Selection Bias in Evalu with Serial Correlation and Fixed Effects." Journal
ating Treatment Effects: Some Formal Illustrations." of Econometrics, 140(2): 670-94.
Unpublished. Hanson, Samuel, and Adi Sunderam. 2008. "The Vari
Goldberger, Arthur S. 1972b. "Selection Bias in Evalu ance of Average Treatment Effect Estimators in the
ating Treatment Effects: The Case of Interaction." Presence of Clustering." Unpublished.
Unpublished. Hardie, Wolfgang. 1990. Applied Nonparametric
Gourieroux, C, A. Monfort, and A. Trognon. 1984a. Regression. Cambridge; New York and Melboure:
"Pseudo Maximum Likelihood Methods: Appli Cambridge University Press.
cations to Poisson Models." Econometrica, 52(3): Heckman, James J. 1990. "Varieties of Selection Bias."
701-20. American Economic Review, 80(2): 313-18.
Gourieroux, C, A. Monfort, and A. Trognon. 1984b. Heckman, James J., and V. Joseph Hotz. 1989.
"Pseudo Maximum Likelihood Methods: Theory." "Choosing among Alternative Nonexperimental
Methods for Estimating the Impact of Social Pro Heckman, James J., and Edward Vytlacil. 2007b.
grams: The Case of Manpower Training." Journal "Econometric Evaluation of Social Programs, Part
of the American Statistical Association, 84(408): II: Using the Marginal Treatment Effect to Orga
862-74. nize Alternative Econometric Estimators to Evalu
Heckman, James J., Hidehiko Ichimura, Jeffrey A. ate Social Programs, and to Forecast Their Effects
Smith, and Petra E. Todd. 1998. "Characterizing in New Environments." In Handbook of Economet
Selection Bias Using Experimental Data." Econo rics, Volume 6B, ed. James J. Heckman and Edward
metrica, 66(5): 1017-98. E. Learner, 4875-5143. Amsterdam and Oxford:
Heckman, James J., Hidehiko Ichimura, and Petra E. Elsevier, North-Holland.
Todd. 1997. "Matching as an Econometric Evalu Hill, Jennifer. 2008. "Discussion of Research Using
ation Estimator: Evidence from Evaluating a Job Propensity-Score Matching: Comments on A Criti
Training Programme." Review of Economic Studies, cal Appraisal of Propensity-Score Matching in the
64(4): 605-54. Medical Literature between 1996 and 2003' by Peter
Heckman, James J., Hidehiko Ichimura, and Petra E. Austin." Statistics in Medicine, 27(12): 2055-61.
Todd. 1998. "Matching as an Econometric Evalua Hirano, Keisuke, and Guido W Imbens. 2001. "Esti
tion Estimator." Review of Economic Studies, 65(2): mation of Causal Effects Using Propensity Score
261-94. Weighting: An Application to Data on Right Heart
Heckman, James J., Robert J. Lalonde, and Jeffrey A. Catheterization." Health Services and Outcomes
Smith. 1999. "The Economics and Econometrics of Research Methodology, 2(3-4): 259-78.
Active Labor Market Programs." In Handbook of Hirano, Keisuke, and Guido W. Imbens. 2004. "The
Labor Economics, Volume 3A, ed. Orley Ashenfelter Propensity Score with Continuous Treatments." In
and David Card, 1865-2097. Amsterdam; New York Applied Bayesian Modeling and Causal Inference
and Oxford: Elsevier Science, North-Holland. from Incomplete-Data Perspectives, ed. Andrew
Heckman, James J., Lance Lochner, and Christopher Gelman and Xiao-Li Meng, 73-84. Hoboken, N.J.:
Taber. 1999. "Human Capital Formation and Gen Wiley.
eral Equilibrium Treatment Effects: A Study of Tax Hirano, Keisuke, Guido W. Imbens, and Geert Ridder.
and Tuition Policy." Fiscal Studies, 20(1): 25-40. 2003. "Efficient Estimation of Average Treatment
Heckman, James J., and Salvador Navarro-Lozano. Effects Using the Estimated Propensity Score."
2004. "Using Matching, Instrumental Variables, and Econometrica, 71(4): 1161-89.
Control Functions to Estimate Economic Choice Hirano, Keisuke, Guido W. Imbens, Donald B. Rubin,
Models." Review of Economics and Statistics, 86(1): and Xiao-Hua Zhou. 2000. "Assessing the Effect of
30-57. an Influenza Vaccine in an Encouragement Design."
Heckman, James J., and Richard Robb Jr. 1985. "Alter Biostatistics, 1(1): 69-88.
native Methods for Evaluating the Impact of Inter Hirano, Keisuke, and Jack R. Porter. 2008. "Asymp
ventions." In Longitudinal Analysis of Labor Market totics for Statistical Treatment Rules." http://
Data, ed. James J. Heckman and Burton Singer, 156 www.u.arizona.edu/~hirano/hp3_2008_08_10.pdf.
245. Cambridge; New York and Sydney: Cambridge Holland, Paul W. 1986. "Statistics and Causal Infer
University Press. ence." Journal of the American Statistical Associa
Heckman, James J., and Jeffrey A. Smith. 1995. tion, 81(396): 945-60.
"Assessing the Case for Social Experiments." Journal Horowitz, Joel L. 2001. "The Bootstrap." In Hand
of Economic Perspectives, 9(2): 85-110. book of Econometrics, Volume 5, ed. James J. Heck
Heckman, James J., and Jeffrey A. Smith. 1997. "Mak man and Edward Learner, 3159-3228. Amsterdam;
ing the Most Out of Programme Evaluations and London and New York: Elsevier Science, North
Social Experiments: Accounting for Heterogeneity Holland.
in Programme Impacts." Review of Economic Stud Horowitz, Joel L., and Charles F Manski. 2000. "Non
ies, 64(4): 487-535. parametric Analysis of Randomized Experiments
Heckman, James J., Sergio Urzua, and Edward Vyt with Missing Covariate and Outcome Data." Jour
lacil. 2006. "Understanding Instrumental Variables nal of the American Statistical Association, 95(449):
in Models with Essential Heterogeneity." Review of 77-84.
Economics and Statistics, 88(3): 389-432. Horvitz, D. G., and D. J. Thompson. 1952. "A Gener
Heckman, James J., and Edward Vytlacil. 2005. "Struc alization of Sampling without Replacement from a
tural Equations, Treatment Effects, and Econo Finite Universe." Journal oj the American Statistical
metric Policy Evaluation." Econometrica, 73(3): Association, 47(260): 663-85.
669-738. Hotz, V Joseph, Guido W. Imbens, and Jacob A. Kler
Heckman, James J., and Edward Vytlacil. 2007a. man. 2006. "Evaluating the Differential Effects of
"Econometric Evaluation of Social Programs, Part Alternative Welfare-to-Work Training Components:
I: Causal Models, Structural Models and Economet A Reanalysis of the California GAIN Program."
ric Policy Evaluation." In Handbook of Economet Journal of Labor Economics, 24(3): 521?66.
rics, Volume 6B, ed. James J. Heckman and Edward Hotz, V Joseph, Guido W Imbens, and Julie H. Mor
E. Learner, 4779-4874. Amsterdam and Oxford: timer. 2005. "Predicting the Efficacy of Future
Elsevier, North-Holland. Training Programs Using Past Experiences at Other
Locations." Journal of Econometrics, 125(1-2): Imbens, Guido W, Whitney K. Newey, and Geert Rid
241-70. der. 2005. "Mean-Squared-Error Calculations for
Hotz, V. Joseph, Charles H. Mullin, and Seth G. Sand Average Treatment Effects." Unpublished.
ers. 1997. "Bounding Causal Effects Using Data from Imbens, Guido W, and Donald B. Rubin. 1997a.
a Contaminated Natural Experiment: Analysing the "Bayesian Inference for Causal Effects in Random
Effects of Teenage Childbearing." Review of Eco ized Experiments with Noncompliance." Annals of
nomic Studies, 64(4): 575-603. Statistics, 25(1): 305-27.
Iacus, Stefano M., Gary King, and Giuseppe Porro. Imbens, Guido W., and Donald B. Rubin. 1997b.
2008. "Matching for Causal Inference without Bal "Estimating Outcome Distributions for Compliers
ance Checking." Unpublished. in Instrumental Variables Models." Review of Eco
Ichimura, Hidehiko, and Oliver Linton. 2005. "Asymp nomic Studies, 64(4): 555-74.
totic Expansions for Some Semiparametric Program Imbens, Guido W, and Donald B. Rubin. Forthcom
Evaluation Estimators." In Identification and Infer ing. Causal Inference in Statistics and the Social
ence for Econometric Models: Essays in Honor of Sciences. Cambridge and New York: Cambridge
Thomas Rothenberg, ed. Donald W. K. Andrews and University Press.
James H. Stock, 149-70. Cambridge and New York: Imbens, Guido W., Donald B. Rubin, and Bruce I. Sac
Cambridge University Press. erdote. 2001. "Estimating the Effect of Unearned
Ichimura, Hidehiko, and Petra E. Todd. 2007. "Imple Income on Labor Earnings, Savings, and Consump
menting Nonparametric and Semiparametric Esti tion: Evidence from a Survey of Lottery Players."
mators." In Handbook of Econometrics, Volume American Economic Review, 91(4): 778-94.
6B, ed. James J. Heckman and Edward E. Learner, Jin, Ginger Zhe, and Phillip Leslie. 2003. "The Effect
5369-5468. Amsterdam and Oxford: Elsevier, of Information on Product Quality: Evidence from
North-Holland. Restaurant Hygiene Grade Cards." Quarterly Jour
Imbens, Guido W 2000. "The Role of the Propensity nal of Economics, 118(2): 409-51.
Score in Estimating Dose-Response Functions." Joffe, Marshall M., and Paul R. Rosenbaum. 1999.
Biometrika, 87(3): 706-10. "Invited Commentary: Propensity Scores." Ameri
Imbens, Guido W. 2003. "Sensitivity to Exogeneity can Journal of Epidemiology, 150(4): 327-33.
Assumptions in Program Evaluation." American Kitagawa, Toru. 2008. "Identification Bounds for the
Economic Review, 93(2): 126-32. Local Average Treatment Effect." Unpublished.
Imbens, Guido W. 2004. "Nonparametric Estimation Kling, Jeffrey R., Jeffrey B. Liebman, and Lawrence
of Average Treatment Effects under Exogeneity: A F. Katz. 2007. "Experimental Analysis of Neighbor
Review." Review of Economics and Statistics, 86(1): hood Effects." Econometrica, 75(1): 83-119.
4-29. Lalive, Rafael. 2008. "How Do Extended Benefits
Imbens, Guido W. 2007. "Non-additive Models with Affect Unemployment Duration? A Regression
Endogenous Regressors." In Advances in Economics Discontinuity Approach." Journal of Econometrics,
and Econometrics: Theory and Applications, Ninth 142(2): 785-806.
World Congress, Volume 3, ed. Richard Blundell, LaLonde, Robert J. 1986. "Evaluating the Economet
Whitney K. Newey, and Torsten Persson, 17-46. ric Evaluations of Training Programs with Experi
Cambridge and New York: Cambridge University mental Data." American Economic Review, 76(4):
Press. 604-20.
Imbens, Guido W, and Joshua D. Angrist. 1994. "Iden Lechner, Michael. 1999. "Earnings and Employment
tification and Estimation of Local Average Treat Effects of Continuous Off-the-Job Training in East
ment Effects." Econometrica, 62(2): 467-75. Germany after Unification." Journal of Business and
Imbens, Guido W., and Karthik Kalyanaraman. 2009. Economic Statistics, 17(1): 74-90.
"Optimal Bandwidth Choice for the Regression Dis Lechner, Michael. 2001. "Identification and Estima
continuity Estimator." National Bureau of Economic tion of Causal Effects of Multiple Treatments under
Research Working Paper 14726. the Conditional Independence Assumption." In
Imbens, Guido W., Gary King, David McKenzie, and Econometric Evaluation of Labour Market Policies,
Geert Ridder. 2008. "On the Benefits of Stratifica ed. Michael Lechner and Friedhelm Pfeiffer, 43?58.
tion in Randomized Experiments." Unpublished. Heidelberg and New York: Physica; Mannheim: Cen
Imbens, Guido W., and Thomas Lemieux. 2008. tre for European Economic Research.
"Regression Discontinuity Designs: A Guide to Lechner, Michael. 2002a. "Program Heterogeneity
Practice." Journal of Econometrics, 142(2): 615?35. and Propensity Score Matching: An Application to
Imbens, Guido W., and Charles F. Manski. 2004. "Con the Evaluation of Active Labor Market Policies."
fidence Intervals for Partially Identified Parameters." Review of Economics and Statistics, 84(2): 205-20.
Econometrica, 72(6): 1845-57. Lechner, Michael. 2002b. "Some Practical Issues in
Imbens, Guido W, and Whitney K. Newey. Forthcom the Evaluation of Heterogeneous Labour Market
ing. "Identification and Estimation of Triangular Programmes by Matching Methods." Journal of the
Simultaneous Equations Models without Additivity." Royal Statistical Society: Series A (Statistics in Soci
National Bureau of Economic Research Technical ety), 165(1): 59-82.
Econometrica. Lechner, Michael. 2004. "Sequential Matching
Estimation of Dynamic Causal Models." University Evidence from a Regression Discontinuity Design."
of St. Gallen Department of Economics Discussion National Bureau of Economic Research Working
Paper 2004-06. Paper 11702.
Lechner, Michael, and Ruth Miquel. 2005. "Identi Ludwig, Jens, and Douglas L. Miller. 2007. "Does
fication of the Effects of Dynamic Treatments By Head Start Improve Children's Life Chances? Evi
Sequential Conditional Independence Assump dence from a Regression Discontinuity Design."
tions." University of St. Gallen Department of Eco Quarterly Journal of Economics, 122(1): 159-208.
nomics Discussion Paper 2005-17. Manski, Charles F 1990. "Nonparametric Bounds on
Lechner, Michael, Ruth Miquel, and Conny Wunsch. Treatment Effects." American Economic Review,
2004. "Long-Run Effects of Public Sector Spon 80(2): 319-23.
sored Training in West Germany." Institute for the Manski, Charles F. 1993. "Identification of Endogenous
Study of Labor Discussion Paper 1443. Social Effects: The Reflection Problem." Review of
Lee, David S. 2001. "The Electoral Advantage to Economic Studies, 60(3): 531-42.
Incumbency and the Voters' Valuation of Politicians' Manski, Charles F. 1995. Identification Problems in
Experience: A Regression Discontinuity Analysis of the Social Sciences. Cambridge and London: Har
Elections to the U.S. ... " National Bureau of Eco vard University Press.
nomic Research Working Paper 8441. Manski, Charles F. 2000a. "Economic Analysis of
Lee, David S. 2008. "Randomized Experiments from Social Interactions." Journal of Economic Perspec
Non-random Selection in U.S. House Elections." tives, 14(3): 115-36.
Journal of Econometrics, 142(2): 675-97. Manski, Charles F. 2000b. "Identification Problems
Lee, David S., and David Card. 2008. "Regression and Decisions under Ambiguity: Empirical Analysis
Discontinuity Inference with Specification Error." of Treatment Response and Normative Analysis of
Journal of Econometrics, 142(2): 655-74. Treatment Choice." Journal of Econometrics, 95(2):
Lee, David S., and Thomas Lemieux. 2008. "Regression 415-42.
Discontinuity Designs in Economics." Unpublished. Manski, Charles F 2001. "Designing Programs for
Lee, David S., Enrico Moretti, and Matthew J. Butler. Heterogeneous Populations: The Value of Covariate
2004. "Do Voters Affect or Elect Policies? Evidence Information." American Economic Review, 91(2):
from the U.S. House." Quarterly Journal of Eco 103-06.
nomics, 119(3): 807-59. Manski, Charles F 2002. "Treatment Choice under
Lee, Myoung-Jae. 2005a. Micro-Econometrics for Ambiguity Induced By Inferential Problems." Jour
Policy, Program, and Treatment Effects. Oxford and nal of Statistical Planning and Inference, 105(1):
New York: Oxford University Press. 67-82.
Lee, Myoung-Jae. 2005b. "Treatment Effect and Sen Manski, Charles F 2003. Partial Identification of
sitivity Analysis for Self-Selected Treatment and Probabilities Distributions. New York and Heidel
Selectively Observed Response." Unpublished. berg: Springer.
Lehmann, Erich L. 1974. Nonparametrics: Statistical Manski, Charles F 2004. "Statistical Treatment Rules
Methods Based on Ranks. San Francisco: Holden for Heterogeneous Populations." Econometrica,
Day. 72(4): 1221-46.
Lemieux, Thomas, and Kevin Milligan. 2008. "Incen Manski, Charles F 2005. Social Choice with Partial
tive Effects of Social Assistance: A Regression Dis Knowledge of Treatment Response. Princeton and
continuity Approach." Journal of Econometrics, Oxford: Princeton University Press.
142(2): 807-28. Manski, Charles F 2007. Identification for Prediction
Leuven, Edwin, and Barabara Sianesi. 2003. and Decision. Cambridge and London: Harvard
"PSMATCH2: Stata Module to Perform Full University Press.
Mahalanobis and Propensity Score Matching, Manski, Charles F, and John V Pepper. 2000. "Mono
Common Support Graphing, and Covariate Imbal tone Instrumental Variables: With an Application
ance Testing." http://ideas.repec.Org/c/boc/bocode/ to the Returns to Schooling." Econometrica, 68(4):
s432001.html. 997-1010.
Li, Qi, Jeffrey S. Racine, and Jeffrey M. Wooldridge. Manski, Charles F., Gary D. Sandefur, Sara McLana
Forthcoming. "Efficient Estimaton of Average han, and Daniel Powers. 1992. "Alternative Esti
Treatment Effects with Mixed Categorical and Con mates of the Effect of Family Structure during
tinuous Data." Journal of Business and Economic Adolescence on High School Graduation." Journal
Statistics. of the American Statistical Association, 87(417):
Linton, Oliver, and Pedro G?zalo. 2003. "Conditional 25-37.
Independence Restrictions: Testing and Estima Matzkin, Rosa L. 2003. "Nonparametric Estimation
tion." Unpublished. of Nonadditive Random Functions." Econometrica,
Little, Roderick J. A., and Donald B. Rubin. 1987. 71(5): 1339-75.
Statistical Analysis with Missing Data. New York: McCrary, Justin. 2008. "Manipulation of the Running
Wiley. Variable in the Regression Discontinuity Design:
Ludwig, Jens, and Douglas L. Miller. 2005. "Does A Density Test." Journal of Econometrics, 142(2):
Head Start Improve Children's Life Chances? 698-714.
McEwan, Patrick J., and Joseph S. Shapiro. 2008. "The Quade, D. 1982. "Nonparametric Analysis of Covari
Benefits of Delayed Primary School Enrollment: ance By Matching." Biometrics, 38(3): 597-611.
Discontinuity Estimates Using Exact Birth Dates." Racine, Jeffrey S., and Qi Li. 2004. "Nonparametric
Journal of Human Resources, 43(1): 1?29. Estimation of Regression Functions with Both Cat
Mealli, Fabrizia, Guido W. Imbens, Salvatore Ferro, egorical and Continuous Data." Journal of Econo
and Annibale Biggeri. 2004. "Analyzing a Random metrics, 119(1): 99-130.
ized Trial on Breast Self-Examination with Noncom Riccio, James, and Daniel Friedlander. 1992. GAIN:
pliance and Missing Outcomes." Biostatistics, 5(2): Program Strategies, Participation Patterns, and
207-22. First-Year Impacts in Six Countries. New York:
Meyer, Bruce D., W. Kip Viscusi, and David L. Durbin. Manpower Demonstration Research Corporation.
1995. "Workers' Compensation and Injury Duration: Riccio, James, Daniel Friedlander, and Stephen Freed
Evidence from a Natural Experiment." American man. 1994. GAIN: Benefits, Costs, and Three-Year
Economic Review, 85(3): 322-40. Impacts of a Welfare-to-Work Program. New York:
Miguel, Edward, and Michael Kremer. 2004. "Worms: Manpower Demonstration Research Corporation.
Identifying Impacts on Education and Health in the Robins, James M., and Ya'acov Ritov. 1997. "Toward
Presence of Treatment Externalities." Economet a Curse of Dimensionality Appropriate (CODA)
rica, 72(1): 159-217. Asymptotic Theory for Semi-parametric Models."
Morgan, Stephen L., and Christopher Winship. 2007. Statistics in Medicine, 16(3): 285-319.
Count erf actuals and Causal Inference: Methods and Robins, James M., and Andrea Rotnitzky. 1995. "Semi
Principles for Social Research. Cambridge and New parametric Efficiency in Multivariate Regression
York: Cambridge University Press. Models with Missing Data." Journal of the American
Moulton, Brent R. 1990. "An Illustration of a Pitfall Statistical Association, 90(429): 122-29.
in Estimating the Effects of Aggregate Variables on Robins, James M., Andrea Rotnitzky, and Lue Ping
Micro Unit." Review of Economics and Statistics, Zhao. 1995. "Analysis of Semiparametric Regression
72(2): 334-38. Models for Repeated Outcomes in the Presence of
Moulton, Brent R., and William C. Randolph. 1989. Missing Data." Journal of the American Statistical
"Alternative Tests of the Error Components Model." Association, 90(429): 106-21.
Econometrica, 57(3): 685-93. Robinson, Peter M. 1988. "Root-N-Consistent Semi
Newey, Whitney K. 1994a. "Kernel Estimation of parametric Regression." Econometrica, 56(4):
Partial Means and a General Variance Estimator." 931-54.
Econometric Theory, 10(2): 233-53. Romano, Joseph P., and Azeem M. Shaikh. 2006a.
Newey, Whitney K. 1994b. "Series Estimation of "Inference for Identifiable Parameters in Partially
Regression Functional." Econometric Theory, Identified Econometric Models." Stanford University
10(1): 1-28. Department of Statistics Technical Report 2006-9.
Olken, Benjamin A. 2007. "Monitoring Corruption: Romano, Joseph P., and Azeem M. Shaikh. 2006b.
Evidence from a Field Experiment in Indonesia." "Inference for the Identified Set in Partially Identi
Journal of Political Economy, 115(2): 200-249. fied Econometric Models." Unpublished.
Pagan, Adrian, and Aman Ullah. 1999. Nonparamet Rosen, Adam M. 2006. "Confidence Sets for Partially
ric Econometrics. Cambridge; New York and Mel Identified Parameters That Satisfy a Finite Number
bourne: Cambridge University Press. of Moment Inequalities." Institute for Fiscal Studies
Pakes, Ariel, Jack R. Porter, Kate Ho, and Joy Ishii. Centre for Microdata Methods and Practice Work
2006. "Moment Inequalities and Their Application." ing Paper CWP25/06.
Institute for Fiscal Studies Centre for Microdata Rosenbaum, Paul R. 1984a. "Conditional Permutation
Methods and Practice Working Paper CWP16/07. Tests and the Propensity Score in Observational
Pearl, Judea. 2000. Causality: Models, Reasoning, and Studies." Journal of the American Statistical Asso
Inference. Cambridge; New York and Melbourne: ciation, 79(387): 565-74.
Cambridge University Press. Rosenbaum, Paul R. 1984b. "The Consequences of
Pettersson-Lidbom, Per. 2007. "The Policy Conse Adjustment for a Concomitant Variable That Has
quences of Direct versus Representative Democracy: Been Affected By the Treatment." Journal of the
A Regression-Discontinuity Approach." Unpublished. Royal Statistical Society: Series A (Statistics in Soci
Pettersson-Libdom, Per. 2008. "Does the Size of the ety), 147(5): 656-66.
Legislature Affect the Size of Government? Evidence Rosenbaum, Paul R. 1987. "The Role of a Second Con
from Two Natural Experiments." Unpublished. trol Group in an Observational Study." Statistical
Pettersson-Lidbom, Per, and Bj?rn Tyrefors. 2007. "Do Science, 2(3): 292-306.
Parties Matter for Economic Outcomes? A Regres Rosenbaum, Paul R. 1989. "Optimal Matching for
sion-Discontinuity Approach." Unpublished. Observational Studies." Journal of the American
Politis, Dimitris N., Joseph P. Romano, and Michael Statistical Association, 84(408): 1024-32.
Wolf. 1999. Subsampling. New York: Springer, Rosenbaum, Paul R. 1995. Observational Studies. New
Verlag York; Heidelberg and London: Springer.
Porter, Jack R. 2003. "Estimation in the Regression Rosenbaum, Paul R. 2002. "Covariance Adjustment
Discontinuity Model." Unpublished. in Randomized Experiments and Observational
Studies." Statistical Science, 17(3): 286-327. Linear Propensity Score Methods with Normal Dis
Rosenbaum, Paul R., and Donald B. Rubin. 1983a. tributions." Biometrika, 79(4): 797-809.
"Assessing Sensitivity to an Unobserved Binary Rubin, Donald B., and Neal Thomas. 1996. "Matching
Covariate in an Observational Study with Binary Using Estimated Propensity Scores: Relating Theory
Outcome." Journal of the Royal Statistical Society: to Practice." Biometrics, 52(1): 249-64.
Series B (Statistical Methodology), 45(2): 212-18. Rubin, Donald B., and Neal Thomas. 2000. "Com
Rosenbaum, Paul R., and Donald B. Rubin. 1983b. bining Propensity Score Matching with Additional
"The Central Role of the Propensity Score in Obser Adjustments for Prognostic Covariates." Journal
vational Studies for Causal Effects." Biometrika, of the American Statistical Association, 95(450):
70(1): 41-55. 573-85.
Rosenbaum, Paul R., and Donald B. Rubin. 1984. Sacerdote, Bruce. 2001. "Peer Effects with Random
"Reducing Bias in Observational Studies Using Sub Assignment: Results for Dartmouth Roommates."
classification on the Propensity Score." Journal of the Quarterly Journal of Economics, 116(2): 681-704.
American Statistical Association, 79(387): 516-24. Scharfstein, Daniel O, Andrea Rotnitzky, and James
Rosenbaum, Paul R., and Donald B. Rubin. 1985. M. Robins. 1999. "Adjusting for Nonignorable Drop
"Constructing a Control Group Using Multivariate Out Using Semiparametric Nonresponse Models."
Matched Sampling Methods That Incorporate the Journal of the American Statistical Association,
Propensity Score." American Statistician, 39(1): 94(448): 1096-1120.
33-38. Schultz, T. Paul. 2001. "School Subsidies for the Poor:
Rotnitzky, Andrea, and James M. Robins. 1995. "Semi Evaluating the Mexican Progresa Poverty Program."
parametric Regression Estimation in the Presence of Yale University Economic Growth Center Discus
Dependent Censoring." Biometrika, 82(4): 805-20. sion Paper 834.
Roy, A. D. 1951. "Some Thoughts on the Distribution of Seifert, Burkhardt, and Theo Gasser. 1996. "Finite
Earnings." Oxford Economic Papers, 3(2): 135-46. Sample Variance of Local Polynomials: Analysis and
Rubin, Donald B. 1973a. "Matching to Remove Bias in Solutions." Journal of the American Statistical Asso
Observational Studies." Biometrics, 29(1): 159-83. ciation, 91(433): 267-75.
Rubin, Donald B. 1973b. "The Use of Matched Sam Seifert, Burkhardt, and Theo Gasser. 2000. "Data
pling and Regression Adjustment to Remove Bias in Adaptive Ridging in Local Polynomial Regression."
Observational Studies." Biometrics, 29(1): 184-203. Journal of Computational and Graphical Statistics,
Rubin, Donald B. 1974. "Estimating Causal Effects 9(2): 338-60.
of Treatments in Randomized and Nonrandomized Sekhon, Jasjeet S. Forthcoming. "Multivariate and Pro
Studies." Journal of Educational Psychology, 66(5): pensity Score Matching Software with Automated
688-701. Balance Optimization: The Matching Package for
Rubin, Donald B. 1976. "Inference and Missing Data." R." Journal of Statistical Software.
Biometrika, 63(3): 581-92. Sekhon, Jasjeet S., and Richard Grieve. 2008. "A New
Rubin, Donald B. 1977. "Assignment to Treatment Non-parametric Matching Method for Bias Adjust
Group on the Basis of a Covariate." Journal of Edu ment with Applications to Economic Evaluations."
cational Statistics, 2(1): 1-26. http://sekhon.berkeley.edu/papers/GeneticMatch
Rubin, Donald B. 1978. "Bayesian Inference for Causal ing_SekhonGrieve.pdf.
Effects: The Role of Randomization." Annals of Sta Shadish, William R., Thomas D. Cook, and Donald T.
tistics, 6(1): 34-58. Campbell. 2002. Experimental and Quasi-Exper
Rubin, Donald B. 1979. "Using Multivariate Matched imental Designs for Generalized Causal Inference.
Sampling and Regression Adjustment to Control Bias Boston: Houghton Mifflin.
in Observational Studies." Journal of the American Smith, Jeffrey A., and Petra E. Todd. 2001. "Recon
Statistical Association, 74(366): 318-28. ciling Conflicting Evidence on the Performance of
Rubin, Donald B. 1987. Multiple Imputation for Non Propensity-Score Matching Methods." American
response in Surveys. New York: Wiley. Economic Review, 91(2): 112-18.
Rubin, Donald B. 1990. "Formal Mode of Statistical Smith, Jeffrey A., and Petra E. Todd. 2005. "Does
Inference for Causal Effects." Journal of Statistical Matching Overcome Lalonde's Critique of Nonex
Planning and Inference, 25(3): 279-92. perimental Estimators?" Journal of Econometrics,
Rubin, Donald B. 1997. "Estimating Causal Effects 125(1-2): 305-53.
from Large Data Sets Using Propensity Scores." Splawa-Neyman, Jerzy. 1990. "On the Application of
Annals of Internal Medicine, 127(5 Part 2): 757-63. Probability Theory to Agricultural Experiments.
Rubin, Donald B. 2006. Matched Samplingfor Causal Essays on Principles. Section 9." Statistical Science,
Effects. Cambridge and New York: Cambridge Uni 5(4): 465-72. (Orig. pub. 1923.)
versity Press. Stock, James H. 1989. "Nonparametric Policy Analy
Rubin, Donald B., and Neal Thomas. 1992a. "Affinely sis." Journal of the American Statistical Association,
Invariant Matching Methods with Ellipsoidal Distri 84(406): 567-75.
butions." Annals of Statistics, 20(2): 1079-93. Stone, Charles J. 1977. "Consistent Nonparametric
Rubin, Donald B., and Neal Thomas. 1992b. Regression." Annals of Statistics, 5(4): 595-620.
"Characterizing the Effect of Matching Using Stoye, J?rg. 2007. "More on Confidence Intervals for