0% found this document useful (0 votes)
127 views83 pages

Imbens & Wooldridge (2009) Recent Developments in The Econometrics of Program Evaluation

The document discusses recent developments in the econometrics of program evaluation. It reviews theoretical and applied literature on evaluating causal effects of programs or policies. The review focuses on practical issues for empirical researchers and provides an overview of the history and references for more technical research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views83 pages

Imbens & Wooldridge (2009) Recent Developments in The Econometrics of Program Evaluation

The document discusses recent developments in the econometrics of program evaluation. It reviews theoretical and applied literature on evaluating causal effects of programs or policies. The review focuses on practical issues for empirical researchers and provides an overview of the history and references for more technical research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Recent Developments in the Econometrics of Program Evaluation

Author(s): Guido W. Imbens and Jeffrey M. Wooldridge


Source: Journal of Economic Literature , Mar., 2009, Vol. 47, No. 1 (Mar., 2009), pp. 5-86
Published by: American Economic Association

Stable URL: https://www.jstor.org/stable/27647134

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

is collaborating with JSTOR to digitize, preserve and extend access to Journal of Economic
Literature

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Journal of Economie Literature 2009, 47:1, 5-86
http:www.aeaweb.org/articles.php?doi=10.1257/jel.47.1.5

Recent Developments in the


Econometrics of Program Evaluation
Guido W. Imbens and Jeffrey M. Wooldridge*

Many empirical questions in economics and other social sciences depend on causal
effects of programs or policies. In the last two decades, much research has been done
on the econometric and statistical analysis of such causal effects. This recent theoreti
cal literature has built on, and combined features of, earlier work in both the statistics
and econometrics literatures. It has by now reached a level of maturity that makes
it an important tool in many areas of empirical research in economics, including
labor economics, public finance, development economics, industrial organization,
and other areas of empirical microeconomics. In this review, we discuss some of the
recent developments. We focus primarily on practical issues for empirical research
ers, as well as provide a historical overview of the area and give references to more
technical research.

1. Introduction research in economics and suitable for a


review. In this article, we attempt to pres
ent such a review. We will focus on practi
Many empirical questions
and other social in economies
sciences depend on cal issues for empirical researchers, as well as
causal effects of programs or policies. In the provide an historical overview of the area and
last two decades, much research has been give references to more technical research.
done on the econometric and statistical anal This review complements and extends other
ysis of such causal effects. This recent theo reviews and discussions, including those by
retical literature has built on, and combined Richard Blundell and Monica Costa Dias
features of, earlier work in both the statistics (2002), Guido W. Imbens (2004), and Joshua
and econometrics literatures. It has by now D. Angrist and Alan B. Krueger (1999) and
reached a level of maturity that makes it an the books by Paul R. Rosenbaum (1995),
important tool in many areas of empirical Judea Pearl (2000), Myoung-Jae Lee (2005a),
Donald B. Rubin (2006), Marco Caliendo
(2006), Angrist and J?rn-Steffen Pischke
*Imbens: Harvard University and NBER. Wooldridge:
Michigan State University. Financial support for this (2009), Howard S. Bloom (2005), Stephen
research was generously provided through NSF grants L. Morgan and Christopher Winship (2007),
SES 0136789, 0452590 and 08. We are grateful for com Jeffrey M. Wooldridge (2002) and Imbens and
ments by Esther Duflo, Caroline Hoxby, Roger Gordon,
Jonathan Beauchamp, Larry Katz, Eduardo Morales, and Rubin (forthcoming). In addition, the reviews
two anonymous referees. in James J. Heckman, Robert J. LaLonde,
5

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
6 Journal of Economie Literature, Vol. XLVII (March 2009)

and Jeffrey A. Smith (1999), Heckman and can involve different physical units or the
Edward Vytlacil (2007a, 2007b), and Jaap same physical unit at different times.
H. Abbring and Heckman (2007) provide an The problem of evaluating the effect of a
excellent overview of the important theoreti binary treatment or program is a well studied
cal work by Heckman and his coauthors in problem with a long history in both econo
this area. metrics and statistics. This is true both in
The central problem studied in this liter the theoretical literature as well as in the
ature is that of evaluating the effect of the more applied literature. The econometric
exposure of a set of units to a program, or literature goes back to early work by Orley
treatment, on some outcome. In economic Ashenfelter (1978) and subsequent work by
studies, the units are typically economic Ashenfelter and David Card (1985), Heckman
agents such as individuals, households, mar and Richard Robb (1985), LaLonde (1986),
kets, firms, counties, states, or countries Thomas Fraker and Rebecca Maynard
but, in other disciplines where evaluation (1987), Card and Daniel G. Sullivan (1988),
methods are used, the units can be animals, and Charles F. Manski (1990). Motivated
plots of land, or physical objects. The treat primarily by applications to the evaluation of
ments can be job search assistance programs, labor market programs in observational set
educational programs, vouchers, laws or tings, the focus in the econometric literature
regulations, medical drugs, environmental is traditionally on endogeneity, or self-selec
exposure, or technologies. A critical feature tion, issues. Individuals who choose to enroll
is that, in principle, each unit can be exposed in a training program are by definition dif
to multiple levels of the treatment. Moreover, ferent from those who choose not to enroll.
this literature is focused on settings with These differences, if they influence the
observations on units exposed, and not response, may invalidate causal comparisons
exposed, to the treatment, with the evalua of outcomes by treatment status, possibly
tion based on comparisons of units exposed even after adjusting for observed covariates.
and not exposed.1 For example, an individual Consequently, many of the initial theoreti
may enroll or not in a training program, or he cal studies focused on the use of traditional
or she may receive or not receive a voucher, econometric methods for dealing with endo
or be subject to a particular regulation or geneity, such as fixed effect methods from
not. The object of interest is a comparison panel data analyses, and instrumental vari
of the two outcomes for the same unit when ables methods. Subsequently, the economet
exposed, and when not exposed, to the treat rics literature has combined insights from
ment. The problem is that we can at most the semiparametric literature to develop new
observe one of these outcomes because the estimators for a variety of settings, requir
unit can be exposed to only one level of the ing fewer functional form and homogeneity
treatment. Paul W. Holland (1986) refers to assumptions.
this as the fundamental problem of causal The statistics literature starts from a dif
inference. In order to evaluate the effect of ferent perspective. This literature originates
the treatment, we therefore always need to in the analysis of randomized experiments by
compare distinct units receiving the different Ronald A. Fisher (1935) and Jerzy Splawa
levels of the treatment. Such a comparison Neyman (1990). From the early 1970s, Rubin
(1973a, 1973b, 1974, 1977, 1978), in a series
of papers, formulated the now dominant
]As oppposed to studies where the causal effect of
fundamentally new programs is predicted through direct approach to the analysis of causal effects in
identification of preferences and production functions. observational studies. Rubin proposed the

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 7

interpretation of causal statements as com market programs, although more recently


parisons of so-called potential outcomes: this emphasis seems to have weakened a
pairs of outcomes defined for the same unit bit. In the last couple of years, some of the
given different levels of exposure to the treat most interesting experiments have been
ment, with the ressearcher only observing conducted in development economics (e.g.,
the potential outcome corresponding to the Edward Miguel and Michael Kremer 2004;
level of the treatment received. Models are Esther Duflo 2001; Angrist, Eric Bettinger,
developed for the pair of potential outcomes and Kremer 2006; Abhijit V. Banerjee
rather than solely for the observed outcome. et al. 2007) and behavioral econom
Rubin's formulation of the evaluation prob ics (e.g., Marianne Bertrand and Sendhil
lem, or the problem of causal inference, Mullainathan 2004). Nevertheless, experi
labeled the Rubin Causal Model (RCM) by mental evaluations remain relatively rare in
Holland (1986), is by now standard in both economics. More common is the case where
the statistics and econometrics literature. economists analyze data from observational
One of the attractions of the potential out studies. Observational data generally cre
comes setup is that from the outset it allows ate challenges in estimating causal effects
for general heterogeneity in the effects of the but, in one important special case, variously
treatment. Such heterogeneity is important referred to as unconfoundedness, exogene
in practice, and it is important theoretically ity, ignorability, or selection on observables,
as it is often the motivation for the endogene questions regarding identification and esti
ity problems that concern economists. One mation of the policy effects are fairly well
additional advantage of the potential out understood. All these labels refer to some
come set up is that the parameters of interest form of the assumption that adjusting treat
can be defined, and the assumptions stated, ment and control groups for differences in
without reference to particular statistical observed covariates, or pretreatment vari
models. ables, remove all biases in comparisons
Of particular importance in Rubin's between treated and control units. This case
approach is the relationship between treat is of great practical relevance, with many
ment assignment and the potential out studies relying on some form of this assump
comes. The simplest case for analysis is when tion. The semiparametric efficiency bound
assignment to treatment is randomized and, has been calculated for this case (Jinyong
thus, independent of covariates as well as the Hahn 1998) and various semiparametric
potential outcomes. In such classical ran estimators have been proposed (Hahn 1998;
domized experiments, it is straightforward Heckman, Hidehiko Ichimura, and Petra
to obtain estimators for the average effect E. Todd 1998; Keisuke Hirano, Imbens,
of the treatment with attractive properties and Geert Ridder 2003; Xiaohong Chen,
under repeated sampling, e.g., the difference Han Hong, and Alessandro Tarozzi 2008;
in means by treatment status. Randomized Imbens, Whitney K. Newey, and Ridder
experiments have been used in some areas 2005; Alberto Abadie and Imbens 2006). We
in economics. In the 1970s, negative income discuss the current state of this literature,
tax experiments received widespread atten and the practical recommendations coming
tion. In the late 1980s, following an influen out of it, in detail in this review.
tial paper by LaLonde (1986) that concluded Without unconfoundedness, there is no
econometric methods were unable to repli general approach to estimating treatment
cate experimental results, more emphasis effects. Various methods have been proposed
was put on experimental evaluations of labor for special cases and, in this review, we

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
8 Journal of Economie Literature, Vol. XLVII (March 2009)

will discuss several of them. One approach particular attention to the practical issues
(Rosenbaum and Rubin 1983b; Rosenbaum raised by the implementation of these meth
1995) consists of sensitivity analyses, where ods. At this stage, the literature has matured
robustness of estimates to specific limited to the extent that it has much to offer the
departures from unconfoundedness are empirical researcher. Although the evalu
investigated. A second approach, developed ation problem is one where identification
by Manski (1990, 2003, 2007), consists of problems are important, there is currently a
bounds analyses, where ranges of estimands much better understanding of which assump
consistent with the data and the limited tions are most useful, as well as a better set
assumptions the researcher is willing to make, of methods for inference given different sets
are derived and estimated. A third approach, of assumptions.
instrumental variables, relies on the pres Most of this review will be limited to set
ence of additional treatments, the so-called tings with binary treatments. This is in keep
instruments, that satisfy specific exogeneity ing with the literature, which has largely
and exclusion restrictions. The formulation focused on binary treatment case. There are
of this method in the context of the potential some extensions of these methods to mul
outcomes framework is presented in Imbens tivalued, and even continuous, treatments
and Angrist (1994) and Angrist, Imbens, and (e.g., Imbens 2000; Michael Lechner 2001;
Rubin (1996). A fourth approach applies to Lechner and Ruth Miquel 2005; Richard D.
settings where, in its pure form, overlap is Gill and James M. Robins 2001; Hirano and
completely absent because the assignment Imbens 2004), and some of these extensions
is a deterministic function of covariates, but will be discussed in the current review. But
comparisons can be made exploiting conti the work in this area is ongoing, and much
nuity of average outcomes as a function of remains to be done here.
covariates. This setting, known as the regres The running example we will use through
sion discontinuity design, has a long tradition out the paper is that of a job market training
in statistics (see William R. Shadish, Thomas program. Such programs have been among
D. Cook, and Donald T. Campbell 2002 and the leading applications in the economics lit
Cook 2008 for historical perspectives), but erature, starting with Ashenfelter (1978) and
has recently been revived in the economics including LaLonde (1986) as a particularly
literature through work by Wilbert van der influential study. In such settings, a number
Klaauw (2002), Hahn, Todd, and van der of individuals do, or do not enroll in a training
Klaauw (2001), David S. Lee (2001), and Jack program, with labor market outcomes, such
R. Porter (2003). Finally, a fifth approach, as yearly earnings or employment status, as
referred to as difference-in-differences, relies the main outcome of interest. An individual
on the presence of additional data in the form not participating in the program may have
of samples of treated and control units before chosen not to do so, or may have been ineli
and after the treatment. An early applica gible for various reasons. Understanding the
tion is Ashenfelter and Card (1985). Recent choices made, and constraints faced, by the
theoretical work includes Abadie (2005), potential participants, is a crucial component
Bertrand, Duflo, and Mullainathan (2004), of any analysis. In addition to observing par
Stephen G. Donald and Kevin Lang (2007), ticipation status and outcome measures, we
and Susan Athey and Imbens (2006). typically observe individual background char
In this review, we will discuss in detail acteristics, such as education levels and age,
some of the new methods that have been as well as information regarding prior labor
developed in this literature. We will pay market histories, such as earnings at various

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 9

levels of aggregation (e.g., yearly, quarterly, or but not both, and thus only one of these two
monthly). In addition, we may observe some potential outcomes can be realized. Prior to
of the constraints faced by the individuals, the assignment being determined, both are
including measures used to determine eli potentially observable, hence the label poten
gibility, as well as measures of general labor tial outcomes. If individual i participates in
market conditions in the local labor markets the program, Y?(1) will be realized and Yf(0)
faced by potential participants. will ex post be a counterfactual outcome. If,
on the other hand individual i does not par
2. The Rubin Causal Model: Potential ticipate in the program, Yiv0) will be realized
and Y?(1) will be the ex post counterfactual.
Outcomes, the Assignment Mechanism,
and Interactions We will denote the realized outcome by Y?,
with Y the N-vector with ?-th element equal
In this section, we describe the essential to Y?. The preceding discussion implies that
elements of the modern approach to program
evaluation, based on the work by Rubin. Y, = YiiWi) = Y,(0) (1 -Wi) + Y<(1) Wf
Suppose we wish to analyze a job training
program using observations on N individu ?y?(o) ifw; = o,
als, indexed by i = l,...,N. Some of these
individuals were enrolled in the training {y,(1) if wt = 1.
program. Others were not enrolled, either
because they were ineligible or chose not to The potential outcomes are tied to the spe
enroll. We use the indicator W{ to indicate cific manipulation that would have made
whether individual i enrolled in the training one of them the realized outcome. The more
program, with W? = 0 if individual i did not, precise the specification of the manipulation,
and Wt = 1 if individual i did, enroll in the the more well-defined the potential out
program. We use W to denote the N-vector comes are.

with i-th element equal to Wt, and N0 and NY This distinction between the pair of poten
to denote the number of control and treated tial outcomes (Yiv0),Yf(l)) and the realized
units, respectively. For each unit, we also outcome Yf is the hallmark of modern statis
observe a K-dimensional column vector of tical and econometric analyses of treatment
covariates or pretreatment variables, Xi9 with effects. We offer some comments on it. The
X denoting the NxK matrix with i-th row potential outcomes framework has important
equal to X[. precursors in a variety of other settings. Most
2.1 Potential Outcomes directly, in the context of randomized experi
ments, the potential outcome framework was
The first element of the RCM is the notion introduced by Splawa-Neyman (1990) to
of potential outcomes. For individual ?, for derive the properties of estimators and confi
i = 1,.. .,N, we postulate the existence of two dence intervals under repeated sampling.
potential outcomes, denoted by Y?(0) and The potential outcomes framework also
Y?(1). The first, Y?(0), denotes the outcome that has important antecedents in econometrics.
would be realized by individual i if he or she Specifically, it is interesting to compare the
did not participate in the program. Similarly, distinction between potential outcomes Y?(0)
Y?(1) denotes the outcome that would be real and Yf(l) and the realized outcome Yz in
ized by individual i if he or she did partici Rubin's approach to Trygve Haavelmo s (1943)
pate in the program. Individual i can either work on simultaneous equations models
participate or not participate in the program, (SEMs). Haavelmo discusses identification of

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
10 Journal of Economie Literature, Vol. XLVII (March 2009)

supply and demand models. He makes a dis The potential outcomes framework has
tinction between "any imaginable price 7r" as a number of advantages over a framework
the argument in the demand and supply func based directly on realized outcomes. The
tions, of (tt) and (fin), and the "actual price p," first advantage of the potential outcome
which is the observed equilibrium price satis framework is that it allows us to define causal
fying cf(p) = q( (p). The supply and demand effects before specifying the assignment
functions play the same role as the potential mechanism, and without making functional
outcomes in Rubin's approach, with the equi form or distributional assumptions. The most
librium price similar to the realized outcome. common definition of the causal effect at the
Curiously, Haavelmo's notational distinction unit level is as the difference Y?(1) ? Y?(0),
between equilibrium and potential prices has but we may wish to look at ratios Y?(1)/Y?(0),
gotten blurred in many textbook discussions of or other functions. Such definitions do not
simultaneous equations. In such discussions, require us to take a stand on whether the
the starting point is often the general formula effect is constant or varies across the popu
tion Yr + XB = U for N x M vectors of real lation. Further, defining individual-specific
ized outcomes Y, N x L matrices of exogenous treatment effects using potential outcomes
covariates X, and an N x M matrix of unob does not require us to assume endogeneity or
served components U. A nontrivial byproduct exogeneity of the assignment mechanism. By
of the potential outcomes approach is that it contrast, the causal effects are more difficult
forces users of SE Ms to articulate what the to define in terms of the realized outcomes.
potential outcomes are, thereby leading to Often, researchers write down a regression
better applications of SE Ms. A related point is function Y i ? a + r Wt + ?*. This regres
made in Pearl (2000). sion function is then interpreted as a struc
Another area where potential outcomes tural equation, with r as the causal effect.
are used explicitly is in the econometric Left unclear is whether the causal effect
analyses of production functions. Similar to is constant or not, and what the properties
the potential outcomes framework, a pro of the unobserved component, eh are. The
duction function g(x, e) describes production potential outcomes approach separates these
levels that would be achieved for each value issues, and allows the researcher to first
of a vector of inputs, some observed (x) and define the causal effect of interest without
some unobserved (e). Observed inputs may considering probabilistic properties of the
be chosen partly as a function of (expected) outcomes or assignment.
values of unobserved inputs. Only for the The second advantage of the poten
level of inputs actually chosen do we observe tial outcome approach is that it links the
the level of the output. Potential outcomes analysis of causal effects to explicit manip
are also used explicitly in labor market set ulations. Considering the two potential out
tings by A. D. Roy (1951). Roy models indi comes forces the researcher to think about
viduals choosing from a set of occupations. scenarios under which each outcome could
Individuals know what their earnings would be observed, that is, to consider the kinds
be in each of these occupations and choose of experiments that could reveal the causal
the occupation (treatment) that maximizes effects. Doing so clarifies the interpretation
their earnings. Here we see the explicit use of causal effects. For illustration, consider
of the potential outcomes, combined with a a couple of recent examples from the eco
specific selection/assignment mechanism, nomics literature. First, consider the causal
namely, choosing the treatment with the effects of gender or ethnicity on outcomes
highest potential outcome. of job applications. Simple comparisons of

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 11

economic outcomes by ethnicity are diffi model the probability of enrolling in the pro
cult to interpret. Are they the result of dis gram given the earnings in both treatment
crimination by employers, or are they the arms conditional on individual characteris
result of differences between applicants, tics. This sequential modeling will lead to a
possibly arising from discrimination at an model for the realized outcome, but it may
earlier stage of life? Now, one can obtain be easier than directly specifying a model for
unambiguous causal interpretations by link the realized outcome.
ing comparisons to specific manipulations. A fourth advantage of the potential out
A recent example is the study by Bertrand comes approach is that it allows us to for
and Mullainathan (2004), who compare call mulate probabilistic assumptions in terms of
back rates for job applications submitted potentially observable variables, rather than
with names that suggest African-American in terms of unobserved components. In this
or Caucasian ethnicity. Their study has a approach, many of the critical assumptions
clear manipulation?a name change?and will be formulated as (conditional) indepen
therefore a clear causal effect. As a sec dence assumptions involving the potential
ond example, consider some recent eco outcomes. Assessing their validity requires
nomic studies that have focused on causal the researcher to consider the dependence
effects of individual characteristics such structure if all potential outcomes were
as beauty (e.g., Daniel S. Hamermesh and observed. By contrast, models in terms of
Jeff E. Biddle 1994) or height. Do the dif realized outcomes often formulate the criti
ferences in earnings by ratings on a beauty cal assumptions in terms of errors in regres
scale represent causal effects? One possible sion functions. To be specific, consider again
interpretation is that they represent causal the regression function Y? = a + r W?, -f- e{.
effects of plastic surgery. Such a manipula Typically (conditional independence) assump
tion would make differences causal, but it tions are made on the relationship between e{
appears unclear whether cross-sectional and W{. Such assumptions implicitly bundle a
correlations between beauty and earnings number of assumptions, including functional
in a survey from the general population rep form assumptions and substantive exogeneity
resent causal effects of plastic surgery. assumptions. This bundling makes the plau
A third advantage of the potential outcome sibility of these assumptions more difficult to
assess.
approach is that it separates the modeling
of the potential outcomes from that of the A fifth advantage of the potential outcom
assignment mechanism. Modeling the real approach is that it clarifies where the un
ized outcome is complicated by the fact that tainty in the estimators comes from. Even
it combines the potential outcomes and the we observe the entire (finite) population (
assignment mechanism. The researcher may is increasingly common with the grow
have very different sources of information to availability of administrative data sets)
bear on each. For example, in the labor mar we can estimate population averages with
ket program example we can consider the uncertainty?causal effects will be uncerta
outcome, say, earnings, in the absence of the because for each unit at most one of the
program: Y,-(0). We can model this in terms of potential outcomes is observed. One may st
individual characteristics and labor market use super population arguments to just
histories. Similarly, we can model the out approximations to the finite sample distri
come given enrollment in the program, again tions, but such arguments are not required
conditional on individual characteristics and motivate the existence of uncertainty abo
labor market histories. Then finally we can the causal effect.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
12 Journal of Economie Literature, Vol. XLVII (March 2009)

2.2 The Assignment Mechanism to simple mean differences by assignment.


Such analyses are valid, but often they are not
The second ingredient of the RCM is the the most powerful tools available to exploit
assignment mechanism. This is defined as the the randomization. We discuss the analysis
conditional probability of receiving the treat of randomized experiments, including more
ment, as a function of potential outcomes and powerful randomization-based methods for
observed covariates. We distinguish three inference, in section 4.
classes of assignment mechanisms, in order of The second class of assignment mecha
increasing complexity of the required analysis. nisms maintains the restriction that the
The first class of assignment mechanisms is assignment probabilities do not depend on
that of randomized experiments. In random the potential outcomes, or
ized experiments, the probability of assign
ment to treatment does not vary with potential wiifcioXYjd))!*,,
outcomes, and is a known function of covari
ates. The leading case is that of a completely where A 1B \ C denotes conditional indepen
randomized experiment where, in a popula dence of A and B given C. However, in contrast
tion of N units, NY<N randomly chosen units to randomized experiments, the assignment
are assigned to the treatment and the remain probabilities are no longer assumed to be a
ing N{) = N ? Ni units are in the control group. known function of the covariates. The pre
There are important variations on this exam cise form of this critical assumption, not tied
ple, such as pairwise randomization, where to functional form or distributional assump
initially units are matched in pairs, and in a tions, was first presented in Rosenbaum
second stage one unit in each pair is randomly and Rubin (1983b). Following Rubin (1990)
assigned to the treatment. Another variant is a we refer to this assignment mechanism as
general stratified experiment, where random unconfounded assignment. Somewhat con
ization takes place within a finite number of fusingly, this assumption, or variations on it,
strata. In any case, there are in practice few are in the literature also referred to by vari
experiments in economics, and most of those ous other labels. These include selection on
are of the completely randomized experiment observables,2 exogeneity,3 and conditional
variety, so we shall limit our discussion to this
type of experiment. It should be noted though
that if one has the opportunity to design a 2 Although Heckman, Ichimura, and Todd (1997, page
611) write that "In the language of Heckman and Robb
randomized experiment, and if pretreatment (1985), matching assumes that selection is on observables"
variables are available, stratified experiments (their italics), the original d?finition in Heckman and Robb
are at least as good as completely randomized (1985, page 163) is not equivalent to unconfoundedness.
In the context of a single cross-section version of their two
experiments, and typically better, in terms of equation selection model, Y ? X'? + W?a + e? and W?
expected mean squared error, even in finite ? l{Z/7 + vt > 0}, they define selection bias to refer to
the case where Efo-Wj ^ 0, and selection-on-observables
samples. See Imbens et al. (2008) for more
details. The use of formal randomization to the case where selection bias is present and caused by
correlation between e{ and Z/5 rather than by correlation
has become more widespread in the social between s? and i/?.
sciences in recent years, sometimes as a for 3Although X? is not exogenous for E[Y?(1) - Yf(0)],
according to the definitions in Robert F. Engle, David
mal design for an evaluation and sometimes F. Hendry and Jean-Francois Richard (1983), because
as an acceptable way of allocating scarce knowledge of its marginal distribution contains infor
resources. The analysis of such experiments mation about E[Yt(l) ? Y,-(0)], standard usage of the
term "exogenous" does appear to capture the notion of
is often straightforward. In practice, however, unconfoundedness, e.g., Manski et al. (1992), and Imbens
researchers have typically limited themselves (2004).

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 13

independence4. Although the analysis of data outcomes for another unit. Only the level of
with such assignment mechanisms is not as the treatment applied to the specific individ
straightforward as that of randomized exper ual is assumed to potentially affect outcomes
iments, there are now many practical meth for that particular individual. In the statistics
ods available for this case. We review them literature, this assumption is referred to as
in section 5.
the Stable-Unit-Treatment-Value-Assumption
The third class of assignment mechanisms (Rubin 1978). In this paper, we mainly focus
contains all remaining assignment mecha on settings where this assumption is main
nisms with some dependence on potential tained. In the current section, we discuss
outcomes.5 Many of these create substantive some of the literature motivated by concerns
problems for the analysis, for which there is about this assumption.
no general solution. There are a number of This lack-of-interaction assumption is very
special cases that are by now relatively well plausible in many biom?dical applications.
understood, and we discuss these in section 6. Whether one individual receives or does
The most prominent of these cases are instru not receive a new treatment for a stroke or
mental variables, regression discontinuity, and not is unlikely to have a substantial impact
differences-in-differences. In addition, we on health outcomes for any other individual.
discuss two general methods that also relax However, there are also many cases in which
the unconfoundedness assumption but do not such interactions are a major concern and the
replace it with additional assumptions. The assumption is not plausible. Even in the early
first relaxes the unconfoundedness assump experimental literature, with applications
tion in a limited way and investigates the sen to the effect of various fertilizers on crop
sitivity of the estimates to such violations. The yields, researchers were cognizant of poten
second drops the unconfoundedness assump tial problems with this assumption. In order
tion entirely and establishes bounds on esti to minimize leaking of fertilizer applied to
mands of interest. The latter is associated with one plot into an adjacent plot experimenters
the work by Manski (1990, 1995, 2007). used guard rows to physically separate the
2.3 Interactions and General plots that were assigned different fertilizers.
A different concern arises in epidemiological
Equilibrium Effects
applications when the focus is on treatments
In most of the literature, it is assumed that such as vaccines for contagious diseases. In
treatments received by one unit do not affect that case, it is clear that the vaccination of
one unit can affect the outcomes of others in
their proximity, and such effects are a large
4E.g., Lechner 2001; A. Colin Cameron and Pravin K. part of the focus of the evaluation.
Trivedi 2005.
5 This includes some mechanisms where the In economic applications, interactions
dependence on potential outcomes does not create any between individuals are also a serious con
problems in the analyses. Most prominent in this category cern. It is clear that a labor market program
are sequential assignment mechanisms. For example, one that affects the labor market outcomes for
could randomly assign the first ten units to the treatment
or control group with probability 1/2. From then on one one individual potentially has an effect on
could skew the assignment probability to the treatment the labor market outcomes for others. In a
with the most favorable outcomes so far. For example,
if the active treatment looks better than the control world with a fixed number of jobs, a train
treatment based on the first N units, then the (N + l)th ing program could only redistribute the jobs,
unit is assigned to the active treatment with probability and ignoring this constraint on the number
0.8 and vice versa. Such assignment mechanisms are not
very common in economics settings, and we ignore them of jobs by using a partial, instead of a gen
in this discussion. eral, equilibrium analysis could lead one to

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
14 Journal of Economic Literature, Vol. XLVII (March 2009)

erroneously conclude that extending the pro depending on some distance metric, either
gram to the entire population would raise geographical distance or proximity in some
aggregate employment. Such concerns have economic metric.
rarely been addressed in the recent program The most interesting literature in this area
evaluation literature. Exceptions include views the interactions not as a nuisance but
Heckman, Lance Lochner, and Christopher as the primary object of interest. This litera
Taber (1999) who provide some simulation ture, which includes models of social inter
evidence for the potential biases that may actions and peer effects, has been growing
result from ignoring these issues. rapidly in the last decade, following the early
In practice these general equilibrium effects work by Manski (1993). See Manski (2000a)
may, or may not, be a serious problem. The and William Brock and Steven N. Durlauf
indirect effect on one individual of exposure (2000) for recent surveys. Empirical work
to the treatment of a few other units is likely to includes Jeffrey R. Kling, Jeffrey B. Liebman,
be much smaller than the direct effect of the and Katz (2007), who look at the effect of
exposure of the first unit itself. Hence, with households moving to neighborhoods with
most labor market programs both small in higher average socioeconomic status; Bruce
scope and with limited effects on the individ I. Sacerdote (2001), who studies the effect
ual outcomes, it appears unlikely that general of college roommate behavior on a student's
equilibrium effects are substantial and they grades; Edward L. Glaeser, Sacerdote, and
can probably be ignored for most purposes. Jose A. Scheinkman (1996), who study social
One general solution to these problems is interactions in criminal behavior; Anne C.
to redefine the unit of interest. If the inter Case and Lawrence F. Katz (1991), who look
actions between individuals are at an inter at neighborhood effects on disadvantaged
mediate level, say a local labor market, or a youths; Bryan S. Graham (2008), who infers
classroom, rather than global, one can ana interactions from the effect of class size on
lyze the data using the local labor market the variation in grades; and Angrist and Lang
or classroom as the unit and changing the (2004), who study the effect of desegregation
no-interaction assumption to require the programs on students' grades. Many iden
absence of interactions among local labor tification and inferential questions remain
markets or classrooms. Such aggregation is unanswered in this literature.
likely to make the no-interaction assump
tion more plausible, albeit at the expense of 3. What Are We Interested In?
reduced precision.
Estimands and Hypotheses
An alternative solution is to directly model
the interactions. This involves specifying In this section, we discuss some of the
which individuals interact with each other, questions that researchers have asked in this
and possibly relative magnitudes of these literature. A key feature of the current litera
interactions. In some cases it may be plau ture, and one that makes it more important to
sible to assume that interactions are limited be precise about the questions of interest, is
to individuals within well-defined, possibly the accommodation of general heterogeneity
overlapping groups, with the intensity of in treatment effects. In contrast, in many
the interactions equal within this group. early studies it was assumed that the effect
This would be the case in a world with a of a treatment was constant, implying that
fixed number of jobs in a local labor market. the effect of various policies could be cap
Alternatively, it maybe that interactions occur tured by a single parameter. The essentially
in broader groups but decline in importance unlimited heterogeneity in the effects of the

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 15

treatment allowed for in the current litera expectation of the unit-level causal effect,
ture implies that it is generally not possible
to capture the effects of all policies of inter
est in terms of a few summary statistics. In TPATE = E[Y,(1) - YM
practice researchers have reported estimates
of the effects of a few focal policies. In this If the policy under consideration would
section we describe some of these estimands. expose all units to the treatment or none at
Most of these estimands are average treat all, this is the most relevant quantity. Another
ment effects, either for the entire population popular estimand is the Population Average
or for some subpopulation, although some Treatment effect on the Treated (PATT), the
correspond to other features of the joint dis average over the subpopulation of treated
tribution of potential outcomes. units:
Most of the empirical literature has focused
on estimation. Much less attention has been TPATT = ?;[Yi(l)-Y/(0)|Wi=l].
devoted to testing hypotheses regarding the
properties or presence of treatment effects. In many observational studies, rPATT is a
Here we discuss null and alternative hypoth more interesting estimand than the overall
eses that may be of interest in settings with average effect. As an example, consider the
heterogeneous effects. Finally, we discuss case where a well defined population was
some of the recent literature on decision exposed to a treatment, say a job training
theoretic approaches to program evaluation program. There may be various possibilities
that ties estimands more closely to optimal for a comparison group, including subjects
policies. drawn from public use data sets. In that case,
it is generally not interesting to consider the
3.1 Average Treatment Effects
effect of the program for the comparison
The econometric literature has largely group: for many members of the comparison
focused on average effects of the treatment. group (e.g., individuals with stable, high-wage
The two most prominent average effects are jobs) it is difficult and uninteresting to imag
defined over an underlying population. In ine their being enrolled in the labor market
cases where the entire population can be program. (Of course, the problem of averag
sampled, population treatment effects rely on ing across units that are unlikely to receive
the notion of a superpopulation, where the future treatments can be mitigated by more
current population that is available is viewed carefully constructing the comparison group
as just one of many possibilities. In either to be more like the treatment group, mak
case, the the sample of size N is viewed as ing rPATE a more meaningful parameter. See
a random sample from a large (super-)popu the discussion below.) A second case where
lation, and interest is in the average effect tpatt is the, estimand of most interest is in
in the superpopulation.6 The most popular the setting of a voluntary program where
treatment effect is the Population Average those not enrolled will never be required
Treatment Effect (PATE), the population to participate in the program. A specific
example is the effect of serving in the mili
6 For simplicity, we restrict ourselves to random sam tary where an interesting question concerns
pling. Some data sets are obtained by stratified sampling. the foregone earnings for those who served
Most of the estimators we consider can be adjusted for (Angrist 1998).
stratified sampling. See, for example, Wooldridge (1999,
2007) on inverse probability weighting of averages and In practice, there is typically little motiva
objective functions. tion presented for the focus on the overall

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
16 Journal of Economie Literature, Vol. XLVII (March 2009)

average effect or the average effect for the have to be particularly concerned with the
treated. Take a job training program. The distinction between the two estimands at the
overall average effect would be the param estimation stage. However, there is an impor
eter of interest if the policy under con tant difference between the population and
sideration is a mandatory exposure to the conditional estimands at the inference stage.
treatment versus complete elimination. It If there is heterogeneity in the effect of the
is rare that these are the alternatives, with treatment, we can estimate the sample aver
more typically exemptions granted to various age treatment effect rCATE more precisely
subpopulations. Similarly the average effect than the population average treatment effect
for the treated would be informative about rPATE. When one estimates the variance of an
the effect of entirely eliminating the current estimator f?which can serve as an estimate
program. More plausible regime changes for rPATE or rCATE ?one therefore needs to
would correspond to a modest extension of be explicit about whether one is interested in
the program to other jurisdictions, or a con the variance relative to the population or to
traction to a more narrow population. the conditional average treatment effect. We
A somewhat subtle issue is that we may will return to this issue in section 5.
wish to separate the extrapolation from the A more general class of estimands includes
sample to the superpopulation from the average causal effects for subpopulations
problem of inference for the sample at hand. and weighted average causal effects. Let A
This suggests that, rather than focusing on be a subset of the covariate space X, and let
PATE or PATT, we might first focus on the rCATE,A denote the conditional average causal
average causal effect conditional on the cova effect for the subpopulation with X? G A:
riates in the sample,

Tcate,a = itJ- ? E[Yi(i) -Y^IXJ,


TcATE iv
= i=l
?t,E[Yia)-Yi(0)\Xil
where NA is the number of units with X? G A.
and, similarly, the average over the subsam Richard K. Crump et al. (2009) argue for
ple of treated units: considering such estimands. Their argu
ment is not based on the intrinsic interest of
these subpopulations. Rather, they show that
TCATr = -?- ? E{Yt(l)-Yt(0)\X<l such estimands may be much easier to esti
mate than rCATE (or rCATT ). Instead of solely
If the effect of the treatment or interven reporting an imprecisely estimated average
tion is constant (Y?(1) ? Y?(0) = r for some effect for the overall population, they sug
constant r), all four estimands, rPATE, rPATT, gest it may be informative to also report
rCATE> and rCATT> are obviously identical. a precise estimate for the average effect of
However, if there is heterogeneity in the some subpopulation. They then propose a
effect of the treatment, the estimands may particular set A for which the average effect
all be different. The difference between is most easily estimable. See section 5.10.2
rPATE anQ] rCATE (and between rPATT and for more details. The Crump et al. estimates
tcatt) is relatively subtle. Most estimators would not necessarily have as much external
that are attractive for the population treat validity as estimates for the overall popula
ment effect are also attractive for the cor tion, but they may be much more informative
responding conditional average treatment for the sample at hand. In any case, in many
effect, and vice versa. Therefore, we do not instances the larger policy questions concern

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 17

extensions of the interventions or treatments differences between quantiles of the two


to other populations, so that external validity marginal potential outcome distributions,
may be elusive irrespective of the estimand. and not as quantiles of the unit level effect,
In settings with selection on unobservables
the enumeration of the estimands of interest (2) fq = Fm)-Y(o) (<?).
becomes more complicated. A leading case
is instrumental variables. In the presence of In general, the quantile of the differ
heterogeneity in the effect of the treatment ence, T differs from the difference in the
one can typically not identify the average quantiles, rq, unless there is perfect rank
effect of the treatment even in the presence correlation between the potential outcomes
of valid instruments. There are two new Y?(0) and Y?(1) (the leading case of this is
approaches in the recent literature. One is to the constant additive treatment effect).
focus on bounds for well-defined estimands The quantile treatment effects, r(f, have
such as the average effect rPATE or rCATE. received much more attention, and in our
Manski (1990, 2003) developed this approach view rightly so, than the quantiles of the
in a series of papers. An alternative is to focus treatment effect, r There are two issues
on estimands that can be identified under regarding the choice between a focus on
weaker conditions than those required for the the difference in quantiles versus quantiles
average treatment effect. Imbens and Angrist of the difference. The first issue is substan
(1994) show that one can, under much weaker tial. Suppose a policy maker is faced with
conditions than required for identification of the choice of assigning all members of a
rPATE, identify the average effect for the sub subpopulation, homogenous in covariates
population of units whose treatment status is Xh to the treatment group, or assigning all
affected by the instrument. They refer to this of them to the control group. The result
subpopulation as the compliers. This does not ing outcome distribution is eithery^o) ( y) or
directly fit into the classification above since fY(i) (y), assuming the subpopulation is large.
the subpopulation is not defined solely in Hence the choice should be governed by
terms of covariates. We discuss this estimand preferences of the policymaker over these
in more detail in section 6.3. distributions (which can often be summa
rized by differences in the quantiles), and
3.2 Quantile and Distributional Treatment
not depend on aspects of the joint distri
Effects and Other Estimands
bution /y(o),y(i) ( J/>z) that do not affect the
An alternative class of estimands consists two marginal distributions. (See Heckman
of quantile treatment effects. These have and Smith 1997 for a somewhat different
only recently been studied and applied in view.) The second issue is statistical. In gen
the economics literature, although they were eral the r are not (point-)identified without
introduced in the statistics literature in the assumptions on the rank correlation between
1970s. Kjell Doksum (1974) and Erich L. the potential outcomes, even with data from
Lehmann (1974) define a randomized experiment. In a randomized
experiment, one can identify fY(o) ( y) and
(1) r, = Fy(\} (q) - Fn?) (q), fy(i)(y) (and any functional thereof) but not
the joint distribution fY(o\Y(i) ( y,z)- Note that
as the q-th quantile treatment effect. There this issue does not arise if we look at average
are some important issues in interpreting effects because the mean of the difference is
these quantile treatment effects. First, note equal to the difference of the means: E[Y?(1)
that these quantiles effects are defined as - 7,(0)] = E[Y,(1)] - ?[7,(0)].

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
18 Journal of Economie Literature, Vol. XLVII (March 2009)

A complication facing researchers inter using Kolmogorov-Smirnov type testing


ested in quantile treatment effects is that procedures.
the difference in a marginal quantile, rq, is in A second set of questions concerns treat
general not equal to the average difference ment effect heterogeneity. Even if the aver
in the conditional quantiles, where the latter age effect is zero, it may be important to
are defined as establish whether a targeted implementa
tion of the intervention, with only those who
rq(x) = F-())?x (q\x)- Fnl))lx (q\x). can expect to benefit from the intervention
assigned to it, could improve average out
In other words, even if we succeed in esti comes. In addition, in cases where there is
mating r(f(x), we cannot simply average rq(X?) not sufficient information to obtain pre
across i to consistently estimate rq. Marianne cise inferences for the average causal effect
Bitler, Jonah Gelbach, and Hilary Hoynes rPATE, it may still be possible to establish
(2006) estimate quantile treatment effects whether there are any subpopulations with an
in a randomized evaluation of a job training average effect positive or different from zero,
program. Sergio Firpo (2007) develops meth or whether there are subpopulations with an
ods for estimating r in observational studies average effect exceeding some threshold. It
given unconfoundedness. Abadie, Angrist, may also be interesting to test whether there
and Imbens (2002) and Victor Chernozhukov is any evidence of heterogeneity in the treat
and Christian B. Hansen (2005) study quan ment effect by observable characteristics.
tile treatment effects in instrumental vari This bears heavily on the question whether
ables settings. the estimands are useful for extrapolation to
other populations which may differ in terms
3.3 Testing of some observable characteristics. Crump et
The literature on hypothesis testing in pro al. (2008) study these questions in settings
gram evaluation is relatively limited. Most of with unconfounded treatment assignment.
the testing in applied work has focused on 3.4 Decision-Theoretic Questions
the null hypothesis that the average effect of
interest is zero. Because many of the com Recently, a small but innovative literature
monly used estimators for average treatment has started to move away from the focus
effects are asymptotically normally distrib on summary statistics of the distribution of
uted with zero asymptotic bias, it follows treatment effects or potential outcomes to
that standard confidence intervals (the point directly address policies of interest. This is
estimate plus or minus a constant times the very much a literature in progress. Manski
standard error) can be used for testing such (2000b, 2001,2002,2004), Rajeev H. Dehejia
hypotheses. However, there are other inter (2005b), and Hirano and Porter (2008) study
esting hypotheses to consider. the problem faced by program administra
One question of interest is whether there tors who can assign individuals to the active
is any effect of the program, that is whether treatment or to the control group. These
the distribution of Y?(1) differs from that of administrators have available two pieces of
Y?(0). This is equivalent to the hypothesis information. First, covariate information for
that not just the mean, but all moments, these individuals, and second, information
are identical in the two treatment groups. about the efficacy of the treatment based on
Abadie (2002) studies such tests in the a finite sample of other individuals for whom
settings with randomized experiments as both outcome and covariate information is
well as settings with instrumental variables available. The administrator may care about

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 19

the entire distribution of outcomes, or solely With experimental data the statisti
about average outcomes, and may also take cal analysis is generally straightforward.
into account costs associated with participa Differencing average outcomes by treatment
tion. If the administrator knew exactly the status or, equivalently, regressing the out
conditional distribution of the potential out come on an intercept and an indicator for the
comes given the covariate information this treatment, leads to an unbiased estimator for
would be a simple problem: the administra the average effect of the treatment. Adding
tor would simply compare the expected wel covariates to the regression function typically
fare for different rules and choose the one improves precision without jeopardizing con
with the highest value. However, the admin sistency because the randomization implies
istrator does not have this knowledge and that in large samples the treatment indicator
needs to make a decision given uncertainty and the covariates are independent. In prac
about these distributions. In these settings, it tice, researchers have rarely gone beyond
is clearly important that the statistical model basic regression methods. In principle,
allows for heterogeneity in the treatment however, there are additional methods that
effects. can be useful in these settings. In section
Graham, Imbens, and Ridder (2006) 4.2, we review one important experimental
extend the type of problems studied in this technique, randomization-based inference,
literature by incorporating resource con including Fisher's method for calculating
straints. They focus on problems that include exact p-values, that deserves wider usage in
as a special case the problem of allocating a social sciences. See Rosenbaum (1995) for a
fixed number of slots in a program to a set of textbook discussion.
individuals on the basis of observable charac
4.1 Randomized Experiments in Economics
teristics of these individuals given a random
sample of individuals for whom outcome and Randomized experiments have a long
covariate information is available. tradition in biostatistics. In this literature
they are often viewed as the only cred
ible approach to establishing causality. For
4. Randomized Experiments
example, the United States Food and Drug
Experimental evaluations have tradition Administration typically requires evidence
ally been rare in economics. In many cases from randomized experiments in order to
ethical considerations, as well as the reluc approve new drugs and medical procedures.
tance of administrators to deny services to A first comment concerns the fact that even
randomly selected individuals after they randomized experiments rely to some extent
have been deemed eligible, have made it on substantive knowledge. It is only once
difficult to get approval for, and implement, the researcher is willing to limit interactions
randomized evaluations. Nevertheless, the between units that randomization can estab
few experiments that have been conducted, lish causal effects. In settings with poten
including some of the labor market training tially unrestricted interactions between
programs, have generally been influential, units, randomization by itself cannot solve
sometimes extremely so. More recently, the identification problems required for
many exciting and thought-provoking experi establishing causality. In biom?dical settings,
ments have been conducted in development where such interaction effects are often argu
economics, raising new issues of design and ably absent, randomized experiments are
analysis (see Duflo, Rachel Glennerster, and therefore particularly attractive. Moreover,
Kremer 2008 for a review). in biom?dical settings it is often possible to

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
20 Journal of Economie Literature, Vol. XLVII (March 2009)

keep the units ignorant of their treatment Examples of such programs include the
status, further enhancing the interpretation Greater Avenues to INdependence (GAIN)
of the estimated effects as causal effects of programs (e.g., James Riccio and Daniel
the treatment, and thus improving the exter Friedlander 1992, the WIN programs (e.g.,
nal validity. Judith M. Gueron and Edward Pauly 1991;
In the economics literature randomization Friedlander and Gueron 1992; Friedlander
has played a much less prominent role. At var and Philip K. Robins 1995), the Self
ious times social experiments have been con Sufficiency Project in Canada (Card and Dean
ducted, but they have rarely been viewed as R. Hyslop 2005, and Card and Robins 1996),
the sole method for establishing causality, and and the Statistical Assistance for Programme
in fact they have sometimes been regarded Selection in Switzerland (Stefanie Behncke,
with some suspicion concerning the rele Markus Fr?lich, and Lechner 2006). Like
vance of the results for policy purposes (e.g., the NSW evaluation, these experiments have
Heckman and Smith 1995; see Gary Burtless been useful not merely in establishing the
1995 for a more positive view of experiments effects of particular programs but also in pro
in social sciences). Part of this may be due to viding fertile testing grounds for new statisti
the fact that for the treatments of interest to cal evaluations methods.
economists, e.g., education and labor mar Recently there has been a large number of
ket programs, it is generally impossible to do exciting and innovative experiments, mainly
blind or double-blind experiments, creating in development economics but also in oth
the possibility of placebo effects that com ers areas, including public finance (Duflo
promise the internal validity of the estimates. and Emmanuel Saez 2003; Duflo et al.
Nevertheless, this suspicion often down 2006; Raj Chetty, Adam Looney, and Kory
plays the fact that many of the concerns that Kroft forthcoming). The experiments in
have been raised in the context of random development economics include many edu
ized experiments, including those related to cational experiments (e.g., T. Paul Schultz
missing data, and external validity, are often 2001; Orazio Attanasio, Costas Meghir,
equally present in observational studies. and Ana Santiago 2005; Duflo and Rema
Among the early social experiments in eco Hanna 2005; Banerjee et al. 2007; Duflo
nomics were the negative income tax experi 2001; Miguel and Kremer 2004). Others
ments in Seattle and Denver in the early study topics as wide-ranging as corruption
1970s, formally referred to as the Seattle and (Benjamin A. Olken 2007; Claudio Ferraz
Denver Income Maintenance Experiments and Frederico Finan 2008) or gender issues
(SIME and DIME). In the 1980s, a number in politics (Raghabendra Chattopadhyay and
of papers called into question the reliability of Duflo 2004). In a number of these experi
econometric and statistical methods for esti ments, economists have been involved from
mating causal effects in observational studies. the beginning in the design of the evalua
In particular, LaLonde (1986) and Fraker and tions, leading to closer connections between
Maynard (1987), using data from the National the substantive economic questions and the
Supported Work (NSW) programs, suggested design of the experiments, thus improving
that widely used econometric methods were the ability of these studies to lead to con
unable to replicate the results from experi clusive answers to interesting questions.
mental evaluations. These influential con These experiments have also led to renewed
clusions encouraged government agencies to interest in questions of optimal design.
insist on the inclusion of experimental evalu Some of these issues are discussed in Duflo,
ation components in job training programs. Glennerster, and Kremer (2008), Miriam

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 21

Bruhn and David McKenzie (2008), and Whether the null of no effect for any unit
Imbens et al. (2008). versus the null of no effect on average is
more interesting was the subject of a testy
4.2 Randomization-Based Inference and
Fishers Exact P-Values exchange between Fisher (who focused on
the first) and Neyman (who thought the lat
Fisher (1935) was interested in calculating ter was the interesting hypothesis, and who
p-values for hypotheses regarding the effect of stated that the first was only of academic
treatments. The aim is to provide exact infer interest) in Splawa-Neyman (1990). Putting
ences for a finite population of size N. This the argument about its ultimate relevance
finite population may be a random sample aside, Fisher's test is a powerful tool for
from a large superpopulation, but that is not establishing whether a treatment has any
exploited in the analysis. The inference is non effect. It is not essential in this framework
parametric in that it does not make functional that the probabilities of assignment to the
form assumptions regarding the effects; it is treatment group are equal for all units. It is
exact in that it does not rely on large sample crucial, however, that the probability of any
approximations. In other words, the p-values particular assignment vector is known. These
coming out of this analysis are exact and valid probabilities may differ by unit provided the
irrespective of the sample size. probabilities are known.
The most common null hypothesis in The implication of Fisher's framework is
Fisher's framework is that of no effect of the that, under the null hypothesis, we know the
treatment for any unit in this population, exact value of all the missing potential out
against the alternative that, at least for some comes. Thus there are no nuisance param
units, there is a non-zero effect: eters under the null hypothesis. As a result,
we can deduce the distribution of any statis
H0:Yf(0) = 7,(1), Vt = l,...,2V, tic, that is, any function of the realized values
of (YfjW^!, generated by the randomiza
against Ha : 3i such that Y?(0) ^ Y?(1). tion. For example, suppose the statistic is
the average difference between treated and
It is not important that the null hypothesis control outcomes, T(W,Y) = Y} ? Y(), where
is that the effects are all zero. What is essen Y?, = lLi:W,=w ?i/Nw, for w = 0,1. Now sup
tial is that the null hypothesis is sharp, that pose we had assigned a different set of units
is, the null hypothesis specifies the value of to the treatment. Denote the vector of alter
all unobserved potential outcomes for each native treatment assignments by W. Under
unit. A more general null hypothesis could the null hypothesis we know all the potential
be that Yiv0) = Y?(1) + c for some prespeci outcomes and thus we can deduce what the
fied c, or that Yiv0) = Y?(1) + ct for some set of value of the statistic would have been under
prespecified ct. Importantly, this framework that alternative assignment, namely T(W,Y).
cannot accommodate null hypotheses such We can infer the value of the statistic for all
as the average effect of the treatment is zero, possible values of the assignment vector W,
against the alternative hypothesis of a non and since we know the distribution of W we
zero average effect, or can deduce the distribution of T(W, Y). The
distribution generated by the randomization
of the treatment assignment is referred to as
Hl):jrYlYi(a)-Yi(0))
i = 0, the randomization distribution. The p-value
of the statistic is then calculated as the prob
against H'a : i ? Y, ((1) - Yf(0)) ? 0. ability of a value for the statistic that is at
i

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
22 Journal of Economie Literature, Vol. XLVII (March 2009)

least as large, in absolute value, as that of the this point, we took data from eight
observed statistic, T(W, Y). ized evaluations of labor market prog
In moderately large samples, it is typi Four of the programs are from th
cally not feasible to calculate the exact demonstration programs. The four
p-values for these tests. In that case, one tions took place in Arkansas, Baltimo
can approximate the p-value by basing it on Diego, and Virginia. See Gueron an
a large number of draws from the random (1991), Friedlander and Gueron (1992
ization distribution. Here the approximation Greenberg and Michael Wiseman
error is of a very different nature than that and Friedlander and Robins (1995) for
in typical large sample approximations: it is detailed discussions of each of thes
controlled by the researcher, and if more ations. The second set of four prog
precision is desired one can simply increase from the GAIN programs in Californ
the number of draws from the randomiza four locations are Alameda, Los An
tion distribution. Riverside, and San Diego. See Ric
In the form described above, with the Friedlander (1992), Riccio, Friedland
statistic equal to the difference in averages Freedman (1994), and Dehejia (20
by treatment status, the results are typically more details on these programs an
not that different from those using Wald evaluations. In each location, we take
tests based on large sample normal approxi outcome total earnings for the first
mations to the sampling distribution to the or second (WIN) year following the pr
difference in means Yt ? Y(), as long as the and we focus on the subsample of ind
sample size is moderately large. The Fisher who had positive earnings at some poin
approach to calculating p-values is much to the program. We calculate three p
more interesting with other choices for for each location. The first p-value is
the statistic. For example, as advocated by on the normal approximation to the
Rosenbaum in a series of papers (Rosenbaum tic calculated as the difference in a
1984a, 1995), a generally attractive choice is outcomes for treated and control in
the difference in average ranks by treatment als divided by the estimated standard
status. First the outcome is converted into The second p -value is based on ran
ranks (typically with, in case of ties, all pos tion inference using the difference i
sible rank orderings averaged), and then the age outcomes by treatment status. An
test is applied using the average difference third p -value is based on the random
in ranks by treatment status as the statistic. distribution using the difference in a
The test is still exact, with its exact distri ranks by treatment status as the statis
bution under the null hypothesis known as results are in table 1.
the Wilcoxon distribution. Naturally, the test In all eight cases, the p-values ba
based on ranks is less sensitive to outliers the /-test are very similar to thos
than the test based on the difference in on randomization inference. This o
means.
is not surprising given the reasonabl
sample the
If the focus is on establishing whether sizes, ranging from 71 (Ark
treatment has some effect on the outcomes,
WIN) to 4,779 (San Diego, GAIN). Ho
rather than on estimating the average size
in a number of cases, the p-value f
of the effect, such rank tests are rank
muchtest
moreis fairly different from tha
on the level difference. In both sets o
likely to provide informative conclusions
locations there is one location whe
than standard Wald tests based differences
rank
in averages by treatment status. To test suggests a clear rejection
illustrate

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 23

TABLE 1
P-VALUES FOR FlSHER EXACT TESTS: RANKS VERSUS LEVELS

Sample Size p-values

Program Location Controls Treated i-test FET (levels) FET (ranks)


GAIN Alameda 601 597 0.835 0.836 0.890
GAIN Los Angeles 1400 2995 0.544 0.531 0.561
GAIN Riverside 1040 4405 0.000 0.000 0.000
GAIN San Diego 1154 6978 0.057 0.068 0.018
WIN Arkansas 37 34 0.750 0.753 0.805
WIN Baltimore 260 222 0.339 0.339 0.286
WIN San Diego 257 264 0.136 0.137 0.024
WIN Virginia 154 331 0.960 0.957 .0.249

5 percent level whereas the level-based test 5. Estimation and Inference under
would suggest that the null hypothesis of no Unconfoundedness
effect should not be rejected at the 5 per
cent level. In the WIN (San Diego) evalua Methods for estimation of average treat
tion, the p-value goes from 0.068 (levels) to ment effects under unconfoundedness are
0.024 (ranks), and in the GAIN (San Diego) the most widely used in this literature. The
evaluation, the p-value goes from 0.136 (lev central paper in this literature, which intro
els) to 0.018 (ranks). It is not surprising that duces the key assumptions, is Rosenbaum
the tests give different results. Earnings data and Rubin (1983b), although the literature
are very skewed. A large proportion of the goes further back (e.g., William G. Cochran
populations participating in these programs 1968; Cochran and Rubin 1973; Rubin 1977).
have zero earnings even after conditioning Often the unconfoundedness assumption,
on positive past earnings, and the earnings which requires that conditional on observed
distribution for those with positive earnings covariates there are no unobserved factors
is skewed. In those cases, a rank-based test that are associated both with the assignment
is likely to have more power against alterna and with the potential outcomes, is contro
tives that shift the distribution toward higher versial. Nevertheless, in practice, where often
earnings than tests based on the difference data have been collected in order to make this
in means. assumption more plausible, there are many
As a general matter it would be useful in cases where there is no clearly superior alter
randomized experiments to include such native, and the only alternative is to abandon
results for rank-based p-values, as a generally the attempt to get precise inferences. In this
applicable way of establishing whether the section, we discuss some of these methods
treatment has any effect. As with all omnibus and the issues related to them. A general
tests, one should use caution in interpreting theme of this literature is that the concern is
a rejection, as the test can pick up interesting more with biases than with efficiency.
changes in the distribution (such as a mean Among the many recent economic appli
or median effect) but also less interesting cations relying on assumptions of this type
changes (such as higher moments about the are Blundell et al. (2001), Angrist (1998),
mean). Card and Hyslop (2005), Card and Brian P.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
24 Journal of Economie Literature, Vol. XLVII (March 2009)

McCall (1996), V. Joseph Hotz, Imbens, and in the subsample with treatment Wl = w.
Jacob A. Klerman (2006), Card and Phillip Imbens and Rubin (forthcoming) suggest as a
B. Levine (1994), Card, Carlos Dobkin, and rule of thumb that with a normalized differ
Nicole Maestas (2004), Hotz, Imbens, and ence exceeding one quarter, linear regression
Julie H. Mortimer (2005), Lechner (2002a), methods tend to be sensitive to the specifi
Abadie and Javier Gardeazabal (2003), and cation. Note the difference with the often
Bloom (2005). reported /-statistic for the null hypothesis of
This setting is closely related to that under equal means,
lying standard multiple regression analysis
with a rich set of controls. See, for example,
Burt S. Barnow, Glend G. Cain, and Arthur (4) T= . X'~*?-.
S. Goldberger (1980). Unconfoundedness JS*/N0 + Sf/N,
implies that we have a sufficiently rich set of
predictors for the treatment indicator, con The reason for focusing on the normalized
tained in the vector of covariates X?, such difference, (3), rather than on the /-statistic,
that adjusting for differences in these covari (4), as a measure of the degree of difficulty in
ates leads to valid estimates of causal effects. the statistical problem of adjusting for differ
Combined with linearity assumptions of the ences in covariates, comes from their relation
conditional expectations of the potential out to the sample size. Clearly, simply increasing
comes given covariates, the unconfoundedness the sample size does not make the problem
assumption justifies linear regression. But in of inference for the average treatment effect
the last fifteen years the literature has moved inherently more difficult. However, quadru
away from the earlier emphasis on regression pling the sample size leads, in expectation,
methods. The main reason is that, although to a doubling of the /-statistic. In contrast,
locally linearity of the regression functions increasing the sample size does not system
may be a reasonable approximation, in many atically affect the normalized difference. In
cases the estimated average treatment effects the landmark LaLonde (1986) paper the nor
based on regression methods can be severely malized difference in mean exceeds unity for
biased if the linear approximation is not accu many of the covariates, immediately show
rate globally. To assess the potential problems ing that standard regression methods are
with (global) regression methods, it is useful unlikely to lead to credible results for those
to report summary statistics of the covariates data, even if one views unconfoundedness as
by treatment status. In particular, one may a reasonable assumption.
wish to report, for each covariate, the differ As a result of the concerns with the sen
ence in averages by treatment status, scaled sitivity of results based on linear regres
by the square root of the sum of the vari sion methods to seemingly minor changes
ances, as a scale-free measure of the differ in specification, the literature has moved to
ence in distributions. To be specific, one may more sophisticated methods for adjusting for
wish to report the normalized difference differences in covariates. Some of these more
sophisticated methods use the propensity
score?the conditional probability of receiv

(3) A-7*rfr ing the treatment?in various ways. Others


rely on pairwise matching of treated units to
control units, using values of the covariates to
where for w ? 0,1, S? = J2i-.w~w PQ ~ match. Although these estimators appear at
XW)2/(NW ? 1), the sample variance of Xt first sight to be quite different, many (including

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 25

nonparametric versions of the regression esti An ongoing discussion concerns the role
mators) in fact achieve the semiparametric of the propensity score, e(x) ? pr(W? = 11 X?
efficiency bound; thus, they would tend to be = x), introduced by Rosenbaum and Rubin
similar in large samples. Choices among them (1983b), and indeed whether there is any
typically rely on small sample arguments, role for this concept. See for recent contribu
which are rarely formalized, and which do not tions to this discussion Hahn (1998), Imbens
uniformly favor one estimator over another. (2004), Angrist and Hahn (2004), Peter C.
Most estimators currently in use can be writ Austin (2008a, 2008b), Dehejia (2005a),
ten as the difference of a weighted average of Smith and Todd (2001, 2005), Heckman,
the treated and control outcomes, with the Ichimura, and Todd (1998), Fr?lich (2004a,
weights in both groups adding up to one: 2004b), B. B. Hansen (2008), Jennifer Hill
(2008), Robins and Ya'acov Ritov (1997),
N
Rubin (1997, 2006), and Elizabeth A. Stuart
t=Y,i=\
V ?:W,=1
Yf, with E \ =(2008).
1,
In this section, we first discuss the key
E A,= -L
?:W,.=0
assumptions underlying an analysis based on
unconfoundedness. We then review some of
the efficiency bound results for average treat
The estimators differ in the way the weights X{ ment effects. Next, in sections 5.3 to 5.5, we
depend on the full vector of assignments and briefly review the basic methods relying on
matrix of covariates (including those of other regression, propensity score methods, and
units). For example, some estimators implicitly matching. Although still fairly widely used,
allow the weights to be negative for the treated we do not recommend these methods in prac
units and positive for controls units, whereas tice. In sections 5.6 to 5.8, we discuss three
others do not. In addition, some depend on of the combination methods that we view as
essentially all other units whereas others more attractive and recommend in practice.
depend only on units with similar covariate We discuss estimating variances in section
values. Nevertheless, despite the commonali 5.9. Next we discuss implications of lack of
ties of the estimators and large sample equiva overlap in the covariate distributions. In par
lence results, in practice the performance of ticular, we discuss two general methods for
the estimators can be quite different, partic constructing samples with improved covari
ularly in terms of robustness and bias. Little ate balance, both relying heavily on the pro
is known about finite sample properties. The pensity score. In section 5.11, we describe
few simulation studies include Zhong Zhao methods that can be used to assess the plau
(2004), Fr?lich (2004a), and Matias Busso, sibility of the unconfoundedness assumption,
John DiNardo, and Justin McCrary (2008). even though this assumption is not directly
On a more positive note, some understanding testable. We discuss methods for testing for
has been reached regarding the sensitivity of the presence of average treatment effects
specific estimators to particular configura and for the presence of treatment effect het
tions of the data, such as limited overlap in erogeneity under unconfoundedness in sec
covariate distributions. Currently, the best tion 5.12.
practice is to combine linear regression with
5.1 Identification
either propensity score or matching methods
in ways that explicitly rely on local, rather than The key assumption is unconfounded
global, linear approximations to the regression ness, introduced by Rosenbaum and Rubin
functions. (1983b),

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
26 Journal of Economic Literature, Vol. XLVII (March 2009)

Assumption 1 (Unconfoundedness) The second assumption used to identify


treatment effects is that for all possible val
w; i (7,(0X^(1)) | x,. ues of the covariates, there are both treated
and control units.
The unconfoundedness assumption is often
controversial, as it assumes that beyond the Assumption 2 (Overlap)
observed covariates X? there are no (unob
served) characteristics of the individual 0 < pr(W? = l\Xi = x) <1, for allx.
associated both with the potential outcomes
and the treatment.' Nevertheless, this kind We call this the overlap assumption as it
of assumption is used routinely in multiple implies that the support of the conditional
regression analysis. In fact, suppose we distribution of X? given W{ ? 0 overlaps com
assume that the treatment effect, r, is con pletely with that of the conditional distribu
stant, so that, for each random draw ?, r = tion of X{ given W{ ? 1.
Y?(1) - Yf(0). Further, assume that Yt-(0) = a With a random sample (W?,X?)-Ii we
+ ?' Xi + eh where e? = Yiv0) - E[Yf(0) |XJ can estimate the propensity score e(x)
is the residual capturing the unobservables ? pr(W{ = 11 Xi ? x), and this can provide
affecting the response in the absence of some guidance for determining whether
treatment. Then, with the observed outcome the overlap assumption holds. Of course
defined as Yf = (1 - Wf) Yiv0) + Wf Y?(1), common parametric models, such as probit
we can write and logit, ensure that all estimated prob
abilities are strictly between zero and one,
Y/ = a + r-Wf + )S/Xi + ej, and so examining the fitted probabilities
from such models can be misleading. We
and unconfoundedness is equivalent to inde discuss approaches for improving overlap
pendence of Si and of Wh conditional on Xz. in 5.10.
Imbens (2004) discusses some economic The combination of unconfoundedness
models that imply unconfoundedness. These and overlap was referred to by Rosenbaum
models assume agents choose to participate in and Rubin's (1983b) as strong ignorability.
a program if the benefits, equal to the differ There are various ways to establish identifi
ence in potential outcomes, exceed the costs cation of various average treatment effects
associated with participation. It is important under strong ignorability. Perhaps the easi
here that there is a distinction between the est is to note that r(x) = E[Yf(l) - ^(0)1^
objective of the participant (net benefits), and = x] is identified for x in the support of the
the outcome that is the focus of the reseacher covariates:
(gross benefits). (See Athey and Scott Stern
1998 for some discussion.) Unconfoundedness
is implied by independence of the costs and (5) t(x) = E[Y((l) | X, = x] - E[y,(0) | X, = x]
benefits, conditional on observed covariates.
= E[Yi(l)\Wi=l,Xi=x]

"Unconfoundedness generally fails if the covariates -E[Yi(0)\Wi = 0,Xi=x]


themselves are affected by treatment. Wooldridge (2005)
provides a simple example where treatment is random = E\Yt\Wt = 1,X, = x]
ized with respect to the counterfactual outcomes but not
with respect to the covariates. Unconfoundedness is easily
shown to fail. -E[Yi\Wi = 0,Xi=x],

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 27

where the second equality follows by uncon effect, the third term drops out, and the vari
foundedness: E[Yi(w)\Wi = w,X{] does not ance bound for rCATE is
depend on w. By the overlap assumption, we GT?(X?) , CT(?(X?)
can estimate both terms in the last line, and (8) Vc
therefore we can identify r(x). Given that we
eiXt) 1 - eOQ
can identify r(x) for all x, we can identify the
expected value across the population distri Still, the role of heterogeneity in the treat
bution of the covariates, ment effect is potentially important. Suppose
we actually had prior knowledge that the
(6) rPATE = E[r(X,)], average treatment effect conditional on the
covariates is constant, or r(x) = rPATE for all
as well as rPATT and other estimands. x. Given this assumption, the model is closely
related to the partial linear model (Peter M.
5.2 Efficiency Bounds
Robinson 1988; James H. Stock 1989). Given
Before discussing specific estimation this prior knowledge, the variance bound is
methods, it is useful to see what we can learn
about the parameters of interest, given just (9) Vconst
the strong ignorability of treatment assign
ment assumption, without functional form or
distributional assumptions. In order to do so, = \E af(X?) -+
, <t*(X,)
eOQ 1 - e(Xt)
we need some additional notation. Let ctq(x)
= V(Y,(0)|X, = x) and af(x) = V^DlX,
= x) denote the conditional variances of the This variance bound can be much l
potential outcomes given the covariates. than (8) if there is variation in the pro
Hahn (1998) derives the lower bounds for sity score. Knowledge of lack of variati
asymptotic variances of x/?-consistent esti the treatment effect can be very valuable,
mators for rPATE as conversely, allowing for general hetero
ity in the treatment effect can be expe
(7) 'PATE aUXt) + <r02(Xf) in terms of precision.
In addition to the conditional variance
the counterfactual outcomes, a third im
+ (t(X,) - r)2 tant determinant of the efficiency bou
the propensity score. Because it enters
where p = E[e(Xj\ is the unconditional treat (7) in the denominator, the presence of
ment probability. Interestingly, this lower with the propensity score close to zero o
bound holds irrespective of whether the will make it difficult to obtain precis
propensity score is known or not. The form mates of the average effect of the treatm
of this variance bound is informative. It is no One approach to address this problem, d
surprise that rPATE is more difficult to esti oped by Crump et al. (2009) and discusse
mate the larger are the variances (Tq(x) and more detail in section 5.10, is to drop ob
erf (x). However, as shown by the presence of vations with the propensity score clo
the third term, it is also more difficult to esti zero and one, and focus on the average e
mate tpate, the more variation there is in the of the treatment in the subpopulation
average treatment effect conditional on the propensity scores away from zero. Su
covariates. If we focus instead on estimat we focus on rCATE A, the average of r(X
ing rCATE, the conditional average treatment Xf G A. Then the variance bound is

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
28 Journal of Economie Literature, Vol. XLVII (March 2009)

(10) Va=?^W (11) rreg =i ?IiE(W)-?oW)


Given parametric models for /i()( ) and
/?i( X estimation and inference are straight
XE[dXy+ l-e(X? (Xi)eA ' forward.8 In the simplest case, we assume
each conditional mean can be expressed as
By excluding from the set A subsets of the functions linear in parameters, say
covariate space where the propensity score is
close to zero or one, we may be able to esti (12) /?o(x) = a0 4- ?o(x - if>x),
mate tCAte,a more precisely than rCATE. (If
we are instead interested in rCATT, we only fJL}(x) = Qi + ?\ (X - l/;x),
need to worry about covariate values where
e(x) is close to one.) where we take deviations from the overall
Having displayed these lower bounds on population covariate mean ipx so that the
variances for the average treatment effects, treatment effect is the difference in inter
a natural question is: Are there estimators cepts. (Naturally, as in any regression context,
that achieve these lower bounds that do not we can replace x with general functions of x.)
require parametric models or functional Of course, we rarely know the population
form restrictions on either the conditional mean of the covariates, so in estimation we
means or the propensity score? The answer replace ipx with the sample average across all
in general is yes, and we now consider differ units, X. Then freg is simply
ent classes of estimators in turn.

5.3 Regression Methods (13) freg = ?1 - d0.


To describe the general approach to This estimator is also obtained from the
regression methods for estimating average coefficient on the treatment indicator W? in
treatment effects, define fi0(x) and ?i^x) to be the regression Y{ on 1, W? Xf, W{ -(X, - X).
the two regression functions for the potential Standard errors can be obtained from stan
outcomes: dard least square regression output. (As
we show below, in the case of estimating
Ho(x) = E[Yi(0)\Xi = x] rPATE, the usual standard error, whether
and or not it is made robust to heteroskedastic
ity, ignores the estimation error in X as an
/i1(x) = E[Yi(l)|Xi=ac]. estimator of ?Jjx; technically, the conventional

By definition, the average treatment effect 8There is a somewhat subtle issue in estimating treat
conditional on X = x is r(x) = ??](x) ? /jLq(x). ment effects from stratified samples or samples with
As we discussed in the identification subsec missing values of the covariates. If the missingness or
stratification are determined by outcomes on the covari
tion, under the unconfoundedness assump ates, Xh and the conditional means are correctly specified,
tion, /i()(x) = E[YZ-| Wt = 0,XZ = x] and /?i(x) then the missing data or stratification can be ignored for
= E[Yt | Wt ? 1,X? = x], which means we can the purposes of estimating the regression parameters; see,
for example, Wooldridge (1999, 2007). However, sample
estimate //0( ) using regression methods for selection or stratification based on X, cannot be ignored in
the untreated subsample and /?i( ) using estimating, say, rPATE, because rPATE equals the expected
the treated subsample. Given consistent difference in regression functions across the population
distribution of Xt. Therefore, consistent estimation of rPATE
estimators /?0( ) and /?i( ), a consistent esti requires applying inverse probability weights or sampling
mator for either rPATE or rCATE is weights to the average in (11).

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 29

Standard error is only valid for rCATE and not linear approximation to the regression func
forrPATE.) tion is globally accurate, regression may lead
A different representation of freg is useful to severe biases. Another way of interpreting
in order to illustrate some of the concerns this problem is as a multicollinearity prob
with regression estimators in this setting. lem. If the averages of the covariates in
Suppose we do use the linear model in (12). the two treatment arms are very different,
It can be shown that the correlation between the covariates and
the treatment indicator is relatively high.
Although conventional least squares standard
(14) ^=Y1_Y0-(7^.A errors take the degree of multicollinearity
into account, they do so conditional on the
Nx - Y ? ? specification of the regression function. Here
the concern is that any misspecification may
be exacerbated by the collinearity problem.
As noted in the introduction to section 5,
To adjust for differences in covariates
between treated and control units, the simplean easy way to establish the severity of this
difference in average outcomes, Y1 ? Y(), isproblem is to inspect the normalized differ
adjusted by the difference in average covari ences (xl - x0)/v/s02+ sf).
ates, Xi ? X0, multiplied by the weightedIn the case of the standard regression esti
average of the regression coeffi mator it is straightforward to derive and to
estimate the variance when we view the esti
cients ?0 and ?x in the two treatment regimes.
mator as an estimator of rCATE. Assuming the
This is a useful representation. It shows that
if the averages of the covariates in the twolinear regression model is correctly specified,
treatment arms are very different, then thewe have
adjustment to the simple mean difference can
be large. We can see that even more clearly (15) V?(freg - rCATE) AMO, V0 + V{\
by inspecting the predicted outcome for the
treated units had they been subject to the con where Vw = N E [(aw - aj2],
trol treatments:
which can be obtained directly from standard
?[Y,(1) | Wt = 0] = Y0 + & (X, - X0). regression output. Estimating the variance
when we view the estimator as an estimator
The regression parameter ?Q is estimatedof tpate requires adding a term capturing the
variation in the treatment effect conditional
on the control sample, where the average
on the covariates. The form is then
of the covariates is equal to X0. It there
fore likely provides a good approximation to
the conditional mean function around that V?(freg - rCATE) -4 mo, v0 + v, + VT),
value. However, this estimated regression
where the third term in the normalized vari
function is then used to predict outcomes
ance is
in the treated sample, where the average of
the covariates is equal to XL. If these cova
riate averages are very different, and thus vT = {?1 - A,)'
the regression model is used to predict out
comes far away from where the parameters E[(X, - E[XJ)(X, - EVQ)1(?l - /??),
were estimated, the results can be sensitive to
which can be estimated as
minor changes in the specification. Unless the

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
30 Journal of Economie Literature, Vol. XLVII (March 2009)

vT=C?i-?oY practice, researchers have not used higher


order kernels and, with positive kernels, the
bias for kernel estimators is a more severe
x|E?-x)(xf-x)'(A-A,). problem than for the matching estimators
discussed in section 5.5.
In practice, this additional term is rarely Kernel regression of this type can be inter
incorporated, and researcher instead report preted as locally fitting a constant regression
the variance corresponding to rCATE. In cases function. A general alternative is to fit locally
where the slope coefficients do not differ a polynomial regression function. The leading
substantially across the two regimes?equiv case of this is local linear regression (J. Fan
alently, the coefficients on the interaction and I. Gijbels 1996), applied to estimation
terms W? (X? ? X) are "small"?this last of average treatment effects by Heckman,
term is likely to be swamped by the variances Ichimura, and Todd (1997) and Heckman et
in (15). al. (1998). Define ?(x) and ?(x) as the local
In many cases, researchers have sought to least squares estimates, based on locally fit
go beyond simple parametric models for the ting a linear regression function:
regression functions. Two general directions N

have been explored. The first relies on local (a(x), ?(x)) = arg min ^ A?
aj? .=1
smoothing, and the second on increasingly
flexible global approximations. We discuss x (y.-a-?XXi-x))2,
both in turn.
Heckman, Ichimura, and Todd (1997) and with the same weights X? as in the standard
Heckman et al. (1998) consider local smooth kernel estimator. The regression function at
ing methods to estimate the two regression x is then estimated as ?(x) = ?(x). In order
functions. The first method they consider is to achieve convergence at the best pos
kernel regression. Given a kernel K( ), and a sible rate for rreg, one needs to use higher
bandwidth h, the kernel estimator for ?iw(x) order kernels, although the order required
is is less than that for the standard kernel
estimator.
A?/*) = Y. Yi ' K with weight For both the standard kernel estimator
and the local linear estimator an important
choice is that of the bandwidth h. In prac
tice, researchers have used ad hoc methods
for bandwidth selection. Formal results on
bandwidth selection from the literature on
Although the rate of convergence of the nonparametric regression are not directly
kernel estimator to the regression function applicable. Those results are based on mini
is slower than the conventional parametric mizing a global criterion such as the expected
rate N~l//2, the rate of convergence of the value of the squared difference between the
implied estimator for the average treatment estimated and true regression function, with
effect, freg in (11), is the regular parametric the expectation taken with respect to the
rate under regularity conditions. These con marginal distribution of the covariates. Thus,
ditions include smoothness of the regression they focus on estimating the regression func
functions and require the use of higher order tion well everywhere. Here the focus is on
kernels (with the order of the kernel depend a particular scalar functional of the regres
ing on the dimension of the covariates). In sion function, and it is not clear whether the

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 31

conventional methods for bandwidth choices K

have good properties. ^w,k(x) ? Xj ?w,k ' X


Although formal results are given for the
case with continuous regressors, modifica We then estimate ?wk by lea
tions have been developed that allows for both regression, and estimate the ave
continuous and discrete covariates (Jeffrey S. ment effect using (11). This is a
Racine and Qi Li 2004). All such methods of the estimator discussed in Imben
require choosing the degree of smoothing and Ridder (2005) and Chen, H
(often known as bandwidths), and there has Tarozzi (2008), with formal results
not been much work on choosing bandwidths for the case with general X,-. Imbe
for the particular problem of estimating aver and Ridder (2005) also discuss me
age treatment effects where the parameter of choosing the number of terms in
interest is effectively the average of a regres based on expected squared erro
sion function, and not the entire function. See average treatment effect.
Imbens (2004) for more discussion. Although If the outcome is binary or more
the estimators based on local smoothing have of a limited dependent variable for
not been shown to attain the variance effi series approximation to the regre
ciency bound, it is likely that they can be con tion is not necessarily attractive.
structed to do so under sufficient smoothness that one can use increasingly flexi
conditions. imations based on models that e
An alternative to local smoothing meth structure of the outcome data. Fo
ods are global smoothing methods, such as with binary outcomes, Hirano, Im
series or sieve estimators. Such estimators Ridder (2003) show how using a p
are parametric for a given sample size, with approximation to the log odds rat
the number of parameters and the flexibil an attractive estimator for the co
ity of the model increasing with the sample mean. See Chen (2007) for gener
size. One attraction of such methods is that sion of such models. One can imag
often estimation and inference can proceed in cases with nonnegative response
as if the model is completely parametric. exponential regression functions
The amount of smoothing is determined by derived from specific models, such
the number of terms in the series, and the (when the response can pile up at z
large-sample analysis is carried out with the bined with polynomial approximat
number of terms growing as a function of the linear index function, might be use
sample size. Again, little is known about how Generally, methods based o
to choose the number of terms when inter approximations suffer from the s
est lies in average treatment effects. For the backs as linear regression. If the
average treatment case, Hahn (1998), Imbens, distributions are substantially di
Newey, and Ridder (2005), Andrea Rotnitzky both treatment groups, estimate
and Robins (1995), and Chen, Hong, and such methods rely, perhaps mor
Tarozzi (2008) have developed estimators of desired, on extrapolation. Using th
this type. Hahn shows that estimators in this ods in cases with substantial diff
class can achieve the variance lower bounds covariate distributions is therefor
for estimating rPATE. For a simple version of ommended (except possibly in ca
such an estimator, suppose that X{ is a sca the sample has been trimmed s
lar. Then we can approximate /j,w(x) by a K-th covariates across the two treatmen
order polynomial have considerable overlap).

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
32 Journal of Economie Literature, Vol. XLVII (March 2009)

Before we turn to propensity score methods, The basic insight is that for any binary vari
we should comment on estimating the average able Wh and any random vector Xi9 it is true
treatment effects on the treated, rPATT and (without assuming unconfoundedness) that
rCATT. In this case, f(X?) gets averaged across
observations with W? = 1, rather than across
the entire sample as in (11) Because /ii(x) is
estimated on the treated subsample, in esti Hence, within subpopulations with the same
mating PATT or CATT there is no problem value for the propensity score, covariates are
if /?i(x) is poorly estimated at covariate values independent of the treatment indicator and
that are common in the control group but thus cannot lead to biases (the same way in
scarce in the treatment group. But we must a regression framework omitted variables that
have a good estimate of /x0(x) at covariate val are uncorrelated with included covariates do
ues common in the treatment group, and this not introduce bias). Since under unconfound
is not ensured because we can only use the edness all biases can be removed by adjusting
control group to obtain /?0(x). Nevertheless, in for differences in covariates, this means that
many settings /i0(x) can be estimated well over within subpopulations homogenous in the
the entire range of the covariates because the propensity score there are no biases in com
control group often includes units that are sim parisons between treated and control units.
ilar to those in the treatment group. By con Given the Rosenbaum-Rubin result, it is
trast, often there are numerous control group sufficient, under the maintained assumption of
units?for example, high-income workers in unconfoundedness, to adjust solely for differ
the context of a job training program?that ences in the propensity score between treated
are quite different from any units in the treat and control units. This result can be exploited
ment group, making the ATE parameters con in a number of ways. Here we discuss three
siderably more difficult to estimate than ATT of these that have been used in practice. The
parameters. (Further, the ATT parameters are first two of these methods exploit the fact
more interesting from a policy perspective in that the propensity score can be viewed as a
such cases, unless one redefines the popula covariate that is sufficient to remove biases in
tion to exclude some units that are unlikely to estimation of average treatment effects. For
ever be in the treatment group.) this purpose, any one-to-one function of the
propensity score could also be used. The third
5.4 Methods Based on the Propensity Score
method further uses the fact that the pro
The first set of alternatives to regres pensity score is the conditional probability of
sion estimators relies on estimates of the receiving the treatment.
propensity score. These methods were intro The first method simply uses the pro
duced in Rosenbaum and Rubin (1983b). pensity score in place of the covariates in
An early economic discussion is in Card and regression analysis. Define vw(e) ? E[Yt\ Wt
Sullivan (1988). Rosenbaum and Rubin show = w,e(X?) = e\ Unconfoundedness in com
that, under unconfoundedness, independence bination with the Rosenbaum-Rubin result
of potential outcomes and treatment indica implies that uw(e) ? E\Y?w) \ e(X?) = e\ Then
tors also holds after conditioning solely on the we can estimate vw(e) very generally using
propensity score, e(x) ? pr(W? = 11 X{ = x): kernel or series estimation on the propensity
score, something which is greatly simpli
W/1(Y/(0),Y/(1))|X? fied by the fact that the propensity score is a
scalar. Heckman, Ichimura, and Todd (1998)
=> W/l(Y/(0),Y?.(l))|e(X/). consider local smoothers and Hahn (1998)

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 33

considers a series estimator. In either case


= 1 be boundary values. Then define By,
we have the consistent estimator for i = 1,..., N, and j = 1,...,/ ? 1, as the
N indicators
^regprop = ^7 ' E ipMX?)) - Z>0(^(Xi))),
B
which is simply the average of the differ li \ 0 otherwise
ences in predicted values for the treated and
untreated outcomes. Interestingly, Hahn
shows that, unlike when we use regression to
andBj;=l-?B?.
7=1
adjust for the full set of covariates, the series
regression estimator based on adjusting for Now estimate within stratum j the average
the known propensity score does not achieve treatment effect Tj = E[Y/1) - Yf(0)|Bi;/ = 1]
as
the efficiency bound.
Although methods of this type have been
used in practice, probably because of their 7/ ~ -^j'1 ~ YJ?
simplicity, regression on simple functions of where
the propensity score is not recommended. ? N
Because the propensity score does not have a
substantive meaning, it is difficult to motivate YJW = ~N~ 2^ BiJ X Y?'
ju; i:W?=w
a low order polynomial as a good approxima
and
N
tion to the conditional expectation. For exam
ple, a linear model in the propensity score ?
jw / ^ -^iy *
is unlikely to provide a good approximation
to the conditional expectation: individuals If/ is sufficiently large and the
with propensity scores of 0.45 and 0.50 are Cj ? c.i small, there is little varia
likely to be much more similar than individu propensity score within a stratum
als with propensity scores equal to 0.01 and and one can analyze the data as if
0.06. Moreover, no formal asymptotic prop sity score is constant, and thus as
erties have been derived for the case with within a block were generated by
the propensity score unknown. randomized experiment (with the
The second method, variously referred to probabilities constant within a st
as blocking, subclassification, or stratification, varying between strata). The av
also adjusts for differences in the propensity ment effect is then estimated as t
score in a way that can be interpreted as average of the within-stratum estim
regression, but in a more flexible manner.
Originally suggested by Rosenbaum and Y-- fN?+N?\
Rubin (1983b), the idea is to partition the
sample into strata by (discretized) values of
T^=^rj\???;
the propensity score, and then analyze the With J large, the implicit step function
data within each stratum as if the propensity approximation to the regression functions
score were constant and the data could be v~{e) will be accurate. Cochran (1968) shows
interpreted as coming from a completely ran in a Gaussian example that with five equal
domized experiment. This can be interpreted sized blocks the remaining bias is less than
as approximating the conditional mean of the 5 percent of the bias in the simple differ
potential outcomes by a step function. To be ence between average outcomes among
more precise, let 0 = c0 < cl < c2 < ... < Cj treated and controls. Motivated by Cochran's

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
34 Journal of Economie Literature, Vol. XLVII (March 2009)

calculations, researchers have often used five where the second and final inequalities fol
strata, although depending on the sample low by iterated expectations and the third
size and the joint distribution of the data, equality holds by unconfoundedness. The
fewer or more blocks will generally lead to a implication is that weighting the treated
lower expected mean squared error. population by the inverse of the propensity
The variance for this estimator is typi score recovers the expectation of the uncon
cally calculated conditional on the strata ditional response under treatment. A similar
indicators, and assuming random assignment calculation shows E[((l - W?)Y?)/(1 - e{X^)}
within the strata. That is, for stratum j, the = E[Yf(0)], and together these imply
estimator is tj, and its variance is estimated
as Vj = Vp -\-Vji, where (16) rPATE = E WrY, (l-Wj'Yt
e(Xt) 1 - *(X<)
si
Vjw = vp> where Equation (16) suggests an obvious estimator
?I TPATE:
S2 - ? Y (Y- - Y )2
J ISjw i:B?j=iW.=w -I (1' ) ^weight = "Tt~
N
The overall variance is then estimated as "'wrY, (l-wg-y,
e{X?) 1 - e(Xt)
V(r1): lock/
Nn+Nj^*
V?
?(Vq, + Vy)-( N
which, as a sample average from a r
sample, is consistent for rPATE an
This variance estimator is appropriate for normally distribute
asymptotically
estimator
rCATE, although it ignores biases arisingin (17) is essentially due to D
from
variation in the propensity scoreHorvitz
within and D. J. Thompson (1952).9
strata.
In propensity
The third method exploiting the practice, (17) is not a feasible es
tor
score is based on weighting. Recall because
that rPATEit depends on the prop
= E[Y,(1) - Y,.(0)] = E[Yf(l)] - score
E\Y,(0)]- Wee(-), which is rarely kn
function
surprising
consider the two terms separately. result is that, even if we kn
Because
W? Y, = W? Y,(?), we have propensity score, rweight does not achi
efficiency bound given in (7). It turns
W, Y, WrY^l) be better, in terms of large sample effici
e(X?) e(X,) to weight using the estimated rather th
true propensity score. Hirano, Imben
Wf-Y((l) X,. Ridder (2003) establish conditions
e(Xt) which replacing e( ) with a logistic siev
mator results in a weighted propensity
estimator that achieves the variance b
E(W1|X().E(Y,(1)|X)' The estimator is practically simple t
pute, as estimation of the propensity
involves a straightforward logit estim
E eOO-EWDlXJ
e{X,)
9 Because the Horvitz-Thompson estimator is ba
sample averages, adjustments for stratified sampl
E[E(Yi(l)|Xj] = ?[Yi(l)], straightforward if one is provided sampling weights

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 35

involving flexible functions of the covariates. the block. This has the advantage of avoiding
Theoretically, the number of terms in the particularly large weights, but comes at the
approximation should increase with the sam expense of introducing bias if the propensity
ple size. In the second step, given the esti score is correctly specified.
mated propensity score e(x), one estimates A particular concern with IPW estimators
arises again when the covariate distributions
are substantially different for the two treatment
(18) T^~h^wlhw?~ groups. That implies that the propensity score
gets close to zero or one for some values of the
f (l-W?)-Y?/f W, covariates. Small or large values of the pro
?? 1-eOQ /??l-?W pensity score raises a number of issues. One
concern is that alternative parametric models
We refer to this as the inverse probabil for the binary data, such as probit and logit
ity weighting (IPW) estimator. See Hirano, models that can provide similar approxima
Imbens, and Ridder (2003) for intuition as tions in terms of estimated probabilities over
to why estimating the propensity score leads the middle ranges of their arguments, tend to
to a more efficient estimator, asymptotically, be more different when the probabilities are
than knowing the propensity score. close to zero or one. Thus the choice of model
Ichimura and Oliver Linton (2005) stud and specification becomes more important,
ied fIPW when e(-) is obtained via kernel and it is often difficult to make well motivated
regression, and they consider the problem of choices in treatment effect settings. A second
optimal bandwidth choice when the object of concern is that for units with propensity scores
interest is rPATE. More recently, Li, Racine, close to zero or one, the weights can be large,
and Wooldridge (forthcoming) consider making those units particularly influential in
kernel estimation for discrete as well as con the estimates of the average treatment effects,
tinuous covariates. The estimator proposed and thus making the estimator imprecise.
by Li, Racine, and Wooldridge achieves the These concerns are less serious than those
variance lower bound. See Hirano, Imbens, regarding regression estimators because at
and Ridder (2003) and Wooldridge (2007) least the IPW estimates will accurately reflect
for methods for estimating the variance for uncertainty. Still, these concerns make the
these estimators. simple IPW estimators less attractive. (As
Note that the blocking estimator can also for regression cases, the problem can be less
be interpreted as a weighting estimator. severe for the ATT parameters because pro
Consider observations in block j. Within the pensity score values close to zero play no role.
block, the N? treated observations all get Problems for estimating ATT arise when some
equal weight 1/N?. In the estimator for the units, as described by their observed covari
overall average treatment effect, this block ates, are almost certain to receive treatment.)
gets weight (?^ + N?)/N, so we can write f 5.5 Matching
? Hi=i Y* Y?, where for treated observations
in block/ the weight normalized by N is N \ Matching estimators impute the missing
= (Nj0 -f- Nji)/Nji), and for control observa potential outcomes using only the outcomes
tions it is N- X? = (Nj0 + N?)/Nj0). Implicitly of a few nearest neighbors of the opposite
this estimator is based on an estimate of treatment group. In that sense, matching is
the propensity score in block j equal to similar to nonparametric kernel regression,
Nji/(NJ0 + N?). Compared to the IPW estima with the number of neighbors playing the role
tor, the propensity score is smoothed within of the bandwidth in the kernel regression. A

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
36 Journal of Economie Literature, Vol. XLVII (March 2009)

formal difference with kernel methods is that replacement." Given the matched pairs, the
the asymptotic distribution for matching esti treatment effect within a pair is estimated
mators is derived conditional on the implicit as the difference in outcomes, and the over
bandwidth, that is, the number of neighbors, all average as the average of the within-pair
often fixed at a small number, e.g., one. Using difference. Exploiting the representation of
such asymptotics, the implicit estimate fi the estimator as a difference in two sample
w(x) is (close to) unbiased, but not consistent, means, inference is based on standard meth
for ?iw{x). In contrast, the kernel regression ods for differences in means or methods for
estimators discussed in the previous section paired randomized experiments, ignoring
implied consistency of p,w(x). any remaining bias. Fully efficient matching
Matching estimators have the attractive algorithms that take into account the effect
feature that the smoothing parameters are of a particular choice of match for treated
easily interpretable. Given the matching unit i on the pool of potential matches for
metric, the researcher only has to choose unit j are computationally cumbersome. In
the number of matches. Using only a single practice, researchers use greedy algorithms
match leads to the most credible inference that sequentially match units. Most com
with the least bias, at the cost of sacrificing monly the units are ordered by the value of
some precision. This sits well with the focus the propensity score with the highest pro
in the literature on reducing bias rather than pensity score units matched first. See Gu and
variance. It also can make the matching esti Rosenbaum (1993) and Rosenbaum (1995)
mator easier to use than those estimators that for discussions.
require more complex choices of smoothing Abadie and Imbens (2006) study formal
parameters, and this may be another expla asymptotic properties of matching estimators
nation for its popularity. in a different setting, where both treated and
Matching estimators have been widely control units are (potentially) matched and
studied in practice and theory (e.g., X. Gu and matching is done with replacement. Code for
Rosenbaum 1993; Rosenbaum 1989, 1995, the Abadie-Imbens estimator is available in
2002; Rubin 1973b, 1979; Rubin and Neal Matlab and Stata (see Abadie et al. 2004).10
Thomas 1992a, 1992b, 1996,2000; Heckman, Formally, given a sample, {(Yf,X?, W?)}/Ij,
Ichimura, and Todd 1998; Dehejia and Sadek let ?i(?) be the nearest neighbor to i, that is,
Wahba 1999; Abadie and Imbens 2006; ?x(i) is equal to the nonnegative integer j, for
Alexis Diamond and Jasjeet S. Sekhon 2008; jE{l,...,JV},ifW^Wf,and
Sekhon forthcoming; Sekhon and Richard
Grieve 2008; Rosenbaum and Rubin 1985; || x _ Xi\\ = min \\Xk - Xi\\.
Stefano M. Iacus, Gary King, and Giuseppe
Porro 2008). Most often they have been More generally, let ?m(i) be the index that sat
applied in settings where, (1) the interest is in isfies W?m(?) ^ W{ and that is the ra-th closest
the average treatment effect for the treated, to unit i:
and (2) there is a large reservoir of potential
controls, although recent work (Abadie and ? l{\\Xl-Xl\\<\\XeM-Xl\\}=m,
Imbens 2006) shows that matching estima
tors can be modified to estimate the overall
average effect. The setting with many poten
tial controls allows the researcher to match
10 See Sascha O. Becker and Andrea ichino (2002) and
each treated unit to one or more distinct
Edwin Leuven and Barbara Sianesi (2003) for alternative
controls, hence the label "matching without Stata implementations of matching estimators.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 37

where 1{ } is the indicator function, equal to it is therefore critical that some weights are
one if the expression in brackets is true and negative through the device of higher order
zero otherwise. In other words, ?m(i) is the kernels, with the exact order required depen
index of the unit in the opposite treatment dent on the dimension of the covariates (see,
group that is the ra-th closest to unit i in e.g., Heckman, Ichimura, and Todd 1998). In
terms of the distance measure based on the practice, however, researchers have not used
norm || ||. Let JM(i) C {1, ...,N] denote the higher order kernels, and so bias concerns
set of indices for the first M matches for unit for nearest-neighbor matching estimators
i'- Jm(?) = Ri(?)> , ?m(?)}- Now impute the are even more relevant for kernel matching
missing potential outcomes as the average methods.
of the outcomes for the matches, by defin There are three caveats to the Abadie
ing Yi(0) and Yf(l) as Imbens bias result. First, it is only the con
tinuous covariates that should be counted in
if w; = o, the dimension of the covariates. With dis
t,(o)={5 crete covariates the matching will be exact
in large samples, and as a result such cova
fl/i
1/mz^Yj if w, = o, riates do not contribute to the order of the
%(X) if W, = 1, bias. Second, if one matches only the treated,
and the number of potential controls is much
The simple matching estimator discussed in larger than the number of treated units, one
Abadie and Imbens is then can justify ignoring the bias by appealing to
an asymptotic sequence where the number
(19) Uh = ^(^-*)' 1=1
of potential controls increases faster with
the sample size than the number of treated
Abadie and Imbens show that the bias of units. Specifically, if the number of controls,
this estimator is of order 0(N_1/K), where K N0, and the number of treated, N}, satisfy
is the dimension of the covariates. Hence, if Nx/Nq/k ?> 0, then the bias disappears in
one studies the asymptotic distribution of the large samples after normalization by \f?[.
estimator by normalizing by \/N (as can be Third, even though the order of the bias may
justified by the fact that the variance of the be high, the actual bias may still be small
estimator is of order 0(1/N)), the bias does if the coefficients in the leading term are
not disappear if the dimension of the covari small. This is possible if the biases for differ
ates is equal to two, and will dominate the ent units are at least partially offsetting. For
large sample variance if K is at least three. To example, the leading term in the bias relies
put this result in perspective, it is useful to on the regression function being nonlinear,
relate it to bias properties of estimators based and the density of the covariates having a
on kernel regression. Kernel estimators can nonzero slope. If either the regression func
be viewed as matching estimators where tion is well approximated by a linear func
all observations within some bandwidth hN tion, or the density is approximately flat, the
receive some weight. As the sample size N bias may be fairly limited.
increases, the bandwidth hN shrinks, but Abadie and Imbens (2006) also show
sufficiently slow in order to ensure that the that matching estimators are generally not
number of units receiving non-zero weights efficient. Even in the case where the bias
diverges. If all the weights are positive, the is of low enough order to be dominated by
bias for kernel estimators would generally be the variance, the estimators do not reach
worse. In order to achieve root-N consistency, the efficiency bound given a fixed number

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
38 Journal of Economie Literature, Vol. XLVII (March 2009)

of matches. To reach the bound the num on estimating ?jlw(x) ? E[Yi(w)\Xi = x] for
ber of matches would need to increase with w = 0,1 and averaging the difference as in
the sample size. If M ?? oo, with M/N ?* 0, (11), and the second is based on estimating
then the matching estimator is essentially the propensity score e(x) = pr(W{ = 11 Xt = x)
like a nonparametric regression estima and using that to weight the outcomes as in
tor. However, it is not clear that using an (18). For each approach, we have discussed
approximation based on a sequence with estimators that achieve the asymptotic effi
an increasing number of matches improves ciency bound. If we have large sample sizes,
the accuracy of the approximation. Given relative to the dimension of Xh we might
that in an actual data set one uses a spe think our nonparametric estimators of the
cific number of matches, M, it would appear conditional means or propensity score are
appropriate to calculate the asymptotic sufficiently accurate to invoke the asymptotic
variance conditional on that number, rather efficiency results described above.
than approximate the distribution as if this In other cases, however, we might choose
number is large. Calculations in Abadie and flexible parametric models without being
Imbens show that the efficiency loss from confident that they necessarily approximate
even a very small number of matches is the means or propensity score well. As we
quite modest, and so the concerns about the discussed earlier, one reason for viewing esti
inefficiency of matching estimators may not mators of conditional means or propensity
be very relevant in practice. Little is known scores as flexible parametric models is that
about the optimal number of matches, or it greatly simplifies standard error calcula
about data-dependent ways of choosing it. tions for treatment effect estimates. In such
All of the distance metrics used in prac cases, one might want to adopt a strategy that
tice standardize the covariates in some combines regression and propensity score
manner. Abadie and Imbens use a diagonal methods in order to achieve some robust
matrix with each diagonal element equal to ness to misspecification of the parametric
the inverse of the corresponding covariate models. It may be helpful to think about the
variance. The most common metric is the analogy to omitted variable bias. Suppose
Mahalanobis metric, which is based on the we are interested in the coefficient on W? in
inverse of the full covariance matrix. Zhao the (long) linear regression of Y, on a con
(2004), in an interesting discussion of the stant, W? and X?. Suppose we omit Xt from
choice of metrics, suggests some alterna the long regression, and just run the short
tives that depend on the correlation between regression of Yf on a constant and W?. The
covariates, treatment assignment, and out bias in the estimate from the short regression
comes. So far there is little experience with is equal to the product of the coefficient on
any metrics beyond inverse-of-the-variances Xi in the long regression, and the coefficient
and the Mahalanobis metrics. Zhao (2004) on Xi in a regression of Wi on a constant and
reports the results of some simulations using X?. Weighting can be interpreted as remov
his proposed metrics, finding no clear winner ing the correlation between Wt and Xh and
given his specific design. regression as removing the direct effect of X?.
Weighting therefore removes the bias from
5.6 Combining Regression and Propensity
omitting Xi from the regression. As a result,
Score Weighting
combining regression and weighting can lead
In sections 5.3 and 5.4, we describe meth to additional robustness by both removing
ods for estimating average causal effects the correlation between the omitted covari
based on two strategies: the first is based ates, and by reducing the correlation between

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 39

the omitted and included variables. This is (2007), weighting the objective function
the idea behind the doubly-robust estima by any nonnegative function of Xt does not
tors developed in Robins and Rotnitzky affect consistency of least squares.11 As a
(1995), Robins, Rotnitzky and Lue Ping Zhao result, even if the logit model for the propen
(1995), and Mark J. van der Laan and Robins sity score is misspecified, the binary response
(2003). MLE 7 still has a well-defined probability
Suppose we model the two regression func limit, say 7*, and the IPW estimator that uses
tions as ?jlw(x) ? aw + ?'w (x ? X), for w = 0,1 weights l/p(Xf;7) for treated observations
(where we abuse notation a bit and insert the and 1/(1 ? p(X?;7)) for control observations
sample averages of the covariates for their pop is asymptotically equivalent to the estima
ulation means). More generally, we may use a tor that uses weights based on 7*.12 It does
nonlinear model for the conditional expecta not matter that for some x, e(x) ^ p(x;7*).
tion, or just a more flexible linear approxima This is the first part of the double robustness
tion. Suppose we model the propensity score result: if the parametric conditional means
as e(x) = p(x; 7), for example as p(x; 7) = exp(70 for E[Y(w) | X = x] are correctly specified, the
+ x'7i)/(l + exp(7o + *'7i))- In the first step, model for the propensity score can be arbi
we estimate 7 by maximum likelihood and trarily misspecified for the true propensity
obtain the estimated propensity scores as ? score. Equation (20) still leads to a consistent
(Xf) = p(x; 7). In the second step, we use lin estimator for rPATE.
ear regression, where we weight the objec When the conditional means are correctly
tive function by the inverse probability of specified, weighting will generally hurt in
treatment or non-treatment. Specifically, to terms of asymptotic efficiency. The optimal
estimate (a0,?0) and (a^?i), we would solve weight is the inverse of the variance, and
the weighted least squares problems in general there is no reason to expect that
weighting the inverse of (one minus) the pro
(20)a?Amin ? (Yi~a"7/^-^
?:Wf=0 p(Xf; 7))
pensity score gives a good approximation to
that. Specifically, under homoskedasticity
and of Yi(w) so that aw = o~w(x), in the context of
least squares?the IPW estimator of (aw, ?w)
(Yj-a1-)9i(Xj-X))2
mm
? is less efficient than the unweighted estima
<*i./*i /:W,=i 1 - p(X?; tor;
7)) see Wooldridge (2007). The motivation
for propensity score weighting is different: it
Given the estimated conditional mean
offers func
a robustness advantage for estimating
rPATE
tions, we estimate rPATE, using the expres
The second part
sion for freg = ? 1 ? ?0 as in equation of the double robustness
(13).
But what is the motivation for result assumes that by
weighting the logit model (or an
the inverse propensity scorealternative
when binarywe didresponse model) is cor
not use such weighting in section 5.3?
rectly specified forThe
the propensity score, so
motivation is the double robustness result
that e(x) = p(x;7*), but allows the condi
tional meansee
due to Robins and Rotnitzky (1995); functions
also to be misspecified.
Daniel O. Scharfstein, Rotnitzky, and Robins
(1999). 11 More generally, it does not affect the consistency of
First, suppose that the conditional expec any quasi-likelihood method that is robust for estimating
tation is indeed linear, or E[Y?(t?;)|X? = x] the parameters of the conditional mean. These are likeli
hoods in the linear exponential family, as described in C.
= aw -f- ?'w (x ? X). Then, as discussed in Gourieroux, A. Monfort, and A. Trognon (1984a, 1984b).
the treatment effect context by Wooldridge 12 See Wooldridge (2007).

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
40 Journal of Economic Literature, Vol. XLVII (March 2009)

The result is that in that case aw ?? E[Yf(o;)], Once we estimate r based on (20), how
and thus r - a} - ?() - E[Yf(l)] - E[Y?(0)] should we obtain a standard error? The nor
? rPATE and the estimator is still consistent. malized variance still has the form V() -f Vi,
Let the weight for control observations be A? where Vw ? E[(?w ? ptw)2]. One option is to
= (1 - /j(X,;7*))-72>;=o (1 - piXp))-1. exploit the representation of ?0] as a weighted
Then the least squares estimator for a0 is average of Yt -f- ?0 (Xt ? X), and use the naive
variance estimator based on weighted least
(21) ?0 = ? (1 - W,) A,.
1=1
squares with known weights:

x(Y?-^(X?-X)). (22) V? = ? A,2 (Y, + & (X, - X) - a,,)2,


i:W(=0

The weights imply that E[(l - W^A.Yj and similar for Vj. In general, we may again
= ?[Y/(0)]_and E[(l - W^X, - X)] want to adjust for the estimation of the
= E[X? ? X] = 0, and as a result a0 ?> parameters in 7 See Wooldridge (2007) for
?|X-(0)]. Similarly, the average of the pre details.
dicted values for Y?(1) converges to E[Y?(1)], Although combining weighting and regres
and so the resulting estimator fIPW = aY ? a0 sion is more attractive then either weighting
is consistent for rPATE and tcate irrespective or regression on their own, it still requires at
of the shape of the regression functions. This least one of the two specifications to be accu
is the second part of the double robustness rate globally. It has been used regularly in
part, at least for linear regression. the epidemiology literature, partly through
For certain kinds of responses, including the efforts of Robins and his coauthors, but
binary responses, fractional responses, and has not been widely used in the economics
count responses, linearity of E[Y?(u;)|X? = literature.
x] is a poor assumption. Using linear con
ditional expectations for limited dependent 5.7 Subclassification and Regression
variables effectively abdicates the first part
of the double robustness result. Instead, We can also combine subclassification
we should use coherent models of the con with regression. The advantage relative to
ditional means, as well as a sensible model weighting and regression is that we do not
for the propensity score, with the hope that use global approximations to the regression
the mean functions, propensity score, or function. The idea is that within stratum j,
both are correctly specified. Beyond speci we estimate the average treatment effect by
fying logically coherent for E[Y?(w) |X? = x] regressing the outcome on a constant, an
so that the first part of double robustness indicator for the treatment, and the covari
has a chance, for the second part we need ates, instead of simply taking the difference
to choose functional forms and estimators in averages by treatment status as in section
with the following property: even when the 5.4. The latter can be viewed as a regression
mean functions are misspecified, E[Y?(t?;)] = estimate based on a regression with only an
E[fx(Xi,SD], where ?*w is the probability limit intercept and the treatment indicator. The
of Sw. Fortunately, for the common kinds of further regression adjustment simply adds
limited dependent variables used in appli (some of) the covariates to that regression.
cations, such functional forms and estima The key difference with using regression in
tors exist; see Wooldridge (2007) for further the full sample is that, within a stratum, the
discussion. propensity score varies relatively little. As a

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 41

result, the covariate distributions are simi regression is not used to extrapolate far out
lar, and the regression function is not used to of sample.
extrapolate far out of sample. The idea behind the regression adjustment
To be precise, we estimate on the observa is to replace Yf(0) and Yf(l) by
tions with Btj = 1, the regression function
\Yi ifWi = 0,
t,(0) =
Y^^+r^Wi + ^-re, i ? LjejM(i) OTj+A)(x? - x;.)) if w< = 1,

by least squares, obtaining the estimates f;


and estimated variances Vj. Dropping X? from
this regression leads to ij = Y? ? Yj0, which ' I Y, ifW, = l,
is the blocking estimator we discussed in sec
tion 5.4. We average the estimated stratum
specific average treatment effects, weighted where the average of the matched outcomes
by the relative stratum size: is adjusted by the difference in covariates
relative to the matched observation. The
only question left is how to estimate the
regression coefficients ?0 and ?Y. For vari
ous methods, see D. Quade (1982), Rubin
with estimated variance (1979), and Abadie and Imbens (2006).
The methods differ in whether the differ
ence in outcomes is modeled as linear in
the difference in covariates, or the origi
nal conditional outcome distributions are
With a modest number of strata, this already approximated by linear regression func
leads to an estimator that is considerably tions, and on what sample the regression
more flexible and robust than either subclas functions are estimated.
sification alone, or regression alone. It is prob Here is one simple regression adjustment.
ably one of the more attractive estimators in To be clear, it is useful to introduce some
practice. Imbens and Rubin (forthcoming) additional notation. Given the set of match
suggest data-dependent methods for choos ing indices JM(i), define
ing the number of strata.

5.8 Matching and Regression


YiO)-lXi ifW,= 0,
M ?"J JmO) Ai

Once we have the N pairs ( Y/0), Y?(1)), the


X;(l) \h^M) xj ifw; = o,
simple matching estimator given in (19) aver Ix ifw;=i,
ages the difference. This estimator may still
be biased due to discrepancies between the and let ?w be based on a regression of Y?(w)
covariates of the matched observations and on a constant and X^io):
their matches. One can attempt to reduce
this bias by using regression methods. This
use of regression is very different from using
regression methods on the full sample.
Here the covariate distributions are likely N / I Xfyv)' VN"'/ ^(u>)
to be similar in the matched sample, and so
f^\Uw) X^Ww)')) \X,(w) %(w)

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
42 Journal of Economie Literature, Vol. XLVII (March 2009)

Like the combination of subclassification and There is an alternative, general, method


regression, this leads to relatively robust esti for estimating variances of treatment effect
mators. Abadie and Imbens (2008a) find that estimators, developed by Abadie and Imbens
the method works well in simulations based (2006), that does not require additional non
on the LaLonde data. parametric estimation. First, recall that most
estimators are of the form
5.9 A General Method for Estimating
Variances N

For some of the estimators discussed in


f=EAi-Yi,with ? Xi = l,
i=] i:W?=l i:W,=()
the previous sections, particular variance
estimators have been used. Assuming that with the weights A? generally fun
a particular parametric model is valid, one all covariates and all treatment i
can typically use standard methods based on Conditional on the covariate
likelihood theory or generalized method of treatment indicators (and thus
moments theory. Often, these methods rely rCATE), the variance of such an est
on consistent estimation of components of
the variance. Here we discuss two general
methods for estimating variances that apply V(t|X],...,Xa,,W1,...,W]V)
1=1
= ? A
to all estimators.
The first approach is to use bootstrapping In order to use this representation, we need
(Bradley Efron and Robert J. Tibshirani 1993; estimates of cr^ (X?), for all i. Fortunately,
A. C. Davison and D. V. Hinkley 1997; Joel these need not be consistent estimates, as
L. Horowitz 2001). Bootstrapping has been long as the estimation errors are not too
widelv used in the treatment effects litera highly correlated so that the weighted aver
ture, as it is straightforward to implement. Itage of the estimates is consistent for the
has rarely been formally justified, although inweighted average of the variances. This is
many cases it is likely to be valid given that similar in the way robust (Huber-Eicker
many of the estimators are asymptotically White) standard errors allow for general
linear. However, in some cases it is known forms of heteroskedasticity without hav
that bootstrapping is not valid. Abadie and ing to consistently estimate the conditional
Imbens (2008a) show that, for a fixed num variance function.
ber of matches, bootstrapping is not valid for Abadie and Imbens (2006) suggested using
matching estimators. It is likely that the prob a matching estimator for a% (X?). The idea
lems that invalidate the bootstrap disappear behind this matching variance estimator is that
if the number of matches increases with the if we can find two treated units with Xt = x,
sample size (thus, the bootstrap might be valid we can estimate erf (x) as <rf (x) = (Y? ? Yj)2/2.
for kernel estimators). Nevertheless, because In general, it is difficult to find exact matches,
in practice researchers often use a small num but, again, this is not necessary. Instead, one
ber of matches, or nonnegative kernels, it is uses the closest match within the set of units
not clear whether the bootstrap is an effective with the same treatment status. Let v{i) be
method for obtaining standard errors and con the unit closest to i, with the same treatment
structing confidence intervals. In cases where indicator (Wu@ = Wt ), so that
bootstrapping is not valid, often subsampling
(Dimitris N. Politis, Joseph P. Romano, and ||X^-Xf||=min ||X,-X,||.
j:Wj=i
Michael Wolf 1999) remains valid, but this
has not been applied in practice. Then we can estimate cr^ (X?) as

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 43

&l, (X,) = (Y, - Y?(?))2/2. limiting it to individuals with zero earnings in


the year prior to the program). Dehejia and
This way we can estimate cr^. (X?) for all units. Wahba looked at this problem more system
Note that these are not consistent estimators atically and found that a major concern is the
of the conditional variances. As the sample lack of overlap in the covariate distributions.
size increases, the bias of these estimators will Traditionally, overlap in the covariate dis
disappear, just as we saw that the bias of the tributions was assessed by looking at sum
matching estimator for the average treatment mary statistics of the covariate distributions
effect disappears under similar conditions. by treatment status. As discussed before in the
We then use these estimates of the con introduction to section 5, it is particularly use
ditional variance to estimate the variance of ful to report differences in average covariates
the estimator: normalized by the square root of the sum of
the within-treatment group variances. In table
2, we report, for the LaLonde data, averages
?=i
and standard deviations of the basic covariates,
An extension to allow for clustering has and the normalized difference. For four out of
been developed by Samuel Hanson and Adi the ten covariates the means are more than
Sunderam (2008). a standard deviation apart. This immediately
suggests that the technical task of adjusting
5.10 Overlap in Covariate Distributions
for differences in the covariates is a challeng
In practice, a major concern in applying ing one. Although reporting normalized dif
methods under the assumption of uncon ferences in covariates by treatment status is a
foundedness is lack of overlap in the covariate sensible starting point, inspecting differences
distributions. In fact, once one is committed to one covariate at a time is not generally suffi
the unconfoundedness assumption, this may cient. Even if all these differences are small,
well be the main problem facing the analyst. there may still be areas with limited overlap.
The overlap issue was highlighted in papers Formally, we are concerned with regions in the
by Dehejia and Wahba (1999) and Heckman, covariate space where the density of covariates
Ichimura, and Todd (1998). Dehejia and in one treatment group is zero and the density
Wahba reanalyzed data on a job training pro in the other treatment group is not. This cor
gram originally analyzed by LaLonde (1986). responds to the propensity score being equal
LaLonde (1986) had attempted to replicate to zero or one. Therefore, a more direct way of
results from an experimental evaluation of a assessing the overlap in covariate distributions
job training program, the National Supported is to inspect histograms of the estimated pro
Work (NSW) program, using a comparison pensity score by treatment status.
group constructed from two public use data Once it has been established that overlap
sets, the Panel Study of Income Dynamics is a concern, several strategies can be used.
(PSID) and the Current Population Survey We briefly discuss two of the earlier specific
(CPS). The NSW program targeted indi suggestions, and then describe in more detail
viduals who were disadvantaged with very two general methods. In practice, researchers
poor labor market histories. As a result, they have often simply dropped observations with
were very different from the raw comparison propensity score close to zero or one, with the
groups constructed by LaLonde from the actual cutoff value chosen in an ad hoc fashion.
CPS and PSID. LaLonde partially addressed Dehejia and Wahba (1999) focus on the aver
this problem by limiting his raw comparison age effect for the treated. After estimating
samples based on single covariate criteria (e.g., the propensity score, they find the smallest

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
44 Journal of Economie Literature, Vol. XLVII (March 2009)

TABLE 2
Balance Improvements in the Lalonde Data (Dehejia-Wahba Sample)
CPS Controls NSW Treated Normalized Difference Treated-Controls

(15992) 185 All e(X?)>e} F-score Maha


Covariate (s.d.) (s.d.) (16177) (6286) (370) (370)

Age 33.23 (11.05) 25.82 (7.16) -0.56 -0.25 -0.08 -0.16


Education 12.03 (2.87) 10.35 (2.01) -0.48 -0.30 -0.02 -0.09
Married 0.71 (0.45) 0.19 (0.39) -0.87 -0.46 -0.01 -0.20
Nodegree 0.30 (0.46) 0.71 (0.46) 0.64 0.42 0.08 0.18
Black 0.07 (0.26) 0.84 (0.36) 1.72 1.45 -0.02 0.00
Hispanic 0.07 (0.26) 0.06 (0.24) -0.04 -0.22 -0.02 0.00
Earn '74 14.02 (9.57) 2.10 (4.89) -1.11 -0.40 -0.07 0.00
Earn '74 positive 0.88 (0.32) 0.29 (0.46) -1.05 -0.72 -0.07 0.00
Earn '75 13.65 (9.27) 1.53 (3.22) -1.23 -0.35 -0.02 -0.01
Earn '75 positive 0.89 (0.31) 0.40 (0.49) -0.84 -0.54 -0.09 0.00

This improves the covariate balance, but


value of the estimated propensity score
many of the normalized differences are still
among the treated units, e_y = min?:W.=1 e(X?).
They then drop all control units with an substantial.
estimated propensity score lower than this Heckman, Ichimura, and Todd (1997) and
Heckman et al. (1998) develop a different
threshold ev The idea behind this suggestion
is that control units with very low values for
method. They focus on estimation of the set
the propensity score may be so different where the density of the propensity score con
ditional on the treatment is bounded away from
from treated units that including them in the
analysis is likely to be counterproductive. (In
zero for both treatment regimes. Specifically,
effect, the population over which the treat they first estimate the density functions/^ | W
ment effects are calculated is redefined.)? A w), for w = 0,1, nonparametrically. They
concern is that the results may be sensitive then evaluate the estimated density/(e(X?) | Wt
? 0) for all N values Xh and the same for
to the choice of specific threshold eA. If, for
example, one used as the threshold the K-th the estimated density/(e(X?) | Wt ? 1) for all
order statistic of the estimated propensity N values Xf. Given these 2N values they
score among the treated (Lechner 2002a, calculate the 2N q order statistic of
2002b), the results might change consider these 2N estimated densities. Denote this
order statistic by f.. Then, for each unit
ably. In the sixth column of table 2, we report
the normalized difference (normalized using i, they compare the estimated density
the same denominator equal to the square MX^Wi = 0) tofip and/(e(X?)|W? = 1)
root of the sum of the within treatment group to/r If either of those estimated densities
sample variances) after removing 9,891 (out is below the order statistic, the observation
of a total 16,177) control observations whose gets dropped from the analysis. Smith and
estimated propensity score was smaller than Todd (2005) implement this method with
the smallest value of the estimated propenq = 0.02, but provide no motivation for the
sity score among the treated, eY = 0.00051. choice of the threshold.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 45

5.10.1 Matching to Improve Overlap in constructing a matched sample in this fashion.


Covariate Distributions In both cases the treated units were matched
in reverse order of the estimated propen
A systematic method for dropping control sity score. The seventh column is based on
units who are different from the treated units is matching on the estimated propensity score,
to construct a matched sample. This approach and the last column is based on matching
has been pushed by Rubin in a series of stud on all the covariates, using the Mahalanobis
ies, see Rubin (2006). It is designed for settings metric (the inverse of the covariance matrix
where the interest is in the average effect for the of the covariates). Matching, either on the
treated (e.g., as in the LaLonde application). It estimated propensity score or on the full set
relies on the control sample being larger than of covariates dramatically improves the bal
the treated sample, and works especially well ance. Whereas before some of the covariates
when the control sample is much larger. differed by as much as 1.7 times a standard
First, the treated observations are deviation, now the normalized differences are
ordered, typically by decreasing values of all less than one tenth of a standard deviation.
the estimated propensity score, since treated The remaining differences are not negligible,
observations with high values of the pro however. For example, average differences in
pensity score are generally more difficult to 1974 earnings are still on the order of $700,
match. Then the first treated unit (e.g., the which, given the experimental estimate from
one with the highest value for the estimated the LaLonde (1986) paper of about $2,000,
propensity score) is matched to the nearest is substantial. As a result, simple estimators
control unit. Next, the second treated unit is such as the average of the within-matched
matched to the nearest control unit, exclud pair differences are not likely to lead to cred
ing the control unit that was used as a match ible estimates. Nevertheless, maintaining
for the first treated unit. Matching without unconfoundedness, this matched sample is
replacement all treated units in this manner sufficiently well balanced that one may be
leads to a sample of 2 x Nx units, (where A?! able to obtain credible and robust estimates
is the size of the original treated subsample), from it in a way that the original sample
half of them treated and half of them control would not allow.
units. Note that the matching is not neces
5.10.2 Trimming to Improve Overlap in
sarily used here as the final analysis. We do Covariate Distributions
not propose to estimate the average treat
ment effect for the treated by averaging the Matching with replacement does not work
differences within the pairs. Instead, this is if the estimand of interest is the overall aver
intended as a preliminary analysis, with the age treatment effect. For that case Crump et
goal being the construction of a sample with al. (2009) suggest an easily implementable
more overlap. Given a more balanced sample, way of selecting the subpopulation with over
one can use any of the previously discussed lap, consistent with the current practice of
methods for estimating the average effect of dropping observations with propensity score
the treatment, including regression, propen values close to zero or one. Their method is
sity score methods, or matching. Using those generally applicable and in particular does not
methods on the balanced sample is likely to require that the control sample is larger than
reduce bias relative to using the simple dif the treated sample. They consider estimation
ference in averages by treatment status. of the average treatment effect for the sub
The last two columns in table 2 report population with X{ G A. They suggest choos
the balance in the ten covariates after ing the set A from the set of all subsets of the

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
46 Journal of Economie Literature, Vol. XLVII (March 2009)

covariate space to minimize the asymptotic half of what it is in the full sample, with this
variance of the efficient estimator of the aver improvement obtained by dropping approxi
age treatment effect for that set. Under some mately 20 percent of the original sample.
conditions (in particular homoskedasticity), A potentially controversial feature of all
they show that the optimal set A* depends these methods is that they change what is
only on the value of the propensity score. This being estimated. Instead of estimating rPATE,
method suggests discarding observations with the Crump et al. (2009) approach estimates
a propensity score less than a away from the rCATE,A- This results in reduced external
two extremes, zero and one: validity, but it is likely to improve internal
validity.
A* = {x G X | a < e(x) < 1 - a],
5.11 Assessing the Unconfoundedness
where a satisifies a condition based on the Assumption
marginal distribution of the propensity The unconfoundedness assumption used
score: in section 5 is not testable. It states that
the conditional distribution of the outcome
1
under the control given receipt of the active
a (1 ? a) treatment and covariates, is identical to the
1 1 distribution of the control outcome condi
2-E
3(X)-(l-e(X)) e(X) (1 - e(X)) tional on being in the control and covari
ates. A similar assumption is made for the
distribution of the treatment outcome. Yet
^ a - (1 ? a) \ ' since the data are completely uninformative
about the distribution of Y?(0) for those who
Based on empirical examples and numerical received the active treatment and of Y?(1)
calculations with beta distributions for the for those receiving the control, the data can
propensity score, Crump et al. (2009) suggest never reject the unconfoundedness assump
that the rule-of-thumb fixing a at 0.10 gives tion. Nevertheless, there are often indi
good results. rect ways of assessing this assumption. The
To illustrate this method, table 3 presents most important of these were developed in
summary statistics for data from Imbens, Rosenbaum (1987) and Heckman and Hotz
Rubin and Sacerdote (2001) on lottery play (1989). Both methods rely on testing the
ers, including "winners" who won big prizes, null hypothesis that an average causal effect
and "losers" who did not. Even though win is zero, where the particular average causal
ning the lottery is obviously random, varia effect is known to equal zero. If the testing
tion in the number of tickets bought, and procedure rejects the null hypothesis, this is
nonresponse, creates imbalances in the cova interpreted as weakening the support for the
riate distributions. In the full sample (sample unconfoundedness assumption. These tests
size N = 496), some of the covariates dif can be divided into two groups.
fer by as much as 0.64 standard deviations. The first set of tests focuses on estimating
Following the Crump et al. calculations leads the causal effect of a treatment that is known
to a bound of 0.0914. Discarding the obser not to have an effect. It relies on the presence
vations with an estimated propensity score of two or more control groups (Rosenbaum
outside the interval [0.0914, 0.9086] leads 1987). Suppose one has two potential control
to a sample size 388. In this subsample, the groups, for example eligible nonparticipants
largest normalized difference is 0.35, about and in?ligibles, as in Heckman, Ichimura and

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 47

TABLE 3
Balance Improvements in the Lottery Data
Losers Winners Normalized Difference Treated-Controls

(259) (237) All 0.0914 < e(Xi) < 0.9086


Covariate (s.d.) (s.d.) (496) (388)
Year Won 1996.4 (1.0) 1996.1 (1.3) -0.19 -0.13
Tickets Bought 2.19 (1.77) 4.57 (3.28) 0.64 0.33
Age 53.2 (12.9) 47.0 (13.8) -0.33 -0.19
Male 0.67 (0.47) 0.58 (0.49) -0.13 -0.09
Years of Schooling 14.4 (2.0) 13.0 (2.2) -0.50 -0.35
Working Then 0.77 (0.42) 0.80 (0.40) 0.06 -0.02
Earnings Year-6 15.6 (14.5) 12.0 (11.8) -0.19 -0.10
Earnings Year-5 16.0 (15.0) 12.1 (12.0) -0.20 -0.12
Earnings Year-4 16.2 (15.4) 12.0 (12.1) -0.21 -0.15
Earnings Year-3 16.6 (16.3) 12.8 (12.7) -0.18 -0.14
Earnings Year-2 17.6 (16.9) 13.5 (13.0) -0.19 -0.15
Earnings Year-1 18.0 (17.2) 14.5 (13.6) -0.16 -0.14
Pos Earnings Year-6 0.69 (0.46) 0.70 (0.46) 0.02 0.05
Pos Earnings Year-5 0.68 (0.47) 0.74 (0.44) 0.10 0.07
Pos Earnings Year-4 0.69 (0.46) 0.73 (0.44) 0.07 0.02
Pos Earnings Year-3 0.68 (0.47) 0.73 (0.44) 0.09 0.02
Pos Earnings Year-2 0.68 (0.47) 0.74 (0.44) 0.10 0.04
Pos Earnings Year-1 0.69 (0.46) 0.74 (0.44) 0.07 0.02

Todd (1997). One can estimate a "pseudo"nonparticipants as in Heckman, Ichimura


and Todd (1997) is a particularly attractive
average treatment effect by analyzing the
data from these two control groups as if one comparison. Alternatively one may use geo
of them is the treatment group. In that case,
graphically distinct comparison groups, for
the treatment effect is known to be zero example from areas bordering on different
and statistical evidence of a non-zero effectsides of the treatment group.
implies that at least one of the control groups To be more specific, let G? be an indica
tor variable denoting the membership of the
is invalid. Again, not rejecting the test does
not imply the unconfoundedness assumption group, taking on three values, GtE {?1,0,1}.
For units with G? = ?1 or 0, the treatment
is valid (as both control groups could suffer
the same bias), but nonrejection in the caseindicator W? is equal to 0:
where the two control groups could poten
tially have different biases makes it more ?0 if G,= -1,0,
plausible that the unconfoundedness assump ' \l if G, = l.
tion holds. The key for the power of this test
is to have available control groups that areUnconfoundedness only requires that
likely to have different biases, if they have
any at all. Comparing in?ligibles and eligible(23) 7,(0), y,(i) 1 Wj | X,,

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
48 Journal of Economie Literature, Vol. XLVII (March 2009)

and this is not testable. Instead we focus on Next, we turn to implementation of the
testing an implication of the stronger condi tests. We can simply test whether there is a
tional independence relation difference in average values of Y? between
the two control groups, after adjusting for
(24) Y?(0), Yi(l) 1 Gi | X,. differences in X?. That is, we effectively test
whether
This independence condition implies (23),
but in contrast to that assumption, it also E[E[Y,|G,= -1,XJ -EfYjG^ 0,Xj]= 0.
implies testable restrictions. In particular, we
focus on the implication that More generally we may wish to test

(25) Yi(0)?Gi\XhGie {-1,0} E^iY^G^ -l,Xi = x]


& Y? 1 Gi | Xi, Gi e {-1, 0}, -E[Yi\Gi= 0,Xi = x]]= 0
because G? G {-1,0} implies that Y{ = Y?(0). for all x in the support of X? using the meth
Because condition (24) is slightly stron ods discussed in Crump et al. (2008b). We
ger than unconfoundedness, the question is can also include transformations of the basic
whether there are interesting settings where outcomes in the procedure to test for dif
the weaker condition of unconfoundedness ference in other aspects of the conditional
holds, but not the stronger condition. To dis distributions.
cuss this question, it is useful to consider two A second set of tests of unconfounded
alternative conditional independence condi ness focuses on estimating the causal effect
tions, both of which are implied by (24): of the treatment on a variable known to be
unaffected by it, typically because its value
(26) (Y((0), Yt(l)) 1 Wt | X,, Gt e {-1,1}, is determined prior to the treatment itself.
Such a variable can be time-invariant, but
and the most interesting case is in considering
the treatment effect on a lagged outcome.
(27) (^(OXY^lWilXj.G^tO.l}. If it is not zero, this implies that the treated
observations are distinct from the controls;
If (26) holds, then we can estimate the average namely that the distribution of Yf(0) for the
causal effect by invoking the unconfounded treated units is not comparable to the distri
ness assumption using only the first control bution of Y?(0) for the controls. If the treat
group. Similarly, if (27) holds, then we can ment is instead zero, it is more plausible that
estimate the average causal effect by invok the unconfoundedness assumption holds. Of
ing the unconfoundedness assumption using course this does not directly test the uncon
only the second control group. The point is foundedness assumption; in this setting,
that it is difficult to envision a situation where being able to reject the null of no effect does
unconfoundedness based on the two com not directly reflect on the hypothesis of inter
parison groups holds?that is, (23) holds? est, unconfoundedness. Nevertheless, if the
but it does not hold using only one of the two variables used in this proxy test are closely
comparison groups at the time. In practice, it related to the outcome of interest, the test
seems likely that if unconfoundedness holds arguably has more power. For these tests it
then so would the stronger condition (24), is clearly helpful to have a number of lagged
and we have the testable implication (25). outcomes.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 49

First partition the vector of covariates X? original unconfoundedness assumption is not


into two parts, a (scalar) pseudo outcome, testable. Nevertheless, if one has a proxy for
denoted by XV, and the remainder, denoted either of the potential outcomes, and in par
by Xf, so that X, - (Xf, Xf')'. Now we will ticular a proxy that is observed irrespective
assess whether the following conditional of the treatment status, one can test inde
independence relation holds: pendence for that proxy variable. We use the
pseudo outcome XV as such a proxy variable.
(28) Xf 1 Wt | Xf. That is, we view XV as a proxy for, say, Y?(0),
and assess (29) by testing (28).
The two issues are, first, the interpretation The most convincing applications of these
of this condition and its relationship to the assessments are settings where the two links
unconfoundedness assumption, and second, are plausible. One of the leading examples
the implementation of the test. is where X? contains multiple lagged mea
The first issue concerns the link between sures of the outcome. For example, in the
the conditional independence relation in (28) evaluation of the effect of a labor market
and original unconfoundedness. This link, by program on annual earnings, one might
necessity, is indirect, as unconfoundedness have observations on earnings for, say, six
cannot be tested directly. Here we lay out the years prior to the program. Denote these
arguments for the connection. First consider lagged outcomes by Y?_b...,Yf_6, where
a related condition: Yf _] is the most recent and Y? _6 is the most
distant preprogram earnings measure. One
(29) y,(o), y,(d i w, | x;: could implement the above ideas using earn
ings for the most recent preprogram year
If this modified unconfoundedness condition Y i-i as the pseudo outcome XV, so that the
were to hold, one could use the adjustment vector of remaining pretreatment variables
methods using only the subset of covari X\ would still include the five prior years of
ates Xf. In practice, though not necessarily, preprogram earnings Y?_2,...,Y?_6 (ignor
this is a stronger condition than the original ing additional pre-treatment variables). In
unconfoundedness condition which requires that case one might reasonably argue that if
conditioning on the full vector X?. One has unconfoundedness holds given six years of
preprogram earnings, it is plausible that it
to be careful here, because it is theoretically
possible that conditional on a subset of the would also hold given only five years of pre
covariates unconfoundedness holds, but at program earnings. Moreover, under uncon
the same time unconfoundedness does not foundedness Yf(c) is independent of W? given
hold conditional on the full set of covari Y i _i,..., Yit_Q, which would suggest that it is
ates. In practice this situation is rare though. plausible that Y? _} is independent of W2 given
For example, it is difficult to imagine in an Y^_2,..., Y?_6. Given those arguments, one
evaluation of a labor market program that can plausibly assess unconfoundedness by
unconfoundedness would hold given age and testing whether
the level of education, but not if one addi
tionally conditions on gender. Generally yi,_1iwi|Yi,_2>...,y,_fi.
making subpopulations more homogenous in
pretreatment variables tends to improve the The implementation of the tests is the same
plausibility of unconfoundedness. as in the first set of tests for assessing uncon
The modified unconfoundedness condition foundedness. We can simply test whether
(29) is not testable for the same reasons the estimates of the average difference between

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
50 Journal of Economie Literature, Vol. XLVII (March 2009)

the groups adjusted for differences in Xf are if on average it does not affect outcomes.13
zero, or test whether the average difference They show that in some data sets they reject
is zero for all values of the covariates (e.g., the null hypothesis (30) even though they
Crump et al. 2008). cannot reject the null hypothesis of a zero
average effect.
5.12 Testing Taking the motivation in Crump et al.
(2008) one step further, one may also be
Most of the focus in the evaluation litera interested in testing the null hypothesis that
ture has been on estimating average treat the conditional distribution of Yiv0) given X?
ment effects. Testing has largely been limited = x is the same as the conditional distribu
to the null hypothesis that the average effect tion of Y?(1) given X2 = x. Under the main
is zero. In that case testing is straightforward tained hypothesis of unconfoundedness, this
since many estimators exist for the average is equivalent to testing the null hypothesis
treatment effect that are approximately nor that
mally distributed in large samples with zero
asymptotic bias. In addition there is some H.-.YilWilXi,
testing based on the Fisher approach using
the randomization distribution. In many against the alternative hypothesis that Y? is
cases, however, there are other null hypoth not independent of W? given X?. Tests of this
eses of interest. Crump et al. (2008) develop type can be implemented using the methods
tests of the null hypotheses of zero average of Linton and Pedro G?zalo (2003). There
effects conditional on the covariates, and of have been no applications of these tests in
a constant average effect conditional on the the program evaluation literature.
covariates. Formally, in the first case the null
hypothesis 5.13 Selection of Covariates

(30) H0:t(x) = 0, Vx, A very important set of decisions in


implementing all of the methods described
against the alternative hypothesis in this section involves the choice of covari
ates to be included in the regression func
Ha : r(x) t? 0, for some x. tions or the propensity score. Except for
warnings about including covariates that are
Recall that r(x) = E[Y?(1) - Yf(0) |X< = x] is themselves influenced by the treatment (for
the average effect for the subpopulation with example, Heckman and Salvador Navarro
covariate value x. The second hypothesis Lozano 2004; Wooldridge 2005), the litera
studied by Crump et al. (2008) is ture has not been very helpful. Consequently,
researchers have just included all covariates
(31) H0:t(x) = rPATE, Vx, linearly, without much systematic effort to
find more compelling specifications. Most
against the alternative hypothesis of the technical results using nonparametric
methods include rates at which the smoothing
Ha : r(x) ^ tpate, for some x.

Part of their motivation is that in many cases 13 A second motivation is that it may be impossible to
there is substantive interest in whether the obtain precise estimates for rPATE even in cases where one
can convincingly reject some of the hypotheses regarding
program is beneficial for some groups, even r(x).

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 51

parameters should change with the sample functional form and functions of a small set
size. For example, using regression estima of covariates.
tors, one would have to choose the bandwidth
if using kernel estimators, or the number of 6. Selection on Unobservables
terms in the series if using series estimators.
The program evaluation literature does not In this section we discuss a number of
provide much guidance as to how to choose methods that relax the pair of assump
these smoothing parameters in practice. tions made in section 5. Unlike in the set
More generally, the nonparametric estima ting under unconfoundedness, there is not
tion literature has little to offer in this regard. a unified set of methods for this case. In
Most of the results in this literature offer a number of special cases there are well
optimal choices for smoothing parameters if understood methods, but there are many
the criterion is integrated squared error. In cases without clear recommendations. We
the current setting the interest is in a sca will highlight some of the controversies and
lar parameter, and the choice of smoothing different approaches. First we discuss some
parameter that is optimal for the regression methods that simply drop the unconfound
function itself need not be close to optimal edness assumption. Next, in section 6.2, we
for the average treatment effect. discuss sensitivity analyses that relax the
Hirano and Imbens (2001) consider an unconfoundedness assumption in a more
estimator that combines weighting with the limited manner. In section 6.3, we discuss
propensity score and regression. In their appli instrumental variables methods. Then, in
cation they have a large number of covariates, section 6.4 we discuss regression disconti
and they suggest deciding which ones to include nuity designs, and in section 6.5 we discuss
on the basis of ?-statistics. They find that the difference-in-differences methods.
results are fairly insensitive to the actual cutoff 6.1 Bounds
point if they use the weight/regression estima
tor, but find more sensitivity if they only use In a series of papers and books, Manski
weighting or regression. They do not provide (1990, 1995, 2003, 2005, 2007) has
formal properties for these choices. developed a general framework for inference
Ichimura and Linton (2005) consider in settings where the parameters of interest
inverse probability weighting estimators and are not identified. Manski's key insight is that
analyze the formal problem of bandwidth even if in large samples one cannot infer the
selection with the focus on the average treat exact value of the parameter, one may be
ment effect. Imbens, Newey and Ridder able to rule out some values that one could
(2005) look at series regression estimators not rule out a priori. Prior to Manski's work,
and analyze the choice of the number of researchers had typically dismissed models
terms to be included, again with the objective that are not point-identified as not useful in
being the average treatment effect. Imbens practice. This framework is not restricted to
and Rubin (forthcoming) discuss some step causal settings, and the reader is referred to
wise covariate selection methods for finding Manski (2007) for a general discussion of the
a specification for the propensity score. approach. Here we limit the discussion to
It is clear that more work needs to be program evaluation settings.
done in this area, both for the case where We start by discussing Manksi's per
the choice is which covariates to include spective in a very simple case. Suppose we
from a large set of potential covariates, have no covariates and a binary outcome Y?
and in the case where the choice concerns G {0,1}. Let the goal be inference for the

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
52 Journal of Economie Literature, Vol. XLVII (March 2009)

average effect in the population, rPATE. We assumptions we cannot rule out any value
can decompose the population average treat inside the bounds. See Manski et al. (1992)
ment effect as
for an empirical example of these particular
bounds.
rFATE = EfrU) | W, = 1] pr(Wj = 1) In this specific case the bounds are not
particularly informative. The width of the
+ E[y((l) | W? = 0] pr(W; = 0) bounds, the difference in ru ? 77, with 77 and
ru given above, is always equal to one, imply
- E\Y,(0)\W,= l]-pr(Wt= 1) ing we can never rule out a zero average treat
ment effect. (In some sense this is obvious:
+ E[Y,(0)|Wi=0]-pr(Wi=0)]. if we refrain from making any assumptions
regarding the treatment effects we cannot
Of the eight components of this expres rule out that the treatment effect is zero for
sion, we can estimate six. The data con any unit.) In general, however, we can add
tain no information about the remaining some assumptions, short of making the type
two, E[Y?(1)| Wt = 0] and E[Y?(0)|W? = 1]. of assumption as strong as unconfoundedness
Because the outcome is binary, and before that gets us back to the point-identified case.
seeing any data, we can deduce that these With such weaker assumptions we maybe able
two conditional expectations must lie inside to tighten the bounds and obtain informative
the interval [0,1], but we cannot say any more results, without making the strong assump
without additional assumptions. This implies tions that strain credibility. The presence of
that without additional assumptions we can covariates increases the scope for additional
be sure that assumptions that may tighten the bounds.
Examples of such assumptions include those
TPATE IT/>T,J> in the spirit of instrumental variables, where
some covariates are known not to affect the
where we can express the lower and upper potential outcomes (e.g., Manski 2007), or
bound in terms of estimable quantities, monotonicity assumptions where expected
outcomes are monotonically related to cova
^=?[^(1)1^= l]-pr(W;=l) riates or treatments (e.g., Manski and John
V. Pepper 2000). For an application of these
-pr(W,= D-EfrWlW^O] methods, see Hotz, Charles H. Mullin, and
Seth G. Sanders (1997). We return to some of
x pr(Wf = 0), these settings in section 6.3.
This discussion has focused on identifica
and tion and demonstrated what can be learned
in large samples. In practice these bounds
rM-E[Y?(l)|W?= l]-pr(W?=l) need to be estimated, which leads to addi
tional uncertainty regarding the estimands.
+ pr(W? = 0) - E[Yf(0) | Wf = 0] A fast developing literature (e.g., Horowitz
and Manski 2000; Imbens and Manski 2004;
x pr(Wi = 0), Chernozhukov, Hong, and Elie Tamer 2007;
Arie Beresteanu and Francesca Molinari
In other words, we can bound the average 2006; Romano and Azeem M. Shaikh 2006a,
treatment effect. In this example the bounds 2006b; Ariel Pakes et al. 2006; Adam M.
are tight, meaning that without additional Rosen 2006; Donald W. K. Andrews and

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 53

Gustavo Soares 2007; Ivan A. Canay 2007; completely relaxing the unconfoundedness
and J?rg Stoye 2007) discusses construction assumption, the idea is to relax it slightly.
of confidence intervals in general settings More specifically, violations of unconfound
with partial identification. One point of con edness are interpreted as evidence of the
tention in this literature has been whether presence of unobserved covariates that are
the focus should be on confidence intervals correlated, both with the potential outcomes
for the parameter of interest (rPATE in this and with the treatment indicator. The size of
case), or for the identified set. Imbens and bias these violations of unconfoundedness
Manski (2004) develop confidence sets for can induce depends on the strength of these
the parameter. In large samples, and at a correlations. Sensitivity analyses investigate
95 percent confidence level, the Imbens whether results obtained under the main
Manski confidence intervals amount to tained assumption of unconfoundedness can
taking the lower bound minus 1.645 times be changed substantially, or even overturned
the standard error of the lower bound and entirely, by modest violations of the uncon
the upper bound plus 1.645 times its stan foundedness assumption.
dard error. The reason for using 1.645 To be specific, consider a job train
rather than 1.96 is to take account of the ing program with voluntary enrollment.
fact that, even in the limit, the width of the Suppose that we have monthly labor market
confidence set will not shrink to zero, and histories for a two year period prior to the
therefore one only needs to be concerned program. We may be concerned that indi
with one-sided errors. Chernozhukov, Hong, viduals choosing to enroll in the program
and Tamer (2007) focus on confidence sets are more motivated to find a job than those
that include the entire partially identified that choose not to enroll in the program.
set itself with fixed probability. For a given This unobserved motivation may be related
confidence level, the latter approach gener to subsequent earnings both in the presence
ally leads to larger confidence sets than the and in the absence of training. Conditioning
Imbens-Manski approach. See also Romano on the recent labor market histories of indi
and Shaikh (2006a, 2006b) for subsampling viduals may limit the bias associated with
approaches to inference in these settings. this unobserved motivation, but it need not
eliminate it entirely. However, we may be
6.2 Sensitivity Analysis willing to limit how highly correlated unob
served motivation is with the enrollment
Unconfoundedness has traditionally been decision and the earnings outcomes in the
seen as an all or nothing assumption: either two regimes, conditional on the labor mar
it is satisfied and one proceeds accord ket histories. For example, if we compare
ingly using the methods appropriate under two individuals with the same labor mar
unconfoundedness, such as matching, or ket history for the last two years, e.g., not
the assumption is deemed implausible and employed the last six months and working
one considers alternative methods. The lat the eighteen months before, and both with
ter include the bounds approach discussed one two-year old child, it may be reason
in section 6.1, as well as approaches relying able to assume that these cannot differ radi
on alternative assumptions, such as instru cally in their unobserved motivation given
mental variables, which will be discussed in that their recent labor market outcomes
section 6.3. However, there is an important have been so similar. The sensitivity analy
alternative that has received much less atten ses developed by Rosenbaum and Rubin
tion in the economics literature. Instead of (1983a) formalize this idea and provides a

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
54 Journal of Economic Literature, Vol. XLVII (March 2009)

tool for making such assessments. Imbens this changes the point estimate of the aver
(2003) applies this sensitivity analysis to age treatment effect.
data from labor market training programs. Typically the sensitivity analysis is done
The second approach is associated with in fully parametric settings, although
work by Rosenbaum (1995). Similar to the since the models can be arbitrarily flex
Rosenbaum-Rubin approach Rosenbaum's ible, this is not particularly restrictive.
method relies on an unobserved covariate Following Rosenbaum and Rubin (1983b),
that generates the deviations from uncon we illustrate this approach in a setting
foundedness. The analysis differs in that with binary outcomes. See Imbens (2003)
sensitivity is measured using only the rela and Lee (2005b) for examples in econom
tion between the unobserved covariate and ics. Rosenbaum and Rubin (1983a) fix the
the treatment assignment, with the focus marginal distribution of the unobserved
on the correlation required to overturn, or covariate to be binomial with p = pr(l/? =
change substantially, p-values of statistical 1), and assume independence of U{ and X?.
tests of no effect of the treatment. They specify a logistic distribution for the
treatment assignment:
6.2.1 The Rosenbaum-Rubin Approach to
Sensitivity Analysis
pr(W?= l\Xi = x, Ui = u)
The starting point is that unconfound
exp(cv() + ot[ x + a2'u)
edness is satisfied only conditional on the
observed covariates Xf and an unobserved 1 4- exp(a0 H~ ai x + o?2-u)
scalar covariate Uf. They also specify logistic regression func
tions for the two potential outcomes:
Y^OXY^DIWJX,,^.
pr(Y?M = 1 \Xi = x, Ui = u) =
This set up in itself is not restrictive, although
once parametric assumptions are made the exp(A?o + ?'wX x + ?w2 u)
assumption of a scalar unobserved covariate 1 + expido + ?'wi x + ?w2 ' u) '
Vi is restrictive.
Now consider both the conditional dis
tribution of the potential outcomes given For the subpopulation with X? = x and C7f =
observed and unobserved covariates and the u, the average treatment effect is
conditional probability of assignment given
observed and unobserved covariates. Rather E[Yf(l) - Yf(0|Xi = x,Ui = u] =
than attempting to estimate both these con
ditional distributions, the idea behind the exp(/?10 + /?u* + /?i2'm)
sensitivity analysis is to specify the form and 1 + exp(/?10 + /?n x + ?l2 ' u)
the amount of dependence of these condi
tional distributions on the unobserved cova _ exp(/?oo+ /?pix + /302-t/)
riate, and estimate only the dependence on 1 + exp(/300 + An * + A)2 '") '
the observed covariate. Conditional on the
specification of the first part estimation of The average treatment effect rCATE can be
the latter is typically straightforward. The expressed in terms of the parameters of this
idea is then to vary the amount of depen model and the distribution of the observable
dence of the conditional distributions on the covariates by averaging over Xi5 and integrat
unobserved covariate and assess how much ing out the unobserved covariate U:

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 55

fCATE = i?p, a2, ?02, /?i2, o?ih au Ao, functional form assumptions, and so attempts
to estimate 0sens are therefore unlikely to be
Pou Ao? Ai) effective. Given 0sens, however, estimating the
remaining parameters is considerably easier.
,fN exp(Ao + Ai *? + A2) In the second step the plan is therefore to
N { tt? r V 1 + exp(Ao + Ai ^ + A2) fix the first set of parameters and estimate
the others by maximum likelihood, and then
exp( Ao + Ai X,. + ?02) translate this into an estimate for r. Thus, for
1 + exp( Ao + Ai *, + A>2? fixed 0sens, we first estimate the remaining
parameters through maximum likelihood:
exp(Ao + Ai *?)
+ (1 - p)
1 + exp(Ao + Ai K 0other(0.sens) = arg aX Mother I #sens)>
Mother

_ exp( Ao + Ai Xi) where L(-) is the logarithm of the likelihood


l + exp(Ao + Ai^) function. Then we consider the function

We do not know the values of the parameters Tv^sens/ ~ ^vVsens? "other \"sens//>
( p, a, ?), but the data are somewhat informative
about them. One conventional approach would Finally, in the third step, we consider the range
be to attempt to estimate all parameters, and of values of the function r(0sem) for a reason
then use those estimates to obtain an estimate able set of values for the sensitivity parameters
for the average treatment effect. Given the (0sens), and obtain a set of values for rCATK.
specific parametric model this may be possi The key question is how to choose the
ble, although in general this would be difficult set of reasonable values for the sensitiv
given the inclusion of unobserved covariates ity parameters. If we do not wish to restrict
in the basic model. A second approach, as dis this set at all, we end up with unrestricted
cussed in section 6.1, is to derive bounds on bounds along the lines of section 6.1. The
r given the model and the data. A sensitivity power from the sensitivity approach comes
analysis offers a third approach. from the researcher's willingness to put
The Rosenbaum-Rubin sensitivity analy real limits on the values of the sensitivity
sis proceeds by dividing the parameters into parameters (p,a2,?02,?i2). Among these
two sets. The first set includes the parameters parameters it is difficult to put real limits on
that would be set to boundary values under p, and typically it is fixed at 1/2, with little
unconfoundedness, (a2, ?02, ?i2), plus tne sensitivity to its choice. The more interesting
parameter p capturing the marginal distribu parameters are (a2,?02,?i2). Let us assume
tion of the unobserved covariate [7f. Together that the effect of the unobserved covariate is
we refer to these as the sensitivity parame the same in both treatment arms, ?2 = ?02
ters, 0sens = (p,&2,?02,?l2). The second set = ?2\, so that there are only two parameters
consists of the remaining parameters, #other left to fix, a2 and ?2. Imbens (2003) sug
= (a0, a 1, Ao? ?ou Ao? Ai)- The idea is that gests linking the parameters to the effects of
0sens is difficult to estimate. Estimates of the the observed covariates on assignment and
other parameters under unconfoundedness potential outcomes. Specifically he suggests
could be obtained by fixing a2 = ?02 = ?l2 to calculate the partial correlations between
= 0 and p at an arbitrary value. The data are observed covariates and the treatment and
not directly informative about the effect of potential outcomes, and then as a bench
an unobserved covariate in the absence of mark look at the sensitivity to an unobserved

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
56 Journal of Economie Literature, Vol. XLVII (March 2009)

covariate that has partial correlations with suggests bounding the ratio of the odds ratios
treatment and potential outcomes as high as e(x?)/(l - e(Xi)) and e(xj)/(l - e(x?)):
any of the observed covariates. For example,
Imbens considers, in the labor market train i/re(Xi)
< (1_-!?<
? e(x?j)
r
ing example, what the effect would be of (1 - e(Xi)) e(xj)
omitting unobserved motivation, if in fact
motivation had as much explanatory power If T = 1, we are back in the sett
for future earnings and for treatment choice unconfoundedness. If we allow T
as did earnings in the year prior to the train are not restricting the association
ing program. A bounds analysis, in contrast, the treatment indicator and the
would implicitly allow unobserved motiva outcomes. Rosenbaum investig
tion to completely determine both selection much the odds would have to be dif
into the program and future earnings. Even order to substantially change the p-
though putting hard limits on the effect of starting from the other side, he in
motivation on earnings and treatment choice for fixed values of T what the impl
may be difficult, it may be reasonable to put on the p-value.
some limits on it, and the Rosenbaum-Rubin For example, suppose that a tes
sensitivity analysis provides a useful frame null hypothesis of no effect has a p
work for doing so. 0.0001 under the assumption of un
edness. If the data suggest it would
6.2.2 Rosenbaums Method for Sensitivity presence of an unobserved covar
Analysis changes the odds of participation by
ten in order to increase that p-valu
Rosenbaum (1995) developed a slightly then one would likely consider the
different approach. The advantage of his be very robust. If instead a small c
approach is that it requires fewer tuning the odds of participation, say with
parameters than the Rosenbaum-Rubin T = 1.5, would be sufficient for a c
approach. Specifically, it only requires the the p-value to 0.05, the study would
researcher to consider the effect unobserved less robust.
confounders may have on the probability of 6.3 Instrumental Variables
treatment assignment. Rosenbaum's focus
is on the effect the presence of unobserved In this section, we review the r
covariates could have on the p -value for the erature on instrumental variables.
test of no effect of the treatment based on the on the part of the literature conce
unconfoundedness assumption, in contrast to heterogenous effects. In the cur
the Rosenbaum-Rubin focus on point esti tion, we limit the discussion to the
mates for average treatment effects. Consider a binary endogenous variable. T
two units i andj with the same value for the literature focused on identificati
covariates, x? = x;. If the unconfoundedness population average treatment effec
assumption conditional on X? holds, both units average effect on the treated. Iden
must have the same probability of assignment of these estimands ran into seri
lems once researchers wished to
to the treatment, e{x?) = e(x?. Now suppose
unconfoundedness only holds conditional on unrestricted heterogeneity in the e
both X, and a binary unobserved covariate the treatment. In an important earl
i/f. In that case the assignment probabilities Bloom (1984) showed that if eligibili
for these two units may differ. Rosenbaum program is used as an instrument,

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 57

can identify the average effect of the treat the observed outcome Y? and the potential
ment for those who received the treatment. outcomes Yf(0) and Y?(1), is
Key for the Bloom result is that the instru
ment changes the probability of receiving w, = w; (o) (i - z,)
the treatment to zero. In order to identify
the average effect on the overall popula
tion, the instrument would also need to shift + wtw
+ w.(i).z=?w*(0) ifZ<=?r
A |w_(1) ifz<=
the probability of receiving the treatment
to one. This type of identification is some Exogeneity of the instrument is captured by
times referred to as identification at infinity the assumption that all potential outcomes
(Gary Chamberlain 1986; Heckman 1990) in are independent of the instrument, or
settings with a continuous instrument. The
practical usefulness of such identification (Y,(0), Y,(l), Wj(0), Wi(l)) 1 Z,.
results is fairly limited outside of cases where
eligibility is randomized. Finding a credible Formulating exogeneity in this way is attrac
instrument is typically difficult enough, with tive compared to conventional residual
out also requiring that the instrument shifts based definitions, as it does not require the
the probability of the treatment close to zero researcher to specify a regression function in
and one. In fact, the focus of the current order to define the residuals. This assump
literature on instruments that can credibly tion captures two properties of the instru
be expected to satisfy exclusion restrictions ment. First, it captures random assignment
makes it even more difficult to find instru of the instrument so that causal effects of the
ments that even approximately satisfy these instrument on the outcome and treatment
support conditions. Imbens and Angrist received can be estimated consistently. This
(1994) got around this problem by changing part of the assumption, which is implied by
the focus to average effects for the subpopu explicitly randomization of the instrument, as
lation that is affected by the instrument. for example in the seminal draft lottery study
Initially we focus on the case with a binary by Angrist (1990), is not sufficient for causal
instrument. This case provides some of the interpretations of instrumental variables
clearest insight into the identification prob methods. The second part of the assumption
lems. In that case the identification at infin captures an exclusion restriction that there
ity arguments are obviously not satisfied and is no direct effect of the instrument on the
so one cannot (point-)identify the population outcome. This second part is captured by the
average treatment effect. absence of % in the definition of the potential
outcome Y?w). This part of the assumption is
6.3.1 A Binary Instrument
not implied by randomization of the instru
Imbens and Angrist adopt a potential out ment and it has to be argued on a case by
come notation for the receipt of the treatment, case basis. See Angrist, Imbens, and Rubin
as well as for the outcome itself. Let Z? denote (1996) for more discussion on the distinction
the value of the instrument for individual i. between these two assumptions, and for a
Let W;(0) and W?(1) denote the level of the formulation that separates them.
treatment received if the instrument takes on Imbens and Angrist introduce a new con
the values 0 and 1 respectively. As before, let cept, the compliance type of an individual.
Yf(0) and Y?(1) denote the potential values for The type of an individual describes the level
the outcome of interest. The observed treat of the treatment that an individual would
ment is, analogously to the relation between receive given each value of the instrument.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
58 Journal of Economie Literature, Vol. XLVII (March 2009)

In other words, it is captured by the pair of Bloom set up with one-sided noncompliance
values (Wi(0), Wf(l)). With both the treat both always-takers and defiers are absent by
ment and instrument binary, there are four assumption.
types of responses for the potential treat Under these two assumptions, inde
ment. It is useful to define the compliance pendence of all four potential outcomes
types explicitly: (Y?(0),Y?(1), Wf(0),W;.(l)) and the instrument
Z?, and monotonicity, Imbens and Angrist
( never-taker if W?(0) = Wf(l) = 0 show that one can identify the average
_ J complier if W;(0) = 0, Wl(l) = 1 effect of the treatment for the subpopula
Ti = | d?fier if Wf(0) = 1, W?(1) = 0' tion of compliers. Before going through their
[ always-taker if W?(0) = W?(1) = 1 argument, it is useful to see why we cannot
generally identify the average effect of the
The labels never-taker, complier, d?fier, treatment for others subpopulations. Clearly,
and always-taker (e.g., Angrist, Imbens, and one cannot identify the average effect of the
Rubin 1996) refer to the setting of a random treatment for never-takers because they are
ized experiment with noncompliance, where never observed receiving the treatment, and
the instrument is the (random) assignment so E[Yf(l) | T? = n] is not identified. Thus,
to the treatment and the endogenous regres only compliers are observed in both treat
sor is an indicator for the actual receipt of ment groups, so only for this group is there
the treatment. Compliers are in that case any chance of identifying the average treat
individuals who (always) comply with their ment effect. In order to understand the
assignment, that is, take the treatment if positive component of the Imbens-Angrist
assigned to it and not take it if assigned to result, that we can identify the average effect
the control group. One cannot infer from the for compliers, it is useful to consider the
observed data (Zh Wh Y?) whether a particular subpopulations defined by instrument and
individual is a complier or not. It is important treatment. Table 4 shows the information
not to confuse compliers (who comply with we have about the individual's type given
their actual assignment and would have com the monotonicity assumption. Consider indi
plied with the alternative assignment) with viduals with (Z? = 1, W? = 0). Because of
individuals who are observed to comply with monotonicity such individuals can only be
their actual assignment: that is, individuals never-takers. Similarly, individuals (Z? = 0,
who complied with the assignment they actu Wt = 1) can only be always-takers. However,
ally received, Zt = Wt. For such individuals consider individuals with (Z? = 0, W? = 0).
we do not know what they would have done Such individuals can be either compliers
had their assignment been different, that is or never-takers. We cannot infer the type
we do not know the value of W?(1 ? Z?). of such individuals from the observed data
Imbens and Angrist then invoke an addi alone. Similarly, individuals with (Zf = 1,
tional assumption they refer to as monotonicity. Wi ? 1) can be either compliers or always
Monotonicity requires that W?(1) > W?(0) for takers.
all individuals, or that increasing the level of The intuition for the identification result
the instrument does not decrease the level is as follows. The first step is to see that we
of the treatment. This assumption is equiva can infer the population proportions of the
lent to ruling out the presence of defiers, and three remaining subpopulations, never
it is therefore sometimes referred to as the takers, always-takers and compliers (using
"no-defiance" assumption (Alexander Balke the fact that the monotonicity assumption
and Pearl 1994; Pearl 2000). Note that in the rules out the presence of defiers). Call these

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 59

TABLE 4
Type by Observed Variables
Z,
1

0 Nevertaker/Complier Nevertaker
' 1 Alwaystaker Alwaystaker/Complier

population shares Pt = pr(T? = t), for t G Tlate = ?[Y,(1) - Y,(0) | W?(0) = O,


{n,a,c}. Consider the subpopulation with Z?
= 0. Within this subpopulation we observe W<(1) = 1]
Wt = 1 only for always-takers. Hence the = E[Yf(l) - Y?(0) | T{ = compiler].
conditional probability of Wt ? 1 given Z? =
0 is equal to the population share of always
takers: Pa = pr(Wi = 11 Zt- = 0). Similarly, in In practice one need not estimate the local
the subpopulation with Z? = 1 we observe W7 average treatment effect by decomposing the
= 0 only for never-takers. Hence the popula mixture distributions directly. Imbens and
tion share of never-takers is equal to the con Angrist show that LATE equals the standard
ditional probability of Wt = 0 given Z{ = 1: Pn instrumental variables estimand, the ratio of
= pr(Wi = 01 Zt- = 1). The population share the covariance of Yf and Z? and the covari
of compliers is then obtained by subtracting ance of Wt and Zf:
the population shares of never-takers and
always-takers from one: Pc = 1 ? Pn ? Pa. E[Yi\Zi=l]-E[Yi\Zi=0]
The second step uses the distribution of Yt TLATE E[w_ i z = y _ ?[w_ | Zi = 0]
given (Zj, Wt). We can infer the distribution
of Y i ? Wi = 0, Ti = n from the subpopula E[Yr(Z?-E[Z?])]
tion with (Z^Wj) = (1,0) since all these EtW-?Z.-EfZj)] '
individuals are known to be never-takers.
Then we use the distribution of Y? | Z? = 0, which can be estimated using two-stage
Wi = 0. This is a mixture of the distribution least-squares. For applications using
of Yt | Wi = 0, Ti ? n and the distribution of parametric models with covariate, see
Y i I Wt ? 0, Tt ? c, with mixture probabilities Hirano et al. (2000) and Fabrizia Mealli et
equal to the relative population shares, PJ al. (2004).
(Pc + Pn) and PC/(PC -h Pn), respectively. Since Earlier we argued that one cannot con
we already inferred the population shares sistently estimate the average effect for
of the never-takers and compliers as well as either never-takers or always-takers in
the distribution of Y? | W? = 0, T{ = n, we can this setting. Nevertheless, we can still use
back out of the conditional distribution of the bounds approach from Manski (1990,
Y? | Wi ? 0, Ti = c. Similarly we can infer the 1995) to bound the average effect for the
conditional distribution of Yt \ W? = 1, Tf = c. full population. To understand the nature
The difference between the means of these of the bound, it is useful to decompose the
two conditional distributions is the Local average effect rPATE by compliance type
Average Treatment Effect (LATE), (Imbens (maintaining monotonicity, so there are no
and Angrist, 1994): defiers):

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
60 Journal of Economie Literature, Vol. XLVII (March 2009)

rFATE = Pn'E[Yi(l)-Yi(0)\Ti = n] We can combine these to estimate any


weighted average of these local average treat
+ Pa-E[Yia)-Yi(0)\Ti = a] ment effects:

+ Pc-E[Y,(l)-Yj(0)|T< = 4 rLATE,A ? 2^ ^k,l ' TLATe(%?? ZU


k,l

The only quantities not consistently estima Imbens and Angrist show that the standard
ble are the average effects for never-takers instrumental variables estimand, using g(Z?)
and always-takers. Even for those we have as an instrument for Wh is equal to a particu
some information. For example, we can write lar weighted average:
E[Yi(l) - Yi(0) | Ti =n] = ?[Y?(1) \Tt = n] -
?[Y,(0) | Ti ? n\ The second term we can E[Y,.-(g(Z,.)-E[g(Zi)])] _
estimate, and the data are completely unin ElWrigW-ElgiZ,)])] Tlate>
formative about the first term. Hence, if there
are natural bounds on Y?(1) (for example, if for a particular set of nonnegative weights as
the outcome is binary), we can use that to long as E[Wf | g(Z?) = g] increases in g.
bound E[Yi(l) \ Tt ? n], and then in turn use Heckman and Vytlacil (2006) and
that to bound rPATE. These bounds are tight. Heckman, Sergio Urzua, and Vytlacil (2006)
See Manski (1990), Toru Kitagawa (2008), study the case with a continuous instrument.
and Balke and Pearl (1994). They use an additive latent single index setup
6.3.2 Multivalued Instruments and where the treatment received is equal to
Weighted Local Average Treatment
Effects
w; = i{h(z? + Vi > o},

The previous discussion was in terms of a where h( ) is strictly monotonie, and the
single binary instrument. In that case there is latent type V? is independent of Zf. In general,
no other average effect of the treatment that in the presence of multiple instruments, this
can be estimated consistently other than the latent single index framework imposes sub
local average treatment effect, rLATE. With stantive restrictions.14 Without loss of gener
a multivalued instrument, or with multiple ality we can take the marginal distribution
binary instruments (still maintaining the set of Vi to be uniform. Given this framework,
ting of a binary treatment?see for extensions Heckman, Urzua, and Vytlacil (2006) define
of the local average treatment effect con the marginal treatment effect as a function
cept to the multiple treatment case Angrist of the latent type v of an individual,
and Imbens (1995) and Card (2001), we can
estimate a variety of local average treatment TMTE(v) = E[Yi(l)-Yi(0)\Vi = v].
effects. Let Z ? {zu ... ,zK] denote the set of
values for the instruments. Initially we take In the single continuous instrument case,
the set of values to be finite. Then for each
rMTE(v) is, under some differentiability and
pair (zk, z?) with pr(W? = 11 Zf = zk) > pr(W? invertibility conditions, equal to a limit of
= 11 Zi = z?) one can define a local average local average treatment effects:
treatment effect:

14 See Vytlacil (2002) for a discussion in the case with


binary instruments, where the latent index set up implies
E[Yi(l) - Y,(0) | WM) = 0, Wi(zk) -1 ]. no loss of generality.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 61

TmteW = lim rLATE(/i \v\ z). Kenneth Y. Chay and Michael Greenstone
zih-\v)
(2005), Card, Alexandre Mas, and Jesse
A parametric version of this concept goes Rothstein (2007), Lee, Enrico Moretti, and
back to work by Anders Bj?rklund and Matthew J. Butler (2004), Jens Ludwig and
Robert Moffitt (1987). All average treatment Douglas L. Miller (2007), Patrick J. McEwan
effects, including the overall average effect, and Joseph S. Shapiro (2008), Sandra E.
the average effect for the treated, and any Black (1999), Susan Chen and van der Klaauw
local average treatment effect can now be (2008), Ginger Zhe Jin and Phillip Leslie
expressed in terms of integrals of this mar (2003), Thomas Lemieux and Kevin Milligan
ginal treatment effect, as shown in Heckman (2008), Per Pettersson-Lidbom (2007, 2008),
and Vytlacil (2005). For example, rPATE = and Pettersson-Lidbom and Bj?rn Tyrefors
Jo tmte(^) dv. A complication in practice is (2007). Key theoretical and conceptual
that not necessarily all the marginal treat contributions include the interpretation of
ment effects can be estimated. For example, estimates for fuzzy regression discontinu
if the instrument is binary, Z? ? {0,1}, then ity designs allowing for general heterogene
for individuals with V? < min(?h(0), ?h(l)), ity of treatment effects (Hahn, Todd, and
it follows that Wi = 0, and for these never van der Klaauw 2001), adaptive estimation
takers we cannot estimate rMTE(v). Any methods (Yixiao Sun 2005), methods for
average effect that requires averaging over bandwidth selection tailored to the RD set
such values of v is therefore also not point ting, (Ludwig and Miller 2005; Imbens and
identified. Moreover, average effects that can Karthik Kalyanaraman 2008) and various
be expressed as integrals of rMTE(v) may be tests for discontinuities in means and distri
identified even if some of the tmte(?) that butions of nonaffected variables (Lee 2008;
are being integrated over are not identified. McCrary 2008) and for misspecification
Again, in a binary instrument example with (Lee and Card 2008). For recent reviews in
pr(W? =11^=1)= 1, and pr(W? = 1\Z{ the economics literature, see van der Klaauw
? 0) = 0, the average treatment effect rPATE (2008b), Imbens and Lemieux (2008), and
is identified, but rMTE(v) is not identified for Lee and Lemieux (2008).
any value of v. The basic idea behind the RD design is that
assignment to the treatment is determined,
6.4 Regression Discontinuity Designs
either completely or partly, by the value of
Regression discontinuity (RD) methods a predictor (the forcing variable X?) being on
have been around for a long time in the psy either side of a common threshold. This gen
chology and applied statistics literature, going erates a discontinuity, sometimes of size one,
back to the early 1960s. For discussions and in the conditional probability of receiving
references from this literature, see Donald L. the treatment as a function of this particular
Thistlethwaite and Campbell (1960), William predictor. The forcing variable is often itself
M. K. Trochim (2001), Shadish, Cook, and associated with the potential outcomes, but
Campbell (2002), and Cook (2008). Except this association is assumed to be smooth. As
for some important foundational work by a result any discontinuity of the conditional
Goldberger (1972a, 1972b), it is only recently distribution of the outcome as a function of
that these methods have attracted much atten
this covariate at the threshold is interpreted
tion in the economics literature. For some of as evidence of a causal effect of the treatment.
the recent applications, see Van Der Klaauw The design often arises from administrative
(2002, 2008a), Lee (2008), Angrist and decisions, where the incentives for individu
Victor Lavy (1999), DiNardo and Lee (2004), als to participate in a program are rationed

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
62 Journal of Economie Literature, Vol. XLVII (March 2009)

for reasons of resource constraints, and clear averaging we make a smoothness assump
transparent rules, rather than discretion, by tion that the two conditional expectations
administrators are used for the allocation of E[Yi(w) \X{ = x], for w ? 0,1, are continuous
these incentives. in x. Under this assumption, E[Y?(0) |X? = c]
It is useful to distinguish between two gen = limx?cE[Yi(0) \Xt = x] = limx?cE[Yt \Xi = x],
eral settings, the sharp and the fuzzy regres implying that
sion discontinuity designs (e.g., Trochim
1984, 2001; Hahn, Todd, and van der Klaauw rSRD = limx[c
E[Yi | X? = x] - lim E[Yi \ Xi = x],
x]c
2001; Imbens and Lemieux 2008; van der
Klaauw 2008b; Lee and Lemieux 2008). where this express
a deterministic fu
6.4.1 The Sharp Regression Discontinuity
of the SRD). The s
Design one of estimating
In the sharp regression discontinuity (SRD) parametrically at
design, the assignment W{ is a deterministic cuss the statistica
function of one of the covariates, the forcing section 6.4.4.
(or treatment-determining) variable X?:
6.4.2 The Fuzzy Regression Discontinuity
Wi = IK > d Design
In the fuzzy regression discontinuity (FRD)
where 1[-] is the indicator function, equal to design, the probability of receiving the treat
one if the even in brackets is true and zero ment need not change from zero to one at the
otherwise. All units with a covariate value of threshold. Instead the design only requires a
at least c are in the treatment group (and par discontinuity in the probability of assignment
ticipation is mandatory for these individuals), to the treatment at the threshold:
and all units with a covariate value less than
c are in the control group (members of this limpr(Wi=
x[c l|Xf = x)
group are not eligible for the treatment). In
the SRD design, we focus on estimation of ^ limpr(W?=
x]c l\Xi = x),

(32) rSRD = ?[Yi(l)-^(0)1^=4 In practice, the discontinuity needs to be


sufficiently large that typically it can be seen
(Naturally, if the treatment effect is constant, easily in simple graphical analyses. These
then rSRD = rPATE.) Writing this expression discontinuities can arise if incentives to par
as E[Y?(1)|X? = c] - E[Y,(0)|Xf = c], we ticipate in a program change discontinuously
focus on identification of the two terms sepa at a threshold, without these incentives being
rately. By design there are no units with X? powerful enough to move all units from non
= c for whom we observe Y?(0). To estimate participation to participation.
E[Y?(w)|X? = c] without making functional In this design we look at the ratio of the
form assumptions, we exploit the possibility jump in the regression of the outcome on
of observing units with covariate values arbi the covariate to the jump in the regression
trarily close to c.15 In order to justify this of the treatment indicator on the covariate

15 Although in principle the first term in the difference


in (32) would be straightforward to estimate if we actually we also need to estimate this term by averaging over units
observe individuals with X? = x, with continuous covariates with covariate values close to c.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 63

as an average causal effect of the treatment. ? c?and acting as if unconfoundedness


Formally, the functional of interest is holds, would lead to estimating the aver
age treatment effect at X? = c based on the
limx]c
_ x[c E[Y, |X, = x]- lim Eft | X, = x] expression
Tfrd ~ limx[c
E[Wi |Xi = x]c
x] - lim E\W{ \Xi = x] ' Tuncon{=E[Yi\Xi = c,Wi=l]

Hahn, Todd, and van


-E[Yi\Xi = der
c,Wi=0], Kla
exploit the instrumental var
nection to interpret the fuzzy
discontinuity which
design when
equals E[Yf(l) - Yf(0) |Xf- = c] under the
the treatment unconfoundedness.
variesIn fact, by underunit.
uncon T
complier to be foundedness
units one can estimate the average
whose partic
affected by theeffect E[Yt-(l) - Yf(0)|Xf.point.
cutoff = x] at values of Tha
plier is someone
x different
with from c. However,
a valuean interesting of t
result is to
variable Xi close that if unconfoundedness
c, and holds,
who
the FRD alsochosen
ticipate if c were estimates ?[Yf(l) - ^(0)1^
to be
Xf, and not participate if have
= c], as long as the potential outcomes c wer
be just above X?. Hahn,
smooth expectations Todd,
as a function of the a
Klaauw then exploit that
forcing variable around struct
x = c. A special case
of this is discussed in with
that in combination Hahn, Todd, and van a mo
assumption, der Klaauw (2001), who assume only that
treatment is unconfounded with respect to
tfrd = E[Yj(l) ? Y,(0) | unit i is a complier the individual-specific gain. Therefore, in
principle, there are situations where even if
andX? = c\. one believes that unconfoundedness holds,
The estimand rFRD is an average effect of the one may wish to use the FRD approach.
treatment, but only averaged for units with In particular, even if we maintain uncon
Xi ? c (by regression discontinuity), and only foundedness, a standard analysis based
for compliers (people who are affected by on runconf can be problematic because the
the threshold). Clearly the analysis generally potential discontinuities in the regression
does not have much external validity. It is functions (at x = c) under the FRD design
only valid for the subpopulation who is com invalidate the usual statistical methods that
plier at the threshold, and it is only valid for treat the regression functions as continuous
the subpopulation with X? ? c. Nevertheless, at x = c.
the FRD analysis may do well in terms of Although unconfoundedness in the FRD
internal validity. setting is possible, its failure makes it diffi
It is useful to compare the RD method in cult to interpret runconf. By contrast, provided
this setting with standard methods based on monotonicity holds, the FRD parameter,
unconfoundedness. In contrast to the SRD rFRD, identifies the average treatment effect
case, an unconfoundedness-based analysis for compliers at x = c. In other words,
is possible in the FRD setting because some approaches that exploit the FRD nature of
treated observations will have X? < c, and the design identify an (arguably) interesting
some control observations will have X{ > c. parameter both when unconfoundedness
Ignoring the FRD setting?that is, ignor holds and in a leading case (monotonicity)
ing the discontinuity in E[W?|X? = x] at x when unconfoundedness fails.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
64 Journal of Economie Literature, Vol. XLVII (March 2009)

6.4.3 Graphical Methods and credible estimates with statistically and


substantially significant magnitudes.
Graphical analyses are typically an inte In addition to inspecting whether there
gral part of any RD analysis. RD designs is a jump at this value of the covariate, one
suggest that the effect of the treatment of should inspect the graph to see whether
interest can be measured by the value of the there are any other jumps in the conditional
discontinuity in the conditional expectation expectation of Yt given X? that are compara
of the outcome as a function of the forcing ble in size to, or larger than, the discontinuity
variable at a particular point. Inspecting the at the cutoff value. If so, and if one cannot
estimated version of this conditional expec explain such jumps on substantive grounds, it
tation is a simple yet powerful way to visual would call into question the interpretation of
ize the identification strategy. Moreover, to the jump at the threshold as the causal effect
assess the credibility of the RD strategy, it of the treatment.
can be useful to inspect additional graphs, as In order to optimize the visual clarity it is
discussed below in section 6.4.5. For strik recommended to calculate averages that are
ingly clear examples of such plots, see Lee, not smoothed across the cutoff point c. In
Moretti, and Butler (2004), Rafael Lalive addition, it is recommended not to artificially
(2008), and Lee (2008). smooth on either side of the threshold in a
The main plot in a SRD setting is a his way that implies that the only discontinuity
togram-type estimate of the average value in the estimated regression function is at c.
of the outcome by the forcing variable. For One should therefore use nonsmooth meth
some binwidth h, and for some number of ods such as the histogram type estimators
bins Ko and 2^ to the left and right of the described above rather than smooth meth
cutoff value, respectively, construct bins ods such as kernel estimators.
(bk>bk+\]> for fc == 1, ...,K = Kq + Kl9 where In a FRD setting, one should also
bk ? c ? (K() ? k-\- l)-h. Then calculate the calculate
number of observations in each bin, and the
average outcome in the bin:
w, = ^-?Yri[^<x?<ViL
Nk = ? Uh <Xi< bk+l],
and plot the Wk against the bin centers bk, in
the same way as described above.
Yk = ?-?lYi'Ubk<Xi<bk+1l
6.4.4 Estimation and Inference
The key plot is that of the Yk, for k The object of interest in regression discon
? 1, ...,K against the mid point of the tinuity designs is a difference in two regres
bins, bk ? (bk + bk+])/2. The question is sion functions at a particular point (in the
whether around the threshold c (by construc SRD case), and the ratio of two differences of
tion on the edge of one of the bins) there is regression functions (in the FRD case). These
any evidence of a jump in the conditional estimands are identified without functional
mean of the outcome. The formal statistical form assumptions, and in general one might
analyses discussed below are essentially just therefore like to use nonparametric regres
sophisticated versions of this, and if the basic sion methods that allow for flexible func
plot does not show any evidence of a disconti tional forms. Because we are interested in the
nuity, there is relatively little chance that the behavior of the regression functions around a
more sophisticated analyses will lead to robust single value of the covariate, it is attractive

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 65

to use local smoothing methods such as ker nel). The choice of bandwidth then
nel regression rather than global smoothing to to dropping all observations such
methods such as sieves or series regression [c ? h, c -j- h]. The question becom
because the latter will generally be sensi choose the bandwidth h.
tive to behavior of the regression function Most standard methods for c
away from the threshold. Local smoothing bandwidths in nonparametric reg
methods are generally well understood (e.g., including both cross-validation and
Charles J. Stone 1977; Herman J. Bierens methods, are based on criteria that
1987; Hardie 1990; Adrian Pagan and Aman the squared error over the entire di
Ullah 1999). For a particular choice of the of the covariates: Jz (m(z) ? m(z))2
kernel, K( ), e.g., a rectangular kernel K(z) ? For our purposes this criterion d
1[?h < z < h], or a Gaussian kernel K(z) = reflect the object of interest. We ar
exp(?z2/2)/v (27r), the regression function cally interested in the regression fu
at x, mix) ? E[Yt \ Xt ? x] is estimated as a single point, moreover, this point
N
a boundary point. Thus we woul
Mx) = E y i - K choose h to minimize E[(m(c) ? m(c
the data with X? < c only, or using
with Xi> c only). If the density of t
with weights A, ? ?--^- is . high at the threshold, a b
? Ef=i?c(V) variable
selection procedure based on glob
may lead to a bandwidth that is mu
An important difference with the primary than is appropriate.
focus in the nonparametric regression litera There are few attempts to form
ture is that in the RD setting we are inter standardize the choice of a bandw
ested in the value of the regression functions such cases. Ludwig and Miller (2
at boundary points. Standard kernel regres Imbens and Lemieux (2008) discu
sion methods do not work well in such cases. cross-validation methods that tar
More attractive methods for this case are directly the object of interest in RD
local linear regression (Fan and Gijbels Assuming the density of X? is continu
1996; Porter 2003; Burkhardt Seifert and and that the conditional variance of
Theo Gasser 1996, 2000; Ichimura and Todd X? is continuous and equal to a2 a
2007), where locally a linear regression func Imbens and Kalyanaraman (2009) sh
tion, rather than a constant regression func the optimal bandwidth depends o
tion, is fitted. This leads to an estimator for ond derivatives of the regression fu
the regression function at x equal to the threshold and has the form

mix) = ?, where (a, ?) hapt=N-1'!i'CK-o*


N

x? _P
?+-J M*
= arg min E V (Yj - a - /3 (X< - x))2, + i - p_
with the same weights \ as before. Vlim4c(g(x))2+liml1c(|f(x))2
In that
case the main remaining choice concerns
the bandwidth, denoted by h. where p is the fraction
Suppose one of observations with
uses a rectangular kernel, K(z) X? > = l[?h
c, and < azconstant that depends
CK is
< h] (and typically the results are
on therelatively
kernel. For a rectangular kernel K(z)
? l-h<z<h>
robust with respect to the choice of the tne constant
ker equals CK ? 2.70.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
66 Journal of Economie Literature, Vol. XLVII (March 2009)

Imbens and Kalyanaram propose and imple times in an attempt to raise their score above
ment a plug in method for the bandwidth.16 the threshold.
If one uses a rectangular kernel, and given There are two sets of specification checks
a choice for the bandwidth, estimation for that researchers can typically perform to at
the SRD and FRD designs can be based on least partly assess the empirical relevance of
ordinary least squares and two stage least these concerns. Although the proposed proce
squares, respectively. If the bandwidth goes dures do not directly test null hypotheses that
to zero sufficiently fast, so that the asymp are required for the RD approach to be valid,
totic bias can be ignored, one can also base it is typically difficult to argue for the validity
inference on these methods. (See HTV and of the approach when these null hypotheses
Imbens and Lemieux 2008.) do not hold. First, one may look for discon
tinuities in average value of the covariates
6.4.5 Specification Checks around the threshold. In most cases, the rea
There are two important concerns in the son for the discontinuity in the probability of
application of RD designs, be they sharp the treatment does not suggest a discontinu
or fuzzy. These concerns can sometimes be ity in the average value of covariates. Finding
assuaged by investigating various implica a discontinuity in other covariates typically
tions of the identification argument underly casts doubt on the assumptions underlying the
ing the regression discontinuity design. RD design. Specifically, for covariates Z?, the
A first concern about RD designs is the pos test would look at the difference
sibility of other changes at the same threshold
value of the covariate. For example, the same rz = limxjc
E[Zi \Xi =x]c
x]- lim E[Zt \X{ = x].
age limit may affect eligibility for multiple
programs. If all the programs whose eligibil Second, McCrary (
ity changes at the same cutoff value affect null hypothesis of
the outcome of interest, an RD analysis may of the covariate t
mistakenly attribute the combined effect to ment at the thre
the treatment of interest. The second con tive of a jump in t
cern is that of manipulation by the individu point. A discontinu
als of the covariate value that underlies the covariate at the p
assignment mechanism. The latter is less of a discontinuity in th
concern when the forcing variable is a fixed, occurs is suggestiv
immutable characteristic of an individual manipulation assu
such as age. It is a particular concern when on the difference
eligibility criteria are known to potential par
ticipants and are based on variables that are Tf(x) = lip/x(x)
x[c x]c
affected by individual choices. For example,
if eligibility for financial aid depends on test In both cases a substantially
scores that are graded by teachers who know significant difference in th
the cutoff values, there may be a tendency to limits suggest that there m
push grades high enough to make students with the RD approach. In pr
eligible. Alternatively if thresholds are known ful than formal statistical t
to students they may take the test multiple analyses of the type disc
6.4.3 where histogram-type
16 Code in Matlab and Stata for calculating the optimal conditional expectation of
bandwidth is available on their website.
of the marginal density/x(x

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 67

6.5 Difference-in-Differences Methods Donald and Lang 2007), as well as the recent
extensions by Athey and Imbens (2006) who
Since the seminal work by Ashenfelter (1978) develop a functional form-free version of the
and Ashenfelter and Card (1985), the use difference-in-differences methodology, and
of Difference-In-Differences (DID) meth Abadie, Diamond, and Jens Hainmueller
ods has become widespread in empirical (2007), who develop a method for construct
economics. Influential applications include ing an artificial control group from multiple
Philip J. Cook and George Tauchen (1982, nonexposed groups.
1984), Card (1990), Bruce D. Meyer, W. Kip
Viscusi, and David L. Durbin (1995), Card 6.5.1 Repeated Cross Sections
and Krueger (1993, 1994), Nada Eissa and The standard model for the DID approach
Liebman (1996), Blundell, Alan Duncan, and is as follows. Individual i belongs to a group,
Meghir (1998), and many others. The DID G? G {0,1} (where group 1 is the treatment
approach is often associated with so-called group), and is observed in time period T? G
"natural experiments," where policy changes {0,1}. For i = 1,..., N, a random sample from
can be used to effectively define control and the population, individual is group identity
treatment groups. See Angrist and Krueger and time period can be treated as random
(1999), Angrist and Pischke (2009), and variables. In the standard DID model, we
Blundell and Thomas MaCurdy (1999) for can write the outcome for individual i in the
textbook discussions. absence of the intervention, Y?(0) as
The simplest setting is one where out
comes are observed for units observed in (33) Y,(0) = a + ?- Ti + 7- G, + ei9
one of two groups, in one of two time peri
ods. Only units in one of the two groups, with unknown parameters a, ?, and 7. We
in the second time period, are exposed to ignore the potential presence of other cova
a treatment. There are no units exposed to riates, which introduce no special com
the treatment in the first period, and units plications. The second coefficient in this
from the control group are never observed specification, ?, represents the time com
to be exposed to the treatment. The average ponent common to both groups. The third
gain over time in the non-exposed (control) coefficient, 7, represents a group-specific,
group is subtracted from the gain over time time-invariant component. The fourth term,
in the exposed (treatment) group. This dou e{, represents unobservable characteristics
ble differencing removes biases in second of the individual. This term is assumed to
period comparisons between the treatment be independent of the group indicator and
and control group that could be the result have the same distribution over time, i.e.,
from permanent differences between those et 1 (G?,T?), and is normalized to have mean
zero.
groups, as well as biases from compari
sons over time in the treatment group that An alternative set up leading to the sa
could be the result of time trends unrelated estimator allows for a time-invariant indiv
to the treatment. In general this allows for ual-specific fixed effect, 7?, potentially co
the endogenous adoption of the new treat lated with Gh and models Yf(0) as
ment (see Timothy Besley and Case 2000
and Athey and Imbens 2006). We discuss (34) Yi(0) = a + ?-Ti+>yi+ei.
here the conventional set up, and recent
work on inference (Bertrand, Duflo, and (See, e.g., Angrist and Krueger 1999.) T
Mullainathan 2004; Hansen 2007a, 2007b; generalization of the standard model do

This content downloaded from


f:ffff:ffff:ffff:ffff:ffff:ffff:ffff on Thu, 01 Jan 1976 12:34:56 UTC
All use subject to https://about.jstor.org/terms
68 Journal of Economic'Literature, Vol. XLVII (March 2009)

not affect the standard DID estimand, and 6.5.2 Multiple Groups and Multiple Periods
it will be subsumed as a special case of the
model we propose. With multiple time periods and multiple
The equation for the outcome without groups we can use a natural extension of the
the treatment is combined with an equa two-group two-time-period model for the
tion for the outcome given the treatment: outcome in the absence of the intervention.
Yi(l) = Yf(0) -f rDID. The standard DID Let T and G denote the number of time peri
estimand is under this model equal to ods and groups respectively. Then:
T

(35) rDID = E[Y;(1)] - E[Yi(0)] (36) Yi(0)=a


t=\
+ Y,?t'UTi = t]
= (E[Yi\Gi = 1,^ = 1]
+ E7g-l[G< = g] + ei
-E[Y?|G? = 1,T? = 0])
with separate parameters for each group and
-(?[^1^ = 0,^ = 1] time period, 7g and ?t, for g = 1,..., G and t
= 1,..., T, where the initial time period coef
-E[Y?|G? = 0,T? = 0]). ficient and first group coefficient have implic
itly been normalized to zero. This model is
then combined with the additive model for
In other words, the population average the treatment effect, Yf(l) = Yf(0) + rD1D,
difference over time in the control group implying that the parameters of this model
(Gi = 0) is subtracted from the population can still be estimated by ordinary least
average difference over time in the treatment squares based on the regression function
group (Gi = 1) to remove biases associated
with a common time trend unrelated to the
(37) Yl=a + ?2fr'l[T,= t]
intervention.
G
We can estimate rDID simply using least
squares methods on the regression function + E 7g l[Gf = g] + rDID I
for the observed outcome,
where 7? is now an indicator for uni
Y?=a + /VT? + 7l-G?+rDID-Wi+??, in a group and time period that was
to the treatment.
where the treatment indicator Wt is equal to This model with more than tw
the interaction of the group and time indica periods, or more than two groups, o
tors, I i ? Tt' G i. Thus the treatment effect imposes testable restrictions on t
is estimated through the coefficient on the For example, if group gY and g2 are
interaction between the indicators for the exposed to the treatment in periods tx
second time period and the treatment group. under this model the double differe
This leads to

Tdid = (Yii ? Y10) ? (Y01 ? Y00),


should estimate zero, which can b
where Y ^ = I]?|G?=</,r,=? ^/A7^ is the average using conventional methods?this
outcome among units in group g and time ity is exploited in the next subsectio
period t. two-period, two-group setting ther

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 69

testable restrictions on the four group/period unobserved components 77^. In this tw


means. group, two-time-period case the problem
even worse than the absence of a consist
6.5.3 Standard Errors in the Multiple
estimator, because one cannot even es
Group and Multiple Period Case
lish whether there is a clustering proble
the
Recently there has been attention data are not informative about the v
called
of a2. If we have data from more than t
to the concern that ordinary least square
standard errors for the DID estimator mayor from more than two time peri
groups
not be accurate in the presence ofwe
correla
can typically estimate a2, and thus
tions between outcomes within groups
least and
under the normality and independe
between time periods. This is a assumptions
particu for r?gt, construct confiden
lar case of clustering where the regressor
intervals for rDID. Consider, for example,
of interest does not vary within clusters.
case with three groups, and two time
See Brent R. Moulton (1990), Moulton
ods. If groups G{ ? 0,1 are both not tre
and William C. Randolph (1989), and
in the second period, then (Yn ? Y10) ? (
Wooldridge (2002) for a general? discus
Y00) ~ J\f(0,4-a2), which can be use
sion. The specific problem has been ana an unbiased estimator for a2.
obtain
lyzed recently by Donald and Lang Donald
(2007),and Lang (2007) for details.
Bertrand, Duflo, and Mullainathan (2004),Bertrand, Duflo, and Mullainathan (20
and Hansen (2007a, 2007b). and Hansen (2007a, 2007b) focus on the c
The starting point of these analyseswithismultiple
a (more than two) time p
particular structure on the error term ods.
ef: In that case we may wish to relax
assumption that the rjgt are independ
?i = VGhTi + "i, over time. Note that with data from only
time periods there is no information in
data that allows one to establish the abse
where v{ is an individual-level idiosyncratic
error term, and r)gt is a group/timeof independence over time. The typical g
specific
component. The unit level error term eralization
v{ is is to allow for a autoregres
independent across all units, E\y{ uj\structure
= 0 if on the 77^, for example,
iy^j and E[vf] = a2. Now suppose we also
assume that r?gt ~ Af(0, a2), and all the rjgt r]gt= a- 77gi_! + ugt,
are independent. Let us focus initially on
the two-group, two-time-period case. With
with a
a serially uncorrelated ougt. More ge
large number of units in each group ally,
and time
with T time periods, one can allow fo
period, Ygt -* a + ?t + 7g + lg=u=i autoregressive
rDID process of order T ? 2. Us
-f- rjgt, so that
simulations and real data calculations b
on data for fifty states and multiple t
^DID ? (^11 ? Yio) ~~ U 01 ~~ Yoo) ~* TDID
periods, Bertrand, Duflo, and Mullainat
(2004) show that corrections to the conv
+ (Vn - ^io) - faoi - ^oo) ~ Mtdid, 4 a2).
tional standard errors taking into account
clustering and autoregressive structure m
Thus, in this case with two groups and a substantial
two difference. Hansen (200
2007b)
time periods, the conventional DID estiprovides additional large sam
mator is not consistent. In fact, no consisunder sequences where the num
results
tent estimator exists because thereof is no periods increases with the sam
time
size. four
way to eliminate the influence of the

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
70 Journal of Economie Literature, Vol. XLVII (March 2009)

6.5.4 Panel Data outcomes actually compromises the com


parison because Yi0 may in fact be correlated
Now suppose we have panel data, in the with ??. In the end, the two approaches make
two period, two group case. Here we have fundamentally different assumptions. One
N individuals, indexed i = 1,..., N, for whom needs to choose between them based on
we observe (G?, Yi{), Ya, XiQ, Xa), where G? is, substantive knowledge. When the estimated
as before, group membership, Xit is the cova coefficient on the lagged outcome is close
riate value for unit i at time t, and Yit is the to zero, obviously there will be little differ
outcome for unit i at time t. ence between the point estimates. In addi
One option is to proceed with estimation tion, using the formula for omitted variable
exactly as before, essentially ignoring the fact bias in least squares estimation, the results
that the observations in different time peri will be very similar if the average outcomes
ods come from the same unit. We can now in the treatment and control groups are simi
interpret the estimator as the ordinary least lar in the first period. Finally, note that in
squares estimator based on the regression the repeated cross-section case the choice
function for the difference outcomes: between the DID and unconfoundedness
approaches did not arise because the uncon
^n ~~ y?() ? ? + rDID * G? + eb foundedness approach is not feasible: it is not
possible to adjust for lagged outcomes when
which leads to the double difference as the we do not have the same units available in
estimator for rDID: fDID = (Yn - Y10) - both periods.
(Y01 ? Y00). This estimator is identical to that As a practical matter, the DID approach
discussed in the context of repeated cross appears less attractive than the unconfound
sections, and so does not exploit directly the edness-based approach in the context of panel
panel nature of the data. data. It is difficult to see how making treated
A second, and very different, approach and control units comparable on lagged out
with panel data, which does exploit the spe comes will make the causal interpretation of
cific features of the panel data, would be to their difference less credible, as suggested by
assume unconfoundedness given lagged out the DID assumptions.
comes. Let us look at the differences between
6.5.5 The Changes-in-Changes Model
these two approaches in a simple setting,
without covariates, and assuming linearity. Now we return to the setting with two
In that case the DID approach suggests the groups, two time periods, and repeated cross
regression of Yn ? Yi0 on the group indica sections. Athey and Imbens (2006) general
tor, leading to rDID. The unconfoundedness ize the standard model in several ways. They
assumption would suggest the regression of relax the additive linear model by assuming
the difference Yn ? Yf0 on the group indica that, in the absence of the intervention, the
tor and the lagged outcome Yi0: outcomes satisfy

Y/i ? Yi0 ? ? + Tunconf Gi + 8 - Yi0 + Si. (38) Y?0) = h0(Ui9T?,


While it appears that the analysis based with h0(u,t) increasing in u. The random vari
on unconfoundedness is necessarily less able Ui represents all unobservable charac
restrictive because it allows a free coef teristics of individual i, and (38) incorporates
ficient in Yi(h this is not the case. The DID the idea that the outcome of an individual
assumption implies that adjusting for lagged with Ui = u will be the same in a given time

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 71

period, irrespective of the group member the treatment group, no assumptions are
ship. The distribution of Ut is allowed to required about how the intervention affects
vary across groups, but not over time within outcomes.
groups, so that U{ 1 T? | Gf. Athey and Imbens The average effect of the treatment for
call the resulting model the changes-in the second period treatment group is rcic
changes (CIC) model. = E[Y,(1) - Y,(0)|G, = 1, T{ = 1]. Because
The standard DID model in (33) adds the first term of this expression is equal to
three additional assumptions to the CIC E\Xt(l) | G, = 1, Tt = 1] = E\Y, IG, = 1, T, =
model, namely 1], it can be estimated directly from the data.
The difficulty is in estimating the second
(39) Ui - E[Ui | GJ 1 Gi (additivity) term. Under the assumptions of monotonicity
oth0(u,t) in u, and conditional independence
(40) h0(u,t) = (?>{u + S-t), of Ti and U? given Gh Athey and Imbens
(single index model) show that in fact the full distribution of Y(0)
given Gt = Ti ? 1 is identified through the
for a strictly increasing function 0( ), and equality

(41) <j)( ) is the identity function. (42) FYJy) = FYm(F^(FYm(y))\


(identity transformation).
where FY (y) denotes the distribution function
In the CIC extension, the treatment of Y i given G{ ? g and T{ ? t. The expected
group's distribution of unobservables may be outcome for the second period treatment
different from that of the control group in group under the control treatment is
arbitrary ways. In the absence of treatment,
all differences between the two groups can ?[Y<(0) | Gi = 1, Ti = 1] = EtFol1 (Fi)0(Ym))l
be interpreted as coming from differences
in the conditional distribution of U given G. To analyze the counterfactual effect of the
The model further requires that the changes intervention on the control group, Athey and
over time in the distribution of each group's Imbens assume that, in the presence of the
outcome (in the absence of treatment) arise intervention,
solely from the fact that h0(u, 0) differs from
h0(u, 1), that is, the relation between unob Y,(l) - h^U,, Tt)
servables and outcomes changes over time.
Like the standard model, the Athey-Imbens for some function hx(u, t) that is increasing
approach does not rely on tracking indivi in u. That is, the effect of the treatment at a
duals over time. Although the distribution given time is the same for individuals with
of Ui is assumed not to change over time the same (7? = u, irrespective of the group.
within groups, the model does not make No further assumptions are required on
any assumptions about whether a particu the functional form of h}, so the treatment
lar individual has the same realization U{ in effect, equal to hx(u, 1) ? h()(u, 1) for indi
each period. Thus, the estimators derived by viduals with unobserved component u, can
Athey and Imbens will be the same whether differ across individuals. Because the distri
one observes a panel of individuals over time bution of the unobserved component U can
or a repeated cross section. Just as in the vary across groups, the average return to the
standard DID approach, if one only wishes policy intervention can vary across groups as
to estimate the effect of the intervention on well.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
72 Journal of Economie Literature, Vol. XLVII (March 2009)

6.5.6 The Abadie-Diamond-Hainmueller group-level covariates may be averages of


Artificial Control Group Approach individual level covariates, or quantiles of the
distribution of within group covariates. The
Abadie, Diamond, and Hainmueller idea is that the future path of the artificial
(2007) develop a very interesting alternative control group, consisting of the A-weighted
approach to the setting with multiple control average of all the control groups, mimics the
groups. See also Abadie and Gardeazabal path that would have been observed in the
(2003). Here we discuss a simple version of treatment group in the absence of the treat
their approach, with T + 1 time periods, ment. Applications in Abadie, Diamond,
and G -f 1 groups, one treated in the final and Hainmueller (2007) to estimation of the
period, and G not treated in either period. effect of smoking legislation in California and
The Abadie-Diamond-Hainmueller idea is the effect of reunification on West Germany
to construct an artificial control group that are very promising.
is more similar to the treatment group in the
initial period than any of the control groups 7. Multivalued and Continuous
on their own. Let G? = G denote the treated Treatments
group, and Gt = 0,...,G ? 1 denote the G
control groups. The outcome for the final Most of the recent econometric program
period treatment group in the absence of the evaluation literature has focused on the case
treatment will be estimated as a weighted with a binary treatment. As a result this case
average of period T outcomes in the G con is now understood much better than it was
trol groups, a decade or two ago. However, much less is
known about settings with multivalued, dis
E[Yi(0)\Ti=T9Gi=G] = crete or continuous treatments. Such cases
G-l __ are common in practice. Social programs are
rarely homogenous. Typically individuals are
assigned to various activities and regimes,
often sequentially,
with weights A? satisfying Y,%=o \ = 1? and tailored to their spe
and
\ > 0. The weights are chosencific
tocircumstances
make and characteristics.
the
weighted control group resemble thesome
To provide treatinsight into the issues
ment group prior to the treatment. That with
arising in settings is, multivalued treat
the weights A(r are chosen toments we discuss the
minimize in this review five sepa
difference between the treatment group
rate cases. and
First, the simplest setting where
the weighted average of the control
the treatmentgroups
is discrete and one is willing
prior to the treatment, namely, to assume unconfoundedness of the treat
_ G_1 _ ment assignment. In that case straightfor
ward extensions of the binary treatment case
YQ() ? ?^ \ * YgO
can be used to obtain estimates and infer
ences for causal effects. Second, we look at
G-l _
the case with a continuous treatment under
GJ-.1 E W-i > unconfoundedness. In that case, the defini
tion of the propensity score requires some
modification but many of the insights from
where || || denotes a measure of distance. the binary treatment case still carry over.
One can also add group level covariates to Third, we look at the case where units can be
the criterion to determine the weights. These exposed to a sequence of binary treatments.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 73

For example, an individual may remain in a example, with three treatments, it may be
training program for a number of periods. In that no units are exposed to treatment level 2
each period the assignment to the program if Xi is in some subset of the covariate space.
is assumed to be unconfounded, given per The insights from the binary case directly
manent characteristics and outcomes up to extend to this multiple (but few) treatment
that point. In the last two cases we briefly case. If the number of treatments is relatively
discuss multivalued endogenous treatments. large, one may wish to smooth across treat
In the fourth case, we look at settings with ment levels in order to improve precision of
a discrete multivalued treatment in the pres the inferences.
ence of endogeneity. We allow the treatment 7.2 Continuous Treatments with
to be continuous in the final case. The last
Unconfounded Treatment Assignment
two cases tie in closely with the simultane
ous equations literature, where, somewhat In the case where the treatment taking
separately from the program evaluation lit on many values, Imbens (2000), Lechner
erature, there has been much recent work on (2001, 2004), Hirano and Imbens (2004),
nonparametric identification and estimation. and Carlos A. Flores (2005) extended some
Especially in the discrete case, many of the of the propensity score methodology under
results in this literature are negative in the unconfoundedness. The key maintained
sense that, without unattractive restrictions
assumption is that adjusting for pre-treat
on heterogeneity or functional form, few ment differences removes all biases, and thus
objects of interest are point-identified. Some solves the problem of drawing causal infer
of the literature has turned toward establish ences. This is formalized by using the con
ing bounds. This is an area with much ongo cept of weak unconfoundedness, introduced
ing work and considerable scope for further by Imbens (2000). Assignment to treatment
research. Wi is weakly unconfounded, given pre-treat
7.1 Multivalued Discrete Treatments with ment variables Xh if

Unconfounded Treatment Assignment


WtlY,(w)\Xt,
If there are a few different levels of the
treatment, rather than just two, essentially all for all w. Compare this to the stronger
of the methods discussed before go through assumption made by Rosenbaum and Rubin
in the unconfoundedness case. Suppose, for (1983b) in the binary case:
example, that the treatment can be one of three
levels, say Wt G {0,1,2}. In order to estimate W.KY^OXY^IX,,
the effect of treatment level 2 relative to treat
ment level 1, one can simply put aside the data which requires the treatment Wt to be
for units exposed to treatment level 0 if one independent of the entire set of potential
is willing to assume unconfoundedness. More outcomes. Instead, weak unconfounded
specifically, one can estimate the average out ness requires only pairwise independence
come for each treatment level conditional on of the treatment with each of the potential
the covariates, E[Yi(w) |X? = x], using data on outcomes. A similar assumption is used in
units exposed to treatment level w, and aver Robins and Rotnitzky (1995). The definition
age these over the (estimated) marginal dis of weak unconfoundedness is also similar
tribution of the covariates, Fx(x). In practice, to that of "missing at random" (Rubin 1976,
the overlap assumption may more likely to be 1987; Roderick J. A. Little and Rubin 1987)
violated with more than two treatments. For in the missing data literature.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
74 Journal of Economie Literature, Vol. XLVII (March 2009)

Although in substantive terms the weak the assignment mechanism; see for example,
unconfoundedness assumption is not very Marshall M. Joffe and Rosenbaum (1999).
different from the assumption used by Because weak unconfoundedness given all
Rosenbaum and Rubin (1983b), it is important pretreatment variables implies weak uncon
that one does not need the stronger assump foundedness given the generalized propen
tion to validate estimation of the expected sity score, one can estimate average outcomes
value of Yi(w) by adjusting for Xf: under by conditioning solely on the generalized
weak unconfoundedness, we have E[Yi(w) | XJ propensity score. If assignment to treatment
= E[Yi(w) | Wi = w9 XJ = E[Yi \Wi = w9 XJ, is weakly unconfounded given pretreatment
and expected outcomes can then be esti variables X, then two results follow. First, for
mated by averaging these conditional means: all w,
E[Yi(wj] = E[E[Yi(w) | XJ]. In practice, it can
be difficult to estimate E[Y?(t?;)] in this man ?(w,r) = E[Yi(w) \r(w, Xf) = r]
ner when the dimension of X? is large, or if
w takes on many values, because the first = E[Yi\Wi = w,r(Wi,Xi) = rl
step requires estimation of the expectation
of Yi(w) given the treatment level and all pre which can be estimated using data on Y?, W?,
treatment variables. It was this difficulty that and r(Wi, X?). Second, the average outcome
motivated Rosenbaum and Rubin (1983b) to given a particular level of the treatment,
develop the propensity score methodology. E[Yi(w)], can be estimated by appropriately
Imbens (2000) introduces the general averaging ?(w, r):
ized propensity score for the multiple treat
ment case. It is the conditional probability of E[Yi(w)] = E[?(w,r(w,Xi))l
receiving a particular level of the treatment
given the pr?treatment variables: As with the implementation of the binary
treatment propensity score methodology, the
r(w, x) = pr(Wi = w | X? = x). implementation of the generalized propensity
score method consists of three steps. In the
In the continuous case, where, say, W? first step the score r(w,x) is estimated. With
takes values in the unit interval, r(w,x) a binary treatment the standard approach
? FW|X(a)|x). Suppose assignment to treat (Rosenbaum and Rubin 1984; Rosenbaum
ment Wi is weakly unconfounded given pre 1995) is to estimate the propensity score
treatment variables Xf. Then, by the same using a logistic regression. More generally, if
argument as in the binary treatment case, the treatments correspond to ordered levels
assignment is weakly unconfounded given of a treatment, such as the dose of a drug or
the generalized propensity score, as ? ?? 0, the time over which a treatment is applied,
one may wish to impose smoothness of the
l{w - S < Wi < w + 6} 1 Yi(w)\r(w, X?), score in w. For continuous Wh Hirano and
Imbens (2004) use a lognormal distribution.
for all w. This is the point where using the In the second step, the conditional expecta
weak form of the unconfoundedness assump tion ?(w,r) = E[Yi | Wi = w, r(Wt, Xf) = r] is
tion is important. There is, in general, no sca estimated. Again, the implementation may be
lar function of the covariates such that the different in the case where the levels of the
level of the treatment W? is independent of treatment are qualitatively distinct than in
the set of potential outcomes iY^i?;)}^^!], the case where smoothness of the conditional
unless additional structure is imposed on expectation function in w is appropriate.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 75

Here, some form of linear or nonlinear instrument, the instrumental variables esti
regression may be used. In the third step the mand can still be interpreted as an average
average response at treatment level w is esti causal effect, but with a complicated weight
mated as the average of the estimated con ing scheme. There are essentially two levels
ditional expectation, $(w,r(w,Xt)), averaged of averaging going on. First, at each level
over the distribution of the pretreatment of the treatment we can only get the aver
variables, Xb...,XN. Note that to get the age effect of a unit increase in the treatment
average ?[Y?(i?;)], the second argument in the for compliers at that level. In addition, there
conditional expectation ?(w, r) is evaluated at is averaging over all levels of the treatment,
r(w,Xt), not at r(Wi,Xi). with the weights equal to the proportion of
compliers at that level.
7.2.1 Dynamic Treatments with Imbens (2007) studies, in more detail,
Unconfounded Treatment Assignment
the case where the endogenous treatment
Multiple-valued treatments can arise takes on three values and shows the limits to
because at any point in time individuals identification in the case with heterogenous
can be assigned to multiple different treat treatment effects.
ment arms, or because they can be assigned
7.4 Continuous Endogenous Treatments
sequentially to different treatments. Gill and
Robins (2001) analyze this case, where they Perhaps surprisingly, there are many
assume that at any point in time an uncon more results for the case with continuous
foundedness assumption holds. Lechner endogenous treatments than for the discrete
and Miquel (2005) (see also Lechner 1999, case that do not impose restrictive assump
and Lechner, Miquel, and Conny Wunsch tions. Much of the focus has been on tri
2004) study a related case, where again a angular systems, with a single unobserved
sequential unconfoundedness assumption is component of the equation determining the
maintained to identify the average effects treatment:
of interest. Abbring and Gerard J. van den
Berg (2003) study settings with duration
data. These methods hold great promise but,
until now, there have been few substantive where r\i is scalar, and an essentially unre
stricted outcome equation:
applications.
7.3 Multivalued Discrete Endogenous Yi = g(Wi,et),
Treatments
where ?f may be a vector. Blundell and James
In settings with general heterogeneity in L. Powell (2003, 2004), Chernozhukov and
the effects of the treatment, the case with Hansen (2005), Imbens and Newey (forth
more than two treatment levels is consider coming), and Andrew Chesher (2003) study
ably more challenging than the binary case. various versions of this setup. Imbens and
There are few studies investigating identifi Newey (forthcoming) show that if h(z, rj) is
cation in these settings. Angrist and Imbens strictly monotone in 77, then one can iden
(1995) and Angrist, Kathryn Graddy and tify average effects of the treatment subject
Imbens (2000) study the interpretation of to support conditions on the instrument.
the standard instrumental variable estimand, They suggest a control function approach
the ratio of the covariances of outcome and to estimation. First 77 is normalized to have
instrument and treatment and instrument. a uniform distribution on [0,1] (e.g., Rosa
They show that in general, with a valid L. Matzkin 2003). Then 77, is estimated

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTCUTC
All use subject to https://about.jstor.org/terms
76 Journal of Economie Literature, Vol. XLVII (March 2009)

as fji ~ FWjZ (Wi | Z?). In the second stage, Y? References


is regressed nonparametrically on X{ and r\{.
Abadie, Alberto. 2002. "Bootstrap Tests of Distribu
Chesher (2003) studies local versions of this tional Treatment Effects in Instrumental Variable
problem. Models." Journal of the American Statistical Asso
ciation, 97(457): 284-92.
When the treatment equation has an addi
Abadie, Alberto. 2003. "Semiparametric Instrumental
tive form, say W? = hx(Z?) -f- rjh where rji is Variable Estimation of Treatment Response Mod
independent of Zh Blundell and Powell (2003, els." Journal of Econometrics, 113(2): 231-63.
2004) derive nonparametric control function Abadie, Alberto. 2005. "Semiparametric Differenee
in-Differences Estimators." Review of Economic
methods for estimating the average struc Studies, 72(1): 1-19.
tural function, E[g(w, ej]. The general idea Abadie, Alberto, Joshua D. Angrist, and Guido W.
Imbens. 2002. "Instrumental Variables Estimates of
is to first obtain residuals, 7)? = W? ? hx(Z?)
the Effect of Subsidized Training on the Quantiles
from a nonparametric regression. Next, a of Trainee Earnings." Econometrica, 70(1): 91?117.
nonparametric regression of Yz on W? and t)? Abadie, Alberto, Alexis Diamond, and Jens Hainmuel
is used to recover m(w, rj) = E(Yt \ Wf = w, r^ ler. 2007. "Synthetic Control Methods for Compara
tive Case Studies: Estimatingthe Effect of California's
= 77). Blundell and Powell show that the aver Tobacco Control Program." National Bureau of Eco
age structural function is generally identified nomic Research Working Paper 12831.
Abadie, Alberto, David Drukker, Jane Leber Herr, and
as E[m(w9rji)]9 which is easily estimated by
Guido W. Imbens. 2004. "Implementing Matching
averaging out 7)? across the sample. Estimators for Average Treatment Effects in Stata."
Stata Journal, 4(3): 290-311.
Abadie, Alberto, and Javier Gardeazabal. 2003. "The
8. Conclusion Economic Costs of Conflict: A Case Study of the
Basque Country." American Economic Review,
Over the last two decades, there has 93(1): 113-32.
been a proliferation of the literature on pro Abadie, Alberto, and Guido W. Imbens. 2006. "Large
Sample Properties of Matching Estimators for
gram evaluation. This includes theoretical Average Treatment Effects." Econometrica, 74(1):
econometrics work, as well as empirical 235-67.
Abadie, Alberto, and Guido W. Imbens. 2008a. "Bias
work. Important features of the modern lit
Corrected Matching Estimators for Average Treat
erature are the convergence of the statistical ment Effects." Unpublished.
and econometric literatures, with the Rubin Abadie, Alberto, and Guido W. Imbens. Forthcoming.
"Estimation of the Conditional Variance in Paired
potential outcomes framework now the dom
inant framework. The modern literature has Experiments. Annales d'Economie et de Statistique.
Abadie, Alberto, and Guido W. Imbens. 2008b. "On
stressed the importance of relaxing func the Failure of the Bootstrap for Matching Estima
tors." Econometrica, 76(6): 1537-57.
tional form and distributional assumptions,
Abbring, Jaap H., and James J. Heckman. 2007.
and has allowed for general heterogeneity in "Econometric Evaluation of Social Programs, Part
the effects of the treatment. This has led to III: Distributional Treatment Effects, Dynamic
renewed interest in identification questions, Treatment Effects, Dynamic Discrete Choice, and
General Equilibrium Policy Evaluation." In Hand
leading to unusual and controversial esti book of Econometrics, Volume 6B, ed. James J.
mands such as the local average treatment Heckman and Edward E. Learner, 5145-5303.
Amsterdam; New York and Oxford: Elsevier Sci
effect (Imbens and Angrist 1994), as well ence, North-Holland.
as to the literature on partial identification Abbring, Jaap H., and Gerard J. van den Berg. 2003.
(Manski 1990). It has also borrowed heav "The Nonparametric Identification of Treatment
Effects in Duration Models." Econometrica, 71(5):
ily from the semiparametric literature, using 1491-1517.
both efficiency bound results (Hahn 1998) Andrews, Donald W. K., and Gustavo Soares. 2007.
and methods for inference based on series "Inference for Parameters Defined By Moment
and kernel estimation (Newey 1994a, 1994b). Inequalities Using Generalized Moment Selection."
Cowles Foundation Discussion Paper 1631.
It has by now matured to the point that it is Angrist, Joshua D. 1990. "Lifetime Earnings and the
of great use for practitioners. Vietnam Era Draft Lottery: Evidence from Social

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 77

Security Administrative Records." American Eco Economie Research Working Paper 6600.
nomic Review, 80(3): 313?36. Attanasio, Orazio, Costas Meghir, and Ana Santiago.
Angrist, Joshua D. 1998. "Estimating the Labor Market 2005. "Education Choices in Mexico: Using a Struc
Impact of Voluntary Military Service Using Social tural Model and a Randomized Experiment to Eval
Security Data on Military Applicants." Economet uate Progresa." Institute for Fiscal Studies Centre
rica, 66(2): 249-88. for the Evaluation of Development Policies Working
Angrist, Joshua D. 2004. "Treatment Effect Hetero Paper EWP05/01.
geneity in Theory and Practice." Economic Journal, Austin, Peter C. 2008a. "A Critical Appraisal of Pro
114(494): C52-83. pensity-Score Matching in the Medical Literature
Angrist, Joshua D., Eric Bettinger, and Michael Kre between 1996 and 2003." Statistics in Medicine,
mer. 2006. "Long-Term Educational Consequences 27(12): 2037-49.
of Secondary School Vouchers: Evidence from Austin, Peter C. 2008b. "Discussion of 'A Critical
Administrative Records in Colombia." American Appraisal of Propensity-Score Matching in the Med
Economic Review, 96(3): 847-62. ical Literature between 1996 and 2003': Rejoinder."
Angrist, Joshua D., Kathryn Graddy, and Guido W. Statistics in Medicine, 27(12): 2066-69.
Imbens. 2000. "The Interpretation of Instrumen Balke, Alexander, and Judea Pearl. 1994. "Nonpara
tal Variables Estimators in Simultaneous Equations metric Bounds of Causal Effects from Partial Com
Models with an Application to the Demand for Fish." pliance Data." University of California Los Angeles
Review of Economic Studies, 67(3): 499-527. Cognitive Systems Laboratory Technical Report
Angrist, Joshua D., and Jinyong Hahn. 2004. "When to R-199.
Control for Covariates? Panel Asymptotics for Esti Banerjee, Abhijit V., Shawn Cole, Esther Duflo, and
mates of Treatment Effects." Review of Economics Leigh Linden. 2007. "Remedying Education: Evi
and Statistics, 86(1): 58-72. dence from Two Randomized Experiments in India."
Angrist, Joshua D., and Guido W. Imbens. 1995. "Two Quarterly Journal of Economics, 122(3): 1235-64.
Stage Least Squares Estimation of Average Causal Barnow, Burt S., Glend G. Cain, and Arthur S. Gold
Effects in Models with Variable Treatment Inten berger. 1980. "Issues in the Analysis of Selectivity
sity." Journal of the American Statistical Associa Bias." In Evaluation Studies, Volume 5, ed. Ernst W.
tion, 90(430): 431-42. Stromsdorfer and George Farkas, 43-59. San Fran
Angrist, Joshua D., Guido W. Imbens, and Donald B. cisco: Sage.
Rubin. 1996. "Identification of Causal Effects Using Becker, Sascha O., and Andrea Ichino. 2002. "Estima
Instrumental Variables." Journal of the American tion of Average Treatment Effects Based on Propen
Statistical Association, 91(434): 444-55. sity Scores." Stata Journal, 2(4): 358-77.
Angrist, Joshua D., and Alan B. Krueger. 1999. "Empir Behncke, Stefanie, Markus Fr?lich, and Michael Lech
ical Strategies in Labor Economics." In Handbook of ner. 2006. "Statistical Assistance for Programme
Labor Economics, Volume 3A, ed. Orley Ashenfelter Selection?For a Better Targeting of Active Labour
and David Card, 1277-1366. Amsterdam; New York Market Policies in Switzerland." University of St.
and Oxford: Elsevier Science, North-Holland. Gallen Department of Economics Discussion Paper
Angrist, Joshua D., and Kevin Lang. 2004. "Does 2006-09.
School Integration Generate Peer Effects? Evidence Beresteanu, Arie, and Francesca Molinari. 2006.
from Boston's Metco Program." American Economic "Asymptotic Properties for a Class of Partially Iden
Review, 94(5): 1613-34. tified Models." Institute for Fiscal Studies Centre
Angrist, Joshua D., and Victor Lavy. 1999. "Using Mai for Microdata Methods and Practice Working Paper
monides' Rule to Estimate the Effect of Class Size CWP1%6.
on Scholastic Achievement." Quarterly Journal of Bertrand, Marianne, Esther Duflo, and Sendhil Mul
Economics, 114(2): 533-75. lainathan. 2004. "How Much Should We Trust
Angrist, Joshua D., and J?rn-Steffen Pischke. 2009. Differences-in-Differences Estimates?" Quarterly
Mostly Harmless Econometrics: An Empiricist's Journal of Economics, 119(1): 249-75.
Companion. Princeton: Princeton University Press. Bertrand, Marianne, and Sendhil Mullainathan. 2004.
Ashenfelter, Orley. 1978. "Estimating the Effect of "Are Emily and Greg More Employable than Lak
Training Programs on Earnings." Review of Eco isha and Jamal? A Field Experiment on Labor Mar
nomics and Statistics, 6(1): 47-57. ket Discrimination." American Economic Review,
Ashenfelter, Orley, and David Card. 1985. "Using the 94(4): 991-1013.
Longitudinal Structure of Earnings to Estimate the Besley, Timothy, and Anne C. Case. 2000. "Unnatu
Effect of Training Programs." Review of Economics ral Experiments? Estimating the Incidence of
and Statistics, 67(4): 648-60. Endogenous Policies." Economic Journal, 110(467):
Athey, Susan, and Guido W. Imbens. 2006. "Identifica F672-94.
tion and Inference in Nonlinear Difference-in-Dif Bierens, Herman J. 1987. "Kernel Estimators of Regres
ferences Models." Econometrica, 74(2): 431?97. sion Functions." In Advances in Econometrics: Fifth
Athey, Susan, and Scott Stern. 1998. "An Empirical World Congress, Volume 1, ed. Truman F. Bew
Framework for Testing Theories About Complimen ley, 99-144. Cambridge and New York: Cambridge
tarity in Organizational Design." National Bureau of University Press.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
78 Journal of Economie Literature, Vol. XLVII (March 2009)

Bitler, Marianne, Jonah Gelbach, and Hilary Hoynes. Economie Perspectives, 9(2): 63-84.
2006. "What Mean Impacts Miss: Distributional Busso, Matias, John DiNardo, and Justin McCrary.
Effects of Welfare Reform Experiments." American 2008. "Finite Sample Properties of Semipara
Economic Review, 96(4): 988-1012. metric Estimators of Average Treatment Effects."
Bj?rklund, Anders, and Robert Moffitt. 1987. "The Unpublished.
Estimation of Wage Gains and Welfare Gains in Caliendo, Marco. 2006. Micro econometric Evaluation
Self-Selection." Review of Economics and Statistics, of Labour Market Policies. Heidelberg: Springer,
69(1): 42-49. Physica-Verlag.
Black, Sandra E. 1999. "Do Better Schools Matter? Cameron, A. Colin, and Pravin K. Trivedi. 2005.
Parental Valuation of Elementary Education." Quar Micro econometrics: Methods and Applications.
terly Journal of Economics, 114(2): 577-99. Cambridge and New York: Cambridge University
Bloom, Howard S. 1984. "Accounting for No-Shows Press.
in Experimental Evaluation Designs." Evaluation Canay, Ivan A. 2007. "EL Inference for Partially Iden
Review, 8(2): 225-46. tified Models: Large Deviations Optimally and
Bloom, Howard S., ed. 2005. Learning More from Bootstrap Validity." Unpublished.
Social Experiments: Evolving Analytic Approaches. Card, David. 1990. "The Impact of the Mariel Boatlift
New York: Russell Sage Foundation. on the Miami Labor Market." Industrial and Lahor
Blundell, Richard, and Monica Costa Dias. 2002. Relations Review, 43(2): 245-57.
"Alternative Approaches to Evaluation in Empirical Card, David. 2001. "Estimating the Return to School
Microeconomics." Institute for Fiscal Studies Cen ing: Progress on Some Persistent Econometric Prob
tre for Microdata Methods and Practice Working lems." Econometrica, 69(5): 1127-60.
Paper CWP1%2. Card, David, Carlos Dobkin, and Nicole Maestas.
Blundell, Richard, Monica Costa Dias, Costas Meghir, 2004. "The Impact of Nearly Universal Insurance
and John Van Reenen. 2001. "Evaluating the Coverage on Health Care Utilization and Health:
Employment Impact of a Mandatory Job Search Evidence from Medicare." National Bureau of Eco
Assistance Program." Institute for Fiscal Studies nomic Research Working Paper 10365.
Working Paper WPO^O. Card, David, and Dean R. Hyslop. 2005. "Estimating
Blundell, Richard, Alan Duncan, and Costas Meghir. the Effects of a Time-Limited Earnings Subsidy for
1998. "Estimating Labor Supply Responses Using Welfare-Leavers." Econometrica, 73(6): 1723-70.
Tax Reforms." Econometrica, 66(4): 827-61. Card, David, and Alan B. Krueger. 1993. "Trends in
Blundell, Richard, Amanda Gosling, Hidehiko Relative Black-White Earnings Revisited." Ameri
Ichimura, and Costas Meghir. 2004. "Changes in the can Economic Review, 83(2): 85-91.
Distribution of Male and Female Wages Accounting Card, David, and Alan B. Krueger. 1994. "Minimum
for Employment Composition Using Bounds." Insti Wages and Employment: A Case Study of the Fast
tute for Fiscal Studies Working Paper W04/25. Food Industry in New Jersey and Pennsylvania."
Blundell, Richard, and Thomas MaCurdy. 1999. "Labor American Economic Review, 84(4): 772-93.
Supply: A Review of Alternative Approaches." In Card, David, and Phillip B. Levine. 1994. "Unemploy
Handbook of Labor Economics, Volume 3A, ed. ment Insurance Taxes and the Cyclical and Seasonal
Orley Ashenfelter and David Card, 1559-1695. Properties of Unemployment." Journal of Public
Amsterdam; New York and Oxford: Elsevier Sci Economics, 53(1): 1-29.
ence, North-Holland. Card, David, Alexandre Mas, and Jesse Rothstein.
Blundell, Richard, and James L. Powell. 2003. "Endo 2007. "Tipping and the Dynamics of Segregation."
geneity in Nonparametric and Semiparametric National Bureau of Economic Research Working
Regression Models." In Advances in Economics and Paper 13052.
Econometrics: Theory and Applications, Eighth Card, David, and Brian P. McCall. 1996. "Is Workers'
World Congress, Volume 2, ed. Mathias Dewatri Compensation Covering Uninsured Medical Costs?
pont, Lars Peter Hansen, and Stephen J. Turnovsky, Evidence from the 'Monday Effect.'" Industrial and
312-57. Cambridge and New York: Cambridge Uni Labor Relations Review, 49(4): 690-706.
versity Press. Card, David, and Philip K. Robins. 1996. "Do Finan
Blundell, Richard, and James L. Powell. 2004. "Endo cial Incentives Encourage Welfare Recipients to
geneity in Semiparametric Binary Response Mod Work? Evidence from a Randomized Evaluation of
els." Review of Economic Studies, 71(3): 655-79. the Self-Sufficiency Project." National Bureau of
Brock, William, and Steven N. Durlauf. 2000. "Interac Economic Research Working Paper 5701.
tions-Based Models." National Bureau of Economic Card, David, and Daniel G. Sullivan. 1988. "Measur
Research Technical Working Paper 258. ing the Effect of Subsidized Training Programs on
Bruhn, Miriam, and David McKenzie. 2008. "In Pur Movements In and Out of Employment." Economet
suit of Balance: Randomization in Practice in Devel rica, 56(3): 497-530.
opment Field Experiments." World Bank Policy Case, Anne C, and Lawrence F. Katz. 1991. "The
Research Working Paper 4752. Company You Keep: The Effects of Family and
Burtless, Gary. 1995. "The Case for Randomized Field Neighborhood on Disadvantaged Youths." National
Trials in Economic and Policy Research." Journal of Bureau of Economic Research Working Paper 3705.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 79

Chamberlain, Gary. 1986. "Asymptotic Efficiency in Davison, A. C, and D. V Hinkley. 1997. Bootstrap
Semi-parametric Models with Censoring." Journal Methods and Their Application. Cambridge and
of Econometrics, 32(2): 189-218. New York: Cambridge University Press.
Chattopadhyay, Raghabendra, and Esther Duflo. Dehejia, Rajeev H. 2003. "Was There a Riverside
2004. "Women as Policy Makers: Evidence from a Miracle? A Hierarchical Framework for Evaluating
Randomized Policy Experiment in India." Econo Programs with Grouped Data.." Journal of Business
metrica, 72(5): 1409-43. and Economic Statistics, 21(1): 1-11.
Chay, Kenneth Y., and Michael Greenstone. 2005. Dehejia, Rajeev H. 2005a. "Practical Propensity Score
"Does Air Quality Matter? Evidence from the Hous Matching: A Reply to Smith and Todd." Journal of
ing Market." Journal of Political Economy, 113(2): Econometrics, 125(1-2): 355-64.
376-424. Dehejia, Rajeev H. 2005b. "Program Evaluation as a
Chen, Susan, and Wilbert van der Klaauw. 2008. "The Decision Problem." Journal of Econometrics, 125(1?
Work Disincentive Effects of the Disability Insur 2): 141-73.
ance Program in the 1990s." Journal of Economet Dehejia, Rajeev H., and Sadek Wahba. 1999. "Causal
rics, 142(2): 757-84. Effects in Nonexperimental Studies: Reevaluat
Chen, Xiaohong. 2007. "Large Sample Sieve Estima ing the Evaluation of Training Programs." Journal
tion of Semi-nonparametric Models." In Handbook of the American Statistical Association, 94(448):
of Econometrics, Volume 6B, ed. James J. Heckman 1053-62.
and Edward E. Learner, 5549-5632. Amsterdam Diamond, Alexis, and Jasjeet S. Sekhon. 2008. "Genetic
and Oxford: Elsevier, North-Holland. Matching for Estimating Causal Effects: A General
Chen, Xiaohong, Han Hong, and Alessandro Tarozzi. Multivariate Matching Method for Achieving Bal
2008. "Semiparametric Efficiency in GMM Mod ance in Observational Studies." Unpublished.
els with Auxiliary Data." Annals of Statistics, 36(2): DiNardo, John, and David S. Lee. 2004. "Economic
808-43. Impacts of New Unionization on Private Sector
Chernozhukov, Victor, and Christian B. Hansen. Employers: 1984-2001." Quarterly Journal of Eco
2005. "An IV Model of Quantile Treatment Effects." nomics, 119(4): 1383-1441.
Econometrica, 73(1): 245-61. Doksum, Kjell. 1974. "Empirical Probability Plots
Chernozhukov, Victor, Han Hong, and Elie Tamer. and Statistical Inference for Nonlinear Models in
2007. "Estimation and Confidence Regions for the Two-Sample Case." Annals of Statistics, 2(2):
Parameter Sets in Econometric Models." Economet 267-77.
rica, 75(5): 1243-84. Donald, Stephen G., and Kevin Lang. 2007. "Inference
Chesher, Andrew. 2003. "Identification in Nonsepa with Difference-in-Differences and Other Panel
rable Models." Econometrica, 71(5): 1405-41. Data." Review of Economics and Statistics, 89(2):
Chetty, Raj, Adam Looney, and Kory Kroft. Forthcom 221-33.
ing. "Salience and Taxation: Theory and Evidence." Duflo, Esther. 2001. "Schooling and Labor Market
American Economic Review. Consequences of School Construction in Indone
Cochran, William G. 1968. "The Effectiveness of sia: Evidence from an Unusual Policy Experiment."
Adjustment by Subclassification in Removing Bias in American Economic Review, 91(4): 795-813.
Observational Studies." Biometrics, 24(2): 295-314. Duflo, Esther, William Gale, Jeffrey B. Liebman, Peter
Cochran, William G., and Donald B. Rubin. 1973. Orszag, and Emmanuel Saez. 2006. "Saving Incen
"Controlling Bias in Observational Studies: A tives for Low- and Middle-Income Families: Evi
Review." Sankhya, 35(4): 417-46. dence from a Field Experiment with H&R Block."
Cook, Thomas D. 2008. "'Waiting for Life to Arrive': Quarterly Journal of Economics, 121(4): 1311-46.
A History of the Regression-Discontinuity Design Duflo, Esther, Rachel Glennerster, and Michael Kre
in Psychology, Statistics and Economics." Journal of mer. 2008. "Using Randomization in Development
Econometrics, 142(2): 636-54. Economics Research: A Toolkit." In Handbook of
Cook, Philip J., and George Tauchen. 1982. "The Effect Development Economics, Volume 4, ed. T. Paul
of Liquor Taxes on Heavy Drinking." Bell Journal of Schultz and John Strauss, 3895-3962. Amsterdam
Economics, 13(2): 379-90. and Oxford: Elsevier, North-Holland.
Cook, Philip J., and George Tauchen. 1984. "The Effect Duflo, Esther, and Rema Hanna. 2005. "Monitor
of Minimum Drinking Age Legislation on Youthful ing Works: Getting Teachers to Come to School."
Auto Fatalities, 1970-1977." Journal of Legal Stud National Bureau of Economic Research Working
ies, 13(1): 169-90. Paper 11880.
Crump, Richard K., V Joseph Hotz, Guido W. Imbens, Duflo, Esther, and Emmanuel Saez. 2003. "The Role
and Oscar A. Mitnik. 2009. "Dealing with Lim of Information and Social Interactions in Retire
ited Overlap in Estimation of Average Treatment ment Plan Decisions: Evidence from a Random
Effects." Biometrika, 96:187-99. ized Experiment." Quarterly Journal of Economics,
Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, 118(3): 815-42.
and Oscar A. Mitnik. 2008. "Nonparametric Tests Efron, Bradley, and Robert J. Tibshirani. 1993. An
for Treatment Effect Heterogeneity." Review of Introduction to the Bootstrap. New York and
Economics and Statistics, 90(3): 389-405. London: Chapman and Hall.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
80 Journal of Economie Literature, Vol. XLVII (March 2009)

Eissa, Nada, and Jeffrey B. Liebman. 1996. "Labor Econometrica, 52(3): 681-700.
Supply Response to the Earned Income Tax Credit." Graham, Bryan S. 2008. "Identifying Social Interac
Quarterly Journal of Economics, 111(2): 605-37. tions through Conditional Variance Restrictions."
Engle, Robert F., David F. Hendry, and Jean-Francois Econometrica, 76(3): 643-60.
Richard. 1983. "Exogeneity." Econometrica, 51(2): Graham, Bryan S., Guido W. Imbens, and Geert Rid
277-304. der. 2006. "Complementarity and Aggregate Impli
Fan, J., and I. Gijbels. 1996. Local Polynomial Mod cations of Assortative Matcning: A Nonparametric
elling and Its Applications. London: Chapman and Analysis." Unpublished.
Hall. Greenberg, David, and Michael Wiseman. 1992. "What
Ferraz, Claudio, and Frederico Finan. 2008. "Exposing Did the OBRA Demonstrations Do?" In Evaluat
Corrupt Politicians: The Effects of brazil's Publicly ing Welfare and Training Programs, ed. Charles F.
Released Audits on Electoral Outcomes." Quarterly Manski and Irwin Garfinkel, 25-75. Cambridge and
Journal of Economics, 123(2): 703-45. London: Harvard University Press.
Firpo, Sergio. 2007. "Efficient Semiparametric Esti Gu, X., and Paul R. Rosenbaum. 1993. "Comparison
mation of Quantile Treatment Effects." Economet of Multivariate Matching Methods: Structures, Dis
rica, 75(1): 259-76. tances and Algorithms." Journal of Computational
Fisher, Ronald A. 1935. The Design of Experiments, and Graphical Statistics, 2(4): 405-20.
First edition. London: Oliver and Boyd. Gueron, Judith M., and Edward Pauly. 1991. From Wel
Flores, Carlos A. 2005. "Estimation of Dose-Response fare to Work. New York: Russell Sage Foundation.
Functions and Optimal Doses with a Continuous Haavelmo, Trygve. 1943. "The Statistical Implications
Treatment." Unpublished. of a System of Simultaneous Equations." Economet
Fraker, Thomas, and Rebecca Maynard. 1987. "The rica, 11(1): 1-12.
Adequacy of Comparison Group Designs for Evalu Hahn, Jinyong. 1998. "On the Role of the Propensity
ations of Employment-Related Programs." Journal Score in Efficient Semiparametric Estimation of
of Human Resources, 22(2): 194-227. Average Treatment Effects." Econometrica, 66(2):
Friedlander, Daniel, and Judith M. Gueron. 1992. "Are 315-31.
High-Cost Services More Effective than Low-Cost Hahn, Jinyong, Petra E. Todd, and Wilbert van der
Services?" In Evaluating Welfare Training Pro Klaauw. 2001. "Identification and Estimation of
grams, ed. Charles F. Manski and Irwin Garfinkel, Treatment Effects with a Regression-Discontinuity
143-98. Cambridge and London: Harvard Univer Design." Econometrica, 69(1): 201-09.
sity Press. Ham, John C, and Robert J. LaLonde. 1996. "The
Friedlander, Daniel, and Philip K. Robins. 1995. Effect of Sample Selection and Initial Conditions
"Evaluating Program Evaluations: New Evidence in Duration Models: Evidence from Experimental
on Commonly Used Nonexperimental Methods." Data on Training." Econometrica, 64(1): 175-205.
American Economic Review, 85(4): 923-37. Hamermesh, Daniel S., and Jeff E. Biddle. 1994.
Fr?lich, Markus. 2004a. "Finite-Sample Properties of "Beauty and the Labor Market." American Eco
Propensity-Score Matching and Weighting Estima nomic Review, 84(5): 1174-94.
tors." Review of Economics and Statistics, 86(1): Hansen, B. B. 2008. "The Essential Role of Balance
77-90. Tests in Propensity-Matched Observational Studies:
Fr?lich, Markus. 2004b. "A Note on the Role of the Comments on A Critical Appraisal of Propensity
Propensity Score for Estimating Average Treatment Score Matching in the Medical Literature between
Effects." Econometric Reviews, 23(2): 167-74. 1996 and 2003' by Peter Austin." Statistics in Medi
Gill, Richard D., and James M. Robins. 2001. "Causal cine, 27(12): 2050-54.
Inference for Complex Longitudinal Data: The Hansen, Christian B. 2007a. "Asymptotic Properties of
Continuous Case." Annals of Statistics, 29(6): a Robust Variance Matrix Estimator for Panel Data
1785-1811. When T Is Large." Journal of Econometrics, 141(2):
Glaeser, Edward L., Bruce Sacerdote, and Jose A. 597-620.
Scheinkman. 1996. "Crime and Social Interactions." Hansen, Christian B. 2007b. "Generalized Least
Quarterly Journal of Economics, 111(2): 507-48. Squares Inference in Panel and Multilevel Models
Goldberger, Arthur S. 1972a. "Selection Bias in Evalu with Serial Correlation and Fixed Effects." Journal
ating Treatment Effects: Some Formal Illustrations." of Econometrics, 140(2): 670-94.
Unpublished. Hanson, Samuel, and Adi Sunderam. 2008. "The Vari
Goldberger, Arthur S. 1972b. "Selection Bias in Evalu ance of Average Treatment Effect Estimators in the
ating Treatment Effects: The Case of Interaction." Presence of Clustering." Unpublished.
Unpublished. Hardie, Wolfgang. 1990. Applied Nonparametric
Gourieroux, C, A. Monfort, and A. Trognon. 1984a. Regression. Cambridge; New York and Melboure:
"Pseudo Maximum Likelihood Methods: Appli Cambridge University Press.
cations to Poisson Models." Econometrica, 52(3): Heckman, James J. 1990. "Varieties of Selection Bias."
701-20. American Economic Review, 80(2): 313-18.
Gourieroux, C, A. Monfort, and A. Trognon. 1984b. Heckman, James J., and V. Joseph Hotz. 1989.
"Pseudo Maximum Likelihood Methods: Theory." "Choosing among Alternative Nonexperimental

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 81

Methods for Estimating the Impact of Social Pro Heckman, James J., and Edward Vytlacil. 2007b.
grams: The Case of Manpower Training." Journal "Econometric Evaluation of Social Programs, Part
of the American Statistical Association, 84(408): II: Using the Marginal Treatment Effect to Orga
862-74. nize Alternative Econometric Estimators to Evalu
Heckman, James J., Hidehiko Ichimura, Jeffrey A. ate Social Programs, and to Forecast Their Effects
Smith, and Petra E. Todd. 1998. "Characterizing in New Environments." In Handbook of Economet
Selection Bias Using Experimental Data." Econo rics, Volume 6B, ed. James J. Heckman and Edward
metrica, 66(5): 1017-98. E. Learner, 4875-5143. Amsterdam and Oxford:
Heckman, James J., Hidehiko Ichimura, and Petra E. Elsevier, North-Holland.
Todd. 1997. "Matching as an Econometric Evalu Hill, Jennifer. 2008. "Discussion of Research Using
ation Estimator: Evidence from Evaluating a Job Propensity-Score Matching: Comments on A Criti
Training Programme." Review of Economic Studies, cal Appraisal of Propensity-Score Matching in the
64(4): 605-54. Medical Literature between 1996 and 2003' by Peter
Heckman, James J., Hidehiko Ichimura, and Petra E. Austin." Statistics in Medicine, 27(12): 2055-61.
Todd. 1998. "Matching as an Econometric Evalua Hirano, Keisuke, and Guido W Imbens. 2001. "Esti
tion Estimator." Review of Economic Studies, 65(2): mation of Causal Effects Using Propensity Score
261-94. Weighting: An Application to Data on Right Heart
Heckman, James J., Robert J. Lalonde, and Jeffrey A. Catheterization." Health Services and Outcomes
Smith. 1999. "The Economics and Econometrics of Research Methodology, 2(3-4): 259-78.
Active Labor Market Programs." In Handbook of Hirano, Keisuke, and Guido W. Imbens. 2004. "The
Labor Economics, Volume 3A, ed. Orley Ashenfelter Propensity Score with Continuous Treatments." In
and David Card, 1865-2097. Amsterdam; New York Applied Bayesian Modeling and Causal Inference
and Oxford: Elsevier Science, North-Holland. from Incomplete-Data Perspectives, ed. Andrew
Heckman, James J., Lance Lochner, and Christopher Gelman and Xiao-Li Meng, 73-84. Hoboken, N.J.:
Taber. 1999. "Human Capital Formation and Gen Wiley.
eral Equilibrium Treatment Effects: A Study of Tax Hirano, Keisuke, Guido W. Imbens, and Geert Ridder.
and Tuition Policy." Fiscal Studies, 20(1): 25-40. 2003. "Efficient Estimation of Average Treatment
Heckman, James J., and Salvador Navarro-Lozano. Effects Using the Estimated Propensity Score."
2004. "Using Matching, Instrumental Variables, and Econometrica, 71(4): 1161-89.
Control Functions to Estimate Economic Choice Hirano, Keisuke, Guido W. Imbens, Donald B. Rubin,
Models." Review of Economics and Statistics, 86(1): and Xiao-Hua Zhou. 2000. "Assessing the Effect of
30-57. an Influenza Vaccine in an Encouragement Design."
Heckman, James J., and Richard Robb Jr. 1985. "Alter Biostatistics, 1(1): 69-88.
native Methods for Evaluating the Impact of Inter Hirano, Keisuke, and Jack R. Porter. 2008. "Asymp
ventions." In Longitudinal Analysis of Labor Market totics for Statistical Treatment Rules." http://
Data, ed. James J. Heckman and Burton Singer, 156 www.u.arizona.edu/~hirano/hp3_2008_08_10.pdf.
245. Cambridge; New York and Sydney: Cambridge Holland, Paul W. 1986. "Statistics and Causal Infer
University Press. ence." Journal of the American Statistical Associa
Heckman, James J., and Jeffrey A. Smith. 1995. tion, 81(396): 945-60.
"Assessing the Case for Social Experiments." Journal Horowitz, Joel L. 2001. "The Bootstrap." In Hand
of Economic Perspectives, 9(2): 85-110. book of Econometrics, Volume 5, ed. James J. Heck
Heckman, James J., and Jeffrey A. Smith. 1997. "Mak man and Edward Learner, 3159-3228. Amsterdam;
ing the Most Out of Programme Evaluations and London and New York: Elsevier Science, North
Social Experiments: Accounting for Heterogeneity Holland.
in Programme Impacts." Review of Economic Stud Horowitz, Joel L., and Charles F Manski. 2000. "Non
ies, 64(4): 487-535. parametric Analysis of Randomized Experiments
Heckman, James J., Sergio Urzua, and Edward Vyt with Missing Covariate and Outcome Data." Jour
lacil. 2006. "Understanding Instrumental Variables nal of the American Statistical Association, 95(449):
in Models with Essential Heterogeneity." Review of 77-84.
Economics and Statistics, 88(3): 389-432. Horvitz, D. G., and D. J. Thompson. 1952. "A Gener
Heckman, James J., and Edward Vytlacil. 2005. "Struc alization of Sampling without Replacement from a
tural Equations, Treatment Effects, and Econo Finite Universe." Journal oj the American Statistical
metric Policy Evaluation." Econometrica, 73(3): Association, 47(260): 663-85.
669-738. Hotz, V Joseph, Guido W. Imbens, and Jacob A. Kler
Heckman, James J., and Edward Vytlacil. 2007a. man. 2006. "Evaluating the Differential Effects of
"Econometric Evaluation of Social Programs, Part Alternative Welfare-to-Work Training Components:
I: Causal Models, Structural Models and Economet A Reanalysis of the California GAIN Program."
ric Policy Evaluation." In Handbook of Economet Journal of Labor Economics, 24(3): 521?66.
rics, Volume 6B, ed. James J. Heckman and Edward Hotz, V Joseph, Guido W Imbens, and Julie H. Mor
E. Learner, 4779-4874. Amsterdam and Oxford: timer. 2005. "Predicting the Efficacy of Future
Elsevier, North-Holland. Training Programs Using Past Experiences at Other

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
82 Journal of Economie Literature, Vol. XLVII (March 2009)

Locations." Journal of Econometrics, 125(1-2): Imbens, Guido W, Whitney K. Newey, and Geert Rid
241-70. der. 2005. "Mean-Squared-Error Calculations for
Hotz, V. Joseph, Charles H. Mullin, and Seth G. Sand Average Treatment Effects." Unpublished.
ers. 1997. "Bounding Causal Effects Using Data from Imbens, Guido W, and Donald B. Rubin. 1997a.
a Contaminated Natural Experiment: Analysing the "Bayesian Inference for Causal Effects in Random
Effects of Teenage Childbearing." Review of Eco ized Experiments with Noncompliance." Annals of
nomic Studies, 64(4): 575-603. Statistics, 25(1): 305-27.
Iacus, Stefano M., Gary King, and Giuseppe Porro. Imbens, Guido W., and Donald B. Rubin. 1997b.
2008. "Matching for Causal Inference without Bal "Estimating Outcome Distributions for Compliers
ance Checking." Unpublished. in Instrumental Variables Models." Review of Eco
Ichimura, Hidehiko, and Oliver Linton. 2005. "Asymp nomic Studies, 64(4): 555-74.
totic Expansions for Some Semiparametric Program Imbens, Guido W, and Donald B. Rubin. Forthcom
Evaluation Estimators." In Identification and Infer ing. Causal Inference in Statistics and the Social
ence for Econometric Models: Essays in Honor of Sciences. Cambridge and New York: Cambridge
Thomas Rothenberg, ed. Donald W. K. Andrews and University Press.
James H. Stock, 149-70. Cambridge and New York: Imbens, Guido W., Donald B. Rubin, and Bruce I. Sac
Cambridge University Press. erdote. 2001. "Estimating the Effect of Unearned
Ichimura, Hidehiko, and Petra E. Todd. 2007. "Imple Income on Labor Earnings, Savings, and Consump
menting Nonparametric and Semiparametric Esti tion: Evidence from a Survey of Lottery Players."
mators." In Handbook of Econometrics, Volume American Economic Review, 91(4): 778-94.
6B, ed. James J. Heckman and Edward E. Learner, Jin, Ginger Zhe, and Phillip Leslie. 2003. "The Effect
5369-5468. Amsterdam and Oxford: Elsevier, of Information on Product Quality: Evidence from
North-Holland. Restaurant Hygiene Grade Cards." Quarterly Jour
Imbens, Guido W 2000. "The Role of the Propensity nal of Economics, 118(2): 409-51.
Score in Estimating Dose-Response Functions." Joffe, Marshall M., and Paul R. Rosenbaum. 1999.
Biometrika, 87(3): 706-10. "Invited Commentary: Propensity Scores." Ameri
Imbens, Guido W. 2003. "Sensitivity to Exogeneity can Journal of Epidemiology, 150(4): 327-33.
Assumptions in Program Evaluation." American Kitagawa, Toru. 2008. "Identification Bounds for the
Economic Review, 93(2): 126-32. Local Average Treatment Effect." Unpublished.
Imbens, Guido W. 2004. "Nonparametric Estimation Kling, Jeffrey R., Jeffrey B. Liebman, and Lawrence
of Average Treatment Effects under Exogeneity: A F. Katz. 2007. "Experimental Analysis of Neighbor
Review." Review of Economics and Statistics, 86(1): hood Effects." Econometrica, 75(1): 83-119.
4-29. Lalive, Rafael. 2008. "How Do Extended Benefits
Imbens, Guido W. 2007. "Non-additive Models with Affect Unemployment Duration? A Regression
Endogenous Regressors." In Advances in Economics Discontinuity Approach." Journal of Econometrics,
and Econometrics: Theory and Applications, Ninth 142(2): 785-806.
World Congress, Volume 3, ed. Richard Blundell, LaLonde, Robert J. 1986. "Evaluating the Economet
Whitney K. Newey, and Torsten Persson, 17-46. ric Evaluations of Training Programs with Experi
Cambridge and New York: Cambridge University mental Data." American Economic Review, 76(4):
Press. 604-20.
Imbens, Guido W, and Joshua D. Angrist. 1994. "Iden Lechner, Michael. 1999. "Earnings and Employment
tification and Estimation of Local Average Treat Effects of Continuous Off-the-Job Training in East
ment Effects." Econometrica, 62(2): 467-75. Germany after Unification." Journal of Business and
Imbens, Guido W., and Karthik Kalyanaraman. 2009. Economic Statistics, 17(1): 74-90.
"Optimal Bandwidth Choice for the Regression Dis Lechner, Michael. 2001. "Identification and Estima
continuity Estimator." National Bureau of Economic tion of Causal Effects of Multiple Treatments under
Research Working Paper 14726. the Conditional Independence Assumption." In
Imbens, Guido W., Gary King, David McKenzie, and Econometric Evaluation of Labour Market Policies,
Geert Ridder. 2008. "On the Benefits of Stratifica ed. Michael Lechner and Friedhelm Pfeiffer, 43?58.
tion in Randomized Experiments." Unpublished. Heidelberg and New York: Physica; Mannheim: Cen
Imbens, Guido W., and Thomas Lemieux. 2008. tre for European Economic Research.
"Regression Discontinuity Designs: A Guide to Lechner, Michael. 2002a. "Program Heterogeneity
Practice." Journal of Econometrics, 142(2): 615?35. and Propensity Score Matching: An Application to
Imbens, Guido W., and Charles F. Manski. 2004. "Con the Evaluation of Active Labor Market Policies."
fidence Intervals for Partially Identified Parameters." Review of Economics and Statistics, 84(2): 205-20.
Econometrica, 72(6): 1845-57. Lechner, Michael. 2002b. "Some Practical Issues in
Imbens, Guido W, and Whitney K. Newey. Forthcom the Evaluation of Heterogeneous Labour Market
ing. "Identification and Estimation of Triangular Programmes by Matching Methods." Journal of the
Simultaneous Equations Models without Additivity." Royal Statistical Society: Series A (Statistics in Soci
National Bureau of Economic Research Technical ety), 165(1): 59-82.
Econometrica. Lechner, Michael. 2004. "Sequential Matching

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 83

Estimation of Dynamic Causal Models." University Evidence from a Regression Discontinuity Design."
of St. Gallen Department of Economics Discussion National Bureau of Economic Research Working
Paper 2004-06. Paper 11702.
Lechner, Michael, and Ruth Miquel. 2005. "Identi Ludwig, Jens, and Douglas L. Miller. 2007. "Does
fication of the Effects of Dynamic Treatments By Head Start Improve Children's Life Chances? Evi
Sequential Conditional Independence Assump dence from a Regression Discontinuity Design."
tions." University of St. Gallen Department of Eco Quarterly Journal of Economics, 122(1): 159-208.
nomics Discussion Paper 2005-17. Manski, Charles F 1990. "Nonparametric Bounds on
Lechner, Michael, Ruth Miquel, and Conny Wunsch. Treatment Effects." American Economic Review,
2004. "Long-Run Effects of Public Sector Spon 80(2): 319-23.
sored Training in West Germany." Institute for the Manski, Charles F. 1993. "Identification of Endogenous
Study of Labor Discussion Paper 1443. Social Effects: The Reflection Problem." Review of
Lee, David S. 2001. "The Electoral Advantage to Economic Studies, 60(3): 531-42.
Incumbency and the Voters' Valuation of Politicians' Manski, Charles F. 1995. Identification Problems in
Experience: A Regression Discontinuity Analysis of the Social Sciences. Cambridge and London: Har
Elections to the U.S. ... " National Bureau of Eco vard University Press.
nomic Research Working Paper 8441. Manski, Charles F. 2000a. "Economic Analysis of
Lee, David S. 2008. "Randomized Experiments from Social Interactions." Journal of Economic Perspec
Non-random Selection in U.S. House Elections." tives, 14(3): 115-36.
Journal of Econometrics, 142(2): 675-97. Manski, Charles F. 2000b. "Identification Problems
Lee, David S., and David Card. 2008. "Regression and Decisions under Ambiguity: Empirical Analysis
Discontinuity Inference with Specification Error." of Treatment Response and Normative Analysis of
Journal of Econometrics, 142(2): 655-74. Treatment Choice." Journal of Econometrics, 95(2):
Lee, David S., and Thomas Lemieux. 2008. "Regression 415-42.
Discontinuity Designs in Economics." Unpublished. Manski, Charles F 2001. "Designing Programs for
Lee, David S., Enrico Moretti, and Matthew J. Butler. Heterogeneous Populations: The Value of Covariate
2004. "Do Voters Affect or Elect Policies? Evidence Information." American Economic Review, 91(2):
from the U.S. House." Quarterly Journal of Eco 103-06.
nomics, 119(3): 807-59. Manski, Charles F 2002. "Treatment Choice under
Lee, Myoung-Jae. 2005a. Micro-Econometrics for Ambiguity Induced By Inferential Problems." Jour
Policy, Program, and Treatment Effects. Oxford and nal of Statistical Planning and Inference, 105(1):
New York: Oxford University Press. 67-82.
Lee, Myoung-Jae. 2005b. "Treatment Effect and Sen Manski, Charles F 2003. Partial Identification of
sitivity Analysis for Self-Selected Treatment and Probabilities Distributions. New York and Heidel
Selectively Observed Response." Unpublished. berg: Springer.
Lehmann, Erich L. 1974. Nonparametrics: Statistical Manski, Charles F 2004. "Statistical Treatment Rules
Methods Based on Ranks. San Francisco: Holden for Heterogeneous Populations." Econometrica,
Day. 72(4): 1221-46.
Lemieux, Thomas, and Kevin Milligan. 2008. "Incen Manski, Charles F 2005. Social Choice with Partial
tive Effects of Social Assistance: A Regression Dis Knowledge of Treatment Response. Princeton and
continuity Approach." Journal of Econometrics, Oxford: Princeton University Press.
142(2): 807-28. Manski, Charles F 2007. Identification for Prediction
Leuven, Edwin, and Barabara Sianesi. 2003. and Decision. Cambridge and London: Harvard
"PSMATCH2: Stata Module to Perform Full University Press.
Mahalanobis and Propensity Score Matching, Manski, Charles F, and John V Pepper. 2000. "Mono
Common Support Graphing, and Covariate Imbal tone Instrumental Variables: With an Application
ance Testing." http://ideas.repec.Org/c/boc/bocode/ to the Returns to Schooling." Econometrica, 68(4):
s432001.html. 997-1010.
Li, Qi, Jeffrey S. Racine, and Jeffrey M. Wooldridge. Manski, Charles F., Gary D. Sandefur, Sara McLana
Forthcoming. "Efficient Estimaton of Average han, and Daniel Powers. 1992. "Alternative Esti
Treatment Effects with Mixed Categorical and Con mates of the Effect of Family Structure during
tinuous Data." Journal of Business and Economic Adolescence on High School Graduation." Journal
Statistics. of the American Statistical Association, 87(417):
Linton, Oliver, and Pedro G?zalo. 2003. "Conditional 25-37.
Independence Restrictions: Testing and Estima Matzkin, Rosa L. 2003. "Nonparametric Estimation
tion." Unpublished. of Nonadditive Random Functions." Econometrica,
Little, Roderick J. A., and Donald B. Rubin. 1987. 71(5): 1339-75.
Statistical Analysis with Missing Data. New York: McCrary, Justin. 2008. "Manipulation of the Running
Wiley. Variable in the Regression Discontinuity Design:
Ludwig, Jens, and Douglas L. Miller. 2005. "Does A Density Test." Journal of Econometrics, 142(2):
Head Start Improve Children's Life Chances? 698-714.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
84 Journal of Economie Literature, Vol. XLVII (March 2009)

McEwan, Patrick J., and Joseph S. Shapiro. 2008. "The Quade, D. 1982. "Nonparametric Analysis of Covari
Benefits of Delayed Primary School Enrollment: ance By Matching." Biometrics, 38(3): 597-611.
Discontinuity Estimates Using Exact Birth Dates." Racine, Jeffrey S., and Qi Li. 2004. "Nonparametric
Journal of Human Resources, 43(1): 1?29. Estimation of Regression Functions with Both Cat
Mealli, Fabrizia, Guido W. Imbens, Salvatore Ferro, egorical and Continuous Data." Journal of Econo
and Annibale Biggeri. 2004. "Analyzing a Random metrics, 119(1): 99-130.
ized Trial on Breast Self-Examination with Noncom Riccio, James, and Daniel Friedlander. 1992. GAIN:
pliance and Missing Outcomes." Biostatistics, 5(2): Program Strategies, Participation Patterns, and
207-22. First-Year Impacts in Six Countries. New York:
Meyer, Bruce D., W. Kip Viscusi, and David L. Durbin. Manpower Demonstration Research Corporation.
1995. "Workers' Compensation and Injury Duration: Riccio, James, Daniel Friedlander, and Stephen Freed
Evidence from a Natural Experiment." American man. 1994. GAIN: Benefits, Costs, and Three-Year
Economic Review, 85(3): 322-40. Impacts of a Welfare-to-Work Program. New York:
Miguel, Edward, and Michael Kremer. 2004. "Worms: Manpower Demonstration Research Corporation.
Identifying Impacts on Education and Health in the Robins, James M., and Ya'acov Ritov. 1997. "Toward
Presence of Treatment Externalities." Economet a Curse of Dimensionality Appropriate (CODA)
rica, 72(1): 159-217. Asymptotic Theory for Semi-parametric Models."
Morgan, Stephen L., and Christopher Winship. 2007. Statistics in Medicine, 16(3): 285-319.
Count erf actuals and Causal Inference: Methods and Robins, James M., and Andrea Rotnitzky. 1995. "Semi
Principles for Social Research. Cambridge and New parametric Efficiency in Multivariate Regression
York: Cambridge University Press. Models with Missing Data." Journal of the American
Moulton, Brent R. 1990. "An Illustration of a Pitfall Statistical Association, 90(429): 122-29.
in Estimating the Effects of Aggregate Variables on Robins, James M., Andrea Rotnitzky, and Lue Ping
Micro Unit." Review of Economics and Statistics, Zhao. 1995. "Analysis of Semiparametric Regression
72(2): 334-38. Models for Repeated Outcomes in the Presence of
Moulton, Brent R., and William C. Randolph. 1989. Missing Data." Journal of the American Statistical
"Alternative Tests of the Error Components Model." Association, 90(429): 106-21.
Econometrica, 57(3): 685-93. Robinson, Peter M. 1988. "Root-N-Consistent Semi
Newey, Whitney K. 1994a. "Kernel Estimation of parametric Regression." Econometrica, 56(4):
Partial Means and a General Variance Estimator." 931-54.
Econometric Theory, 10(2): 233-53. Romano, Joseph P., and Azeem M. Shaikh. 2006a.
Newey, Whitney K. 1994b. "Series Estimation of "Inference for Identifiable Parameters in Partially
Regression Functional." Econometric Theory, Identified Econometric Models." Stanford University
10(1): 1-28. Department of Statistics Technical Report 2006-9.
Olken, Benjamin A. 2007. "Monitoring Corruption: Romano, Joseph P., and Azeem M. Shaikh. 2006b.
Evidence from a Field Experiment in Indonesia." "Inference for the Identified Set in Partially Identi
Journal of Political Economy, 115(2): 200-249. fied Econometric Models." Unpublished.
Pagan, Adrian, and Aman Ullah. 1999. Nonparamet Rosen, Adam M. 2006. "Confidence Sets for Partially
ric Econometrics. Cambridge; New York and Mel Identified Parameters That Satisfy a Finite Number
bourne: Cambridge University Press. of Moment Inequalities." Institute for Fiscal Studies
Pakes, Ariel, Jack R. Porter, Kate Ho, and Joy Ishii. Centre for Microdata Methods and Practice Work
2006. "Moment Inequalities and Their Application." ing Paper CWP25/06.
Institute for Fiscal Studies Centre for Microdata Rosenbaum, Paul R. 1984a. "Conditional Permutation
Methods and Practice Working Paper CWP16/07. Tests and the Propensity Score in Observational
Pearl, Judea. 2000. Causality: Models, Reasoning, and Studies." Journal of the American Statistical Asso
Inference. Cambridge; New York and Melbourne: ciation, 79(387): 565-74.
Cambridge University Press. Rosenbaum, Paul R. 1984b. "The Consequences of
Pettersson-Lidbom, Per. 2007. "The Policy Conse Adjustment for a Concomitant Variable That Has
quences of Direct versus Representative Democracy: Been Affected By the Treatment." Journal of the
A Regression-Discontinuity Approach." Unpublished. Royal Statistical Society: Series A (Statistics in Soci
Pettersson-Libdom, Per. 2008. "Does the Size of the ety), 147(5): 656-66.
Legislature Affect the Size of Government? Evidence Rosenbaum, Paul R. 1987. "The Role of a Second Con
from Two Natural Experiments." Unpublished. trol Group in an Observational Study." Statistical
Pettersson-Lidbom, Per, and Bj?rn Tyrefors. 2007. "Do Science, 2(3): 292-306.
Parties Matter for Economic Outcomes? A Regres Rosenbaum, Paul R. 1989. "Optimal Matching for
sion-Discontinuity Approach." Unpublished. Observational Studies." Journal of the American
Politis, Dimitris N., Joseph P. Romano, and Michael Statistical Association, 84(408): 1024-32.
Wolf. 1999. Subsampling. New York: Springer, Rosenbaum, Paul R. 1995. Observational Studies. New
Verlag York; Heidelberg and London: Springer.
Porter, Jack R. 2003. "Estimation in the Regression Rosenbaum, Paul R. 2002. "Covariance Adjustment
Discontinuity Model." Unpublished. in Randomized Experiments and Observational

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
Imbens and Wooldridge: Econometrics of Program Evaluation 85

Studies." Statistical Science, 17(3): 286-327. Linear Propensity Score Methods with Normal Dis
Rosenbaum, Paul R., and Donald B. Rubin. 1983a. tributions." Biometrika, 79(4): 797-809.
"Assessing Sensitivity to an Unobserved Binary Rubin, Donald B., and Neal Thomas. 1996. "Matching
Covariate in an Observational Study with Binary Using Estimated Propensity Scores: Relating Theory
Outcome." Journal of the Royal Statistical Society: to Practice." Biometrics, 52(1): 249-64.
Series B (Statistical Methodology), 45(2): 212-18. Rubin, Donald B., and Neal Thomas. 2000. "Com
Rosenbaum, Paul R., and Donald B. Rubin. 1983b. bining Propensity Score Matching with Additional
"The Central Role of the Propensity Score in Obser Adjustments for Prognostic Covariates." Journal
vational Studies for Causal Effects." Biometrika, of the American Statistical Association, 95(450):
70(1): 41-55. 573-85.
Rosenbaum, Paul R., and Donald B. Rubin. 1984. Sacerdote, Bruce. 2001. "Peer Effects with Random
"Reducing Bias in Observational Studies Using Sub Assignment: Results for Dartmouth Roommates."
classification on the Propensity Score." Journal of the Quarterly Journal of Economics, 116(2): 681-704.
American Statistical Association, 79(387): 516-24. Scharfstein, Daniel O, Andrea Rotnitzky, and James
Rosenbaum, Paul R., and Donald B. Rubin. 1985. M. Robins. 1999. "Adjusting for Nonignorable Drop
"Constructing a Control Group Using Multivariate Out Using Semiparametric Nonresponse Models."
Matched Sampling Methods That Incorporate the Journal of the American Statistical Association,
Propensity Score." American Statistician, 39(1): 94(448): 1096-1120.
33-38. Schultz, T. Paul. 2001. "School Subsidies for the Poor:
Rotnitzky, Andrea, and James M. Robins. 1995. "Semi Evaluating the Mexican Progresa Poverty Program."
parametric Regression Estimation in the Presence of Yale University Economic Growth Center Discus
Dependent Censoring." Biometrika, 82(4): 805-20. sion Paper 834.
Roy, A. D. 1951. "Some Thoughts on the Distribution of Seifert, Burkhardt, and Theo Gasser. 1996. "Finite
Earnings." Oxford Economic Papers, 3(2): 135-46. Sample Variance of Local Polynomials: Analysis and
Rubin, Donald B. 1973a. "Matching to Remove Bias in Solutions." Journal of the American Statistical Asso
Observational Studies." Biometrics, 29(1): 159-83. ciation, 91(433): 267-75.
Rubin, Donald B. 1973b. "The Use of Matched Sam Seifert, Burkhardt, and Theo Gasser. 2000. "Data
pling and Regression Adjustment to Remove Bias in Adaptive Ridging in Local Polynomial Regression."
Observational Studies." Biometrics, 29(1): 184-203. Journal of Computational and Graphical Statistics,
Rubin, Donald B. 1974. "Estimating Causal Effects 9(2): 338-60.
of Treatments in Randomized and Nonrandomized Sekhon, Jasjeet S. Forthcoming. "Multivariate and Pro
Studies." Journal of Educational Psychology, 66(5): pensity Score Matching Software with Automated
688-701. Balance Optimization: The Matching Package for
Rubin, Donald B. 1976. "Inference and Missing Data." R." Journal of Statistical Software.
Biometrika, 63(3): 581-92. Sekhon, Jasjeet S., and Richard Grieve. 2008. "A New
Rubin, Donald B. 1977. "Assignment to Treatment Non-parametric Matching Method for Bias Adjust
Group on the Basis of a Covariate." Journal of Edu ment with Applications to Economic Evaluations."
cational Statistics, 2(1): 1-26. http://sekhon.berkeley.edu/papers/GeneticMatch
Rubin, Donald B. 1978. "Bayesian Inference for Causal ing_SekhonGrieve.pdf.
Effects: The Role of Randomization." Annals of Sta Shadish, William R., Thomas D. Cook, and Donald T.
tistics, 6(1): 34-58. Campbell. 2002. Experimental and Quasi-Exper
Rubin, Donald B. 1979. "Using Multivariate Matched imental Designs for Generalized Causal Inference.
Sampling and Regression Adjustment to Control Bias Boston: Houghton Mifflin.
in Observational Studies." Journal of the American Smith, Jeffrey A., and Petra E. Todd. 2001. "Recon
Statistical Association, 74(366): 318-28. ciling Conflicting Evidence on the Performance of
Rubin, Donald B. 1987. Multiple Imputation for Non Propensity-Score Matching Methods." American
response in Surveys. New York: Wiley. Economic Review, 91(2): 112-18.
Rubin, Donald B. 1990. "Formal Mode of Statistical Smith, Jeffrey A., and Petra E. Todd. 2005. "Does
Inference for Causal Effects." Journal of Statistical Matching Overcome Lalonde's Critique of Nonex
Planning and Inference, 25(3): 279-92. perimental Estimators?" Journal of Econometrics,
Rubin, Donald B. 1997. "Estimating Causal Effects 125(1-2): 305-53.
from Large Data Sets Using Propensity Scores." Splawa-Neyman, Jerzy. 1990. "On the Application of
Annals of Internal Medicine, 127(5 Part 2): 757-63. Probability Theory to Agricultural Experiments.
Rubin, Donald B. 2006. Matched Samplingfor Causal Essays on Principles. Section 9." Statistical Science,
Effects. Cambridge and New York: Cambridge Uni 5(4): 465-72. (Orig. pub. 1923.)
versity Press. Stock, James H. 1989. "Nonparametric Policy Analy
Rubin, Donald B., and Neal Thomas. 1992a. "Affinely sis." Journal of the American Statistical Association,
Invariant Matching Methods with Ellipsoidal Distri 84(406): 567-75.
butions." Annals of Statistics, 20(2): 1079-93. Stone, Charles J. 1977. "Consistent Nonparametric
Rubin, Donald B., and Neal Thomas. 1992b. Regression." Annals of Statistics, 5(4): 595-620.
"Characterizing the Effect of Matching Using Stoye, J?rg. 2007. "More on Confidence Intervals for

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms
86 Journal of Economie Literature, Vol. XLVII (March 2009)

Partially Identified Parameters." Unpublished. An Evaluation of Title I." Journal of Econometrics,


Stuart, Elizabeth A. 2008. "Developing Practical Rec 142(2): 731-56.
ommendations for the Use of Propensity Scores: Dis Van der Klaauw, Wilbert. 2008b. "Regression-Discon
cussion of 'A Critical Appraisal of Propensity Score tinuity Analysis: A Survey of Recent Developments
Matching in the Medical Literature between 1996 in Economics." Labour, 22(2): 219-45.
and 2003' by Peter Austin." Statistics in Medicine, Van der Laan, Mark J., and James M. Robins. 2003.
27(12): 2062-65. Unified Methods for Censored Longitudinal Data
Sun, Yixiao. 2005. "Adaptive Estimation of the Regres and Causality. New York: Springer, Physica-Verlag.
sion Discontinuity Model." Unpublished. Vytlacil, Edward. 2002. "Independence, Monotonicity,
Thistlethwaite, Donald L., and Donald T. Campbell. and Latent Index Models: An Equivalence Result."
1960. "Regression-Discontinuity Analysis: An Alter Econometrica, 70(1): 331-41.
native to the Ex Post Facto Experiment." Journal of Wooldridge, Jeffrey M. 1999. "Asymptotic Properties
Educational Psychology, 51(6): 309-17. of Weighted M-Estimators for Variable Probability
Trochim, William M. K. 1984. Research Design for Samples." Econometrica, 67(6): 1385-1406.
Program Evaluation: The Regression-Discon Wooldridge, Jeffrey M. 2002. Econometric Analysis of
tinuity Approach. Thousand Oaks, Calif.: Sage Cross Section and Panel Data. Cambridge and Lon
Publications. don: MIT Press.
Trochim, William M. K. 2001. "Regression-Disconti Wooldridge, Jeffrey M. 2005. "Violating Ignorability
nuity Design." In International Encyclopedia of the of Treatment By Controlling for Too Many Factors."
Social and Behavioral Sciences, Volume 20, ed. Neil Econometric Theory, 21(5): 1026-28.
J. Smelser and Paul B. Baltes, 12940-45. Oxford: Wooldridge, Jeffrey M. 2007. "Inverse Probability
Elsevier Science. Weighted Estimation for General Missing Data
Van der Klaauw, Wilbert. 2002. "Estimating the Effect Problems." Journal of Econometrics, 141(2):
of Financial Aid Offers on College Enrollment: A 1281-1301.
Regression-Discontinuity Approach." International Zhao, Zhong. 2004. "Using Matching to Estimate
Economic Review, 43(4): 1249-87. Treatment Effects: Data Requirements, Matching
Van der Klaauw, Wilbert. 2008a. "Breaking the Link Metrics, and Monte Carlo Evidence." Review of
between Poverty and Low Student Achievement: Economics and Statistics, 86(1): 91-107.

This content downloaded from


130.56.64.101 on Mon, 01 Aug 2022 12:02:36 UTC
All use subject to https://about.jstor.org/terms

You might also like