Causal Inference in The Social Sciences
Causal Inference in The Social Sciences
Social Sciences
Guido W. Imbens
Department of Economics and Graduate School of Business, Stanford University, Stanford,
California, USA; email: [email protected]
https://doi.org/10.1146/annurev-statistics-033121- Abstract
114601
Knowledge of causal effects is of great importance to decision makers in
Copyright © 2024 by the author(s). This work is
a wide variety of settings. In many cases, however, these causal effects are
licensed under a Creative Commons Attribution 4.0
International License, which permits unrestricted not known to the decision makers and need to be estimated from data. This
use, distribution, and reproduction in any medium, fundamental problem has been known and studied for many years in many
provided the original author and source are credited.
disciplines. In the past thirty years, however, the amount of empirical as well
See credit lines of images or other third-party
material in this article for license information. as methodological research in this area has increased dramatically, and so
has its scope. It has become more interdisciplinary, and the focus has been
more specifically on methods for credibly estimating causal effects in a wide
range of both experimental and observational settings. This work has greatly
impacted empirical work in the social and biomedical sciences. In this article,
I review some of this work and discuss open questions.
123
1. INTRODUCTION
Knowledge of causal effects is of great importance to decision makers in a wide variety of settings,
including policy makers in government and nongovernment organizations and decision makers in
the private sector. In many cases, these causal effects are not known to the decision makers and need
to be estimated from data. This fundamental problem has been known and studied for many years
in many disciplines. In the past thirty years, however, the amount of methodological and empirical
research has increased substantially. Its scope has also changed dramatically. Just to illustrate how
much the area has grown in the past thirty years, consider Figure 1, similar to figures for different
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
methodological terms in the work of Currie et al. (2020). This figure shows the fraction of working
papers in empirical economics published by the National Bureau of Economic Research (a widely
used working paper series in economics) as well as the fraction of published papers in leading
economics journals that use the term “causality” or related terms such as “causal.” Whereas before
1990, the percentage of papers using the term “causality” in empirical papers in economics was
modest—between 10% and 15%—and relatively constant, starting around 1990, the percentage
began to increase rapidly, so that by 2015, a full 50% of empirical papers in economics used the
term “causality.”
During these past thirty years, the study of statistical problems related to the estimation
of causal effects has become more interdisciplinary, with methodological contributions and in-
sights from statistics, econometrics, political science, computer science, epidemiology, biomedical
science, and others. The focus in this literature has been specifically on methods for credibly
estimating causal effects in both experimental and observational settings. This work has greatly
impacted empirical work in a variety of disciplines, including social and biomedical sciences. In
this article, I review some of this work and discuss open questions.
This review focuses on four areas. The first is the work on the analysis and design of ran-
domized controlled trials (RCTs) (Section 3). This is the traditional setting where researchers
in statistics have studied the estimation of causal effects since the seminal contributions by
Fisher and Neyman in the 1920s and 1930s, often in biomedical and agricultural settings. More
0.5
0.4
Fraction of papers
0.3
Top 5 journals
0.2
NBER
empirical working papers
0.1
1980 1985 1990 1995 2000 2005 2010 2015
Year
Figure 1
Fraction of papers using the term “causal” or “causality,” motivated by similar figures in Currie et al. (2020). Abbreviation: NBER,
National Bureau of Economic Research.
124 Imbens
recently, researchers in statistics and social and computer sciences have developed innovative new
experimental designs, partly motivated by the dramatic increase in online experiments by tech
companies that now run hundreds of thousands of experiments annually (Gupta et al. 2019). In
fact, there are now multiple companies dedicated to running online randomized experiments.
These new experimental designs include sophisticated adaptive designs, as well as designs taking
into account complex interactions between units. The second area discussed is the analysis of
observational studies under unconfoundedness (Section 4). This is the most common setting for
observational studies. Following Rosenbaum & Rubin (1983b), researchers often make assump-
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
tions that justify adjusting for observed confounders through regression methods, matching,
inverse-propensity-score weighting, and variations thereon. The recent literature in this area has
focused on analyzing heterogeneous effects in this setting, as well as allowing for the presence of
high-dimensional confounders. The third area I discuss is methods for analyzing observational
studies in settings where unconfoundedness is not a plausible assumption, and in fact not even a
reasonable starting point (Section 5). This is an area that has been studied in econometrics since
the 1930s (Tinbergen 1930, Haavelmo 1943). In this literature a number of methods have been
developed that allow for credible estimation of causal effects in specific settings without uncon-
foundedness assumptions. The most popular of these methods include instrumental variables,
difference-in-differences (DID) methods, synthetic control (SC), and regression discontinuity
designs. Finally, I discuss methods for combining observational and experimental data leveraging
the strengths of both in order to address the shortcomings in either of them (Section 6).
This review is not comprehensive. One important area that I do not discuss in detail concerns
dynamic models, which have received much attention in epidemiology (Robins 1989, 1997; Robins
et al. 2000) and in the older econometric literature on panel data (Chamberlain 1984), but less in
the recent econometric literature, with an exception in Han (2021).
There have been a number of general books on causality and causal inference in statistics
and social sciences in the past two decades, including those of Rubin (2006), Imbens & Rubin
(2015), Cunningham (2018), Pearl (2000), Rosenbaum (2002, 2010), Morgan & Winship (2015),
and Huntington-Klein (2021), but the pace of the research means it is difficult for these to be
up-to-date. There are also a number of reviews in journals. Surveys with a focus on social sciences
include those by Imbens & Wooldridge (2009), Abadie & Cattaneo (2018), and Keele (2015).
In addition to the treatment Wi and the outcome Yi , we may observe additional variables for each
unit. Some of these include pretreatment variables, known to the researcher not to be affected by
the treatment. We denote such variables by Xi for unit i.
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
Often the interest is in some average effect of the treatment, such as the sample average
treatment effect,
1 ∑( )
N
τ sample ≡ Yi (T ) − Yi (C) .
N i=1
Alternatively we may be interested in the average effect in the population, τ pop a E[Yi (T ) − Yi (C)],
if the sample can be viewed as a random sample from some population of interest, or the average
effect in some subpopulation, for example, the average effect for the treated, τ treated = E[Yi (T ) −
Yi (C)|Wi = T ]. The difference between the sample average treatment effect and the population
average treatment effect is subtle. Typically there are no implications for estimation: The best
estimator for τ sample is typically also the best estimator for τ pop if the sole information is in the
form of a random sample from the population. But there are implications for inference: We can-
not estimate τ pop as precisely as τ sample because there is an additional layer of uncertainty. These
issues have received some attention in the recent literature in the discussions of design-based ver-
sus model- or sampling-based uncertainty. For more information, readers are directed to Imbens
(2004), Rosenbaum (2010), Imbens & Rubin (2015), and Abadie et al. (2020).
In the recent literature, there is additional emphasis on heterogeneity in treatment effects by
characteristics of the population. Although there was always interest in differences in treatment
effects by prespecified groups, the use of large data sets, both from online experiments and from
observational studies based on administrative data, in combination with modern machine learning
methods, has led researchers to develop effective methods for estimating either the conditional
average treatment effect (CATE),
[ ]
τ (x) = E Yi (T ) − Yi (C) Xi = x ,
or summary statistics thereof (Chernozhukov et al. 2018, Wager & Athey 2018). Beyond esti-
mating the CATE, researchers have also focused on estimating policy functions that capture the
optimal assignment as a function of pretreatment variables (e.g., Athey & Wager 2021).
126 Imbens
economists, and other social scientists routinely do targeted experiments (Harrison & List 2004,
Kalla & Broockman 2018).
Although the original experimental designs going back to Fisher and Neyman continue to be
widely used, interesting new designs have been developed in recent years. This is partly the result
of the recent interest in conducting randomized trials in academic settings, and partly because of
the interest from private sector organizations in experimental evaluations, often in online settings.
The canonical RCT, going back to Neyman and Fisher [Splawa-Neyman 1990 (1923), Fisher
1937], focuses on the case with a population of N units, each characterized by a pair of potential
outcomes (Yi (C), Yi (T )), as mentioned previously. This notation rules out the presence of spillover
effects, where exposing one unit to the treatment affects outcomes for other units. The absence of
such spillover effects is plausible in many of the traditional biomedical settings for experimenta-
tion, such as drug trials or agricultural experiments. However, even in biomedical settings, there
are exceptions, such as experiments involving infectious diseases. Concern about the presence of
spillovers in modern social science settings is widespread and has motivated new experimental
designs and analysis methods that we discuss in Section 3.3. The interest in traditional RCTs is in
the causal effects τ i = Yi (T ) − Yi (C), with the focus often on the average effect for the N units in
the sample or study population:
1 ∑( )
N
τ sample = Yi (T ) − Yi (C) .
N i=1
Out of this population with N units, NT units are drawn at random and assigned to the treatment
group, and the remaining NC = N − NT units are assigned to the control group, with Wi {C, T }
denoting the treatment.
In this setting, Fisher (1937) focused on testing sharp null hypotheses regarding the causal
effects. The most common hypothesis is that the treatment had no effect on the outcomes
whatsoever:
H0 : Yi (C) = Yi (T ) ∀i, against the alternative, Ha : ∃ i s.t. Yi (C) ̸= Yi (T ).
Although approximations to such exact testing procedures are still widely used, in social sciences,
it is rarely the case that tests of null hypotheses of no effects, either on average or for all units,
are of primary interest. While such questions may be of substantial interest in the development
of new drugs, in many settings decision makers are most interested in the magnitudes of effects,
and whether these effects are substantially meaningful. In the experimental setting, that makes
the results of Splawa-Neyman [1990 (1923)] more relevant. Neyman focused on estimating the
overall average effect using the difference in means:
1 ∑
τ̂ = YT − YC , where Yw = Yi , w ∈ {C, T }.
Nw i:W =w
i
∑N ( )2 ∑N ( )2
1 1
V= Yi (C) − Y (C) + Yi (T ) − Y (T )
(N − 1)NC i=1 (N − 1)NT i=1
∑N ( )2
1
− Yi (T ) − Yi (C) − (Y (T ) − Y (C)) ,
(N − 1)N i=1
These basic results continue to be the basis of recent experimentation in biomedical, social
science, and industry settings. Common modifications include the exploitation of unit-level co-
variates or pretreatment variables to increase the precision of the estimators. The presence of the
covariates can be used to improve the design of the experiments through stratification or, in the
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
limit, pairing of similar units (Athey & Imbens 2017). Although their incorporation in the design
stage of an experiment is to be preferred (Rubin 2008), their presence can also be used in the anal-
ysis stage of the experiment through ex post adjustment, through regression or other methods. For
most estimators (e.g., regression estimators), the accuracy or validity of the model does not affect
the bias of the estimator (although this result is asymptotic and does not necessarily hold in finite
samples) but can lead to substantial improvements of the asymptotic precision of the estimators
(Lin 2013).
128 Imbens
to the posterior probability that that treatment arm is the best one. Suppose the outcome is binary,
and the joint prior distribution for the K success probabilities pk is flat, that is, the product of in-
dependent Beta distributions with parameters α k = β k = 1. Then, if, after Nk units are assigned to
treatment k initially, we see Mk successes and Nk − Mk failures, the posterior distribution for the
success probability pk is a Beta distribution with parameters α k = Mk + 1 and β k = Nk − Mk + 1:
pk |data ∼ B(Mk + 1, Nk − Mk + 1),
independent across treatment arms. Given the joint posterior distribution for the success
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
probabilities, we can infer the posterior probability that treatment arm k is the best one,
pr(pk = maxKm=1 pm ), and we assign the next unit to treatment arm k with that probability. As a
result, we assign few units to treatment arms that initially perform poorly and that are therefore
judged unlikely to be the optimal arm. At the end of the experiment, we therefore may not be able
to infer the precise success probabilities for the inferior arms, but that is not the goal here: We
want to learn which arm is optimal, and our losses are related to differences in efficacy between
the chosen arm and the optimal arm.
The second approach to updating the assignment probabilities is the upper confidence bounds
(UCB) approach (Lai & Robbins 1985, Lattimore & Szepesvári 2020). Here we calculate, after
some initial assignments for each treatment arm, a confidence interval for each of the success
probabilities, with confidence level α—say, the empirical success rate plus and minus 1.96 times the
standard error for a 95% confidence interval. The next unit is assigned to the treatment arm that
has the highest value for the UCB, the highest value for the empirical success rate plus 1.96 times
the standard error. With each treatment assignment, we slowly increase the level of the confidence
intervals toward one so that every treatment arm still receives some traffic.
In both the Thompson sampling and UCB approaches, we increasingly de-emphasize treat-
ment arms once we are confident that they are not the best in the set. The simple bandit algorithms
lead to substantial improvements over standard experiments, but there are many subtle issues re-
garding their use as well as modifications geared toward more complex settings that are important
in practice. The first issue concerns inference. Simply using the average outcomes as an estimator
for the expected outcome and its standard deviation scaled by the square root of the number of
units as the standard error introduces biases. This is easy to see in a simple example: Suppose there
are two stages, where in the first stage, 100 units are assigned to one of two treatment arms, and
in the second stage, the next 100 units are all assigned to the treatment arm with the highest em-
pirical success rate. If the true expected outcomes are equal, the empirical success rate for the arm
with the lowest initial success rate is biased downward. Second, interesting complications arise
in settings with covariates, where the assignment probabilities depend both on earlier outcomes
and on the characteristics of the incoming observations, in what are known as contextual bandits
(Dimakopoulou et al. 2018). Third, concerns arise in settings where the expected outcomes may
change over time, so-called nonstationary bandits. Changes can be in the form of stochastic trends
or seasonal (day of the week or month of the year) effects. In that case, one needs to slow down
the exploitation part of the algorithm in order to ensure that there is a sufficient number of units
for each of the treatment arms so that we do not erroneously discard the good arms (Besbes et al.
2014, Liu et al. 2023).
estimated average treatment effect in each market declines with the fraction of treated individuals
in that labor market. This is not surprising if part of the effect of the program comes through
making the treated unemployed more attractive hires compared with control individuals, in a
setting where the number of open positions in each labor market is approximately fixed.
The appropriate experimental design in the presence of spillovers, and the analysis of data from
such experiments, depends on the precise nature of the spillovers. This has led to an extensive
literature studying cases relevant in particular contexts, in both experimental and observational
settings. A common theme is to limit the spillovers through exposure mappings (Aronow & Samii
2017) that measure what components of the full treatment vector matter for a particular unit.
One leading case is the setting where the population is partitioned into subsets, referred to
as strata or clusters, such that the spillovers are limited to units within the cluster (Hudgens &
Halloran 2008). Examples include the labor market setting in Crépon et al. (2013), but also
educational settings where treatments applied to one student may affect all students in the same
classroom (but not students in other classrooms), or rideshare companies where treatments
applied to one customer affect all customers in the same market at that time (but not customers
in other markets).
Another important setting is that of networks where spillovers or interactions arise through
network links (Athey et al. 2018a, Basse et al. 2019). Here challenges are more substantial than in
the stratified case because, depending on the nature of the spillovers, treating one unit may affect
units it is not connected to. Bond et al. (2012) present a well-known example, where treating some
individuals in a way that makes them more likely to vote affects the voting behavior of their friends
as well as individuals beyond their direct friends.
An alternative setup that allows for general spillovers is based on bipartite graphs (Pouget-
Abadie et al. 2019, Zigler & Papadogeorgou 2021). In contrast to much of the experimental design
literature, the starting point is a single set of N units, to which the treatment could be applied and
for which potential outcomes are defined. There is, in this approach, no longer a simple one-
to-one correspondence between the units on which the treatments are defined and the units on
which the outcomes are measured. The bipartite graph framework starts with a set of treatment
units T , with binary treatment indicators, {Wi , i ∈ T }, and a set of outcome units Q with observed
responses {Y j , j ∈ Q}, together with a bipartite graph with vertex sets P and O that describes which
treatments affect which outcomes.
Bajari et al. (2021, 2023) and Johari et al. (2022) consider a setting with two or more populations
where the treatments are assigned to, and outcomes are measured on, pairs or tuples of units. This
setting is a natural one in marketplaces, for example, rideshare companies such as Uber and Lyft,
or rental markets such as Airbnb. Treatments—say, changes in the interaction between drivers and
riders such as default tipping policies—are assigned to pairs of drivers and riders, in contrast to
traditional experiments where treatments are assigned to all drivers (for all riders) or to riders (for
all drivers). By creating variation in the share of treated drivers for each rider, and the other way
around, the researcher has the ability to learn about the interaction between the two sides of the
market and the spillovers that result from those interactions.
130 Imbens
4. OBSERVATIONAL STUDIES WITH UNCONFOUNDEDNESS
Although experiments have become more prominent in social sciences over the past thirty years,
observational studies continue to be the mainstay of social sciences. To balance the superior inter-
nal validity of experimental studies, there are three aspects of causal studies where observational
studies often have the advantage over randomized experiments: (a) more detailed information on
units, (b) larger sample sizes, and (c) improved representativeness or external validity.
The most important approach to observational studies is the one in which, within homogenous
subpopulations, the treatment assignment is assumed to be as good as random, or unconfounded
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
(Rosenbaum & Rubin 1983b), so that within such subpopulations, we can analyze the data as if
they arose from a randomized experiment. Formally, with Xi denoting a vector of pretreatment
variables or covariates, the key assumption is
( )
Wi ⊥⊥ Yi (0), Yi (1) Xi . 1.
In addition, there is typically an overlap assumption that guarantees that the assignment
probability is bounded away from zero and one,
where e(·) is the propensity score that plays a key role in this setting. Combined, these two as-
sumptions are referred to as strong ignorability (Rosenbaum & Rubin 1983b). There is a large
theoretical literature developing methods for estimation and inference in this setting (reviews in-
clude Rosenbaum 1984, Rubin 2006, Stuart 2010, Imbens 2015, Zubizarreta et al. 2023), as well
as a huge empirical literature that relies on some form of this assumption, variously referred to
as ignorable treatment assignment (Rosenbaum & Rubin 1983b), exogeneity (Imbens 2004), and
unconfoundedness (Rubin 1978).
In the graphical tradition (Pearl 1995, 2000; Peters et al. 2017; Pearl & Mackenzie 2018), the
key unconfoundedness assumption can be expressed as in Figure 2, with n common ancestors for
both the treatment and outcome. In this setting, there is no need to specify the causal links, absent
or present, between these ancestors. Most common estimators are not affected by the presence
or absence of those links. What is important, though, is that none of these variables are affected
by the treatment or outcome—in the directed acyclic graph (DAG) terminology, none of them
are descendants of the outcome or the treatment. Typically, the credibility of that assumption is
based on the notion that these variables precede the treatment (Rosenbaum 1984, 2010). In prac-
tice, using variables causally affected by the treatment or outcome is the most common mistake in
W Y
X1 X2 X3 ... Xn
Figure 2
Unconfounded treatment assignment: X1 , . . . , Xn are exogenous (pretreatment variables). Adjusting for them
in an unconfoundedness-based analysis removes all biases in comparisons between treated and control units.
W Y
X
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
Figure 3
M-bias: Although X may be a pretreatment, using it as a conditioning variable in an unconfoundedness
analysis introduces bias.
132 Imbens
in Equation 1 more plausible. At the same time, including more covariates into this conditioning
set makes the overlap assumption in Equation 2 more controversial and thus makes the practical
challenges in adjusting effectively for all the covariates more severe. Finding methods that are
effective in settings with a substantial number of covariates has been one of the main goals of this
literature.
Define the conditional expectation of the outcome given treatment and covariates,
Under the unconfoundedness assumption, these two conditional expectations are equal:
Given unconfoundedness and overlap, in combination with some smoothness assumptions on the
propensity score and the conditional outcome expectations, one can estimate the population av-
erage treatment effect at the parametric rate. The semiparametric efficiency bound (Newey 1990,
Bickel et al. 1993, Hahn 1998) is
[ ]
(Yi (T ) − µT (Xi ))2 (Yi (C) − µC (Xi ))2
V=E + + (µT (Xi ) − µC (Xi ) − τ pop )2 .
e(Xi ) 1 − e(Xi )
This setting is remarkable for two reasons. First, there is a huge theoretical literature with many
proposed estimators for this setting, ostensibly quite different, yet many (though not all) of them
are semiparametrically efficient. Second, at the same time, there is a vast empirical literature where
many different estimators are frequently used in practice. This setting is one of the leading exam-
ples where semiparametric efficiency bounds and corresponding estimators have been studied in
the econometric literature, and as a result, many insights that carry over to other semiparametric
problems have been obtained. I want to divide this literature into four subliteratures corresponding
to specific classes of estimators, described in detail in the following subsections.
4.1.1. Matching estimators. The first set of estimators uses one-to-one nearest neighbor
matching, and extensions thereof. For each treated (control) unit, one or more control units are
selected that are similar in terms of pretreatment variables, by minimizing some metric. Formally,
for a treated unit i (a unit with Wi = T ), a match is found in the unit j(i) that minimizes
min Xi − X j ,
j:W j =C
for some metric ∥ · ∥, often the Mahalanobis metric based on the inverse of the full-sample co-
variance matrix. The difference between the outcome for the treated (control) unit and its match,
Yi − Yj(i) (or the average outcome for its matches if there are multiple matches), is then used as
an estimate of the treatment effect for that unit. These unit-level estimates are then averaged to
get an estimate of the overall average effect, or the average effect for the treated. There is a large
literature discussing large sample properties of matching estimators, computational concerns, and
variations on the basic versions with additional bias adjustment (Abadie & Imbens 2006, Rubin
2006, Rosenbaum 2020, Zubizarreta et al. 2023).
An advantage of matching estimators is that they are intuitive and easy to explain. However,
there are two drawbacks associated with simple matching estimators. First, with a fixed number
of matches, these matching estimators are never fully efficient. In order to improve the preci-
sion, and in fact to reach the efficiency bound, one needs to let the number of matches increase
4.1.2. Regression estimators. The second class of estimators first estimates the conditional
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
expectation µ(w, x), followed by averaging the differences over the sample of the estimated
conditional expectations:
1 ∑( )
N
τ̂ reg = µ̂(T, Xi ) − µ̂(C, Xi ) .
N i=1
The estimator for conditional expectation itself can be based on a parametric specification—say,
a simple linear model, arguably still the most common estimator for estimating average treat-
ment effects under unconfoundedness—or a more flexible approach such as kernel regression
or sieve methods. Modern implementations have used machine learning methods such as deep
neural nets or random forests for this component. Hahn (1998) shows that regression estima-
tors based on sufficiently flexible specifications of the regression function are semiparametrically
efficient.
4.1.3. Propensity score estimators. In applications where the number of conditioning vari-
ables, that is, the dimension of the pretreatment variables Xi , is substantial, estimating the
conditional expectation µ(w, x) can be a challenge. A celebrated result from Rosenbaum & Rubin
(1983b) shows there are alternatives to estimating this conditional expectation if the goal is to
estimate the population average treatment effect τ pop . They show that under unconfoundedness,
as in Equation 1, it is also true that
( )
Wi ⊥
⊥ Yi (0), Yi (1) e(Xi ). 3.
Here, one only needs to condition on a scalar function of the covariates, known as the propensity
score. More generally, one can condition on any balancing score such that conditioning on this
balancing score makes Xi and Wi independent, with the propensity score and one-to-one functions
thereof the lowest dimensional balancing scores.
The Rosenbaum–Rubin propensity score result can be exploited in a number of ways. Two of
these treat the propensity score simply as a scalar pretreatment variable that needs to be adjusted
for in the two methods described in the previous two subsections. That is, one can use the match-
ing estimators from Section 4.1.1 by matching on the propensity score rather than by matching
on all the pretreatment variables. Matching on a scalar avoids the bias concerns that arise with
matching estimators where one matches on multiple variables. This method is fairly widely used.
For a discussion of the formal properties, readers are directed to Abadie & Imbens (2016). Al-
ternatively, one can use the regression estimators from Section 4.1.2, where instead of using the
basic covariates, one estimates the conditional expectation of the outcomes given the treatment
and the propensity score, µ̃(w, e) = E[Yi (w)|e(Xi ) = e]. This method is not widely used, partly be-
cause there is no natural functional form for this conditional expectation—e.g., there is no reason
to expect this conditional expectation to be linear in the propensity score.
A third, more direct, method for using the propensity score result directly exploits the inter-
pretation of the propensity score as the probability of being exposed to the treatment, rather than
134 Imbens
viewing it simply as a balancing score. Specifically, it exploits the results that
[ ] [ ]
1Wi =T Yi 1Wi =C Yi
E = E[Yi (T )], and E = E[Yi (C)].
e(Xi ) 1 − e(Xi )
This result is then used by reweighting the units by the inverse of the probability of the treatment
received:
/ N / N
∑N
Yi 1Wi =T ∑ 1W =T ∑N
Yi 1Wi =C ∑ 1W =C
τ̂ = i
− i
,
i=1
e(X i)
i=1
e(X i)
i=1
1 − e(X i)
i=1
1 − e(Xi)
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
with the weights scaled to sum to zero by treatment group. The formal properties of this Horvitz–
Thompson type estimator (Horvitz & Thompson 1952) are studied by Hirano et al. (2003), who
show that when using suitably flexible estimators of the propensity score, the resulting estimator
reaches the semiparametric efficiency bound. The perhaps surprising insight is that estimating
the propensity score here is critical: Weighting by the inverse of the true propensity score does
not lead to an efficient estimator. A simple example makes this clear. Suppose the propensity
score is constant, equal to p for all units. Then, weighting by the inverse of the true propensity
∑ ∑
score leads to τ̂ = (1/N ) i 1Wi =T Yi /p − (1/N ) i 1Wi =C Yi /(1 − p), whereas weighting by the es-
∑
timated propensity score p̂ = (1/N ) i 1Wi =T ensures that we have a weighted average of treated
and control outcomes with the weights summing to one.
4.1.4. Doubly robust estimators. Matching, regression, and inverse-propensity-score weight-
ing estimators are all widely used in empirical work. In addition, most of them are, in principle,
semiparametrically efficient. Nevertheless, the current state of the literature suggests that the most
attractive estimators for the average treatment effect combine estimates of the propensity score
and estimates of the conditional expectation. Formally, such estimators rely on less restrictive as-
sumptions on the smoothness of the conditional outcome expectations and the propensity score.
They do so by formally requiring lower rates of convergence of the corresponding estimators
compared with regression and inverse propensity score estimators. They also have the double ro-
bustness property that when either the conditional outcome expectations or the propensity score
is estimated consistently, the estimator for the average treatment effect is consistent. Estimators
using such combinations of estimators for the conditional outcome expectations and the propen-
sity score were first introduced in a series of papers by Robins and coauthors (e.g., Robins et al.
1994, 2000), who focused on the double robustness property. These estimators build on the semi-
parametric efficiency bound literature (Newey 1990, Bickel et al. 1993). They are also related to
the literature on targeted maximum likelihood (Van der Laan & Rose 2011), where the focus is
more on the efficiency properties than the robustness. More recently, various specific estimators
have been proposed for average treatment effects in this setting (e.g., Chernozhukov et al. 2017,
Athey et al. 2018b).
A systematic way to generate semiparametrically efficient estimators in many settings is to use
the influence function. First define
( y − µ(w, x))1w=T ( y − µ(w, x))1w=C
ψ ( y, w, x) = µ(T, x) − µ(C, x) + −
e(x) 1 − e(x)
y1w=T y1w=C e(x) − 1w=T
= − + (µ(T, x)(1 − e(x)) + µ(C, x)e(x)),
e(x) 1 − e(x) e(x)(1 − e(x))
so that ψ( y, w, x) − τ is the influence function. Then, we have
1 ∑
N
τ̂ dr = ψ̂ (Yi , Wi , Xi ),
N i=1
expression has expectation zero when evaluated at the true propensity score e(x), even if the con-
ditional outcome expectations are misspecified. Although the double robustness property may in
itself not be a compelling reason for using the estimator, formal arguments show that the doubly
robust estimators have good properties even when µ̂(w, x) and ê(x) converge relatively slowly to
their population counterparts. Because the convergence rates depend on the number of covariates,
this makes these doubly robust estimators particularly attractive in settings with many covariates.
Readers are directed to Chernozhukov et al. (2017) and Athey et al. (2018b) for formal properties
given general estimators for the conditional expectations and propensity score.
4.2. Overlap
In practice, a big concern with estimators for average treatment effects under unconfoundedness is
possible violations of the overlap assumptions. This concern became clear after LaLonde (1986),
where the treatment group and control group were very far apart in terms of covariate distri-
butions. Researchers have proposed various methods for dealing with this that change the focus
from the average effect in the population to some other weighted average. Crump et al. (2009)
propose changing the estimand dropping units with a propensity score close to zero and one, with
the threshold determined by minimizing the asymptotic variance. Formally, let α be the solution
to
[ ]
1 1
=E α < e(Xi ) < 1 − α .
α(1 − α) e(Xi )(1 − e(Xi ))
Crump et al. (2009) suggest changing the estimand to
[ ]
τ = E τ (Xi ) α < e(Xi ) < 1 − α .
Li et al. (2018), focusing on the same objective function, modify this by weighting the units opti-
mally by a function of the pretreatment variables, which leads to weighting units by the product
of the propensity score and one minus the propensity score, leading to the estimand
[ ]
e(Xi (1 − e(Xi ))
τ =E τ (x) .
E[e(Xi )(1 − e(Xi )]
136 Imbens
that estimating the entire function τ (·) requires more data than typically were available, especially
in settings where even the main effects are hard to detect because of a combination of the effects
being small and the sample sizes being modest. Motivated by the availability of large data sets with
rich detail, interest grew in effective methods for uncovering heterogeneity in treatment effects.
The earlier literature included some attempts to estimate τ (x) through series methods (Crump
et al. 2008), which were not effective in settings with a substantial number of covariates. The more
recent literature instead used machine learning methods, adapted to estimating causal effects. In
standard settings where supervised machine learning methods are used for estimating conditional
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
expectations, one has observations on outcomes that are unbiased for the conditional expectations
that are being estimated. That makes cross-validation methods based on leaving out some units
very effective. That does not directly work in settings where the focus is on estimating average
causal effects because there is no direct observation of the causal effects.
Athey & Imbens (2016) proposed constructing regression trees that were tailored to estimat-
ing CATEs in experimental settings. One proposal was based on the insight that a Horvitz–
Thompson-type transformation of the outcome, Yi 1Wi =T /e(Xi ) − Yi 1Wi =C /(1 − e(Xi )), has condi-
tional expectation given Xi = x equal to τ (x). Thus, if we first transform the outcome, and note
that this transformation is known in the experimental case, then we can directly use methods for
supervised learning, such as regression trees. Wager & Athey (2018) generalized these methods to
random forests, which, importantly, allowed for honest inference without prespecifying which co-
variates are used to create subpopulations. These causal random forests estimators are now widely
used in practice.
Athey & Wager (2021), Dehejia (2005), Hirano & Porter (2009), Manski (2004), and Kitagawa
& Tetenov (2018) change the focus away from estimating the CATE τ (x) to estimating policy
rules that assign units to the treatment based on the values of their pretreatment variables. The
goal in this literature is to find estimators for the optimal policy rules that perform well when the
CATE function is unknown. Athey & Wager (2021) show that this problem reduces to one that is
very similar to that of estimating average treatment effects. Critical is the complexity of the class
of policy rules that the researcher optimizes over.
A similar decomposition can be derived for the expectation of Yi (C), leading to bounds on the
population average treatment effect,
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
These bounds are generically wide: Their width is, in this case, always equal to one, because the
width for the bounds for E[Yi (T )] is pr(Wi = C) and the width for the bounds for E[Yi (C)] is
pr(Wi = T ). This, in turns, implies that the bounds for the average treatment effect will always
include zero.
The Rosenbaum–Rubin sensitivity analysis starts with a modified unconfoundedness
assumption:
( )
Wi ⊥⊥ Yi (0), Yi (1) Xi , Ui .
The difference between this independence condition and the unconfoundedness condition in
Equation 1 is that the second confounder, Ui , is not observed. Without loss of generality, we can
take Ui to be binary. We cannot consistently estimate the average effect of the treatment under
this assumption because Ui is not observed. We therefore augment it with additional assumptions
on the relation between the potential outcomes, treatment assignment, and covariates given the
unobserved confounder. In spirit, this is similar to the standard analysis of omitted variable bias in
linear regression models. Suppose we are interested in the coefficient on Wi in a (long) regression,
Yi = β0 + βW 1Wi =T + βX Xi + βU Ui + εi ,
Yi = α0 + αW 1Wi =T + αX Xi + ηi ,
Ui = δ0 + δW 1Wi =T + δX Xi + νi .
138 Imbens
covariates and binary outcomes. Suppose the unobserved confounder Ui is binary. Then, we can
model the probability of assignment in a parametric, logistic framework through the log odds
ratio:
( )
pr(Wi = T |Ui = u)
ln = α0 + αU · u.
pr(Wi = C|Ui = u)
Similarly, we model the potential outcome distribution as
( )
pr(Yi (w) = 1|Ui = u)
ln = βw0 + βwU · u.
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
pr(Yi (w)|Ui = u)
Now, given fixed values for the sensitivity parameters (α U , β CU , β TU ) and the data, we can first
estimate the remaining parameters α 0 and β w0 and thus the average treatment effect. This defines
a function
Because the sensitivity parameters are defined in terms of log odds ratio, a willingness to put limits
on the effect of the unobserved confounder on the log odds ratio then determines the function
τ̂ (αU , βCU , βTU , data). The key question is how to choose a reasonable range of values for the
sensitivity parameters (α U , β CU , β TU ).
Imbens (2003) suggests limiting the range of plausible values for the sensitivity parameters by
inspecting the association between observed confounders and assignment and potential outcomes.
Specifically, the suggestion is to find the strongest association, in a logistic model, between the
observed confounders and the assignment, and the strongest association between the observed
confounders and the potential outcomes, and assume that the unobserved confounders do not
have a stronger association with either assignment or potential outcomes than that. This work has
been extended by Oster (2019), Cinelli & Hazlett (2020), Manski (1990), Masten et al. (2020), and
Chernozhukov et al. (2022).
Rosenbaum (2002) develops a sensitivity analysis that does not require assumptions on the as-
sociation between the unobserved confounder and the potential outcomes. In the example without
covariates, his approach bounds the log odds ratio for the assignment probabilities:
( )
pr(Wi = T )
0 ≤ ln ≤ 1/0,
pr(Wi = C)
and then explores the range of possible estimates for τ̂ associated with the limit 0. Implicitly,
this allows the association between the unobserved confounder and the potential outcomes to be
arbitrarily strong.
U
Figure 4
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
large literature in econometrics that focuses on methods that do not start from unconfoundedness
assumptions.
There is no general solution for this case. As discussed before, and illustrated by the
Manski bounds, simply dropping the unconfoundedness assumption implies average treatment
effects are no longer identified, although informative bounds can be derived in some cases. Much
of the econometrics literature has focused on special cases where additional information of some
kind is available. This can be in the form of additional variables with specific causal structure,
or in the form of additional assumptions placed on either the assignment mechanism or the po-
tential outcome distributions, so that either the overall average treatment effect or some other
estimand is identified. Much of the empirical work relies on a small set of what Angrist & Krueger
(1999) call identification strategies. Occasionally, new strategies are proposed and make it into
the toolkit of empirical researchers in social sciences. Here, we describe three of the leading
strategies: instrumental variables, fixed effect and DID methods, and regression discontinuity
designs.
Mediator
Treatment Outcome
Figure 5
Directed acyclic graph with mediation.
140 Imbens
The first difference is that what is the treatment in the instrumental variables case is, in the
mediation setting, the mediator, and what is the instrument in the instrumental variables case is
the treatment in the mediation setting. In addition, there is—and this is critical—no direct effect
of the instrument on the outcome. Finally, there is an unobserved confounder that makes a direct
comparison between treatment and outcome impossible, and that is the motivation to look for an
instrument.
One of the most celebrated applications where this structure is plausible (and which is, in fact,
referenced in Angrist’s Nobel citation by the prize committee) is the draft lottery example of
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
Angrist (1990), where the treatment is military service status, the outcome is earnings later in
life, and the instrument is draft eligibility determined by the draft lottery number. In this case, it
appears plausible that the effect of the lottery number on earnings is entirely, or at least largely,
mediated by military service. A second classic example is Angrist & Krueger (1991), where the
focus is on estimating the effect of years of education on earnings, using compulsory schooling
laws as instruments. Another class of examples includes randomized experiments with imperfect
compliance. Again, the key assumption is that the effect of the random assignment to treatment
is entirely mediated by the receipt of the treatment.
One cannot simply compare outcomes by treatment status in an as-treated analysis because of
the presence of an unobserved confounder U. We also cannot simply drop those who are observed
not to comply with their treatment assignment in a per-protocol analysis. However, because there
is no unobserved confounder for the relation between the additional variable, the instrument, and
the treatment, we can estimate the average causal effect of the instrument on the treatment, the
intention-to-treat effect. Similarly, there is no unobserved confounder for the relation between
the instrument and the outcome as there is in Figure 6 where the instrument assumptions do
not hold. In addition, there is no direct effect of the instrument on the outcome, no arrow from
Z to Y, as there is in Figure 7. This implies we can estimate the average causal effect of the
instrument on the outcome. These two intention-to-treat effects, however, are not the primary
estimand in these settings. Instead, we are interested in the causal effect of the treatment on the
outcome.
In terms of potential outcomes, the instrumental variables analysis starts with two sets of po-
tential outcomes: For the treatment, we have Wi (z) for each value of the instrument, and for the
outcomes, we have Yi (w, z) indexed by both the treatment and the instrument. The key assump-
tions are now that the potential outcomes are all independent of the instrument Zi , which is an
unconfoundedness type assumption. Second, the potential outcomes Yi (z, w) do not actually vary
W
Z Y
U
Figure 6
Directed acyclic graph with violation of the exogeneity assumption for the instrument.
U
Figure 7
Directed acyclic graph with violation of the exclusion restriction because of a direct effect of the instrument
on the outcome.
by the instrument, so we can drop the z argument and write Yi (w). This is the exclusion restriction
captured in the graph by the absence of a direct link between the instrument and the outcome,
with the treatment itself acting as a mediator.
These assumptions by themselves are not sufficient to identify the average effect of the treat-
ment on the outcome, as pointed out by Heckman (1990) and Manski (1990). To make further
progress, Imbens & Angrist (1994) added one more assumption, what they labeled monotonicity.
This requires that the causal effect of the instrument on the treatment is in the same direction for
all units. That is, changing the instrument from 0 to 1 can leave the treatment status for a unit
unchanged, or it can move the unit from the untreated state to the treated state, but it cannot move
the unit from the treated state to the untreated state. Formally, the assumption requires that for
all i, Wi (1) ≥ Wi (0). Even adding this monotonicity assumption does not allow for point identifi-
cation of the average effect of the treatment, but the combination of assumptions does allow for
the identification of the average effect of the treatment for the subpopulation of units for whom
changing the instrument from 0 to 1 moves them from untreated to treated. This subpopulation
is generally referred to as the compliers, and the average effect for this group is the local average
treatment effect (LATE) (Imbens & Angrist 1994, Angrist et al. 1996, Imbens 2014),
τ LATE = E[Yi (T ) − Yi (C)|Wi (1) = T, Wi (0) = C].
This identification result is unusual because the resulting estimand, the LATE, is unconventional.
There is no particular reason why the subpopulation it refers to, the compliers, is necessarily an
interesting subpopulation. It may be, and in the Angrist draft lottery example it arguably is, but
that is not the reason for focusing on it. The main reason is that it is the only subpopulation
for which we can identify the average effect of the treatment. We may be more interested in the
overall average effect, but we cannot identify that without substantially stronger assumptions, e.g.,
constant treatment effects.
Note that the monotonicity assumption is difficult to capture in the graphical representation.
However, it is plausible in many applications, and similar shape restrictions (monotonicity of de-
mand or supply functions, convexity of preferences or production functions, decreasing returns to
scale) play an important role in econometric identification strategies (e.g., Matzkin 1994).
Concerns with the exclusion restriction have often driven researchers to use instruments for
which that assumption may be plausible, but that have only limited effects on the treatment. (In the
limit, a random number would by definition satisfy the exclusion restriction, but it would not have
a causal effect on the treatment, so there would be no compliers.) This has led to concerns about
the properties of instrumental variables estimators when the instruments are weak, in the sense of
having only a weak correlation with the treatment. In that case, instrumental variables estimators
142 Imbens
have poor properties, and the standard Normal-distribution-based confidence intervals may not
be valid (Staiger & Stock 1997, Andrews et al. 2019).
Researchers have also studied quantile regression estimators in instrumental variables settings,
including Abadie et al. (2002) and Chernozhukov & Hansen (2005).
Here, the four terms are all averages—e.g., YT,post is the average outcome for units in the treatment
group who were observed in the posttreatment period. The DID estimator can be motivated by a
TWFE model for the potential outcomes that has an additive fixed effect for the group (treatment
versus control) and an additive fixed effect for the time period (post versus pre):
Yit (C) = βi + γt + εit . 4.
Combined with an additive treatment, Yit (T ) = Yit (C) + τ , this leads to a regression in terms of
the realized outcome of the form
Yit = βi + γt + Wit τ + εit .
This setup extends naturally to allow for multiple time periods where we can still use the
specification in Equation 4. This includes settings where the treated units do not all receive the
treatment in the same period. For example, a common setting is that with staggered adoption
(Athey & Imbens 2021), where units enter the treatment group at different times, but once they
are in the treatment group they remain there for all subsequent periods. This setting is also known
as a stepped wedge design. In this case, the standard TWFE estimator may estimate a weighted
average of treatment effects with some of the weights negative. This has led to a number of alter-
native estimators (Callaway & Sant’Anna 2020, Goodman-Bacon 2021, Sun & Abraham 2021).
Recent extensions are discussed by de Chaisemartin & d’Haultfœuille (2020), Freyaldenhoven
et al. (2019), Imai & Kim (2019), Liu et al. (2022), Sant’Anna & Zhao (2020), and Roth et al.
(2022).
Abadie & Gardeazabal (2003) and Abadie et al. (2010) focused on a different special case of
this setup, where a single unit was treated from a point in time onwards. In what they called the
synthetic control (SC) approach, the central idea was to approximate the treated unit by a synthetic
version consisting of a convex combination of the control units. If unit N is treated in period T,
its counterfactual outcome YNT (0) is estimated as
∑
N −1
ŶNT (0) = ωiYiT ,
i=1
Various modifications and extensions of the basic SC method have been proposed. Two of the
most important ones are, first, allowing for an intercept in the objective function to allow for per-
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
manent stable differences between the treated unit and the convex combination of the other units,
and second, allowing for some negative weights (Doudchenko & Imbens 2016, Ferman & Pinto
2021). Both imply that the SC can be outside of the convex hull of the control units. A Bayesian
approach was introduced by Brodersen et al. (2015). Chernozhukov et al. (2021) discuss appli-
cations of conformal inference to SC settings. Xu (2017) and Athey et al. (2021) discuss general
factor models and their relation to SC methods.
Arkhangelsky et al. (2021) combine the DID and SC approaches in the synthetic DID
estimator. Here, the SC weights are combined with a TWFE model for the outcomes:
∑
N ∑
T
(τ̂ , α̂, β̂ ) = arg min (Yit − αi − βt − τWit )2 ωi λt ,
α,β,τ
i=1 t=1
where the SC weights ωi are calculated as before. In addition, time weights λt are used to put more
emphasis on time periods that are similar to the time periods where the treatment occurs. This
combination of an outcome model with the SC weighting leads to more robust estimates.
Assuming that the conditional distribution, and in particular the conditional expectation, of the
potential outcomes given Xi is smooth in the covariate, the average effect of the treatment for
units with Xi = c is
τ = lim E[Yi (T ) − Yi (C)|Xi = x] = lim E[Yi |Xi = x] − lim E[Yi |Xi = x].
x→c x↓c x↑c
We can then estimate the conditional expectation E[Yi |Xi = x] at the two limits to get an estimator
for the average causal effect τ at the threshold.
This setting arises very naturally in social science applications, and the identification strategy
often has great credibility there. In education settings, administrators often use thresholds for
144 Imbens
test scores to offer different menus of options for students, including access to selective schools
(Abdulkadıroğlu et al. 2022) or required attendance in summer programs (Matsudaira 2008). Elec-
tions also create sharp thresholds that allow for the evaluation of the effect of incumbency (Lee
et al. 2004).
Since the work by Hahn et al. (2001) and Porter (2003), the most common estimator for τ
is based on local linear regression using observations close to, on the left and on the right of,
the threshold c. The local linear regression has largely replaced the global polynomial methods,
which are sensitive to the choice of degree of the polynomial as documented by Gelman & Imbens
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
(2018). The key choice in implementation is the bandwidth for the local linear estimator. Since the
rediscovery of regression discontinuity designs in the early 2000s, in the social science literature,
various algorithms have been proposed for this choice (Imbens & Kalyanaraman 2012, Calonico
et al. 2014).
A second case of interest is the fuzzy regression discontinuity design. Here, the probability of
exposure to the treatment does not jump from zero to one at the threshold. Rather, there is a
discontinuity at the threshold, but only a limited one. In this case, the estimand is the ratio of the
jump in the average outcome, scaled by the magnitude of the jump in the probability of receipt of
the treatment. The interpretation is now that this estimates, under some regularity conditions, the
average effect of the treatment, for a subpopulation from the population of units with covariate
values at the threshold. This subpopulation is like the compliers in the instrumental variables
setting, consisting of units who are affected by being on the right versus the left of the threshold.
The estimator for the fuzzy regression discontinuity setting is simply the ratio of two estimates
of the differences in regression functions at the threshold. The numerator is the magnitude in
the discontinuity of the expected value of the outcome, and the denominator is the estimated
magnitude in the discontinuity of the expected value of the treatment. This setting is common,
though perhaps understudied, in biomedical settings. Guidelines for treating patients often use
somewhat arbitrary thresholds based on tests or age for recommending in favor of or against
particular procedures.
Recently there has been important work on alternatives to local linear estimators. This work
focuses on the characterization of the estimator as a weighted average of the outcomes to the
right minus a weighted average of the outcomes to the left of the threshold. Choices of estimators
correspond to choices of weight functions. This literature has attempted to characterize optimal
choices for weights by minimizing the expected squared error under the worst-case scenario for
the conditional expectation of the outcome given the covariate. These approaches deal very ef-
fectively with settings with discrete as well as continuous covariates and make the reliance on
smoothness assumptions explicit (see Armstrong & Kolesár 2018, Imbens & Wager 2018).
To assess the plausibility of regression discontinuity designs, some specific procedures have
been proposed. These are useful in cases with concerns that the value of the running variable might
have been manipulated. Given such manipulation, one would expect to see that the distribution
of the running variable is discontinuous around the threshold. The McCrary test (McCrary 2008)
formalizes this. One can also test whether the expected value for other covariates is discontinuous
at the threshold for the running variable.
A practical concern with regression discontinuity estimators is their limited external validity.
They are only valid for units close to the threshold, and in the fuzzy case only for the complier
subpopulation of those units. Angrist & Rokkanen (2015) and Bertanha & Imbens (2020) assess
the plausibility of extrapolating to larger subpopulations, in particular away from the threshold.
This involves inspecting discontinuities in the conditional expectation of the outcomes given
treatment status as a function of the covariate at the threshold. Smoothness of this conditional
expectation implies that control compliers and never-takers are not substantially different, and
6.1. Surrogacy
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
Gupta et al. (2019) discuss as one of the main challenges in online experimentation the problem
of estimating long-term causal effects from short experiments. Experimenters often want to act
on results quickly but may wish to optimize for long-term outcomes. During the experiment they
may be able to measure a number of short-term outcomes that are all related to the primary (long-
term) outcome. The question arises of how to combine these multiple short-term outcomes into
a single variable that can be used to decide on the efficacy of the intervention. Athey et al. (2020a)
suggest combining the short-term variables into a single predictor of the long-term outcome.
They consider a setting with two samples, one experimental, where we observe the treatment and
the short-term outcomes, the surrogates, but not the primary outcome, as illustrated in Figure 8b,
and an observational sample, where we observe the surrogates and the primary outcome, but not
the treatment, as illustrated in Figure 8a. This can be motivated by the assumption that the short-
term variables are valid surrogates in the sense of Prentice (1989). The key component of the
assumption is that all causal paths from the treatment to the outcome go through at least one
of the surrogates, so that there is no direct causal effect of the treatment on the outcome, only
indirect effects through the surrogates, as illustrated in Figure 8. In the case where all the effects
are linear, this leads to a Baron & Kenny (1986)–type approach where the causal effects on the
surrogates are weighted by their coefficients in a predictive regression.
a b
Observational data Experimental data
Treatment Outcome Treatment Outcome
Surrogate Surrogate
Figure 8
(a) Data generation for observational data. We observe in the observational data the surrogates and the
primary outcome, but not the treatment. (b) Data generation for experimental data. For the experimental
data, we observe the treatment and the surrogates but not the primary outcome.
146 Imbens
a b
Observational data Experimental data
X X
W Y W Y
U U
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
Figure 9
(a) Observational data. For the observational data, we observe all three variables, the treatment, the
secondary outcome, and the primary outcome. (b) Experimental data. For the experimental sample, we
observe only the treatment and the secondary outcome.
treatment and the secondary outcome. They then adjust for the estimated unobserved confounder
in the observational data to estimate the average effect of the treatment in the observational study.
7. CONCLUSION
The literature on causal inference in statistics and social sciences has been a fast growing-one in the
past twenty years. With close interactions between methodologists and empirical researchers, in
a variety of disciplines including political science, economics, statistics, and computer science, the
credibility of empirical work has been substantially improved. This trend shows no sign of slowing
down, with important advances being made in the literature on spillovers in observational studies,
and the estimation and inference of dynamic effects in panel data.
DISCLOSURE STATEMENT
The author is not aware of any affiliations, memberships, funding, or financial holdings that might
be perceived as affecting the objectivity of this review.
ACKNOWLEDGMENTS
This work was partially supported by the Office of Naval Research under grant N00014-17-1-
2131.
LITERATURE CITED
Abadie A, Angrist J, Imbens G. 2002. Instrumental variables estimates of the effect of subsidized training on
the quantiles of trainee earnings. Econometrica 70(1):91–117
Abadie A, Athey S, Imbens G, Wooldridge J. 2020. Sampling-based versus design-based uncertainty in
regression analysis. Econometrica 88:265–96
Abadie A, Cattaneo M. 2018. Econometric methods for program evaluation. Annu. Rev. Econ. 10:465–503
Abadie A, Diamond A, Hainmueller J. 2010. Synthetic control methods for comparative case studies:
estimating the effect of California’s tobacco control program. J. Am. Stat. Assoc. 105(490):493–505
Abadie A, Gardeazabal J. 2003. The economic costs of conflict: a case study of the Basque Country. Am. Econ.
Rev. 93:113–32
Abadie A, Imbens G. 2006. Large sample properties of matching estimators for average treatment effects.
Econometrica 74(1):235–67
Abadie A, Imbens G. 2011. Bias-corrected matching estimators for average treatment effects. J. Bus. Econ. Stat.
29(1):1–11
Abadie A, Imbens G. 2016. Matching on the estimated propensity score. Econometrica 84(2):781–807
1014
Angrist J, Krueger A. 1999. Empirical strategies in labor economics. In Handbook of Labor Economics, Vol. 3, ed.
OC Ashenfelter, D Card, pp. 1277–366. Amsterdam: Elsevier
Angrist J, Rokkanen M. 2015. Wanna get away? Regression discontinuity estimation of exam school effects
away from the cutoff. J. Am. Stat. Assoc. 110(512):1331–44
Arkhangelsky D, Athey S, Hirshberg D, Imbens G, Wager S. 2021. Synthetic difference-in-differences. Am.
Econ. Rev. 111(12):4088–118
Armstrong T, Kolesár M. 2018. Optimal inference in a class of regression models. Econometrica 86(2):655–83
Aronow P, Samii C. 2017. Estimating average causal effects under general interference, with application to a
social network experiment. Ann. Appl. Stat. 11(4):1912–47
Athey S, Bahati M, Doudchenko N, Imbens G, Khosravi K. 2021. Matrix completion methods for causal panel
data models. J. Am. Stat. Assoc. 116(536):1716–30
Athey S, Chetty R, Imbens G, Kang H. 2020a. Estimating treatment effects using multiple surrogates: the role
of the surrogate score and the surrogate index. arXiv:1603.09326 [stat.ME]
Athey S, Chetty R, Imbens G. 2020b. Combining experimental and observational data to estimate treatment
effects on long term outcomes. arXiv:2006.09676 [stat.ME]
Athey S, Eckles D, Imbens GW. 2018a. Exact p-values for network interference. J. Am. Stat. Assoc.
113(521):230–40
Athey S, Imbens G. 2016. Recursive partitioning for heterogeneous causal effects. PNAS 113(27):7353–60
Athey S, Imbens G. 2017. The econometrics of randomized experiments. In Handbook of Economic Field
Experiments, Vol. 1, ed. AV Banerjee, E Duflo, pp. 73–140. Amsterdam: Elsevier
Athey S, Imbens G. 2021. Design-based analysis in difference-in-differences settings with staggered adoption.
J. Econom. 226(1):62–79
Athey S, Imbens G, Wager S. 2018b. Approximate residual balancing. J. R. Stat. Soc. Ser. B 80(4):597–623
Athey S, Wager S. 2021. Policy learning with observational data. Econometrica 89(1):133–61
Bajari P, Burdick B, Imbens G, Masoero L, McQueen J, et al. 2021. Multiple randomization designs.
arXiv:2112.13495 [stat.ME]
Bajari P, Burdick B, Imbens G, Masoero L, McQueen J, et al. 2023. Experimental design in marketplaces. Stat.
Sci. 38(3):458–76
Banerjee A. 2020. Field experiments and the practice of economics. Am. Econ. Rev. 110(7):1937–51
Baron R, Kenny D. 1986. The moderator–mediator variable distinction in social psychological research:
conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol. 51(6):1173–82
Basse GW, Feller A, Toulis P. 2019. Randomization tests of causal effects under interference. Biometrika
106(2):487–94
Bertanha M, Imbens G. 2020. External validity in fuzzy regression discontinuity designs. J. Bus. Econ. Stat.
38(3):593–612
Besbes O, Gur Y, Zeevi A. 2014. Stochastic multi-armed-bandit problem with non-stationary rewards. In
NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, ed.
Z Ghahramani, M Welling, C Cortes, ND Lawrence, KQ Weinberger, pp. 199–207. Cambridge, MA:
MIT Press
Bickel P, Klaassen C, Ritov Y, Wellner J. 1993. Efficient and Adaptive Estimation for Semiparametric Models.
Baltimore, MD: Johns Hopkins Univ. Press
Brodersen K, Gallusser F, Koehler J, Remy N, Scott S. 2015. Inferring causal impact using Bayesian structural
time-series models. Ann. Appl. Stat. 9(1):247–74
148 Imbens
Black S. 1999. Do better schools matter? Parental valuation of elementary education. Q. J. Econ. 114(2):577–99
Bond RM, Fariss CJ, Jones JJ, Kramer ADI, Marlow C, et al. 2012. A 61-million-person experiment in social
influence and political mobilization. Nature 489(7415):295–98
Callaway B, Sant’Anna P. 2020. Difference-in-differences with multiple time periods. J. Econom. 225(2):200–30
Calonico S, Cattaneo M, Titiunik R. 2014. Robust nonparametric confidence intervals for regression-
discontinuity designs. Econometrica 82(6):2295–326
Chamberlain G. 1984. Panel data. In Handbook of Econometrics, Vol. 2, ed. Z Griliches, MD Intriligator,
pp. 1247–318. Amsterdam: Elsevier
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W. 2017. Double/debiased/
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
150 Imbens
Lin W. 2013. Agnostic notes on regression adjustments to experimental data: reexamining Freedman’s critique.
Ann. Appl. Stat. 7(1):295–318
Lin Z, Ding P, Han F. 2021. Estimation based on nearest neighbor matching: from density ratio to average
treatment effect. arXiv:2112.13506 [math.ST]
Liu L, Wang Y, Xu Y. 2022. A practical guide to counterfactual estimators for causal inference with time-series
cross-sectional data. Am. J. Political Sci. In press
Liu Y, Van Roy B, Xu K. 2023. A definition of non-stationary bandits. arXiv:2302.12202 [cs.LG]
Manski CF. 1990. Nonparametric bounds on treatment effects. Am. Econ. Rev. 80(2):319–23
Manski CF. 2003. Partial Identification of Probability Distributions. New York: Springer
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:42:43
Manski CF. 2004. Statistical treatment rules for heterogeneous populations. Econometrica 72(4):1221–46
Masten M, Poirier A, Zhang L. 2020. Assessing sensitivity to unconfoundedness: estimation and inference.
arXiv:2012.15716 [econ.EM]
Matsudaira J. 2008. Mandatory summer school and student achievement. J. Econom. 142(2):829–50
Matzkin R. 1994. Restrictions of economic theory in nonparametric methods. In Handbook of Econometrics,
Vol. 4, ed. RF Engle, DL McFadden, pp. 2523–58. Amsterdam: Elsevier
McCrary J. 2008. Testing for manipulation of the running variable in the regression discontinuity design.
J. Econom. 142(2):698–714
Morgan S, Winship C. 2015. Counterfactuals and Causal Inference. Cambridge, UK: Cambridge Univ. Press
Newey W. 1990. Semiparametric efficiency bounds. J. Appl. Econom. 5(2):99–135
Oster E. 2019. Unobservable selection and coefficient stability: theory and evidence. J. Bus. Econ. Stat.
37(2):187–204
Pearl J. 1995. Causal diagrams for empirical research. Biometrika 82(4):669–88
Pearl J. 2000. Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge Univ. Press
Pearl J, Mackenzie D. 2018. The Book of Why: The New Science of Cause and Effect. New York: Basic Books
Peters J, Janzing D, Schölkopf B. 2017. Elements of Causal Inference: Foundations and Learning Algorithms.
Cambridge, MA: MIT Press
Porter J. 2003. Estimation in the regression discontinuity model. Work. Pap., Dep. Econ., Harvard Univ.,
Cambridge, MA. https://users.ssc.wisc.edu/∼jrporter/reg_discont_2003.pdf
Pouget-Abadie J, Aydin K, Schudy W, Brodersen K, Mirrokni V. 2019. Variance reduction in bipartite ex-
periments through correlation clustering. In NIPS’19: Proceedings of the 33rd International Conference on
Neural Information Processing Systems, ed. H Wallach, H Larochelle, A Beygelzimer, F d’Alché-Buc, E Fox,
R Garnett, pp. 13309–19. Red Hook, NY: Curran
Prentice R. 1989. Surrogate endpoints in clinical trials: definition and operational criteria. Stat. Med. 8(4):431–
40
Robins J. 1989. The analysis of randomized and non-randomized AIDS treatment trials using a new ap-
proach to causal inference in longitudinal studies. In Health Service Research Methodology: A Focus on AIDS,
pp. 113–59. Washington, DC: US Public Health Serv.
Robins J. 1997. Causal inference from complex longitudinal data. In Latent Variable Modeling and Applications
to Causality, ed. M Berkane, pp. 69–117. New York: Springer
Robins J, Hernán MA, Brumback B. 2000. Marginal structural models and causal inference in epidemiology.
Epidemiology 11(5):550–60
Robins J, Rotnitzky A, Zhao L. 1994. Estimation of regression coefficients when some regressors are not
always observed. J. Am. Stat. Assoc. 89(427):846–66
Robins PK. 1985. A comparison of the labor supply findings from the four negative income tax experiments.
J. Hum. Resourc. 20(4):567–82
Rosenbaum PR. 1984. The consequences of adjustment for a concomitant variable that has been affected by
the treatment. J. R. Stat. Soc. Ser. A 147(5):656–66
Rosenbaum PR. 2002. Observational Studies. New York: Springer
Rosenbaum PR. 2010. Design of Observational Studies. New York: Springer
Rosenbaum PR. 2020. Modern algorithms for matching in observational studies. Annu. Rev. Stat. Appl. 7:143–
76
Rosenbaum PR, Rubin DB. 1983a. Assessing sensitivity to an unobserved binary covariate in an observational
study with binary outcome. J. R. Stat. Soc. Ser. B 45(2):212–18
152 Imbens