Causal Review
Causal Review
Causal Inference
Judith Abécassis, Elise Dumas, Julie Alberge, Gaël Varoquaux
November 8, 2024
Abstract
The increasing accumulation of medical data brings the hope of data-driven medical
decision-making, but its increasing complexity –as text or images in electronic health
records– calls for complex models, such as machine learning. Here, we review how
machine learning can be used to inform decisions for individualized interventions, a causal
question. Going from prediction to causal effects is challenging as no individual is seen as
both treated and not. We detail how some data can support some causal claims and how
to build causal estimators with machine learning. Beyond variable selection to adjust for
confounding bias, we cover the broader notions of study design that make or break causal
inference. As the problems span across diverse scientific communities, we use didactic
yet statistically precise formulations to bridge machine learning to epidemiology.
Contents
1 Introduction: from prediction to individualized interventions 2
1
From prediction to prescription: Machine learning and Causal Inference
Data is getting richer and more complex: monitoring devices are cheaper, administrative records are
ubiquitous, and notes are consolidated into databases. This trend brings the hope of better data-driven
decision-making, such as individualized medicine. Machine learning (ML) tools are central to this hope,
as they can build predictions from such complex data. Yet prediction often does not suffice: inferring
causal effects is needed to decide on how to intervene. Causal effects characterize how an intervention
in a system modifies an outcome of interest. Not all statistical associations are causal. For instance, a
hospital visit is associated with an increase in mortality. This association results not from the hospital’s
effect but from the confounding effect of the worse baseline health of hospital-goers.
If the investigator can actually intervene, the straightforward approach to measuring causal effects is the
randomized study: as the population with and without intervention are statistically alike, computing the
average effect of the intervention does not require sophisticated statistics. But the average effect provides
only a partial perspective: some individuals may respond to the intervention and others not. Personalized
decision makings call for individualized or conditional causal effects, which do require more elaborate
statistical approaches. Likewise, a randomized intervention is not possible in observational settings,
causal inference must emulate randomized allocation, and statistical modeling is crucial. This review
covers how machine learning can be used for this statistical modeling, opening the door to individualizing
interventions thanks to causal inference from complex or high-dimensional data. Compared to existing
reviews (Crown, 2019; Blakely et al., 2020; Prosperi et al., 2020; Curth et al., 2024; Liu, 2024), our
focus is on an up-to-date, didactic but statistically precise, description of machine learning for estimating
heteregoneous effects needed for personalized medicine.
The sophistication of modern machine-learning approaches and the amount of available data can capture
subtle variation of effects effects across individuals. But, when working from observational data –without
randomized intervention– a risk is always to capture non-causal associations resulting from current be-
haviors or policies. Causal validity requires a rigorous process. First, one must define causal questions
that can be answered from the data at hand, as explained in Section 2. Then, different machine-learning
approaches, detailed in Section 3, can then be used to individualize causal effects. But pitfalls in ob-
servational data can trick the most powerful machine learning approach, and Section 4 discusses how to
avoid them with good study design.
By themselves, machine learning methods provide only predictions. Causal interpretations, estimating
the effects that ground data-driven decision-making, require more, namely a causal framework.
2
From prediction to prescription: Machine learning and Causal Inference
ATE Average
Treatment Effect:
the average dif-
Computed across the whole population, the ATE summarizes the treatment effect which can be different ference between
treated and not
for each individual. As it lumps together very different individuals, another interesting causal quantity
treated across the
(estimand) is to consider a conditional average, considering relevant individual characteristics as covariates whole population
XCAT E ∈ RdCAT E , eg age, general health status, stage of diagnosis of the treated condition etc:
Definition 2.2 (Conditional average treatment effect).
CATE Conditional
CATE τ (xCAT E ) = E[Y (1) − Y (0)|XCAT E = xCAT E ] (2) Average Treat-
ment Effect: the
The challenge of causal inference is that in practice we do not get to observe the potential outcomes: the average difference
between treated
observed data are (Y, X, A), with X consisting of all observed covariates for an individual, including the
and not treated
covariates XCATE on which one can condition on to define and identify the CAT E estimands, X ∈ Rd . for individuals
The CAT E is crucial to making individualized decisions based on a specific patient’s characteristics. The with characteristics
above estimands are expressed in terms of potential outcomes, ie unobserved data. In some settings, these XCAT E = xCAT E
quantities can be identified to statistical estimands of the available observed variables. In the following
sections, we detail this identification starting from the “ideal” case of the randomized controlled trial
(RCT), and then considering more complicated settings.
SUTVA states that the observed outcome corresponds to the potential outcome of the treatment actually
taken by the unit. In particular, it implies that there is only one version of the treatment and that there
is no interference between the units: the outcome of a unit does not depend on the treatment received
by another unit. Therefore, SUTVA is broken in several common cases, such as the vaccination, as the
outcome of an individual depends on their own vaccinal status, but also on the degree of vaccination
in the surrounding population (herd immunity). SUTVA can also be broken by binarizing a continuous
treatment (with several possible doses) or when the treatment is ill-defined. For instance, a binary
variable indicating smoking or not is simplistic, as smoking duration and intensity may induce very
different outcomes. The problem of defining a suitable treatment with respect to the SUTVA assumption
(also called consistency) requires a particular attention, as detailed in section 4.2.
In an RCT, randomization of the treatment assignment enforces the two other assumptions important
for causal inference: unconfoundedness and overlap.
Assumption 2.4. Ignorability (also called unconfoundedness) – The treatment assignment is indepen-
dent from the potential outcomes:
(Y (0), Y (1)) ⊥
⊥ A.
Assumption 2.5. Overlap (also called positivity) – Any unit has a non-zero probability of receiving any
version of the treatment:
0 < P (A = 1) < 1.
With those assumptions, the causal estimand based on potential outcomes can be identified to a statistical
estimand that depends only on observed quantities:
E[Y (1)] = E[Y (1)|A = 1] from overlap and ignorability
= E[Y |A = 1] from SUTVA (3)
It follows similarly that E[Y (0)] = E[Y |A = 0] and thus the ATE is identified as such:
AT E = E[Y (1) − Y (0)] = E[Y |A = 1] − E[Y |A = 0]. (4)
3
From prediction to prescription: Machine learning and Causal Inference
Likewise, the CATE can be identified to E[Y |A = 1, XCAT E ] − E[Y |A = 0, XCAT E ] with conditional
versions of ignorability and overlap. These equalities are important, because they identify a causal
quantity, defined from the two potential outcomes impossible to observe, to quantities directly observable.
110
Intensive Care
Y = Nb of days hospitalized
Recovering identifiability: conditional ignorability and overlap Alternative and less stringent
assumptions come into play: conditional versions of Overlap and Ignorability.
Assumption 2.6. Conditional ignorability The treatment is independent of the potential outcomes,
conditionally on a well-chosen set of covariates, Xident ∈ Rdident :
(Y (0), Y (1)) ⊥
⊥ A | Xident .
Assumption 2.7. Conditional overlap Any unit has a non-zero probability of receiving any version of
the treatment, conditionally on covariates:
Here we denote the covariates necessary for the conditional overlap and ignorability assumptions to hold
Xident a different notation from the covariates used in the CATE (definition 2.2), but some covariates
can belong to both sets: we assume that both covariate sets are observed, ie Xident ∪ XCAT E ⊂ X. The
4
From prediction to prescription: Machine learning and Causal Inference
covariates used in the CATE are chosen to analyze treatment effect heterogeneity, ensuring conditional
ignorability and overlap necessitates a meticulous selection of covariates, as elaborated in Section 2.4.
Intuitively, conditional overlap states that an individual has a comparable counterpart in the other treat-
ment group for any possible set of characteristics, making it possible to compare treated and untreated
individuals to estimate a causal effect. Conditional ignorability means that the treatment is as good as
randomly assigned among the subjects with the same characteristics. In other words, with conditional
versions of overlap and ignorability, among subjects with the same characteristics xident , the data is
similar to that obtained with an RCT, hence allowing to estimate a causal effect.
This intuition is maintained in the formal identifiability proofs: identification for the ATE proceeds
through identification of the CATE for each possible value of xident . Key to identification is that a
potential outcome, such as Y (1), can be written as an expectation of the observed outcomes Y reweighted
by the probability of treatment:
E[Y (1)] = E[E[Y (1)|Xident = xident ]] from the law of total expectation
= E[E[Y (1)|A = 1, Xident = xident ]] from conditional ignorability and overlap
= E[E[Y |A = 1, Xident = xident ]] from SUTVA (5)
Y 1A=1
=E E |Xident = xident
P(A = 1|Xident = xident )
Y 1A=1
=E (6)
P(A = 1|Xident = xident )
Strategies to estimate causal estimands Equations 5 and 6 form the basics of estimation strategies
for causal estimands from observations:
Outcome prediction Equation 5 suggests modeling the conditional expectation of outcomes given the
covariates and using their difference to obtain the causal effect. This approach makes a link
to the intuitive causal reasoning behind mechanistic models, where changes in the input induce
changes in the output in a causal way. Note that this causal interpretation holds for predictive
models only if the SUTVA, conditional overlap, and conditional ignorability assumptions hold
(which requires a good choice of covariates) and the conditional expectations in eq 5 are well
estimated.
Inverse propensity weighting (IPW) Alternatively, equation 6 grounds the so-called IPW approaches
based on reweighing observations. The weights –the inverse of propensity scores– depend on the
values of the treatment and the covariates Xident : w = 1/P(A = 1|Xident = xident ). This
reweighing makes the covariate distributions comparable across the treated and untreated pop-
ulations. The propensity score “summarizes” the influence of the (high-dimensional) covariates
into a single number so that it can cancel out.
Confounding bias and spurious causal associations The prototypical threat to ignorability is con-
fusion bias, where a third (confounding) variable explains the seemingly causal association between the
treatment A and the outcome Y seen in the data. Failure to adjust for confounding bias can lead to
a phenomenon known as Simpson’s paradox (Simpson, 1951), which occurs when a trend between the
treatment and the outcome in the whole dataset is different when considering subgroups defined by a
third (confounding) variable. Simpson’s paradox was noted in comparing treatments A and B for kidney
stones (Charig et al., 1986; Julious and Mullee, 1994). Treatment A performed better for both small and
large stones individually, but when data were combined, treatment B seemed superior. This paradox
occurs because treatment A was mainly used for large stones, which have lower success rates. Without
accounting for stone size, treatment A appeared less effective overall.
5
From prediction to prescription: Machine learning and Causal Inference
One might think that including many variables in the confounding set, Xident , is a good idea to avoid
confounding bias, making the ignorability assumption more plausible. However, by doing so, we may
introduce other biases (MacKinnon and Lamp, 2021; Schisterman et al., 2009; Cinelli et al., 2024). A
famous example of bias induced by over-adjustment is known as Berkson’s paradox. Berkson’s paradox
manifests in a very similar way to Simpson’s paradox, with a different correlation trend in the whole
population or a subset, but the underlying causal mechanism is different: if we study individuals selected
because they show degraded health –for instance hospitalized– we will find anti-correlations between
causes: if a patient was not hospitalized for a stroke, he must have had another reason (Berkson, 1946).
Such selection bias breaks causal inference and keeps happening, for instance, when investigating COVID-
19 by focusing on symptomatic individuals (Griffith et al., 2020). The use of a causal graph can provide
clear guidance to select the right adjustment set and establish whether conditional ignorability holds for
a given causal question while avoiding the caveats of over-adjustment.
What is a causal graph? A causal graph describes the causal links between the variables of the
problem, including unobserved variables. The corresponding variables are represented as nodes connected
by directed edges (arrows). A causal graph is defined as a directed acyclic graph (DAG): all edges are
directed and form no cycles. Figure 2a shows the confounding bias described above written as a (very
simple) causal graph: the treatment A, the outcome of interest Y , and a third variable X. Arrows
between them represent their causal relationships: the treatment A has a causal effect on the outcome Y ,
but the variable X, the stone size, also influences both A and Y . X is called a confounder. Intuitively, a
confounder explains part of the association between A and Y . This part of the A − Y association is thus
not causal, and to isolate the causal effect, we need to correct the non-causal effect due to X. In that Confounder A
simple case, adjusting for X suffices to obtain identifiability using sec. 2.3. As we will see, using a graph variable influencing
enables to go from domain assumptions to the choice of the right variables in Xident . both treatment
assignment and
Identifiability criterion using the graph To formalize and generalize the idea of a causal and a outcome. Unac-
counted for, it
non-causal part of the association between two variables, we can introduce the concept of path. A path creates a non-
in the graph is a sequence of at least two different nodes, such that there is an edge between each node causal association
and its successor in the sequence, regardless of its direction. If a path exists between A and Y , with all between treatment
the edges oriented in the same direction, we say that this path is causal. In the graph of Figure 2a, there and outcome.
are two paths from A to Y : a causal path A → Y , and a non-causal path A ← X → Y . If a node in
the path is the target of two distinct edges in the path, it is called a collider relative to this path. In the
graph of Figure 2d, V2 is a collider on the path V1 → V2 ← V3 .
Paths can be opened or blocked. If we do not condition on any variable, a path is open unless there is a
collider on it. If we condition on a non-collider node on a path, it blocks the path, but if we condition on
a collider node, it opens the path. Overall, a path is blocked if there is at least one “blocking node” on it.
Intuitively, if the path from A to Y is open, information can be exchanged between those two variables
through this path, ie the two variables will be non-independent in the data.
We can now establish a criterion for choosing which variables to adjust for in order to enforce the
conditional ignorability hypothesis. We can identify the causal effect between an exposure A and an
outcome Y if we can adjust on a subset of variables Xident such that they block all the non-causal paths
from A to Y , and leave open all the causal ones. The procedure of finding an adjustment subset can be
automatized, for instance, in the software Dagitty (Textor et al., 2016). Satisfying conditional ignorability
is only possible if the graph structure and observed variables allow it.
6
From prediction to prescription: Machine learning and Causal Inference
How to obtain a causal graph? The causal structure encoded in the graph enables precise identifica-
tion conditions. However, it requires knowing this graph. There are two main approaches to obtaining a
causal graph. The first is expert knowledge, where the graph is built by hand, ideally with the assistance
of one or several experts of the application domain. The second is using causal discovery to construct
the graph using available data (Peters et al., 2017; Huber, 2024). In both cases, there can be uncertainty
in the graph structure, and therefore it might be relevant to assess the robustness of the causal effect
estimation to variations in the graph.
When it comes to variable selection, the perfect is the enemy of the good The criterion
for conditional ignorability obtained from the causal graph replaces classical heuristics to choose the
variables to include in the adjustment set. Simple heuristics can indeed lead to invalid analysis. A first
misguided strategy would be to adjust on all the available variables, to make sure that all confounders are
accounted for. The main risk is to select variables that should not be adjusted for, such as colliders (node
Col in Figure 2d), common effects of the treatment and the outcome (MacKinnon and Lamp, 2021), or
mediators (node M in Figure 2d) that could block a causal path and bias the estimation of the causal
effect. Adjusting for those nodes will introduce additional bias in the estimation instead of correcting for
confounding bias. An alternative way to express this is to state that one should refrain from adjusting for
post-treatment variables, ie, variables that are causally influenced by the treatment. A second misguided
strategy is to selection variables by relying on correlations in the data instead of knowledge of the graph
structure. This strategy may also lead to selecting colliders or mediators. Another strategy, suboptimal
though not invalid, is to consider variables that are causes of both the action A and the outcome Y .
This set selection might lead to overadjustment in the sense that a more parsimonious subset of variables
would lead to a valid causal result. Reducing the number of variables for in the model is useful because
it facilitates estimation (reducing the variance). Finally, adjustment sets with more confounders make
the conditional ignorability assumption more plausible but, at the same time, reduces overlap (D’Amour
et al., 2021). In some cases, it can be interesting to consider not adjusting on weak confounders to
preserve a stronger overlap, which could result in a smaller overall bias, though this is difficult to assess
in practice.
Once the target estimand is defined with the covariates needed for identification from the available
data, the next step is to use this data to actually estimate the causal quantity of interest. Recent
machine-learning methods bring new flexibility to such estimation beyond simple models traditionally
used –typically linear models. This flexibility is particularly welcomed to handle larger and more complex
datasets as with electronic health records. We will first outline the two main strategies for estimation,
introduced in the previous section: sample reweighting and outcome modeling, and then more advanced
methods that best use machine learning for CATE estimation. In each case, we will outline practical
considerations for the implementation of those approaches.
7
From prediction to prescription: Machine learning and Causal Inference
Individualized decision making calls for estimating the dependency of the effect on covariates, the CATE
τ (x). This estimation can be obtained by replacing the empirical mean in equation 8 by a regression of
def
X on reweighed targets (Wager and Athey, 2018) –or pseudo outcomes– YIP W = Y ( A − 1−A ). In
e(X)
b e(X)
1−b
practice, any machine-learning regressor can be used, learning to predict YIP W from XCAT E .
These formulas rely on inverting estimated probabilities, and this inversion will amplify estimation noise
in eb when it is close to 0 or 1. But these “extreme” propensity scores signify that for some observations,
the treatment is almost deterministic, thus violating overlap (assumption 2.5). In such regions of the
covariate space, the treatment may be very unlikely, maybe infeasible, and thus, it may be preferable to
avoid having causal claims (Li et al., 2019; Oberst et al., 2020). One solution is “trimming”, excluding
observations with e ∈ / [α, 1 − α] for a choice of α (Crump et al., 2009). This approach, however, does not
define explicitly the covariate space on which the causal estimates are valid, and a more explicit solution
may be preferable (Oberst et al., 2020).
Estimating good propensity scores eb can be estimated with any classifier predicting A from Xident .
However, here the goal is not good classification, but accurate probabilities. This goal requires selecting
models not to maximize accuracy or area under the ROC curve –Receiver Operating Characteristic
(Varoquaux and Colliot, 2023)– but using a strictly proper scoring rule (Gneiting and Raftery, 2007),
such as log-loss or Brier score. A common misconception is that it suffices to measure and correct
the calibration error. The calibration error measures whether a probabilistic classifier is overconfident or
underconfident, and simple recalibration methods can correct it: eg isotonic recalibration (Niculescu-Mizil
and Caruana, 2005) temperature scaling (Guo et al., 2017). Yet, a classifier with zero calibration error
may be far from the conditional probability e (Perez-Lebel et al., 2022). As machine-learning models
can be systematically over or under-confident (Guo et al., 2017; Minderer et al., 2021), recalibration
techniques may be useful. But, all in all, proper scoring rules must drive every aspect of model selection,
including confirming the benefit of recalibration.
We use the notation X = Xident ∪ XCAT E , to simplify the notations and as we aim at estimating the
CATE. µ b is estimated by fitting a regression model of the outcome given the treatment and the covariates,
eg with a base machine-learning model. 2) Then the estimate of the CATE is given by contrasting the
predictions with two different treatment options:
If a single predictive model is used for the regression in equation 9, this approach is called S-learner.
It may be beneficial to fit two distinct models for the treatment and the control, an approach called
T-learner (Künzel et al., 2019):
def
b1 (x) − µ
τbT (x) = µ b0 (x) with µa (x) = E[Y |A = a, X = x] (11)
Both estimators are unbiased if the base regression models are unbiased1 . The S and the T-learner
correspond to two different inductive biases: the patterns that they easily capture differ (Curth and
Van der Schaar, 2021). The S-learner will foster similar response surfaces for the treated and control
populations. In the simple case of using a linear model for µb, if no interaction term between A and X Inductive bias
is added, the model imposes the same slope for treated and non-treated, capturing a constant –and not How a given
heterogeneous– effect (fig. 3a). In the T-learner, the two response functions are not biased to resemble machine learning
each other, as they are fitted separately on distinct subparts of the data, which comes at the cost of method favours
using less data. Which one to prefer? The S-learner is preferable if there is enough shared structure certain patterns
over others to
1 To be precise, the no-bias results are asymptotic, ie characterize consistency. generalize. It can
be explicitly visible
in the model form
8 (eg linear), or
implicit.
From prediction to prescription: Machine learning and Causal Inference
(x)
Y
Y
60 60 10
0
40 40
10 20 40 60 80 100
20 40 60 80 20 40 60 80
Age Age Age
d) S-learner, Gradient Boosted Trees e) T-learner, Gradient Boosted Trees f) CATE, Gradient Boosted Trees
40 * (x) X-Learner
100 1(x) 100 1(x) S-Learner R-Learner
0(x) 0(x) T-Learner
80 80
20
(x)
Y
Y
60 60
40 40 0
20 40 60 80 20 40 60 80 20 40 60 80 100
Age Age Age
Figure 3: Different meta learners and base machine-learning algorithms. The upper plots display
the estimations using parametric Linear Regression, including the response functions for the S-learner (a), the
T-learner (b), and the CATE estimates for all meta-learners (c). The lower plots present the same estimations,
but using Gradient Boosting Trees (GBT) for the estimation of nuisance parameters, including the S-learner (d),
the T-learner (e), and the CATE estimates for all meta-learners (f).
between µ0 (x) and µ1 (x) to simplify markedly their estimation, a common setting in medicine (Hahn
et al., 2020; Curth and Van der Schaar, 2021). Gauging this in practice is difficult, not only because one
does not know beforehand µ0 and µ1 , but also because the relevant notions of simplicity relate to the
base machine-learning model used, with their own inductive biases. For a very flexible base model, there
can little benefit of the T-learner compared to the S-learner (fig. 3d and 3e). Indeed, in regions where
the two potential outcomes differ, a flexible S-learner will model these differently. Consider for instance
a tree-based model (random forest, gradient-boosted trees), if the treatment is very predictive of the
outcome, the first split of the trees will split on the treatment, thus the two population will subsequently
be fitted separately.
The dilemma between the S and the T learner illustrates another challenge of estimating heterogeneous
causal effects: while we can easily control the error on the response functions separately µ0 (x) and µ1 (x),
we would like to minimize errors on τ (x) = µ1 (x) − µ0 (x), a difference that we never observe, as no
individual is simultaneously treated and not treated.
9
From prediction to prescription: Machine learning and Causal Inference
is computed by reversing the role of treated and controls. The final estimate is obtained by a weighted
combination of estimates on both groups: τb(x) = eb(x) τb0 (x) + (1 − eb(x)) τb1 (x). A key aspect here is that
the weights eb favor the relative estimates where they are more trustworthy: where the treatment is likely,
e is high, putting more weights on τb0 , τb0 which is itself estimated via µ
b1 and therefore less noisy where
there are many treated units.
R-decomposition and R-loss As one of the challenges is to separate out the treatment effect τ (x), the
difference between the two treatment groups, from the common effects, the variations of baseline risk, it
is useful to rewrite the problem introducing the mean outcome:
def
Conditional mean outcome: m(x) = E[Y |X = x] (12)
The outcome can then be written (Robinson, 1988; Nie and Wager, 2021):
R-decomposition: Y (A) = m(X) + A − e(X) τ (X) + ε(X, A), (13)
where, importantly, E[ε(X, A)|X, A] = 0. This rewriting shows that, given estimates of m and e, τ can be
readily estimated from the data by optimizing τb to minimize ε. Specifically, it suggests a risk to minimize
(Nie and Wager, 2021; Chernozhukov et al., 2018; Foster and Syrgkanis, 2023; van der Laan and Luedtke,
2014):
def
R − risk(τf ) = E (Y − m(X) − (A − e(X))τf (X))2
R-Risk (14)
where τf is the candidate CATE that is optimized. The CATE τb(x) can then be estimated by fitting a
machine-learning model using as a loss the above R-risk. This approach can easily be implemented even
with machine-learning toolkits that do not support custom losses but do support sample weights. Any
def
regression model (based on a squared loss) can be adapted by fitting pseudo outcomes2 YR = Y −m b (X)
A−b
e(X)
3 def 2
and using as sample weights WR = (A − eb(X)) .
Augmented IPW and DR-learner Another route that leads to a risk combining outcome model and
IPW is to consider corrections of the individual methods. Indeed, as discussed previously, in outcome
modeling, the estimations of µ do not minimize the error on τ . The theory of influence functions can
give corrections (Robins and Rotnitzky, 1995; Hahn, 1998), leading to define a pseudo-outcome, known
as AIPW:
def A 1−A
YAIP W = µ1 (X) − µ0 (X) + Y − µ1 (X) + Y − µ0 (X) (15)
e(X) 1 − e(X)
YAIP W can be seen as providing corrections to a simple outcome-model estimator of τ (x): µ1 (x) − µ0 (x).
The corresponding correction is an IPW applied to the residuals of the outcome-model estimate. The
formula can be rewritten to expose another, symmetric, interpretation:
A 1−A e(X) − A 1 − e(X) − (1 − A)
YAIP W = Y − + µ1 (X) + µ0 (X) (16)
e(X) 1 − e(X) e(X) 1 − e(X)
Here, YAIP W appears that an IPW estimate (eq. 8), with a correction that corresponds to an outcome
model reweighed by the residual treatment probabilities.
A CATE estimate, called DR-learner, can be built by using a machine-learning model to regress YAIP W
on X (Kennedy, 2023). The DR-learner divides by propensity scores, which will create noise for regions
of extreme propensity scores (low overlap), unlike the R-learner which ignores the corresponding samples.
2 for numerical stability, it can be useful to add ϵ, typically 10−6 , to the denominator. Note that not adding this ϵ to the
sample weights WR will shrink to zero samples with no reasonable matching counterparts, ie in regions without overlap,
thus letting the inductive bias of the model form (trees, linear model) fill in.
3 Without the sample weights, the regression would give a U -learner (Künzel et al., 2019; Nie and Wager, 2021), which
is unstable with extreme propensity scores
10
From prediction to prescription: Machine learning and Causal Inference
Cross-fitting The above formulas (eq. 14 and 15) need estimates of nuisances e, µ, m, which must be
computed beforehand, using a machine-learning model. To avoid coupling of estimation error of this
first estimation in the second estimation that regresses on YR or YAIP W , the two steps should ideally be
carried out on different samples. One approach is to split the data in two folds, fit the nuisance models
on the first half, the CATE estimators on the second half, repeat the procedure swapping the two folds,
and average the resulting CATE predictors.
Doubly robust property These CATE estimators perform well because the IPW errors (on eb) and
the outcome-modeling errors (on µ b or m)
b cancel out. Asymptotically, only one of the two models (IPW
or outcome model) need to be unbiased (well specified), for the CATE estimator to be unbiased (Bang
and Robins, 2005): the corrections in equations 15 or 16 will be null. In finite samples, even unbiased
estimators of IPW and outcome models will have estimation noise. However, in the CATE estimators,
their errors are multiplied with one another, and as a result, they partly cancel out, giving fast convergence
rates: fewer samples are needed to obtain good estimates (Nie and Wager, 2021; Chernozhukov et al.,
2018; Kennedy et al., 2024).
Super learners In causal-inference settings, eg health, the data are often of a tabular nature, on which
tree models (such as gradient boosting) tend to work well (Grinsztajn et al., 2022). However, this is not
a general rule; typically for low amount of data or or high noise settings linear models may be preferable.
A popular approach in causal inference is the super-learner (Van der Laan et al., 2007), which can be
implemented by using model stacking (Breiman, 1996) on a few complementary models, typical linear and
tree-based models –modern autoML relies on a related notion of model portfolio (Feurer et al., 2022). A
recent thorough benchmark of machine learning for causal inference confirmed that such stacking of linear
model and gradient-boosted trees was a good solution for nuisance models (Doutreligne and Varoquaux,
2023).
Causal model selection Nuisance models can be selected by cross-validating in a standard machine-
learning way. However, when selecting a model for the CATE it is important to keep in mind that the
best model is not the best predictor in the usual sense (squared loss). Rather, a good approach is to
measure the R-Risk (eq. 14) in a cross-validation loop (Doutreligne and Varoquaux, 2023). For this,
nuisance models must be fitted in the train set in parallel to the CATE model, as they are needed for the
R-Risk.
Different plausible machine-learning approaches can give different causal estimates (Bouvier et al., 2024;
Doutreligne et al., 2023). An empirical comparison of method can assess the robustness of the result and
grasp the key factors influencing the result. Such an approach, sometimes called vibration analysis (Patel
et al., 2015; Doutreligne et al., 2023) can show how much the estimated effect varies across methods and
which methods achieve similar or different performances.
Using data to personalize clinical decisions requires causal inference for each possible intervention, each
with a tailored choice of causal estimand as well as an appropriate identification strategy (Hernán et al.,
2019), multiple steps that require care and background knowledge (Hoffman et al., 2024). Indeed, unlike
with classic prediction scenarios, as in machine-learning, we cannot just compare predicted values to
observed values: we have never observed the same individual bother treated and not. Causal inference
can go wrong in many ways. For instance, a flawed study design (the formulation of the question and
11
From prediction to prescription: Machine learning and Causal Inference
data to include) can lead to studying an impossible scenario such as going back in time and applying a
prevention strategy to save an individual. These problems cannot be solved by accumulating large sample Study design
sizes or deploying more sophisticated procedures, eg based on machine-learning: if you aim at the wrong overall strategy to
target, you’ll miss, no matter how good an archer you are. answer a question,
encompassing all
Let us consider a historical example: aspirin use and breast cancer recurrence. Several observational the choices made
studies suggested that aspirin use in breast-cancer patients may reduce the risk of breast cancer recurrence for analyzing the
(Chen and Holmes, 2017). As a result, aspirin was proposed as a potential treatment for breast cancer, collected data.
and a large double-blind randomized trial was conducted, in which patients were randomized to receive
either 300 mg of aspirin or a placebo daily for five years (Chen et al., 2024). Contrary to previous
observational findings, the trial results were negative, showing no reduction in breast cancer recurrence
risk with aspirin use. We use this example below to illustrate some points of attention when designing
and running a causal analysis. However, we do not intend to provide a definitive explanation for the
apparently different results between the randomized trial and the previous observational studies in the
case of aspirin and breast cancer.
We first explore potential pitfalls resulting from ill-defined causal questions and flawed study designs.
Intervention A well specified intervention is essential for SUTVA (assumption 2.3) to hold (Cole and
Frangakis, 2009). In an observational study, we may be tempted to estimate the “effect” of ever versus
never taking aspirin. This approach is problematic because we do not expect 300 mg of aspirin daily
to have the same effect on breast cancer recurrence as 500 mg of aspirin once every year, although
both use patterns would be included in the aspirin-ever group. In fact, the intervention “aspirin” is
ambiguous without at least the dosage, timing, frequency, and duration of use being specified. The
challenge when specifying the interventions is to determine the extent of detail necessary (Hernán, 2016).
While dosage, frequency, and duration are commonly acknowledged as factors which may influence the
potential outcomes, other details –such as the medication excipients, or whether it is taken during or
outside meals– might also potentially be of importance. In theory, interventions should be specified until
no different versions may result in different potential outcomes. Determining when this point is met is
non-trivial, and requires expert knowledge. The task of unambiguously defining interventions may seem
even more complex when dealing with individual attributes such as gender, ethnicity, or body mass index
(BMI) (Hernán and Taubman, 2008). Importantly, causal inference is not aimed at identifying causes
per se, but at estimating the effects of interventions: this is the “no causation without manipulation”
paradigm (Holland, 1986). Our causal framework (the potential outcomes framework) does not define,
even less measure, the “effect” of a high BMI on an outcome. Rather, this framework allows us to
formulate and quantify the effect of specific interventions related to weight loss, such as initiating a
hypo-caloric diet program.
Control The choice of the control intervention used as a reference should also be carefully chosen and
clearly stated (Rosenbaum, 1999; Malay and Chung, 2012). For example, aspirin may decrease the risk
of breast cancer recurrence compared to no therapy but increase the risk of breast cancer recurrence
compared to chemotherapy.
Population A causal effect estimate is specific to the population in which it was estimated. Individual-
ized treatment effects as obtained with CATEs can sometimes be applied to a population different from
study population, however such generalization can be undermined by different interference structures or
treatment variants (Hernán and VanderWeele, 2011). For example, an estimator of the effect of aspirin on
breast cancer recurrence derived from a source population may not be directly applicable to an individual
12
From prediction to prescription: Machine learning and Causal Inference
in a target population if there is a limited supply of aspirin pills, resulting in inter-unit interference in
the target population, or if the target population uses drugs from a different pharmaceutical company,
with different excipients.
Outcome A good specification of the outcome to be evaluation is key to framing a causal question. A
first aspect is to enable apple-to-apple comparison across studies. Imagine that aspirin only delays cancer
recurrence: we might find a beneficial effect of aspirin if the outcome is recurrence at five years, but no
effect if the outcome is recurrence at twenty years. However, another fundamental aspect is the validity of
the outcome from a health standpoint. Indeed, considering only recurrence leaves out patients who have
died of another cause –a patient who dies can no longer recur. We say that death is a competing event for
cancer recurrence (Young et al., 2020). Suppose that aspirin has no specific effect on cancer recurrence
but increases the risk of hemorrhagic death. Without accounting for the competing event, aspirin might
appear beneficial for cancer recurrence because aspirin-induced hemorrhagic death precludes recurrence.
Dedicated statistical methods can handle competing events (Young et al., 2020; Stensrud et al., 2022),
in which case these must be explicitly stated for the causal estimand to be unambiguous. An alternative
but less precise solution is to consider a composite outcome (Wolbers et al., 2014).
Contrast The ATE, CATE, and all the formulas above are risk differences, based on Y (1) − Y (0). How-
ever, there are many other possible contrasts, such as risk ratio ( YY (1)
(0) ), odds ratio (for binary outcomes),
hazard ratio, (Colnet et al., 2023)... An effect might appear highly beneficial on a multiplicative scale
but negligible on an additive scale if the baseline risk is low (Forrow et al., 1992; Dj et al., 1993); thus
statistical guidelines recommend reporting results on both absolute (e.g., risk difference) and relative
(e.g., risk ratio or odds ratio) scales (Schulz et al., 2010; Cuschieri, 2019). In the medical literature,
odds ratios are frequent for binary outcomes, and hazard ratios for time-to-event outcomes (Holmberg
and Andersen, 2020). However, both of these measures are problematic for causal inference because
they are not collapsible : the subgroup causal effects (CATEs) cannot be aggregated into a population
causal effect (ATE) (Didelez and Stensrud, 2022; Colnet et al., 2023). In addition, causal interpretation
of hazard ratios is challenging because they have a built-in selection bias (Hernán, 2010; Stensrud and
Hernán, 2020).
Narrow definitions of the causal estimand –a well-defined population, a homogeneous intervention– help
ensuring a valid causal estimation; this is known as internal validity. However, it comes at the expense
of the external validity: the estimated effect is applicable to less situation (Pearce and Vandenbroucke,
2023). Personalized decisions require modeling heterogeneous settings and including important factors of
variability as covariates in the CATE.
4.2 Beyond simple confounding: biases can arise from the study design
Even with a precise definition of the target causal estimand and unlimited data, poor study design or data
artifacts can introduce systematic biases in the estimation that persist despite the absence of unmeasured
confounding or model misspecification (Acton et al., 2023). These biases can take many forms (Spencer
et al., 2023), and their detection requires a thorough understanding of the data collection processes and
expert knowledge of the underlying causal structure. In the following, we detail a few examples: time
alignment failure, measurement bias, and informative losses to follow-up. But many other sources of bias
exist (Spencer et al., 2023; Jager et al., 2020; Berrington de González et al., 2024).
Time alignment failures In clinical trials, the start of follow-up (or baseline time), the time of eligibility
assessment, and the time of treatment assignment are synchronized. In observational studies, however,
these three time points are not naturally defined and should be specified while designing the study. Failure
to properly align baseline, eligibility, and intervention assignment times can lead to time alignment failure
biases (Hernán et al., 2016).
Consider an observational study of aspirin versus control. Suppose the start of follow-up is set at the
time of the first breast cancer diagnosis and coincides with the assessment of eligibility criteria. Without
random assignment, one must assign patients to the treatment and control group based on their observed
patterns of aspirin use. A natural approach would be to assign patients to the treatment arm if they
had taken aspirin before breast cancer diagnosis. In this case, treatment allocation occurs before baseline
and eligibility screening. This temporal misalignment can introduce prevalent-user bias (Danaei et al.,
2012). Indeed, if aspirin is effective in preventing breast cancer progression, former aspirin users may
13
From prediction to prescription: Machine learning and Causal Inference
be less likely to develop breast cancer initially and to be included in the analysis. This might bias the
study population: patients who are diagnosed with breast cancer despite taking aspirin may represent
non-responder individuals. In contrast, individuals in the non-aspirin group would be a mix of resistant
and non-resistant patients. Overall, this would lead to underestimating the true effect of aspirin. An
alternative approach could be to define the treatment group as patients who start taking aspirin after
baseline. In this case, treatment allocation occurs after baseline and eligibility screening. This temporal
misalignment can introduce immortal-time bias (Suissa, 2008): to be assigned to the aspirin group,
patients must survive long enough to start treatment. During this “immortal” period before aspirin is
started, outcome events are attributed solely to the control group, which can falsely suggests that aspirin
is more beneficial than it actually is. Other variations of time alignment failures exist (Hernán et al.,
2016).
As accounting for time is crucial, the “PICO” framework can be extended to “PICOT”, where T stands
for “time”, specifying the duration of the intervention and other temporal patterns of the study (Riva
et al., 2012). To avoid defining treatment allocation using post-baseline information, and pitfalls such
as immortal-time bias, one solution is the cloning, censoring, and weighting approach, within the target
trial emulation framework (Hernán et al., 2016). Briefly, it consists of (i) cloning each patient at baseline
and assigning one clone to each of the considered interventions, (ii) censoring the clones the first time the
patient’s behavior is no longer consistent with the assigned intervention, and (iii) weighting the clones to
account for the selection bias due to informative censoring induced in the latter step (Maringe et al., 2020;
Matthews et al., 2022; Gaber et al., 2024; Huitfeldt et al., 2015). The estimation approaches discussed
in section 3 must then use survival models (Ishwaran et al., 2008; Van Belle et al., 2011; Wiegrebe et al.,
2024; Alberge et al., 2024).
Measurement bias Measurement bias refers to any bias that results from the process of collecting and
preparing the study variables (Hernán and Robins, 2020). We focus on bias arising from mismeasurement
of the treatment or outcome, but in principle, bias can also arise from mismeasurement of other variables,
including confounding variables. Measurement bias encompasses various pitfalls in data collection, which
have been given multiple names, including information bias, recall bias, reverse causation bias, protopathic
bias, Berkson bias, interviewer bias, observer bias, and others (Young et al., 2018; Berrington de González
et al., 2024; Jager et al., 2020).
Within electronic health records or health claims, a medical condition is generally considered as non-
existing in the absence of related care. This could lead to false negative cases and induce measurement
bias (Lanes et al., 2015). For example, in healthcare prescription claims, a patient who receives over-the-
counter aspirin will be considered an aspirin non-user, and a patient who refuses treatment for cancer
recurrence will be considered to have no recurrence. Measurement error is particularly problematic when
the measurement error of the treatment (or outcome) is related to the true value of the outcome (or
the treatment), in which case it is said to be differential. For example, in retrospective data collected
through questionnaires, patients experiencing breast cancer recurrence may be more likely to remember
and report aspirin use because they believe it may be related to their condition: this is an example
of differential measurement error for the intervention, often referred to as recall bias (Prince, 2012).
Another common form of differential measurement error for the intervention, known as reverse causation
or protopathic bias (Ri and Ar, 1980; Faillie, 2015), occurs when the intervention is given in response to
the first symptoms of the outcome before it is diagnosed or recorded in the dataset. For example, the
onset of symptoms of metastatic cancer recurrence could lead patients to take aspirin for bone pain relief
before the cancer recurrence is detected. Alternatively, differential measurement error for the outcome
can occur if aspirin users tend to visit their primary care physician more often, increasing the likelihood
that the cancer recurrence is detected.
The use of proxies for the intervention can also undermine the plausibility of the SUTVA assumption even
when the interventions of interest are well-defined (Hernán, 2016). For example, in pharmacy claims,
one can only record that the patient received a box of aspirin, not that the patient actually ingested the
pills. In this sense, the intervention measured in the dataset is ambiguous: it includes taking any number
of pills up to the number of pills in the box. Here, a clinical study would do measure intention-to-treat
effects (effects of being assigned to medication arm versus the other, whether or the patient complies or
not) (Ranganathan et al., 2016). But claims data typical measures buying the treatment, rather than
prescription, and thus captures something close to per-protocol effects (effects of taking the intervention
being assigned).
Measurement bias is both difficult to detect and difficult to manage. Methods to either identify or mitigate
measurement bias include, but are not limited to, improving the data collection process, discussing
14
From prediction to prescription: Machine learning and Causal Inference
the direction and magnitude of bias from expert knowledge and prior publications, parametrizing a
measurement error model, or performing sensitivity analyses by varying the study design (e.g., using lag
periods to reduce the impact of protopathic bias) (Young et al., 2018; Arfè and Corrao, 2016; Schennach,
2016).
Loss to follow-up Loss to follow-up occurs when participants in a study drop out or become unavailable
for further data collection before the study is completed, resulting in missing data for the outcome. This
type of case, where the outcome is the time to occurrence of a specific event, when observed, can be
handled by survival analysis approaches. When informative, loss to follow-up can introduce selection
bias, even in randomized trials (Akl et al., 2012; Howe et al., 2016). Selection bias due to informative
loss to follow-up occurs when the censoring status variable acts as a collider on the pathway between
the intervention and the outcome, or as the descendant of such a collider, thereby introducing collider
bias (Hernán and Robins, 2020). For example, consider a randomized controlled trial of aspirin versus
placebo. Patients in the aspirin arm may experience side effects, leading them to withdraw from the
trial at a higher rate than those in the placebo arm. Hence, patients in the aspirin group will become
healthier over time than those in the placebo group, and the two arms will no longer be exchangeable,
undermining the benefit of the initial randomization.
In most cases, we are interested in estimating the causal effect if no one in the study population were
lost to follow-up; that is, the joint effect of aspirin versus placebo, under a second intervention that
would prevent censoring in both groups. Identification of this causal estimand involves adjustment for
informative censoring, typically with weighting methods. This requires new specific causal assumptions
to hold (Young et al., 2020).
Machine learning can model complex data, such as images, or text. In a causal framework it opens
the door to exploiting new data source for evidence-based medical decision-making. Yet, as we have
seen, ensuring causal validity from observational data is challenging. The multiple steps, summarized
Figure 4, all hide pitfalls. Study design and choice of covariates require a good mastery of the data and
associated modeling hypotheses. The choice of machine-learning method also matters (Bouvier et al.,
2024; Doutreligne et al., 2023) and may not be obvious: The best predictive model is not necessarily
15
From prediction to prescription: Machine learning and Causal Inference
Population
Intervention (A) STUDY DESIGN design A
Control
Outcome (Y) Looking for sysematic bias Linear design B
(Time) Regression
Estimand CATE • time alignment failure X-learner GBT
• loss to follow-up
τ(x)=E[Y(1)-Y(0)| X=x]
• selection bias
R-learner GBT
IDENTIFICATION X-learner LR
Neural Network
• prior knowledge R-learner LR
• untestable causal assumptions
• unmeasured confounding bias MACHINE τ(x) distribution
LEARNING
the causal model, and the best causal model is not necessarily the best predictive one (Doutreligne
and Varoquaux, 2023). No machine-learning method can “solve” causal validity with uncontrolled input
data, from which they could for instance learning non-causal “shortcuts” (Geirhos et al., 2020). Beyond
confounding bias or well-specified estimators, study design, framing well a causal question, is crucial. For
these reason, RCTs are typically considered as more solid evidence for causal effects than observational
studies (Murad et al., 2016). It can be useful to validate machine-learning causal effects obtained from
an observational study by extracting an average effect to compare to an existing RCT (Doutreligne et al.,
2023). And yet, for routine decision-making, RCTs are also imperfect (Rothwell, 2005; Deaton and
Cartwright, 2018). Their target causal estimand may differ from that of interest: different populations
and different interventions erode external validity. Individualizing decisions requires estimating a detailed
conditional effect, which requires larger sample sizes than the typical RCT and can benefit from observing
a wide diversity of settings. Finally, decision-making must build on the information available in routine
practice, which often differs from that in a clinical study.
There is no magic bullet to building individualized decisions from data; it requires crossing information
from RCTs, from observational data at hand, with expert knowledge and machine-learning models.
Acknowledgments We thank Stefan Wager for insightful discussions, and Sarah Abécassis for graphics
work on figure 4. JA, JA, and GV acknowledge funding for the project INTERCEPT-T2D by the
European Union under the Horizon Europe Programme (Grant Agreement No 101095433), and the
PEPR SN SMATCH France 2030 ANR-22-PESN-0003
References
E. K. Acton, A. W. Willis, and S. Hennessy. Core concepts in pharmacoepidemiology: Key biases arising
in pharmacoepidemiologic studies. Pharmacoepidemiology and drug safety, 32(1):9–18, Jan. 2023. ISSN
1053-8569. . URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10204604/.
16
From prediction to prescription: Machine learning and Causal Inference
estimated treatment effects of information lost to follow-up in randomised controlled trials (LOST-IT):
systematic review. BMJ, 344:e2809, May 2012. ISSN 1756-1833. . URL https://www.bmj.com/
content/344/bmj.e2809. Publisher: British Medical Journal Publishing Group Section: Research.
J. Alberge, V. Maladière, O. Grisel, J. Abécassis, and G. Varoquaux. Survival models: Proper scoring
rule and stochastic optimization with competing risks. arXiv preprint arXiv:2410.16765, 2024.
A. Arfè and G. Corrao. The lag-time approach improved drug-outcome association estimates in presence
of protopathic bias. Journal of Clinical Epidemiology, 78:101–107, Oct. 2016. ISSN 1878-5921. .
H. Bang and J. M. Robins. Doubly robust estimation in missing data and causal inference models.
Biometrics, 61(4):962–973, 2005.
J. Berkson. Limitations of the application of fourfold table analysis to hospital data. Biometrics Bulletin,
2(3):47–53, 1946.
T. Blakely, J. Lynch, K. Simons, R. Bentley, and S. Rose. Reflection on modern methods: when worlds
collide—prediction, machine learning and causal inference. International journal of epidemiology, 49
(6):2058–2064, 2020.
J. E. Brand, X. Zhou, and Y. Xie. Recent developments in causal inference and machine learning. Annual
Review of Sociology, 49(1):81–110, 2023.
W. Y. Chen and M. D. Holmes. Role of Aspirin in Breast Cancer Survival. Current Oncology Reports,
19(7):48, June 2017. ISSN 1534-6269. . URL https://doi.org/10.1007/s11912-017-0605-6.
C. Cinelli, A. Forney, and J. Pearl. A crash course in good and bad controls. Sociological Methods &
Research, 53(3):1071–1104, 2024.
17
From prediction to prescription: Machine learning and Causal Inference
B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Risk ratio, odds ratio, risk difference... which causal
measure is easier to generalize? arXiv preprint arXiv:2303.16008, 2023.
W. H. Crown. Real-world evidence, causal inference, and machine learning. Value in Health, 22(5):
587–592, 2019.
R. K. Crump, V. J. Hotz, G. W. Imbens, and O. A. Mitnik. Dealing with limited overlap in estimation
of average treatment effects. Biometrika, 96(1):187–199, 2009.
A. Curth and M. Van der Schaar. On inductive biases for heterogeneous treatment effect estimation.
Advances in Neural Information Processing Systems, 34:15883–15894, 2021.
A. Curth, R. W. Peck, E. McKinney, J. Weatherall, and M. van Der Schaar. Using machine learning
to individualize treatment effect estimation: Challenges and opportunities. Clinical Pharmacology &
Therapeutics, 115(4):710–719, 2024.
S. Cuschieri. The STROBE guidelines. Saudi Journal of Anaesthesia, 13(Suppl 1):S31–S34, Apr. 2019.
ISSN 1658-354X. . URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6398292/.
G. Danaei, M. Tavakkoli, and M. A. Hernán. Bias in observational studies of prevalent users: lessons for
comparative effectiveness research from a meta-analysis of statins. American Journal of Epidemiology,
175(4):250–262, Feb. 2012. ISSN 1476-6256. .
A. Deaton and N. Cartwright. Understanding and misunderstanding randomized controlled trials. Social
science & medicine, 210:2–21, 2018.
I. Dı́az, H. Lee, E. Kıcıman, E. J. Schenck, M. Akacha, D. Follman, and D. Ghosh. Sensitivity analysis for
causality in observational studies for regulatory science. Journal of Clinical and Translational Science,
7(1):e267, 2023.
V. Didelez and M. J. Stensrud. On the logic of collapsibility for causal effect measures. Biometrical
Journal. Biometrische Zeitschrift, 64(2):235–242, Feb. 2022. ISSN 1521-4036. .
M. Dj, B. Ja, J. S, W. Jw, and R. Jm. The framing effect of relative and absolute risk. Journal of general
internal medicine, 8(10), Oct. 1993. ISSN 0884-8734. . URL https://pubmed.ncbi.nlm.nih.gov/
8271086/. Publisher: J Gen Intern Med.
J. Dockès, G. Varoquaux, and J.-B. Poline. Preventing dataset shift from breaking machine-learning
biomarkers. GigaScience, 10(9):giab055, 2021.
M. Doutreligne and G. Varoquaux. How to select predictive models for causal inference? arXiv preprint
arXiv:2302.00370, 2023.
A. D’Amour, P. Ding, A. Feller, L. Lei, and J. Sekhon. Overlap in observational studies with high-
dimensional covariates. Journal of Econometrics, 221(2):644–654, 2021.
18
From prediction to prescription: Machine learning and Causal Inference
J.-L. Faillie. Indication bias or protopathic bias? British Journal of Clinical Pharmacology, 80(4):779–780,
Oct. 2015. ISSN 0306-5251. . URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4594717/.
M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter. Auto-sklearn 2.0: Hands-free automl
via meta-learning. Journal of Machine Learning Research, 23(261):1–61, 2022.
L. Forrow, W. C. Taylor, and R. M. Arnold. Absolutely relative: how research results are summarized
can affect treatment decisions. The American Journal of Medicine, 92(2):121–124, Feb. 1992. ISSN
0002-9343. .
D. J. Foster and V. Syrgkanis. Orthogonal statistical learning. The Annals of Statistics, 51(3):879–908,
2023.
K. A. Frank, Q. Lin, R. Xu, S. Maroulis, and A. Mueller. Quantifying the robustness of causal inferences:
Sensitivity analysis for pragmatic social science. Social Science Research, 110:102815, 2023.
T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the
American statistical Association, 102(477):359–378, 2007.
S. Greenland and B. Brumback. An overview of relations among causal modelling methods. International
journal of epidemiology, 31(5):1030–1037, 2002.
L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning
on typical tabular data? Advances in neural information processing systems, 35:507–520, 2022.
J. Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment
effects. Econometrica, pages 315–331, 1998.
P. R. Hahn, J. S. Murray, and C. M. Carvalho. Bayesian regression tree models for causal inference:
Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis, 15(3):
965–1056, 2020.
M. A. Hernán. The Hazards of Hazard Ratios. Epidemiology (Cambridge, Mass.), 21(1):13–15, Jan. 2010.
ISSN 1044-3983. . URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3653612/.
M. A. Hernán. Does water kill? A call for less casual causal inferences. Annals of Epidemiology, 26(10):
674–680, Oct. 2016. ISSN 1047-2797. . URL https://www.sciencedirect.com/science/article/
pii/S1047279716302800.
M. A. Hernán and J. M. Robins. Causal Inference: What If. Boca raton: Chapman & hall/crc. edition,
2020.
19
From prediction to prescription: Machine learning and Causal Inference
M. A. Hernán and S. L. Taubman. Does obesity shorten life? The importance of well-defined interventions
to answer causal questions. International Journal of Obesity, 32(3):S8–S14, Aug. 2008. ISSN 1476-5497.
. URL https://www.nature.com/articles/ijo200882. Publisher: Nature Publishing Group.
M. A. Hernán, B. C. Sauer, S. Hernández-Dı́az, R. Platt, and I. Shrier. Specifying a target trial pre-
vents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical
Epidemiology, 79:70–75, Nov. 2016. ISSN 1878-5921. .
M. A. Hernán, J. Hsu, and B. Healy. A Second Chance to Get Causal Inference Right: A
Classification of Data Science Tasks. CHANCE, 32(1):42–49, Jan. 2019. ISSN 0933-2480.
. URL https://doi.org/10.1080/09332480.2019.1579578. Publisher: ASA Website eprint:
https://doi.org/10.1080/09332480.2019.1579578.
M. A. Hernán, W. Wang, and D. E. Leaf. Target Trial Emulation: A Framework for Causal Inference
From Observational Data. JAMA, 328(24):2446–2447, Dec. 2022. ISSN 1538-3598. .
P. W. Holland. Statistics and causal inference. Journal of the American statistical Association, 81(396):
945–960, 1986.
M. J. Holmberg and L. W. Andersen. Estimating Risk Ratios and Risk Differences: Alternatives to Odds
Ratios. JAMA, 324(11):1098–1099, Sept. 2020. ISSN 0098-7484. . URL https://doi.org/10.1001/
jama.2020.12698.
C. J. Howe, S. R. Cole, B. Lau, S. Napravnik, and J. J. Eron. Selection bias due to loss to follow up in
cohort studies. Epidemiology (Cambridge, Mass.), 27(1):91–97, Jan. 2016. ISSN 1044-3983. . URL
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5008911/.
M. Huber. An introduction to causal discovery. Swiss Journal of Economics and Statistics, 160(1):14,
Oct. 2024. ISSN 2235-6282. . URL https://doi.org/10.1186/s41937-024-00131-4.
A. Huitfeldt, M. Kalager, J. M. Robins, G. Hoff, and M. A. Hernán. Methods to Estimate the Comparative
Effectiveness of Clinical Strategies that Administer the Same Intervention at Different Times. Current
epidemiology reports, 2(3):149–161, Sept. 2015. ISSN 2196-2995. . URL https://www.ncbi.nlm.nih.
gov/pmc/articles/PMC4646164/.
G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge
university press, 2015.
K. J. Jager, G. Tripepi, N. C. Chesnaye, F. W. Dekker, C. Zoccali, and V. S. Stel. Where to look for the
most frequent biases? Nephrology (Carlton, Vic.), 25(6):435–441, June 2020. ISSN 1320-5358. . URL
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7318122/.
S. A. Julious and M. A. Mullee. Confounding and simpson’s paradox. Bmj, 309(6967):1480–1481, 1994.
B. C. Kahan, J. Hindley, M. Edwards, S. Cro, and T. P. Morris. The estimands framework: a primer on
the ICH E9(R1) addendum. BMJ, 384:e076316, Jan. 2024. ISSN 1756-1833. . URL https://www.bmj.
20
From prediction to prescription: Machine learning and Causal Inference
E. H. Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic
Journal of Statistics, 17(2):3008–3049, 2023.
S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment
effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165,
2019.
F. Li, L. E. Thomas, and F. Li. Addressing extreme propensity scores via the overlap weights. American
journal of epidemiology, 188(1):250–257, 2019.
F. Liu. Data science methods for real-world evidence generation in real-world data. Annual Review of
Biomedical Data Science, 7, 2024.
C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect inference with
deep latent-variable models. Advances in neural information processing systems, 30, 2017.
D. P. MacKinnon and S. J. Lamp. A unification of mediator, confounder, and collider effects. Prevention
Science, 22(8):1185–1193, 2021.
S. Malay and K. C. Chung. The Choice of Controls for Providing Validity and Evidence in Clinical
Research. Plastic and reconstructive surgery, 130(4):959–965, Oct. 2012. ISSN 0032-1052. . URL
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3461178/.
C. Maringe, S. Benitez Majano, A. Exarchakou, M. Smith, B. Rachet, A. Belot, and C. Leyrat. Reflection
on modern methods: trial emulation in the presence of immortal-time bias. Assessing the benefit of
major surgery for elderly lung cancer patients using observational data. International Journal of
Epidemiology, 49(5):1719–1729, Oct. 2020. ISSN 0300-5771. . URL https://doi.org/10.1093/ije/
dyaa057.
A. A. Matthews, G. Danaei, N. Islam, and T. Kurth. Target trial emulation: applying principles of
randomised trials to observational studies. BMJ, 378:e071108, Aug. 2022. ISSN 1756-1833. . URL
https://www.bmj.com/content/378/bmj-2022-071108. Publisher: British Medical Journal Publish-
ing Group Section: Research.
M. H. Murad, N. Asi, M. Alsawas, and F. Alahdab. New evidence pyramid. BMJ Evidence-Based
Medicine, 21(4):125–127, 2016.
A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceedings
of the 22nd international conference on Machine learning, pages 625–632, 2005.
X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):
299–319, 2021.
21
From prediction to prescription: Machine learning and Causal Inference
C. J. Patel, B. Burford, and J. P. Ioannidis. Assessment of vibration of effects due to model specification
can demonstrate the instability of observational associations. Journal of clinical epidemiology, 68(9):
1046–1058, 2015.
N. Pearce and J. P. Vandenbroucke. Are target trial emulations the gold standard for observational
studies? Epidemiology, 34(5):614–618, 2023.
A. Perez-Lebel, M. Le Morvan, and G. Varoquaux. Beyond calibration: estimating the grouping loss of
modern neural networks. ICLR, 2022.
J. Peters, D. Janzing, and B. Schölkopf. Elements of causal inference: foundations and learning algo-
rithms. The MIT Press, 2017.
M. Prince. 9 - Epidemiology. In P. Wright, J. Stern, and M. Phelan, editors, Core Psychiatry (Third
Edition), pages 115–129. W.B. Saunders, Oxford, Jan. 2012. ISBN 978-0-7020-3397-1. . URL https:
//www.sciencedirect.com/science/article/pii/B9780702033971000094.
M. Prosperi, Y. Guo, M. Sperrin, J. S. Koopman, J. S. Min, X. He, S. Rich, M. Wang, I. E. Buchan, and
J. Bian. Causal inference and counterfactual prediction in machine learning for actionable healthcare.
Nature Machine Intelligence, 2(7):369–375, 2020.
H. Ri and F. Ar. The problem of ”protopathic bias” in case-control studies. The American journal of
medicine, 68(2), Feb. 1980. ISSN 0002-9343. . URL https://pubmed.ncbi.nlm.nih.gov/7355896/.
Publisher: Am J Med.
J. J. Riva, K. M. Malik, S. J. Burnie, A. R. Endicott, and J. W. Busse. What is your research question? An
introduction to the PICOT format for clinicians. The Journal of the Canadian Chiropractic Association,
56(3):167–171, Sept. 2012. ISSN 0008-3194. URL https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC3430448/.
J. M. Robins and A. Rotnitzky. Semiparametric efficiency in multivariate regression models with missing
data. Journal of the American Statistical Association, 90(429):122–129, 1995.
J. M. Robins, A. Rotnitzky, and D. O. Scharfstein. Sensitivity analysis for selection bias and unmeasured
confounding in missing data and causal inference models. In Statistical models in epidemiology, the
environment, and clinical trials, pages 1–94. Springer, 2000.
P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for
causal effects. Biometrika, 70(1):41–55, 1983.
P. M. Rothwell. External validity of randomised controlled trials:“to whom do the results of this trial
apply?”. The Lancet, 365(9453):82–93, 2005.
22
From prediction to prescription: Machine learning and Causal Inference
D. O. Scharfstein, R. Nabi, E. H. Kennedy, M.-Y. Huang, M. Bonvini, and M. Smid. Semiparametric sen-
sitivity analysis: Unmeasured confounding in observational studies. arXiv preprint arXiv:2104.08300,
2021.
S. M. Schennach. Recent Advances in the Measurement Error Literature. Annual Review of Eco-
nomics, 8(Volume 8, 2016):341–377, Oct. 2016. ISSN 1941-1383, 1941-1391. . URL https://www.
annualreviews.org/content/journals/10.1146/annurev-economics-080315-015058. Publisher:
Annual Reviews.
K. F. Schulz, D. G. Altman, D. Moher, and the CONSORT Group. CONSORT 2010 Statement: updated
guidelines for reporting parallel group randomised trials. BMC Medicine, 8(1):18, Mar. 2010. ISSN
1741-7015. . URL https://doi.org/10.1186/1741-7015-8-18.
E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical
Society: Series B (Methodological), 13(2):238–241, 1951.
M. J. Stensrud and M. A. Hernán. Why Test for Proportional Hazards? JAMA, 323(14):1401–1402,
Apr. 2020. ISSN 1538-3598. .
J. Textor, B. Van der Zander, M. S. Gilthorpe, M. Liśkiewicz, and G. T. Ellison. Robust causal inference
using directed acyclic graphs: the r package ‘dagitty’. International journal of epidemiology, 45(6):
1887–1894, 2016.
V. Van Belle, K. Pelckmans, S. Van Huffel, and J. A. Suykens. Support vector methods for survival
analysis: a comparison between ranking and regression approaches. Artificial intelligence in medicine,
53(2):107–118, 2011.
M. J. van der Laan and A. R. Luedtke. Targeted learning of an optimal dynamic treatment, and statistical
inference for its mean outcome. 2014.
M. J. Van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical applications in genetics
and molecular biology, 6(1), 2007.
T. J. VanderWeele and P. Ding. Sensitivity analysis in observational research: introducing the e-value.
Annals of internal medicine, 167(4):268–274, 2017.
G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic value. Machine
learning for brain disorders, pages 601–630, 2023.
23
From prediction to prescription: Machine learning and Causal Inference
V. Veitch and A. Zaveri. Sense and sensitivity analysis: Simple post-hoc analysis of bias due to unobserved
confounding. Advances in neural information processing systems, 33:10999–11009, 2020.
V. Veitch, D. Sridhar, and D. Blei. Adapting text embeddings for causal inference. In Conference on
Uncertainty in Artificial Intelligence, pages 919–928. PMLR, 2020.
S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests.
Journal of the American Statistical Association, 113(523):1228–1242, 2018.
S. Wiegrebe, P. Kopper, R. Sonabend, B. Bischl, and A. Bender. Deep learning for survival analysis: a
review. Artificial Intelligence Review, 57(3):65, 2024.
24