Invited Commentary - Machine Learning in Causal Inference-How Do I Love
Invited Commentary - Machine Learning in Causal Inference-How Do I Love
8
© The Author(s) 2021. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of https://doi.org/10.1093/aje/kwab048
Public Health. All rights reserved. For permissions, please e-mail: [email protected].
Advance Access publication:
March 6, 2021
Invited Commentary
Initially submitted November 30, 2020; accepted for publication February 4, 2021.
In this issue of the Journal, Mooney et al. (Am J Epidemiol. 2021;190(8):1476–1482) discuss machine learning
as a tool for causal research in the style of Internet headlines. Here we comment by adapting famous literary
quotations, including the one in our title (from “Sonnet 43” by Elizabeth Barrett Browning (Sonnets From the
Portuguese, Adelaide Hanscom Leeson, 1850)). We emphasize that any use of machine learning to answer
causal questions must be founded on a formal framework for both causal and statistical inference. We illustrate
the pitfalls that can occur without such a foundation. We conclude with some practical recommendations for
integrating machine learning into causal analyses in a principled way and highlight important areas of ongoing
work.
causal inference; causal models; cross-validation; double robustness; machine learning; sample-splitting; Super
Learner
Editor’s note: The opinions expressed in this article are rarely have the knowledge to specify a correct parametric
those of the authors and do not necessarily reflect the views regression a priori, and data-snooping (fitting a series of
of the American Journal of Epidemiology. estimators and selecting the “best” in an ad hoc manner) or
P-hacking (conscious or not) undermines the foundations of
To [ML] or not to [ML]—that is [not] the question statistical inference. In place of these unsatisfactory alterna-
tives, ML offers a principled and prespecified way to flexibly
—William Shakespeare (1) learn from the data.
Machine learning (ML) has become ubiquitous in public While ML is often an essential ingredient for causal
health and epidemiologic research (2, 3). Supervised learn- inference, even the best ML algorithm may yield wildly
ing algorithms, which estimate the expected value of an ob- misleading answers to causal questions if the rest of the
served variable given a set of other measured variables, are recipe is ignored. We cannot simply accessorize our ML-
commonly used to improve predictions (4–6). In epidemiolo- based predictions with causal assumptions (e.g., no unmea-
gy, however, we often ask causal questions—questions about sured confounding) or statistical concepts (e.g., a bootstrap)
what outcomes would look like under alternative hypotheti- after the fact. Instead, ML algorithms must be carefully inte-
cal conditions (e.g., a change in how a treatment was assigned grated within a formal framework for causal and statistical
or an exposure distributed) (7). As Mooney et al. (8) discuss inference.
in their accompanying article, supervised ML also offers the
promise of better answers to these causal questions. When I’d heard the learn’d [epidemiologist]
The need for ML is clear, particularly in modern data —Walt Whitman (9)
ecosystems where we often face dozens of, if not more,
confounding variables. Stratification-based approaches are Researchers are sometimes worried that ML will supplant
typically ill-defined because of empty or sparse cells; we human expertise and experience in causal inference research
Am J Epidemiol. 2021;190(8):1483–1487
Machine Learning and the Causal Roadmap 1485
Table 1. Results From a Simulation Study Illustrating the Consequences of Neglecting the Causal Modela
under the exposure E(Y1 ) and under no exposure E(Y0 ), and equal to 14.3% in this simulation (38). Over 1,000
repetitions of the data-generating process, which is compatible with Figure 1, we show the mean point estimate,
bias (average deviation between point estimate and true effect), and coverage (proportion of times the calculated
95% confidence interval contained the true effect) for the following estimators: unadjusted, with G-computation
naively implemented (regressing the outcome on the measured past); inverse weighting naively implemented
(regressing for exposure on the measured past); G-computation informed by the Causal Roadmap (regressing
the outcome on exposure and confounders); and inverse weighting informed by the Roadmap (regressing the
exposure on the confounders).
b (W1, W2, W3) are confounders, I is an instrumental variable, M is a preexposure variable but not a confounder,
provide valid inferences when faced with realistic statistical that the conditional probability of the exposure A is accu-
models: models that accurately reflect our limited knowl- rately described by a main-terms logistic function of the
edge. confounders W and instrument I. Instead, such a main-terms
function is just one of many possible ways that the exposure
A could be generated from (W, I, UA ). During statistical
What’s in a [statistical model]?
estimation (Roadmap step 6), using a statistical model that
—William Shakespeare (23) reflects this uncertainty provides the foundation for accurate
inferences.
Mooney et al. refer to errors stemming from reliance on
parametric assumptions as “statistical model misspecifica-
tion” (8). As before, a more apt term might be “statistical Not all those who wander are lost
model neglect”: failure to respect our statistical knowledge —J.R.R. Tolkien (25)
during the estimation process. Such errors can be avoided
by ensuring that the statistical model, formally defined as Here is where the power and necessity of ML-based
the set of all possible distributions of the observed data, approaches become clear. Respecting the limits of our knowl-
only represents real knowledge—not assumptions made for edge forces us to confront very large statistical models—
convenience at the estimation stage (24). Step 4 of the for example, those without functional-form restrictions on
Causal Roadmap guarantees that our knowledge of the data- the conditional probability of exposure given confounders or
generating process (as opposed to wished-for simplifications) on the expected outcome given exposure and confounders.
is carried through to the statistical model. For example, In doing so, we are empowered to dismiss George Box’s
we often assume that the observed data are generating by quotation, “All models are wrong” (26, p. 792).
sampling N times from a data-generating process compatible Instead, we can joyfully proclaim, “My statistical model
with the causal model (24). Under this assumption, the correctly describes reality.” However, in doing so, we also
causal model implies the statistical model, characterizing the face new challenges for statistical estimation and inference.
set of possible distributions of the observed data. In practice, In particular, we are forced to leave behind the familiar
few or no restrictions are placed on this set, yielding a semi- comforts of parametric regressions and strike out on a jour-
parametric or nonparametric statistical model, respectively. ney through the vast space of distributions contained in our
For example, the causal model in Figure 1 does not encode statistical model. Respect for our statistical model means
parametric knowledge about the functional forms of the that we can, and indeed must, explore a wide range of
relationships between variables. Focusing on the propensity relationships between exposure and confounders as well
score, for example, the causal model encodes our limited as relationships between the outcome, exposure, and con-
knowledge that the exposure A is some unknown function founders.
of the confounders W, the instrument I, and unmeasured Supervised ML provides the means to conduct this
factors UA . The causal model does not state, for example, exploration in a powerful, principled, and fully prespecified
Am J Epidemiol. 2021;190(8):1483–1487
1486 Balzer and Petersen
manner. Ensemble methods, such as Super Learner (27, algorithms considered. We recommend incorporating expert
28), are particularly promising, because they use K-fold knowledge and including simple parametric regressions,
cross-validation (i.e., sample-splitting) to build the optimal together with more flexible approaches. With hierarchical or
weighted combination of predictions from a set of candidate repeated-measures data, it is essential to sample-split on the
algorithms. Importantly, background knowledge can again independent unit (e.g., the individual in longitudinal settings).
play a key role through the inclusion of expert-guided Finally, if the dependent variable is rare, we recommend
interaction terms or other features in the predictor set and stratifying on the outcome before sample-splitting to main-
parametric regressions in the algorithm set. tain roughly the same prevalence in each split. Of course,
these recommendations are just that—recommendations.
Implementation of any ML algorithm with real data always
Do you know anything on earth which has not a raises complex challenges, and careful examination of the
dangerous side if it is mishandled and exaggerated? default settings of any statistical computing package (as
—Sir Arthur Conan Doyle (29) well as, ideally, performance evaluation using simulation) is
Am J Epidemiol. 2021;190(8):1483–1487
Machine Learning and the Causal Roadmap 1487
6. Marcus JL, Sewell WC, Balzer LB, et al. Artificial Verbeke G, et al., eds. Longitudinal Data Analysis. Boca
intelligence and machine learning for HIV prevention: Raton, FL: Chapman & Hall/CRC; 2009:553–597.
emerging approaches to ending the epidemic. Curr HIV/AIDS 23. Shakespeare W. Romeo and Juliet, First Folio. London,
Rep. 2020;17(3):171–179. United Kingdom: Stationers Company; 1623.
7. Pearl J. Causal inference in statistics: an overview. Statist 24. van der Laan MJ, Rose S. Targeted Learning: Causal
Surv. 2009;3:96–146. Inference for Observational and Experimental Data. New
8. Mooney SJ, Keil AP, Westreich DJ, et al. Thirteen questions York, NY: Springer Publishing Company; 2011.
about using machine learning in causal research (you won’t 25. Tolkien JRR. The Fellowship of the Ring. London, United
believe the answer to number 10!). Am J Epidemiol. 2021; Kingdom: George Allen & Unwin; 1954.
190(8):1476–1482. 26. Box GEP, ed. Science and statistics. J Am Stat Assoc. 1976;
9. Whitman W. Drum-Taps. New York, NY: Peter Eckler; 1865. 71(356):791–799.
10. Keil AP, Edwards JK. You are smarter than you think: (super) 27. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat
machine learning in context. Eur J Epidemiol. 2018;33(5): Appl Genet Mol Biol. 2007;6:Article25.
437–440. 28. Naimi AI, Balzer LB. Stacked generalization: an introduction
Am J Epidemiol. 2021;190(8):1483–1487