0% found this document useful (0 votes)
16 views5 pages

Invited Commentary - Machine Learning in Causal Inference-How Do I Love

This commentary discusses the integration of machine learning (ML) in causal inference research, emphasizing the necessity of a formal framework for both causal and statistical inference. The authors highlight the risks of neglecting causal models and recommend a structured approach, known as the Causal Roadmap, to ensure valid inferences. They conclude by advocating for the careful application of ML techniques to enhance causal analyses while maintaining a strong foundation in causal knowledge.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Invited Commentary - Machine Learning in Causal Inference-How Do I Love

This commentary discusses the integration of machine learning (ML) in causal inference research, emphasizing the necessity of a formal framework for both causal and statistical inference. The authors highlight the risks of neglecting causal models and recommend a structured approach, known as the Causal Roadmap, to ensure valid inferences. They conclude by advocating for the careful application of ML techniques to enhance causal analyses while maintaining a strong foundation in causal knowledge.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

American Journal of Epidemiology Vol. 190, No.

8
© The Author(s) 2021. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of https://doi.org/10.1093/aje/kwab048
Public Health. All rights reserved. For permissions, please e-mail: [email protected].
Advance Access publication:
March 6, 2021

Invited Commentary

Invited Commentary: Machine Learning in Causal Inference—How Do I Love


Thee? Let Me Count the Ways

Laura B. Balzer∗ and Maya L. Petersen

Downloaded from https://academic.oup.com/aje/article/190/8/1483/6159691 by guest on 19 August 2024


∗ Correspondence to Dr. Laura B. Balzer, Department of Biostatistics and Epidemiology, School of Public Health and Health
Sciences, University of Massachusetts Amherst, 427 Arnold House, Amherst, MA 01003 (e-mail: [email protected]).

Initially submitted November 30, 2020; accepted for publication February 4, 2021.

In this issue of the Journal, Mooney et al. (Am J Epidemiol. 2021;190(8):1476–1482) discuss machine learning
as a tool for causal research in the style of Internet headlines. Here we comment by adapting famous literary
quotations, including the one in our title (from “Sonnet 43” by Elizabeth Barrett Browning (Sonnets From the
Portuguese, Adelaide Hanscom Leeson, 1850)). We emphasize that any use of machine learning to answer
causal questions must be founded on a formal framework for both causal and statistical inference. We illustrate
the pitfalls that can occur without such a foundation. We conclude with some practical recommendations for
integrating machine learning into causal analyses in a principled way and highlight important areas of ongoing
work.
causal inference; causal models; cross-validation; double robustness; machine learning; sample-splitting; Super
Learner

Abbreviations: ML, machine learning; TMLE, targeted maximum likelihood estimation.

Editor’s note: The opinions expressed in this article are rarely have the knowledge to specify a correct parametric
those of the authors and do not necessarily reflect the views regression a priori, and data-snooping (fitting a series of
of the American Journal of Epidemiology. estimators and selecting the “best” in an ad hoc manner) or
P-hacking (conscious or not) undermines the foundations of
To [ML] or not to [ML]—that is [not] the question statistical inference. In place of these unsatisfactory alterna-
tives, ML offers a principled and prespecified way to flexibly
—William Shakespeare (1) learn from the data.
Machine learning (ML) has become ubiquitous in public While ML is often an essential ingredient for causal
health and epidemiologic research (2, 3). Supervised learn- inference, even the best ML algorithm may yield wildly
ing algorithms, which estimate the expected value of an ob- misleading answers to causal questions if the rest of the
served variable given a set of other measured variables, are recipe is ignored. We cannot simply accessorize our ML-
commonly used to improve predictions (4–6). In epidemiolo- based predictions with causal assumptions (e.g., no unmea-
gy, however, we often ask causal questions—questions about sured confounding) or statistical concepts (e.g., a bootstrap)
what outcomes would look like under alternative hypotheti- after the fact. Instead, ML algorithms must be carefully inte-
cal conditions (e.g., a change in how a treatment was assigned grated within a formal framework for causal and statistical
or an exposure distributed) (7). As Mooney et al. (8) discuss inference.
in their accompanying article, supervised ML also offers the
promise of better answers to these causal questions. When I’d heard the learn’d [epidemiologist]
The need for ML is clear, particularly in modern data —Walt Whitman (9)
ecosystems where we often face dozens of, if not more,
confounding variables. Stratification-based approaches are Researchers are sometimes worried that ML will supplant
typically ill-defined because of empty or sparse cells; we human expertise and experience in causal inference research

1483 Am J Epidemiol. 2021;190(8):1483–1487


1484 Balzer and Petersen

(10). In contrast, such knowledge forms the essential foun-


U1 U2
dation for using ML to answer causal questions. Consider,
for example, the following steps of the Causal Roadmap,
one of several frameworks for causal and statistical inference
(11–16). W M
I
1. State the research question, including the target popu-
lation, primary exposure, primary outcome, and scale
of comparison. A Y
2. Specify a causal model, such as a directed acyclic
graph (17), to represent causal relationships between
Figure 1. Example of a directed acyclic graph in which W = {W1,
key variables, including potential sources of bias (e.g.,
W2, W3} represents measured confounders, I an instrumental vari-
confounding, selection, missing data, censoring). able, M another preexposure variable but not a confounder, A the

Downloaded from https://academic.oup.com/aje/article/190/8/1483/6159691 by guest on 19 August 2024


3. Translate the research question into a well-defined exposure indicator, Y the outcome, U1 unmeasured common causes
causal parameter, a summary measure of the dis- of M and A, and U2 unmeasured common causes of M and Y. The
tribution of the counterfactual (potential) outcomes unmeasured causes of the confounders UW and of the instrumental
(e.g., the difference in participants’ expected outcomes variable UI are independent of the others and are not shown. The
under both exposed and unexposed conditions). corresponding nonparametric structural equations are W = fW (UW );
4. Specify what data are available in actuality and the link I = fI (UI ); M = fM (UM ); A = fA (W, I, UA ); and Y = fY (W, A, UY ).
between the causal and statistical models.
5. Identify: Translate the causal parameter to a statistical
parameter—a summary measure of the observed data
Mooney et al. refer to these errors as “causal model mis-
distribution—by critically evaluating the assumptions
specification” (8). While incorrectly specified causal models
encoded in the causal model (together with a statistical
can certainly lead to identification errors, perhaps a more
assumption of adequate data support).
common error is “causal model neglect”: the failure to use
6. Estimate: Obtain point estimates and inference for the
causal knowledge to carefully specify a target statistical
corresponding statistical parameter (e.g., with match-
parameter before proceeding to estimation. If, instead, we
ing, G-computation, inverse weighting, augmented
had followed the Causal Roadmap, the identification step
inverse weighting, or targeted maximum likelihood
would have led us to the following statistical parameter,
estimation (TMLE) with Super Learner).
expressed as the G-computation formula
7. Interpret results. Causal interpretation is warranted
only when identifiability assumptions hold (step 5).
E [E (Y|A = 1, W)] − E [E (Y|A = 0, W)] (1)
ML, and more generally statistical estimation, only plays
a role in step 6. (Following Mooney et al. (8), we do not or, equivalently in inverse-weighted form,
address causal discovery algorithms in this commentary.)
The other steps of the Causal Roadmap rely almost exclu-    
I (A = 1) I (A = 0)
sively on human knowledge and expertise. E Y −E Y . (2)
To illustrate, consider the directed acyclic graph and cor- P (A = 1|W) P (A = 0|W)
responding nonparametric structural equations in Figure 1.
Here, W = {W1, W2, W3} represents measured confounders, In this setting, the instrumental variable I and the M-
I an instrumental variable, M another preexposure variable biasing variable M can and should be ignored for purposes of
but not a confounder, A the exposure indicator, Y the out- estimation and inference (17–20). As Table 1 shows, both G-
come, U1 unmeasured common causes of M and A, and U2 computation and inverse weighting exhibited minimal bias
unmeasured common causes of M and Y. Given the observed and good confidence interval coverage when their adjust-
data O = (W, I, M, A, Y), a naive approach to evaluate the ment sets followed from the identification result (step 5) and
causal effect of A on Y might start by applying a sophisti- when a correctly specified parametric regression was used in
cated algorithm to estimate the conditional expectation of estimation (step 6). Of course, in practice, our knowledge
the outcome Y given its measured past (W, I, M, A), or is generally insufficient to enable correct specification of
alternatively to estimate the conditional probability of the parametric regressions, and ML is needed to help address
exposure A given its measured past (W, I, M), that is, the this challenge.
propensity score. While such an approach is subject to a While Table 1 may seem like an extreme example, errors
number of possible pitfalls, ignoring the causal model is of “causal model neglect” commonly occur when heeding
arguably the most critical, because we end up estimating advice to adjust for all preexposure variables or when the
a statistical parameter that differs meaningfully from any exposure is time-varying (i.e., longitudinal) (21, 22). It may
interpretable causal effect, even when the “no unmeasured seem obvious, but it is nonetheless worth stating explicitly:
confounding” assumption holds, as in Figure 1. As the sim- Background knowledge remains the foundation of causal
ulation study presented in Table 1 illustrates, this approach identification (step 5), and ML cannot uncover cause-and-
can yield deeply misleading inferences for the causal effect effect if this foundation is weak. Instead, ML provides an
of interest. essential tool in the design of statistical estimators able to

Am J Epidemiol. 2021;190(8):1483–1487
Machine Learning and the Causal Roadmap 1485

Table 1. Results From a Simulation Study Illustrating the Consequences of Neglecting the Causal Modela

Estimator Mean, % Bias, % Coverage, % Relevant Regressionb

Unadjusted 15.6 1.3 84.6 Y∼A


Naive implementation
G-computation 17.9 3.5 75.0 Y ∼ W1 + W2 + W3 + I + M + A
Inverse weighting 25.8 11.5 81.2 A ∼ W1 + W2 + W3 + I + M
Roadmap-informed
G-computation 14.4 0.1 96.4 Y ∼ W1 + W2 + W3 + A
Inverse weighting 14.3 0.0 100.0 A ∼ W1 + W2 + W3

Downloaded from https://academic.oup.com/aje/article/190/8/1483/6159691 by guest on 19 August 2024


a We consider the average treatment effect—defined as the difference in the expected counterfactual outcome

under the exposure E(Y1 ) and under no exposure E(Y0 ), and equal to 14.3% in this simulation (38). Over 1,000
repetitions of the data-generating process, which is compatible with Figure 1, we show the mean point estimate,
bias (average deviation between point estimate and true effect), and coverage (proportion of times the calculated
95% confidence interval contained the true effect) for the following estimators: unadjusted, with G-computation
naively implemented (regressing the outcome on the measured past); inverse weighting naively implemented
(regressing for exposure on the measured past); G-computation informed by the Causal Roadmap (regressing
the outcome on exposure and confounders); and inverse weighting informed by the Roadmap (regressing the
exposure on the confounders).
b (W1, W2, W3) are confounders, I is an instrumental variable, M is a preexposure variable but not a confounder,

A is the exposure indicator, and Y is the outcome.

provide valid inferences when faced with realistic statistical that the conditional probability of the exposure A is accu-
models: models that accurately reflect our limited knowl- rately described by a main-terms logistic function of the
edge. confounders W and instrument I. Instead, such a main-terms
function is just one of many possible ways that the exposure
A could be generated from (W, I, UA ). During statistical
What’s in a [statistical model]?
estimation (Roadmap step 6), using a statistical model that
—William Shakespeare (23) reflects this uncertainty provides the foundation for accurate
inferences.
Mooney et al. refer to errors stemming from reliance on
parametric assumptions as “statistical model misspecifica-
tion” (8). As before, a more apt term might be “statistical Not all those who wander are lost
model neglect”: failure to respect our statistical knowledge —J.R.R. Tolkien (25)
during the estimation process. Such errors can be avoided
by ensuring that the statistical model, formally defined as Here is where the power and necessity of ML-based
the set of all possible distributions of the observed data, approaches become clear. Respecting the limits of our knowl-
only represents real knowledge—not assumptions made for edge forces us to confront very large statistical models—
convenience at the estimation stage (24). Step 4 of the for example, those without functional-form restrictions on
Causal Roadmap guarantees that our knowledge of the data- the conditional probability of exposure given confounders or
generating process (as opposed to wished-for simplifications) on the expected outcome given exposure and confounders.
is carried through to the statistical model. For example, In doing so, we are empowered to dismiss George Box’s
we often assume that the observed data are generating by quotation, “All models are wrong” (26, p. 792).
sampling N times from a data-generating process compatible Instead, we can joyfully proclaim, “My statistical model
with the causal model (24). Under this assumption, the correctly describes reality.” However, in doing so, we also
causal model implies the statistical model, characterizing the face new challenges for statistical estimation and inference.
set of possible distributions of the observed data. In practice, In particular, we are forced to leave behind the familiar
few or no restrictions are placed on this set, yielding a semi- comforts of parametric regressions and strike out on a jour-
parametric or nonparametric statistical model, respectively. ney through the vast space of distributions contained in our
For example, the causal model in Figure 1 does not encode statistical model. Respect for our statistical model means
parametric knowledge about the functional forms of the that we can, and indeed must, explore a wide range of
relationships between variables. Focusing on the propensity relationships between exposure and confounders as well
score, for example, the causal model encodes our limited as relationships between the outcome, exposure, and con-
knowledge that the exposure A is some unknown function founders.
of the confounders W, the instrument I, and unmeasured Supervised ML provides the means to conduct this
factors UA . The causal model does not state, for example, exploration in a powerful, principled, and fully prespecified

Am J Epidemiol. 2021;190(8):1483–1487
1486 Balzer and Petersen

manner. Ensemble methods, such as Super Learner (27, algorithms considered. We recommend incorporating expert
28), are particularly promising, because they use K-fold knowledge and including simple parametric regressions,
cross-validation (i.e., sample-splitting) to build the optimal together with more flexible approaches. With hierarchical or
weighted combination of predictions from a set of candidate repeated-measures data, it is essential to sample-split on the
algorithms. Importantly, background knowledge can again independent unit (e.g., the individual in longitudinal settings).
play a key role through the inclusion of expert-guided Finally, if the dependent variable is rare, we recommend
interaction terms or other features in the predictor set and stratifying on the outcome before sample-splitting to main-
parametric regressions in the algorithm set. tain roughly the same prevalence in each split. Of course,
these recommendations are just that—recommendations.
Implementation of any ML algorithm with real data always
Do you know anything on earth which has not a raises complex challenges, and careful examination of the
dangerous side if it is mishandled and exaggerated? default settings of any statistical computing package (as
—Sir Arthur Conan Doyle (29) well as, ideally, performance evaluation using simulation) is

Downloaded from https://academic.oup.com/aje/article/190/8/1483/6159691 by guest on 19 August 2024


warranted.
Nonetheless, even when appropriately confined to the
estimation stage of the Causal Roadmap (step 6), ML is I [thoughtfully ML], therefore I am
not without dangers. First and foremost, there may be a
temptation to use ML-based predictions in singly robust —René Descartes (37)
methods, such as G-computation or inverse weighting. This
approach may well outperform singly robust methods rely- In summary, recent advances in ML provide a tremendous
ing on misspecified parametric regressions and, assuming opportunity to improve epidemiologic research by reducing
that the ML algorithms are flexible enough, provides the or (better yet) eliminating our reliance on unrealistically
benefit of decreasing bias as sample size increases. The restrictive statistical assumptions. However, this opportunity
decrease in bias, however, is typically too slow to offset the is only afforded when our analyses are guided by epidemi-
corresponding decrease in variance, resulting in the potential ologic principles, formal causal frameworks, and statistical
for misleading statistical inference (i.e., lower than nominal theory (38).
confidence interval coverage).
One way to understand this challenge is that ML-based ACKNOWLEDGMENTS
predictions are generated on the basis of minimizing some
loss function corresponding to the supervised learning task. Author affiliations: Department of Biostatistics and
For example, ML may be used to do the best possible job pre- Epidemiology, School of Public Health and Health Sciences,
dicting the outcome for all possible values and combinations University of Massachusetts Amherst, Amherst, Massa-
of the exposure and confounders, while the true value of chusetts, United States (Laura B. Balzer); and Division of Bio-
the G-computation formula (equation 1) is just one number statistics, School of Public Health, University of California,
(equal to the average treatment effect under the identifiabil- Berkeley, Berkeley, California, United States (Maya L.
ity assumptions). In other words, a full prediction function is Petersen).
a different estimation goal than the G-computation formula L.B.B. thanks Thomas Hungerford and Bruce Coffin
and thereby has a different optimal bias-variance tradeoff. (Westover School, Middlebury, Connecticut) for inspiring
Equally important, there is no theory to support that the the love of literature and learning in countless generations
central limit theorem applies to the resulting estimators. of students.
Therefore, the 95% confidence intervals resulting when Computer code for reproducing the study results is avail-
using ML with G-computation or inverse weighting should able on GitHub (39).
be regarded with suspicion. Conflict of interest: none declared.
These challenges have inspired the development of sev-
eral double-robust estimators, such as augmented inverse
weighting and TMLE (24, 30–34). These approaches can
REFERENCES
combine ML-based estimates of the expected outcome and
the propensity score to achieve a number of desirable asymp- 1. Shakespeare W. Hamlet, First Folio. London, United
totic properties, including the construction of valid 95% Kingdom: Stationers Company; 1623.
confidence intervals under regularity conditions. Double 2. Mooney SJ, Pejaver V. Big data in public health:
robust estimators employing sample-splitting, such as cross- terminology, machine learning, and privacy. Annu Rev Public
validated TMLE, can help to ensure that the conditions Health. 2018;39:95–112.
required for valid statistical inference are met in practice 3. Bi Q, Goodman KE, Kaminsky J, et al. What is machine
(34–36). learning? A primer for the epidemiologist. Am J Epidemiol.
2019;188(12):2222–2239.
An additional practical challenge is selection and im-
4. Rose S. Mortality risk score prediction in an elderly
plementation of the ML algorithm best suited for the cur- population using machine learning. Am J Epidemiol. 2013;
rent problem. Approaches like Super Learner allow us to 177(5):443–452.
formally explore a variety of algorithms (including the same 5. Baćak V, Kennedy EH. Principled machine learning using the
algorithm with different tuning parameters). However, the super learner: an application to predicting prison violence.
performance of an ensemble approach is driven by the set of Sociol Methods Res. 2019;48(3):698–721.

Am J Epidemiol. 2021;190(8):1483–1487
Machine Learning and the Causal Roadmap 1487

6. Marcus JL, Sewell WC, Balzer LB, et al. Artificial Verbeke G, et al., eds. Longitudinal Data Analysis. Boca
intelligence and machine learning for HIV prevention: Raton, FL: Chapman & Hall/CRC; 2009:553–597.
emerging approaches to ending the epidemic. Curr HIV/AIDS 23. Shakespeare W. Romeo and Juliet, First Folio. London,
Rep. 2020;17(3):171–179. United Kingdom: Stationers Company; 1623.
7. Pearl J. Causal inference in statistics: an overview. Statist 24. van der Laan MJ, Rose S. Targeted Learning: Causal
Surv. 2009;3:96–146. Inference for Observational and Experimental Data. New
8. Mooney SJ, Keil AP, Westreich DJ, et al. Thirteen questions York, NY: Springer Publishing Company; 2011.
about using machine learning in causal research (you won’t 25. Tolkien JRR. The Fellowship of the Ring. London, United
believe the answer to number 10!). Am J Epidemiol. 2021; Kingdom: George Allen & Unwin; 1954.
190(8):1476–1482. 26. Box GEP, ed. Science and statistics. J Am Stat Assoc. 1976;
9. Whitman W. Drum-Taps. New York, NY: Peter Eckler; 1865. 71(356):791–799.
10. Keil AP, Edwards JK. You are smarter than you think: (super) 27. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat
machine learning in context. Eur J Epidemiol. 2018;33(5): Appl Genet Mol Biol. 2007;6:Article25.
437–440. 28. Naimi AI, Balzer LB. Stacked generalization: an introduction

Downloaded from https://academic.oup.com/aje/article/190/8/1483/6159691 by guest on 19 August 2024


11. Petersen ML, van der Laan MJ. Causal models and learning to super learning. Eur J Epidemiol. 2018;33(5):459–464.
from data: integrating causal modeling and statistical 29. Doyle AC. The Land of Mist. London, United Kingdom:
estimation. Epidemiology. 2014;25(3):418–426. Hutchinson & Co. (Publishers) Ltd.; 1926.
12. Petersen ML, Balzer LB. Introduction to causal inference. 30. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression
www.ucbbiostat.com. Published August 2014. Updated coefficients when some regressors are not always observed.
December 2018. Accessed February 1, 2021. J Am Stat Assoc. 1994;89(427):846–866.
13. Petersen ML. Commentary: applying a causal road map in 31. Robins JM. Robust estimation in sequentially ignorable
settings with time-dependent confounding. Epidemiology. missing data and causal inference models. In: 1999
2014;25(6):898–901. Proceedings of the American Statistical Association.
14. Balzer L, Petersen M, van der Laan MJ. Tutorial for causal Alexandria, VA: American Statistical Association; 2000:
inference. In: Buhlmann P, Drineas P, Kane M, et al., eds. 6–10.
Handbook of Big Data. Boca Raton, FL: Chapman & 32. Bang H, Robins JM. Doubly robust estimation in missing
Hall/CRC Press; 2016:361–386. data and causal inference models. Biometrics. 2005;61:
15. Tran L, Yiannoutsos CT, Musick BS, et al. Evaluating the 962–972.
impact of a HIV low-risk express care task-shifting program: 33. van der Laan MJ, Rose S. Targeted Learning in Data Science.
a case study of the targeted learning roadmap. Epidemiol New York, NY: Springer Publishing Company; 2018.
Methods. 2016;5(1):69–91. 34. Díaz I. Machine learning in the estimation of causal effects:
16. Saddiki H, Balzer LB. A primer on causality in data science. targeted minimum loss-based estimation and double/debiased
J Société FrançStatist. 2020;161(1):67–90. machine learning. Biostatistics. 2020;21(2):353–358.
17. Pearl J. Causality: Models, Reasoning and Inference. 2nd ed. 35. Zheng W, van der Laan MJ. Cross-validated targeted
New York, NY: Cambridge University Press; 2009. minimum-loss-based estimation. In: van der Laan MJ, Rose
18. Greenland S. Quantifying biases in causal models: classical S, eds. Targeted Learning: Causal Inference for
confounding vs collider-stratification bias. Epidemiology. Observational and Experimental Data. New York, NY:
2003;14(3):300–306. Springer Publishing Company; 2011:459–474.
19. Hernán MA, Hernández-Díaz S, Robins JM. A structural 36. Benkeser D, Carone M, van der Laan MJ, et al. Doubly robust
approach to selection bias. Epidemiology. 2004;15(5): nonparametric inference on the average treatment effect.
615–625. Biometrika. 2017;104(4):863–880.
20. Liu W, Brookhart MA, Schneeweiss S, et al. Implications of 37. Descartes R. Discours de la Méthode pour Bien Conduire sa
M bias in epidemiologic studies: a simulation study. Am J Raison, et Chercher la Vérité dans les Sciences. Leiden, the
Epidemiol. 2012;176(10):938–948. Netherlands: Johannes Maire; 1637.
21. Robins JM. A new approach to causal inference in mortality 38. Fox MP, Edwards JK, Platt R, et al. The critical importance
studies with sustained exposure periods—application to of asking good questions: the role of epidemiology doctoral
control of the healthy worker survivor effect. Math Model. training programs. Am J Epidemiol. 2020;189(4):261–264.
1986;7:1393–1512. 39. Balzer L. MachineLearningLove. https://github.com/
22. Robins JM, Hernán MA. Estimation of the causal effects of LauraBalzer/MachineLearningLove. Published January 24,
time-varying exposures. In: Fitzmaurice G, Davidian M, 2021. Accessed February 2, 2021.

Am J Epidemiol. 2021;190(8):1483–1487

You might also like