Aleksander Molak - Ajit Jaokar - Causal Inference and Discovery in Python-Packt Publishing (2023)
Aleksander Molak - Ajit Jaokar - Causal Inference and Discovery in Python-Packt Publishing (2023)
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained in
this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing or its dealers and distributors, will be held
liable for any damages caused or alleged to have been caused directly or
indirectly by this book.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80461-298-9
www.packtpub.com
To my wife, Katia. You cause me to smile. I am grateful for every day
we spend together.
Foreword
I have been following Aleksander Molak’s work on causality for a while.
Despite causality becoming a key topic for AI and increasingly also for
generative AI, most developers are not familiar with concepts such as causal
graphs and counterfactual queries.
Aleksander’s book makes the journey into the world of causality easier for
developers. The book spans both technical concepts and code and provides
recommendations for the choice of approaches and algorithms to address
specific causal scenarios.
Ajit Jaokar
Contributors
Mike Hankin is a data scientist and statistician, with a B.S. from Columbia
University and a Ph.D. from the University of Southern California
(dissertation topic: sequential testing of multiple hypotheses). He spent 5
years at Google working on a wide variety of causal inference projects. In
addition to causal inference, he works on Bayesian models, non-parametric
statistics, and deep learning (including contributing to TensorFlow/Keras). In
2021, he took a principal data scientist role at VideoAmp, where he works as
a high-level tech lead, overseeing all methodology development. On the side,
he volunteers with a schizophrenia lab at the Veterans Administration,
working on experiment design and multimodal data analysis.
Acknowledgments
There’s only one name listed on the front cover of this book, but this book
would not exist without many other people whose names you won’t find on
the cover.
I want to thank my wife, Katia, for the love, support, and understanding that
she provided me with throughout the year-long process of working on this
book.
I want to thank Shailesh Jain, who was the first person at Packt with whom I
shared the idea about this book.
The wonderful team at Packt made writing this book a much less challenging
experience than it would have been otherwise. I thank Dinesh Chaudhary for
managing the process, being open to non-standard ideas, and making the
entire journey so smooth.
I want to thank all the people, who provided me with clarifications, and
additional information, agreed to include their materials in the book, or
provided valuable feedback regarding parts of this book outside of the
formal review process: Kevin Hillstrom, Matheus Facure, Rob Donnelly,
Mehmet Süzen, Ph.D., Piotr Migdał, Ph.D., Quentin Gallea, Ph.D., Uri Itai,
Ph.D., prof. Judea Pearl, Alicia Curth.
I want to thank my friends, Uri Itai, Natan Katz, and Leah Bar, with whom we
analyzed and discussed some of the papers mentioned in this book.
Additionally, I want to thank Prof. Frank Harrell and Prof. Stephen Senn for
valuable exchanges on Twitter that gave me many insights into
experimentation and causal modeling as seen through the lens of biostatistics
and medical statistics.
I am grateful to the CausalPython.io community members who shared their
feedback regarding the contents of this book: Marcio Minicz; Elie Kawerk,
Ph.D.; Dr. Tony Diana; David Jensen; and Michael Wexler.
I did my best not to miss anyone from this list. Nonetheless, if I missed your
name, the next line is for you.
Thank you!
I also want to thank you for buying this book.
Congratulations on starting your causal journey today!
Table of Contents
Preface
Part 1: Causality – an Introduction
3
Regression, Observations, and
Interventions
S t a rt ing s imp le – o b s e rv a t io na l d a t a a nd line a r
re g re s s io n
L ine a r re g re s s io n
p -v a lue s a nd s t a t is t ic a l s ig nif ic a nc e
G e o me t ric int e rp re t a t io n o f line a r re g re s s io n
R e v e rs ing t he o rd e r
S ho uld w e a lw a y s c o nt ro l f o r a ll a v a ila b le
c o v a ria t e s ?
N a v ig a t ing t he ma z e
If y o u d o n’t kno w w he re y o u’re g o ing , y o u
mig ht e nd up s o me w he re e ls e
G e t inv o lv e d !
To c o nt ro l o r no t t o c o nt ro l?
R e g re s s io n a nd s t ruc t ura l mo d e ls
SCMs
L ine a r re g re s s io n v e rs us S C M s
F ind ing t he link
R e g re s s io n a nd c a us a l e f f e c t s
W ra p p ing it up
R e f e re nc e s
4
Graphical Models
G ra p hs , g ra p hs , g ra p hs
Ty p e s o f g ra p hs
G ra p h re p re s e nt a t io ns
G ra p hs in P y t ho n
W ha t is a g ra p hic a l mo d e l?
D A G y o ur p a rd o n? D ire c t e d a c y c lic g ra p hs in
t he c a us a l w o nd e rla nd
D e f init io ns o f c a us a lit y
D A G s a nd c a us a lit y
L e t ’s g e t f o rma l!
L imit a t io ns o f D A G s
S o urc e s o f c a us a l g ra p hs in t he re a l w o rld
C a us a l d is c o v e ry
Ex p e rt kno w le d g e
C o mb ining c a us a l d is c o v e ry a nd e x p e rt
kno w le d g e
Ex t ra – is t he re c a us a lit y b e y o nd D A G s ?
D y na mic a l s y s t e ms
C y c lic S C M s
W ra p p ing it up
R e f e re nc e s
10
11
12
13
14
Index
This book provides a map that allows you to break into the world of
causality.
Next, we dive into the world of causal effect estimation. Starting simple, we
consistently progress toward modern machine learning methods. Step by
step, we introduce the Python causal ecosystem and harness the power of
cutting-edge algorithms.
In the last part of the book, we sneak into the secret world of causal
discovery. We explore the mechanics of how causes leave traces and
compare the main families of causal discovery algorithms to unravel the
potential of end-to-end causal discovery and human-in-the-loop learning.
We close the book with a broad outlook into the future of causal AI. We
examine challenges and opportunities and provide you with a comprehensive
list of resources to learn more.
Who this book is for
The main audience I wrote this book for consists of machine learning
engineers, data scientists, and machine learning researchers with three or
more years of experience, who want to extend their data science toolkit and
explore the new unchartered territory of causal machine learning.
People familiar with causality who have worked with another technology
(e.g., R) and want to switch to Python can also benefit from this book, as
well as people who have worked with traditional causality and want to
expand their knowledge and tap into the potential of causal machine learning.
Finally, this book can benefit tech-savvy entrepreneurs who want to build a
competitive edge for their products and go beyond the limitations of
traditional machine learning.
What this book covers
Chapter 1, Causality: Hey, We Have Machine Learning, So Why Even
Bother?, briefly discusses the history of causality and a number of motivating
examples. This chapter introduces the notion of spuriousness and
demonstrates that some classic definitions of causality do not capture
important aspects of causal learning (which human babies know about). This
chapter provides the basic distinction between statistical and causal learning,
which is a cornerstone for the rest of the book.
Chapter 15, Epilogue, closes Part 3 of the book with a summary of what
we’ve learned, a discussion of causality in business, a sneak peek into the
(potential) future of the field, and pointers to more resources on causal
inference and discovery for those who are ready to continue their causal
journey.
To get the most out of this book
The code for this book is provided in the form of Jupyter notebooks. To run
the notebooks, you’ll need to install the required packages.
The easiest way to install them is using Conda. Conda is a great package
manager for Python. If you don’t have Conda installed on your system, the
installation instructions can be found here: https://bit.ly/InstallConda.
Note that Conda’s license might have some restrictions for commercial use.
After installing Conda, follow the environment installation instructions in the
book’s repository README.md file (https://bit.ly/InstallEnvironments).
If you want to recreate some of the plots from the book, you might need to
additionally install Graphviz. For GPU acceleration, CUDA drivers might be
needed. Instructions and requirements for Graphviz and CUDA are available
in the same README.md file in the repository
(https://bit.ly/InstallEnvironments).
The code for this book has been only tested on Windows 11 (64-bit).
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder
names, filenames, file extensions, pathnames, dummy URLs, user input, and
Twitter handles. Here is an example: “We’ll model the adjacency matrix
using the ENCOAdjacencyDistributionModule object.”
preds = causal_bert.inference(
texts=df['text'],
confounds=df['has_photo'],
)[0]
$ mkdir css
$ cd css
Bold: Indicates a new term, an important word, or words that you see
onscreen. For instance, words in menus or dialog boxes appear in bold. Here
is an example: “Select System info from the Administration panel.”
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book,
email us at [email protected] and mention the book title in the
subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our
content, mistakes do happen. If you have found a mistake in this book, we
would be grateful if you would report this to us. Please visit
www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the
internet, we would be grateful if you would provide us with the location
address or website name. Please contact us at [email protected] with a
link to the material.
Your review is important to us and the tech community and will help us make
sure we’re delivering excellent quality content.
Do you like to read on the go but are unable to carry your print books
everywhere? Is your eBook purchase not compatible with the device of your
choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of
that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from
your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts,
newsletters, and great free content in your inbox daily
https://packt.link/free-ebook/9781804612989
3. That’s it! We’ll send your free PDF and other benefits to your email
directly
Part 1: Causality – an Introduction
Part 1 of this book will equip us with a set of tools necessary to understand
and tackle the challenges of causal inference and causal discovery.
Finally, we’ll learn about the important properties of graphical structures that
play an essential role in almost any causal endeavor.
You might ask yourself – if all this stuff works so well, why would we bother
and look into something else?
We’ll start this chapter with a brief discussion of the history of causality.
Next, we’ll consider a couple of motivations for using a causal rather than
purely statistical approach to modeling and we’ll introduce the concept of
confounding.
Finally, we’ll see examples of how a causal approach can help us solve
challenges in marketing and medicine. By the end of this chapter, you will
have a good idea of why and when causal inference can be useful. You’ll be
able to explain what confounding is and why it’s important.
CONDITIONING
If you want to learn more about different types of conditioning, check this
https://bit.ly/MoreOnConditioning or search for phrases such as classical
conditioning versus operant conditioning and names such as Ivan Pavlov and
Burrhus Skinner, respectively.
Second, most classic machine learning algorithms also work on the basis of
association. When we’re training a neural network in a supervised fashion,
we’re trying to find a function that maps input to the output. To do it
efficiently, we need to figure out which elements of the input are useful for
predicting the output. And, in most cases, association is just good enough for
this purpose.
Have you ever seen a parent trying to convince their child to stop throwing
around a toy? Some parents tend to interpret this type of behavior as rude,
destructive, or aggressive, but babies often have a different set of
motivations. They are running systematic experiments that allow them to
learn the laws of physics and the rules of social interactions (Gopnik, 2009).
Infants as young as 11 months prefer to perform experiments with objects that
display unpredictable properties (for example, can pass through a wall) than
with objects that behave predictably (Stahl & Feigenson, 2015). This
preference allows them to efficiently build models of the world.
What we can learn from babies is that we’re not limited to observing the
world, as Hume suggested. We can also interact with it. In the context of
causal inference, these interactions are called interventions, and we’ll learn
more about them in Chapter 2. Interventions are at the core of what many
consider the Holy Grail of the scientific method: randomized controlled
trial, or RCT for short.
Imagine you work at a research institute and you’re trying to understand the
causes of people drowning. Your organization provides you with a huge
database of socioeconomic variables. You decide to run a regression model
over a large set of these variables to predict the number of drownings per
day in your area of interest. When you check the results, it turns out that the
biggest coefficient you obtained is for daily regional ice cream sales.
Interesting! Ice cream usually contains large amounts of sugar, so maybe
sugar affects people’s attention or physical performance while they are in the
water.
This hypothesis might make sense, but before we move forward, let’s ask
some questions. How about other variables that we did not include in the
model? Did we add enough predictors to the model to describe all relevant
aspects of the problem? What if we added too many of them? Could adding
just one variable to the model completely change the outcome?
Figure 1.1 – Graphical representation of models with two (a) and three variables (b).
Dashed lines represent the association, solid lines represent causation. ICE = ice cream
sales, ACC = the number of accidents, and TMP = temperature.
In Figure 1.1, we can see that adding the average daily temperature to the
model removes the relationship between regional ice cream sales and daily
drownings. Depending on your background, this might or might not be
surprising to you. We’ll learn more about the mechanism behind this effect in
Chapter 3.
Figure 1.2 – Pairwise scatterplots of relations between a, b, and c. The code to recreate
the preceding plot can be found in the Chapter_01.ipynb notebook
(https://github.com/PacktPublishing/Causal-Inference-and-Discovery-in-
Python/blob/main/Chapter_01.ipynb).
In Figure 1.2, blue points signify a causal relationship while red points
signify a spurious relationship, and variables a, b, and c are related in the
following way:
b causes a and c
Hey, but in Figure 1.2 we see some relationship! Let’s unpack it!
Okay, we said that there are some spurious relationships in our data; we
added another variable to the model and it changed the model’s outcome.
That said, I was still able to make useful predictions without this variable. If
that’s true, why would I care whether the relationship is spurious or non-
spurious? Why would I care whether the relationship is causal or not?
A marketer’s dilemma
Imagine you are a tech-savvy marketer and you want to effectively allocate
your direct marketing budget. How would you approach this task? When
allocating the budget for a direct marketing campaign, we’d like to
understand what return we can expect if we spend a certain amount of money
on a given person. In other words, we’re interested in estimating the effect of
our actions on some customer outcomes (Gutierrez, Gérardy, 2017). Perhaps
we could use supervised learning to solve this problem? To answer this
question, let’s take a closer look at what exactly we want to predict.
is the outcome for person when they received the treatment (in
our example, they received marketing content from us)
is the outcome for the same person given they did not receive the
treatment
What the formula says is that we want to take the person ’s outcome when
this person does not receive treatment and subtract it from the same
person’s outcome when they receive treatment .
An interesting thing here is that to solve this equation, we need to know what
person ’s response is under treatment and under no treatment. In reality, we
can never observe the same person under two mutually exclusive conditions
at the same time. To solve the equation in the preceding formula, we need
counterfactuals.
Total 27 95 23 99
The numbers in Table 1.1 represent the number of patients diagnosed with
disease D who were administered drug A or drug B. Row 2 (Blood clot)
gives us information on whether a blood clot was found in patients or not.
Note that the percentage scores are rounded. Based on this data, which drug
would you choose? The answer seems pretty obvious. 81% of patients who
received drug B did not develop blood clots. The same was true for only
78% of patients who received drug A. The risk of developing a blood clot is
around 3% lower for patients receiving drug B compared to patients
receiving drug A.
This looks like a fair answer, but you feel skeptical. You know that blood
clots can be very risky and you want to dig deeper. You find more fine-
grained data that takes the patient’s gender into account. Let’s look at Table
1.2:
Drug A B
Female 24 56 17 25
Male 3 39 6 74
Total 27 95 23 99
Drug A B
Table 1.2 – Data for drug A and drug B with gender-specific results added. F = female, M
= male. Color-coding added for ease of interpretation, with better results marked in green
and worse results marked in orange.
Something strange has happened here. We have the same numbers as before
and drug B is still preferable for all patients, but it seems that drug A works
better for females and for males! Have we just found a medical
Schrödinger’s cat
(https://en.wikipedia.org/wiki/Schr%C3%B6dinger%27s_cat) that flips the
effect of a drug when a patient’s gender is observed?
If you think that we might have messed up the numbers – don’t believe me,
just check the data for yourself. The data can be found in
data/ch_01_drug_data.csv (https://github.com/PacktPublishing/Causal-
Inference-and-Discovery-in-Python/blob/main/data/ch_01_drug_data.csv).
We could try to answer this question from a pure machine learning point of
view: perform cross-validated feature selection and pick the variables that
contribute significantly to the outcome. This solution is good enough in some
settings. For instance, it will work well when we only care about making
predictions (rather than decisions) and we know that our production data
will be independent and identically distributed; in other words, our
production data needs to have a distribution that is virtually identical (or at
least similar enough) to our training and validation data. If we want more
than this, we’ll need some sort of a (causal) world model.
Wrapping it up
“Let the data speak” is a catchy and powerful slogan, but as we’ve seen
earlier, data itself is not always enough. It’s worth remembering that in many
cases “data cannot speak for themselves” (Hernán, Robins, 2020) and we
might need more information than just observations to address some of our
questions.
In this chapter, we learned that when thinking about causality, we’re not
limited to observations, as David Hume thought. We can also experiment –
just like babies.
Unfortunately, experiments are not always available. When this is the case,
we can try to use observational data to draw a causal conclusion, but the data
itself is usually not enough for this purpose. We also need a causal model. In
the next chapter, we’ll introduce the Ladder of Causation – a neat metaphor
for understanding three levels of causation proposed by Judea Pearl.
References
Alexander, J. E., Audesirk, T. E., & Audesirk, G. J. (1985). Classical
Conditioning in the Pond Snail Lymnaea stagnalis. The American Biology
Teacher, 47(5), 295–298. https://doi.org/10.2307/4448054
Gutierrez, P., & Gérardy, J. (2017). Causal Inference and Uplift Modelling:
A Review of the Literature. Proceedings of The 3rd International Conference
on Predictive Applications and APIs in Proceedings of Machine Learning
Research, 67, 1-13
Hernán M. A., & Robins J. M. (2020). Causal Inference: What If. Boca
Raton: Chapman & Hall/CRC
Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux
JUDEA PEARL
Judea Pearl is an Israeli-American researcher and computer scientist, who devoted a large
part of his career to researching causality. His original and insightful work has been
recognized by Association for Computing Machinery (ACM), who awarded him with the
Turing Award – considered by many the equivalent of the Nobel Prize in computer science.
The Ladder of Causation was introduced in Pearl’s popular book on causality, The Book of
Why (Pearl, Mackenzie, 2019).
Rung one of the ladder represents association. The activity that is related to
this level is observing. Using association, we can answer questions about
how seeing one thing changes our beliefs about another thing – for instance,
how observing a successful space launch by SpaceX changes our belief that
SpaceX stock price will go up.
Rung two represents intervention. Remember the babies from the previous
chapter? The action related to rung two is doing or intervening. Just like
babies throwing their toys around to learn about the laws of physics, we can
intervene on one variable to check how it influences some other variable.
Interventions can help us answer questions about what will happen to one
thing if we change another thing – for instance, if I go to bed earlier, will I
have more energy the following morning?
Rung three represents counterfactual reasoning. Activities associated with
rung three are imagining and understanding. Counterfactuals are useful to
answer questions about what would have happened if we had done something
differently. For instance, would I have made it to the office on time if I took
the train rather than the car?
Imagine that you’re a doctor and you consider prescribing drug D to one of
your patients. First, you might recall hearing other doctors saying that D
helped their patients. It seems that in the sample of doctors you heard talking
about D, there is an association between their patients taking the drug and
getting better. That’s rung one. We are skeptical about the rung one evidence
because it might just be the case that these doctors only treated patients with
certain characteristics (maybe just mild cases or only patients of a certain
age). To overcome the limitation of rung one, you decide to read articles
based on randomized clinical trials.
These trials were based on interventions (rung two) and – assuming that they
were properly designed – they can be used to determine the relative efficacy
of the treatment. Unfortunately, they cannot tell us whether a patient would be
better off if they had taken the treatment earlier, or which of two available
treatments with similar relative efficacy would have worked better for this
particular patient. To answer this type of question, we need rung three.
Now, let’s take a closer look at each of the rungs and their respective
mathematical apparatus.
Associations
In this section, we’ll demonstrate how to quantify associational
relationships using conditional probability. Then, we’ll briefly introduce
structural causal models. Finally, we’ll implement conditional probability
queries using Python.
We can view the mathematics of rung one from a couple of angles. In this
section, we’ll focus on the perspective of conditional probability.
CONDITIONAL PROBABILITY
Conditional probability is the probability of one event, given that another event has occurred.
A mathematical symbol that we use to express conditional probability is | (known as a pipe
or vertical bar). We read as a probability of X given Y. This notation is a bit simplified
(or abused if you will). What we usually mean by is , the probability
that the variable takes the value , given that the variable takes the value . This notation
can also be extended to continuous cases, where we want to work with probability densities
– for example, .
Imagine that you run an internet bookstore. What is the probability that a
person will buy book A, given that they bought book B? This question can be
answered using the following conditional probability query:
Note that the preceding formula does not give us any information on the
causal relationship between both events. We don’t know whether buying
book A caused the customer to buy book B, buying book B caused them to
buy book A, or there is another (unobserved) event that caused both. We only
get information about non-causal association between these events. To see
this clearly, we will implement our bookstore example in Python, but before
we start, we’ll briefly introduce one more important concept.
Structural causal models (SCMs) are a simple yet powerful tool to encode
causal relationships between variables. You might be surprised that we are
discussing a causal model in the section on association. Didn’t we just say
that association is usually not enough to address causal questions? That’s
true. The reason why we’re introducing an SCM now is that we will use it as
our data-generating process. After generating the data, we will pretend to
forget what the SCM was. This way, we’ll mimic a frequent real-world
scenario where the true data-generating process is unknown, and the only
thing we have is observational data.
Let’s take a small detour from our bookstore example and take a look at
Figure 2.2:
Figure 2.2 – A graphical representation of a structural causal model
As you can see, there are two types of variables (marked with dashed versus
regular lines). Arrows at the end of the lines represent the direction of the
relationship.
Nodes A, B, and C are marked with solid lines. They represent the observed
variables in our model. We call this type of variable endogenous.
Endogenous variables are always children of at least one other variable in a
model.
The other type of nodes (UX nodes) are marked with dashed lines. We call
these variables exogenous, and they are represented by root nodes in the
graph (they are not descendants of any other variable; Pearl, Glymour, and
Jewell, 2016). Exogenous variables are also called noise variables.
NOISE VARIABLES
Note that most causal inference and causal discovery methods require that noise variables
are uncorrelated with each other (otherwise, they become unobserved confounders). This is
one of the major difficulties in real-world causal inference, as sometimes, it’s very hard to be
sure that we have met this assumption.
Let’s return to the SCM from Figure 2.2. We’ll define the functional
relationships in this model in the following way:
, , and represent the nodes in Figure 2.1, and is an assignment
operator, also known as a walrus operator. We use it here to emphasize that
the relationship that we’re describing is directional (or asymmetric), as
opposed to the regular equal sign that suggests a symmetric relation. Finally,
, , represent arbitrary functions (they can be as simple as a summation or
as complex as you want). This is all we need to know about SCMs at this
stage. We will learn more about them in Chapter 3.
Let’s practice!
For this exercise, let’s recall our bookstore example from the beginning of
the section.
First, let’s define an SCM that can generate data with a non-zero probability
of buying book A, given we bought book B. There are many possible SCMs
that could generate such data. Figure 2.3 presents the model we have chosen
for this section:
Figure 2.3 – A graphical model representing the bookstore example
To precisely define causal relations that drive our SCM, let’s write a set of
equations:
If then , otherwise .
Now, let’s recreate this SCM in code. You can find the code for this exercise
in the notebook for this chapter https://bit.ly/causal-ntbk-02:
import numpy as np
from scipy import stats
2. Next, let’s define the SCM. We will use the object-oriented approach for
this purpose, although you might want to choose other ways for yourself,
which is perfectly fine:
class BookSCM:
def __init__(self, random_seed=None):
self.random_seed = random_seed
self.u_0 = stats.uniform()
self.u_1 = stats.norm()
def sample(self, sample_size=100):
"""Samples from the SCM"""
if self.random_seed:
np.random.seed(self.random_seed)
u_0 = self.u_0.rvs(sample_size)
u_1 = self.u_1.rvs(sample_size)
a = u_0 > .61
b = (a + .5 * u_1) > .2
return a, b
Let’s unpack this code. In the __init__() method of our BookSCM, we define
the distributions for and and set a random seed for reproducibility; the
.sample() method samples from and , computes values for and
(according to the formulas specified previously), and returns them.
Great! We’re now ready to generate some data and quantify an association
between the variables using conditional probability:
1. First, let’s instantiate our SCM and set the random seed to 45:
scm = BookSCM(random_seed=45)
((100,), (100,))
The shapes are correct. We generated the data, and we’re now ready to
answer the question that we posed at the beginning of this section – what is
the probability that a person will buy book A, given that they bought book B?
proba_book_a_given_book_b = buy_book_a[buy_book_b].sum() /
buy_book_a[buy_book_b].shape[0]
print(f'Probability of buying book A given B:
{proba_book_a_given_book_b:0.3f}')
Let’s climb to the second rung of the Ladder of Causation to see how to go
beyond some of these limitations.
The idea of intervention is very simple. We change one thing in the world and
observe whether and how this change affects another thing in the world. This
is the essence of scientific experiments. To describe interventions
mathematically, we use a special -operator. We usually express it in
mathematical notation in the following way:
The preceding formula states that the probability of , given that we set
to 0. The fact that we need to change ’s value is critical here, and it
highlights the inherent difference between intervening and conditioning
(conditioning is the operation that we used to obtain conditional probabilities
in the previous section). Conditioning only modifies our view of the data,
while intervening affects the distribution by actively setting one (or more)
variable(s) to a fixed value (or a distribution). This is very important –
intervention changes the system, but conditioning does not. You might ask,
what does it mean that intervention changes the system? Great question!
We say that the node X is a parent of the node Y and that Y is a child of X when there’s a
direct arrow from X to Y. If there’s also an arrow from Y to Z, we say that Z is a grandchild of
X and that X is a grandparent of Z. Every child of X, all its children and their children, their
children’s children, and so on are descendants of X, which is their ancestor. For a more
formal explanation, check out Chapter 4.
Let’s translate interventions into code. We will use the following SCM for
this purpose:
1. First, we’ll define the sample size for our experiment and set a random
seed for reproducibility:
SAMPLE_SIZE = 100
np.random.seed(45)
2. Next, let’s build our SCM. We will also compute the correlation
coefficient between and and print out a couple of statistics:
u_0 = np.random.randn(SAMPLE_SIZE)
u_1 = np.random.randn(SAMPLE_SIZE)
a = u_0
b = 5 * a + u_1
r, p = stats.pearsonr(a, b)
print(f'Mean of B before any intervention: {b.mean():.3f}')
print(f'Variance of B before any intervention: {b.var():.3f}')
print(f'Correlation between A and B:\nr = {r:.3f}; p =
{p:.3f}\n')
a = np.array([1.5] * SAMPLE_SIZE)
b = 5 * a + u_1
We said that an intervention changes the system. If that’s true, the statistics
for should change as a result of our intervention. Let’s check it out:
Both the mean and variance have changed. The new mean of is significantly
greater than the previous one. This is because the value of our intervention on
(1.5) is much bigger than what we’d expect from the original distribution
of (centered at 0). At the same time, the variance has shrunk. This is
because became constant, and the only remaining variability in comes
from its stochastic parent, .
What would happen if we intervened on instead? Let’s see and print out a
couple of statistics:
a = u_0
b = np.random.randn(SAMPLE_SIZE)
r, p = stats.pearsonr(a, b)
print(f'Mean of B after the intervention on B: {b.mean():.3f}')
print(f'Variance of B after the intervention on B:
{b.var():.3f}')
print(f'Correlation between A and B after intervening on B:\nr =
{r:.3f}; p = {p:.3f}\n')
Let’s see.
Figure 2.4 presents the data generated according to the following set of
structural equations:
Figure 2.4 – A scatter plot of the data from a causal data-generating process
Although from the structural point of view, there’s a clear causal link
between X and Y, the correlation coefficient for this dataset is essentially
equal to 0 (you can experiment with this data in the notebook:
https://bit.ly/causal-ntbk-02). The reason for this is that the relationship
between X and Y is not monotonic, and popular correlation metrics such as
Pearson’s r or Spearman’s rho cannot capture non-monotonic relationships.
This leads us to an important realization that a lack of traditional correlation
does not imply independence between variables.
A number of tools for more general independence testing exist. For instance,
information-theoretic metrics such as the maximal information coefficient
(MIC) (Reshef et al., 2011; Reshef et al., 2015; Murphy, 2022, pp. 217–219)
work for non-linear, non-monotonic data out of the box. The same goes for
the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2007)
and a number of other metrics.
Another scenario where you might see no correlation although causation is
present is when your sampling does not cover the entire support of relevant
variables. Take a look at Figure 2.5.
This data is a result of exactly the same process as the data presented in
Figure 2.4. The only difference is that we sampled according to the
following condition:
Situations such as these can happen in real life. Everything from faulty
sensors to selection bias in medical studies can remove a substantial amount
of information from observational data and push us toward wrong
conclusions.
In this section, we looked into interventions in greater detail. We introduced
the -operator and discussed how it differs from conditioning. Finally, we
implemented a simple SCM in Python and demonstrated how interventions on
one variable affect the distribution of other variables.
If all of this makes you a little dizzy, don’t worry. For the vast majority of
people coming from a statistical background, conversion to causal thinking
takes time. This is because causal thinking requires breaking certain habits
we all pick up when we learn statistical thinking. One such habit is thinking
in terms of variables as basic entities rather than in terms of the process that
generated these variables.
Ready for more? Let’s step up to rung three – the world of counterfactuals!
The difference is that in one model, no patient is affected by the drug, while
in the second model, all patients are affected by the drug – a pretty striking
difference. As you can expect, counterfactual outcomes for both models
differ (for more details, check out Pearl (2009, pp. 33–38)).
Let’s try to formalize it. We’ll denote the fact that you drank coffee this
morning by and the fact that you now feel bad as . The
subscript informs us that the outcome, , happened in the world where you
had your coffee in the morning ( ). The quantity we want to estimate,
therefore, is the following:
We read this as the probability that you’d feel bad if you hadn’t had your
coffee, given you had your coffee and you feel bad.
Let’s unpack it:
stands for the probability that you’d feel bad (in the
alternative world) if you hadn’t had your coffee
says that you had your coffee and you felt bad (in the real
world)
Note that everything on the right side of the conditioning bar comes from the
actual observation. The expression on the left side of the conditional bar
refers to the alternative, hypothetical world.
Many people feel uncomfortable seeing this notation for the first time. The
fact that we’re conditioning on to estimate the quantity in the world
where seems pretty counterintuitive. On a deeper level, this makes
sense though. This notation makes it virtually impossible to reduce
counterfactuals to expressions (Pearl, Glymour, and Jewell, 2016). This
reflects the inherent relationship between interventions and counterfactuals
that we discussed earlier in this section.
COUNTERFACTUAL NOTATION
In this book, we follow the notation used by Pearl, Glymour, and Jewell (2016). There are also
other options available in the literature. For instance, Peters, Janzing, and Schölkopf propose
(2017) using different style of notation. Another popular choice is notation related to Donald
Rubin’s potential outcomes framework. It is worth making sure you understand the notation
well. Otherwise, it can be a major source of confusion.
Computing counterfactuals
The basic idea behind computing counterfactuals is simple in theory, but the
goal is not always easily achievable in practice. This is because computing
counterfactuals requires an SCM that is fully specified at least for all the
relevant variables. What does it mean? It means that we need to have full
knowledge about functions that relate to relevant variables in the SCM and
full knowledge about the values of all relevant exogenous variables in a
system. Fortunately, if we know structural equations, we can compute noise
variables describing the subject in the abduction step (see the following
Computing counterfactuals step by step callout).
• Modification (originally called an action): Replacing the structural equation for the
treatment with a counterfactual value
• Prediction: Using the modified SCM to compute the new value of the outcome under the
counterfactual
Let’s take our coffee example. Let denote our treatment – drinking coffee,
while will characterize you fully as an individual in our simplified world.
stands for coffee sensitivity, while stands for a lack thereof.
Additionally, let’s assume that we know the causal mechanism for reaction to
coffee. The mechanism is defined by the following SCM:
Great! We know the outcome under the actual treatment, (you drank
the coffee and you felt bad), but we don’t know your characteristics ( ). Can
we do something about it?
It turns out that our model allows us to unambiguously deduct the value of
(that we’ll denote as ), given we know the values of and . Let’s solve
for by transforming our structural equation for :
We’re ready for the next step, modification. We will fix the value of our
treatment at the counterfactual of interest, ( ):
Finally, we’re ready for the last step, prediction. To make a prediction, we
need to substitute with the value(s) of your personal characteristic(s) that
we computed before:
And here is the answer – you wouldn’t feel bad if you hadn’t had your
coffee!
In the real world, we are not always able to compute counterfactuals deterministically.
Fortunately, the three-step framework that we introduced earlier generalizes well for
probabilistic settings. To learn more on how to compute probabilistic counterfactuals, please
refer to Chapter 4 of Causal inference in statistics: A primer (Pearl, Glymour, and Jewell,
2016) or an excellent YouTube video by Brady Neal (Neal, 2020).
Time to code!
The last thing we’ll do before concluding this section is to implement our
counterfactual computation in Python:
class CounterfactualSCM:
def abduct(self, t, y):
return (t + y - 1)/(2*t - 1)
def modify(self, t):
return lambda u: t * u + (t - 1) * (u - 1)
def predict(self, u, t):
return self.modify(t)(u)
Note that each method implements the steps that we performed in the
preceding code:
.abduct() computes the value of , given the values for treatment and the
actual outcome
2. Let’s instantiate the SCM and assign the known treatment and outcome
values to the t and y variables respectively:
coffee = CounterfactualSCM()
t = 1
y = 1
3. Next, let’s obtain the value of by performing abduction and print out
the result:
u = coffee.abduct(t=t, y=y)
u
1.0
coffee.predict(u=u, t=0)
0.0
If you hadn’t had your coffee in the morning, you’d feel better now.
In the classic formulation of RL, an agent interacts with the environment. This
suggests that an RL agent can make interventions in the environment.
Intuitively, this possibility moves RL from an associative rung one to an
interventional rung two. Bottou et al. (2013) amplify this intuition by
proposing that causal models can be reduced to multiarmed bandit problems
– in other words, that RL bandit algorithms are special cases of rung two
causal models.
Although the idea that all RL is causal might seem intuitive at first, the reality
is more nuanced. It turns out that even for certain bandit problems, the results
might not be optimal if we do not model causality explicitly (Lee and
Bareinboim, 2018).
Wrapping it up
In this chapter, we introduced the concept of the Ladder of Causation. We
discussed each of the three rungs of the ladder: associations, interventions,
and counterfactuals. We presented mathematical apparatus to describe each
of the rungs and translated the ideas behind them into code. These ideas are
foundational for causal thinking and will allow us to understand more
complex topics further on in the book.
In the next chapter, we’ll take a look at the link between observations,
interventions, and linear regression to see the differences between rung one
and rung two from yet another perspective. Ready?
References
Berrevoets, J., Kacprzyk, K., Qian, Z., & van der Schaar, M. (2023). Causal
Deep Learning. arXiv preprint arXiv:2303.02186.
Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A.
(2007). A Kernel Statistical Test of Independence. NIPS.
Kaddour, J., Lynch, A., Liu, Q., Kusner, M. J., & Silva, R. (2022). Causal
Machine Learning: A Survey and Open Problems. arXiv, abs/2206.15475
Pearl, J., & Mackenzie, D. (2019). The book of why. Penguin Books.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G.,
Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., & Sabeti, P. C. (2011).
Detecting novel associations in large data sets. Science, 334 (6062), 1518–
1524. https://doi.org/10.1126/science.1205438.
Jimenez Rezende, D., Danihelka, I., Papamakarios, G., Ke, N.R., Jiang, R.,
Weber, T., Gregor, K., Merzic, H., Viola, F., Wang, J.X., Mitrovic, J., Besse,
F., Antonoglou, I., & Buesing, L. (2020). Causally Correct Partial Models
for Reinforcement Learning. arXiv, abs/2002.02836.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt,
S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T. & Silver,
D. (2019). Mastering Atari, Go, Chess and Shogi by Planning with a
Learned Model. Nature.
Sgouritsa, E., Janzing, D., Hennig, P., & Schölkopf, B. (2015). Inference of
Cause and Effect with Unsupervised Inverse Regression. AISTATS.
Vowels, M. J., Camgoz, N. C., & Bowden, R. (2021). Shadow-Mapping for
Unsupervised Neural Causal Discovery. arXiv, 2104.08183
Wang, Z., Xiao, X., Xu, Z., Zhu, Y. ; Stone, P.. (2022). Causal Dynamics
Learning for Task-Independent State Abstraction. Proceedings of the 39th
International Conference on Machine Learning, in Proceedings of
Machine Learning Research, 162, 23151–23180.
Wu, X., Gong, M., Manton, J. H., Aickelin, U., & Zhu, J. (2022). On
Causality in Domain Adaptation and Semi-Supervised Learning: an
Information-Theoretic Analysis. arXiv. 2205.04641
3
Linear regression
Linear regression is a basic data-fitting algorithm that can be used to predict
the expected value of a dependent (target) variable, , given values of some
predictor(s), . Formally, this is written as .
Let’s take a model with just one predictor, . Such a model can be described
by the following formula:
In the preceding formula, is a predicted value for observation , is a
learned intercept term, is the observed value of , and is the regression
coefficient for . We call and model parameters.
1. First, we’ll define our data-generating process. We’ll make the process
follow the (preceding) linear regression formula and assign arbitrary
values to the (true) parameters and . We’ll choose 1.12 as the value
of and 0.93 as the value of (you can use other values if you want).
We will also add noise to the model and mark it as . We choose to be
normally distributed with zero mean and a standard deviation of one.
Additionally, we’ll scale by 0.5. With these values, our data-generating
formula becomes the following:
1. We’ll start by importing the libraries that we’re going to use in this
chapter. We’re going to use statsmodels to fit our linear regression
model:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
STATSMODELS
statsmodels is a popular statistical library in Python that offers support for R-like syntax
and R-like model summaries (in case you haven’t heard of R, it is a popular open source
statistical programming language). statsmodels is a great choice if you want to work with
traditional statistical models. The package offers convenient model summaries that contain -
values and other useful statistics. If you come from a Scikit-learn background, you might find
the statsmodels API a bit confusing. There are several key differences between the two
libraries. One of them is the .fit() method, which in statsmodels returns an instance of
a wrapper object that can be further used to generate predictions. For more details on
statsmodels, refer to the documentation: https://bit.ly/StatsmodelsDocs.
2. Next, we’ll set a random seed for reproducibility and define the number
of samples that we’re going to generate:
np.random.seed(45)
N_SAMPLES = 5000
alpha = 1.12
beta = 0.93
epsilon = np.random.randn(N_SAMPLES)
4. Finally, we’ll use our model formula to generate the data:
X = np.random.randn(N_SAMPLES)
y = alpha + beta * X + 0.5 * epsilon
5. There’s one more step that we need to take before fitting the model.
statsmodels requires us to add a constant feature to the data. This is
needed to perform the intercept computations. Many libraries perform
this step implicitly; nonetheless, statsmodels wants us to do it explicitly.
To make our lives easier, the authors have provided us with a convenient
method, .add_constant(). Let’s apply it!
X = sm.add_constant(X)
Now, our X has got an extra column of ones at column index 0. Let’s print the
first five rows of X to see it:
print(X[:5, :])
[[ 1. 0.11530002]
[ 1. -0.43617719]
[ 1. -0.54138887]
[ 1. -1.64773122]
[ 1. -0.32616934]]
Now, we’re ready to fit the regression model using statsmodels and print the
summary:
model = sm.OLS(y, X)
fitted_model = model.fit()
print(fitted_model.summary())
The output of the model summary is presented in Figure 3.1. We marked the
estimated coefficients with a red ellipse:
Figure 3.1 – A summary of the results of a simple linear regression model
The null hypothesis for a given coefficient states that this coefficient is not
significantly different from zero. The null hypothesis for the model states that
the entire model is not significantly different from the null model (in the
context of simple regression analysis, the null model is usually represented
as an intercept-only model).
If you want to learn more about null hypotheses and hypothesis testing, check
out this video series from Khan Academy:
https://bit.ly/KhanHypothesisTesting.
Each blue dot in Figure 3.2 represents a single observation, while the red
line represents the best-fit line found by the linear regression algorithm. In
the case of multiple regression, the line becomes a hyperplane.
The regression model itself cannot help us understand which variable is the
cause and which is the effect. To determine this, we need some sort of
external knowledge.
Causal attributions become even more complicated in multiple regression,
where each additional predictor can influence the relationship between the
variables in the model. For instance, the learned coefficient for variable X
might be 0.63, but when we add variable Z to the model, the coefficient for X
changes to -2.34. A natural question in such cases is: if the coefficient has
changed, what is the true effect here?
Let’s take a look at this issue from the point of view of statistical control.
Let’s start with an example. When studying predictors of dyslexia, you might
be interested in understanding whether parents smoking influences the risk of
dyslexia in their children. In your model, you might want to control for
parental education. Parental education might affect how much attention
parents devote to their children’s reading and writing, and this in turn can
impact children’s skills and other characteristics. At the same time, education
level might decrease the probability of smoking, potentially leading to
confounding. But how do we actually know whether it does lead to
confounding?
In some cases, we can refer to previous research to find an answer or at least
a hint. In other cases, we can rely on our intuition or knowledge about the
world (for example, we know that a child’s current skills cannot cause the
parent’s past education). However, in many cases, we will be left without a
clear answer. This inevitable uncertainty led to the development of various
heuristics guiding the choice of variables that should be included as
statistical controls.
Some authors offer more fine-grained heuristics. For example, Becker and
colleagues (Becker et al., 2016; https://bit.ly/BeckersPaper) shared a set of
10 recommendations on how to approach statistical control. Some of their
recommendations are as follows (the original ordering is given in
parentheses):
If you are not sure about a variable, don’t use it as a control (1)
An important aspect of this story is that the thing is now broken. This suggests
that it worked properly before. This is not necessarily a valid assumption to
make from a causal point of view. Not including a variable in the model
might also lead to confounding and spuriousness. This is because there are
various patterns of independence structure possible between any three
variables. Let’s consider the structural causal model (SCM) presented in
Figure 3.5:
Figure 3.5 – An example SCM with various patterns leading to spurious associations
From the model structure, we can clearly see that and are causally
independent. There’s no arrow between them, nor is there a directed path that
would connect them indirectly. Let’s fit four models and analyze which
variables, when controlled for, lead to spurious relationships between and
:
What is your best guess – out of the four models, which ones will correctly
capture causal independence between and ? I encourage you to write your
hypotheses down on a piece of paper before we reveal the answer. Ready?
Let’s find out!
a = np.random.randn(N_SAMPLES)
x = 2 * a + 0.5 * np.random.randn(N_SAMPLES)
y = 2 * a + 0.5 * np.random.randn(N_SAMPLES)
b = 1.5 * x + 0.75 * y
Note that all the coefficients that we use to scale the variables are arbitrarily
chosen.
2. Next, let’s define four model variants and fit the models iteratively:
Model
Why did controlling for work while all other schemes did not? There are
three elements to the answer:
Third, not controlling for any variable leads to the same result in terms of
the significance of as controlling for and (note that the results are
different in terms of coefficients, yet as we’re now interested in the
structural properties of the system, this is of secondary importance). This
is precisely because the effects of controlling for and controlling for
are exactly the opposite from a structural point of view and they cancel
each other out!
If we have full knowledge about the causal graph, the task of deciding which
variables we should control for becomes relatively easy (and after reading
the next couple of chapters, you might even find it almost trivial). If the true
causal structure is unknown, the decision is fundamentally difficult.
Causality does not give you a new angle on statistical control; it gives you
new eyes that allow you to see what’s invisible from the perspective of rung
1 of The Ladder of Causation. For a summary of what constitutes good and
bad controls, check out the excellent paper by Cinelli et al. (2022;
https://bit.ly/GoodBadControls).
SCMs
In the previous chapter, we learned that SCMs are a useful tool for encoding
causal models. They consist of a set of variables (exogenous and
endogenous) and a set of functions defining the relationships between these
variables. We saw that SCMs can be represented as graphs, with nodes
representing variables and directed edges representing functions. Finally, we
learned that SCMs can produce interventional and counterfactual
distributions.
When fitting the four models describing the SCM from Figure 3.5, we saw
that in the correct model ( ), the estimate of the coefficient for
was equal to 1.967. That’s very close to the true coefficient, which was
equal to 2. This result shows the direction of our conclusion.
Linear regression can be used to estimate causal effects, given that we know
the underlying causal structure (which allows us to choose which variables
we should control for) and that the underlying system is linear in terms of
parameters.
Linear models can be a useful microscope for causal analysis (Pearl, 2013).
To cement our intuition regarding the link between linear regression and
SCMs, let’s build one more SCM that will be linear in terms of parameters
but non-linear in terms of data and estimate its coefficients with linear
regression:
1. As usual, let’s first start by defining the causal structure. Figure 3.6
presents a graphical representation of our model.
2. Next, let’s define the functional assignments (we’ll use the same settings
for sample size and random seed as previously):
a = np.random.randn(N_SAMPLES)
x = 2 * a + .7 * np.random.randn(N_SAMPLES)
y = 2 * a + 3 * x + .75 * x**2
3. Let’s add a constant and then initialize and fit the model:
Figure 3.7 presents the results of our regression analysis. The coefficient for
is marked as x, the coefficient for as x^2, and the coefficient for as a.
If we compare the coefficient values to the true coefficients in our SCM, we
can notice that they are exactly the same! This is because we modeled as a
deterministic function of and , not adding any noise.
We can also see that the model correctly decoded the coefficient for the non-
linear term ( ). Although the relationship between and is non-linear,
they are related by a linear functional assignment.
To understand how to interpret the coefficients in such a model, let’s get back
to the simple univariate case first.
Let’s represent the univariate case with the following simplified formula (we
omit the intercept and noise):
If you know a thing or two about calculus, you will quickly notice that this
derivative is equal to – our coefficient. If you don’t, don’t worry.
Taking the derivative of this expression with respect to will give us the
following:
Note that we can get this result by applying a technique called the power rule
(https://bit.ly/MathPowerRule) to our equation.
Note that another perspective on the quadratic term is that this is a special
case of interaction between and itself. This is congruent with the
multiplicative definition of interaction that we presented earlier in this
chapter. In this light, the quadratic term can be seen as .
The causal interpretation of linear regression only holds when there are no
spurious relationships in your data. This is the case in two scenarios: when
you control for a set of all necessary variables (sometimes this set can be
empty) or when your data comes from a properly designed randomized
experiment.
Wrapping it up
That was a lot of material! Congrats on reaching the end of Chapter 3!
We’re now ready to take a more detailed look at the graphical aspect of
causal models. See you in the next chapter!
References
Becker, T. E., Atinc, G., Breaugh, J. A., Carlson, K. D., Edwards, J. R., &
Spector, P. E. (2016). Statistical control in correlational studies: 10
essential recommendations for organizational researchers. Journal of
Organizational Behavior, 37(2), 157–167.
Cinelli, C., Forney, A., & Pearl, J. (2022). A Crash Course in Good and Bad
Controls. Sociological Methods & Research, 0 (0), 1-34.
Kline, R. B. (2015). Principles and Practice of Structural Equation
Modeling. Guilford Press.
Graphical Models
Welcome to Chapter 4!
So far, we have used graphs mainly to visualize our models. In this chapter,
we’ll see that from the causal point of view, graphs are much more than just a
visualization tool. We’ll start with a general refresher on graphs and basic
graph theory. Next, we’ll discuss the idea of graphical models and the role
of directed acyclic graphs (DAGs) in causality. Finally, we’ll look at how
to talk about causality beyond DAGs. By the end of this chapter, you should
have a solid understanding of what graphs are and how they relate to causal
inference and discovery.
A refresher on graphs
Let’s start!
Graphs can be defined in multiple ways. You can think of them as discrete
mathematical structures, abstract representations of real-world entities and
relations between them, or computational data structures. What all of these
perspectives have in common are the basic building blocks of graphs: nodes
(also called vertices) and edges (links) that connect the nodes.
Types of graphs
We can divide graphs into types based on several attributes. Let’s discuss the
ones that are the most relevant from the causal point of view.
U nd ire c t e d v e rs us d ire c t e d
Directed graphs are graphs with directed edges, while undirected graphs
have undirected edges. Figure 4.1 presents an example of a directed and
undirected graph:
Figure 4.1 – Directed (a) and undirected (b) graphs
In certain cases, we might not have full knowledge of the orientation of all
the edges in a graph of interest.
When we know all the edges in the graph, but we are unsure about the
direction of some of them, we can use complete partially directed acyclic
graphs (CPDAGs) to represent such cases.
We’ll see in Part 3, Causal Discovery, that some causal discovery methods
can in certain cases only recover partial causal structures from the data. Such
structures can be encoded as partially directed graphs.
In general, you can also see graphs with different types of edges, denoting
different types of relations between nodes. This is often the case in
knowledge graphs, network science, and in some applications of graph
neural networks. In this book, we’ll focus almost exclusively on graphs with
one edge type.
C y c lic v e rs us a c y c lic
Cyclic graphs are graphs that allow for loops. In general, loops are paths
that lead from a given node to itself. Loops can be direct (from a node to
itself; so-called self-loops) or indirect (going through other nodes).
The graph on the right side of Figure 4.2 (b) contains two types of loops.
There’s a self-loop between node B and itself and there are two other larger
loops. Can you find them?
Let’s start from A. There are two paths we can take. If we move to B, we can
either stay at B (via its self-loop) or move to C. From C, we can only get
back to A.
In Figure 4.3, we can see a fully-connected graph on the left (a) and a
partially connected one on the right (b). Note that real-world causal graphs
are rarely fully connected (that would mean that everything is directly
causally related to everything else either as a cause or as an effect).
W e ig ht e d v e rs us unw e ig ht e d
Weighted graphs contain additional information on the edges. Each edge is
associated with a number (called a weight), which may represent the strength
of connection between two nodes, distance, or any other metric that is useful
in a particular case. In certain cases, weights might be restricted to be
positive (for example, when modeling distance). Unweighted graphs have
no weights on the edges; alternatively, we can see them as a special case of a
weighted graph with all edge weights set to 1. In the context of causality,
edge weights can encode the strength of the causal effect (note that it will
only make sense in the linear case with all the structural equations being
linear with no interactions).
Graph representations
Now we’re ready to talk about various ways we can represent a graph.
We’ve seen enough graphs so far to understand how to present them visually.
This representation is very intuitive and easy to work with for humans (at
least for small graphs), but not very efficient for computers.
Let’s see a couple of examples to make it clearer. We’ll start with something
very simple:
There are four 1-entries that translate to four edges in the graph: from 0 to 2,
from 1 to 3, from 3 to 0, and from 3 to 2. The non-zero entries in the matrix
and the respective edges are color-coded for your convenience.
D ire c t e d a nd a c y c lic
Have you noticed that all the matrices in our examples had zeros on the
diagonal?
This is not by accident. All the preceding matrices represent valid DAGs – a
type of graph with no cycles and with directed edges only.
Any diagonal entry in a matrix will have an index in the form, denoting
an edge from node to itself.
This means that a valid DAG should always have zeros on the diagonal.
Self-loops, represented by ones in the diagonal would lead to cyclicity and
DAGs are acyclic by definition.
If you’re familiar with matrix linear algebra, you might have thought that the
fact that a matrix has only zeros in the diagonal provides us with important
information about its trace and eigenvalues.
Graphs in Python
Now we’re ready for some practice.
There are many options to define graphs in Python and we’re going to
practice two of them: using graph modeling language (GML) and using
adjacency matrices. Figure 4.7 presents a GML definition of a three-node
graph with two edges and the resulting graph. The code from Figure 4.7 and
the remaining part of this section is available in the notebook for Chapter 4
in the Graphs in Python section (https://bit.ly/causal-ntbk-04):
Figure 4.7 – GML definition of a graph and the resulting graph
Another popular graph language is DOT. A dedicated Python library called pydot
(https://bit.ly/PyDOTDocs) allows you to easily read, write, and manipulate DOT graphs in
pure Python. DOT is used by graphviz and can be used with NetworkX and DoWhy.
GML syntax can be parsed using the NetworkX parse_gml() function, which
returns a networkx.classes.digraph.DiGraph instance (digraph is shorthand
for directed graph). The usage is very simple:
graph = nx.parse_gml(sample_gml)
GML is pretty flexible, but also pretty verbose. Let’s define the same graph
using an adjacency matrix:
Figure 4.8 – Adjacency matrix of a graph and the resulting graph
Figure 4.8 presents an adjacency matrix created using NumPy and the
resulting graph. As you can see, this definition is much more compact. It’s
also directly usable (without additional parsing) by many algorithms. To
build a NetworkX graph from an adjacency matrix, we need just one line of
code:
graph = nx.DiGraph(adj_matrix)
To get some practice, create the following graph using GML and an
adjacency matrix yourself and visualize it using NetworkX:
Six nodes
Six edges: (0, 1), (0, 3), (0, 5), (3, 2), (2, 4), (4, 5)
You should get a result similar to the one presented in Figure 4.9:
Figure 4.9 – The resulting graph
In this section, we discussed various types of graphs. We learned what a
DAG is and how to construct it in Python. We’ve seen that graphs can be
represented in many different ways and used two of these ways (GML and
adjacency matrices) to build our own graphs. We’re now ready to discuss
graphical models.
The basic building blocks of GCM graphs are the same as the basic elements
of any directed graph: nodes and directed edges. In a GCM, each node is
associated with a variable.
These properties of GCMs are the basis of many causal inference and
discovery methods and can also be leveraged for domain adaptation even in
the presence of unobserved confounders (Magliacane et al., 2018) – a
flexible and very practical extension of the causal discovery toolbox.
All that said, definitions of GCMs are not entirely consistent in the literature.
In most places in this book, we’ll talk about SCMs as consisting of a graph
and a set of functional assignments. This will allow us to avoid the confusion
related to the inconsistencies in the literature, while preserving the most
important elements of the graphical and functional representations.
Definitions of causality
In the first chapter, we discussed a couple of historical definitions of
causality. We started with Aristotle, then we briefly covered the ideas
proposed by David Hume. We’ve seen that Hume’s definition (as we
presented it) was focused on associations. This led us to look into how
babies learn about the world using experimentation. We‘ve seen how
experimentation allows us to go beyond the realm of observations by
interacting with the environment. The possibility of interacting with the
environment is at the heart of another definition of causality that comes from
Judea Pearl.
Pearl proposed something very simple yet powerful. His definition is short,
ignores ontological complexities of causality, and is pretty actionable. It goes
as follows: causes if listens to (Pearl & Mackenzie, 2019).
What does it mean that one variable listens to another? In Pearlian terms, it
means that if we change , we also observe a change in .
The operation of removing the incoming edges from a node that we intervene
upon is sometimes referred to as graph mutilation.
The good news is that we don’t always need interventions to alter the
information flow in the data. If we know which paths in the graph should
remain open and which should be closed, we can leverage the power of d-
separation and statistical control to close and open the paths in the graph,
which will allow us to describe causal quantities in purely statistical terms
under certain conditions. Moreover, in some scenarios, it will allow us to
learn the structure of the data-generating process from the data generated by
this process. We’ll learn more about d-separation in Chapter 6, and more
about learning the structure of the data-generating process in Part 3, Causal
Discovery.
By definition, in a DAG, there are no paths that start at vertex that lead back
to vertex (either directly or indirectly).
Limitations of DAGs
DAGs seem to capture certain intuitions about causality really well. My best
guess is that most people would agree that talking about the direction of
causal influence makes sense (can you think of undirected causality?). At the
same time, I believe that fewer people would agree that causal influence
cannot be cyclic. Let’s see an example.
If we run these equations sequentially for a number of steps, we’ll see that
what partner 1 said at time will become dependent on what they said at
time .
Another example – perhaps more intuitive for some readers – comes from the
field of economics. When demand for product P grows, the producer might
increase the supply in the hope of collecting potential profits from the market.
Increased supply might cause a price drop, which in turn can increase
demand further.
In this section, we’ll provide a brief overview of such sources and we’ll
leave a more detailed discussion for Part 3 of the book.
On a high level, we can group the ways of obtaining causal graphs into three
classes:
Causal discovery
Expert knowledge
A combination of both
Causal discovery
Causal discovery and causal structure learning are umbrella terms for
various kinds of methods used to uncover causal structure from observational
or interventional data. We devote the entirety of Part 3 of this book to this
topic.
Expert knowledge
Expert knowledge is a term covering various types of knowledge that can
help define or disambiguate causal relations between two or more variables.
Depending on the context, expert knowledge might refer to knowledge from
randomized controlled trials, laws of physics, a broad scope of experiences
in a given area, and more.
Dynamical systems
The scenario with two interacting partners that we discussed in the previous
section describes a dynamical system. This particular example is inspired by
the research by an American ex-rabbi turned psychologist called John
Gottman, who studies human romantic relationships from a dynamical
systems point of view (for an overview: Gottman & Notarius, 2000; Gottman
et al., 1999).
Cyclic SCMs
Because SCMs constitute a very useful framework for causal inference, there
are many attempts to generalize it to cyclic cases. For instance, Forré, &
Mooij (2017) proposed -separation – a generalization of d-separation for
cyclical systems. Moreover, the same authors presented a causal discovery
algorithm that not only works with cycles, but can also handle latent
confounders (Forré, & Mooij, 2018). Interestingly, Mooij & Classen (2020)
showed that FCI (which stands for fast causal inference) – a popular causal
discovery algorithm – also gives correct results for data generated with
cyclical systemsunder certain circumstances.
Wrapping it up
We started this chapter by refreshing our knowledge of graphs and learned
how to build simple graphs using Python and the NetworkX library. We
introduced GCMs and DAGs and discussed some common limitations and
challenges that we might face when using them.
Now you have the ability to translate between the visual representation of a
graph and an adjacency matrix. The basic DAG toolkit that we’ve discussed
in this chapter will allow you to work smoothly with many causal inference
and causal discovery tools and will help you represent your own problems
as graphs, which can bring a lot of clarity – even in your work with
traditional (non-causal) machine learning.
In the next chapter, we’ll learn how to use basic graphical structures to
understand the fundamental mechanics of causal inference and causal
discovery.
References
Cosentino, C., & Bates, D. (2011). Feedback control in systems biology.
CRC Press.
Forré, P., & Mooij, J. M. (2017). Markov properties for graphical models
with cycles and latent variables. arXiv preprint arXiv:1710.08775.
Gottman, J., Swanson, C., & Murray, J. (1999). The mathematics of marital
conflict: Dynamic mathematical nonlinear modeling of newlywed marital
interaction. Journal of Family Psychology, 13(1), 3.
Magliacane, S., Van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., &
Mooij, J. M. (2018). Domain adaptation by using causal inference to
predict invariant conditional distributions. Advances in neural information
processing systems, 31.
Pearl, J., & Mackenzie, D. (2019). The book of why. Penguin Books.
Uhler, C., Raskutti, G., Bühlmann, P., & Yu, B. (2013). Geometry of the
faithfulness assumption in causal inference. The Annals of Statistics, 436-
463.
5
We’ll start with a brief introduction to the mapping between distributions and
graphs. Next, we’ll learn about three basic graphical structures – forks,
chains, and colliders – and their properties.
Finally, we’ll use a simple linear example to show in practice how the
graphical properties of a system can translate to its statistical properties.
For an introduction to some of the fundamental differences between the two approaches,
check out Jake VanderPlas’ talk (https://bit.ly/BayesianVsFrequentsist). For slightly more
formal treatment, check out the Massachusetts Institute of Technology (MIT) course notes
PDF (https://bit.ly/BayesVsFreqMIT).
Let’s see how to encode conditional independence using the notation we just
introduced.
Now, let’s introduce a simple yet useful distinction. We will use the symbol,
, to denote independence in the distribution and to denote
independence in the graph. To say that and are independent in their
distributions, we’ll use the following:
On the other hand, to say that and are independent in the graph, we’ll
say:
Equipped with this new shiny notation, let’s discuss why mapping between
graphical and distributional independence might be important to us.
We say that two nodes, X and Y, are conditionally independent given (a set of) node(s) Z
when Z blocks all open paths that connect X and Y.
In other words, (in)dependence in the graph is a function of open paths between nodes in
this graph. We’ll use and enrich this intuition in the current chapter and build a more formal
perspective on top of it in Chapter 6.
The causal Markov condition states that the node, , is independent of all
its non-descendants (excluding its parents) given its parents. Therefore,
formally, it can be presented as follows:
This formula might look discouraging, but in fact, it says something relatively
simple. Let’s dissect and analyze it step by step:
Next, we have . The symbol means “for all”. What we say here is
that the node, , is independent of all other nodes ( ) in the graph
, where represents our graph (directed acyclic graph (DAG)) of
interest, is a set of all vertices (nodes) in this graph, and is a set of all
edges in this graph.
Putting it all together: the node, , is independent of all other nodes in the
graph, , excluding the descendants and parents of this node, given its
parents. If this is not entirely clear to you at this stage, that’s perfectly fine.
We’ll reiterate this concept while discussing d-separation in the next
chapter.
In the meantime, feel free to take a look at Figure 5.1 and Figure 5.2, which
depict two examples of graphs in which the causal Markov condition holds.
In both cases controlling for – the parent node of the node, ,
removes the association between nodes and . Note that the relationships
between and are different in both figures. In Figure 5.1, the nodes,
and , have a common cause ( ), in Figure 5.2, the node, is a
grandparent of . If the causal Markov condition holds, the association
between the nodes and should be removed in both cases. Note that if
there was an unobserved common cause of both nodes ( and ) in any of
the scenarios, controlling for would not render the nodes
independent, and the condition would be violated:
Figure 5.1 – Causal Markov condition – example one
Figure 5.2 – Causal Markov condition – example two
The important statement is that when the causal Markov condition holds, the
following is true:
We read it as: and are independent in the graph given , then they are
also statistically independent given . This is known as the global Markov
property. In fact, we could show that the global Markov property, the local
Markov property, and another property called the Markov factorization
property are equivalent (Lauritzen 1996, pp. 51-52 for a formal proof; Peters
et al., 2017, pp. 101 for an overview).
A s s ump t io ns f o r c a us a l d is c o v e ry
So far, we have discussed the importance of mapping between graphical and
distributional independence structures. Now, let’s reverse the direction.
Causal discovery aims at discovering (or learning) the true causal graph from
observational and/or (partially) interventional data. In general, this task is
difficult. Nonetheless, it’s possible when certain conditions are met.
This might look familiar. Note that it’s the exact reverse of the global Markov
property that we discussed in the previous section. The formula says that if
and are independent in the distribution given , they will also be
independent in the graph given .
The last assumption that we will discuss in this section is called the causal
minimality condition (also known as the causal minimality assumption).
It turns out that there might be more than one graph that entails the same
distribution! That’s problematic when we want to recover causal structure
(represented as a causal graph) because the mapping between the graph and
the distribution is ambiguous. To address this issue, we use the causal
minimality condition.
Causal minimality can be seen from various perspectives (Neal, 2020, pp.
21-22; Pearl, 2009, p. 46; Peters & Schölkopf, 2017, pp. 107-108). Although
the assumption is usually perceived as a form of Ockham’s razor, its
implications have practical significance for constraint-based causal
discovery methods and their ability to recover correct causal structures.
O t he r a s s ump t io ns
Before we conclude this section, let’s discuss one more important
assumption that is very commonly used in causal discovery and causal
inference. It’s an assumption of no hidden confounding (sometimes also
referred to as causal sufficiency). Although meeting this assumption is not
necessary for all causal methods, it’s pretty common.
Note that causal sufficiency and the causal Markov condition are related (and
have some practical overlap), but they are not identical. For further details,
check out Scheines (1996).
Now, let’s discuss three basic graphical structures that are immensely helpful
in determining sets of conditional independencies.
Ready?
Around 11 seconds later, to Mr. Huang’s dismay, his Tesla crashed into the
overturned truck’s rooftop. Fortunately, the driver, Mr. Huang, survived the
crash and came out of the accident without any serious injuries (Everington,
2020).
A chain of events
Many modern cars are equipped with some sort of collision warning or
collision prevention system. At a high level, a system like this consists of a
detector (or detector module) and an alerting system (sometimes also an
automatic driving assistance system). When there’s an obstacle on the
collision course detected by the detector, it sends a signal that activates the
alerting system. Let’s say that the detector is in state 1 when it detects an
obstacle and it’s in state 0 when it detects no obstacles.
An important fact about a system such as the one presented in Figure 5.3 is
that the existence of the obstacle does not give us any new information about
the alert when we know the detector state. The obstacle and the alert are
independent, given the detector state.
In other words, when the detector thinks there is a pedestrian in front of the
car, the alarm will set off even if there’s no pedestrian in front of the car.
Similarly, if there’s an obstacle on the road and the detector does not
recognize it as an obstacle, the alert will remain silent.
Chains
First, let’s replace the variables in the graph from Figure 5.3 with , and
. This general structure is called a chain, and you can see it in Figure 5.4:
Figure 5.4 – A chain structure
Intuitively, controlling for closes the only open path that exists between
and . Note that and become dependent when we do not control for .
Translating this back to our object detection system: if we do not observe the
detector state, the presence of the obstacle within the system’s range becomes
correlated with the safety response (alarm and emergency breaking).
You might already see where it’s going. If we’re able to fulfill the
assumptions that we discussed in the previous section (causal Markov
condition, no hidden confounding), we can now predict conditional
independence structure in the data from the graph structure itself (and if that’s
true we can also figure out which variables we should control for in our
model to obtain valid causal effect estimates(!) – we’ll learn more on this in
the next chapter). Moreover, predicting the graph structure from the
observational data alone also becomes an option. That’s an exciting
possibility, but let’s take it slowly.
Now, let’s see what happens if we change the direction of one of the edges in
our chain structure.
Forks
Figure 5.5 represents a fork. A fork is a structure where the edge between
nodes and is reversed compared to the chain structure:
Figure 5.5 – A fork structure
In the fork, node becomes what we usually call a common cause of nodes
and .
Imagine you’re driving your car, and you suddenly see a llama in the middle
of the road.
The detector recognizes the llama as an obstacle. The emergency brake kicks
in before you even realize it. At the same time, a small subcortical part of
your brain called the amygdala sends a signal to another structure, the
hypothalamus, which in turn activates your adrenal glands. This results in
an adrenaline injection into your bloodstream. Before you even notice, your
body has entered a state that is popularly referred to as a fight-or-flight
mode.
The presence of a llama on the road caused the detector to activate the
emergency brake and caused you to develop a rapid response to this
potentially threatening situation.
This makes the llama on the road a common cause of your stress response
and the detector’s response.
However, when we control for the llama on the road, your threat response
and the detector’s response become independent. You might feel stressed
because you’re running late for an important meeting or because you had a
tough conversation with your friend, and this has zero connection to the
detector state.
Note that in the real world, you would also likely react with a fight-or-flight
response to the mere fact that the emergency brakes were activated. The path,
llama → detector → brakes → fight-or-flight, will introduce a spurious
connection between llama and flight-or-fight that can be removed by
controlling for the brakes variable.
Let’s build a more formal example to clear any confusion that can come from
this connection.
Let’s take a look at Figure 5.5 once again and think about conditional
independencies. Are and unconditionally dependent?
In other words: are and dependent when we do not control for ? The
answer is yes.
Why are they dependent? They are dependent because they both inherit some
information from . At the same time, the information inherited from is all
they have in common.
Let’s take a look at a simple structural model that describes a fork structure
to see how independence manifests itself in the distribution:
It seems that chains and forks lead to the same pattern of conditional
independence, and if we want to recreate a true graph from the data, we end
up not knowing how to orient the edges in the graph!
This might sound disappointing, but before we let the disappointment take
over, let’s examine the last structure in this section.
Let’s take a look at Figure 5.6. In the collider, causal influence flows from
two different parents into a single child node:
Figure 5.6 – A collider structure
To extend our series of driving examples, let’s think about llamas on the road
and driving against the sun. I think that most people would agree it’s
reasonable to say that whether you drive against the sun is independent of
whether there’s a llama on the road.
Now, let’s assume that your detector reacts with some probability to the sun
reflexes as if there was an obstacle on the road.
The loss of independence between two variables, when we control for the
collider, sounds surprising to many people at first. It was also my
experience.
I found that for many people, examples involving real-world objects (even if
these objects are llamas) bring more confusion than clarity when it comes to
colliders.
Let’s build a slightly more abstract yet very simple example to make sure that
we clarify any confusion you might still have.
Take a look at Figure 5.6 once again to refresh the graphical representation
of the problem.
Now, let’s imagine that both and randomly generate integers between 1
and 3. Let’s also say that is a sum of and . Now, let’s take a look at
values of and when the value of . The following are the
combinations of and that lead to :
Can you see the pattern? Although and are unconditionally independent
(there’s no correlation between them as they randomly and independently
generate integers), they become correlated when we observe ! The reason
for this is that when we hold constant and the value of increases, the
value of has to decrease if we want to keep the value of constant.
If you want to make it more visual, you can think about two identical glasses
of water. If you randomly pour some water from one glass to the other, the
total amount of water in both glasses will remain the same (assuming that we
don’t spill anything). If you repeat this n times and measure the amount of
water in both glasses at each stage, the amount of water between the glasses
will become negatively correlated.
I hope that this example will help you cement your intuition about colliders’
conditional independence properties.
Unfortunately, this is not always the case. Let’s discuss these scenarios now.
Ambiguous cases
As we’ve seen earlier, various graphical configurations might lead to the
same statistical independence structure. In some cases, we might get lucky
and have enough colliders in the graph to make up for it. In reality, though,
we might often not be that fortunate.
Does this mean that the discussion we have had so far leads us to the
conclusion that, in many cases, we simply cannot recover the graph from the
data?
That’s not entirely true. Even in cases where some edges cannot be oriented
using constraint-based methods, we can still obtain some useful information!
Let’s introduce the concept of the Markov equivalence class (MEC). A set
of DAGs, 𝒟 = { G 0(V, E 0) , … , G n(V, E n)}, is Markov equivalent if and
only if all DAGs in have the same skeleton and the same set of colliders
(Verma & Pearl, 1991).
A skeleton is basically an undirected version of a DAG – all the edges are
in place, but we have no information on the arrows. If we add the edges for
all the collider structures that we’ve found, we will obtain a complete
partially-directed acyclic graph (CPDAG).
If we take the CPDAG and generate a set of all possible DAGs from it, we’ll
obtain a MEC. MECs can be pretty useful. Even if we cannot recover a full
DAG, a MEC can significantly reduce our uncertainty about the causal
structure for a given dataset.
Before we conclude this section, let’s take a look at Figure 5.8, which
presents a simple MEC:
Figure 5.8 – An example of a MEC
The graphs in Figure 5.8 have the same set of edges. If we removed the
arrows and left the edges undirected, we would obtain two identical graphs,
which is an indicator that both graphs have the same skeleton.
Great!
What we’re going to do now is to generate three datasets, each with three
variables, , , and . Each dataset will be based on a graph representing
one of the three structures: a chain, a fork, or a collider. Next, we’ll fit one
regression model per dataset, regressing on the remaining two variables,
and analyze the results. On the way, we’ll plot pairwise scatterplots for each
dataset to strengthen our intuitive understanding of a link between graphical
structures, statistical models, and visual data representations.
Let’s start with graphs. Figure 5.9 presents chain, fork, and collider
structures:
Figure 5.9 – Graphical representations of chain, fork, and collider structures
We will use the graphs from Figure 5.9 to guide our data-generating process.
Note that we omitted the noise variables for clarity of presentation.
The code for this section can be found in the Chapter_05.ipynb notebook
(https://bit.ly/causal-ntbk-05).
NOISE_LEVEL = .2
N_SAMPLES = 1000
NOISE_LEVEL will determine the standard deviation of noise variables in our
datasets. N_SAMPLES simply determines the sample size.
a = np.random.randn(N_SAMPLES)
b = a + NOISE_LEVEL*np.random.randn(N_SAMPLES)
c = b + NOISE_LEVEL*np.random.randn(N_SAMPLES)
Let’s plot pairwise scatterplots for this dataset. We present them in Figure
5.10:
Figure 5.10 – Pairwise scatterplots for the dataset generated according to a chain-
structured graph
We can see that all the scatterplots in Figure 5.10 are very similar. The
pattern is virtually identical for each pair of variables, and the correlation is
consistently pretty strong. This reflects the characteristics of our data-
generating process, which is linear and only slightly noisy.
Note that the correlation metric that we used – Pearson’s r – can only capture linear
relationships between two variables. Metrics for non-linear relationships are also available,
but we won’t discuss them here.
b = np.random.randn(N_SAMPLES)
a = b + NOISE_LEVEL*np.random.randn(N_SAMPLES)
c = b + NOISE_LEVEL*np.random.randn(N_SAMPLES)
In Figure 5.11, we can see pairwise scatterplots for the fork. What’s your
impression? Do they look similar to the ones in Figure 5.10?
Figure 5.11 – Pairwise scatterplots for the dataset generated according to a fork-
structured graph
Both figures (Figure 5.10 and Figure 5.11) might differ in detail (note that
we generate independent noise variables for each dataset), but the overall
pattern seems very similar between the chain and fork datasets.
Let’s see!
Generating the collider dataset
Let’s start with the data:
a = np.random.randn(N_SAMPLES)
c = np.random.randn(N_SAMPLES)
b = a + c + NOISE_LEVEL*np.random.randn(N_SAMPLES)
This time the pattern is pretty different! Relationships between and (top
left) and and (bottom left) seem to be noisier. Moreover, it seems there’s
no correlation between and (top right). Note that this result is congruent
with what we said about the nature of colliders earlier in this chapter.
Additionally, when we take a look at the data-generating process, this
shouldn’t be very surprising because the data generating process renders
and entirely independent, evidence for, which you can also see in the code.
In each case, we’ll regress on and . We’ll use statsmodels to fit the
regression models. The code for each of the three models is identical:
In the first line, we create a pandas dataframe using NumPy arrays containing
our predictors ( and ). This will make statsmodels automatically assign
the correct variable names in the model summary.
After fitting the models (for the full flow, check out the code in the notebook
(https://bit.ly/causal-ntbk-05)), we’re ready to print the model summaries
and compare the results.
Figure 5.13 provides us with a compact summary of all three models. We
used yellow and red ellipses to mark the p-values for each of the models:
Figure 5.13 – The results of regression analysis of three basic conditional independence
structures
In Figure 5.13, there are three rows printed out for each model. The top row
is marked as const, and we’ll ignore it. The remaining two rows are marked
with A and B, which denote our variables, and , respectively.
For the sake of our analysis, we’ll use the customary threshold of 0.05 for p-
values. We’ll say that p-values greater than 0.05 indicate a non-significant
result (no influence above the noise level), and p-values lower than or equal
to 0.05 indicate a significant result.
We can see that for the chain model, only one predictor ( ) is significant.
Why is that?
When we look at the results for the fork, we see the same pattern.
Again, only one predictor is significant, and again it turns out to be . The
logic behind the difference between the pairwise scatterplot and regression
result is virtually identical to the one that we just discussed for the chain.
The model for the collider gives us a different pattern. For this model, both
and are significant predictors of . If you now take a look at Figure 5.12
again, you can clearly see that there’s no relationship between and .
Before we conclude this section, I want to ask you to recall our discussion on
statistical control from Chapter 3, and think again about the question that we
asked in one of the sections – should we always control for all the available
variables? – knowing what we’ve learned in this chapter.
In this section, we saw how the properties of chains, forks, and colliders
manifest themselves in the realm of statistical analysis. We examined
pairwise relationships between variables and compared them to conditional
relationships that we observed as a result of multiple regression analysis.
Finally, we deepened our understanding of confounding.
Wrapping it up
This chapter introduced us to the three basic conditional independence
structures – chains, forks, and colliders (the latter also known as
immoralities or v-structures). We studied the properties of these structures
and demonstrated that colliders have unique properties that make constraint-
based causal discovery possible. We discussed how to deal with cases when
it’s impossible to orient all the edges in a graph and introduced the concept
of MECs. Finally, we got our hands dirty with coding the examples of all the
structures and analyzed their statistical properties using multiple linear
regression.
This chapter concludes the first, introductory part of this book. The next
chapter starts on the other side, in the fascinating land of causal inference.
We’ll go beyond simple linear cases and see a whole new zoo of models.
Ready?
References
Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Springer.
Everington, K. (2020, Jun 2). Video shows Tesla on autopilot slam into
truck on Taiwan highway. Taiwan News.
https://www.taiwannews.com.tw/en/news/3943199.
Pearl, J., & Mackenzie, D. (2019). The Book of Why. Penguin Books.
Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and
Search. MIT Press.
Uhler, C., Raskutti, G., & Bühlmann, P., & Yu, B. (2013). Geometry of the
faithfulness assumption in causal inference. The Annals of Statistics, 436-
463.
Verma, T., & Pearl, J. (1991). Equivalence and synthesis of causal models.
UCLA.
Part 3: Causal Discovery
In Part 3, we will start our journey into the world of causal discovery. We
will begin with an overview of the sources of causal knowledge and a
deeper look at important assumptions.
Along the way, we will show you how to inject expert knowledge into the
causal discovery process, and we will briefly discuss methods that allow us
to combine observational and interventional data to learn causal structure
more efficiently.
By the end of this chapter, you will know a broad range of causal discovery
methods. You’ll be able to implement them using Python and gCastle and
you’ll understand the mechanics and implications of combining selected
methods with pre-existing domain knowledge.
Introduction to gCastle
In this section, we’ll review these assumptions and discuss other, more
general assumptions that will be useful in our causal discovery journey.
Let’s start!
Gearing up
Causal discovery aims at discovering (or learning) the true causal graph from
observational (and sometimes interventional or mixed) data.
In general, this task is difficult but possible under certain conditions. Many
causal discovery methods will require that we meet a set of assumptions in
order to use them properly.
The first general assumption is one of causal sufficiency (or lack of hidden
confounding). A vast majority of causal discovery methods rely on this
assumption (although not all).
Another popular assumption for causal discovery is faithfulness.
This assumption is the reverse of the global Markov property that we use
for causal inference and that we discussed back in Chapter 5. The formula
says that if and are independent in their distribution given , they will
also be independent in the graph given .
Moreover, any situation where one variable influences another through two
different paths and these paths cancel out completely will lead to the
violation of faithfulness (see Chapter 5 for more details and see Neal
Brady’s video for a quick reference: https://bit.ly/BradyFaithfulness).
That said, the probability of encountering the latter in the real world is
extremely small (Sprites et al., 2000, pp. 68-69), even if it feels easy to
come up with theoretical examples (Peters et al., 2017, pp. 107-108).
Minimalism is a virtue
There might exist more than one graph or structural causal model (SCM)
that entails the same distribution.
That’s a challenge for recovering the causal structure as the mapping between
the structure and the distribution becomes ambiguous.
Although not all causal discovery methods require all three assumptions that
we discussed in this section, the three are likely the most frequent over a
broad set of methods. We’ll talk more about this when discussing particular
methods in detail.
In the next section, we’ll discuss four streams of ideas that led to the
development of classic and contemporary causal discovery methods.
We use the word families rather than a more formal term such as type, as the
distinction between the four families we’ll use might be slightly vague. We
follow the categorization proposed by Glymour et al. (2019) and extend it
slightly to include more recent causal discovery methods not mentioned in
Glymour and colleagues’ paper.
The ideas that followed from this research led to two parallel research
streams focused around three academic centers – the University of
California, Los Angeles (UCLA), Carnegie Mellon University (CMU),
and Stanford (Pearl, 2009):
This section was titled The four (and a half) families. What about the
remaining half?
We’ll discuss each family in greater detail, but before we start, let’s
introduce the main library that we’ll use in this chapter – gCastle.
Introduction to gCastle
In this section, we’ll introduce gCastle (Zhang et al., 2021) – a causal
discovery library that we’ll use in this chapter. We’ll introduce four main
modules: models, synthetic data generators, visualization tools, and model
evaluation tools. By the end of this section, you’ll be able to generate a
synthetic dataset of chosen complexity, fit a causal discovery model,
visualize the results, and evaluate the model using your synthetic data as a
reference.
Hello, gCastle!
What is gCastle?
Models
Visualization tools
Erdős–Rényi
Scale-free (Barabási–Albert)
Bipartite
Hierarchical
Low-rank
Gaussian (gauss)
Exponential (exp)
Gumbel (gumbel)
Uniform (uniform)
Non-linear datasets can be generated using the following:
Let’s generate some data. On the way, we’ll also briefly discuss scale-free
graphs and the main parameters of the IIDSimulation object.
G e ne ra t ing t he d a t a w it h g C a s t le
Let’s start with the imports. The code for this chapter can be found in the
notebook at https://bit.ly/causal-ntbk-13.
SEED = 18
np.random.seed(SEED)
We use the .scale_free() method of the DAG object. We specify the number
of nodes to be 10 and the number of edges to be 17. We fix the random seed to
the predefined value.
Although scale-free networks were known at least from the 1960s (although
not necessarily under their current name), they significantly gained popularity
in the late 1990s and early 2000s with the works of Albert-László Barabási
and Réka Albert, who proposed that scale-free networks can be used to
model the internet interlink structure and other real-world systems (Barabási
and Albert, 1999; Barabási, 2009).
First, let’s transform the adjacency matrix into the NetworkX directed graph
object:
g = nx.DiGraph(adj_matrix)
plt.figure(figsize=(12, 8))
nx.draw(
G=g,
node_color=COLORS[0],
node_size=1200,
pos=nx.circular_layout(g)
)
We set the figure size using Matplotlib and pass a number of parameters to
NetworkX’s nx.draw() method. We pass g as a graph, set the node color and
size, and define the network layout to increase readability. The circular
layout should provide good visual clarity for our relatively small graph.
In Figure 13.1, at the top, we see a highly connected node with six incoming
edges. This is the effect of the preferential attachment mechanism. If we grew
our graph further, the differences between highly connected nodes and less
connected nodes would become even more visible. Looking at the graph in
Figure 13.1, you can imagine how the preferential attachment mechanism
would add more edges to the more connected nodes and fewer edges to the
less connected ones.
We are now ready to generate some data. We’ll use our adjacency matrix,
adj_matrix, and generate 10,000 observations with linear structural
assignments and Gaussian noise:
dataset = IIDSimulation(
W=adj_matrix,
n=10000,
method='linear',
sem_type='gauss'
)
dataset.X
With our dataset generated, let’s learn how to fit a gCastle causal discovery
model.
pc = PC()
pc.learn(dataset.X)
pc.causal_matrix
This will print out the learned causal matrix. We won’t print it here to save
space, but you can see the matrix in this chapter’s notebook.
Now that we have learned the adjacency matrix, we are ready to compare it
with the ground truth. First, we’ll do it visually.
Comparing two graphs visually is only useful for smaller graphs. Although
our graph has only 10 nodes and 17 edges, the comparison might be already
challenging for some of us.
Figure 13.3 – Heatmaps representing the learned and the true adjacency matrices
After a short inspection, you should be able to spot four differences between
the two matrices in Figure 13.3. That’s congruent with our observation from
Figure 13.2.
Visual comparisons might be useful and even very helpful in certain cases,
yet their usefulness is limited due to the limitations of our human attention,
which can only track between three and seven elements at a time. That’s
when numerical evaluation metrics become useful.
The newly created metrics object computes a number of useful metrics for us
internally. We can access them through a dictionary internally stored as an
attribute called metrics. The values for all available metrics can be easily
accessed via respective dictionary keys.
metrics.metrics['F1']
As of the time of writing (gCastle, version 1.0.3), there are nine available
metrics in MetricsDAG.
metrics.metrics
{'fdr': 0.1176,
'tpr': 0.9375,
'fpr': 0.069,
'shd': 3,
'nnz': 17,
'precision': 0.8333,
'recall': 0.9375,
'F1': 0.8824,
'gscore': 0.75}
False fdr
Discovery Rate
Rate
Rate
Name Acronym Definition
Negative
Entries
Name Acronym Definition
Precision precision
Recall recall
F1 Score F1
Name Acronym Definition
G-Score gscore
In Table 13.1, TP stands for the number of true positives, FP stands for the
number of false positives, TN stands for the number of true negatives, and
FN stands for the number of false negatives; represents the number of
reversed edges.
There are two other useful quantities that you might want to compute when
benchmarking causal discovery methods: the number of undirected edges and
the structural intervention distance (SID) (Peters and Bühlmann, 2015).
def get_n_undirected(g):
total = 0
for i in range(g.shape[0]):
for j in range(g.shape[0]):
if (g[i, j] == 1) and (g[i, j] == g[j, i]):
total += .5
return total
Ready?
Let’s start!
By the end of this chapter, you will have a solid understanding of how
constraint-based methods work and you’ll know how to implement the PC
algorithm in practice using gCastle.
Let’s start with a brief refresher on chains, forks, and colliders. Figure 13.4
presents the three structures:
Figure 13.4 – The three basic graphical structures
Let’s take a look at Figure 13.4 once again. Each structure consists of three
variables represented by nodes A, B, and C. For each structure, let’s imagine
a model where we regress C on A.
The graph in Figure 13.5 consists of four nodes and three directed edges.
For a more detailed view of the PC algorithm, check Glymour et al. (2019),
Sprites et al. (2000), or Le et al. (2015).
If any of the steps seem unclear to you, feel free to go through the stages in
Figure 13.6 once again and review the PC algorithm steps that we discussed
earlier. If you need a deeper refresher on conditional independencies, review
Chapter 5.
Let’s generate some data according to our graph and fit the PC algorithm to
recover the true structure.
N = 1000
p = np.random.randn(N)
q = np.random.randn(N)
r = p + q + .1 * np.random.randn(N)
s = .7 * r + .1 * np.random.randn(N)
# Store the data as a matrix
pc_dataset = np.vstack([p, q, r, s]).T
pc = PC()
pc.learn(pc_dataset)
Let’s display the predicted causal matrix alongside the true DAG:
GraphDAG(
est_dag=pc.causal_matrix,
true_dag=pc_dag)
plt.show()
Figure 13.7 – The comparison between the DAG learned by the PC algorithm (left) and the
true DAG (right)
Both DAGs in Figure 13.7 look identical. It indicates that our model learned
the underlying DAG perfectly.
In this case, it’s just a pure formality; nonetheless, let’s print out the metrics:
MetricsDAG(
B_est=pc.causal_matrix,
B_true=pc_dag
).metrics
{'fdr': 0.0,
'tpr': 1.0,
'fpr': 0.0,
'shd': 0,
'nnz': 3,
'precision': 1.0,
'recall': 1.0,
'F1': 1.0,
'gscore': 1.0}
The reason for this is that in certain scenarios, the algorithm might be
sensitive to the order in which we perform conditional independence tests.
The sensitivity has been addressed in a variant of the PC algorithm called
PC-stable (Colombo and Maathuis, 2012).
pc_stable = PC(variant='stable')
MORE ON PC-STABLE
In fact, PC-stable only addresses a part of the problem of ordering sensitivity – the variant is
only insensitive to the ordering of the tests that lead to the discovery of the skeleton but is still
sensitive to ordering when finding colliders and orienting edges. Other variants of the PC
algorithm have been proposed to address these two problems, but it turns out that, in
practice, they might lead to less useful results than simple PC-stable. Check Colombo and
Maathuis (2012) for more details.
pc_parallel = PC(variant='parallel')
The results for the stable and parallel versions of the algorithm are identical
for our example so we do not present them here (but you can find them in the
notebook).
pc_cat = PC(ci_test='chi2')
pc_cat = PC(ci_test=g2)
In order to use them, we need to pass a function (without calling it) to the
ci_test parameter:
This modular design is really powerful. Notice that all causal libraries used
in this book so far – DoWhy, EconML, and gCastle – promote an open,
modular, and flexible design, which makes them well suited for future
developments and facilitates research.
The time has come to conclude this section. We’ll get back to the PC
algorithm briefly in later sections of this chapter to reveal one more hidden
gem that can be really helpful in real-world scenarios.
In this section, we learned about constraint-based causal discovery methods
and introduced the PC algorithm. We discussed the five basic steps of the PC
algorithm and applied them to our example DAG. Next, we generated data
according to the DAG and implemented the PC algorithm using gCastle.
In the next section, we’ll introduce the second major family of causal
discovery methods – score-based methods.
GES is a two-stage procedure. First, it generates the edges, then it prunes the
graph.
The algorithm starts with a blank slate – an entirely disconnected graph – and
iteratively adds new edges. At each step, it computes a score that expresses
how well a new graph models the observed distribution, and at each step, the
edge that leads to the highest score is added to the graph.
The entire first and second phases are executed in a greedy fashion (hence
the greedy part of the algorithm’s name).
GES – scoring
In order to compute the score at each step, Chickering (2003) uses the
Bayesian scoring criterion. Huang et al. (2018) proposed a more general
method based on regression in Reproducing Kernel Hilbert Space (RHKS),
which allows for mixed data types and multidimensional variables.
GES in gCastle
Implementing the GES algorithm in gCastle is straightforward:
ges = GES(criterion='bic')
To train the model, use the .learn() method (analogously to what we’ve
done for the PC algorithm):
ges.learn(pc_dataset)
We used the dataset generated according to the graph in Figure 13.5. Let’s
plot the results to see how GES performed:
GraphDAG(
est_dag=ges.causal_matrix,
true_dag=pc_dag)
plt.show()
As we can see in Figure 13.8, the GES algorithm did a very good job and
retrieved the perfect graph.
In general, the GES algorithm can be proven to be optimal (based on the so-
called Meek Conjecture; see Chickering (2003) for details), but the
guarantees are only asymptotic.
In the limit of a large sample, GES converges to the same solution as PC. In
my personal experience, I found GES repeatedly underperforming compared
to other methods (including PC), but please note that my observations do not
have a systematic character here.
That said, symmetry only tells a part of the story. Perfectly symmetrical faces
and objects are perceived as unnatural – fearfully symmetrical. It seems that
nature likes to be asymmetrical sometimes and a large class of real-world
variables is distributed in a non-symmetrical manner.
It seems that the linear-Gaussian case is hard, but moving into either the non-
linear or non-Gaussian realm breaks the symmetry.
ANM model
ANM (Hoyer et al., 2008) is a causal discovery algorithm that leverages the
fact that when two variables are related non-linearly, the causal mechanism
does leave traces.
x = np.random.randn(1000)
y = x**3 + np.random.randn(1000)
plt.figure(figsize=(10, 7))
plt.scatter(x, y, alpha=.5, color=COLORS[0])
plt.xlabel('$X$')
plt.ylabel('$Y$')
plt.show()
Now, let’s fit two non-linear spline regressions to our data – one in the
causal (X -> Y) and one in the anti-causal (Y -> X) direction.
n_splines = 150
# Instantiate the models
model_xy = GAM(n_splines=n_splines)
model_yx = GAM(n_splines=n_splines)
# Fit the models
model_xy.fit(x.reshape(-1, 1), y)
model_yx.fit(y.reshape(-1, 1), x)
# Generate predictions
y_pred = model_xy.predict(x.reshape(-1, 1))
x_pred = model_yx.predict(y.reshape(-1, 1))
Let’s plot the data alongside the fitted regression curves. Figure 13.11
presents the results:
Figure 13.11 – Scatter plot of non-linear data with two fitted regression curves
The two lines representing the causal and the anti-causal models differ
significantly.
residuals_xy = y - y_pred
residuals_yx = x - x_pred
plt.figure(figsize=(15, 7))
plt.subplot(121)
plt.scatter(x, residuals_xy, alpha=.5, color=COLORS[0])
plt.xlabel('$X$', fontsize=14)
plt.ylabel('$Y-residuals$', fontsize=14)
plt.subplot(122)
plt.scatter(residuals_yx, y, alpha=.5, color=COLORS[0])
plt.xlabel('$X-residuals$', fontsize=14)
plt.ylabel('$Y$', fontsize=14)
plt.show()
Figure 13.12 – Scatter plots of residuals for the causal and the anti-causal model
As we can see in Figure 13.12, the residuals for both models form very
different patterns. In particular, the residuals for the anti-causal model (right
panel) form a cross-like pattern, indicating a lack of independence between
the residuals (x axis) and the predictor (y axis).
This is great news for us!
Assessing independence
It seems clear that none of the traditional correlation metrics such as
Pearson’s r or Spearman’s rho will be able to help here. The relationship in
the right panel of Figure 13.12 is non-monotonic, violating the basic
assumption of both methods.
Luckily, many methods exist that can help us with this task. One of the
methods traditionally used in ANM models is called Hilbert-Schmidt
Independence Criterion (HSIC). HSIC can easily handle patterns like the
one that we obtained for the anti-causal model.
Let’s compute HSIC for both models. We’ll use the implementation of HSIC
provided by gCastle.
# Compute HSIC
is_indep_xy = hsic_test(
x = x.reshape(-1, 1),
y = residuals_xy.reshape(-1, 1),
alpha=.05
)
is_indep_yx = hsic_test(
x = y.reshape(-1, 1),
y = residuals_yx.reshape(-1, 1),
alpha=.05
)
is_indep_xy, is_indep_yx
(1, 0)
The result says that the residuals are independent for the model in the X -> Y
direction, while they are dependent for the model in the Y -> X direction.
This is correct!
Congrats, you just learned how to implement the ANM model from scratch!
There are also other options to implement ANM. You can use another non-
linear model instead of spline regression (for instance, gCastle’s
implementation of ANM uses Gaussian process regression). You can also use
other tools to compare the residuals. For instance, an alternative approach to
independence testing is based on likelihood. Check this notebook for
implementation: https://bit.ly/ANMNotebook.
ANM only works when the data has independent additive noise; it also
requires no hidden confounding.
LiNGAM time
While ANM relies on non-linearity to break the symmetry between the causal
and anti-causal models, LiNGAM relies on non-Gaussianity.
Let’s compare residual patterns in two linear datasets: one with Gaussian
and one with non-Gaussian noise.
SAMPLE_SIZE = 1000
x_gauss = np.random.normal(0, 1, SAMPLE_SIZE)
y_gauss = x_gauss + 0.3 * np.random.normal(0, 1,
SAMPLE_SIZE)
x_ngauss = np.random.uniform(0, 1, SAMPLE_SIZE)
y_ngauss = x_ngauss + 0.3 * np.random.uniform(0, 1,
SAMPLE_SIZE)
Let’s take a look at Figure 13.13, which presents the Gaussian and non-
Gaussian data with fitted regression lines in causal and anti-causal
directions:
Figure 13.13 – Gaussian and non-Gaussian data with fitted regression lines
In the top panes of Figure 13.13, we see two Gaussian models with fitted
regression lines (first and third columns) and their respective residuals
(second and fourth columns). In the bottom panes, we see two non-Gaussian
models (first and third columns) and their residuals (second and fourth
columns).
ICA is an algorithm frequently used to recover the source signals from noisy
overlapping observations. A popular example of ICA usage is the source
separation in the so-called cocktail party problem, where we have a multi-
track recording of multiple speakers speaking simultaneously and we want to
separate those speakers’ voices into separate tracks.
It turns out that we can achieve it under certain assumptions using ICA. One
of the core assumptions here is the non-Gaussianity of the source signals,
which allows us to presume that there’s a bijective mapping between the
noisy recording and the source.
The second assumption is linearity. ICA can only hold linear data and, again,
LiNGAM inherits this limitation.
The good news is that LiNGAM does not require the faithfulness assumption.
lingam = ICALiNGAM(random_state=SEED)
lingam.learn(pc_dataset)
Let’s plot the results. Figure 13.14 presents the comparison between the
predicted and the correct adjacency matrix:
We can see that LiNGAM has predicted two edges correctly (1 -> 2 and 2 ->
3) but it has missed the edge 0 -> 2. Additionally, the model hallucinated two
edges (1 -> 0 and 2 -> 0). These results are not very impressive, especially
for such a simple graph. That said, this is expected as we violated one of the
basic assumptions of the method.
U s ing le g a l d a t a w it h L iN G A M
Let’s generate the data that LiNGAM can deal with. Let’s keep the same
causal structure that we used for pc_dataset, and only update the functional
forms:
a = np.random.uniform(0, 1, N)
b = np.random.uniform(3, 6, N)
c = a + b + .1 * np.random.uniform(-2, 0, N)
d = .7 * c + .1 * np.random.uniform(0, 1, N)
lingam_dataset = np.vstack([a, b, c, d]).T
lingam = ICALiNGAM(random_state=SEED)
lingam.learn(lingam_dataset)
This time, the model was able to perfectly retrieve the true graph.
A great feature that LiNGAM offers is that it does not only retrieve the causal
structure but also the strength of the relationships between variables. In this
sense, LiNGAM is an end-to-end method for causal discovery and causal
inference.
Let’s check how well LiNGAM did with retrieving the coefficients. We can
access the learned weighted matrix via the weight_causal_matrix attribute:
lingam.weight_causal_matrix
Tensor([[0. , 0. , 1.003, 0. ],
[0. , 0. , 0.999, 0. ],
[0. , 0. , 0. , 0.7 ],
[0. , 0. , 0. , 0. ]])
Figure 13.16 presents these results in the context of the original structural
equations. Let’s see how well the algorithm managed to recover the true
coefficients.
Figure 13.16 – LiNGAM results (right) versus the original DAG (left) and the original
structural equations (top)
In Figure 13.16 (top), we see the set of original structural equations (the
ones that we implemented). On the bottom, we see the original DAG (left)
and the learned matrix (right). The blue dashed lines map the coefficients
from the equations to the respective cells in the original DAG. The green
cells on the right contain the coefficients retrieved by LiNGAM. Dark green
emphasizes the perfectly retrieved coefficient.
Before we conclude, there are a couple of things that are good to know about
LiNGAM. First, LiNGAM uses ICA in the background, and ICA is a
stochastic method. Sometimes, it might not converge to a good solution
within the default number of steps.
You can modify the number of steps by modifying the max_iter parameter:
lingam = ICALiNGAM(
max_iter=2000,
random_state=SEED
)
It’s guaranteed to converge to the true solution in the infinite sample regime
when the model assumptions are strictly met. The number of steps required
for convergence scales linearly with the number of variables.
Of course, we never have access to an infinite number of samples in the real
world. That said, DirectLiNGAM has been shown to outperform its
counterpart on simulated finite sample data as well (Shimizu et al., 2011).
DirectLiNGAM also solves the scaling issue. The cost for the improvements
is in computational time. The time scales as a power function of the number
of variables (nodes).
d_lingam = DirectLiNGAM()
d_lingam.learn(lingam_dataset)
The results are identical to the regular model, so we skip them here. You can
see the plots and the learned weighted matrix in the notebook.
More variants of the LiNGAM model exist than the two we discussed here.
The LiNGAM framework has been extended to work with time series,
cyclical graphs (Lacerda et al., 2008), and latent variables, among others.
For implementations and references for some of these variants, check the
Python LiNGAM library (https://bit.ly/LiNGAMDocs).
It’s time to conclude our journey with functional-based causal discovery for
now.
We learned how to implement the ANM model from scratch using Python and
HSIC independence tests from the gCastle library and how to use the
algorithms from the LiNGAM family using gCastle. Finally, we discussed the
main limitations of ICA-LiNGAM and discussed how to address them using
DirectLiNGAM.
In the next section, we’ll introduce the most contemporary family of causal
discovery methods – gradient-based methods.
The work was titled DAGs with NO TEARS: Continuous Optimization for
Structure Learning and introduced a novel approach to causal structure
learning (though we need to say that the authors did not explicitly state that
their method is causal).
2. is a matrix exponential
3. is a trace of a matrix
First, we import the linalg linear algebra module from SciPy. We’ll need it
for the matrix exponential operation.
Next, we multiply the adjacency matrix element-wise by itself. Note that this
step is only effective for weighted adjacency matrices. For binary matrices,
Finally, we compute the trace of the resulting matrix and subtract the number
of rows in the adjacency matrix, which is equivalent to the number of nodes
in the graph represented by this matrix (or the number of variables in our
problem).
If this quantity (note that trace is a scalar value) is equal to zero, we return
True. Otherwise, we return False.
Let’s test the function on the pc_dag DAG that we created at the beginning of
this chapter:
check_if_dag(pc_dag)
dcg = np.array([
[0, 1, 0],
[1, 0, 0],
[0, 1, 0]
])
check_if_dag(dcg)
False
The overall loss function in NOTEARS consists of the mean squared error
component with a number of additional regularization and
penalty terms. The function is optimized using the augmented Lagrangian
method (Niemirovski, 1999) subject to the constraint that (see
Zheng et al., 2018, for the full definition).
The initial excitement cooled down after the publication of two research
papers that examined the NOTEARS algorithm. Kaiser and Sipos (2021)
have shown that NOTEARS’ sensitivity to scaling makes it a risky choice for
real-world causal discovery. Reisach et al. (2021) demonstrated similar
inconsistencies in the performance of gradient-based methods between
standardized and unstandardized data and pointed to a more general problem:
synthetic data used to evaluate these methods might contain unintended
regularities that can be relatively easily exploited by these models, as the
authors expressed in the title of their paper, Causal Discovery Benchmarks
May Be Easy To Game.
These changes do not seem to address all the issues raised by Reisach et al.
(2021) and it seems that GOLEM’s performance is affected by the unintended
variance patterns introduced by synthetic data-generating processes. In
particular, the model’s performance tends to degrade in the normalized data
scenario, where unfair variance patterns are minimized. Note that the same
is true for NOTEARS (check Reisach et al., 2021, for the details).
The comparison
Now, let’s generate three datasets with different characteristics and compare
the performance of the methods representing each of the families.
methods = OrderedDict({
'PC': PC,
'GES': GES,
'LiNGAM': DirectLiNGAM,
'Notears': NotearsNonlinear,
'GOLEM': GOLEM
})
We’re now ready to start the experiment. You can find the full experimental
loop in the notebook for this chapter.
We present the results in Figure 13.18. The best-performing models for each
dataset are marked in bold.
Figure 13.18 – Results of the model comparison
In the next section, we’ll see how to combine causal discovery algorithms
with expert knowledge.
By the end of this section, you’ll be able to translate expert knowledge into
the language of graphs and pass it to causal discovery algorithms.
We’ll take the linear Gaussian dataset from the previous section’s experiment
and we’ll try to improve PC’s performance by adding some external
knowledge to the algorithm.
priori_knowledge = PrioriKnowledge(n_nodes=10)
priori_knowledge.add_required_edges([(7, 3)])
priori_knowledge.add_forbidden_edges([(0, 9), (8, 6)])
I checked the plots for the experiment in the notebook to find the edges the
PC algorithm got wrong in the previous section’s experiment. We added
some of them here.
pc_priori = PC(priori_knowledge=priori_knowledge)
pc_priori.learn(datasets['linear_gauss'].X)
Note that, this time, we pass the priori_knowledge object to the constructor.
Let’s compare the results before and after sharing our knowledge with the
algorithm. Figure 13.19 summarizes the results for us:
Figure 13.19 – Results for the PC algorithm before and after adding prior knowledge
In Figure 13.19, we see that all metrics that we record have been improved
after adding external knowledge to the algorithm. We could continue adding
and restricting more edges to improve the algorithm further.
In this section, we’ve learned how to encode and pass expert knowledge to
the PC algorithm. Currently, only one algorithm in gCastle supports external
knowledge, but as far as I know, the developers are planning to add support
for more algorithms in the future.
Wrapping it up
We started this chapter with a refresher on important causal discovery
assumptions. We then introduced gCastle. We discussed the library’s main
modules and trained our first causal discovery algorithm. Next, we discussed
the four main families of causal discovery models – constraint-based, score-
based, functional, and gradient-based – and implemented at least one model
per family using gCastle. Finally, we ran a comparative experiment and
learned how to pass expert knowledge to causal models.
In the next chapter, we’ll discuss more advanced ideas in causal discovery
and take a broader perspective on the applicability of causal discovery
methods in real-life use cases.
References
Barabási, A. L. (2009). Scale-free networks: a decade and beyond. Science,
325(5939), 412-413.
Cai, R., Wu, S., Qiao, J., Hao, Z., Zhang, K., and Zhang, X. (2021). THP:
Topological Hawkes Processes for Learning Granger Causality on Event
Sequences. ArXiv.
Enquist, M., Arak, A. (1994). Symmetry, beauty and evolution. Nature, 372,
169–172
Glymour, C., Zhang, K., and Spirtes, P. (2019). Review of Causal Discovery
Methods Based on Graphical Models. Frontiers in genetics, 10, 524.
Hoyer, P., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. (2008).
Nonlinear causal discovery with additive noise models. In D. Koller, D.
Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information
Processing Systems, 21. Curran Associates, Inc.
Huang, B., Zhang, K., Lin, Y., Schölkopf, B., and Glymour, C. (2018).
Generalized score functions for causal discovery. In Proceedings of the
24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, 1551-1560.
Johnston, I. G., Dingle, K., Greenbury, S. F., Camargo, C. Q., Doye, J. P. K.,
Ahnert, S. E., and Ard A. Louis. (2022). Symmetry and simplicity
spontaneously emerge from the algorithmic nature of evolution.
Proceedings of the National Academy of Sciences, 119(11), e2113883119.
Kaiser, M., and Sipos, M. (2021). Unsuitability of NOTEARS for Causal
Graph Discovery. ArXiv, abs/2104.05441.
Lacerda, G., Spirtes, P. L., Ramsey, J., and Hoyer, P. O. (2008). Discovering
Cyclic Causal Models by Independent Components Analysis. Conference on
Uncertainty in Artificial Intelligence.
Le, T.D., Hoang, T., Li, J., Liu, L., Liu, H., and Hu, S. (2015). A Fast PC
Algorithm for High Dimensional Causal Discovery with Multi-Core PCs.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16,
1483-1495.
Ng, I., Ghassami, A., and Zhang, K. (2020). On the Role of Sparsity and
DAG Constraints for Learning Linear DAGs. ArXiv, abs/2006.10201.
Rebane, G., and Pearl, J. (1987). The recovery of causal poly-trees from
statistical data. International Journal of Approximate Reasoning.
Shimizu, S., Hoyer, P.O., Hyvärinen, A., and Kerminen, A.J. (2006). A
Linear Non-Gaussian Acyclic Model for Causal Discovery. J. Mach. Learn.
Res., 7, 2003-2030.
Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y., Washio,
T., Hoyer, P.O., and Bollen, K.A. (2011). DirectLiNGAM: A Direct Method
for Learning a Linear Non-Gaussian Structural Equation Model. J. Mach.
Learn. Res., 12, 1225–1248.
Tsamardinos, I., Brown, L. E., and Aliferis, C. F. (2006). The max-min hill-
climbing Bayesian network structure learning algorithm. Machine
Learning, 65(1), 31-78.
Uhler, C., Raskutti, G., Bühlmann, P., and Yu, B. (2013). Geometry of the
faithfulness assumption in causal inference. The Annals of Statistics, 436-
463.
Zhang, K., Zhu, S., Kalander, M., Ng, I., Ye, J., Chen, Z., and Pan, L. (2021).
gCastle: A Python Toolbox for Causal Discovery. ArXiv.
Zheng, X., Aragam, B., Ravikumar, P., and Xing, E.P. (2018). DAGs with NO
TEARS: Continuous Optimization for Structure Learning. Neural
Information Processing Systems.
Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E.P. (2020).
Learning Sparse Nonparametric DAGs. International Conference on
Artificial Intelligence and Statistics.
Part 2: Causal Inference
In the first chapter of Part 2, we will deepen and strengthen our
understanding of the important properties of graphical models and their
connections to statistical quantities.
In the last two chapters, we’ll introduce a number of causal estimators that
will allow us to estimate average and individualized causal effects.
In this part, we’ll dive into the world of causal inference. We’ll combine all
the knowledge that we’ve gained so far and start building on top of it. This
chapter will introduce us to two powerful concepts – d-separation and
estimands.
Combining these two concepts with what we’ve learned so far will equip us
with a flexible toolkit to compute causal effects.
Further down the road, we’ll discuss back-door and front-door criteria –
two powerful methods to identify causal effects – and introduce a more
general notion of Judea Pearl’s do-calculus. Finally, we’ll present
instrumental variables – a family of techniques broadly applied in
econometrics, epidemiology, and social sciences.
After reading this chapter and working through the exercises, you will be
able to take a simple dataset and – assuming that the data meets necessary
assumptions – compute causal effect estimates yourself.
Do-calculus
Instrumental variables
Let’s formalize this definition and make it a little bit more general at the same
time. Instead of talking about blocking a path between two nodes with
another node, we will talk about paths between sets of nodes blocked by
another set of nodes. We will denote sets of nodes with capital cursive script
letters, , , and .
Thinking in terms of sets of nodes rather than single nodes is useful when we
work with scenarios with multiple predictors and/or multiple confounders. If
you prefer thinking in terms of single nodes X, Y, and Z, imagine these nodes
as a single-element set, and the same definition will work for you!
Let’s go!
For any three disjoint sets of nodes , and , a path between and
is blocked by in the following scenarios:
Figure 6.1 – The first two DAGs in the Keep ‘em d-separated game
Which nodes of DAGs a and b from Figure 6.1 should be observed in order
to make the and nodes d-separated? I encourage you to write your
answers down on a piece of paper and then compare them with the answers
at the end of the chapter.
OK, let’s add some more complexity! Figure 6.2 presents the third DAG in
our game:
Figure 6.2 – DAG number 3 in our Keep ‘em d-separated game
Great, let’s make it more fun! Figure 6.3 presents the fourth example in Keep
‘em -separated:
DAGs similar to DAG d in Figure 6.3 are often more challenging, but I am
sure you’ll find the right answer!
OK, now, it’s time for something more advanced! Figure 6.4 presents the last
example in our game:
Figure 6.4 – The final DAG in the Keep ‘em d-separated game
I am sure that, with practice, you’ll feel more confident about them! You can
find more similar DAG games on the pgmpy web page
(https://bit.ly/CausalGames).
DAG a is a chain. To block the path between and , we need to control for
.
DAG b is a collider. This means that the path is already blocked by the
middle node, and we don’t need to do anything.
The first path is a chain. In order to block it, we need to control for . The
second path is a collider, and we don’t need to do anything to block it, as it’s
already blocked.
Let’s get another one. We can see that in DAG d, there are two overlapping
structures:
A collider,
A fork,
We can also control for – there’s no benefit to this, but it also does not
harm us (except that it increases model complexity).
Finally, we can control for (which will open the path) and for , which
will close the path.
We can see right away that the first path represents a chain. Controlling for
closes this path. The second path contains a collider, so not controlling for
anything keeps the path closed. Another solution is to block for A, B, and C
(note how controlling for C closes the path that we opened by controlling for
B).
With this exercise, we conclude the first section of Chapter 6. Let’s review
what we’ve learned.
Let’s imagine you just got a new job. You’re interested in estimating how
much time you’ll need to get from your home to your new office. You decide
to record your commute times over 5 days. The data you obtain looks like
this:
One thing you can do is to take the arithmetic average of these numbers,
which will give you the so-called sample mean - your estimate of the true
average commute time. You might feel that this is not enough for you and
you’d prefer to fit a distribution to this data rather than compute a point-wise
estimate (this approach is commonly used in Bayesian statistics).
The arithmetic mean of our data, as well as the parameters of the distribution
that we might have decided to fit to this data, are estimates. A computational
device that we use to obtain estimates is an estimator.
Estimators might be as simple as computing the arithmetic mean and as
complex as a 550 billion-parameter language model. Linear regression is an
estimator, and so is a neural network or a random forest model.
Let’s try to stretch our thinking about confounding a bit for a minute. What
does it mean that a relationship between two variables is confounded?
Can you recall the example from the first chapter (ice cream, drownings, and
temperature, presented in Figure 1.1)? You can use Figure 6.6 as a quick
refresher:
Figure 6.6 – A graphical representation of the problem from Chapter 1 (Figure 1.1)
Getting back to our example from the first chapter, we could naively choose
to define our model as the following:
If we used our naïve model, our estimand would look like this:
We already know that this is incorrect in our case. We know that the
relationship between and is spurious! To get the correct estimate of
the causal effect of on , we need to control for temperature ( ).
The correct way to model our problem is, therefore, the following:
This formula is an example of the so-called causal effect rule, which states
that given a graph, , and a set of variables, , that are (causal) parents of
, the causal effect of on is given by the following (Pearl, Glymour, and
Jewell, 2016):
Note that what we’re doing in our example is congruent with point 1 in our
definition of d-separation – if there’s a fork between the two variables,
and , we need to control for the middle node in this fork in to order block a
path between and .
All that we’ve done so far in this section has one essential goal – to find an
estimand that allows us to compute unbiased causal effects from
observational data. Although it won’t be always possible, it will be possible
sometimes. And sometimes, it can bring us tremendous benefits.
In this section, we learned what estimands are and how they are different
from estimators and estimates. We built a correct estimand for our ice cream
example from the first chapter and showed how this estimand is related to a
more general causal effect rule. Finally, we discussed the links between
estimands, d-separation, and confounding.
You can see that when we looked for the estimand in our ice cream example,
we did precisely this – we blocked all the paths between and that
contained an arrow into . is not a descendant of , so we also
met the second condition. Finally, we haven’t opened any new spurious paths
(in our very simple graph, there was not even the opportunity to do so).
Given the model presented in Figure 6.7, which nodes should we control for
in order to estimate the causal effect of on ?
Controlling for
Controlling for
Controlling for both – and
Let’s consider a modified model from Figure 6.7, where one of the variables
is unobserved. Figure 6.8 presents this model:
Figure 6.8 – A graph with a confounding pattern and one unobserved variable
The node and two edges ( and ) marked with dashed lines
are all unobserved, yet we assume that the overall causal structure is known
(including the two unobserved edges).
In other words, we don’t know anything about or what the functional form
of ’s influence on or is. At the same time, we assume that exists and
has no other edges than the one presented in Figure 6.8.
Our estimand for the model presented in Figure 6.8 would be identical to the
second estimand for the fully observed model:
That’s powerful! Imagine that recording is the most expensive part of your
data collection process. Now, understanding the back-door criterion, you can
essentially just skip recording this variable! How cool is that?
One thing we need to remember is that to keep this estimand valid, we need
to be sure that the overall causal structure holds. If we changed the structure
a bit by adding a direct edge from to , the preceding estimand would lose
its validity.
That said, if we completely removed and all its edges from the model, our
estimand would still hold. Can you explain why? (Check the Answers section
at the end of the chapter for the correct answer.)
Let’s summarize.
In this section, we learned about the back-door criterion and how it can help
us build valid causal estimands. We saw that, in some cases, we might be
able to build more than one valid estimand for a single model. We also
demonstrated that the back-door criterion can be helpful in certain cases of
unobserved confounding.
The authors argue that their results suggest a causal link between GPS usage
and spatial memory decline. We already know that something that looks
connected does not necessarily have to be connected in reality.
The authors also know this, so they decided to add a longitudinal component
to their design. This means that they observed people over a period of time,
and they noticed that those participants who used more GPS had a greater
decline in their memory.
Imagine that you decide to discuss this study with your colleague Susan. The
time component seems promising to you, but Susan is somehow critical about
the results and interpretation and proposes another hypothesis – the link
between GPS usage and spatial memory decline is purely spurious.
They seem related – argues Susan – because there’s a common cause for
using GPS and memory decline – low global motivation (Pelletier et al.,
2007). Susan argues that people with low global motivation are reluctant to
learn new things (so they are not interested in remembering new information,
including spatial information) and they try to avoid effort (hence, they prefer
to use GPS more often, as it allows them to avoid the effortful process of
decision-making while driving).
She also claims that low global motivation tends to expand – unmotivated
people look for solutions that can take the burden of doing stuff from them,
and if these solutions work, they use them more often. They are also less and
less interested in learning new things with age.
Let’s encode Susan’s model graphically alongside the model containing both
hypotheses (motivation and GPS usage). Figure 6.9 presents both models:
Figure 6.9 – A model presenting Susan’s hypothesis (a) and the “full” hypothesis (b)
As we can see, none of the two models can be deconfounded because the
confounder is unobserved (dashed lines).
In such a case, the back-door criterion cannot help us, but there’s another
criterion we could possibly use that relies on the concept of mediation. This
would require us to find a variable that mediates the relationship between
GPS usage and memory decline.
London cab drivers need to pass a very restrictive exam checking their
spatial knowledge and are not allowed to use any external aids in the
process.
One study (Woollett and Maguire, 2011) showed that drivers who failed this
exam did not show an increase in hippocampal volume. At the same time, in
those who passed the exam, a systematic increase in hippocampal volume
was observed. During the continuing training over a 4-year period,
hippocampal volume was associated with an improvement in spatial memory
(only in those who were in continuous training).
Let’s try to incorporate these results into our model. We’ll hypothesize that
GPS usage negatively impacts the relative volume of the hippocampus, which
in turn impacts spatial memory. Figure 6.10 presents the updated model:
The second important assumption we make is that motivation can only affect
hippocampal volume indirectly through GPS usage. This assumption is
critical in order to make the criterion that we’re going to introduce next – the
front-door criterion – useful to us.
To make the notation more readable, we’ll replace variable names with
symbols. Figure 6.11 presents the graph with updated variable names:
Figure 6.11 – A model with updated variable names
First, let’s take a look at the relationship between and . There’s one back-
door path between them, , but it’s already blocked. Can you
see how?
That’s great!
Now, let’s drop do-operators from the right-hand side by substituting them
according to the preceding equalities. This leads us to the following:
This formula is called the front-door formula (Pearl et al., 2016) or front-
door adjustment.
You can think about it as seeing from two different angles or – perhaps – as
being in two different places at the same time (I guess in sci-fi movies they
usually call it bilocation).
Three simple steps toward the front door
In general, we can say that a set of variables, , satisfies the front-door
criterion, given the graph, and a pair of variables, , if the
following applies (Pearl et al., 2016):
Front-door in practice
Let’s implement a hypothetical model of our GPS example. You can find the
code for this chapter in the following notebook: https://bit.ly/causal-ntbk-06.
First, let’s define a structural causal model (SCM) that will generate
hypothetical data for us. We’re going to implement it as a Python class,
similar to what we did in Chapter 2. Let’s start with the imports:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
We import basic scientific packages. Note that this time we did not import
Statsmodels’ (Seabold and Perktold, 2010) linear regression module but,
rather, the implementation from Scikit-learn (Pedregosa et al., 2011).
We did this on purpose to leverage the simple and intuitive interface that
Scikit-learn offers. At the same time, we’ll be less interested in the well-
formatted output of the model results – a great feature of Statsmodels.
class GPSMemorySCM:
def __init__(self, random_seed=None):
self.random_seed = random_seed
self.u_x = stats.truncnorm(0, np.infty, scale=5)
self.u_y = stats.norm(scale=2)
self.u_z = stats.norm(scale=2)
self.u = stats.truncnorm(0, np.infty, scale=4)
In the .init() method, we define the distributions for all the exogenous
variables in the model (we omitted them for readability in the preceding
figures). We use a truncated normal distribution for u_x and u to restrict them
to positive values and normal distribution for u_y and u_z
For clarity, you might want to take a look at Figure 6.12, which shows our
full graphical model with exogenous variables:
Figure 6.12 – A full model with exogenous variables
First, we fix the random seed if a user provided a value for it.
Finally, we compute the values of the three observed variables in our model,
gps, hippocampus, and memory, which represent GPS usage, hippocampal
volume, and spatial memory change respectively.
You might have noticed that there’s an additional if statement that checks for
treatment_value. It allows us to generate interventional distribution from
the model if a value for treatment_value is provided.
The last method in our SCM implementation is .intervene(). It’s a syntactic
sugar wrapper around .sample(). The .intervene() method returns an
interventional distribution from our model:
Its purpose is to make the code cleaner and our process more explicit.
Note that passing either None or 0 as a treatment value will result in a special
case of null intervention, and the outcome will be identical to the
observational sample.
Great! Let’s instantiate the model and generate some observational data:
scm = GPSMemorySCM()
gps_obs, hippocampus_obs, memory_obs = scm.sample(600)
treatments = []
experiment_results = []
# Sample over a range of treatments
for treatment in np.arange(1, 21):
gps_hours, hippocampus, memory =
scm.intervene(treatment_value=treatment, sample_size=30)
experiment_results.append(memory)
treatments.append(gps_hours)
Figure 6.13 presents the relationship between GPS usage and spatial
memory change in observational and interventional samples:
Let’s fit two linear regression models – one on the observational data and
one on the interventional data – and compare the results. What results do you
expect?
lr_naive = LinearRegression()
lr_naive.fit(
X=gps_obs.reshape(-1, 1),
y=memory_obs
)
You might have noticed that when we pass gps_obs to the .fit() method, we
change the array’s shape. This is because sklearn models require 2D arrays
of shape , where is the number of observations and is the
dimensionality of the design matrix (or a number of features if you will), and
our original array was 1D with shape as we only have one feature.
The second model will use the same variables but generated in our
experiment, rather than recorded from observations.
Before we train it, we need to unpack our treatment (GPS usage) and
outcome (spatial memory change). This is because we generated 30 samples
for each of the 20 intervention levels, and we stored them as a nested list of
lists.
treatments_unpack = np.array(treatments).flatten()
results_unpack = np.array(experiment_results).flatten()
lr_experiment = LinearRegression()
lr_experiment.fit(
X=treatments_unpack.reshape(-1, 1),
y=results_unpack
)
Perfect. Now, let’s generate predictions from both models on test data:
Next, we query both models to generate predictions on test data. Figure 6.14
presents a scatterplot of the interventional distribution and predictions from
both models:
Figure 6.14 – A scatterplot of the interventional distribution and two fitted regression lines
We can see pretty clearly that the naive model does not fit the experimental
data very well.
Let’s compare the values for the regression coefficients for both models:
print(f'Naive model:\n{lr_naive.coef_}\n')
print(f'Experimental model:\n{lr_experiment.coef_}')
Let’s figure out how to get a valid causal coefficient from observational data
using the front-door criterion in three simple steps.
T he line a r b rid g e t o t he c a us a l p ro mis e d la nd
It turns out that when we’re lucky enough that our model of interest is linear
and the front-door criterion can be applied, we can compute the valid
estimate of the causal effect of on in three simple steps:
Fit a model,
Fit a model,
lr_zx = LinearRegression()
lr_zx.fit(
X=gps_obs.reshape(-1, 1),
y=hippocampus_obs
)
2. Next, let’s train the model to regress on and . Note that in both
cases, we follow the same logic that we followed in the continuous case
described previously:
lr_yxz = LinearRegression()
lr_yxz.fit(
X=np.array([gps_obs, hippocampus_obs]).T,
y=memory_obs
)
lr_zx.coef_[0] * lr_yxz.coef_[1]
We take the 0th coefficient from the first model (there’s just one coefficient,
for GPS usage) and the 1st coefficient for the second model (because we’re
interested in the effect of hippocampal volume on spatial memory given GPU
usage), and we multiply them together.
−0.43713599902679
Great!
We saw that the front-door-adjusted estimate was pretty close to the estimate
obtained from the experimental data, but what actually is the true effect that
we’re trying to estimate, and how close are we?
We can answer this question pretty easily if we have a full linear SCM. The
true effect in a model such as ours is equal to the product of coefficients on
causal paths from and . The idea of multiplying the
coefficients on a directed causal path can be traced back to Sewall Wright’s
path analysis, introduced as early as 1920 (Wright, 1920).
In our case, the true causal effect of GPS usage on spatial memory will be
-0.6 * 0.7 = -0.42. The coefficients (-0.6 and 0.7) can be read from the
definition of our SCM.
It’s time to conclude this section. We learned what the front-door criterion is.
We discussed three conditions (and one additional assumption) necessary to
make the criterion work and showed how to derive an adjustment formula
from the basic principles. Finally, we built an SCM and generated
observational and interventional distributions to show how the front-door
criterion can be used to accurately approximate experimental results from
observational data.
For example, will denote a DAG, , where we removed all the incoming
edges to the node and all the outgoing edges from the node .
Perfect. Now, let’s see the rules (Pearl, 2009, and Malina, 2020):
These rules might look pretty overwhelming! Let’s try to decode their
meaning.
Rule 1 tells us that we can ignore any observational (set of) variable(s),
when and the outcome, are independent, given and in a modified
DAG, .
Rule 2 tells us that any intervention over a (set of) variable(s), can be
treated as an observation when and the outcome are independent given
and in a modified DAG .
Finally, rule 3 tells us that any intervention over a (set of) variable(s), can
be ignored when and the outcome, are independent, given and in a
modified DAG, .
All this might sound complicated at first, until you realize that what it
requires in practice is to take your DAG, find the (set of) confounders
(denoted as in our rules), and check whether any of the rules apply. Plus,
you can stack the transformations into arbitrary long sequences if this helps!
The good part is that this work can also be automated.
It might take some time to fully digest the rules of do-calculus, and that’s OK.
Once you get familiar with them, you’ll have a very powerful tool in your
toolbox.
If you want to learn more about do-calculus, check out Pearl (2009) for
formal definitions and step-by-step examples, Spitser and Pearl (2006) for
the proof of completeness, and Stephen Malina’s blog for intuitive
understanding (Malina, 2020 – https://stephenmalina.com/post/2020-03-09-
front-door-do-calc-derivation/).
Before we conclude this section, let’s see one more popular method that can
be used to identify causal effects.
Instrumental variables
Instrumental variables (IVs) are a family of deconfounding techniques that
are hugely popular in econometrics. Let’s take a look at the DAG in Figure
6.15:
Not all is lost though! It turns out that we can use the IV technique to estimate
our causal effect of interest. Let’s see how to do it.
T he t hre e c o nd it io ns o f IV s
Instrumental variable methods require a special variable called an
instrument to be present in a graph. We will use to denote the instrument.
Our effect of interest is the causal effect of on .
The first condition talks about association rather than causation. The nature
of the relationship between and determines how much information we’ll
be able to extract from our instrument. Theoretically speaking, we can even
use instruments that are only weakly (non-directly) associated with (in
such a case, they are called proxy instruments), yet there’s a cost to this.
In certain cases, the only thing we’ll be able to obtain will be the lower and
upper bounds of the effect, and in some cases, these bounds might be very
broad and, therefore, not very useful (Hernán and Robins, 2020).
We’re lucky though!
C a lc ula t ing c a us a l e f f e c t s w it h IV s
Let’s take a look at Figure 6.15 once again. The variable is associated
with , it does not affect other than through , and there are no common
causes of and ; therefore, meets all of the criteria of being an
instrument. Moreover, in our DAG, the relationship between and is
causal and direct, which allows us to approximate the exact causal effect (as
opposed to just bounds)!
We compute the ratio by dividing the coefficient for the first model by the
coefficient from the second model.
Et voilà!
All this makes IVs pretty flexible and broadly adopted, yet in practice, it
might be difficult to find good instruments or verify that the necessary
assumptions (for instance, a lack of influence of on other than through )
are met.
There are a couple of resources if you want to dive deeper into IVs. For a
great technical and formal overview, check out Chapter 16 of the excellent
book Causal Inference: What If? by Miguel Hernán and James Robbins from
the Harvard School of Public Health (Hernán and Robins, 2020).
For practical advice on how to find good instruments, intuitions, and great
real-world examples, check out Chapter 7 of Scott Cunningham’s Causal
Inference: The Mixtape (Cunningham, 2021). I am sure you’ll love the latter,
in particular if you’re a hip-hop aficionado.
Finally, we learned how to use IVs in a linear case and linked to the
resources demonstrating how the method can be extended to non-linear and
non-parametric cases.
Wrapping it up
We learned a lot in this chapter, and you deserve some serious applause for
coming this far!
In this chapter, we learned a lot. We started with the notion of d-separation.
Then, we showed how d-separation is linked to the idea of an estimand. We
discussed what causal estimands are and what their role is in the causal
inference process.
Answer
Controlling for B (Figure 6.7) essentially removes A’s influence on X and Y.
If we remove A from the graph, it will not change anything (up to noise) in
our estimate of the relationship strength between X and Y. Note that in a graph
with a removed node A, controlling for B becomes irrelevant (it does not hurt
us to do so, but there’s no benefit to it either).
References
Carroll, R. J., Ruppert, D., Crainiceanu, C. M., Tosteson, T. D., and Karagas,
M. R. (2004). Nonlinear and Nonparametric Regression and Instrumental
Variables. Journal of the American Statistical Association, 99(467), 736-
750.
Hernán M. A., Robins J. M. (2020). Causal Inference: What If. Boca Raton:
Chapman and Hall/CRC.
Hejtmánek, L., Oravcová, I., Motýl, J., Horáček, J., and Fajnerová, I. (2018).
Spatial knowledge impairment after GPS guided navigation: Eye-tracking
study in a virtual town. International Journal of Human-Computer Studies,
116, 15-24.
Maguire, E. A., Gadian, D. G., Johnsrude, I. S., Good, C. D., Ashburner, J.,
Frackowiak, R. S., and Frith, C. D. (2000). Navigation-related structural
change in the hippocampi of taxi drivers. Proceedings of the National
Academy of Sciences of the United States of America, 97(8), 4398-4403.
Malina, S. (2020, March 9). Deriving the front-door criterion with the do-
calculus. https://stephenmalina.com/post/2020-03-09-front-door-do-calc-
derivation/.
Pelletier, L. G., Sharp, E., Blanchard, C., Lévesque, C. Vallerand, R. J., and
Guay, F. (2007). The general motivation scale (GMS): Its validity and
usefulness in predicting success and failure at self-regulation. Manuscript
in preparation. University of Ottawa.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
(2011). Scikit-learn: Machine Learning in Python. Journal of Machine
Learning Research, 12, 2825-2830.
This is a true milestone in our journey. In this chapter, we’ll learn how to
neatly structure the entire causal inference process using the DoWhy library
(Sharma & Kiciman, 2020). By the end of this chapter, you’ll be able to
write production-ready causal inference pipelines using linear and non-linear
estimators.
We’ll start with an introduction to DoWhy and its sister library, EconML
(Battochi et al., 2019). After that, we’ll see how to use the graph modeling
language (GML), which we introduced briefly in Chapter 4 to translate our
assumptions regarding the data-generating process into graphs. Then, we’ll
see how to compute causal estimands and causal estimates using DoWhy.
Finally, we’ll introduce refutation tests and see how to apply them to our
models. We’ll conclude the chapter with an example of a complete causal
inference process. By the end of this chapter, you’ll have a solid
understanding of the mechanics of the causal inference process and be able to
carry it out for your own problems.
Refutation tests
We’ll start with an overview of the Python causal ecosystem and then discuss
what DoWhy and EconML are.
Then, we’ll share why they are the packages of choice for this book.
Finally, we’ll dive deeper into DoWhy’s APIs and look into the integration
between DoWhy and EconML.
Yalla!
As you can see, we have many options to choose from. You might be
wondering – what’s so special about DoWhy and EconML that we’ve chosen
to use these particular packages in the book?
Why DoWhy?
There are at least six reasons why I think choosing DoWhy as the foundation
of your causal ecosystem is a great idea. Let me share them with you:
DoWhy is designed in a way that enables you to run the entire causal
inference process in four clearly defined and easily reproducible steps
With the new experimental GCM API (Blobaum et al., 2022; for more
details, see https://bit.ly/DoWhyGCM), DoWhy becomes even more
powerful! Another great thing about DoWhy is that it translates a complex
process of causal inference into a set of simple and easily reproducible
steps.
One of the great things about EconML is that it’s deeply integrated with
DoWhy. This integration allows us to call EconML estimators from within
DoWhy code without even explicitly importing EconML. This is a great
advantage, especially when you’re in the early experimental stage of your
project and your code does not have a strong structure. The integration
allows you to keep your code clean and compact and helps to avoid
excessive clutter.
Great, now that we’ve had a short introduction, let’s jump in and see how to
work with DoWhy and EconML. We’ll learn more about both libraries on the
way.
In the next four sections, we’ll see how to use the main DoWhy API to
perform the following four steps of causal inference:
We’re going to use the GPSMemorySCM class from Chapter 6 for this purpose.
The code for this chapter is in the Chapter_07.ipynb notebook
(https://bit.ly/causal-ntbk-07).
Let’s initialize our SCM, generate 1,000 observations, and store them in a
data frame:
scm = GPSMemorySCM()
gps_obs, hippocampus_obs, memory_obs = scm.sample(1000)
df = pd.DataFrame(np.vstack([gps_obs, hippocampus_obs,
memory_obs]).T, columns=['X', 'Z', 'Y'])
Note that we denoted the columns for GPS with X, hippocampal volume with
Z, and spatial memory with Y.
Figure 7.1 presents the GPS example from the previous chapter, which we’ll
model next. Note that we have omitted variable-specific noise for clarity:
Figure 7.1 – The graphical model from Chapter 6
Note that the graph in Figure 7.1 contains an unobserved variable, U. We did
not include this variable in our dataset (it’s unobserved!), but we’ll include it
in our graph. This will allow DoWhy to recognize that there’s an unobserved
confounder in the graph and find a relevant estimand for us automatically.
Great! Let’s translate the model from Figure 7.1 into a GML graph:
gml_graph = """
graph [
directed 1
node [
id "X"
label "X"
]
node [
id „Z"
label „Z"
]
node [
id "Y"
label "Y"
]
node [
id „U"
label „U"
]
edge [
source "X"
target "Z"
]
edge [
source "Z"
target "Y"
]
edge [
source "U"
target "X"
]
edge [
source "U"
target "Y"
]
]
"""
Our definition starts with the directed keyword. It tells the parser that all
edges in the graph should be directed. To obtain an undirected graph, you can
use undirected instead.
Next, we define the nodes. Each node has a unique ID and a label. Finally,
we define the edges. Each edge has a source and a target. The entire
definition is encapsulated in a Python multiline string.
model = CausalModel(
data=df,
treatment='X',
outcome='Y',
graph=gml_graph
)
In this section, we’ve learned how to create a GML graph to model our
problem and how to pass this graph into the CausalModel object. In the next
section, we’ll see how to use DoWhy to automatically find estimands for our
graph.
Front-door
Instrumental variable
We know all of them from the previous chapter. To see a quick practical
introduction to all three methods, check out my blog post Causal Python — 3
Simple Techniques to Jump-Start Your Causal Inference Journey Today
(Molak, 2022; https://bit.ly/DoWhySimpleBlog).
Let’s see how to use DoWhy in order to find a correct estimand for our
model.
estimand = model.identify_effect()
print(estimand)
We see that DoWhy prints out three different estimands. There’s the
estimand’s name and information if a given type of estimand has been found
for our model.
For the estimand that has been found, there’s information about it printed out;
for estimands that haven’t been found, there’s a No such variable(s) found!
You can see that DoWhy has found only one estimand for our graph:
frontdoor. As you might remember from the previous chapter, this is the
correct one for our model!
Before we finish this section, I want to reiterate one thing. The capabilities
of DoWhy are really great, yet it will only be able to find estimands for us if
our model is identifiable using one of the three supported methods. For more
advanced identification strategies, check out the grapl-causal library created
by Max Little of the University of Birmingham and MIT
(https://bit.ly/GRAPLCausalRepo).
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='frontdoor.two_stage_regression')
The name of the method that will be used to compute the estimate
You might recall from Chapter 6 that we needed to fit two linear regression
models, get their coefficients, and multiply them in order to obtain the final
causal effect estimate. DoWhy makes this process much easier for us.
Note that we haven’t set the random seed, so your results might slightly
differ.
Currently, DoWhy only supports one estimator for the front-door criterion
(frontdoor.two_stage_regression), but many more estimators are available
in general. We’ll introduce some of them soon.
In Figure 7.3, the blue folds denote validation sets, while the white ones
denote training sets. We train a new model at each iteration using white folds
for training and evaluate using the remaining blue fold. We collect metrics
over iterations and compute a statistic summary (most frequently, the
average).
CV is a widely adopted method to estimate prediction errors in statistical
learning (yet it turns out that it’s much less understood than we tend to think;
Bates et al., 2021).
Why?
Although CV can provide us with some information about the estimator fit,
this approach will not work in general as a causal model evaluation
technique. The reason is that CV operates on rung 1 concepts. Let’s think of
an example.
Compare this example with another one: the train always passes the station
within five minutes after I hear the morning weather forecast on the radio
(do you remember Hume’s association-based definition of causality?).
Popper says that we never observe all Xs and therefore it’s impossible to
prove a theory. He finds this problematic because he wants to see a
difference between science and non-science (this is known as the
demarcation problem). To solve this tension, Popper proposes that instead
of proving a theory, we can try to disprove it.
His logic is simple. He says that “all Xs are A” is logically equivalent to “no
X is not-A.” Therefore, if we find X that is not A, we can show that the theory
is false. Popper proposes that scientific theories are falsifiable – we can
define a set of conditions in which the theory would fail. On the other hand,
non-scientific theories are non-falsifiable – for instance, the existence of a
higher intelligent power that cannot be quantified in any systematic way.
Refutation tests aim to achieve this by modifying the model or the data. There
are two types of transformations available in DoWhy (Sharma & Kiciman,
2020):
Invariant transformations
Nullifying transformations
Invariant transformations change the data in such a way that the result
should not change the estimate. If the estimate changes significantly, the
model fails to pass the test.
Nullifying transformations change the data in a way that should cause the
estimated effect to be zero. If the result significantly differs from zero, the
model fails the test.
The basic idea behind refutation tests is to modify an element of either the
model or a dataset and see how it impacts the results.
Now, let’s see how one of them – the data subset refuter – works.
L e t ’s re f ut e !
Let’s apply some refutation tests to our model. Note that in DoWhy 0.8, not
all tests will work with front-door estimands.
refute_subset = model.refute_estimate(
estimand=estimand,
estimate=estimate,
method_name="data_subset_refuter",
subset_fraction=0.4)
This test removes a random subset of the data and re-estimates the causal
effect. In expectation, the new estimate (on the subset) should not
significantly differ from the original one. Let’s print out the results:
print(refute_subset)
As you can see, the original and newly estimated effects are very close and
the p-value is high, indicating that there’s likely no true difference between
the two estimates. This result does not falsify our hypothesis and perhaps
makes us a bit more confident that our model might be correct.
Full example
This section is here to help us solidify our newly acquired knowledge. We’ll
run a full causal inference process once again, step by step. We’ll introduce
some new exciting elements on the way and – finally – we’ll translate the
whole process to the new GCM API. By the end of this section, you will
have the confidence and skills to apply the four-step causal inference process
to your own problems.
Figure 7.4 presents a graphical model that we’ll use in this section:
Figure 7.4 – A graphical model that we’ll use in this section
We’ll generate 1,000 observations from an SCM following the structure from
Figure 7.4 and store them in a data frame:
SAMPLE_SIZE = 1000
S = np.random.random(SAMPLE_SIZE)
Q = 0.2*S + 0.67*np.random.random(SAMPLE_SIZE)
X = 0.14*Q + 0.4*np.random.random(SAMPLE_SIZE)
Y = 0.7*X + 0.11*Q + 0.32*S +
0.24*np.random.random(SAMPLE_SIZE)
P = 0.43*X + 0.21*Y + 0.22*np.random.random(SAMPLE_SIZE)
# Build a data frame
df = pd.DataFrame(np.vstack([S, Q, X, Y, P]).T,
columns=['S', 'Q', 'X', 'Y', 'P'])
1. First, we define two lists: a list of nodes (nodes) and a list of edges
(edges).
3. Then, we run the first for loop over the nodes. At each step, we append a
new line to our gml_string representing a node and specifying its ID and
label (we’ve also added tabulation and newlines to make it more
readable, but this is not necessary).
4. After the first loop is done, we run the second for loop over edges. We
append a new line at each step again. This time, each line contains
information about the edge source and target.
print(gml_string)
graph [directed 1
node [id "S" label "S"]
node [id "Q" label "Q"]
node [id "X" label "X"]
node [id "Y" label "Y"]
node [id "P" label "P"]
edge [source "S" target "Q"]
edge [source "S" target "Y"]
edge [source "Q" target "X"]
edge [source "Q" target "Y"]
edge [source "X" target "P"]
edge [source "Y" target "P"]
edge [source "X" target "Y"]
]
model = CausalModel(
data=df,
treatment='X',
outcome='Y',
graph=gml_string
)
Let’s visualize the model to make sure that our graph’s structure is as
expected:
model.view_model()
The graph in Figure 7.5 has the same set of nodes and edges that the graph in
Figure 7.4 has. It confirms that our GML definition is correct.
estimand = model.identify_effect()
print(estimand)
We can see that DoWhy proposed one valid estimand – backdoor. Although
our graph looks a bit complex, it contains only one back-door path.
Controlling for Q deconfounds the relationship between X and Y.
Step 3 – estimate!
Although our SCM is pretty simple and we could model it using linear
regression, we’ll use a more advanced estimator this time. Along the way,
we’ll learn how to leverage DoWhy‘s integration with other packages in the
Python machine learning ecosystem to build advanced multicomponent
estimators.
First, let’s import some models from scikit-learn. We picked LassoCV and
GradientBoostingRegressor. Both are rather overkill for our simple
problem, but we picked them on purpose so we can see DoWhy’s flexibility
and integration capabilities in action:
We will use a DML estimator from the EconML package. DML is a family of
methods for estimating causal effects that was originally proposed by Victor
Chernozhukov and colleagues (Chernozhukov et al., 2016).
Behind the scenes, DML fits three machine learning models to compute de-
biased estimates of treatment effects. First, it predicts the outcome from the
controls. Then, it predicts the treatment from the controls. Finally, it fits the
model that regresses residuals from the second model on the residuals from
the first model.
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.dml.DML',
method_params={
'init_params': {
'model_y': GradientBoostingRegressor(),
'model_t': GradientBoostingRegressor(),
'model_final': LassoCV(fit_intercept=False),
},
'fit_params': {}}
)
3. Then, we defined the models we want to use for each of the three stages
of DML estimation. We use regression models only as our treatment and
outcome are both continuous.
Good job! The true effect is 0.7 and so we are really close!
estimate_lr = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.linear_regression')
print(f'Estimate of causal effect (linear regression): {
estimate_lr.value}')
In this comparison, a complex DML model did a slightly better job than a
simple linear regression, but it’s hard to say whether this effect is anything
more than noise (feel free to refit both models a couple of times to see how
stable these results are; you can also re-generate or bootstrap the data to get
even more reliable information).
Before we jump to refutation tests, I want to share one more thing with you.
We had enough data to feed a complex DML estimator and the results are
really good, but if you have a smaller dataset, a simpler method could be a
better choice for you. If you’re not sure what to choose, it’s always a great
idea to run a couple of quick experiments to see which methods are behaving
better for your problem (or a similar problem represented by simulated
data).
Let’s try one more refutation test. This time, we’ll replace our treatment
variable with a random placebo variable:
placebo_refuter = model.refute_estimate(
estimand=estimand,
estimate=estimate,
method_name='placebo_treatment_refuter'
)
print(placebo_refuter)
It seems that our model passed both tests! Note that the first test (random
common cause) belonged to the category of invariant transformations, while
the second (placebo treatment) belonged to the category of nullifying
transformations.
At the time of writing this chapter (September 2022), there are more
refutation tests available for back-door criterion than there are for the front-
door criterion.
To work with the GCM API, we need to import networkx and the gcm
subpackage:
import networkx as nx
from dowhy import gcm
We will reuse the data from our previous example. I’ll put the data-
generating code here as a refresher:
SAMPLE_SIZE = 1000
S = np.random.random(SAMPLE_SIZE)
Q = 0.2*S + 0.67*np.random.random(SAMPLE_SIZE)
X = 0.14*Q + 0.4*np.random.random(SAMPLE_SIZE)
Y = 0.7*X + 0.11*Q + 0.32*S
+ 0.24*np.random.random(SAMPLE_SIZE)
P = 0.43*X + 0.21*Y + 0.22*np.random.random(SAMPLE_SIZE)
df = pd.DataFrame(np.vstack([S, Q, X, Y, P]).T,
columns=['S', 'Q', 'X', 'Y', 'P'])
The next step is to generate the graph describing the structure of our data-
generating process. The GCM API uses nx.DiGraph rather than GML strings
as a graph representation.
To make sure that everything is as expected, let’s plot the graph (Figure 7.6):
Figure 7.6 – A networkx representation of our graph
Now, let’s define a causal model. It’s very simple with the GCM API:
causal_model = gcm.InvertibleStructuralCausalModel(
graph_nx)
There are many different causal models available in the GCM API. We
picked the invertible SCM, since this is the only model that allows us to
generate counterfactuals without manually providing values for all noise
variables.
causal_model.set_causal_mechanism('S',
gcm.EmpiricalDistribution())
causal_model.set_causal_mechanism('X',
gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))
causal_model.set_causal_mechanism('Y',
gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))
causal_model.set_causal_mechanism('P',
gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))
causal_model.set_causal_mechanism('Q',
gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))
We’ll learn more about additive noise models (ANMs) in Part 3, Causal
Discovery. For a detailed discussion on ANMs, check out Peters et al.
(2017).
Let’s fit the model to the data and estimate causal effects strengths:
gcm.fit(causal_model, df)
gcm.arrow_strength(causal_model, 'Y')
As you can see, the GCM API returns estimates for all the variables with the
incoming edges to the variable of interest (Y in our case; see Figure 7.6 for
reference).
These results are different from what we’ve seen before. The reason for this
is that the GCM API by default returns results in terms of the outcome
variable variance change given we remove the edge from the source
variable. This behavior can be changed and you can define your own
estimation functions. To learn more, refer to the GCM API documentation,
available here: https://bit.ly/DoWhyArrowStrength.
A great feature of the GCM API is that when you create and fit a model, you
can easily answer different types of causal queries using this model.
gcm.counterfactual_samples(
causal_model,
{'X': lambda x: .21},
observed_data=pd.DataFrame(data=dict(X=[.5],
Y=[.75], S=[.5], Q=[.4], P=[.34])))
To learn more about the GCM API, please refer to the documentation.
Wrapping it up
In this chapter, we discussed the Python causal ecosystem. We introduced the
DoWhy and EconML libraries and practiced the four-step causal inference
process using DoWhy’s CausalModel API. We learned how to automatically
obtain estimands and how to use different types of estimators to compute
causal effect estimates. We discussed what refutation tests are and how to use
them in practice. Finally, we introduced DoWhy’s experimental GCM API
and showed its great capabilities when it comes to answering various causal
queries. After working through this chapter, you have the basic skills to apply
causal inference to your own problems. Congratulations!
References
Bates, S., Hastie, T., & Tibshirani, R. (2021). Cross-validation: what does
it estimate and how well does it do it?. arXiv preprint.
https://doi.org/10.48550/ARXIV.2104.00673
Battocchi, K., Dillon, E., Hei, M., Lewis, G., Oka, P., Oprescu, M., &
Syrgkanis, V. (2019). EconML: A Python Package for ML-Based
Heterogeneous Treatment Effects Estimation.
https://github.com/microsoft/EconML
Blobaum, P., Götz, P., Budhathoki, K., Mastakouri, A., & Janzing, D. (2022).
DoWhy-GCM: An extension of DoWhy for causal inference in graphical
causal models. arXiv.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C.,
Newey, W., & Robins, J. (2016). Double/Debiased Machine Learning for
Treatment and Causal Parameters. arXiv preprint.
https://doi.org/10.48550/ARXIV.1608.00060
Shimoni, Y., Karavani, E., Ravid, S., Bak, P., Ng, T. H., Alford, S. H.,
Meade, D., & Goldschmidt, Y. (2019). An Evaluation Toolkit to Guide
Model Selection and Cohort Definition in Causal Inference. arXiv preprint.
arXiv:1906.00442.
8
By the end of this chapter, you will have a good understanding of the
challenges that you may face when implementing causal models in real life
and possible solutions to these challenges.
Identifiability
Positivity assumption
Exchangeability/ignorability assumption
SUTVA
Selection bias
We’ll start by sketching a broader context. After that, we’ll explicitly define
the concept of identifiability. Finally, we’ll discuss some popular challenges
faced by practitioners:
In between
Ancient Greek mythology provides us with a story of Ikaros (also known as
Icarus) and his father, the Athenian inventor Daidalos (also known as
Daedalus). Daidalos wants to escape from Crete – a Greek island where he’s
trapped. He builds two sets of wings using feathers and wax. He successfully
tests his wings first before passing the other set to his son.
Before they fly, Daidalos advises his son not to fly too low, nor too high:
“for the fogs about the earth may weigh you down and the blaze from the
sun are going to melt your feathers apart” (Graves, 1955). Ikaros does not
listen. He’s excited by his freedom. Consumed by ecstatic feelings, he
ascends toward the sun. The wax in his wings starts melting and – tragically
– he falls into the sea and drowns.
It’s likely that we haven’t yet reached the Causal Promised Land and that we
will continue to learn and grow in our causal journey in the coming years.
Understanding the assumptions that we discuss in this chapter will help you
achieve two things:
Let’s go!
Identifiability
One of the core concepts that we’ll use in this chapter is identifiability. The
good news is that you already know what identifiability is.
We say that a causal effect (or any other causal quantity) is identifiable when
it can be computed unambiguously from a set of (passive) observations
summarized by a distribution and a causal graph (Pearl, 2009).
The first condition is achievable by blocking all the paths that are leaking
non-causal information using the rules of do-calculus and the logic of d-
separation. This is often possible using the back-door criterion, the front-
door criterion, instrumental variables, or the general rules of do-calculus.
The important thing is not to pretend that it’s possible when we know it’s not.
In such a case, using causal methods can bring more harm than good. One of
the main challenges is that sometimes we simply don’t know. We’ll learn
more about this in the last part of this section.
First, any estimator needs to have a large enough sample size to return
meaningful estimates. Second, we need to make sure that the probability of
every possible value of treatment in our dataset (possibly conditioned on all
important covariates) is greater than 0. This is known as the positivity
assumption and we’ll discuss it in the next section.
Now, let’s talk about a couple of challenges causal data scientists face.
There are three things that people usually do to obtain a causal graph for a
problem with an unknown causal structure:
If your data comes from a natural experiment, you can also try some of the
methods from the field of econometrics, such as synthetic controls,
regression discontinuity, or difference-in-differences. We’ll briefly
discuss one of them (with a Bayesian flavor) in Chapter 11.
Nonetheless, sample size can also be an issue in the case of very simple
models. I conducted a simple experiment so we could see it in action. The
full code for this experiment is in the notebook for Chapter 8
(https://bit.ly/causal-ntbk-08). Here we only present the methodology and the
results.
Let’s take the graphical model from the previous chapter (check Figure 8.1
for a refresher).
U nd e rs t a nd ing t he s a mp le s iz e – me t ho d o lo g y
Let’s take four different sample sizes (30, 100, 1,000, and 10,000), and for
each sample size, generate 20 datasets from the model presented in Figure
8.1. We fit two models for each dataset: one based on simple linear
regression and one based on a DML estimator, powered by two gradient-
boosting models and a cross-validated lasso regression:
Figure 8.1 – The graphical model that we use in the experiment
U nd e rs t a nd ing t he s a mp le s iz e – re s ult s
The results are presented in Figure 8.2. We see sample size on the x axis and
percentage error on the y axis. 100% error means that the coefficient returned
by the model was twice the size of the true coefficient; negative 100% means
that the estimated effect was (the true effect minus 100% of the true effect
is ):
To make the most out of smaller sample sizes, a good idea would be to
bootstrap your results, if you can afford it.
Unverifiable assumptions
In some cases, it might be very difficult to find out whether you’re meeting
your assumptions. Let’s start with confounding.
Imagine that you’re trying to use the back-door criterion for your problem. In
certain cases, you’ll be able to rule out the possibility that there are other
unobserved variables introducing confounding between your treatment and
your outcome, but in many cases, it might be very difficult. In particular,
when your research involves human interactions, financial markets, or other
complex phenomena, making sure that the effects of interest are identifiable
can be difficult if not impossible. That’s likely one of the reasons for the huge
popularity of instrumental variable techniques in contemporary applied
econometrics. Although good instruments have a reputation of being
notoriously hard to find, they might be easier to find than the certainty that we
meet the back-door criterion.
An elephant in the room – hopeful or
hopeless?
Some people reading about the challenges in causal inference from
observational data might ask, “Is there anything we can do?”
I like the idea presented by James Altucher and Claudia Altucher in their
book, The Power of No: if you cannot come up with 10 creative ideas, you
should come up with 20 (Altucher & Altucher, 2014).
The first level of creativity is to use the refutation tests that we described in
the previous chapter. You already know how they work. One challenge with
these tests is that they check for the overall correctness of the model
structure, but they do not say much about how good the obtained estimate is.
In other words, we can virtually check whether our effect would still hold in
the worst-case scenario. Sensitivity analysis for regression models can be
performed using the Python PySensemakr package (Cinelli and Hazlett, 2020;
https://bit.ly/PySensemakr) or even using the online Sensemakr app
(https://bit.ly/SensemakrApp).
Even better, the sensitivity analysis framework has been recently extended
for a broad class of causal models (check Chernozhukov et al. (2022) for
details).
In this section, we talked about the challenges that we can face when working
with causal models and we discussed some creative ways to overcome them.
We talked about the lack of causal graphs, insufficient sample sizes, and
uncertainty regarding assumptions. We’ve seen that although there might not
be a universal cure for all causal problems, we can definitely get creative
and gain a more realistic understanding of our position.
Positivity
In this short section, we’re going to learn about the positivity assumption,
sometimes also called overlap or common support.
First, let’s think about why this assumption is called positivity. It has to do
with (strictly) positive probabilities – in other words, probabilities greater
than zero.
The answer to that is the probability of your treatment given all relevant
control variables (the variables that are necessary to identify the effect – let’s
call them ). Formally:
The preceding formula must hold for all values of that are present in the
population of interest (Hernán & Robins, 2020) and for all values of
treatment .
Using the adjustment formula (Pearl et al., 2016), to compute these quantities
from observational data, we need to control for .
Now imagine an extreme case where the support of (in other words, values
of ) does not overlap at all with the treatment values (hence the names
overlap and common support; Neal, 2020). Figure 8.3 presents a graphical
representation of such a case:
Figure 8.3 – Positivity violation example
In order to estimate the causal effect of the treatment given , our estimator
would need to extrapolate the red dots to the left (in the range between 2 and
5, where the support of for is) and the blue dots to the right (in the
range between 5 and 9, where the support of for is). It’s highly
unlikely that any machine learning model would perform such an
extrapolation realistically. Figure 8.4 presents a possible extrapolation
trajectory:
Figure 8.4 – Possible extrapolation trajectories (lines)
The red and blue lines in Figure 8.4 represent possible extrapolation
trajectories. How accurate do they seem to you?
To me, not very. They look like simple linear extrapolations, not capturing
the non-linear qualities of the original functions.
As you can see, here the model has a much easier task as it only needs to
interpolate between the points.
Our example is very simple: we only have one confounder. In the real world,
the data is often multidimensional, and making sure that
becomes more difficult. Think about a three-
dimensional example. How about a 100-dimensional example?
Exchangeability
In this section, we’ll introduce the exchangeability assumption (also known
as the ignorability assumption) and discuss its relation to confounding.
Exchangeable subjects
The main idea behind exchangeability is the following: the treated subjects,
had they been untreated, would have experienced the same average outcome
as the untreated did (being actually untreated) and vice versa (Hernán &
Robins, 2020).
At the same time, the core idea behind it is simple: the treated and the
untreated need to share all the relevant characteristics that can influence the
outcome.
In fact, the potential outcomes framework aims to achieve the same goals as
SCM/do-calculus-based causal inference, just using different means (see
Pearl, 2009, pp. 98-102, 243-245).
Pearl argues that both frameworks are logically equivalent and can be used
interchangeably or symbiotically, pointing out that graphical models can help
clearly address challenges that might be difficult to spot using the potential
outcomes formalism (see https://bit.ly/POvsSCM for examples). I
wholeheartedly agree with the latter.
…and more
In this short section, we’ll introduce and briefly discuss three assumptions:
the modularity assumption, stable unit treatment value assumption
(SUTVA), and the consistency assumption.
Modularity
Imagine that you’re standing on the rooftop of a tall building and you’re
dropping two apples. Halfway down, there’s a net that catches one of the
apples.
The net performs an intervention for one of the apples, yet the second apple
remains unaffected.
The SCM after the intervention, let’s call it (we often add a subscript
M (modified) to denote an SCM after intervention), is presented in Figure
8.7:
Figure 8.7 – The modified SCM
At the same time, the edge from to is untouched (despite our intervention
on , changes in will still affect the value of ) in accordance with the
modularity assumption.
Removing edges from the graph when we intervene on (a set of) variable(s)
is sometimes called graph mutilation. I don’t like this name too much. A
more neutral term is graph modification.
Modularity might seem challenging to understand at first, but at its core, it’s
deeply intuitive.
In this world, you’d make an intervention on your web page by changing the
button shape and as a result, your web page will also automatically be
translated into Portuguese, while your computer will automatically start
playing Barbra Streisand songs and your lawyer will immediately color their
hair green – all without any other changes in the world besides your button
shape. Such a world might be interesting but hard to navigate in terms of
understanding causal mechanisms.
SUTVA
The SUTVA is another assumption coming from the potential outcomes
framework.
The assumption states that the fact that one unit (individual, subject, or
object) receives treatment does not influence any other units.
Imagine that the treatment is to win a car. There are two levels of treatment:
you either get a car or don’t.
Some people in the treatment group win a brand-new electric BMW while
others get a rusty, 20-year-old Mazda without wheels. If our outcome
variable is the level of excitement, we’d expect that on average the same
person’s level of excitement would differ between the two versions of
treatment. That would be an example of a violation of consistency as we
essentially encode two variants of treatment as one.
If you only take two things out of our discussion on consistency, it should be
these:
In the next section, we’ll shed some new light on the topic of unobserved
variables.
Note that in the field of econometrics, the term selection bias might be used
to describe any type of confounding bias.
Let’s review what Hernán and Robins (2020) call a selection bias and learn
two valuable lessons.
An interesting fact is that the damage does not look random. It seems that
there’s a clear pattern to it. Bullet holes are concentrated around the fuel
system and fuselage, but not so much around the engines.
A person who was faced with this question was Abraham Wald, a
Hungarian-born Jewish mathematician, who worked at Columbia University
and was a part of the SRG at the time.
Instead of providing the army with the answer, Wald asked a question.
Missing holes?
What Wald meant were the holes that we’ve never observed. The ones that
were in the planes that never came back.
DAG them!
What Wald pointed out is so-called survivorship bias. It’s a type of selection
bias where the effects are estimated only on a subpopulation of survivors
with an intention to generalize them to the entire population. The problem is
that the population might differ from the selected sub-sample. This bias is
well known in epidemiology (think about disease survivors, for instance).
Let’s put a hypothetical SCM together to represent the problem that Wald was
facing.
Note that what we do by only looking at planes that came back home is
implicitly conditioning on .
Two factors influence : the number of bullets shot at the engines ( ) and the
overall damage severity ( ), which also depends on . If the number of
bullets shot at the engines is high enough, the plane is not coming back home
regardless of other damage. It can also be the case that the number of bullets
shot at engines is not that high, but the overall damage is so severe that the
plane does not make it anyway.
Let’s translate our example into Python to see how this logic works on an
actual dataset.
First, let’s define the sample size and the set of structural assignments that
defines our SCM:
SAMPLE_SIZE = 1000
# A hypothetical SCM
T = np.random.uniform(20, 110, SAMPLE_SIZE)
Y = T + np.random.uniform(0, 40, SAMPLE_SIZE)
C = (T + Y < 100).astype('int')
Let’s put all the variables in a pandas DataFrame for easier manipulation:
The data in blue represents the damage severity for the planes that came back
home. The data in red denotes the damage severity of the planes that did not
come back.
As you can see, the severity is much higher for the planes that did not make it
back home and there’s only a small overlap between the two sets around the
value of .
Although our example is simplified (we don’t take the damage location into
account in our histogram), the conclusion is the same as Wald’s – what
matters the most are the missing holes, not the ones that we’ve observed. The
first lesson is: look at what’s missing.
If you want to understand the original solution proposed by Wald, you can
find it in Wald (1980).
The DAG in Figure 8.11 is even more interesting. There are no edges
directly connecting and or and . Can you guess what’s the source of
spuriousness here?
In the second part of the section, we focused on selection bias and recalled
the story of Abraham Wald. We cast this story into the language of DAGs and
translated it into Python to see how selection bias works in practice.
Two lessons we learned in this section are look at what’s missing and look
further than you think is necessary.
Wrapping it up
In this chapter, we talked about the challenges that we face while using
causal inference methods in practice. We discussed important assumptions
and proposed potential solutions to some of the discussed challenges. We got
back to the topic of confounding and showed examples of selection bias.
The four most important concepts from this chapter are identifiability, the
positivity assumption, modularity, and selection bias.
Are you ready to add some machine learning sauce to all we’ve learned so
far?
References
Altucher, J., Altucher C. A. (2014). The Power of No: Because One Little
Word Can Bring Health, Abundance, and Happiness. Hay House.
Curth, A., Svensson, D., Weatherall, J., and van der Schaar, M. (2021).
Really Doing Great at Estimating CATE? A Critical Look at ML
Benchmarking Practices in Treatment Effect Estimation. Proceedings of
the Neural Information Processing Systems Track on Datasets and
Benchmarks.
Chernozhukov, V., Cinelli, C., Newey, W., Sharma, A., and Syrgkanis, V.
(2022). Long Story Short: Omitted Variable Bias in Causal Machine
Learning (Working Paper No. 30302; Working Paper Series). National
Bureau of Economic Research.
Donnely, R. (2022, October 2). One of the big challenges with causal
inference is that we can’t easily prove which approach produced the best
estimate of the true causal effect… [Post]. LinkedIn.
https://www.linkedin.com/posts/robert-donnelly-4376579_evaluating-the-
econometric-evaluations-of-activity-6979849583241609216-7iXQ?
utm_source=share&utm_medium=member_desktop
Graves, R. (1955). Daedalus and Talus. The Greek Myths. Penguin Books.
Gordon, B. R., Moakler, R., and Zettelmeyer, F. (2022). Close Enough? A
Large-Scale Exploration of Non-Experimental Approaches to Advertising
Measurement. arXiv. https://doi.org/10.48550/ARXIV.2201.07055
Gui, H., Xu, Y., Bhasin, A., and Han, J. (2015). Network A/B Testing: From
Sampling to Estimation. Proceedings of the 24th International Conference on
World Wide Web, 399–409. https://doi.org/10.1145/2736277.2741081
Hernán M. A., Robins J. M. (2020). Causal Inference: What If. Boca Raton:
Chapman & Hall/CRC.
Intellectual Ventures. (March 10, 2016). Failing for Success: The Wright
Brothers. https://www.intellectualventures.com/buzz/insights/failing-for-
success-the-wright-brothers/
Machine Learning Street Talk. (2022, January 4). #61: Prof. YANN LECUN:
Interpolation, Extrapolation and Linearisation (w/ Dr. Randall
Balestriero) [Video]. YouTube. https://www.youtube.com/watch?
v=86ib0sfdFtw
Reisach, A.G., Seiler, C., & Weichwald, S. (2021). Beware of the Simulated
DAG! Varsortability in Additive Noise Models. arXiv, abs/2102.13647.
Saint-Jacques, G., Varshney, M., Simpson, J., and Xu, Y. (2019). Using Ego-
Clusters to Measure Network Effects at LinkedIn. arXiv, abs/1903.08755.
In this chapter, we’ll see a number of methods that can be used to estimate
causal effects in non-linear cases. We’ll start with relatively simple methods
and then move on to more complex machine learning estimators.
By the end of this chapter, you’ll have a good understanding of what methods
can be used to estimate non-linear (and possibly heterogeneous (or
individualized)) causal effects. We’ll learn about the differences between
four different ways to quantify causal effects: average treatment effect
(ATE), average treatment effect on the treated (ATT), average
treatment effect on the control (ATC), and conditional average
treatment effect (CATE).
Matching
Propensity scores
Meta-learners
The basics I – matching
In this section, we’ll discuss the basics of matching. We’ll introduce ATE,
ATT, and ATC. We’ll define a basic matching estimator and implement an
(approximate) matching estimator using DoWhy’s four-step causal process.
Some authors, including Stuart (2010) and Sizemore & Alkurdi (2019),
suggest that matching should be treated as a data preprocessing step, on top
of which any estimator can be used. This view is also emphasized by
Andrew Gelman and Jennifer Hill: “Matching refers to a variety of
procedures that restrict and reorganize the original sample” (Gelman &
Hill, 2006).
Types of matching
There are many variants of matching. First, we have exact versus inexact
(approximate) matching. The former requires that treated observations and
their respective untreated pairs have exactly the same values for all relevant
variables (which includes confounders). As you can imagine, this might be
extremely hard to achieve, especially when your units are complex entities
such as humans, social groups, or organizations (Stuart, 2010). Inexact
matching, on the other hand, allows for pairing observations that are similar.
Many different metrics can be used to determine the similarity between two
observations. One common choice is Euclidean distance or its
generalizations: Mahalanobis distance (https://bit.ly/MahalanobisDistance)
and Minkowski distance (https://bit.ly/MinkowskiDistance). Matching can
be performed in the raw feature space (directly on your input variables) or
other spaces, for instance, the propensity score space (more on propensity
scores in the next section). One thing that you need to figure out when using
inexact matching is the maximal distance that you’ll accept to match two
observations – in other words: how close is close enough.
MORE ON MATCHING
A detailed discussion of different variants of matching is beyond the scope of this book. If
this particular family of methods seems interesting to you, you can start with a great
summary article by Elizabeth A. Stuart (Stuart, 2010; https://bit.ly/StuartOnMatching) or a
(more recent) blog post by Samantha Sizemore and Raiber Alkurdi from the Humboldt
University in Berlin (Sizemore & Alkurdi, 2019; https://bit.ly/SizemoreAlkurdi). The latter
contains code examples in R, leverages some more recent machine learning methods (such
as XGBoost), and provides you with an experimental evaluation of different approaches. To
learn about matching in the multiple treatment scenarios, check out Lopez & Guttman (2017)
for similarities between matching and regression, Angrist & Pischke (2008, pp. 51–59) for
matching in the historical context of subclassification, Cunningham (2021, pp. 175–198) or
Facure (2020, Chapter 10).
As is the case for all other methods we have discussed so far, matching
requires that the relationship between the treatment and the outcome is
unconfounded. Therefore, we need to make sure that all confounders are
observed and present in the matching feature set if we want to obtain
unbiased estimates of causal effects. Moreover, it’s good to draw a directed
acyclic graph (DAG) to make sure we don’t introduce bias by controlling
for colliders.
Note that matching does not care about linearity in the data – it can be used
for linear as well as non-linear data.
ATT looks very similar to ATE. The key difference is that we do not sum over
all observations but rather only over the treated units (units that received the
treatment). In the preceding formula, represents the number of units that
received the treatment (hence T=1), and represents the indices of these
units.
ATC is a mirror reflection of ATT; the only thing we need to change is the
value of T:
Matching estimators
The type of causal effect that we want to estimate will influence the
definition of the matching estimator itself.
In the preceding formula, stands for the value of the outcome of the i-th
treated observation, and represents the outcome of a matched observation
from the untreated group (I borrowed the notation from Scott
Cunningham (2021)).
The matching estimator for ATE can be defined in the following way:
This time, we iterate over all observations and find matches for them within
the other group. For each treated observation i, we match an untreated
observation, j(i), and vice versa. Note that in the ATT estimator, j(i) always
belonged to the untreated group, while i was always treated. When estimating
ATE, i can be treated or untreated, and j(i) simply belongs to the other group
(if i is treated, j(i) is not and vice versa).
Note that this formula only works for binary treatments (it needs to be further
adapted to work with multiple-level treatments; for some further ideas on
this, check out Lopez & Gutman, 2017).
Let’s see a quick example. Imagine that we only have two units in our
dataset, represented in Table 9.1:
Unit Covariate Treatment Outcome
A 33 1 9
B 33 0 1.5
Let’s compute a matching ATE for this dataset. For the first unit, we have the
following equation:
Figure 9.1 – Visual representation of matching formula computations with data from Table
9.1
Now, you might ask, what if we have observations without good matches in
our dataset? The traditional answer to this question is that we need to discard
them! In case of approximate (inexact) matching, you can try extending your
criteria of similarity in order to retain some of these observations, but keep
in mind that this will likely increase the bias in your estimates (especially if
your sample size is small).
What if we end up with more than one good match per observation? Do we
also need to discard it?
It turns out that we can be more efficient than that! One thing we can do to
avoid discarding the observations is to average over them. For each i, we
collect all j(i)s that meet our similarity criteria and simply take the average
of their outcomes. This is presented formally as follows:
Implementing matching
Great, we’re ready to get some data and see matching in action! We’ll follow
DoWhy’s four-step procedure and use the approximate matching estimator.
The code for this chapter can be found in the Chapter_09.ipynb notebook
(https://bit.ly/causal-ntbk-09).
For our analysis, we’ll use a simplified synthetic dataset (n=200) inspired
by the famous dataset used in Robert LaLonde’s seminal paper Evaluating
the Econometric Evaluations of Training Programs with Experimental
Data (1986). Our dataset consists of three variables:
The subject’s yearly earnings 18 months after the training took place (in
United States Dollar (USD))
First, let’s read in the data (we omit library imports here for a better reading
experience – please refer to the notebook for the more details):
earnings_data = pd.read_csv(r'./data/ml_earnings.csv')
earnings_data.head()
earnings_data.groupby(['age', 'took_a_course']).mean()
Although we only printed out a couple of rows in Figure 9.3, you can see that
for some age groups (for instance, 36 or 38), there are observations only for
one of the values of the treatment. This means that we won’t be able to
compute the exact effects for these groups. We’ll leave it to our matching
estimator to handle this for us, but first, let’s compute the naïve estimate of
the causal effect of training on earnings using the treatment and control group
means:
treatment_avg = earnings_data.query('took_a_course==1')[
'earnings'].mean()
cntrl_avg = earnings_data.query('took_a_course==0')[
'earnings'].mean()
treatment_avg - cntrl_avg
6695.57088285231
The naïve estimate of the effect of our training is a bit over 6,695 USD per
year.
Let’s see whether, and, if so, how the approximate matching estimate differs.
S t e p 1 – re p re s e nt ing t he p ro b le m a s a g ra p h
Let’s start by constructing a graph modeling language (GML) graph:
model = CausalModel(
data=earnings_data,
treatment='took_a_course',
outcome='earnings',
graph=gml_string
)
model.view_model()
estimand = model.identify_effect()
print(estimand)
This gives us the following output:
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.distance_matching',
target_units='ate',
method_params={'distance_metric': 'minkowski', 'p': 2})
One thing I’d like to add here is that we haven’t standardized our matching
variable (age). This is fine because we have just one variable, yet in a
multivariable case, many people would recommend normalizing or
standardizing your variables to keep their scales similar. This helps to avoid
a scenario where variables with larger values disproportionally outweigh
other variables in the distance computation. Many machine learning
resources follow this advice, and it’s often considered a standard practice in
multivariate problems involving distance computations.
estimate.value
10118.445
Now, let’s reveal the true effect size! The true effect size for our data is
10,000 USD. This means that matching worked pretty well here! The
absolute percentage error is about 1.2%, which is a great improvement over
the naïve estimate that resulted in 33% of absolute percentage error!
Let’s see how well our matching estimator will handle our attempts to refute
it!
S t e p 4 – re f ut ing t he e s t ima t e
For brevity, we’ll only run one refutation test here. In the real world, we
optimally want to run as many tests as available:
refutation = model.refute_estimate(
estimand=estimand,
estimate=estimate,
method_name='random_common_cause')
print(refutation)
We see that the new effect is slightly higher than the estimated one.
Nonetheless, a high p value indicates that the change is not statistically
significant. Good job!
Now, let’s see what challenges we may meet on our way when using
matching.
The basics II – propensity scores
In this section, we will discuss propensity scores and how they are
sometimes used to address the challenges that we encounter when using
matching in multidimensional cases. Finally, we’ll demonstrate why you
should not use propensity scores for matching, even if your favorite
econometrician does so.
To get an idea of what the answer can be, let’s take a look at Figure 9.5:
Figure 9.5 – The probability of finding an exact match versus the dimensionality of the
dataset
In Figure 9.5, the x axis represents the dataset’s dimensionality (the number
of variables in the dataset), and the y axis represents the probability of
finding at least one exact match per row.
The blue line is the average probability, and the shaded areas represent +/-
two standard deviations.
Sounds good!
Second, two observations that are very different in their original feature
space may have the same propensity score. This may lead to matching on
very different observations and hence, biasing the results.
In the binary case, the optimal propensity score would be 0.5. Let’s think
about it.
What happens in the ideal scenario when all observations have the optimal
propensity score of 0.5? The position of every observation in the propensity
score space becomes identical to any other observation.
We’ve seen this in the formulas before – we either take the best match or we
average over the best matches, yet now every point in the dataset is an
equally good match!
We can either pick one at random or average on all the observations! Note
that this kills the essence of what we want to do in matching – compare the
observations that are the most similar. This is sometimes called the PSM
paradox. If you feel like digging deeper into this topic, check out King
(2018), King & Nielsen (2019), or King’s YouTube video here:
https://bit.ly/GaryKingVideo.
Unfortunately, this is not the case. Setting aside the other challenges
discussed earlier, PSM requires unconfoundedness (no hidden confounding).
In this section, we learned what propensity scores are, how to use them for
matching, and why they should not be used for this purpose. We discussed
the limitations of the PSM approach and reiterated that PSM is not immune to
hidden confounding.
Imagine that we want to estimate the effect of drug D. If males and females
react differently to D and we have 2 males and 6 females in the treatment
group and 12 males and 2 females in the control group, we might end up with
a situation similar to the one that we’ve seen in Chapter 1: the drug is good
for everyone, but is harmful to females and males!
This is Simpson’s paradox at its best (if you want to see how we can
deconfound the data from Chapter 1 using IPW, check out the Extras
notebook: https://bit.ly/causal-ntbk-extras-02).
Formalizing IPW
The formula for a basic IPW ATE estimator is as follows:
In the preceding formula, we follow the convention that we used earlier in
this chapter: represents the number of units that received the treatment
and represents the number of units that did not receive the treatment.
Lowercase represents the outcome for unit (or respectively), and
is the (estimated) propensity score for unit (or ) defined as the
probability of treatment given the characteristics of the unit:
Implementing IPW
For this exercise, we’ll use the same earnings dataset that we used in the
previous section. This allows us also to reuse the graph and the estimand that
we found in the previous section and the only thing we need to redo is to re-
compute the estimate using the appropriate estimator. We’ll use DoWhy’s
backdoor.propensity_score_weighting method. Behind the scenes, the
estimator uses weighted least squares (WLS) regression, which weights
each treated sample by the inverse of its propensity score ( ) and each
untreated sample by the inverse of .
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.propensity_score_weighting',
target_units='ate')
Let’s see the results:
estimate.value
10313.5668311203
Note that it’s not necessarily always the case that matching performs better
than IPW. That said, some authors have reported that IPW underperforms
compared to other methods (for instance, Elze et al., 2017; but note that this
is an observational study that might suffer from other sources of bias).
You need a model to compute your propensity scores before you can
perform weighting. Logistic regression is a common choice because its
probability estimates are (usually) well calibrated, other models might
return biased probabilities, and you might want to scale their outputs
using one of the available methods such as Platt scaling or isotonic
regression (Niculescu-Mizil & Caruana, 2005). Although there is an
opinion that probability scaling is not necessarily very important for
IPW, recent research by Rom Gutman, Ehud Karavani, and Yishai
Shimoni from IBM Research has shown that good calibration
significantly reduces errors in causal effect estimates, especially when
the original classifier lacks calibration (Gutman et al., 2022). This is
particularly important when you use expressive machine learning models
to estimate your propensity scores. You might want to use such models if
you want your propensity scores to reflect higher-order relationships
(interactions) in your dataset that simple logistic regression cannot
capture (unless you perform some feature engineering).
It’s time to conclude our section on propensity scores, but it’s definitely not
the last time we see them.
Let’s go!
By the end of this section, you will have a solid understanding of what CATE
is, understand the main ideas behind meta-learners, and learn how to
implement S-Learner using DoWhy and EconML on your own.
Ready?
However, it’s important to remember that people and other complex entities
(such as animals, social groups, companies, or countries) can have different
individual reactions to the same treatment.
A cancer therapy might work very well for some patients while having no
effect on others. A marketing campaign might work great for most people
unless you target software developers.
Your best friend Rachel might enjoy your edgy jokes, yet your brother Artem
might find them less appealing.
When we deal with a situation like this, ATE might hide important
information from us. Imagine that you want to estimate the influence of your
jokes on people’s moods. You tell a joke to your friend Rachel, and her mood
goes up from a 3 to a 5 on a 5-point scale.
This gives us the individual treatment effect (ITE) for Rachel, which is 2.
Now, you say the same joke to Artem, and his mood decreases from a 3 to a
1. His ITE is -2.
If we were to calculate the ATE based on these two examples, we would get
the effect of 0 since the ITEs for Rachel and Artem cancel each other out.
However, this doesn’t tell us anything about the actual impact of the joke on
Rachel or Artem individually.
In this case, the ATE hides the fact that the joke had a positive effect on
Rachel’s mood and a negative effect on Artem’s mood. To fully understand
the effect of the joke, we need to look at the ITE for each individual.
For instance, it might be the case that your jokes are perceived as funnier by
people familiar with a specific context. Let’s assume that Rachel studied
philosophy. It’s likely that your jokes about the German philosopher
Immanuel Kant will be funnier to her than they will be to Artem, who is
interested in different topics and has never heard of Kant. CATE can help us
capture these differences.
In general, CATE for binary treatment can be defined in the following way:
The idea that people might react differently to the same content is often
presented in a matrix sometimes called the uplift model matrix that you can
see in Figure 9.6:
Figure 9.6 – The uplift model matrix
In Figure 9.6, rows represent reactions to content when the treatment (e.g., an
ad) is presented to the recipient. Columns represent reactions when no
treatment is applied.
The four colored cells represent a summary of the treatment effect dynamics.
Sure thing (green) buys regardless of the treatment. Do not disturb (red)
might buy without the treatment, but they won’t buy if treated (e.g., Czakon’s
developers). Lost cause (gray) won’t buy regardless of the treatment status,
and Persuadable (blue) would not buy without the treatment but might buy
when approached.
Marketing to the Sure thing and Lost cause groups will not hurt you directly
but won’t give you any benefit while consuming the budget.
Assuming that your data comes from an RCT, you should start by checking
whether the design of your study did not lead to leakage. Leakage refers to a
situation where some aspects of your RCT design or the randomization
process itself lead to the non-random assignment of units to the control and
experimental groups.
Note that this method is informative only when leakage variables are
observed. In real-world scenarios, it might be the case that leakage is driven
by an unobserved variable that is independent of any other observed
variable.
This can be achieved either by building an architecture that allows for modeling interactions
or by manually providing a model with interaction features. In the latter case, even a simple
regression will be sufficient to model CATE, though, at scale, manual feature generation
might be a challenging task (both epistemologically and technically).
In the former case, we need to make sure that the architecture is expressive enough to
handle feature interactions. Tree-based models deal well with this task. Classic neural
networks of sufficient depth can also model interactions, but this might be challenging
sometimes.
7. Subtract the value of the prediction without treatment from the value of
the prediction with treatment.
You can easily build it yourself using any relevant model. Nonetheless, we’ll
use DoWhy here to continue familiarizing ourselves with the API and to
leverage the power of convenient abstractions provided by the library.
Figure 9.7 – The causal graph (DAG) for the enhanced earnings dataset
Let’s read in the data. We’ll use the train and test sets here so that we can
evaluate the model’s performance:
earnings_interaction_train = pd.read_csv(
r'./data/ml_earnings_interaction_train.csv')
earnings_interaction_test = pd.read_csv(
r'./data/ml_earnings_interaction_test.csv')
earnings_interaction_train.shape, earnings_interaction_test.shape
Our train set consists of 5000 observations, and the test set consists of 100
observations.
Why?
Because they say nothing about how correct our estimand is. In other words, they are not
informative when it comes to potential hidden confounding, selection bias, or any other
structurally invoked bias (you can also think about cross-validation as a rung 1 method).
That said, in our case, the traditional train-test split is useful. We know the true causal
structure, and so we’re not interested in evaluating the estimand, but rather – assuming we
have a correct estimand – we want to know how well our models estimate the effect, and –
as we know from traditional machine learning literature – cross-validation and train-test splits
can help us with that.
Figure 9.8 presents the first five rows of the training dataset:
Figure 9.8 – The first five rows of the enhanced earnings dataset (train)
Figure 9.9 – The first five rows of the enhanced earnings dataset (test)
As you can see, the test set’s structure is different. We don’t see the earnings
column. Instead, we have a new column called true_effect. Clearly,
knowing the true effect is the privilege of synthetic data, and we won’t get it
with the real-world data!
Nevertheless, we want to use it here to demonstrate important aspects of our
models’ performance.
First, let’s instantiate the CausalModel object (note we’re omitting the GML
graph construction here for brevity; check out the notebook for the full code):
model = CausalModel(
data=earnings_interaction_train,
treatment='took_a_course',
outcome='earnings',
effect_modifiers='python_proficiency',
graph=gml_string
)
estimand = model.identify_effect()
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.metalearners.SLearner',
target_units='ate',
method_params={
'init_params': {
'overall_model': LGBMRegressor(
n_estimators=500, max_depth=10)
},
'fit_params': {}
})
init_params
fit_params
The expected values for these keys are also dictionaries. The former defines
estimator-level details such as the base-learner model class or model-
specific parameters. In other words, everything that’s needed to initialize
your estimator.
The latter defines parameters that can be passed to the causal estimator’s
.fit() method. One example here could be the inference parameter, which
allows you to switch between bootstrap and non-bootstrap inference modes.
If you’re unsure what keys are expected by your method of choice, check out
the EconML documentation here https://bit.ly/EconMLDocs.
Let’s check whether S-Learner did a good job. Note that we’re passing the
refutation step in print for brevity, but the model behaved well under a
random common cause and placebo refuters (check out this chapter’s
notebook for details).
effect_pred = model.causal_estimator.effect(
earnings_interaction_test.drop(['true_effect',
'took_a_course'], axis=1))
Note that we’re dropping the treatment column (took_a_course) and the true
effect column.
The true effect column will be useful at the evaluation stage, which we’ll
perform in a second.
effect_true = earnings_interaction_test['
true_effect'].values
Let’s compute the mean absolute percentage error (MAPE) and plot the
predicted values against the true effect:
mean_absolute_percentage_error(effect_true, effect_pred)
0.0502732092578003
Slightly above 5%! It’s a pretty decent result in absolute terms (of course
such a result might be excellent or terrible in a particular context, depending
on your use case)!
Let’s plot the true effect against predicted values. Figure 9.10 presents the
results:
Figure 9.10 – True effect (x axis) versus predicted effect (y axis) for S-Learner trained on
full data
The x axis in Figure 9.10 represents the values of the true effect for each of
the observations in the test set. The y axis represents the predicted values,
and the red line represents the results of a (hypothetical) perfect model (with
zero error).
We can see that there are some observations where our S-Learner
underestimates the effect (the three points that are way below the line), but its
overall performance looks good!
Now I want to share with you something that I feel is often overlooked or
discussed only briefly or in abstract terms by many practitioners.
Small data
Let’s train the same S-Learner once again but on a small subset of the training
data (100 observations).
You can find the code for this training and evaluation in the notebook.
For the model trained on 100 observations, the MAPE is equal to 35.9%!
That’s over seven times larger than for the original dataset. Figure 9.11
presents the results:
Figure 9.11 – The results for S-Learner trained on 100 observations
As you can see, this model is very noisy to the extent that it’s almost
unusable.
The question of defining a “safe” dataset size for S-Learner and other causal
models is difficult to answer. Power calculations for machine learning
models are often difficult, if possible at all. For our earnings dataset, 1,000
observations are enough to give us a pretty sensible model (you can check it
out in the notebook by changing the sample size in model_small and re-
running the code), but for another – perhaps more complex – dataset, this
might not be enough.
If your case resembles an A/B test, you can try one of the A/B test sample
size calculators (https://bit.ly/SampleSizeCalc). Otherwise, you can try
computing the power for an analogous linear model using statistical power
computation software (e.g., G*Power; https://bit.ly/GPowerCalc) and then
heuristically adjust the sample size to your model complexity.
That said, these methods can provide you only with very rough estimates.
S-Learner’s vulnerabilities
S-Learner is a great, flexible, and relatively easy-to-fit model. The simplicity
that makes this method so effective is also a source of its main weakness.
S-Learner treats the treatment variable (pun not intended) as any other
variable. Let’s see what we might risk here.
Therefore, if the treatment effect is small, the S-Learner model may decide to
ignore the treatment completely, resulting in the predicted causal effect being
nullified. One heuristic solution to this problem is to use deeper trees to
increase the probability of the split on treatment, yet keep in mind that this
can also increase the risk of overfitting.
Other base learners with regularization, such as lasso regression, might also
learn to ignore the treatment variable.
Figure 9.12 – The graphical intuition behind the T-Learner forced split
4. Subtract the results of the untreated model from the results of the treated
model.
Note that now there is no chance that the treatment is ignored as we encoded
the treatment split as two separate models.
Implementing T-Learner
We’ll reuse the same graph and the same estimand as we used for S-Learner
and focus on the estimate. We’ll use the original (large) dataset.
Compared to the S-Learner, there are two essential differences in the code:
First, we use EconML’s TLearner class as our method (method_name)
Second, instead of just one model, we need to fit two models now, and
this is reflected in the structure of the init_params dictionary
Rather than the overall_model key, we use the models key that takes a list of
models as a value:
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.metalearners.TLearner',
target_units='ate',
method_params={
'init_params': {
'models': [
LGBMRegressor(n_estimators=200,
max_depth=10),
LGBMRegressor(n_estimators=200,
max_depth=10)
]
},
'fit_params': {}
})
Let’s estimate the effect and retrieve the true effect value:
effect_pred = model.causal_estimator.effect(
earnings_interaction_test.drop(['true_effect',
'took_a_course'], axis=1))
effect_true = earnings_interaction_test[
'true_effect'].values
Note that we repeat exactly the same steps as before. Let’s compute the
MAPE:
mean_absolute_percentage_error(effect_true, effect_pred)
The error is higher than the one for S-Learner. The MAPE is now above 8%!
Let’s look into the results by plotting the true effect versus the predicted
effect in Figure 9.13:
Figure 9.13 – The results of the T-Learner model trained on full data
The results look much worse than the ones in Figure 9.10. Didn’t we just say
that T-Learner was created as an improvement over S-Learner?
T-Learner focuses on improving just one aspect where S-Learner might (but
does not have to) fail. This improvement comes at a price. Fitting two
algorithms to two different data subsets means that each algorithm is trained
on fewer data, which can harm the fit quality.
It also makes T-Learner less data-efficient (you need twice as much data to
teach each T-Learner’s base learner a representation of quality comparable
to that in S-Learner).
This is often the case in modern-day online A/B testing, where a company
sends only a minor percentage of the traffic to a new version of the site or
service in order to minimize risk. In such a case you might want to use a
simpler model for the treatment arm with a smaller number of observations
(yes, we don’t have to use the same architecture for both base learners).
To summarize, T-Learner can be helpful when you expect that your treatment
effect might be small and S-Learner could fail to recognize it. It’s good to
remember that this meta-learner is usually more data-hungry than S-Learner,
but the differences decrease as the overall dataset size becomes larger.
Let’s start!
Hey, but wait! Did we just say estimate CATE? How can we possibly do that
if we never actually observe both potential outcomes necessary to compute
CATE?
Great question!
Let’s see.
Seatbelts fastened?
1. The first step is easy, plus you already know it. It’s precisely what we
did for T-Learner.
We split our data on the treatment variable so that we obtain two separate
subsets: the first containing treated units only, and the second one containing
untreated units only. Next, we train two models: one on each subset. We call
these models and , respectively.
Twofold.
We know something about them! For each of the treated units, we know one
of their counterfactual outcomes, namely because that’s precisely their
actual outcome ( ).
Hopefully, this makes sense. We know what outcome to expect under the
treatment from the treated units because we simply see this outcome recorded
in the data.
The challenge is that we don’t see the other outcome for the treated units –
we don’t know how they would behave if we didn’t treat them.
Happily, we can use one of our models to estimate the outcome under no
treatment.
Let’s do it and write it down formally. We’ll call the obtained imputed
treatment scores (for the estimated difference).
We will use superscript notation to indicate whether the score was computed
for the treated versus untreated units. The score for the treated units is
computed in the following way:
Second, we take all the untreated observations and compute the “mirror”
score for them:
We indexed the first formula using i and the second using j to emphasize that
the subsets of observations used in the first and second formulas are disjoint
(because we split the dataset on the treatment in the beginning).
3. In the third step, we’re going to train two more models (hold on, that’s
not the end!). These models will learn to predict the imputed scores
from units’ feature vectors .
You might ask – why do we want to predict from if we’ve just computed
it?
Great question! The truth is that if we’re only interested in quantifying the
causal effects for an existing dataset, we don’t have to do it.
On the other hand, if we’re interested in predicting causal effects for new
observations before administering the treatment, then this becomes necessary
(because we won’t know any of the potential outcomes required for
computations in step 2 before administering the treatment).
Going back to our models. We said that we’ll train two of them. Let’s call
these models and . These second-stage models are also known in the
literature as second-stage base learners. We’ll train the first model to
predict , and the second one to predict :
We call the new predicted quantities (with a double hat) because this
quantity is an estimate of an estimate.
Now, let’s briefly discuss the logic behind the weighting part. We weight the
output of by the unit’s propensity score (which is the probability of this
unit getting the treatment).
Let’s recall that is a model trained on the units that were untreated. We
knew their actual outcome under no treatment, but we needed to predict their
potential outcome under treatment in order to train .
This logic is not bullet-proof, though. Perhaps this is the reason why the
authors propose that we can use other weighting functions instead of
propensity scores or even “choose [the weights to be] (…) 1 or 0 if the
number of treated units is very large or small compared to the number of
control units” (Künzel et al., 2019).
Let’s summarize what we’ve learned so far! X-Learner requires us to fit five
models in total (if we count the propensity score model).
First-stage base learners and are identical to the two models we use in
T-Learner. The former ( ) is trained solely on untreated, while the latter (
), solely on the treated units. Second-stage base learners are trained on the
actual outcomes from the dataset and outputs from the first-stage models.
1. We start exactly the same as before. We split our data on the treatment
variable, which gives us two subsets: the subset of all treated
observations and a subset of all untreated observations.
We fit two separate models, one to each subset of the data. These are our
first-stage base learners and .
2. In step 2, we use the models and alongside the actual outcomes from
the dataset in order to compute the imputed scores:
Again, this step is identical to what we’ve done before. Note that now we
have CATE estimates for all observations (regardless of their original
treatment values), let’s call it .
3. In step 3, we fit the final model, let’s call it , that learns to predict
directly from :
Implementing X-Learner
Now, understanding how X-Learner works, let’s fit it to the data and
compare its performance against its less complex cousins.
We’ll reuse the graph and the estimand that we created previously and only
focus on the estimate:
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.metalearners.XLearner',
target_units='ate',
method_params={
'init_params': {
'models': [
LGBMRegressor(n_estimators=50,
max_depth=10),
LGBMRegressor(n_estimators=50,
max_depth=10)
],
'cate_models': [
LGBMRegressor(n_estimators=50,
max_depth=10),
LGBMRegressor(n_estimators=50,
max_depth=10)
]
},
'fit_params': {},
})
Note that we now have a new key in the init_params object: cate_models.
We use it to specify our second-stage base learners. If you don’t specify the
CATE models, EconML will use the same models that you provided as your
first-level base learners. It’s also good to mention that if you want to use
identical models for , , , and , it’s sufficient if you only specify it
once:
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.metalearners.XLearner',
target_units='ate',
method_params={
'init_params': {
'models': LGBMRegressor(n_estimators=50,
max_depth=10),
},
'fit_params': {},
})
EconML will duplicate this model behind the scenes for you.
If we want to explicitly specify the propensity score model, this can also be
done by adding the propensity_model key to the init_params dictionary and
passing the model of choice as a value.
effect_pred = model.causal_estimator.effect(
earnings_interaction_test.drop(['true_effect',
'took_a_course'], axis=1))
effect_true = earnings_interaction_test['true_effect'
].values
mean_absolute_percentage_error(effect_true, effect_pred)
0.036324966995778
The MAPE for X-Learner is 3.6%. That’s more than twice as good as T-
Learner’s 8.1% and around 32% better than S-Learner’s 5.2%!
We can see that most points are focused around the red line. We don’t see
clear outliers as in the case of S-Learner (Figure 9.10), and the overall error
is much smaller than in the case of T-Learner (Figure 9.13). A pleasant view.
The results for all three meta-learners on our dataset are summarized in
Table 9.2:
Estimator MAPE
S-Learner 5.2%
T-Learner 8.1%
X-Learner 3.6%
On the other hand, when your dataset is very small, X-Learner might not be a
great choice, as fitting each additional model comes with some additional
noise, and we might not have enough data to overpower this noise with signal
(in such a case, it might be better to use a simpler – perhaps linear – model
as your base learner).
Take a look at Figure 9.15, which presents the results of X-Learner trained
on just 100 observations from our training data:
Figure 9.15 – X-Learner’s results on small data
At first sight, the results in Figure 9.15 look slightly more structured than the
ones for S-Learner on small data (Figure 9.11), but the MAPE is slightly
higher (39% versus 36%).
Wrapping it up
Congrats on finishing Chapter 9!
We started with the basics and introduced the matching estimator. On the
way, we defined ATE, ATT, and ATC.
We learned that S-Learner might sometimes ignore the treatment variable and
therefore underestimate (or nullify) causal effects.
X-Learner seems like a safe bet in most cases where the sample size is
sufficiently large. S-Learner is a great starting point in many settings as it’s
simple and computationally friendly.
In the next chapter, we’ll see more architectures that can help us model
heterogeneous treatment effects.
References
Abrevaya, J., Hsu, Y., & Lieli, R.P. (2014). Estimating Conditional Average
Treatment Effects. Journal of Business & Economic Statistics, 33, 485–
505.
Elze, M. C., Gregson, J., Baber, U., Williamson, E., Sartori, S., Mehran, R.,
Nichols, M., Stone, G. W., & Pocock, S. J. (2017). Comparison of
Propensity Score Methods and Covariate Adjustment: Evaluation in 4
Cardiovascular Studies. Journal of the American College of Cardiology,
69(3), 345–357. https://doi.org/10.1016/j.jacc.2016.10.060
Facure, M., A. (2020). Causal Inference for The Brave and True.
https://matheusfacure.github.io/python-causality-handbook/landing-page.html
Gelman, A., & Hill, J. (2006). Analytical methods for social research: Data
analysis using regression and multilevel/hierarchical models. Cambridge
University Press.
Gutman, R., Karavani, E., & Shimoni, Y. (2022). Propensity score models
are better when post-calibrated. arXiv.
Hernán M. A., Robins J. M. (2020). Causal Inference: What If. Boca Raton:
Chapman & Hall/CRC.
Iacus, S., King, G., & Porro, G. (2012). Causal Inference without Balance
Checking: Coarsened Exact Matching. Political Analysis, 20, 1–24.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.
(2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree.
NIPS.
King, G., & Nielsen, R. (2019). Why Propensity Scores Should Not Be Used
for Matching. Political Analysis, 27 (4).
Künzel, S.R., Sekhon, J.S., Bickel, P.J., & Yu, B. (2017). Meta-learners for
Estimating Heterogeneous Treatment Effects using Machine Learning.
arXiv: Statistics Theory.
LaLonde, R.J. (1986). Evaluating the Econometric Evaluations of Training
Programs with Experimental Data. The American Economic Review, 76,
604–620.
Liu, Y., Dieng, A., Roy, S., Rudin, C., & Volfovsky, A. (2018). Interpretable
Almost Matching Exactly for Causal Inference. arXiv: Machine Learning.
Morucci, M., Orlandi, V., Roy, S., Rudin, C., & Volfovsky, A. (2020).
Adaptive Hyper-box Matching for Interpretable Individualized Treatment
Effect Estimation. arXiv, abs/2003.01805.
Rosenbaum, P.R., & Rubin, D.B. (1983). The central role of the propensity
score in observational studies for causal effects. Biometrika, 70, 41–55.
Sizemore, S., Alkurdi, R. (August 18, 2019). Matching Methods for Causal
Inference: A Machine Learning Update. Seminar Applied Predictive
Modeling (SS19). Humboldt-Universität zu Berlin. https://humboldt-
wi.github.io/blog/research/applied_predictive_modeling_19/matching_meth
ods/
Zou, D., Zhang, L., Mao, J., & Sheng, W. (2020). Feature Interaction based
Neural Network for Click-Through Rate Prediction. arXiv,
abs/2006.05312.
10
In this chapter, we’ll learn about doubly robust (DR) methods, double
machine learning (DML), and Causal Forests. By the end of this chapter,
you’ll have learned how these methods work and how to implement them
using EconML by applying them to real-world experimental data. You’ll also
have learned about the concept of counterfactual explanations.
Causal Forests
Counterfactual explanations
Isn’t the fact that both treatment and outcome models can be used to achieve
the same goal interesting?
Let’s see.
Adding a new node representing the propensity score to the graph is possible
thanks to the propensity score theorem that we learned about in the
previous chapter.
By controlling for X
Each of these ways will deconfound the relationship between the treatment
and the outcome.
Interestingly, although all three ways to deconfound the data are equivalent
from a graphical point of view, they might lead to different estimation errors.
In particular, although the estimands obtained using X and e(X) are
equivalent ( ), the estimators estimating these
quantities might have different errors. In particular, in certain cases, one of
the models might be correct with a near-zero error, while the other might be
misspecified with arbitrary bias.
It turns out that this is exactly what DR methods do. We get a model that
automatically switches between the outcome model and the treatment model.
The bias of DR estimators scales as the product of the errors of the
treatment and outcome models (Hernán & Robins, 2020; p. 229).
Moreover, when both models (treatment and outcome) are even moderately
misspecified, DR estimators can have high bias and high variance (Kang &
Schafer, 2007).
Speaking from experience, it will usually also have a smaller variance than
the meta-learners introduced in the previous chapter. This observation is
supported by theory. The DR estimator can be decomposed as an S-Learner
with an additional adjustment term (for details, see Courthoud, 2022).
To be fair, the basic idea of double robustness dates back even earlier – to
Cassel et al.’s 1976 paper. Cassel and colleagues (1976) proposed an
estimator virtually identical to what we call the DR estimator today, with the
difference that they only considered a case with known (as opposed to
estimated) propensity scores.
1. Divide the big formula into two chunks (on the left of the main minus sign
and on the right).
The two main parts of the formula in Figure 10.2 represent models under
treatment (left; orange) and under no treatment (right; green). The purple
parts represent averaging.
Let’s say that our model, , returns a perfect prediction for observation ,
but our propensity estimate, , is way off. In such a case, this prediction
will be equal to . Therefore, will be 0. This will make the
whole fraction expression ( ) in the orange part in Figure 10.2 equal
to 0, leaving us with after the plus sign. Because returned a perfect
prediction, is a perfect estimate and we’re happy to use it.
Note that the same logic holds for the green part of the formula in Figure
10.2.
We can also show that in the case of a misspecified outcome model, the
opposite will happen – the outcome model will be weighted down to 0 and
the treatment model will take the lead in the estimate. Showing this
mechanism requires some algebraic transformations, which we will skip
here. If you’re interested in more details, Chapter 12 of Matheus Facure’s
online book contains an accessible explanation: https://bit.ly/DoublyRobust.
The code for the following experiments can be found in the notebook for
Chapter 10 (https://bit.ly/causal-ntbk-10).
We will continue with the same data that we used for meta-learners in the
previous chapter and so we’ll skip the graph creation and estimand
identification steps here. You can find the full pipeline implemented in the
notebook accompanying this chapter.
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.dr.LinearDRLearner',
target_units='ate',
method_params={
'init_params': {
# Specify treatment and outcome models
'model_propensity': LogisticRegression(),
'model_regression': LGBMRegressor(n_estimators=
1000, max_depth=10)
},
'fit_params': {}
})
First, we use 'backdoor.econml.dr.LinearDRLearner' as a method name. We
specify the ATE as a target unit (but CATE estimates will still be available to
us) and pass model-specific parameters as a dictionary. Here, we only use
two parameters to specify the outcome model ('model_regression') and the
treatment model ('model_propensity'). We picked simple logistic regression
for the latter and an LGBM regressor for the former.
effect_pred = model.causal_estimator.effect(
earnings_interaction_test.drop(['true_effect',
'took_a_course'], axis=1))
effect_true = earnings_interaction_test['true_effect'
].values
Now, we’re ready to evaluate our model. Let’s check the value of the mean
absolute percentage error (MAPE):
mean_absolute_percentage_error(effect_true, effect_pred)
0.00623739241153059
This is six times lower than X-Learner’s error in the previous chapter
(3.6%)!
Let’s plot the predictions versus the true effect (Figure 10.3):
Figure 10.3 – True versus predicted effect for the linear DR-Learner model
Let’s think about what could help the model get such a good fit.
Note that the scale of the plot plays a role in forming our impressions
regarding the model’s performance. For ease of comparison, we kept the
ranges for both axes identical to the ones in the previous chapter. To an
extent, this can mask the model’s errors visually, but it does not change the
fact that the improvements over meta-learners from Chapter 9 are significant.
Let’s address one more question that some of us might have in mind – we
used a model called LinearDRLearner. Why do we call it linear if we used a
non-linear boosting model as an outcome model? The answer is that the
orthogonalization procedure that we’ll describe in the next section allows us
to model non-linearities in the data while preserving linearity in causal
parameters.
This setting is a good-enough fit for a broad category of problems, yet if you
need more flexibility, you can use the DRLearner class and choose an
arbitrary non-linear final model.
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.dr.DRLearner',
target_units='ate',
method_params={
'init_params': {
'model_propensity': LogisticRegression(),
'model_regression': LGBMRegressor(
n_estimators=1000, max_depth=10),
'model_final': LGBMRegressor(
n_estimators=500, max_depth=10),
},
'fit_params': {}
})
This model’s MAPE is over 7.6% – over 10 times higher than for a simple
linear DR-Learner. Figure 10.4 presents the results:
Figure 10.4 – True versus predicted effect for the non-linear DR-Learner model
As we can see, the non-linear model has a much higher variance on the test
set. Try to decrease the number of estimators and the maximum depth in the
final model and see how it affects the error.
ForestDRLearner uses the Causal Forest algorithm (Wager & Athey, 2018) as
the final model. We’ll discuss Causal Forests in greater detail later in this
chapter. Forest DR-Learner is great at handling multidimensional data, and
unlike SparseLinearDRLearner is not limited to linear cases.
Now, let’s see another interesting DR estimator that can bring advantages
under (limited) positivity violations (Porter et al., 2011) or smaller sample
sizes.
Note that the TMLE estimator we introduce in this chapter works for binary
treatments and binary outcomes only. The method can be extended to
continuous outcomes under certain conditions (Gruber & van der Laan,
2010).
TMLE can be used with any machine learning algorithm and in this sense can
be treated as a meta-learner.
2. Generate the following predictions (note that all s are probabilities, not
labels):
By saying fixed vector, we mean that we take a whole vector of values (in
this case, a vector of predictions coming from the model, ), and
use this vector as an intercept, instead of using just a single number (scalar).
Specifically, the intercept value for each row in our dataset is the prediction
of the model for that particular row.
2. Obtaining the coefficient for from the preceding logistic
regression model. We call this coefficient the fluctuation parameter and
use to denote it.
1.
2.
3.
The expit function is the inverse of the logit function. We need it here to go
back from logits to probabilities. To learn more about logit and expit, check
out this great blog post by Kenneth Tay: https://bit.ly/LogitExpit.
7. Compute the estimate of the ATE (this can be also adapted for CATE):
Following the formulas, you can relatively easily implement it from scratch
yourself in Python, but you don’t have to!
For an excellent visual introduction to TMLE, check out the series of blog
posts by Katherine Hoffman: https://bit.ly/KatHoffmanTMLE. Her great
work helped me structure this section more clearly and I hope this will also
translate into more clarity for you.
We learned that DR methods rely on two models: the outcome model and the
treatment model (also called the propensity model). Both models can be
estimated using any compatible machine learning technique – whether linear
models, tree-based methods, or neural networks.
The core strength of DR models is that even if only one model (e.g., the
outcome model) is specified correctly, they will still give consistent
estimates. When both models (outcome and treatment) are specified
correctly, DR estimators tend to have very low variance, which makes them
even more attractive.
In the next section, we’ll dive into another exciting area of causal inference,
called double machine learning, which offers some improvements over the
DR framework.
DML can be implemented using arbitrary base estimators, and in this sense,
it also belongs to the meta-learner family. At the same time, unlike S-, T- and
X-Learners, the framework comes with a strong theoretical background and
unique architectural solutions.
In this section, we’ll introduce the main concepts behind DML. After that,
we’ll apply it to our earnings dataset using DoWhy’s API. We’ll discuss
some popular myths that have arisen around DML and explore the
framework’s main limitations. Finally, we’ll compare it to DR estimators and
present some practical guidelines for when to choose DML over DR-
Learner.
In particular, the authors built DML in a way that makes it root-n consistent.
We say that an estimator is consistent when its error goes down with the
sample size. Intuitively, root-n consistency is an indicator that the estimation
error goes to 0 at a rate of when the sample size ( ) goes to infinity.
MORE ON CONSISTENCY
Estimator consistency is important (not only in causality) because it tells us whether the
estimator will converge to the true value with a large enough sample size. Formally, we say
that a consistent estimator will converge in probability to the true value.
It turns out that there are two main sources of bias that we need to address.
The second source of bias comes from overfitting. Let’s start by addressing
the latter.
O v e rf it t ing b ia s
DML solves the second problem using cross-fitting. The idea is as follows:
4. Finally, we average the two estimates and this gives us the final estimate.
1.
2.
It turns out that the coefficient, , from the last equation is equal to the
coefficient, , from the original regression equation (note that if you code it
yourself, you may get minor differences between and ; they might come
from numerical imprecisions and/or the implementation details of the
packages you use). For a proof of the FWL theorem, including R code
examples, check out https://bit.ly/FWLProof.
The core strengths of DML come from its ability to model complex non-
linear relationships in the data. Let’s re-write the original linear regression
equation to allow for non-linearities. This will allow us to demonstrate the
power of DML more clearly. We will start with a so-called partially linear
model:
Moreover, this means that we can get the benefits of parametric statistical
inference and don’t care about the exact functional form of the nuisance
parameter. This gives us great flexibility!
Before we conclude this subsection, let’s summarize the key points. DML
uses two machine learning models in order to estimate the causal effect.
These models are sometimes called the treatment and the outcome models
(as they were in the case of DR methods), but note that the treatment model in
DML does not estimate the propensity score, but rather models treatment
directly.
Additionally, we fit one more model to estimate the causal parameter, , from
the residuals. This model is known as the final model. In linear DML, this
model is – by design – a linear regression, but in generalized, non-parametric
DML, an arbitrary machine learning regressor can be used.
DML is an estimator with low variance and bias and can provide us with
valid confidence intervals. It also natively supports both discrete and
continuous treatments.
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.dml.LinearDML',
target_units='ate',
method_params={
'init_params': {
# Define outcome and treatment models
'model_y': LGBMRegressor(
n_estimators=500, max_depth=10),
'model_t': LogisticRegression(),
# Specify that treatment is discrete
'discrete_treatment': True
},
'fit_params': {}
})
effect_pred = model.causal_estimator.effect(
earnings_interaction_test.drop(['true_effect',
'took_a_course'], axis=1))
effect_true = earnings_interaction_test['true_effect'
].values
mean_absolute_percentage_error(effect_true, effect_pred)
0.0125345885989969
This error is roughly twice higher than the error for the linear DR-Learner.
It seems that DML comes with a slightly biased estimate. In particular, the
model underestimates the lower effect values (see the blue dots below the
red line for lower values of the true effect in Figure 10.5).
Let’s try to reduce the complexity of the outcome model and increase the
number of cross-fitting folds:
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.dml.LinearDML',
target_units='ate',
method_params={
'init_params': {
# Define outcome and treatment models
'model_y': LGBMRegressor(n_estimators=50,
max_depth=10),
'model_t': LogisticRegression(),
# Specify that treatment is discrete
'discrete_treatment': True,
# Define the number of cross-fitting folds
'cv': 4
},
'fit_params': {
}
})
effect_pred = model.causal_estimator.effect(
earnings_interaction_test.drop(['true_effect',
'took_a_course'], axis=1))
effect_true = earnings_interaction_test['true_effect'
].values
mean_absolute_percentage_error(effect_true, effect_pred)
0.00810075098627887
That’s better, but the results are still slightly worse than for the best DR-
Learner.
These results look much better, but the model seems to systematically
overestimate the effect.
Now, you might be puzzled about this whole process. Isn’t there a
fundamental challenge to all we’re doing?
Hyperparameter tuning can improve our estimators’ fit to the data and their
generalization to new instances (note that it cannot help with the estimand).
There are two main ways to run hyperparameter tuning with DoWhy and
EconML.
The first is to wrap the models in one of the sklearn cross-validation classes,
GridSearchCV, HalvingGridSearchCV, or RandomizedSearchCV, and pass the
wrapped models into the constructor.
This time, we’ll use the LGBM classifier instead of logistic regression to
predict the treatment. We hope that with some hyperparameter tuning, we can
outperform the model with logistic regression.
Now’s also a good moment for a reminder. In DML, the treatment model does
not estimate propensity scores but rather predicts treatment values directly. In
this setting, probability calibration – important for propensity scores – is not
that critical and so using a more complex model might be beneficial, despite
the fact that its probabilities might be less well calibrated than in the case of
logistic regression.
Ready to code?
We’ll start by defining our outcome (model_y) and treatment (model_t)
models and wrapping them in a grid search wrapper:
model_y = GridSearchCV(
estimator=LGBMRegressor(),
# Define the model's parameter search space
param_grid={
'max_depth': [3, 10, 20, 100],
'n_estimators': [10, 50, 100]
},
# Define GridSearch params
cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
)
model_t = GridSearchCV(
estimator=LGBMClassifier(),
# Define the model's parameter search space
param_grid={
'max_depth': [3, 10, 20, 100],
'n_estimators': [10, 50, 100]
},
# Define GridSearch params
cv=10, n_jobs=-1, scoring='accuracy'
)
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.dml.LinearDML',
target_units='ate',
method_params={
'init_params': {
# Pass models wrapped in GridSearchCV objects
'model_y': model_y,
'model_t': model_t,
# Set discrete treatment to `True`
'discrete_treatment': True,
# Define the number of cross-fitting folds
'cv': 4
},
'fit_params': {
}
})
0.00179346825212699
The MAPE value is only 0.17%. This is much better than anything we’ve
seen so far!
In particular, this result is more than 3.5 times better than the result of our
best DR-Learner.
The tuned model did a really good job! To the extent that such a small error
might seem unrealistic.
When looking at Figure 10.7, we should remember two things. First, we’re
working with a relatively simple synthetic dataset, which should be easily
solvable by a powerful enough model. Second, for comparability, we keep
the same value ranges for both axes in the plot that we used for the previous
models. This (to an extent) hides the fact that the model has some error.
If you see results like this in real life, don’t forget about Twyman’s law. The
law has multiple variants (Kohavi et al., 2020) and states that any figure or
statistic that looks interesting is most likely wrong. Twyman’s law can be a
valuable reminder to retain healthy skepticism in our work.
As we’ve just seen, tuning hyperparameters for CATE models can bring
significant improvements. At the beginning of this section, we said that there
are two ways to tune hyperparameters with DoWhy and EconML.
There are two main advantages to this approach. The first is that you’re not
limited to the grid search or random search options available in sklearn.
With Hyperopt, Optuna, and some other frameworks, you can leverage the
power of more efficient Bayesian optimization.
First, the answer depends on which models we use. For instance, tree-based
models do not extrapolate beyond the maximum value in the training set.
Let’s imagine that you only have one predictor in your problem, X, and all
values of X in your training dataset are bounded between 0 and 1,000.
Additionally, let’s assume that the outcome, Y, is always two times X.
This is a very simple problem, but if you test your tree-based model on data
where X ranges from 1,000 to 2,000 (that is, all values are outside the
training range), the prediction for all observations will be a constant value
that is no greater than the maximum outcome seen in the training dataset
(which would be 1000*2 = 2000).
This behavior comes from the fact that trees learn a set of rules from the data
(e.g., if 4.5 < x < 5.5, then y = 10) and they would need to essentially make
up the rules for data that is outside the training range. The fact that we’re
using a causal DML model will not surpass this limitation.
If we use a linear regression model or a neural network, on the other hand,
these models will extrapolate. This brings us to the second factor.
In this scenario, using DML will also not surpass the limitations of the
outcome and treatment models.
DML is a great and flexible method and tends to work very well in a broad
set of circumstances; yet – like any other method – it has its own set of
constraints. We will discuss the main ones here.
We already know that one of the assumptions behind DML is a lack of hidden
confounding. In their 2022 paper, Paul Hünermund and colleagues showed
that not meeting this assumption leads to significant biases that are similar in
magnitude to biases that we get from other, much simpler methods, such as
LASSO (Hünermund et al., 2022). These findings are consistent with the
results obtained by Gordon et al. (2022) in a large-scale study that
benchmarked DML using high-dimensional non-experimental data from the
Facebook Ads platform. Note that DML is sensitive to not only unobserved
common causes of treatment and outcomes but also other bad control
schemes that allow for non-causal information flow in the graph. This is also
true for other causal models (for a comprehensive overview of bad control
schemes, check out Cinelli et al., 2022).
While it might not be particularly surprising to some of you that DML can
lead to biased results when hidden confounding cannot be excluded, there are
two ideas I’d like to highlight here.
First, those familiar with deep learning will recognize that in certain cases,
increasing the training sample size can lead to significant model performance
improvements. In causal machine learning, this approach can also be helpful
when the base models do not have enough data to learn a useful mapping or
when the original data does not provide enough diversity to satisfy the
positivity assumption. That said, when these conditions are met, adding more
observations cannot help in reducing causal bias that arises from unmet
causal assumptions.
The reason for this is that causal bias comes from ill-defined structural
relationships between variables, not from an insufficient amount of
information within a dataset. In other words, causal bias is related to the
estimand misspecification and not to the statistical estimate.
Secondly, sometimes, in particular in big data settings, practitioners of
traditional machine learning might benefit from adding more features to the
model. For many tech companies, it’s not unusual to use hundreds or even
thousands of variables for predictive tasks. The logic behind adding more
features to a statistical model relies on the assumption that having more
features translates to more predictive power, which in turn translates to
better predictions. This reasoning might lead to beneficial outcomes when
using modern machine learning techniques.
I cannot give you a definitive answer that will guarantee that one model
always outperforms the other, but we can look into some recommendations
together. Hopefully, these recommendations will help you make decisions in
your own use cases.
The first key difference between the models is that DML works for
categorical and continuous treatments, while DR-Learner (similarly to
TMLE) is – by design – limited to categorical treatments. This might sort
some stuff out for you right away.
On the other hand, DR-Learner will usually have higher variance than DML.
Jane Huang from Microsoft suggests that this might become particularly
noticeable when “there are regions of the control space (…) in which some
treatment has a small probability of being assigned” (Huang et al., 2020).
In this scenario, “DML method could potentially extrapolate better, as it
only requires good overlap on average to achieve good mean squared
error” (Huang et al., 2020). Note that TMLE could also perform well in the
latter case. Another setting where DML might outperform DR-Learner is
under sparsity in a high-dimensional setting, as demonstrated by Zachary
Clement (Clement, 2023; note that in this case, DR-Learner’s results could
be potentially improved by adding lasso regularization to the final model).
That said, given the results of our experiments in this and the previous
chapter, we might get the impression that DML outperforms other methods.
This impression is somewhat biased as we did not tune hyperparameters for
the remaining methods. On the other hand, DML is often recommended as the
go-to method for continuous treatments. DR methods cannot compete in this
category. S-Learner can be adapted to work with continuous treatment, but
the current version of EconML does not support this feature.
If you cannot benchmark a wide array of methods and you’re convinced that
your data contains heterogeneous treatment effects, my recommendation
would be to start with S-Learner, in particular if computational resources are
an issue. T-Learner and X-Learner might be good to add to the mix if your
treatment is discrete.
In the next section, we’ll discuss one more family of methods that might be
worth considering.
Causal Forest is a tree-based model that stems from the works of Susan
Athey, Julie Tibshirani, and Stefan Wager (Wager & Athey, 2018; Athey et
al., 2019). The core difference between regular random forest and Causal
Forest is that Causal Forest uses so-called causal trees. Otherwise, the
methods are similar and both use resampling, predictor subsetting, and
averaging over a number of trees.
Causal trees
What makes causal trees different from regular trees is the split criterion.
Causal trees use a criterion based on the estimated treatment effects, using
so-called honest splitting, where the splits are generated on training data,
while leaf values are estimated using a hold-out set (this logic is very similar
to cross-fitting in DML). For a more detailed overview, check out Daniel
Jacob’s article (Jacob, 2021), Mark White’s blog post
(https://bit.ly/CausalForestIntro), or the chapter on Causal Forests in White
& Green (2023). For a high-level introduction, check out Haaya Naushan’s
blog post (https://bit.ly/CausalForestHighLvl). For a deep dive, refer to
Wagner & Athey (2018) or Athey et al. (2019).
Forests overflow
EconML offers a wide variety of estimators that build on top of the idea of
Causal Forests. The methods share many similarities but may differ in
significant details (for instance, how first-stage models are estimated). This
might translate to significant differences in computational costs. To
understand the differences between the different estimator classes, check out
the EconML documentation page here:
https://bit.ly/EconMLCausalForestDocs.
To start with Causal Forests, the CausalForestDML class from the dml module
will likely be the best starting point in most cases. EconML also offers a raw
version of Causal Forest – CausalForest – which can be found in the grf
module. The latter class does not estimate the nuisance parameter. You
might want to use this basic implementation in certain cases, but keep in mind
that this might lead to suboptimal results compared to CausalForestDML.
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.dml.CausalForestDML',
target_units='ate',
method_params={
'init_params': {
'model_y': LGBMRegressor(n_estimators=50,
max_depth=10),
'model_t': LGBMClassifier(n_estimators=50,
max_depth=10),
'discrete_treatment': True,
# Define the num. of cross-fitting folds
'cv': 4
},
'fit_params': {
}
}
)
Before we conclude this section, let’s summarize the results of all CATE
models on the machine learning earnings interaction dataset from Chapter 9
and Chapter 10 in a single table. Table 10.1 presents this summary:
Estimator MAPE
Estimator MAPE
S-Learner 5.02%
T-Learner 8.13%
X-Learner 3.63%
Estimator MAPE
Table 10.1 – A summary of the results for all models from Chapter 9 and Chapter 10 on the
machine learning earnings interaction dataset
Ready?
Heterogeneous treatment effects with
experimental data – the uplift odyssey
Modeling treatment effects with experimental data is usually slightly different
in spirit from working with observational data. This stems from the fact that
experimental data is assumed to be unconfounded by design (assuming our
experimental design and implementation were not flawed).
The data
In this section, we’ll use the data from Kevin Hillstrom’s MineThatData
challenge (https://bit.ly/KevinsDataChallenge).
Before we start, I want to take a moment to express my gratitude for Kevin’s
generosity. Kevin agreed that we could use his dataset in the book. I
appreciate this and believe that Kevin’s decision will help us all become
better causal data scientists, one step at a time. Thank you, Kevin!
Now, let’s understand how the dataset is structured. The data comes from a
randomized email experiment on 64,000 customers.
The features include the time since the last purchase (recency), the amount of
dollars spent in the past year (history), indicators of the type of merchandise
bought in the past year (mens or womens), an indicator of whether the
customer was new in the past 12 months (newbie), what channel they
purchased from previously (channel), and what type of area they live in
(rural, suburban, urban; zip_code). Treatment is a three-level discrete
variable describing which email campaign the customer received (men’s,
women’s, or control). Finally, we have three outcome variables: visit,
conversion, and spending.
We’ll use spending as our target. It records a customer’s spending within the
two weeks following the delivery of the email campaign.
hillstrom_clean =
pd.read_csv(r'./data/hillstrom_clean.csv')
with open(r'./data/hillstrom_clean_label_mapping.json',
'r') as f:
hillstrom_labels_mapping = json.load(f)
Our treatment has three levels. I stored the mapping in a dictionary serialized
as a JSON object.
We cut the display in Figure 10.9 into two parts for readability. As you can
see, there are many sparse binary features in the data. Roughly half of them
are one-hot-encoded multi-level categorical features: zip code area and
channel.
The zip code and channel variables are represented as fully one-hot-encoded
sets of variables. This means that for each row, the set of variables
representing a channel (or zip code area) will have exactly one column with
the value 1.
This makes one of the columns redundant as its value can be unambiguously
inferred from the values of the remaining columns in the set (if channel__web
and channel__phone are both 0, channel__multichannel has to be 1 – by
definition of one-hot-encoding).
hillstrom_clean = hillstrom_clean.drop(['zip_code__urban',
'channel__web'], axis=1)
One practical way to perform this check is to split your data in half: train a
model that predicts the treatment from covariates on one half of the data and
predict treatment values on the other half. Our model’s performance should
essentially be random.
First, let’s split the data into treatment, outcome, and feature vectors:
Second, let’s check whether the treatments are distributed uniformly in our
data as expected:
sample_size = hillstrom_clean.shape[0]
hillstrom_T_.value_counts() / sample_size
1 0.334172
2 0.332922
0 0.332906
Name: treatment, dtype: float64
All values are close to 33.3%, which indicates that the treatments were
distributed uniformly.
Let’s verify the split’s quality. We expect that roughly 33% of observations
should be assigned to each treatment group:
T_test_eda.value_counts() / T_test_eda.shape[0]
0 0.335156
2 0.333250
1 0.331594
Name: treatment, dtype: float64
Fourth, let’s fit the classifier that aims to predict treatment T from features X:
lgbm_eda = LGBMClassifier()
lgbm_eda.fit(X_train_eda, T_train_eda)
One side remark that I want to share with you here is that, in general, for the
LGBM classifier, it might be better for us not to one-hot encode our
categorical features (such as zip area or channel). The model handles
categorical data natively in a way that can improve the results over one-hot-
encoded features in certain cases.
That said, we’ll continue with one-hot-encoded features as they give us more
flexibility at later stages (other models might not support categorical features
natively) and we don’t expect significant improvements in our case anyway.
Fifth, let’s predict on the test data and calculate the accuracy score:
T_pred_eda = lgbm_eda.predict(X_test_eda)
accuracy_score(T_test_eda, T_pred_eda)
0.33384375
random_scores = []
test_eda_sample_size = T_test_eda.shape[0]
for i in range(10000):
random_scores.append(
(np.random.choice(
[0, 1, 2],
test_eda_sample_size) == np.random.choice(
[0, 1, 2],
test_eda_sample_size)).mean())
np.quantile(random_scores, .025), np.quantile(random_scores,
.975)
(0.32815625, 0.33850078125)
The accuracy score that we obtained for our model lies within the boundaries
of these intervals. This gives us more confidence that the data is not
observably confounded.
We’re almost ready to start, but before we open the toolbox, let me share
something with you.
This surprise comes partially from the fact that the world is complex, yet the
way we structure incentives in the education system and the publish-or-
perish culture also play a role here.
In this section, we’ll see how things should look when everything works
smoothly, but we’ll also see what happens when they don’t.
Now, let’s get some more context and begin our journey into the wild!
Kevin’s challenge
As we said in the beginning, the data that we’re about to use comes from an
online experiment, but this is only half of the story. The data was released as
part of a challenge organized by Kevin back in 2008. The challenge also
included a series of questions, among others – if you could eliminate 10,000
customers from the campaign, which 10,000 should that be?
This and the other questions posed by Kevin are very interesting and I am
sure they stir up a sense of excitement in anyone interested in marketing.
The fact that this question might seem simpler does not make our endeavor
easier.
First, we have more than one treatment. This makes model evaluation
more challenging than in a binary case.
Let’s take a look at the label mapping to find out how to work with the
treatment:
hillstrom_labels_mapping
Let’s check how many people bought something under both treatments:
456
This is roughly 0.7% of the entire dataset.
Let’s start by instantiating the models. First, we’ll create a function that
returns an instance of the LGBM model to make the code a bit more
readable:
s_learner = SLearner(
overall_model=create_model('regressor')
)
x_learner = XLearner(
models=[
create_model('regressor'),
create_model('regressor'),
create_model('regressor'),
],
cate_models=[
create_model('regressor'),
create_model('regressor'),
create_model('regressor'),
]
)
t_learner = TLearner(
models=[
create_model('regressor'),
create_model('regressor'),
create_model('regressor'),
]
)
dml = LinearDML(
model_y=create_model('regressor'),
model_t=create_model('classifier'),
discrete_treatment=True,
cv=5
)
dr = DRLearner(
model_propensity=LogisticRegression(),
model_regression=create_model('regressor'),
model_final=create_model('regressor'),
cv=5,
)
cf = CausalForestDML(
model_y=create_model('regressor'),
model_t=create_model('classifier'),
discrete_treatment=True,
cv=5
)
This is very similar to what we were doing before when using DoWhy
wrappers. The main difference is that we now pass model parameters
directly to model constructors rather than encoding them in an intermediary
dictionary.
Note that for linear DML and Causal Forest, we set discrete_treatment to
True. We don’t do so for meta-learners and DR-Learner because these
models only allow discrete treatments (S-Learner can be generalized to
continuous treatments, but the current version of EconML does not support
that). Also note that our create_model() function returns estimators with the
same set of pre-defined parameters for each base learner.
F irs t t hing s f irs t
We want to assess the performance of our models and so we’ll divide our
data into training and test sets.
We set the test size to 0.5. The reason I chose this value is that although our
dataset has 64,000 observations, only a tiny fraction of subjects actually
converted (made a purchase). Let’s see how many conversion instances we
have in each of the splits:
(227, 229)
That’s only 227 converted observations in training and 229 in the test set.
This is not a very large number for machine learning methods. The trade-off
here is between having enough observations to effectively train the models
and having enough observations to effectively evaluate the models. Our case
is pretty challenging in this regard.
Nonetheless, let’s fit the models and see how much we can get out of them.
We’ll fit the models in a loop to save ourselves some space and time.
Speaking of time, we’ll measure the fitting time for each of the algorithms to
get a sense of the differences between them in terms of computational cost.
models = {
'SLearner': s_learner,
'TLearner': t_learner,
'XLearner': x_learner,
'DRLearner': dr,
'LinearDML': dml,
'CausalForestDML': cf
}
From this output, we can get a sense of the computational costs related to
each model. It would be great to repeat this experiment a couple of times to
get more reliable estimates of the differences between the models. I did this
in advance for you. Table 10.2 contains rough estimates of how much
computational time each of the methods needs compared to the S-Learner
baseline:
Model Time
S-Learner 1x
T-Learner 2x
X-Learner 5x
DR-Learner 13x
LinearDML 27x
CausalForestDML 39x
Table 10.2 – Relative training times of six different EconML CATE estimators
In Table 10.2, the right column displays the time taken by each model to
complete its training in comparison to the S-Learner baseline. For instance,
2x means that a model needed twice as much time as S-Learner to finish the
training.
All the models were trained on the three-level treatment scenario, using the
same dataset, with identical base estimators and default hyperparameters.
T-Learner needs twice as much training time as S-Learner, and Causal Forest
with DML needs a stunning 39 times more time! Just to give you an
understanding of the magnitude of this difference: if S-Learner trained for 20
minutes, Causal Forest would need 13 hours!
Our models are trained. Let’s think about how to evaluate them.
The AUUC is based on a similar idea – it measures the area under the
cumulative uplift curve. We plot uplift on the y axis against the percentage of
observed units (the cumulative percentage of our sample size) sorted by
model predictions on the x axis and calculate the area under the curve.
The AUUC and Qini are very popular and you can find many open source
implementations for these metrics (for instance, in the uplift-analysis
package: https://bit.ly/UpliftAnalysis). Both metrics were originally
designed for scenarios with binary treatments and binary outcomes. Another
popular choice for uplift model performance assessment is the uplift by
decile (or uplift by percentile) plot.
You might be wondering – why are we talking about metrics and plots that
are cumulative or per decile? The answer is related to the nature of the
problem we’re tackling: we never observe the true uplift (aka the true causal
effect).
Let’s take a look at uplift by decile and explain it step by step. It turns out
that we can generalize it naturally to continuous outcomes. That’s good news!
U p lif t b y d e c ile
When we run an experiment, we can never observe all potential outcomes for
a given unit at the same time. Each unit or subject can either receive
treatment, T, or not receive it.
We divide the dataset into bins and assume we can learn something about the
true effect within each bin. We leverage this information to assess the quality
of our models.
2. Next, we sort the predictions from the highest to the lowest predicted
uplift.
3. We bin our predictions into deciles (in case of smaller datasets or when
it's impossible to split the data into deciles, quantiles of smaller
granularity can also be used, for example, quartiles).
5. Within each decile, we compute the average outcome for the units that
were originally treated and the average outcome for the units that were
originally in the control group.
6. Within each decile, we subtract the average outcome for untreated from
the average outcome for treated.
These differences are our estimates of the true uplift within each decile.
7. We use a bar plot to visualize these estimates against the deciles, ordered
from the top to the bottom decile (left to right).
Let’s take a closer look at Figure 10.11. The x axis represents deciles. The y
axis represents our estimated average true uplift.
The true meaning of this plot comes from the fact that the deciles are not
sorted by the estimated true uplift that we actually see in the plot but rather
by our predicted uplifts (which are not visible in the plot).
We expect that the output of a good model will correlate with the true uplift –
we’d like to see a high estimated true uplift in the top deciles (on the left)
and a low estimated true uplift in the bottom deciles (on the right).
Figure 10.11 presents the results from a really good model as we can see that
higher values are on the left and lower values are on the right.
In the case of a perfect model, the values on the y axis should monotonically
go down from left to right. We see that in Figure 10.11, this is not entirely the
case, with minor deviations on the fifth and eighth ticks (note that the x axis is
0-indexed), but overall, the pattern indicates a very good model.
Uplift by decile can be a good way to quickly visually assess the model, yet
it doesn’t provide us with the means to quantitatively compare various
models.
We will use another tool for this purpose, but first, let’s plot uplift by decile
for all of our models. Figure 10.12 presents the results:
Figure 10.12 – Uplift by decile plots
Each row in Figure 10.12 corresponds to one model. The leftmost two
columns present the results for treatment 1, and the rightmost two columns for
treatment 2. Blue plots represent the evaluation on training data, while red
ones the evaluation on the test data. Note that the y axes are not normalized.
We did this on purpose to more clearly see the patterns between different
models.
According to the criteria that we discussed earlier, we can say that most
models perform very well on the training data. One exception is linear DML,
with a less clear downward pattern. One reason for this might be the fact that
the last stage in linear DML is… linear, which imposes a restriction on the
model’s expressivity.
When it comes to the test set, the performance of most models drops
significantly. Most cells in the plot indicate poor performance. DR-Learner
for treatment 1 is a strong exception, but the same model for treatment 2 gives
almost a reversed pattern!
There might be a couple of reasons for this poor performance on the test set:
First, our models might be overfitting to the training data. This can be
related to the fact that there is not enough data for the models to build a
generalizable representation.
The second side of the same coin is that the architectures we used might
be too complex for the task (note that we picked hyperparameters
arbitrarily, without tuning the models at all).
The fact that uplift per decile plots do not look favorable does not
necessarily imply that our models are not useful.
Let’s compute a metric that will help us assess whether the models can bring
us some real value.
Ex p e c t e d re s p o ns e
The expected response metric was introduced by Yan Zhao and colleagues
in their 2017 paper Uplift Modeling with Multiple Treatments and General
Response Types (Zhao et al., 2017). The method works in multiple-treatment
scenarios and with continuous outcomes, which is perfect for our case.
Although the metric is focused on the outcome rather than on uplift, it’s a
valid way to evaluate uplift models. The metric computes the expected
average outcome of a model by combining information from all the
treatments. That’s very handy.
At the same time, the metric is also useful from a decision-making point of
view as it gives us a good understanding of what average return on
investment we can expect by employing a chosen model.
Intuitively, the metric works for uplift models because we use previously
unseen data to check the expected return on investment for a given model.
This information can be used to compare two or more models under
unconfoundedness and obtain information on the expected performance of
out-of-sample data.
Before we explain how the expected response metric works, let’s get some
context. Uplift models can be understood as action recommendation
systems.
The idea behind the expected response is simple. For each observation in the
test set, we check whether the predicted treatment was the same as the actual
treatment. If so, we divide the value of the outcome for this observation by
the probability of the treatment and store it. Otherwise, we set the value to 0
and store it. Finally, we average over the stored values and the obtained
score is our expected response for a given model.
I put the formula for the expected response metric in the notebook for this
chapter. We’ll skip it here, but feel free to explore the notebook and play
with the implementation and the formula.
Now, let’s see the results for the expected response (the full code is in the
notebook):
We see that the metric for all models significantly dropped between the train
and test sets. The smallest difference can be observed for the linear DML
model. This is congruent with what we observed in the uplift by decile plot –
linear DML performed relatively poorly on the training data and slightly
worse on the test data.
The best model on the test set according to the expected response is DR-
Learner. This also translates – at least partially – to what we observed in the
plot. DR-Learner had a pretty clear downward trend for treatment 1. Perhaps
this good performance on treatment 1 allowed the model to compensate for
the worse performance on treatment 2.
Confidence intervals
We said before that one of the advantages of linear DML is its ability to
provide us with valid confidence intervals. How can we obtain them in
practice?
With EconML, it’s very easy. We simply call the .effect_interval() method
on the fitted estimator. The method returns a two-tuple of numpy arrays. The
first array contains the lower bounds and the second the upper bounds of the
confidence intervals.
ints = np.stack(models['LinearDML'].effect_interval(
X=X_test, T0=0, T1=1, alpha=.05)).T
# What % of effects contains zero?
(np.sign(ints[:, 0]) == np.sign(ints[:, 1])).sum() /
ints.shape[0]
0.100875
Out of all the test observations, confidence intervals for 10% of them contain
0. Removing these observations from our action recommendation set could
perhaps further improve the model’s performance.
To obtain confidence intervals for methods that do not support them natively,
pass inference='bootstrap' to the model’s .fit() method. Note that this
will result in a significant increase in training time. The number of bootstrap
samples can be adjusted by using the BootstrapInference object. For more
details, check out https://bit.ly/EconMLBootstrapDocs.
If it sounds oddly familiar, it might be because we’ve already seen it (that is,
the last name) in this chapter. Nicholas Radcliffe is not only the author of the
winning submission in Kevin’s challenge but also the person who originally
proposed the Qini coefficient (and the Qini curve).
All these methods can help us make valid conclusions from experimental
data.
Uplift (CATE and HTE) modeling techniques have been proven to bring
value in numerous industrial settings from marketing and banking to gaming
and more.
The main price that we pay for the flexibility offered by CATE machine
learning models is the difficulty of computing the sample size necessary to
establish the desired statistical power. In Chapter 9, we discussed some
ideas of how to overcome this limitation.
The main advantage of CATE machine learning models is that they give us a
way to non- or semi-parametrically distinguish the units that can benefit from
obtaining our treatment (persuadables) from those that will most likely not
benefit from it (sure things and lost causes) and those who will likely not
benefit but can additionally get hurt or hurt us (do not disturbs; check
Chapter 9 and Table 9.5 for a refresher).
We started this section with a brief discussion on using CATE models with
experimental data and the EconML workflow tailored to experimental
scenarios.
After that, we introduced the Hillstrom dataset and tested whether our data is
unconfounded under observed variables. We fitted six different models and
compared their performance from a computational cost point of view.
In the next short section, we’ll discuss the idea of using machine learning for
counterfactual explanations.
You call the bank. You ask questions. You want to understand why. At the end
of the day, its decision impacts some of your most important plans!
The only response you get from the customer service representative is that
you did not meet the criteria. “Which criteria?” you ask. You don’t get a
satisfying answer.
You’d like to make sure that you meet the criteria the next time you re-apply,
yet it seems that no one can tell you how to improve!
First, we train a single model on our data. Then, we generate predictions for
different values of the treatment variable and perform a simple subtraction.
This idea is very flexible. What if we extended it to other variables that we
did not treat as the treatment before?
Theoretically (depending on the setting), this could mess up the causal
character of the model.
Do we care?
Why?
Interventions such as this provide us with valid results given our goal.
What if we need to change more than one feature in order to influence the
outcome because there’s an interaction between these features? What if there
are many ways to change the outcome and some are much easier, but we’ve
only found the hardest one?
To address this and other issues, Microsoft released an open source package
called Diverse Counterfactual Explanations (DiCE; Mothilal et al., 2020).
The basic intuition behind DiCE is that it searches for a set of changes in the
input features that lead to the outcome change, at the same time maximizing
the proximity to the original values and diversity (finding many different
solutions to the same problem to let us choose the best one). If you want to
learn more about DiCE, check out the introductory blog post by Amit Sharma
(https://bit.ly/DiCEIntro) and DiCE’s GitHub repository
(https://bit.ly/DiCERepo).
Note that depending on the industry sector and geopolitical region, usage of
methods such as the ones we discussed previously might be difficult due to
particular regulatory requirements. DiCE offers a smart set of tools that can
help when sensitive data is at stake.
Wrapping it up
Congratulations! You just reached the end of Chapter 10.
In the next chapter, we’ll continue our journey through the land of causal
inference with machine learning, and with Chapter 12, we’ll open the door
to the last part of the book, dedicated to causal discovery. See you!
References
Athey, S., Tibshirani, J., & Wager, S. (2018). Generalized random forests.
The Annals of Statistics, 47(2). 1148-1178.
Chernozhukov, V., Cinelli, C., Newey, W., Sharma, A., & Syrgkanis, V.
(2022). Long Story Short: Omitted Variable Bias in Causal Machine
Learning (Working Paper No. 30302; Working Paper Series). National
Bureau of Economic Research. https://doi.org/10.3386/w30302.
Cinelli, C., Forney, A., & Pearl, J. (2022). A crash course in good and bad
controls. Sociological Methods & Research, 00491241221099552.
Diemert, E., Betlei, A., Renaudin, C., & Amini, M. R. (2018). A Large Scale
Benchmark for Uplift Modeling. KDD.
Green, J. & White, M. H., II. (2023). Machine Learning for Experiments in
the Social Sciences. Cambridge University Press.
Hünermund, P., Louw, B., & Caspi, I. (2022). Double Machine Learning
and Automated Confounder Selection – A Cautionary Tale. arXiv.
https://doi.org/10.48550/ARXIV.2108.11294.
Kohavi, R., Tang, D., & Xu, Y. (2020). Twyman’s Law and Experimentation
Trustworthiness. In Trustworthy Online Controlled Experiments: A
Practical Guide to A/B Testing (pp. 39-57). Cambridge: Cambridge
University Press.
Mothilal, R. K., Sharma, A., & Tan, C. (2020). Explaining machine learning
classifiers through diverse counterfactual explanations. Proceedings of the
2020 Conference on Fairness, Accountability, and Transparency, 607-617.
Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019). Orthogonal Random Forest
for Causal Inference. Proceedings of the 36th International Conference on
Machine Learning, in Proceedings of Machine Learning Research, 97, 4932-
4941. https://proceedings.mlr.press/v97/oprescu19a.html.
Porter, K. E., Gruber, S., van der Laan, M. J., & Sekhon, J. S. (2011). The
Relative Performance of Targeted Maximum Likelihood Estimators. The
International Journal of Biostatistics, 7(1), 31. https://doi.org/10.2202/1557-
4679.1308.
Tan, X., Yang, S., Ye, W., Faries, D. E., Lipkovich, I., & Kadziola, Z. (2022).
When Doubly Robust Methods Meet Machine Learning for Estimating
Treatment Effects from Real-World Data: A Comparative Study. arXiv.
https://doi.org/10.48550/ARXIV.2204.10969
van de Geer, S.A., Buhlmann, P., Ritov, Y., & Dezeure, R. (2013). On
asymptotically optimal confidence regions and tests for high-dimensional
models. Annals of Statistics, 42, 1166-1202.
van der Laan, M. & Hejazi, N. (2019, December 24). CV-TMLE and double
machine learning. vanderlaan-lab.org. https://vanderlaan-
lab.org/2019/12/24/cv-tmle-and-double-machine-learning/
Xu, K., Li, J., Zhang, M., Du, S. S., Kawarabayashi, K., & Jegelka, S.
(2020). How Neural Networks Extrapolate: From Feedforward to Graph
Neural Networks. arXiv, abs/2009.11848.
Zhao, Yan, Xiao Fang, and David Simchi-Levi. Uplift Modeling with
Multiple Treatments and General Response Types. Proceedings of the 2017
SIAM International Conference on Data Mining (June 9, 2017): 588–596.
11
Before we move on, let’s take a closer look at what deep learning has to
offer in the realm of causal inference.
We’ll start by taking a step back and recalling the mechanics behind two
models that we introduced in Chapter 9 – S-Learner and T-Learner.
We’ll explore how flexible deep learning architectures can help us combine
the advantages of both models, and we’ll implement some of these
architectures using the PyTorch-based CATENets library.
Next, we’ll explore how causality and natural language processing (NLP)
intersect, and we’ll learn how to enhance modern Transformer architectures
with causal capabilities, using Huggingface Transformers and PyTorch.
After that, we’ll take a sneak peek into the world of econometrics and quasi-
experimental time series data, learning how to implement a Bayesian
synthetic control estimator using CausalPy.
Ready?
Although the core idea behind (supervised) deep learning is associative in its
nature and, as such, belongs to rung one of the Ladder of Causation, the
flexibility of the framework can be leveraged to improve and extend existing
causal inference methods.
S-Learner uses a single base-learner model and trains it with all available
data. Meanwhile, T-Learner uses treated observations and control
observations to train two separate base models.
Note that this solution also combats the greatest weakness of S-Learner –
namely, the risk that a treatment variable will not be taken into account in the
shared representation.
SNet
SNet (Curth & van Der Schaar, 2021a) is a deep learning-based architecture
that can be thought of as a generalization of TARNet (and other architectures
such as DragonNet (Shi et al., 2019) and DR-CFR (Hassanpour & Greiner,
2020)).
The S in SNet’s name comes from the fact that the information from the early
representations in the network is shared between the latter task-specific
heads.
In Figure 11.2, we can see that the signal from each of the five representation
layers (yellow) flows to a different set of regression heads. Blue arrows
indicate flow to just a single head, while green and orange arrows indicate
flow to multiple heads.
As all representations are learned jointly, the distinction between them might
not be well identified. The authors add a regularization term to the model
objective that “enforces orthogonalization of inputs to (…) different
layers” (Curth & van der Schaar, 2021).
The authors demonstrate that SNet outperforms TARNet and neural network
versions of T-Learner and DR-Learner over multiple synthetic and semi-
synthetic datasets.
We will now implement TARNet and SNet using the CATENets library.
Most of the models are also available as PyTorch (Paszke et al., 2017)
implementations (kindly contributed by Bogdan Cebere –
https://github.com/bcebere). We’ll use these implementations in the
subsequent code examples.
PYTORCH
PyTorch (Paszke et al., 2017) is an open source deep learning framework, originally
developed by the Meta AI (Facebook AI) team and currently governed by the PyTorch
Foundation. Version 2.0 was released in March 2023, introducing a number of features that
can speed up PyTorch in numerous scenarios. In recent years, PyTorch has gained
significant traction, especially in the research community. The models that we will experiment
with in this chapter (CATENets and CausalBert) use PyTorch behind the scenes. In Part 3,
Causal Discovery, we’ll build a complete PyTorch training loop when implementing
Microsoft’s DECI model.
Ex p e rime nt s w it h C AT EN e t s
We’ll use a simulated non-linear dataset with a binary treatment for our
CATENets experiments.
You can find the code for this and the next section in the Chapter_11.1.ipynb
notebook (https://bit.ly/causal-ntbk-11_1).
import numpy as np
SAMPLE_SIZE = 5000
TRAIN_SIZE = 4500
N_FEATURES = 20
X = np.random.normal(0, 1, (SAMPLE_SIZE, N_FEATURES))
T = np.random.binomial(1, 0.5, SAMPLE_SIZE)
weights = np.random.gumbel(5, 10, (SAMPLE_SIZE,
N_FEATURES - 1))
y = (50 * T * np.abs(X[:, 0])**1.2) + (weights * X[:,
1:]).sum(axis=1)
y0 = (50 * 0 * np.abs(X[:, 0])**1.2) + (weights * X[:,
1:]).sum(axis=1)
y1 = (50 * 1 * np.abs(X[:, 0])**1.2) + (weights * X[:,
1:]).sum(axis=1)
effect_true = y1[TRAIN_SIZE:] - y0[TRAIN_SIZE:]
We generate 5,000 samples and split them into training and test sets
containing 4500 and 500 observations, respectively. Our dataset has 20
features. The data is generated according to the following formula:
As you can see in the preceding formula, the treatment effect is non-additive
and non-linear (we have an absolute value and raise the interaction term to
the 1.2-th power). At the same time, only 1 out of 20 features we generated
interacts with the treatment (feature , which corresponds to X[:, 0] in the
code snippet).
We’ll use our dataset to fit S-Learner, X-Learner, DR-Learner, and Causal
Forest as a baseline. Next, we’ll fit TARNet and SNet.
import torch
import pytorch_lightning as pl
If you have a GPU installed in your system that is supported by PyTorch and
you have properly configured the drivers, this will print the following:
'cuda'
'cpu'
Next, let’s set the seed for reproducibility. PyTorch Lightning offers a
convenience function to do so:
SEED = 18
pl.seed_everything(SEED)
Note that even using pl.seed_everything() does not always lead to the same
results. To enforce fully the deterministic behavior of PyTorch, we would
need to use torch.use_deterministic_algorithms(True) and freeze the
randomness in data loaders. At the time of writing, this would require
modifying environmental variables and CATENets code. We won’t go that
far and keep the results weakly reproducible. Be prepared for your results to
differ.
benchmark_models = {
'SLearner': SLearner(overall_model=LGBMRegressor()),
'XLearner': XLearner(models=LGBMRegressor()),
'DRLearner': LinearDRLearner(),
'CausalForest': CausalForestDML()
}
benchmark_results = {}
for model_name, model in benchmark_models.items():
model.fit(
X=X[:TRAIN_SIZE, :],
T=T[:TRAIN_SIZE],
Y=y[:TRAIN_SIZE]
)
effect_pred = model.effect(
X[TRAIN_SIZE:]
)
benchmark_results[model_name] = effect_pred
We create a dictionary of models and iterate over it, fitting each model and
storing the results in another dictionary called benchmark_results. We used
LightGBM regressors for S- and X-Learner and default models for DR-
Learner and Causal Forest.
Now, let’s fit TARNet and SNet:
tarnet = TARNet(
n_unit_in=X.shape[1],
binary_y=False,
n_units_out_prop=32,
n_units_r=8,
nonlin='selu',
)
tarnet.fit(
X=X[:TRAIN_SIZE, :],
y=y[:TRAIN_SIZE],
w=T[:TRAIN_SIZE]
)
n_unit_in defines the number of features in our dataset. To keep the code
dynamic, we pass the size of the first dimension of our dataset as an
argument. This parameter is mandatory.
binary_y informs the model whether the outcome is binary. Our dataset
has a continuous outcome, so we set it to False. Internally, this parameter
will control which loss function the model should use for the outcome
model (binary cross-entropy for a binary outcome and mean squared
error otherwise). This parameter is set to False by default, but I wanted
to use it explicitly because it’s an important one.
After initializing the model, we call the .fit() method and pass the data.
Note that the naming convention in CATENets differs from the one we used
in EconML:
X takes the feature matrix (it should include all confounders if there is
observational data)
Note that we did not shuffle the dataset. This is because it’s randomly
generated.
effect_pred_tarnet = tarnet.predict(
X=X[TRAIN_SIZE:, :]
).cpu().detach().numpy()
To get the predicted CATE, we use the .predict() method. Note that this
time we pass the test dataset (starting from the TRAIN_SIZE observation up to
the last one).
1. We call .cpu() to send the resulting tensor from a GPU (if you used one)
to a CPU.
This chain is only necessary when we use a GPU, but it’s harmless when
working on a CPU-only machine.
snet = SNet(
n_unit_in=X.shape[1],
binary_y=False,
n_units_out_prop=32,
n_units_r=8,
nonlin='selu',
)
snet.fit(
X=X[:TRAIN_SIZE, :],
y=y[:TRAIN_SIZE],
w=T[:TRAIN_SIZE]
)
To get the predictions, we call the .predict() method and the CPU-detach-
NumPy chain on the result:
effect_pred_snet = snet.predict(
X=X[TRAIN_SIZE:, :]
).cpu().detach().numpy()
Great! Now we’re ready to compare the results between all the models.
Figure 11.3 summarizes the results for TARNet, SNet, and our benchmark
methods:
Figure 11.3 – Results for the benchmark models and deep learning models (TARNet and
SNet)
As we can see in Figure 11.3, TARNet got the best results, achieving a mean
absolute percentage error (MAPE) value of 2.49. SNet performed less
favorably, with the highest variance out of all compared models. Perhaps the
model could benefit from longer training and/or a larger sample size.
TARNet is the best at approximating low-density regions (the part in the top-
right corner of each panel with a small number of points). Causal Forest
DML has the lowest variance, but performs poorly in these low-density
regions, giving place to TARNet.
Finally, the linear DR-Learner fails to capture the heterogeneity of the effect.
This is not surprising as the model – by definition – can only capture
treatment effects linear in parameters.
Note that the results of our experiment should not be treated as a universal
benchmark. We only ran one iteration for each model, but there’s also a
deeper reason.
Taking a broader perspective, perhaps we could say that this state of affairs
is not unique to modeling causality, and contemporary non-causal machine
learning faces similar challenges, yet they are harder to explicitly spot (for
example, think about validating large language models trained on virtually all
of the internet).
It’s 1916. The flames of war consume Europe. A young Austrian man of
Jewish descent arrives at a hospital in Kraków, injured in an industrial
explosion. He’s a volunteer soldier who served in an Austrian artillery
regiment.
There’s something that differentiates him from other young men in the
hospital.
He keeps them close, but the notes are not a diary. They consist of a set of
remarks on logic, ethics, language, and religion. Some of them were taken
while he was still in the trenches of the Eastern Front.
The young man’s name is Ludwig Wittgenstein, and his notes will later
become the basis of the only book he will publish in his lifetime – Tractatus
Logico-Philosophicus (Wittgenstein, 1922).
The book will become one of the most significant works of 20th-century
Western philosophy.
One of the core ideas of the book is that most philosophical problems are
created by the misuse of language. It states that fixing language by making
clear references to real-world objects and states of affairs would solve
existing philosophical problems – many by showing that they are mere
artifacts of language misuse.
The theory of meaning proposed in the book states that names that we use in
language refer to simple (atomic) objects. Atomic objects have no
complexity; they cannot be divided further or described, only named.
There’s beauty in this vision. The correspondence between language and the
world seems simple and elegant. One challenge to this view is that in natural
languages, the same word can often denote different objects, events, or even
represent entire sentences.
For instance, the Arabic word ( ﯾﻼyalla), used extensively by Arabic and
Hebrew speakers, has numerous meanings that heavily depend on the context.
It can mean (among others) the following:
Let’s go!
Hurry up!
Deal!
Go away!
Come here!
Let’s do it!
This context dependence is common in natural languages. Moreover, the
changes in the meaning of a word might have a subtle character between
different usages. Wittgenstein realized this and changed his approach toward
meaning in the later stages of his philosophical career.
The new approach he took in his later works might have helped us build
systems such as ChatGPT, but let’s take it step by step.
Over the years, computational linguists and computer scientists came up with
many creative ways to squeeze information from natural language documents.
First attempts relied on bag-of-words-type approaches and information-
theoretic analyses (e.g., Zipf in 1949).
The main challenge with these approaches was that they were not able to
capture the aspects of semantic similarities between words (although some of
them could capture some notion of similarity on a document level).
These ideas are closely related to the notion promoted by English linguist
John Rupert Firth, summarized as, “You shall know a word by the company it
keeps” (Firth, 1957).
Less than five years after the publication of the word2vec paper, ELMo – a
model introduced by Matthew E. Peters and colleagues (Peters et al., 2018)
– made modeling polysemy possible. Just a couple of months later, another
model – BERT – was released by a team at Google (Devlin et al., 2018).
BERT replaced recurrent neural networks with a multi-head attention
mechanism (Vaswani et al., 2018). BERT is an example of the Transformer
architecture that also gave birth to the GPT family of models (Radford et al.,
2019, Brown et al., 2020, and OpenAI, 2023).
It has been demonstrated that the model can successfully answer various
types of causal and counterfactual questions. Figure 11.4 presents how
ChatGPT (correctly) answered the counterfactual query that we solved
analytically in Chapter 2.
Figure 11.4 – ChatGPT’s counterfactual reasoning
I find the behavior presented in Figure 11.4 impressive, yet it turns out that
under more systematic examination, the model is not entirely consistent in
this realm.
The system has been thoroughly scrutinized from the causal perspective by
two independent teams at Microsoft (led by Cheng Zhang) and TU Darmstadt
(led by Moritz Willig and Matej Zečević). Both teams arrived at similar
conclusions – the system has significant potential to answer various types of
questions (including some causal questions), but cannot be considered fully
causal (Zheng et al., 2023 and Willig et al., 2023).
Willig et al. (2023) proposed that ChatGPT learns a meta-SCM purely from
language data, which allows it to reason causally but only in a limited set of
circumstances and without proper generalization.
Zhang et al. (2023) suggested that the model could benefit from integration
with implicit or explicit causal modules. Her team has even demonstrated an
early example of GPT-4 integration with a causal end-to-end framework
DECI (we will discuss DECI in Part 3, Causal Discovery).
These results are truly impressive. The main challenge that remains is related
to model failure modes, which are difficult to predict and might occur “even
in tasks where LLMs obtain high accuracy” (Kıcıman et al., 2023). This
might be related to the fact that the meta-SCM that the model learns (as
proposed by Willig et al. (2023)) is correlational in its nature, so the model
might produce mistakes stochastically.
Keeping in mind that LLMs are not yet capable of systematic causal
reasoning, a question that naturally arises is whether there are other ways in
which they can be useful in the context of causality.
It turns out that the answer is positive.
There are three basic scenarios where we can leverage LLMs and other NLP
tools to answer causal queries when addressing the first type of question. In
each of the scenarios, we will use text as a node (or a set of nodes) in a
causal graph.
Text as a treatment
Text as an outcome
Text as a confounder
In this subsection, we will draw from a great review article by Amit Feder
and colleagues, which I wholeheartedly recommend you read if you’re
interested in causality and NLP (Feder et al., 2020).
Hanna has poured her heart into the project, painstakingly choosing words
that she believes will speak to her ideal client. She’s created numerous
successful campaigns in the past, but deep down in her heart, she really
doesn’t know which aspects of her copywriting are responsible for their
success.
Hanna relies on her intuition to decide when the copy is good. She’s often
right, but quantifying the impact of the copy in a more systematic way could
help her work and scale to new niches much faster. It would also give her an
excellent asset that she could leverage in communication with internal and
external stakeholders.
One of the examples of how Hanna’s use case could be addressed comes
from Reid Pryzant et al.’s paper (2017). The authors designed a neural
network to isolate language aspects that influence sales and concluded that in
their sample (coming from a Japanese online marketplace Rakuten),
appealing to authority by using polite, informative, and seasonal language
contributed most to increased sales.
Some other works that aim at discovering the features of language that impact
the outcome of interest include the discovery of conversational tendencies
that lead to positive mental health outcomes (Zhang et al., 2020).
These works are very interesting and open a path toward new fascinating
research directions. At the same time, they come with a number of
challenges.
Moreover, the situation can be further complicated by the fact that different
readers might interpret a text differently (to see an example of a model that
takes the reader’s interpretation into account, check out Pryzant et al.
(2021)).
Te x t a s a n o ut c o me
Yìzé is an aspiring writer. He has a prolific imagination and produces stories
with ease, but he feels that the way he writes lacks a bit of focus and clarity.
He decides to enroll in a writing course at a local community college.
One way to deal with this challenge is to train the outcome measurement
model on another sample (Egami et al., 2018). This can be relatively easily
done if we have enough samples.
Te x t a s a c o nf o und e r
Finally, text can also be a confounder.
Catori and her friend Stephen both love manga and are inspired by similar
characters. They pay attention to similar details and often find themselves
wanting to share the same story at the same moment.
Both of them post about manga on Reddit. One day, they noticed an
interesting pattern – Stephen’s posts get more upvotes, and it’s much more
likely that his post gets an upvote within the first hour after being posted than
Catori’s post does. They are curious about what causes this difference.
It has been repeatedly shown that across scientific fields, female authors are
cited less frequently than their male counterparts. This effect has been
demonstrated in neuroscience (Dworkin et al., 2020), astronomy (Caplar et
al., 2017), transplantation research (Benjamens et al., 2020), and so on.
Could the fact that Catori is perceived as female impact how other Reddit
users react to her posts?
As seasoned causal learners, we know that to answer this question, we need
to carefully consider a number of factors.
Catori’s gender might impact the topics she chooses, her style of writing, or
the frequency of using certain words. All of these choices can impact the way
other participants react to her content.
Moreover, other Reddit users do not know Catori’s true gender. They can
only infer it from her profile information, such as her username or her avatar.
Therefore, the question here is about perceived genders, rather than the true
gender’s impact on other users’ behavior.
The blue node in Figure 11.6 represents our treatment (perceived gender),
and the green one is the outcome (Upvote – a post upvote within one hour of
posting).
True gender impacts the probability that a person will use a female avatar
and impacts Text (e.g., the topic or linguistic properties). Note that True
gender is unobserved, which we indicate by dashed lines in Figure 11.6.
Note that the True gender node opens a backdoor path between the treatment
(Female avatar) and outcome (Upvote). Luckily, its impact is mediated by
Text. This means that by controlling for Text, we can block the path.
A natural question to ask here is which aspects of the text we should control
for and how can we make sure that they are present in the representation of
the text that we choose to use.
CausalBert
CausalBert is a model proposed by Victor Veitch, Dhanya Sridhar, and David
Blei from Columbia University in their paper Adapting Text Embeddings for
Causal Inference (Veitch et al., 2020). It leverages BERT architecture
(Devlin et al., 2018) and adapts it to learn causally sufficient embeddings
that allow for causal identification when text is a confounder or mediator.
CausalBert is conceptually similar to DragonNet (Shi et al., 2019) – a
descendant of the TARNet architecture. Figure 11.7 presents CausalBert
symbolically:
During the training, CausalBert adapts the pretrained layers and learns the
embedding, outcome, and treatment models jointly. This joint objective
preserves the information in the embeddings that is predictive of the
treatment and outcome and attenuates everything else. Thanks to this
mechanism, we make sure that information relevant to the downstream task is
preserved in the embeddings, making them sufficient to effectively control for
confounding.
Note that the same mechanism is helpful when the text is a partial mediator of
the treatment effect. When controlling for text in this scenario, we learn the
direct effect of the treatment on the outcome (this will exclude all the
information that travels through the mediator and only leave the information
that flows directly from the treatment to the outcome, also known as the
natural direct effect (NDE)).
P la y ing s a f e w it h C a us a lB e rt
The mechanism that adapts the embeddings that predict the treatment and
outcome also has a darker side. Confounding and mediation scenarios are not
the only ones in which text is correlated with both the treatment and outcome.
It will also happen when text is a common descendant of the treatment and
outcome. In such a case, the text becomes a collider, and controlling for it
opens a spurious path between the treatment and outcome, biasing the results.
To mitigate this risk, we should always make sure that none of the aspects of
the text are a descendant of the treatment and outcome; neither should any
aspect of the text be a descendant of the outcome alone (as this would nullify
any effect of the treatment on the outcome).
For instance, if a Reddit user shares a post, observes a lack of upvote within
the first 15 minutes after posting, and then edits the post in the hope of getting
an upvote, the text becomes a collider between the treatment and the
outcome.
import pandas as pd
from models.causal_bert_pytorch.CausalBert import
CausalBertWrapper
We import pandas to read the data and the CausalBertWrapper class, which
implements the model and wraps it in a user-friendly API.
df = pd.read_csv('data/manga_processed.csv')
Our dataset has been generated according to the structure presented in Figure
11.6. It consists of 221 observations and 5 features:
upvote – a binary indicator of a post upvote in the 1st hour after posting
(the outcome)
Texts are short Reddit-like posts, generated using ChatGPT and prompted
indirectly to produce gender-stereotypical content. Both the text and female
avatar indicators are confounded by an unobserved true gender variable,
and the text causally impacts the outcome through linguistic features (for the
data-generating process code, check out https://bit.ly/ExtraDGPs).
Figure 11.8 presents the first five rows of the dataset before shuffling:
causal_bert = CausalBertWrapper(
batch_size=8,
g_weight=0.05,
Q_weight=1.,
mlm_weight=0.05
)
We define the batch size and three weight parameters. These three
parameters will be responsible for weighting the components of the loss
function during the training.
Here’s their meaning:
g_weight is responsible for weighting the part of the loss that comes from
propensity score (the “P(T=1|X)” block in Figure 11.7; it’s called g,
according to the convention used in Veitch et al.’s (2018) paper)
Q_weight is responsible for weighting the outcome model loss (the sub-
model that predicts the actual outcome – the “T=0” and “T=1” blocks in
Figure 11.7, called Q according to the paper’s convention)
MLM
BERT and some of its cousins are trained using the so-called MLM objective. It’s a self-
supervised training paradigm in which a random token in a training sequence is masked,
and the model tries to predict this masked token using other tokens in the sequence. Thanks
to the attention mechanism, the model can “see” the whole sequence when predicting the
masked token.
As you can see, we weighted the outcome model loss (Q_weight) much higher
than the propensity model (g_weight) or MLM loss (mlm_weight).
This setting works pretty well for our dataset, but when you work with your
own data, you should tune these parameters only if you have enough samples
to do so.
The low weighting of the MLM loss indicates that the texts in our dataset are
likely not very different from the overall language on the internet, and the
generic DistilBERT embeddings provide us with a good-enough
representation for our task.
To do so, we call the .inference() method and pass texts and additional
confounds as arguments. This method returns a tuple. We’re interested in the
0th element of this tuple:
preds = causal_bert.inference(
texts=df['text'],
confounds=df['has_photo'],
)[0]
To calculate the ATE, we subtract the entries in the first column from the
entries in the second column and take the average:
−0.62321178301562
This conclusion is only valid if the three causal assumptions (no hidden
confounding, positivity, and consistency) are met and the model is
structurally sound (no aspects of the text are descendants of the treatment and
outcome, nor the outcome itself).
If Catori and Stephen had doubts regarding some of these assumptions, they
could run a randomized experiment on Reddit to see whether the results held.
Before we conclude this section, I want to take a step back and take a look at
the topic of fairness. At the beginning of this section, we cited a number of
papers, demonstrating a citation gender gap between female and male
authors.
In the next section, we’ll take a look at one of the ways in which we can
leverage the time dimension to draw causal conclusions when experiments
are not available.
Quasi-experiments
Randomized controlled trials (RCTs) are often considered the “gold
standard” for causal inference. One of the challenges regarding RCTs is that
we cannot carry them out in certain scenarios.
The day after the formal acquisition, on October 28, Musk tweeted, “The
bird is freed,” suggesting that the acquisition had now become a reality. The
tweet quickly gained significant attention from the public and media outlets
alike, with many people commenting and resharing it on the platform.
Significant events in politics, economy, or art and culture can spark public
attention and alter the behaviors of large groups of people, making them
search for additional information.
Let’s see.
To answer this question, we’ll take a look at Google Trends data and
compare how often users searched for Twitter before and after Elon Musk’s
tweet.
First, let’s notice what we already have. We have a treated unit, and we
believe that we know at which point in time the treatment has been assigned.
A useful control unit would provide us with information about what would
have happened to the treated unit if the treatment had not occurred. As we
already know, this alternative outcome cannot be observed. How can we
deal with this?
Whichever scenario of the two is the case, the two units will have some
predictive power to predict each other (as long as the underlying data-
generating process does not change).
REICHENBACH’S PRINCIPLE
Reichenbach’s common cause principle asserts that when two variables are correlated, there
must be either a causal relationship between the two, or a third variable exists (known as a
Reichenbachian common cause) that causes the correlation between the two.
In a finite data regime, correlations between two variables might also occur randomly. That’s
why in the text we say that it’s highly likely that the two variables are causally related or they
have a common cause, rather than saying that this is certainly true. Random correlations
between two long time series are highly unlikely, but the shorter the series, the more likely the
random correlation is.
The main idea behind synthetic control is to find units that are correlated
with our unit of interest in the pre-treatment period, learning a model that
effectively predicts the behavior of the treated unit after the treatment occurs.
The units that we use as predictors are called the donor pool.
Figure 11.9 – An outcome variable (Target) and a set of potential donor variables
Table 11.1 – The correlation coefficients between the donor variables and the outcome
Strong significant correlation coefficients indicate that these donor pool
variables will be good predictors of the outcome.
Let’s see how well we can predict the outcome from these variables using
simple linear regression. Figure 11.10 presents the actual outcome variable
(blue) and its predicted version (gray):
Figure 11.10 – The actual target and prediction based on donor variables
Figure 11.11 – An outcome variable (Target), a set of donor pool variables, and the
treatment
As we can see, our outcome variable changed significantly after the treatment
occurred, but donor variables seem to follow their previous pattern in the
post-treatment period.
Let’s predict the outcome variable from the donor pool variables after the
occurrence of the treatment.
The prediction (the gray solid line in Figure 11.12) in the post-treatment
period is our synthetic control unit.
Having the actual outcome (the blue line in Figure 11.12) and the predicted
counterfactual outcome (the gray line in Figure 11.12), we can subtract the
latter from the former to get the estimated treatment effect at any point in the
post-treatment period.
We’ll use the information on search volume for LinkedIn, TikTok, and
Instagram as our donor pool units. We sampled the data with daily granularity
between May 15, 2022 and November 11, 2022, with the treatment (the
tweet) occurring on October 28, 2022. This gives us 181 samples (days) in
total, with 166 samples (days) in the pre-treatment period. This sample size
should be more than sufficient to capture predictive regularities in the
dataset.
The synthetic control estimator learns to predict the outcome variable using
the donor pool variables in the pre-treatment period, and then it predicts the
counterfactual version of the outcome variable in the post-treatment period.
This procedure is carried out by finding a set of weights for each of the
donor pool variables that best predicts the outcome variable. The easiest
way to do this is to use linear regression.
In order to decrease the risk of overfitting, the synthetic control estimator
forces the weights for the donor pool variables to take values between 0 and
1 and enforces the condition that all the weights sum up to 1.
Note that to effectively predict the outcome variable from the donor pool
variables under these constraints, we need some of our donor pool variables
to take values greater than the outcome and some others to take values lower
than the outcome.
If all your donor pool variables always take values below or above the
values of your outcome, a constraint synthetic control estimator won’t work.
Theoretically, in such a case, you can transform your variables, but this
comes with certain drawbacks (see Abadie, 2021 for details).
You can find the code for this section in the Chapter_11.2.ipynb notebook
(https://bit.ly/causal-ntbk-11_2). Note that this notebook uses a separate
conda environment. You’ll find installation instructions in the repository’s
description or the notebook itself (whichever is more convenient for you).
import pandas as pd
import causalpy as cp
import matplotlib.pyplot as plt
data = pd.read_csv(r'./data/gt_social_media_data.csv')
The first column stores the information about the date. The remaining four
columns contain information about the relative search volume for a given
day.
To make our work with CausalPy smooth, we’ll use date as the index of our
data frame:
data.index = pd.to_datetime(data['date'])
data = data.drop('date', axis=1)
Second, we drop the original column (the date information is now stored in
the index).
Let’s plot the data. Figure 11.14 presents the data, with the date of Musk’s
tweet marked by a black dashed line:
Figure 11.14 – Social media platforms search volume data
Note that Instagram’s search volume is typically higher than that of Twitter,
while LinkedIn and TikTok have lower search volumes compared to the
latter. Twitter’s line occasionally exceeds Instagram’s line, which is not
ideal, but as this happens rarely, we’ll accept it, assuming that it won’t
hinder the model’s ability to learn a useful representation.
To prepare the dataset for modeling, we need to store the treatment date in
the same format as the dates in the index of our data frame. We will use the
pd.to_datetime() function for this purpose:
treatment_index = pd.to_datetime('2022-10-28')
model = cp.pymc_models.WeightedSumFitter()
To define the model structure, we’ll use the R-style regression formula (you
can refer to Chapter 3 for a refresher on R-style formulas):
The formula says that we’ll predict twitter using the remaining three
variables. Zero at the beginning of the formula means that we’ll not fit an
intercept.
results = cp.pymc_experiments.SyntheticControl(
data,
treatment_index,
formula=formula,
model=model,
)
We pass the data, treatment date index, formula, and model to the constructor
and assign the object to the results variable.
Our donor pool predictors only explain 38.6% of the variability in the
outcome before the treatment. This means that the fit has limited quality.
The top panel presents the actual values of the outcome variable (the black
dots) and the predicted values of the outcome variable (the blue line). The
orange line represents the predicted synthetic control, and the blue shaded
area represents the estimated causal effect of the treatment. The treatment is
represented as a red vertical line.
The middle panel presents the difference between the predicted and actual
values of the outcome variable. If we had a model that predicted the outcome
perfectly, the blue line in the middle panel of Figure 11.15 would be a
straight line fixed at zero in the entire pre-treatment period. This is because a
perfect model would have zero errors.
Finally, the bottom panel presents the estimated cumulative causal impact of
the treatment.
Let’s take a look at how each of the donor pool predictors contributed to the
prediction. This can be achieved by calling the .summary() method on the
results object:
results.summary()
Note that none of the 94% highest density intervals (HDIs) contains zero,
suggesting that all predictors were significant. You can think of the HDI as a
Bayesian analog of confidence intervals (although this is a simplification; for
more details, check out Martin et al., 2021).
The model fit is not perfect (as expressed by ), yet the overall effect seems
pretty large.
To increase our confidence in the results, we could further formally test the
significance of the effect (for more ideas, check out Facure, 2020 in Chapter
15 and Chernozhukov et al., 2022), but we’ll skip this procedure here.
Assuming that our model is correctly specified, we can say that we’ve found
convincing evidence that Elon Musk’s tweet increased the Google search
volume for Twitter.
That’s powerful!
Challenges
The donor pool size we used for this analysis is small, which may explain
the low value of . Many practitioners would recommend using at least
between 5 to 25 variables in your donor pool as a rule of thumb. That said,
smaller donor pool sizes might have certain advantages.
For instance, we can be pretty sure that we’re not overfitting, which might
happen with larger sizes of donor pools (Abadie, 2021).
We hypothesized that Elon Musk’s tweet caused the increase in the search
volume for Twitter, yet there might be other factors at work (e.g., media
publications on Twitter’s acquisition). We might also deal with confounding
here.
The Twitter acquisition itself could have caused Musk’s tweet and increased
interest in the information about the platform. This alternative cannot be
excluded based on the data alone and might be very difficult to exclude in
general.
We should consider all the available data when working with these methods,
exactly the same way we do with any other causal method. Considering a
DAG that describes our problem, understanding structural relationships
between variables, and investigating potential confounding are good first
steps in quasi-experimental analysis.
To learn more about good practices regarding data selection and overall
synthetic control methodology, check out the excellent papers from Abadie
(2021) and Ferman et al. (2020).
Other great resources on synthetic controls include Scott Cunningham’s book
Causal Inference: The Mixtape (Cunningham, 2021) and Chapter 15 of
Matheus Facure’s online book Causal Inference for the Brave and True
(Facure, 2020).
Wrapping it up
We covered a lot in this chapter. We started by revisiting the S-Learner and
T-Learner models and demonstrated how flexible deep learning architectures
can help combine the benefits of both models. We implemented TARNet and
SNet and learned how to use the PyTorch-based CATENets library.
In the next chapter, we’ll start our adventure with causal discovery.
Benjamens, S., Banning, L. B. D., van den Berg, T. A. J., & Pol, R. A.
(2020). Gender Disparities in Authorships and Citations in
Transplantation Research. Transplantation Direct, 6(11), e614.
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin,
D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., & Zhang,
Q. (2018). JAX: composable transformations of Python+NumPy programs
[Computer software]: http://github.com/google/jax
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ...
& Amodei, D. (2020). Language models are few-shot learners. Advances in
Neural Information Processing Systems, 33, 1877-1901.
Chernozhukov, V., Wuthrich, K., & Zhu, Y. (2022). A t-test for synthetic
controls. arXiv.
Curth, A., & van der Schaar, M. (2021b). On Inductive Biases for
Heterogeneous Treatment Effect Estimation. Proceedings of the Thirty-Fifth
Conference on Neural Information Processing Systems.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-
training of deep bidirectional transformers for language understanding.
arXiv.
Dworkin, J. D., Linn, K. A., Teich, E. G., Zurn, P., Shinohara, R. T., &
Bassett, D. S. (2020). The extent and drivers of gender imbalance in
neuroscience reference lists. Nature Neuroscience, 23(8), 918-926.
Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E., & Stewart, B. M.
(2018). How to make causal inferences using texts. arXiv.
Facure, M., A. (2020). Causal Inference for The Brave and True.
Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-
Doughty, Z., ... & Yang, D. (2022). Causal inference in natural language
processing: Estimation, prediction, interpretation and beyond.
Transactions of the Association for Computational Linguistics, 10, 1138-
1158.
Ferman, B., Pinto, C., & Possebom, V. (2020). Cherry Picking with
Synthetic Controls. Journal of Policy Analysis and Management., 39, 510-
532.
Frohberg, J., & Binder, F. (2022). CRASS: A Novel Data Set and Benchmark
to Test Counterfactual Reasoning of Large Language Models. Proceedings
of the Thirteenth Language Resources and Evaluation Conference, 2126–
2140. https://aclanthology.org/2022.lrec-1.229
Gelman, A., Goodrich, B., Gabry, J., & Vehtari, A. (2018). R-squared for
Bayesian regression models. The American Statistician.
Hernán M. A., Robins J. M. (2020). Causal Inference: What If. Boca Raton:
Chapman & Hall/CRC.
Kıcıman, E., Ness, R., Sharma, A., & Tan, C. (2023). Causal Reasoning and
Large Language Models: Opening a New Frontier for Causality. arXiv
preprint arXiv:2305.00050.
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). Self-
normalizing neural networks. Advances in Neural Information Processing
Systems, 30.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet
Classification with Deep Convolutional Neural Networks. In: F. Pereira, C.
J. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural
Information Processing Systems (Vol. 25).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation
of word representations in vector space. arXiv.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in
PyTorch.
Pryzant, R., Chung, Y., & Jurafsky, D. (2017). Predicting Sales from the
Language of Product Descriptions. eCOM@SIGIR.
Pryzant, R., Card, D., Jurafsky, D., Veitch, V., & Sridhar, D. (2021). Causal
Effects of Linguistic Properties. Proceedings of the 2021 Conference of the
North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, 4095-4109.
Pryzant, R., Shen, K., Jurafsky, D., & Wagner, S. (2018). Deconfounded
Lexicon Induction for Interpretable Social Science. Proceedings of the
2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 1, 1615–1625.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... &
Sutskever, I. (2021). Learning transferable visual models from natural
language supervision. In International Conference on Machine Learning,
8748-8763. PMLR.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019).
Language Models are Unsupervised Multitask Learners.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled
version of BERT: smaller, faster, cheaper and lighter. arXiv.
Shi, C., Blei, D., & Veitch, V. (2019). Adapting neural networks for the
estimation of treatment effects. Advances in Neural Information Processing
Systems, 32.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. In: I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R.
Garnett (eds), Advances in Neural Information Processing Systems (Vol. 30).
Veitch, V., Sridhar, D., & Blei, D. (2020). Adapting text embeddings for
causal inference. In Conference on Uncertainty in Artificial Intelligence,
919-928. PMLR.
Willig, M., Zečević, M., Dhami, D. S., Kersting, K. (2023). Causal Parrots:
Large Language Models May Talk Causality But Are Not Causal [ACM
preprint].
Zhang, C., Bauer, S., Bennett, P., Gao, J., Gong, W., Hilmkil, A., ... &
Vaughan, J. (2023). Understanding Causality with Large Language Models:
Feasibility and Opportunities. arXiv.
Most people who hear about causal discovery for the first time are truly
fascinated by this topic. Other people are skeptical, yet almost no one is
indifferent.
In this chapter, we’ll discuss three sources of causal knowledge and we’ll
think together about the relative advantages and disadvantages of using them.
By the end of this chapter, you’ll have a clear idea about different sources of
causal knowledge and will be able to discuss their core strengths and
weaknesses.
It’s difficult to assess how much of the 11 million bits can be processed
unconsciously. It is also difficult to translate these numbers directly into
babies’ processing capabilities, but it’s safe to say that infants are not able to
process all the information available in the environment most of the time.
One of the mechanisms that helps humans and other animals to free up some
of their processing capacity is called habituation. Habituation is a process
in which an organism’s response to repeated or prolonged presentations of a
stimulus decreases. In other words, we stop paying attention to things that
happen around us repeatedly over and over again in a similar way.
Have you ever moved to a new house or apartment with a loud street behind
the windows? If you had this experience, it’s likely that it was hard not to
notice the noise in the beginning.
There’s a chance it was pretty annoying to you. You might have even said to
yourself or your family or friends “Gosh, how loud this is!”
It is likely that in a month or so, you stopped noticing the noise. The noise
became a part of the background; it became essentially invisible.
Aimee Stahl and Lisa Feigenson from Johns Hopkins University have shown
groups of 11-month-old infants regular toys and toys that (seemingly)
violated the laws of physics. The altered toys went through the walls, flowed
in the air, and suddenly changed location.
Infants spent more time exploring the altered toys than the regular ones. They
also engaged with them differently.
The analyses have shown that babies learned about objects with unexpected
behavior better, explored them more, and tested relevant hypotheses for these
objects’ behavior (Stahl & Feigenson, 2015).
This and other experiments (see Gopnik, 2009) suggest that even very small
babies systematically choose actions that help them falsify hypotheses and
build relevant causal world models. This interventional active learning
approach is essential in achieving this goal.
Do you remember Malika, Illya, and Lian from the beginning of this chapter?
Malika passed a toy car to Illya, who dropped it accidentally. It turns out that
it’s much more likely that a baby will pass a dropped object to a person who
dropped it accidentally than to someone who dropped it or threw it
intentionally.
World models are critical for efficient navigation in physical and social
worlds, and we do not stop building them as grown-ups.
Scientific insights
Human babies have a lot to learn during their development – the laws of
physics, language, and social norms. After learning the basics, most children
start school. School provides most of us with a canon of current knowledge
about a broad variety of topics.
In this section, we’ll take a look at causal knowledge through the lens of
scientific methods. We’ll introduce some core scientific methodologies and
touch upon their benefits and limitations.
It seems to me that not many people who advocate for the existence of
supernatural beings are interested in formulating hypotheses such as this.
Other, more general or more vague hypotheses in similar contexts usually
turn out to be unfalsifiable.
For instance, a theory might fit the current data very well. The reason for that
might be that we simply do not have advanced enough tooling to collect the
data that could falsify this theory and so the theory’s fitness does not
automatically translate to its truthfulness. Some of these views have been
since criticized and not all of them are broadly accepted today (see the
callout box).
For instance, Thomas Kuhn has proposed an alternative model of how science develops
over time (Kuhn, 1962), and other philosophers such as Imre Lakatos or Paul Feyerabend
criticized Popper’s views on the demarcation between science and pseudo-science and his
criteria of theory rejection.
One of the (perhaps surprising) consequences of the consistent Popperism is the rejection of
Darwin’s theory of evolution as non-scientific (Popper himself called it an interesting
metaphysics project; Rosenberg & McIntyre, 2020).
French physicist Pierre Duhem and American logician Willard Van Orman Quine argued
independently that hypotheses cannot be empirically falsified in isolation because an
empirical test needs auxiliary assumptions. Popper’s theory has been also criticized as
overly romantic and non-realistic (Rosenberg & McIntyre, 2020).
These differences are reflected in the choice of scientific tools that scientists
in these fields make.
Controlled experiments
Physicists, chemists, and other representatives of the so-called hard sciences
often work with controlled experiments. The control aspect means that the
experimental environment is carefully controlled in order to minimize any
interference from the outside world or – in other words – to keep the context
variables constant. An extreme example of such control is projects such as
the Large Hadron Collider – a sophisticated experimental system consisting
of a 27-kilometer-long ring of superconducting magnets placed over 100
meters under the ground and a no less sophisticated shielding system aiming
to reduce unwanted outside influences.
RCTs are often considered the golden standard for inferring causal
relationships.
One of the challenges with RCTs pointed out by some authors is that they do
not provide us with estimates of the efficacy of treatment at the individual
level, but only at a group level (e.g., Kostis & Dobrzynski, 2020). Other
authors suggest that this is not always problematic and that often the group-
level conclusions are surprisingly transportable to individuals (e.g.,
Harrell, 2023), but as we saw in Chapter 9, group-level conclusions can
hide important information from us.
In this section, we implicitly assumed that an RCT consists of two (or more)
groups that are assigned to the treatment or control conditions in parallel.
This is not the only possible RCT design. Moreover, other designs such as
cross-over design might be more powerful. Further discussion on
experimental designs is beyond the scope of this book. Check Harrell (2023)
for a starter.
Each new hypothesis and each new experiment can potentially deepen our
understanding of real-world causal systems and this understanding can be
encoded using causal graphs (note that not all systems can be easily
described using acyclic graphs and sometimes causality might be difficult to
quantify). For instance, you might be interested in how the brightness of
colors on your website impacts sales. You might design an A/B test to test
this hypothesis. If the results indicate an influence of color brightness on
sales, you should add an edge from brightness to sales to a graph describing
the causes of sales.
Simulations
Simulations are another way to obtain causal knowledge.
Your reaction was likely a result of many different factors, including a fast
instinctive response to a threatening stimulus (splashing muddy water), but it
was likely not entirely instinctive. You noticed a car in advance and likely
simulated what will happen. A simulation like this requires us to have a
world model.
In this case, your model of what will happen when a relatively hard object (a
tire) moves through a puddle at a sufficient speed likely comes from various
different experiences. It might have experimental components (such as you
jumping into puddles as a 2-year-old and observing the effects of your
interventions) and observational components (you seeing a person splashed
by a car and observing the consequences).
Personal experiences
Personal experiences such as the ones that led you to simulate the car
splashing you with the water can be a valid source of causal knowledge.
Humans and some other animals can effectively generalize experiences and
knowledge from one context to another.
That said, the generalizations are not always correct. A good example comes
from the realm of clinical psychology. Children growing up in dysfunctional
families might (and usually do) learn specific relational patterns (how to
relate to other people in various situations).
This might make you subconsciously associate men with leadership and
manifest itself in a conscious belief that being male is related to being better
equipped to take leadership positions, although the evidence suggests
otherwise (e.g., Zenger & Folkman, 2020).
An availability heuristic might lead not only to faulty world models but also
to interesting paradoxes. Multiple studies have shown that although women
are perceived as having stronger key leadership competencies than men, they
are not necessarily perceived as better leaders by the same group of people
(e.g., Pew Research Center, 2008).
Domain knowledge
Domain knowledge can rely on various sources. It might be based on
scientific knowledge, personal experiences, cultural transmission, or a
combination of all of these.
Domain experts will usually have a deep understanding of one or more areas
that they spent a significant amount of time studying and/or interacting with.
They might be able to accurately simulate various scenarios within their area
of expertise.
Some more recent methods allow for encoding expert knowledge into the
graph or learning from interventional data (with known or unknown
interventions).
Causal structure learning might be much cheaper and faster than running an
experiment, but it often turns out to be challenging in practice.
We saw that humans start to work on building world models very early in
development; yet not all world models that we build are accurate. Heuristics
that we use introduce biases that can skew our models on an individual,
organizational, or cultural level.
Unfortunately, experiments are not always available and have their own
limitations. Causal structure learning methods can be cheaper and faster than
running experiments, but they might rely on assumptions difficult to meet in
certain scenarios.
Let’s see how to implement causal discovery algorithms and how they work
in practice.
References
Gopnik, A. (2009). The philosophical baby: What children’s minds tell us
about truth, love, and the meaning of life. Farrar, Straus and Giroux.
Hall N. S. (2007). R. A. Fisher and his advocacy of randomization. Journal
of the History of Biology, 40(2), 295–325.
Kahneman, D. (2011). Thinking, fast and slow. Farrar, Straus and Giroux.
Muenssinger, J., Matuz, T., Schleger, F., Kiefer-Schmidt, I., Goelz, R.,
Wacker-Gussmann, A., Birbaumer, N., & Preissl, H. (2013). Auditory
habituation in the fetus and neonate: an fMEG study. Developmental
Science, 16(2), 287–295.
Pew Research Center. (2008). Men or Women: Who’s the Better Leader? A
Paradox in Public Attitudes. https://www.pewresearch.org/social-
trends/2008/08/25/men-or-women-whos-the-better-leader/
Tetlock, P.E. (2005). Expert Political Judgment: How Good Is It? How Can
We Know? Princeton University Press.
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science
of Prediction. Crown.
Zenger, J., & Folkman, J. (2020, December 30). Research: Women Are
Better Leaders During a Crisis. Harvard Business Review.
https://hbr.org/2020/12/research-women-are-better-leaders-during-a-crisis
14
We're inevitably moving towards the end of our book, but we still have
something to learn!
In this chapter, we’ll introduce methods and ideas that aim to solve some of
these limitations. We’ll discuss an advanced deep learning causal discovery
framework, Deep End-to-end Causal Inference (DECI), and implement it
using the Microsoft open source library Causica and PyTorch.
We’ll see how to approach data with hidden confounding using the fast
causal inference (FCI) algorithm and introduce other algorithms that can be
used in similar scenarios.
We’ll close this chapter with a discussion of the challenges and open areas
for improvement in causal discovery.
By the end of this chapter, you’ll understand the basic theory behind DECI
and will be able to apply it to your own problems. You will understand when
the FCI algorithm might be useful and will be able to use it, including adding
expert knowledge.
The fact that graph search could be carried out using continuous optimization
opened up a path for integrating causal discovery with techniques coming
from other deep learning areas.
One example of a framework integrating such techniques into the realm of
causal discovery is DECI – a deep learning end-to-end causal discovery
and inference framework (Geffner et al., 2022).
DECI is a flexible model that builds on top of the core ideas of the NO
TEARS paper. It works for non-linear data with additive noise under
minimality and no hidden confounding assumptions.
In this section, we’ll discuss its architecture and major components and
apply it to a synthetic dataset, helping the model to converge by injecting
expert knowledge into the graph.
This section will have a slightly more technical character than some of the
previous chapters. We’ll focus on gaining a deeper understanding of the
framework and code.
This focus will help us get a grasp of how advanced methods in causal
discovery might be designed and how to implement them using lower-level
components.
Let’s briefly review the connection between autoregressive flows and causal
models.
A setting where variables are ordered and their value only depends on the
variables that are preceding them resembles a structural causal model
(SCM), where nodes’ values only depend on the values of their parents.
This insight has been leveraged by Ilyes Khemakhem and colleagues
(Khemakhem et al., 2021), who proposed an autoregressive flow framework
with causal ordering of variables (CAREFL).
DECI further builds on this idea, by modeling the likelihood of data given the
graph in an autoregressive fashion.
Moreover, the model learns the posterior distribution over graphs rather than
just a single graph. In this sense, DECI is Bayesian. This architectural choice
allows for a very natural incorporation of expert knowledge into the graph.
We will see this in action later in this section.
is a matrix exponential
is a trace of a matrix
In other words, minimizing this component during training helps us make sure
that the recovered model will not contain cycles or bi-directional edges.
The DAG-ness score is one of the components that constitute DECI’s loss
function; but it’s not the only one. A second important component is the
sparsity score.
The sparsity score is defined using the squared Frobenius norm of the
adjacency matrix. Formally, this is represented as follows:
FROBENIUS NORM
The Frobenius norm is a matrix norm (https://bit.ly/MatrixNorm) defined as the square root of
the sum of squares of absolute values of matrix entries:
When is an unweighted adjacency matrix, the Frobenius norm measures the sparsity of the
matrix. The sparser the matrix, the more zero entries it has.
Note that if the matrix is real-valued, the absolute value function is redundant (squaring
will make the result non-negative anyway). This only matters for complex-valued matrices
(which we’re not using here, but we include the absolute value operator for completeness of
the definition).
The DAG-ness score and sparsity score are used to define three terms used
in the model’s loss function:
While the latter coefficient is kept constant over the training, the first two are
gradually increased.
The idea here is not to overly constrain the search space in the beginning
(even if early graphs are not DAGs) in order to allow the algorithm to
explore different trajectories. With time, we increase the values of and ,
effectively limiting the solutions to DAGs only.
The updates of these parameters are carried out by the scheduler of the
augmented Lagrangian optimizer (https://bit.ly/AugmentedLagrangian).
The initial values of , , and are set by the user (they play the role of
hyperparameters) and can influence the trajectory of the algorithm. In some
causal discovery algorithms, including DECI, initial hyperparameter settings
can make or break the results.
Keeping this in mind, let’s move forward and see how to implement DECI.
First, we’ll import the dataclass decorator and NumPy and NetworkX
libraries:
The dataclass decorator will help us make sure that the model configuration
is immutable and we don’t change it by mistake somewhere on the way.
We’ll use NumPy for general numerical purposes and NetworkX for graph
visualizations.
Next, we’ll import PyTorch, PyTorch Lightning, and two convenient tools –
DataLoader and TensorDict:
import torch
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from tensordict import TensorDict
We’ll use them to generate the data, plot the results, and compute useful
metrics.
Next, we’ll import a number of modules and methods from Causica. Causica
(https://bit.ly/MicrosoftCausica) is a Microsoft-managed open source library
focused on causal machine learning. At the time of writing this chapter (Q2
2023), DECI is the only algorithm available in the library, yet the authors
have informed us that other algorithms will be added with time.
import causica.distributions as cd
We have skipped these imports here to avoid excessive clutter (check the
notebook for a full list of imports).
TENSORDICT
TensorDict is a PyTorch-specific dictionary-like class designed as a data-storing
container. The class inherits properties from PyTorch tensors, such as indexing, shape
operations, and casting to a device. TensorDict provides a useful abstraction that helps to
achieve greater modularity.
DECI training can be accelerated using GPU. The following line will check
whether a GPU is available in your system and store the relevant information
in the device variable:
SEED = 11
pl.seed_everything(SEED)
We’re ready to generate the data. We’ll generate 5,000 observations from a
simple scale-free graph:
Here, we follow the same process that we used in the previous chapter:
We picked non-linear data generated using the multiple index model (Zheng
et al., 2020). Figure 14.1 presents our generated DAG.
Figure 14.1 – Generated DAG
D EC I c o nf ig ura t io n
As you have probably already noticed, DECI is a pretty complex model. In
order to keep the code clean and reproducible, we’ll define a set of neat
configuration objects:
@dataclass(frozen=True)
class TrainingConfig:
noise_dist=cd.ContinuousNoiseDist.SPLINE
batch_size=512
max_epoch=500
gumbel_temp=0.25
averaging_period=10
prior_sparsity_lambda=5.0
init_rho=1.0
init_alpha=0.0
training_config = TrainingConfig()
auglag_config = AugLagLRConfig()
We use the @dataclass decorator with frozen=True in order to make sure that
we don’t alter the configuration object somewhere along the way mistakenly.
We instantiate model configuration (TrainingConfig()) and optimizer
configuration (AugLagLRConfig()) and assign them to variables.
I set the batch size to 512 as I noticed that larger batches work better for
small graphs with DECI.
P re p a ring t he d a t a
The dataset that we generated is stored as NumPy arrays. As DECI uses
PyTorch, we need to cast it to torch.tensor objects.
data_tensors = {}
for i in range(dataset.X.shape[1]):
data_tensors[f'x{i}'] = torch.tensor(dataset.X[:,
i].reshape(-1, 1))
dataset_train = TensorDict(data_tensors,
torch.Size([dataset.X.shape[0]]))
Let’s move the dataset to the device (the device should be cuda if a GPU
accelerator had been detected on your system, otherwise cpu):
dataset_train = dataset_train.apply(lambda t:
t.to(dtype=torch.float32, device=device))
Finally, let’s create a PyTorch data loader, which will take care of batching,
shuffling, and smooth data serving during training for us:
dataloader_train = DataLoader(
dataset=dataset_train,
collate_fn=lambda x: x,
batch_size=training_config.batch_size,
shuffle=True,
drop_last=False,
)
D EC I a nd e x p e rt kno w le d g e
Thanks to its flexible architecture, DECI allows us to easily inject prior
knowledge into the training process.
Let’s pick one edge (let’s say edge (3, 0)) in our graph and pass a strong
belief about its existence to the model’s prior.
Figure 14.2 presents a plot of the true adjacency matrix with edge (3, 0)
marked in red:
Figure 14.2 – The true adjacency matrix with prior knowledge edges marked
As you can see, there’s also another spot marked in blue in Figure 14.2. This
spot represents the same edge, but pointing in the opposite direction. As we
are learning a directed graph (a DAG), the model should automatically
understand that if the edge (3, 0) exists, the edge (0, 3) does not exist. The
DAG-ness penalty in the cost function pushes the model toward this solution.
The entries in the relevance matrix inform the model of which entries should
be taken into account during the optimization.
First, we generate a zero matrix of the size of our adjacency matrix and
assign 1 to the entry (3, 0), where we believe an edge exists:
expert_matrix = torch.tensor(np.zeros(adj_matrix.shape))
expert_matrix[3, 0] = 1.
Next, in order to get the relevance matrix, we clone the expert matrix (we
want the entry (3, 0) to be taken into account by the model, so we can just
reuse the work we just did) and set the entry (0, 3) to 1:
relevance_mask = expert_matrix.clone()
relevance_mask[0, 3] = 1.
Now, the expert_matrix object contains 1 in position (3, 0), while the
relevance_mask object contains ones in positions (3, 0) and (0, 3).
confidence_matrix = relevance_mask.clone()
We want to tell the model that we’re 100% sure that the edge (3, 0) exists.
The confidence matrix takes values between 0 and 1, so the ones in the (3,
0) and (0, 3) entries essentially tell the model that we’re completely sure
that these entries are correct.
In order to effectively pass the knowledge to the model, we need to pass all
three matrices to Causica’s ExpertGraphContainer object:
expert_knowledge = cd.ExpertGraphContainer(
dag=expert_matrix,
mask=relevance_mask,
confidence=confidence_matrix,
scale=5.
)
The last parameter that we pass to the expert knowledge container – scale –
determines the amount of contribution of the expert term to the loss.
The larger the value of scale, the heavier the expert graph will be weighted
when computing the loss, making expert knowledge more important and
harder to ignore for the model.
T he ma in mo d ule s o f D EC I
DECI is largely a modular system, where particular components could be
replaced with other compatible elements. This makes the model even more
flexible.
prior = cd.GibbsDAGPrior(
num_nodes=len(dataset_train.keys()),
sparsity_lambda=training_config.prior_sparsity_lambda,
expert_graph_container=expert_knowledge
)
We use the GibbsDAGPrior class to define the prior. We pass three parameters
to the class constructor: the number of nodes in the graph (which also
represents the number of features in our dataset), the sparsity lambda value
(this is the parameter, which weights the sparsity score that we discussed
earlier in this chapter), and – last but not least – the expert knowledge object.
The Gibbs prior object will later be used in the training in order to compute
the unnormalized log probability of the DAG that we’ll use to compute the
value of the loss function.
The noise distribution module (which models the noise term distributions
in the SEM)
adjacency_dist = cd.ENCOAdjacencyDistributionModule(
num_nodes)
Next, we define the functional model. We’ll use the ICGNN graph neural
network (Park et al., 2022) for this purpose:
icgnn = ICGNN(
variables=tensordict_shapes(dataset_train),
embedding_size=8,
out_dim_g=8,
norm_layer=torch.nn.LayerNorm,
res_connection=True,
)
The size of the embeddings that represent the parent nodes while
computing the representations of the children nodes internally
(out_dim_g)
We’ll start by creating a type dictionary. For each variable, we’ll create a
key-value pair with the variable name used as a key and its type as a value.
We’ll use Causica-specific type descriptions stored in the VariableTypeEnum
object:
As all the variables in our dataset are continuous, we use the same type
(VariableTypeEnum.CONTINUOUS) for all variables.
Finally, let’s create a set of noise modules for each of the variables:
noise_submodules = cd.create_noise_modules(
shapes=tensordict_shapes(dataset_train),
types=types_dict,
continuous_noise_dist=training_config.noise_dist
)
We pass the variable shapes and types to the constructor alongside the
intended noise distribution.
The information about the noise distribution type is stored in our training
configuration object. At the time of writing, DECI supports two noise
distribution types: Gaussian and spline. The latter is generally more flexible
and has been demonstrated to work better across a wide range of scenarios
(Geffner et al., 2022, p. 7), and so we’ve picked it here as our default type.
noise_module = cd.JointNoiseModule(noise_submodules)
We now have all three SEM modules (adjacency, functional, and noise)
prepared. Let’s pass them to a common SEM super-container and send the
whole thing to a device:
sem_module = cd.SEMDistributionModule(
adjacency_module=adjacency_dist,
functional_relationships=icgnn,
noise_module=noise_module)
sem_module.to(device)
The SEM module is now ready for training. The last missing part is the
optimizer. DECI can use any PyTorch optimizer. Here, we’ll use Adam.
First, we’ll create a parameter list for all modules and then pass it to Adam’s
constructor:
modules = {
'icgnn': sem_module.functional_relationships,
'vardist': sem_module.adjacency_module,
'noise_dist': sem_module.noise_module,
}
parameter_list = [
{'params': module.parameters(), 'lr':
auglag_config.lr_init_dict[name], 'name': name}
for name, module in modules.items()
]
optimizer = torch.optim.Adam(params=parameter_list)
scheduler = AugLagLR(config=auglag_config)
auglag_loss = AugLagLossCalculator(
init_alpha=training_config.init_alpha,
init_rho=training_config.init_rho
)
In the outer for loop, we’ll iterate over the number of epochs, and in the
inner for loop, we’ll iterate over the batches within each epoch (as a result,
we iterate over all batches within each epoch).
Before we start the for loops, let’s store the total number of samples in our
dataset in a num_samples variable. We’ll use it later to compute our
objective:
num_samples = len(dataset_train)
for epoch in range(training_config.max_epoch):
for i, batch in enumerate(dataloader_train):
Within the loop, we’ll start by zeroing the gradients, so that we can make
sure that we compute fresh gradients for each batch:
optimizer.zero_grad()
Next, we’ll sample from our SEM module and calculate the probability of
the data in the batch given the current model:
sem_distribution = sem_module()
sem, *_ = sem_distribution.relaxed_sample(
torch.Size([]),
temperature=training_config.gumbel_temp
)
batch_log_prob = sem.log_prob(batch).mean()
Note that we use the .relaxed_sample() method. This method uses the
Gumbel-Softmax trick, which approximates sampling from a discrete
distribution (which is non-differentiable) with a continuous distribution
(which is differentiable).
Next, still within the batch loop, we compute the SEM distribution entropy
(https://bit.ly/EntropyDefinition). We’ll need this quantity to compute the
overall value of the loss for the model:
sem_distribution_entropy =
sem_distribution.entropy()
Next, we compute the log probability of the current graph, given our prior
knowledge and the sparsity score we defined in the DECI’s internal
building blocks subsection earlier (the score is computed internally):
prior_term = prior.log_prob(sem.graph)
Next, we compute the objective and DAG-ness score, and pass everything to
our augmented Lagrangian calculator:
We compute the gradients for the whole model and propagate them back:
loss.backward()
optimizer.step()
scheduler.step(
optimizer=optimizer,
loss=auglag_loss,
loss_value=loss.item(),
lagrangian_penalty=constraint.item(),
)
This concludes our training loop (in fact, in the notebook, we have a couple
more lines that print out and plot the results, but we’ve skipped them here to
avoid stacking too much code in the chapter).
Figure 14.3 – The matrix recovered by DECI and the true matrix
As we can see, DECI did a very good job and recovered the matrix perfectly.
That said, we need to remember that we made the task easier for the model
by providing some very strong priors.
I also found out that a number of hyperparameter settings were crucial for
making the model results more or less stable for this and similar datasets. I
used my knowledge and intuitions from previous experiments to choose the
values.
First, I set the embedding sizes for ICGNN to 8. With larger, 32-dimensional
embeddings, the model seemed unstable. Second, I set the batch size to 512.
With a batch size of 128, the model had difficulties converging to a good
solution.
The DECI architecture is powerful and flexible. This comes at the price of
higher complexity. If you want to use DECI in practice, it might be a good
idea to first work with the model on other datasets with a known structure
that are similar to your problem in order to find good hyperparameter
settings.
For details on assumptions, check out Geffner et al. (2022), Section 3.2. For
theoretical guarantees and their proofs, check out Geffner et al. (2022),
Theorem 1 and Appendix A.
DECI is end-to-end
We used DECI as a causal discovery method, but in fact, it is an end-to-end
causal framework, capable of not only causal discovery but also estimating
the average treatment effect (ATE) and (to an extent) the conditional
average treatment effect (CATE).
Despite its many strengths, DECI – similar to the models that we discussed in
the previous chapter – requires that no hidden confounding is present in the
data.
In this section, we’ll learn about the FCI algorithm, which can operate when
some or all confounding variables are unobserved. We’ll implement the FCI
algorithm using the causal-learn package and – finally – discuss two more
approaches that can be helpful when our dataset contains potential
unobserved confounders.
That said, FCI output can be more informative than the standard PC output.
The reason for this is that FCI returns more edge types than just simple
directed and undirected edges.
To be precise, there are four edge types in FCI (we’re following the notation
scheme used in the causal-learn package):
We import the fci function, which implements the FCI algorithm and the
BackgroundKnowledge and GraphNode classes, which will allow us to inject
prior knowledge into the algorithm.
N = 1000
q = np.random.uniform(0, 2, N)
w = np.random.randn(N)
x = np.random.gumbel(0, 1, N) + w
y = 0.6 * q + 0.8 * w + np.random.uniform(0, 1, N)
z = 0.5 * x + np.random.randn(N)
data = np.stack([x, y, w, z, q]).T
confounded_data = np.stack([x, y, z, q]).T
We generate a five-dimensional dataset with 1,000 observations and different
noise distributions (Gaussian, uniform, and Gumbel).
Figure 14.4 presents the structure of our dataset. The red node, W, is an
unobserved confounder.
causal-learn’s API is pretty different from the one we know from gCastle.
First, the model is represented as a function rather than a class with a
dedicated fitting method.
The model function (fci()) returns causal-learn’s native graph object and a
list of edge objects. Let’s run the algorithm and store the outputs in variables:
g, edges = fci(
dataset=confounded_data,
independence_test_method='kci'
)
g.graph
array([[0, 2, 2, 0],
[1, 0, 0, 1],
[2, 0, 0, 0],
[0, 2, 0, 0]])
For most people, this matrix is difficult to read (at least at first), and there’s
no easy way to meaningfully plot it using GraphDAG, which we’ve used so far.
Fortunately, we can iterate over the edges object to obtain a more human-
readable representation of the results.
Let’s create a mapping between the default variable names used internally by
causal-learn and the variable names that we used in our dataset:
mapping = {
'X1': 'X',
'X2': 'Y',
'X3': 'Z',
'X4': 'Q'
}
Now, let’s iterate over the edges returned by FCI and print them out:
X o-> Y
X o-o Z
Q o-> Y
Y is not an ancestor of Q
Looking at the graph, all of this is true. At the same time, the output is not
very informative as it leaves us with many unknowns.
F C I a nd e x p e rt kno w le d g e
FCI allows us to easily inject expert knowledge into it. In causal-learn, we
can add expert knowledge using the BackgroundKnowledge class. Note that
we’ll need to use the naming convention used by causal-learn internally (X1,
X2, etc.) in order to specify expert knowledge (see the code snippet with the
mapping dictionary in the preceding section for mapping between causal-
learn’s variable names and the variable names in our graph).
prior_knowledge = BackgroundKnowledge()
prior_knowledge.add_forbidden_by_node(GraphNode('X2'),
GraphNode('X4'))
prior_knowledge.add_required_by_node(GraphNode('X1'),
GraphNode('X3'))
In order to identify the nodes, we use GraphNode objects and pass node names
that causal-learn uses internally (we could reverse our mapping here, but
we’ll skip it for simplicity).
Now, we’re ready to pass our prior knowledge to the fci() function:
g, edges = fci(
dataset=confounded_data,
independence_test_method='fisherz',
background_knowledge=prior_knowledge
)
Note that this time, we also used the Fisher’s Z-test instead of KCI. Fisher-Z
is a fast but slightly less flexible conditional independence test, compared to
KCI. Fisher-Z is recommended for Gaussian linear data, but I’ve seen it
work multiple times for much more complex distributions and relationships
as well.
X o-> Y
X --> Z
Q --> Y
FCI might be a good choice if we’re not sure about the existence of edges.
The algorithm can give us a number of useful hints. The edges might next be
passed in the form of expert knowledge to another algorithm for further
disambiguation (e.g., LiNGAM or DECI if the respective assumptions are
met). FCI performance can also be improved by using so-called tiers when
passing background knowledge to the algorithm (tiered background
knowledge). For details, check out Andrews et al. (2020).
In this section, we introduced causal discovery methods that can work under
hidden confounding. We discussed the FCI algorithm and implemented it
using the causal-learn library. We also learned how to pass prior knowledge
to the algorithm and introduced two other methods that can work under
different scenarios involving hidden confounding: CCANM and CORTH.
In the next section, we’ll discuss methods that can leverage information
coming from interventions.
In this short section, we’ll introduce two methods that can help us make sure
that we make good use of such interventions.
ENCO
Efficient Neural Causal Discovery (ENCO; Lippe et al., 2022) is a causal
discovery method for observational and interventional data. It uses
continuous optimization and – as we mentioned earlier in the section on
DECI – parametrizes edge existence and its orientation separately. ENCO is
guaranteed to converge to a correct DAG if interventions on all variables are
available, but it also performs reasonably well on partial intervention sets.
Moreover, the model works with discrete, continuous, and mixed variables
and can be extended to work with hidden confounding. The model code is
available on GitHub (https://bit.ly/EncoGitHub).
ABCI
Active Bayesian Casual Inference (ABCI; Toth et al., 2022) is a fully
Bayesian framework for active causal discovery and reasoning. ABCI
requires no hidden confounding or acyclicity and assumes a non-linear
additive noise model with homoscedastic noise. A great advantage of ABCI
is that it does not necessarily focus on estimating the entire causal graph, but
rather on a causal query of interest, and then sequentially designs
experiments that most reduce the uncertainty. This makes ABCI highly data-
efficient. As a Bayesian method, ABCI makes it easy to encode expert
knowledge in the form of a prior(s). Moreover, ABCI allows for different
types of causal queries: causal discovery, partial causal discovery, SCM
learning, and more.
The results for all causal discovery methods had limited quality in this
setting. F1 scores varied roughly between 0.01 and 0.32, depending on the
method and setting (top scores required a very large sample size, that is,
>16,000 samples).
The authors also report that both methods were sensitive to hyperparameter
changes. This is challenging in a real-world setting, where the true graph is
unknown and there’s no good general benchmark, because we don’t know
which values of hyperparameters to choose.
Shen et al. (2020) applied FCI and FGES algorithms to Alzheimer’s disease
data. The FGES algorithm provided promising results with precision varying
between 0.46 and 0.76 and recall varying between 0.6 and 0.74, depending
on how much expert knowledge was injected into the algorithm.
The latter results show how valuable adding expert knowledge to the graph
can be.
I hope that with the rising adoption of causal methods in the industry, this
will start gradually changing and we’ll start seeing more companies sharing
their experiences. This is of paramount importance as industrial use cases
provide the research community with a vital injection of motivation. The
research community gives back, providing the industry with better and more
efficient ideas, which moves the industry forward.
ENCO supports mixed types natively, but only when we provide the
algorithm with interventional data. The necessity to use interventional data
might be a serious limitation in applying this algorithm in certain use cases.
Contemporary causal discovery can be a source of valuable insights and can
definitely be helpful, yet it should be used with the awareness of its
limitations.
Wrapping it up!
In this chapter, we introduced several methods and ideas that aim to
overcome the limitations of traditional causal discovery frameworks. We
discussed DECI, an advanced deep learning causal discovery framework,
and demonstrated how it can be implemented using Causica, Microsoft’s
open source library, and PyTorch.
We explored the FCI algorithm, which can be used to handle data with
hidden confounding, and introduced other algorithms that can be used in
similar scenarios. These methods provide a strong foundation for tackling
complex causal inference problems.
After that, we discussed two frameworks, ENCO and ABCI, that allow us to
combine observational and interventional data. These frameworks extend our
ability to perform causal discovery and provide valuable tools for data
analysis.
In the next chapter, we’ll summarize everything we’ve learned so far, discuss
practical ideas of how to effectively apply some of the methods we’ve
discussed, and see how causality is being successfully implemented across
industries.
References
Andrews, B., Sprites, P., & Cooper, G. F. (2020). On the Completeness of
Causal Discovery in the Presence of Latent Confounding with Tiered
Background Knowledge. International Conference on Artificial Intelligence
and Statistics.
Cai, R., Qiao, J., Zhang, K., Zhang, Z., & Hao, Z. (2021). Causal discovery
with cascade nonlinear additive noise models. ACM Trans. Intell. Syst.
Technol., 6(12).
Geffner, T., Antorán, J., Foster, A., Gong, W., Ma, C., Kıcıman, E., Sharma,
A., Lamb, A., Kukla, M., Pawlowski, N., Allamanis, M., & Zhang, C.
(2022). Deep End-to-end Causal Inference. arXiv.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial
networks. Communications of the ACM, 63(11), 139-144.
Goudet, O., Kalainathan, D., Caillou, P., Guyon, I., Lopez-Paz, D., & Sebag,
M. (2018). Causal generative neural networks. arXiv.
Huang, Y., Kleindessner, M., Munishkin, A., Varshney, D., Guo, P., & Wang,
J. (2021). Benchmarking of Data-Driven Causality Discovery Approaches
in the Interactions of Arctic Sea Ice and Atmosphere. Frontiers in Big Data,
4.
Kalainathan, D., Goudet, O., Guyon, I., Lopez-Paz, D., & Sebag, M. (2022).
Structural agnostic modeling: Adversarial learning of causal graphs.
arXiv.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., &
Welling, M. (2016). Improved Variational Inference with Inverse
Autoregressive Flow. Advances in Neural Information Processing Systems,
29.
Lippe, P., Cohen, T., & Gavves, E. (2021). Efficient neural causal
discovery without acyclicity constraints. arXiv.
Park, J., Song, C., & Park, J. (2022). Input Convex Graph Neural Networks:
An Application to Optimal Control and Design Optimization. Open Review.
https://openreview.net/forum?id=S2pNPZM-w-f
Shen, X., Ma, S., Vemuri, P., & Simon, G. (2020). Challenges and
opportunities with causal discovery algorithms: application to Alzheimer’s
pathophysiology. Scientific Reports, 10(1), 2975.
Soleymani, A., Raj, A., Bauer, S., Schölkopf, B., & Besserve, M. (2022).
Causal feature selection via orthogonal search. Transactions on Machine
Learning Research.
Sprites, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and
Search. MIT Press.
Spirtes, P., Meek, C., & Richardson, T. S. (2013). Causal inference in the
presence of latent variables and selection bias. arXiv.
Toth, C., Lorch, L., Knoll, C., Krause, A., Pernkopf, F., Peharz, R., & Von
Kügelgen, J. (2022). Active Bayesian Causal Inference. arXiv.
Tu, R., Zhang, K., Bertilson, B., Kjellstrom, H., & Zhang, C. (2019).
Neuropathic pain diagnosis simulator for causal discovery algorithm
evaluation. Advances in Neural Information Processing Systems, 32.
Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2012). Kernel-based
conditional independence test and application in causal discovery. arXiv.
Zheng, X., Aragam, B., Ravikumar, P., & Xing, E. P. (2018). DAGs with NO
TEARS: Continuous Optimization for Structure Learning. Neural
Information Processing Systems.
Zheng, X., Dan, C., Aragam, B., Ravikumar, P., & Xing, E. P. (2020).
Learning Sparse Nonparametric DAGs. International Conference on
Artificial Intelligence and Statistics.
15
Epilogue
Congratulations on reaching the final chapter!
This is the last stop in our journey. Before we close, we’ll do a number of
things:
Discuss five steps to get the best out of your causal projects
Take a look at the intersection of causality and business and see how
organizations implement successful causal projects
Discuss where to find resources and how to learn more about causality
We defined the concept of confounding and showed how it can lead us astray
by producing spurious relationships between causally independent variables.
Next, we introduced the Ladder of Causation and its three rungs –
observations, interventions, and counterfactuals. We showed the
differences between observational and interventional distributions using
linear regression.
After that, we refreshed our knowledge of the basic graph theory and
introduced graphs as an important building block for causal models. We
discussed three basic conditional independence structures – forks, chains,
and colliders, and showed that colliders have a special status among the
three, allowing us to infer the direction of causal influence from the data.
In the next section, we’ll discuss five steps to get the best out of your causal
project that are based on some of the best practices we discussed in the
book.
Five steps to get the best out of your
causal project
In this section, we’ll discuss five steps that can help you maximize the
potential of your causal project.
For instance, one mistake that I observe in the industry is starting with very
broad questions regarding a complete causal model of a process, or even an
entire organization/organizational unit. In certain cases, building such a
complete model might be very difficult, very costly, or both.
The problem with over-scoping the project and asking too broad questions is
not unique to causality. It happens in non-causal AI projects as well. That
said, under-defined or incorrectly scoped questions can break a causal
project much faster than a traditional machine learning project.
One positive is that an organization can understand relatively early on that a
project does not bring the expected benefits and shut it down faster, avoiding
significant losses. The negative is that such an experience might produce
disappointment, leading to reluctance to move toward causal modeling. This
can be detrimental because, for many organizations, causal modeling can
bring significant benefits, which might often remain unrealized without causal
methods.
Depending on the use case, collecting and validating expert knowledge can
be a short and natural process or a long and effortful one, but its value is hard
to overestimate.
Note that not all sources of expert knowledge have to come with the same
level of confidence. It’s a good idea to keep an open mind regarding the (less
confident) sources and be prepared to discard them in the process, especially
if alternative explanations are more plausible and/or more coherent with
other trustworthy sources of information.
It’s a good practice to store and manage expert knowledge in a structured and
accessible way (e.g., following FAIR (https://bit.ly/FAIRPrinciples) or
another set of principles that are well suited for your organization and the
problem that you’re solving).
Depending on the size of the graph, the size of the project, and your
organization characteristics, we might choose different storage options for
our graphs – a repository containing (a) file(s) encoding the graph(s) or a
graph database (e.g., Neo4j or Amazon Neptune) can all be valid options for
you.
Accessibility and the ease of updating the structure are of key importance
here.
Check identifiability
It’s likely that your first graph will contain some ambiguous edges and/or
some unobserved variables. The effect of interest might be possible to
estimate even despite this.
To check whether this is the case, you can use one of the advanced
identifiability algorithms. You can find some of these algorithms in the
Python grapl-causal library, developed by Max Little of the University of
Birmingham and MIT.
Even if your effect is not identifiable right away, you might still be able to
obtain some actionable information from your model. This is possible for a
broad array of models with sensitivity analysis tools (check Chapter 8 for
more details).
For instance, if you work in marketing or sales, you might know from
experience that even if there’s hidden confounding in your data, the maximum
impact of all hidden confounders on sales should not be greater than some
particular value.
If this is the case, you can reliably check whether your effect holds under
extreme confounding or under the most likely values of confounders.
If your effect turns out to be identifiable right away or you work with
experimental data, you can start estimating the effect using one of the methods
that we discussed in Part 2 of our book.
If you have doubts regarding some of the edges or their orientation, you can
employ one of the causal discovery methods that we discussed in Part 3 and
confront the output against expert knowledge. If you can afford interventions
on all or some variables, methods such as ENCO (Chapter 14) can lead to
very good results.
Falsifying hypotheses
When we obtain an identifiable graph, we treat it as a hypothesis, and we can
now learn the functional assignments over the graph (e.g., using the four-step
framework from Chapter 7). After learning the functional assignments over
the graph, we can generate predictions. Testing these predictions over
interventional test data is a good way to check whether the model behaves
realistically.
Note that the steps we described do not include the actual data collection
process. This is because data collection might happen at very different
stages, depending on the nature of the question, your organization’s data
maturity, and the methodology you pick.
In this section, we discussed five steps we can take in order to carry out a
reliable causal project:
3. Generating hypotheses.
4. Checking identifiability.
In the next section, we’ll see examples of causal projects carried out in the
real world.
Causality and business
In this section, we’ll describe a couple of real-world use cases where causal
systems have been successfully implemented to address business challenges
and discuss how causality intersects with business frameworks.
3. What has caused a recent change in outcome O and how do we revert the
change?
For instance, question 1 asks what actions should be taken to minimize the
characteristic of interest of product P. This question goes beyond plain
understanding of what simply correlates with this characteristic (for context,
recall our discussion on confounding from Chapter 1). To answer question 1,
we need to understand the mechanism behind this characteristic and how
changes in other variables can affect it.
After defining the questions, Geminos’ team consulted the experts and
researched relevant scientific literature. Combining information from both
sources, the team built their first hypothetical graph.
The team addressed this challenge by adjusting the graph in a way that
preserved identifiability of the most important relationships, and the client
learned which variables should be prioritized for future collection to
facilitate further causal analyses.
After a number of iterations, the client was able to answer their questions
and optimize the process beyond what they had previously been able to
achieve, using traditional machine learning, statistical and optimization tools.
The company also learned how to structure and prioritize their new data
collection efforts in order to maximize the value, which saved them potential
losses related to adapting existing systems to collect data that would not
bring clear business value.
Causal modeling is also used by Swedish audio streaming giant Spotify. The
company regularly shares their experience in applied causality through their
technical blog and top conference publications (e.g., Jeunen et al., 2022).
Spotify’s blog covers a wide variety of topics, from sensitivity analysis of a
synthetic control estimator (which we learned about in Chapter 11; see more
at https://bit.ly/SpotifySynthControl) to disentangling causal effects from sets
of interventions under hidden confounding (https://bit.ly/SpotifyHiddenBlog).
Production, sales, marketing and digital entertainment are not the only areas
that can benefit from causal modeling. Causal models are researched and
implemented across fields, from medicine to the automotive industry.
This is not surprising, as causal models often capture the aspects of problems
that we’re the most interested in solving. Causal models (in particular
structural causal models (SCMs)) are also compatible with (or easily
adaptable to) numerous business and process improvement frameworks.
Let’s see an example. Six Sigma is a set of process improvement tools and
techniques proposed by Bill Smith at Motorola. Within the Six Sigma
framework, causality is defined as an impact of some input on some output
via some mechanism , formally:
Note that this is almost identical to functional assignments in SCMs, which
we discussed back in Chapter 2 (except that we skip the noise variable in
the preceding formula).
To learn more about causality from a business point of view, check out
Causal Artificial Intelligence: The Next Step in Effective, Efficient, and
Practical AI by Judith Hurwitz and John K. Thompson (due for release in Q4
2023). Now, let’s take a look at the future of causality.
Let’s start our journey into the future from where we’re currently standing.
Here are three skills that offer significant benefits for a wide range of
organizations:
Despite the relatively low entry barrier, these three ideas seem to be highly
underutilized across sectors, industries, and organization types.
Causal benchmarks
From a research perspective, one of the main challenges that we face today
in causality is a lack of widely accessible real-world datasets and universal
benchmarks – the analogs of ImageNet in computer vision.
Datasets and benchmarks can play a crucial role in advancing the field of
causality by fostering reproducibility and transparency and providing
researchers with a common point of reference.
Today’s synthetic benchmarks have properties that hinder their effective use
in causal discovery (Reisach et al., 2021) and causal inference (Curth et al.,
2021).
Some useful data simulators that can help to benchmark causal discovery
methods have been also proposed recently (see
https://bit.ly/DiagnosisSimulator).
Now, let’s discuss four potential research and application directions where
causality can bring value.
Intervening agents
The 2022/23 generative AI revolution has resulted in a number of new
applications, including Auto-GPT or AgentGPT – programs that leverage
GPT-class models behind the scenes and allow them to interact with the
environment.
Model instances (called agents) might have access to the internet or other
resources and can solve complex multistep tasks. Equipping these agents
with causal reasoning capabilities can make them much more effective and
less susceptible to confounding, especially if they are able to interact with
the environment in order to falsify their own hypotheses about causal
mechanisms.
Agents such as these could perform automated research and are a significant
step toward creating the automated scientist. Note that allowing these agents
to interact with the physical environment rather than only with the virtual one
could significantly enhance their efficiency, yet this also comes with a
number of important safety and ethical considerations.
Note that these ideas are related to some of the concepts discussed
extensively by Elias Bareinboim in his 2020 Causal Reinforcement
Learning tutorial (https://bit.ly/CausalRL).
For example, CITIRS (Lippe et al., 2022) learns low-level causal structures
from videos containing interventions.
Imitation learning
When human babies learn basic world models, they often rely on
experimentation (e.g., Gopnik, 2009; also vide Chapter 1 of this book). In
parallel, children imitate other people in their environment. At later stages in
life, we start learning from others more often than performing our own
experiments. This type of learning is efficient but might be susceptible to
arbitrary biases or confounding.
Racial or gender stereotypes passed through generations are great examples
of learning incorrect world models by imitation. For instance, in modern
Western culture, it was widely believed until the late 1960s that women are
physically incapable of running a marathon. Bobbi Gibb falsified this
hypothesis by (illegally) running in the Boston Marathon in 1966. She not
only finished the race but also ran faster than roughly 66% of the male
participants (https://bit.ly/BobbiGibbStory).
Willig et al. (2023) proposed that LLMs learn a meta-SCM from text and
that this model is associational in its nature.
For instance, causally aware imitation learning can be very useful to create
virtual assistants for business, medicine, or research. These assistants can
learn from existing materials or the performance of other (virtual or physical)
assistants and then improve by validating the trustworthiness of learned
models.
Another broad research direction that intersects with some of the preceding
ideas is neuro-symbolic AI – a class of models combining associational
representation learning with symbolic reasoning modules.
Learning causality
In this section, we’ll point to the resources to learn more about causality after
finishing this book.
For many people starting with causality, their learning path begins with
excitement. The promise of causality is attractive and powerful. After
learning about the basics and realizing the challenges that any student of
causality has to face, many of us lose hope in the early stages of our journeys.
Some of us regain it, learning that solutions do exist, although not necessarily
where we initially expected to find them.
After overcoming the first challenges and going deeper into the topic, many
of us realize that there are more difficulties to come. Learning from earlier
experiences, it’s easier at this stage to realize that (many of) these difficulties
can be tackled using a creative and systematic approach.
I like the way the Swiss educator and researcher Quentin Gallea presented
the journey into learning causality in a graphical form (Figure 15.1).
Figure 15.1 – The journey into causality by Quentin Gallea
At whichever point of the curve from Figure 15.1 you find yourself currently,
being consistent will inevitably move you to the next stage.
A common problem that many of us face when learning a new topic is the
choice of the next resource after finishing a book or a course.
Here are a couple of resources that you might find useful on your journey one
day.
First, there are many great books on causality. Starting with Judea Pearl’s
classics such as The Book of Why and finishing with Hernán and Robins’
What If?, you can learn a lot about different perspectives on causal inference
and discovery. I summarized six great books on causality in one of my blog
posts here: https://bit.ly/SixBooksBlog.
Second, survey papers are a great way to get a grasp of what’s going on in
the field and what the open challenges are.
Here are three survey papers that can help you understand the current causal
landscape:
LinkedIn: https://www.linkedin.com/in/aleksandermolak/
Twitter: @AleksanderMolak
If you want to consult a project or run a workshop on causality for your team,
drop me a line at [email protected].
Wrapping it up
It’s time to conclude our journey.
I hope finishing this book won't be the end for you, but rather the beginning of
a new causal chapter!
References
Bareinboim, E., & Pearl, J. (2016). Causal inference and the data-fusion
problem. Proceedings of the National Academy of Sciences of the United
States of America, 113(27), 7345–7352.
Berrevoets, J., Kacprzyk, K., Qian, Z., & van der Schaar, M. (2023). Causal
Deep Learning. arXiv.
Chau, S. L., Ton, J.-F., González, J., Teh, Y., & Sejdinovic, D. (2021).
BayesIMP: Uncertainty Quantification for Causal Data Fusion.
Curth, A., Svensson, D., Weatherall, J., & van der Schaar, M. (2021). Really
Doing Great at Estimating CATE? A Critical Look at ML Benchmarking
Practices in Treatment Effect Estimation. Proceedings of the Neural
Information Processing Systems Track on Datasets and Benchmarks.
Deng, Z., Zheng, X., Tian, H., & Zeng, D. D. (2022). Deep Causal Learning:
Representation, Discovery and Inference. arXiv.
Hünermund, P., & Bareinboim, E. (2023). Causal inference and data fusion
in econometrics. arXiv.
Kaddour, J., Lynch, A., Liu, Q., Kusner, M. J., & Silva, R. (2022). Causal
machine learning: A survey and open problems. arXiv.
Kıcıman, E., Ness, R., Sharma, A., & Tan, C. (2023). Causal Reasoning and
Large Language Models: Opening a New Frontier for Causality. arXiv.
Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., & Gavves, S.
(2022). CITRIS: Causal identifiability from temporal intervened
sequences. In International Conference on Machine Learning (pp. 13557–
13603). PMLR.
Reisach, A.G., Seiler, C., & Weichwald, S. (2021). Beware of the Simulated
DAG! Varsortability in Additive Noise Models. arXiv.
Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal,
A., & Bengio, Y. (2021). Toward causal representation learning.
Proceedings of the IEEE, 109(5), 612–634.
Toth, C., Lorch, L., Knoll, C., Krause, A., Pernkopf, F., Peharz, R., & Von
Kügelgen, J. (2022). Active Bayesian Causal Inference. arXiv.
Vowels, M. J., Camgoz, N. C., & Bowden, R. (2022). D’ya like dags? A
survey on structure learning and causal discovery. ACM Computing
Surveys, 55(4), 1–36.
Willig, M., Zečević, M., Dhami, D. S., Kersting, K. (2023). Causal Parrots:
Large Language Models May Talk Causality But Are Not Causal [ACM
preprint].
Index
As this ebook edition doesn't have fixed pagination, the page numbers
below are hyperlinked for reference only, based on the printed edition of
this book.
A
abduction 31
acyclic graph
assignment operator 20
associational relationships 18
associative learning 5
assumptions 74, 77, 78
B
back-door criterion 33, 97, 104
base-learners 192
Bayesian 72
beliefs 72
C
Carnegie Mellon University (CMU) 330
CATENets 277
experiments 277-284
CausalBert 292
in code 293-297
causal conclusion
example 9-11
assumptions 76, 77
discovering 328
faithfulness 328
families 329
methods 154
minimalism 329
minimality 328
causal effect
rule 104
causaLens 404
advantages 242
overflow 242
causal graphs
causal discovery 67
expert knowledge 67
sources 66
causal inference 73, 74, 328
challenges 152
conditions 74-76
causality
child interaction 5, 9
confounding 6-9
history 4, 5
need for 5
implementing 403-405
causal knowledge
habituation 316
sources 316
violation of expectation (VoE) 317
causal minimality 77
condition 77
causal ML
benefits 406
future 405
causal models
validation 135
defining 400
causation 26, 27
Causica
graphical representation 86
chain dataset
generating 87, 88
ChatGPT 286
counterfactual reasoning 287
coefficients 38
graphical representation 86
collider dataset
generating 89, 90
conditional average treatment effect (CATE) 173, 189, 219, 274, 387, 400,
406
conditional independence 73
conditioning 5
conditions 74
confounding variable 6
connected graphs
correlation 26, 27
CORTH 393
counterfactual reasoning 17
counterfactuals 10, 28
computing 30, 31
example 28, 29
fundamental problem of causal inference 30
covariates 44, 45
control-or-not-to-control 48
heuristics, controlling 45
scenarios 45-48
cyclic graph
cyclic SCMs 68
D
data
CCANM 392
CORTH 393
data-generating process 19
modules 381-384
training 384-386
deep learning
dependent variable 43
difference-in-differences 154
directed acyclic graph (DAG) 64, 98, 74, 127, 175, 216, 291, 324, 329, 374
limitations 66
directed graphs
disconnected graphs
do-calculus 119
rules 119
technique 154
versus doubly robust (DR) methods 240, 241
benefits 218
options 224
value 218
E
EconML
edges 55
ELMo 286
endogenous variable 19
equal estimates
estimators 102
exchangeability 161
exogenous variable 19
expert knowledge 67
encoding 366
F
faithfulness assumption 64, 76, 328, 388
problems 328
falsification 318
implementing 388-391
graphical representation 86
fork dataset
generating 88, 89
frequentist 72
evaluating 111
fully-connected graph 57
G
Gaussian processes (gp) 332
gCastle 331
GCM API
example 146-148
GES algorithm
implementing, in gCastle 348
scoring 347
GOLEMs 363
comparison 363-365
GOLEMs 363
NOTEARS 362
reference link 61
benefits 321
graph representations
adjacency matrix 58-60
graphs 55
in Python 60-63
representations 58
types 56
graphs, types
grapl-causal library
H
habituation 316
data 245-247
data, testing 248-250
hippocampus 109
hyperparameter tuning
with DoWhy 234-238
I
identifiability 152, 153
immoralities 92
independence structure 46
indicator function 21
conditions 120
techniques 156
interaction 38
intervention 6, 17, 23
formula 187
K
kernel-based conditional independence (KCI) 390
L
Ladder of Causation 274
LinearDRLearner 222
Linear Non-Gaussian Acyclic Model (LiNGAM) algorithm 330
geometric interpretation 42
LiNGAM 355-357
M
machine learning method 33
marginal probability 72
example 85
objective 295
reference link 72
matching 174
types 174
meta-learners 188
meta-SCM 408
minimalism 329
Minkowski distance 174
multiple regression 38
N
natural direct effect (NDE) 293
challenges 285-288
NetworkX
nodes 55
no hidden confounding 77
noise variable 19
cyclic SCMs 68
dynamical systems 67
non-linear associations 38
null hypothesis 41
types 41
O
observational data 37
observational equivalence 49
Ockham’s razor 77
orthogonality 72
overlap 157
P
partially linear model 230
PC algorithm 341
PC-stable 345
Pearson's r 88
Playtika 405
polysemy 286
Popperian 137
example 158-160
power rule 52
probability 72
pseudo-populations 186
p-values 41
pydot
reference link 61
PyMC 304
Python
graphs 60-63
PyTorch 277
Q
Qini coefficient 257
Quasi-experiments 297
R
random graphs 332
randomization 17
RealCause 157
regression models
fitting 90-92
regularization 229
reproducibility, PyTorch
S
sample size
methodology 154
scale-invariant 359
scatterplots 88
GES 347
Simpson’s paradox 11
skeleton 85
implementing 189
vulnerabilities 199
SNet 276
architecture 276
Spotify 405
spuriousness 165
statistical control 92
statistical significance 41
statsmodels 39
reference link 39
structural causal model (SCM) 15, 18, 46, 49, 112, 329, 373, 405
structural model 49
survivorship bias 166-168
in code 303-308
logic 298-302
synthetic data
in gCastle 331
T
tabula rasa 347
implementing 225
TensorDict 376
formula 201
implementing 202-204
limitation 200
U
undirected graphs
unweighted graph
V
variational autoencoder (VAE) 392
weighted graph
word2vec 286
X
X-Learner 204
implementing 208-210
reconstructing 205-207
Y
Yule-Simpson effect 11
packtpub.com
Subscribe to our online digital library for full access to over 7,000 books
and videos, as well as industry leading tools to help you plan your personal
development and advance your career. For more information, please visit our
website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks and
Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Did you know that Packt offers eBook versions of every book published,
with PDF and ePub files available? You can upgrade to the eBook version at
packtpub.com and as a print book customer, you are entitled to a discount on
the eBook copy. Get in touch with us at [email protected] for
more details.
At www.packtpub.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters, and receive exclusive
discounts and offers on Packt books and eBooks.
Maxime Labonne
ISBN: 978-1-80461-752-6
Sam Morley
ISBN: 978-1-80461-837-0
Become familiar with basic Python packages, tools, and libraries for
solving mathematical problems
Find out how to choose the most suitable package, tool, or technique to
solve a problem
Implement basic mathematical plotting, change plot styles, and add labels
to plots using Matplotlib
Get to grips with probability theory with the Bayesian inference and
Markov Chain Monte Carlo (MCMC) methods
Your review is important to us and the tech community and will help us make
sure we’re delivering excellent quality content.
Do you like to read on the go but are unable to carry your print books
everywhere? Is your eBook purchase not compatible with the device of your
choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of
that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from
your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts,
newsletters, and great free content in your inbox daily
2. That’s it! We’ll send your free PDF and other benefits to your email
directly