01 - Introduction To Causality - Causal Inference For The Brave and True
01 - Introduction To Causality - Causal Inference For The Brave and True
01 - Introduction To Causality - Causal Inference For The Brave and True
Contents
Why Bother?
Data Science is Not What it Used to Be (or it Finally Is)
Answering a Different Kind of Question
When Association IS Causation
Bias
Key Ideas
References
Contribute
Why Bother?
First and foremost, you might be wondering: what’s in it for me? Here is what:
style.use("fivethirtyeight")
np.random.seed(123)
n = 100
tuition = np.random.normal(1000, 300, n).round()
tablet = np.random.binomial(1, expit((tuition - tuition.mean()) /
tuition.std())).astype(bool)
enem_score = np.random.normal(200 - 50 * tablet + 0.7 * tuition, 200)
enem_score = (enem_score - enem_score.min()) / enem_score.max()
enem_score *= 1000
plt.figure(figsize=(6,8))
sns.boxplot(y="enem_score", x="Tablet", data=data).set_title('ENEM score by
Tablet in Class')
plt.show()
To get beyond simple intuition, let’s first establish some notation. This will be our everyday
language to speak about causality. Think of it as the common tongue we will use to identify
other brave and true causal warriors, and that will compose our cry in the many battles to
come.
Let’s call the treatment intake for unit i.
Ti
The treatment here doesn’t need to be a medicine or anything from the medical field. Instead,
it is just a term we will use to denote some intervention for which we want to know the effect.
In our case, the treatment is giving tablets to students. As a side note, you might sometimes
see instead of to denote the treatment.
D T
The outcome is our variable of interest. We want to know if the treatment has any influence in
it. In our tablet example, it would be the academic performance.
Here is where things get interesting. The fundamental problem of causal inference is that
we can never observe the same unit with and without treatment. It is as if we have two
diverging roads and we can only know what lies ahead of the one we take. As in Robert Frost
poem:
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
To wrap our heads around this, we will talk a lot in term of potential outcomes. They are
potential because they didn’t actually happen. Instead they denote what would have
happened in the case some treatment was taken. We sometimes call the potential outcome
that happened, factual, and the one that didn’t happen, counterfactual.
As for the notation, we use an additional subscript:
Y0i is the potential outcome for unit i without the treatment.
Y1iis the potential outcome for the same unit i with the treatment.
Sometimes you might see potential outcomes represented as functions , so beware.
Yi(t) Y0i
could be Yi(0) and could be . Here, we will use the subscript notation most of the
Y1i Yi(1)
time.
classroom with tablets. Whether this is or not the case, it doesn’t matter for . It is the same
Y1i
regardless. If student i gets the tablet, we can observe . If not, we can observe . Notice
Y1i Y0i
how in this last case, is still defined, we just can’t see it. In this case, it is a counterfactual
Y1i
potential outcome.
With potential outcomes, we can define the individual treatment effect:
Y1i − Y0i
Of course, due to the fundamental problem of causal inference, we can never know the
individual treatment effect because we only observe one of the potential outcomes. For the
time being, let’s focus on something easier than estimating the individual treatment effect.
Instead, lets focus on the average treatment effect, which is defined as follows.
AT E = E[Y1 − Y0]
where, E[...] is the expected value. Another easier quantity to estimate is the average
treatment effect on the treated:
AT T = E[Y1 − Y0|T = 1]
Now, I know we can’t see both potential outcomes, but just for the sake of argument, let’s
suppose we could. Pretend that the causal inference deity is pleased with the many statistical
battles we fought and has rewarded us with godlike powers to see the potential alternative
outcomes. With that power, say we collect data on 4 schools. We know if they gave tablets to
its students and their score on some annual academic tests. Here, tablets are the treatment,
so T = 1 if the school provides tablets to its kids. will be the test score.
Y
pd.DataFrame(dict(
i= [1,2,3,4],
Y0=[500,600,800,700],
Y1=[450,600,600,750],
T= [0,0,1,1],
Y= [500,600,600,750],
TE=[-50,0,-200,50],
))
i Y0 Y1 T Y TE
0 1 500 450 0 500 -50
1 2 600 600 0 600 0
2 3 800 600 1 600 -200
3 4 700 750 1 750 50
The AT E here would be the mean of the last column, that is, of the treatment effect:
AT E = (−50 + 0 − 200 + 50)/4 = −50
This would mean that tablets reduced the academic performance of students, on average, by
50 points. The here would be the mean of the last column when
AT T : T = 1
This is saying that, for the schools that were treated, the tablets reduced the academic
performance of students, on average, by 75 points. Of course we can never know this. In
reality, the table above would look like this:
pd.DataFrame(dict(
i= [1,2,3,4],
Y0=[500,600,np.nan,np.nan],
Y1=[np.nan,np.nan,600,750],
T= [0,0,1,1],
Y= [500,600,600,750],
TE=[np.nan,np.nan,np.nan,np.nan],
))
i Y0 Y1 T Y TE
0 1 500.0 NaN 0 500 NaN
1 2 600.0 NaN 0 600 NaN
2 3 NaN 600.0 1 600 NaN
3 4 NaN 750.0 1 750 NaN
This is surely not ideal, you might say, but can’t I still take the mean of the treated and
compare it to the mean of the untreated? In other words, can’t I just do
AT E = (600 + 750)/2 − (500 + 600)/2 = 125 ? Well, no! Notice how different the results
are. You’ve just committed the gravest sin of mistaking association for causation. To
understand why let’s look into the main enemy of causal inference.
Bias
Bias is what makes association different from causation. Fortunately, it can be easily
understood with our intuition. Let’s recap our tablets in the classroom example. When
confronted with the claim that schools that give tablets to their kids achieve higher test
scores, we can refute it by saying those schools will probably achieve higher test scores
anyway, even without the tablets. That is because they probably have more money than the
other schools; hence they can pay better teachers, afford better classrooms, etc. In other
words, it is the case that treated schools (with tablets) are not comparable with untreated
schools.
Using potential outcome notation is to say that of the treated is different from the of the
Y0 Y0
untreated. Remember that the of the treated is counterfactual. We can’t observe it, but
Y0
we can reason about it. In this particular case, we can even leverage our understanding of how
the world works to go even further. We can say that, probably, of the treated is bigger than
Y0
Y0 of the untreated schools. That is because schools that can afford to give tablets to their
kids can also afford other factors that contribute to better test scores. Let this sink in for a
moment. It takes some time to get used to talking about potential outcomes. Reread this
paragraph and make sure you understand it.
With this in mind, we can show with elementary math why it is the case that association is not
causation. Association is measured by E[Y |T = 1] − E[Y |T = 0] . In our example, this is
the average test score for the schools with tablets minus the average test score for those
without them. On the other hand, causation is measured by E[Y1 − Y0] .
Let’s take the association measurement and replace the observed outcomes with the potential
outcomes to see how they relate. For the treated, the observed outcome is . For the
Y1
Now, let’s add and subtract . This is a counterfactual outcome. It tells what
E[Y0|T = 1]
would have been the outcome of the treated, had they not received the treatment.
E[Y |T = 1] − E[Y |T = 0] = E[Y1|T = 1] − E[Y0|T = 0] + E[Y0|T = 1] − E[Y0|T = 1]
Finally, we reorder the terms, merge some expectations, and lo and behold:
E[Y |T = 1] − E[Y |T = 0] = E[Y1 − Y0|T = 1] + {E[Y0|T = 1] − E[Y0|T = 0]}
AT T BI AS
This simple piece of math encompasses all the problems we will encounter in causal
questions. I cannot stress how important it is that you understand every aspect of it. If you’re
ever forced to tattoo something on your arm, this equation should be a good candidate for it.
It’s something to hold onto dearly and understand what is telling us, like some sacred text that
can be interpreted 100 different ways. In fact, let’s take a deeper look. Let’s break it down into
some of its implications. First, this equation tells why the association is not causation. As we
can see, the association is equal to the treatment effect on the treated plus a bias term. The
bias is given by how the treated and control group differ before the treatment, in case
neither of them has received the treatment. We can now say precisely why we are
suspicious when someone tells us that tablets in the classroom boost academic performance.
We think that, in this example, E[Y0|T = 0] < E[Y0|T = 1] , that is, schools that can afford
to give tablets to their kids are better than those that can’t, regardless of the tablets
treatment.
Why does this happen? We will talk more about that once we enter confounding, but for now,
you can think of bias arising because many things we can’t control are changing together with
the treatment. As a result, the treated and untreated schools don’t differ only on the tablets.
They also differ on the tuition cost, location, teachers… For us to say that tablets in the
classroom increase academic performance, we would need for schools with and without them
to be, on average, similar to each other.
plt.figure(figsize=(10,6))
sns.scatterplot(x="Tuition", y="enem_score", hue="Tablet", data=data,
s=70).set_title('ENEM score by Tuition Cost')
plt.show()
Now that we understand the problem let’s look at the solution. We can also say what would be
necessary to make association equal to causation. If E[Y0|T = 0] = E[Y0|T = 1] , then,
association IS CAUSATION! Understanding this is not just remembering the equation. There
is a strong intuitive argument here. To say that E[Y0|T = 0] = E[Y0|T = 1] is to say that
treatment and control group are comparable before the treatment. Or, when the treated had
not been treated, if we could observe its , its outcome would be the same as the untreated.
Y0
Also, if the treated and the untreated only differ on the treatment itself, then,
E[Y0|T = 0] = E[Y0|T = 1] and we have that the causal impact on the treated is the same
as in the untreated (because they are very similar).
E[Y1 − Y0|T = 1] = E[Y1|T = 1] − E[Y0|T = 1]
= E[Y1|T = 1] − E[Y0|T = 0]
= E[Y |T = 1] − E[Y |T = 0]
Additionally, if the treated and the untreated only differ on the treatment itself, we also have
E[Y1|T = 0] = E[Y1|T = 1] , that is, we make sure that both treated and control groups
respond similarly to the treatment. Now, besides being exchangeable prior to the treatment,
treated and untreated are also exchangeable after the treatment. In this case,
E[Y1 − Y0|T = 1] = E[Y1 − Y0|T = 0] and
E[Y |T = 1] − E[Y |T = 0] = AT T = AT E
Once again, this is so important that I think it is worth going over it again, now with pretty
pictures. If we make a simple average comparison between the treatment and the untreated
group, this is what we get (blue dots didn’t receive the treatment, that is, the tablet):
Notice how the difference in outcomes between the two groups can have two causes:
1. The treatment effect. The increase in test scores is caused by giving kids tablets.
2. Some of the differences in test scores can be due to tuition prices for better education.
In this case, treated and untreated differ because the treated have a much higher tuition
price. Other differences between the treatment and untreated are NOT the treatment
itself.
The individual treatment effect is the difference between the unit’s outcome and another
theoretical outcome that the same unit would have if it got the alternative treatment. The
actual treatment effect can only be obtained if we have godlike powers to observe the
potential outcome, like the left figure below. These are the counterfactual outcomes and are
denoted in a light color.
In the right plot, we depicted the bias that we’ve talked about before. We get the bias if we set
everyone to not receive the treatment. In this case, we are only left with the potential
T0
outcome. Then, we see how the treated and untreated groups differ. If they do, something
other than the treatment is causing the treated and untreated to be different. This something
is the bias and is what shadows the actual treatment effect.
Now, contrast this with a hypothetical situation where there is no bias. Suppose that tablets
are randomly assigned to schools. In this situation, rich and poor schools have the same
chance of receiving the treatment. Treatment would be well distributed across the tuition
spectrum.
In this case, the difference in the outcome between treated and untreated IS the average
causal effect. This happens because there is no other source of difference between treatment
and untreated other than the treatment itself. All the differences we see must be attributed to
it. Another way to say this is that there is no bias.
If we set everyone to not receive the treatment so that we only observe the s, we would find
Y0
Key Ideas
So far, we’ve seen that association is not causation. Most importantly, we’ve seen precisely
why it isn’t and how we can make association be causation. We’ve also introduced the
potential outcome notation as a way to wrap our heads around causal reasoning. With it, we
saw statistics as two possible realities: one in which the treatment is given and another in
which it is not. But unfortunately, we can only measure one of them, where the fundamental
problem of causal inference lies.
We will see some basic techniques to estimate the causal effect, starting with the golden
standard of a randomized trial. I’ll also review some statistical concepts as we go. I’ll end with
a quote often used in causal inference classes, taken from a kung-fu series:
‘What happens in a man’s life is already written. A man must move through life as
his destiny wills.’ -Caine
‘Yes, yet each man is free to live as he chooses. Though they seem opposite, both
are true.’ -Old Man
References
I like to think of this book as a tribute to Joshua Angrist, Alberto Abadie and Christopher
Walters for their amazing Econometrics class. Most of the ideas here are taken from their
classes at the American Economic Association. Watching them is what’s keeping me sane
during this tough year of 2020.
Cross-Section Econometrics
Mastering Mostly Harmless Econometrics
I’ll also like to reference the amazing books from Angrist. They have shown me that
Econometrics, or ‘Metrics as they call it, is not only extremely useful but also profoundly fun.
Mostly Harmless Econometrics
Mastering ‘Metrics
My final reference is Miguel Hernan and Jamie Robins’ book. It has been my trustworthy
companion in the most thorny causal questions I had to answer.
Causal Inference Book
The beer analogy was taken from the awesome Stock Series, by JL Colins. This is an absolute
must read for all of those wanting to learn how to productively invest their money.
Contribute
Causal Inference for the Brave and True is an open-source material on causal inference, the
statistics of science. Its goal is to be accessible monetarily and intellectually. It uses only free
software based on Python. If you found this book valuable and want to support it, please go to
Patreon. If you are not ready to contribute financially, you can also help by fixing typos,
suggesting edits, or giving feedback on passages you didn’t understand. Go to the book’s
repository and open an issue. Finally, if you liked this content, please share it with others who
might find it helpful and give it a star on GitHub.
By Matheus Facure Alves
© Copyright 2022.