0% found this document useful (0 votes)
44 views10 pages

Causal Inferences

KJB,

Uploaded by

saimaiqbalatex44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views10 pages

Causal Inferences

KJB,

Uploaded by

saimaiqbalatex44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

t

os
9 - 6 2 2 -1 1 1
REV: JULY 20, 2022

rP
IAVOR BOJINOV

MICHAEL PARZEN

PAUL J. HAMILTON

Causal Inference

yo
Causal inference is a vast field that seeks to address questions relating causes to effects. 1 We more
formally define causal inference as the study of how a treatment (i.e., action or intervention) affects
outcomes of interest relative to an alternate treatment. Often, one of the treatments represents a
baseline or status quo: it is then called a control. Academics have used causal reasoning for over a
century to establish many scientific findings we now consider as facts. In the past decade, there has
been a rapid increase in the adoption of causal thinking by firms, and it is now an integral part of data
science. Causal inference can be used to answer such questions as:
op
• What is the effect on the throughput time (outcome) of introducing a new drill to a production
line (treatment) relative to the current process (control)?

• What is the effect of changing the text on the landing page’s button (treatment) on the
clickthrough rate (outcome), relative to the current text (control)?
tC

• Which of two hospital admission processes (treatment 1 vs treatment 2) leads to better health
results (outcome)?

The most reliable way to establish causal relationships is to run a randomized experiment. Different
fields have different names for these, including A/B tests, clinical trials, randomized control trials, etc.
Basically, a randomized experiment involves randomly assigning subjects (e.g., customers, divisions,
companies) to either receive a treatment or a control intervention. The effectiveness of the treatment is
then assessed by contrasting the outcomes of the treated subjects to the outcomes of the control subjects.
No

The simple idea of running experiments has had a profound impact on how managers make
decisions, as it allows them to discern their customers’ preferences, evaluate their initiatives, and
ultimately test their hypotheses. Experimentation is now an integral part of the product development
process at most technology companies. It is increasingly being adopted by non-technology companies
as well, as they recognize that experimentation allows managers to continuously challenge their
working hypotheses and perform pivots that ultimately lead to better innovations.
Do

1 For a browser-based version of this note with integrated R code, see the Causal Inference chapter under the Inference module of
dsm.business. Sections marked with (§) contain additional content not covered in this note.

Professor Iavor Bojinov, Senior Lecturer Michael Parzen, and Research Associate Paul J. Hamilton prepared this note as the ba sis for class
discussion.

Copyright © 2022 President and Fellows of Harvard College. To order copies or request permission to reproduce materials, call 1-800-545-7685,
write Harvard Business School Publishing, Boston, MA 02163, or go to www.hbsp.harvard.edu. This publication may not be digitized, photocopied,
or otherwise reproduced, posted, or transmitted, without the permission of Harvard Business School.

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
622-111 Causal Inference

t
os
Unfortunately, we cannot always run experiments because of ethical concerns, high costs, or an
inability to control the random assignment directly. Luckily, this is a challenge that academics have
grappled with for a long time, and they have developed many different strategies for identifying causal
effects from non-experimental (i.e., observational) data. For example, we know that smoking causes
cancer even though no one ever ran a randomized experiment to measure the effect of smoking.
However, it is important to understand that causal claims from observational data are inherently less

rP
reliable than claims derived from experimental evidence and can be subject to severe biases.

Observational Studies
The vast majority of statistics courses begin and end with the premise that correlation is not
causation. However, they rarely explain why we cannot directly assume that correlation is not
causation.

yo
Let us consider a simple example. Imagine you are working as a data scientist for Musicfi, a firm
that offers on-demand music streaming services. To keep things simple, let’s assume the firm has two
account types: a free account and a premium account. Musicfi’s main measure of customer engagement
is the total streaming minutes that measure how many minutes each customer spent on the service per
day. As a data scientist, you want to understand your customers, so you decide to perform a simple
analysis that compares the total streaming minutes across the two account types. To put this in causal
op
terms:

• Our treatment is having a premium account;

• Our control is having a free account; and

• Our outcome is total streaming minutes.


tC

Suppose we have a random sample of 500 customers, the first few observations of which are shown
below.

Table 1 Observational Musicfi data

Age AccountType StreamingMinutes


33 Premium 63
No

38 Premium 61
24 Free 82
28 Premium 72
61 Premium 38

Source: Casewriter.
Do

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
Causal Inference 622-111

t
os
First, let’s examine a side-by-side boxplot that compares the streaming minutes of free and premium
customers.

Figure 1 Boxplot of streaming minutes for free and premium users

rP
yo
op
Source: Casewriter.
tC

From the figure, we can see that the customers with premium accounts had lower total streaming
minutes than customers with free accounts! The average in the premium group was 71 minutes
compared to 80 minutes in the free group, meaning that (on average) customers with a premium
account listened to less music. We can use a t-test to check if this observed difference is statistically
significant:
## Welch Two Sample t-test
No

## data: musicfi$StreamingMinutes by musicfi$AccountType


## t = 11.774, df = 444.24, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.52599 10.54174
## sample estimates:
## mean in group Free mean in group Premium
## 80.42205 71.38819

Suppose we take the above results at face value. In that case, we might incorrectly conclude that
Do

having a premium account reduces customer engagement. But this seems unlikely! Our general
understanding suggests that buying a premium account should have the opposite effect. So what is
going on? Well, we are likely falling victim to what is often called selection bias. Selection bias occurs
when the treatment group is systematically different from the control group before the treatment has
occurred, making it hard to disentangle differences due to the treatment from those due to the
systematic difference between the two groups.

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
622-111 Causal Inference

t
os
Mathematically, we can model selection bias as a third variable—often called a confounding
variable—associated with both a unit’s propensity to receive the treatment and that unit’s outcome. In
our Musicfi example, this variable could be the age of the customer: younger customers tend to listen
to more music and are less likely to purchase a premium account. Therefore, the age variable is a
confounding variable as it limits our ability to draw causal inferences.

rP
To see this, we can compare how age is related to total streaming minutes and the account type.
First, let’s create a scatter plot of age and streaming minutes.

Figure 2 Scatter plot of age and streaming minutes

yo
op
tC

Source: Casewriter.

If we wanted to adjust for the age variable (assuming it is observed), we could run a linear
regression with both account type and age as independent variables:
No

## Call:
## lm(formula = StreamingMinutes ~ AccountType + Age, data = musicfi)

## Residuals:
## Min 1Q Median 3Q Max
## -9.1032 -2.2259 0.0901 2.0055 8.6514

## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.57573 0.41028 245.138 < 2e-16 ***
## AccountTypePremium 1.51568 0.33928 4.467 9.82e-06 ***
## Age -1.06136 0.01904 -55.736 < 2e-16 ***
Do

## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 3.144 on 497 degrees of freedom


## Multiple R-squared: 0.8927, Adjusted R-squared: 0.8923
## F-statistic: 2068 on 2 and 497 DF, p-value: < 2.2e-16

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
Causal Inference 622-111

t
os
From this output, we see that the coefficient on account type is positive! This means that after
controlling for age, premium users actually listen to more music, on average. Including age in the
regression removes its confounding effect on the relationship between account type and streaming
minutes.

However, even after controlling for age, can we be confident in interpreting this as a causal

rP
outcome? The answer is still most likely no. That is because there may exist more confounding variables
that are not a part of our data set; these are known as unobserved confounders. Even if there were no
unobserved confounders, there are much more robust methods for analyzing observational studies
than linear regression.

Randomized Experiments

yo
Randomized experiments remove the selection problem and ensure that there are no confounding
variables (observed or unobserved). They do this by removing the individual’s opportunity to select
whether or not they receive the treatment. In the Musicfi example, suppose we ran an experiment with
500 participants where we randomly upgraded some free accounts to premium accounts. The first few
observations of this data set are shown in the table below. Now, we no longer have to adjust for age as
(on average) there will be no age difference between the treated and control subjects.

Table 2 Musicfi data from randomized experiment


op
Age AccountType StreamingMinutes
20 Premium 92
28 Free 88
20 Premium 89
24 Premium 86
tC

22 Premium 91
15 Premium 93

Source: Casewriter.

Unlike the observational data from the previous section, there is now no significant difference in
age between the treatment and control groups.
No
Do

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
622-111 Causal Inference

t
os
Figure 3 Boxplot of age for free and premium users from randomized experiment

rP
yo
op
Source: Casewriter.

Age is not a confounding variable as it is independent of the treatment assignment. We can now
directly attribute any differences in the outcome due to the intervention. This allows us to conclude
that giving people a premium account increases streaming minutes. Below is the output of a hypothesis
test comparing streaming minutes for free and premium users, and the figure visualizes this difference
tC

with a boxplot.
## Welch Two Sample t-test

## data: musicfiExp$StreamingMinutes by musicfiExp$AccountType


## t = -2.5057, df = 497.95, p-value = 0.01254
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.2728058 -0.2750465
No

## sample estimates:
## mean in group Free mean in group Premium
## 87.55285 88.82677
Do

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
Causal Inference 622-111

t
os
Figure 4 Boxplot of streaming minutes for free and premium users from randomized experiment

rP
Source: Casewriter.
yo
op
The difference may be relatively small, but the low p-value indicates that the difference is
statistically significant.

Designing an Experiment
tC

Now that we know why experimentation is necessary, let’s review the steps that businesses use to
design their experiments.

Generate a Hypothesis
The starting point to any experiment is to generate a hypothesis. For companies these hypotheses
take the following form:
No

“If we [do this] then [this outcome] will increase/decrease by [this much].”

• [do this]: describes the treatment; most of the time, we implicitly assume that the control is the
current approach, and hence it is not explicitly mentioned.

• [this outcome]: describes the outcome we expect the treatment will affect.

• [this much]: describes our best guess at the likely effect of the treatment on the outcome. It is also
essential to make a reasonable guess at what the effect will be; this guess will form the basis of
a calculation that will determine the number of people (sample size) necessary to detect an effect
Do

of such a magnitude in an experiment.

Some examples:

• If we enlarge our “Buy now” button by 20% then total sales will increase by +5%.

• If we launch a simpler user interface on our home page then total sessions will increase by 10%.

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
622-111 Causal Inference

t
os
• If we remove fake accounts then complaints will decrease by 2%.

Although the above examples focus on a single outcome, it is common to look at one to five primary
outcomes and possibly another ten to fifteen secondary (less important) outcomes. No matter the
number, however, it is crucial that these are specified before data collection.

rP
Select the Study Population
After generating the hypothesis, we have to identify the group(s) of people who will take part in
the study; this is often called determining the inclusion/exclusion criteria. Choosing these criteria
depends on the context.

For the above examples we could consider the following populations:

yo
• All U.S. cell phone users that visit the website.

• All customers using the company’s iOS mobile application with version XXX or above.

• All worldwide customers.

This step will provide us with an estimate of the overall population size. From here, we need to
determine how we are going to recruit people into our study. The next step explains how to assign
op
people to either treatment or control.

Define the Assignment Mechanism


The assignment mechanism is a probability distribution that specifies the likelihood of each person
in our study receiving a treatment. There are, of course, many possible assignment mechanisms for an
experiment on a population. Below, we introduce the two most popular mechanisms:
tC

• Bernoulli Design: Each person is independently assigned with probability p to receive the
treatment and a probability 1 - p to receive the control. In other words, for every person, we
toss an independent coin with probability of heads = p. If we get heads, we assign them to
treatment, and if we get tails, we assign them to control.

• Completely Randomized Design: In this design, we specify how many people will be placed
in each condition. For example, we might specify that 𝑛0 will get the control and 𝑛1 will get the
No

treatment. Then, we randomly pick 𝑛0 of our participants to receive the control and 𝑛1 to receive
the treatment.

The main difference between a Bernoulli and a completely randomized design is that we guarantee
how many people receive the treatment in the latter. On the other hand, the Bernoulli is much easier to
implement if we don’t know exactly how many people will be in our study (e.g., people “arrive” one
after another).

Power
Do

After completing the steps in the previous section, businesses face an additional decision: how much
data to collect in their experiment. Of course, more data is always better, but there are usually economic
limitations that prevent one from collecting an arbitrarily large sample. Therefore, we face a trade-off:
large samples may be prohibitively expensive, but if our sample is too small we may not have enough
“power” to detect a significant difference between our treatment and control groups. That is, we may

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
Causal Inference 622-111

t
os
not have enough data to reject the null when there truly is a difference between the two groups. Power
analysis allows us to calculate the minimum sample size required to detect a difference, given that one
really exists.

Using the language of hypothesis testing, Musicfi plans to test the following hypotheses, where 𝜇0
is the mean of the control group and 𝜇1 is the mean of the treatment group:

rP
𝐻𝑜 : 𝜇0 = 𝜇1 .

𝐻𝑎 : 𝜇0 ≠ 𝜇1 .

Imagine that the alternative hypothesis (𝐻𝑎 ) is true, i.e., the mean streaming minutes is different for
customers in the control and treatment groups. Even under this scenario, there is no guarantee that our
analysis will detect the difference because we have a sample of customers (fortunately a random

yo
sample) and not the population of all possible customers. The question then becomes: what is the smallest
sample we can collect that will still provide a reasonable chance of detecting a difference if one exists?

To tackle this question, think through the following scenarios. What do you think would require
more data: (a) Detecting a difference in streaming time between free and premium users when the true
average difference is five minutes, or (b) Detecting a difference when the true average difference is only
two minutes?
op
Intuitively, the larger the true difference, the more likely that difference will manifest itself in our
random sample. Therefore, we need would need more data in scenario (b) to pick up on the difference
than we would in scenario (a).

Let’s imagine that based on the business context, Musicfi does not care if the difference between
free and premium users is less than two minutes; in other words, if the true difference is two minutes
or less, Musicfi would consider that difference negligible. Therefore, they want to determine the
tC

smallest possible sample size that still has a good chance of detecting a difference of two minutes or
greater. Note this implies that if the true difference is less than two minutes, the experiment will likely
not pick up on that difference. Determining the minimum detectable difference is an important step in
a power analysis, and one that requires input from managers who have business area expertise.

To conduct a power analysis, we need several pieces of information:

• The significance level of the test (α). We will use a significance level of 0.05.
No

• The power of the test (1 - β). This is the probability that our test will detect the difference given
that there is a difference in the population. A common choice for 1 - β is 0.8. Note that this implies
there is still a β = 20% chance our test will not detect the difference when a difference actually
exists.

• The treatment effect, defined as the difference between the two groups (𝜇1 − 𝜇0 ). Of course,
we don’t know the true population treatment effect — this is the quantity we are trying to
estimate from the experiment. However, to determine how big our sample size should be, we
Do

need a plausible guess of the smallest effect that we would want to detect in our study.
Typically, the estimate is based on historical data; we usually shrink the estimate closer to zero
because our data is likely subject to unobserved confounding.

• The pooled standard deviation of the two groups, usually estimated from historical data.

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860
622-111 Causal Inference

t
os
For example, imagine that Musicfi would like to design an experiment that will have an 80% chance
of detecting a difference of two minutes or greater at a 5% significance level. This means that α is 0.05,
1 - β is 80%, and 𝜇1 − 𝜇0 is 2. Assume that Musicfi estimates the pooled standard deviation from
historical data. Using this information, we can conduct a power analysis using functionality included
in nearly all statistical software packages. Shown below is the output from conducting a power analysis
with this data in the R programming language.

rP
## Two-sample t test power calculation

## n = 127.7748
## d = 0.3518408
## sig.level = 0.05
## power = 0.8
## alternative = two.sided

yo
## NOTE: n is number in *each* group

These results indicate that (rounding up) we need at least 128 participants in the control and
treatment groups, for a total sample size of 256.
op
tC
No
Do

10

This document is authorized for educator review use only by dr ishfaq khan, HE OTHER until Jul 2025. Copying or posting is an infringement of copyright. [email protected] or
617.783.7860

You might also like