0% found this document useful (0 votes)
6 views28 pages

A2 Causality

The document discusses the concept of potential outcomes and the identification of causal parameters through randomized control trials (RCTs), emphasizing the limitations of the Rubin causal model. It explores various causal parameters, including average treatment effects and the challenges of identifying them, as well as critiques of current methodologies in causal inference. Additionally, it presents alternative approaches such as the Roy model and directed acyclic graphs (DAGs) for understanding causality in econometrics.

Uploaded by

Sunakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views28 pages

A2 Causality

The document discusses the concept of potential outcomes and the identification of causal parameters through randomized control trials (RCTs), emphasizing the limitations of the Rubin causal model. It explores various causal parameters, including average treatment effects and the challenges of identifying them, as well as critiques of current methodologies in causal inference. Additionally, it presents alternative approaches such as the Roy model and directed acyclic graphs (DAGs) for understanding causality in econometrics.

Uploaded by

Sunakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Part A: Regression and causality

A2: Potential outcomes and RCTs

Kirill Borusyak
ARE 213 Applied Econometrics
UC Berkeley, Fall 2024

1
Outline

1 The concept of potential outcomes

2 Causal parameters and their identification via RCTs

3 Limitations of the Rubin causal model and alternatives

4 Causality or prediction?
Rubin causal model
Consider some population of units i
Each unit is observed one of several treatment conditions Di ∈ D
▶ E.g. D ∈ {0, 1}: untreated and treated
Suppose we can imagine each unit under all possible conditions (in the same
period)
▶ Causality always requires specifying alternatives
▶ Corresponding potential outcomes are {Yi (d) : d ∈ D}
⋆ e.g. (Yi (0), Yi (1)) (equivalently written as (Y0i , Y1i ))
⋆ e.g. demand function
▶ Causal effects Yi (d′ ) − Yi (d) are defined by this abstraction
▶ Writing Yi (d) encodes a possibility that Di impacts Yi
▶ Realized outcome: Yi = Yi (Di )
2
What can be a cause/treatment?
Is it meaningful to say “She did not get this position because she is a woman” (example
from Imbens 2020)?

3
What can be a cause/treatment?
Imagining each unit under all possible conditions is non-trivial:

“No causation without [imagining] manipulation” (Holland & Rubin)

1. “She did not get this position because she is a woman” 7


▶ Gender is an attribute, not a cause; same for race
▶ “She got an orchestra job because of a gender-blind audition” (cf. Goldin and
Rouse 2000) 3

2. “She did well on the exam because she was coached by her teacher” (Holland
1986) 3
▶ “She did well on the exam because she studied for it” (Holland 1986) 7

4
SUTVA (1)
In writing Yi (di ) we implicitly imposed SUTVA (“stable unit treatment value
assumption”)
Most common meaning: no unmodeled interference
▶ I.e., treatment statuses of other units, d−i do not affect Yi
▶ Frequently violated: e.g. vaccines and infectious disease; information and
technology adoption; equilibrium effects via prices
Allowing for interference, we’d write Yi (d1 , . . . , dN ) for the population of size N
▶ We may be interested in own-treatment effects Yi (d′i , d−i ) − Yi (di , d−i ) and various
spillover effects, e.g. Yi (di , 1, . . . , 1) − Yi (di , 0, . . . , 0)
No interference is an exclusion restriction: Yi (di , d−i ) = Yi (di , d′−i ) ≡ Yi (di ),
∀di , d−i , d′−i

− →
− ∑
Intermediate case: e.g. Yi ( d i ) for exposure mapping d i = (di , k∈Friends(i) dk )
5
SUTVA (2)
Additional meaning of SUTVA: D summarizes everything about the intervention
that is relevant for the outcome
Example 1: “She got a high wage because she studied for many years”
▶ In writing Yi (d), we implicitly assume that school quality does not matter
▶ To think through violations, we could start from Yi (years, quality)

Example 2: D = Herfindahl index of migration origins in a destination region,


capturing migrant diversity
▶ Assumes that this index summarizes everything about exposure to migration

Defining treatment variables is imposing a causal model. Don’t take it lightly!

6
Effects of causes vs. causes of effects
Statistical analysis focuses on effects of causes (treatments) rather than causes of
effects (outcomes)

Causes are not clearly defined

(Holland 1986, p.959)

7
Outline

1 The concept of potential outcomes

2 Causal parameters and their identification via RCTs

3 Limitations of the Rubin causal model and alternatives

4 Causality or prediction?
Common causal parameters (1)
We cannot learn the causal effect Yi (1) − Yi (0) for any particular unit
▶ “Fundamental problem of causal inference”: multiple potential outcomes are never
observed at once
▶ ... but we can sometimes learn some averages
Average treatment/causal effect: ATE = E [Yi (1) − Yi (0)]
▶ ATE = E [Yi (1)] − E [Yi (0)]
▶ Yi (1) − Yi (0) is never observed but Yi (1) and Yi (0) are: for some but not all units
▶ Causal inference can be understood as imputing missing data: e.g. from
E [Yi (1) | Di = 1] we try to learn E [Yi (1) | Di = 0] and thus E [Yi (1)]

Conditional average treatment effect E [Yi (1) − Yi (0) | Xi = x], for predetermined
covariates Xi
8
Common causal parameters (2)

Average effect on the treated: ATT = E [Yi (1) − Yi (0) | Di = 1] (a.k.a. TOT, TT)

▶ Parameter depends on how selection into treatment happened

▶ Yields the aggregate effect of the treatment: Pop Size · P(Di = 1) · ATT

Average effect on the untreated: ATU = E [Yi (1) − Yi (0) | Di = 0]

All these parameters follow from the distribution of (Y(1), Y(0), D). But are they
identified from data on (Y, D)?

9
Identifying ATT & ATE

ATT = E [Y1 | D = 1] − E [Y0 | D = 1]


= (E [Y1 | D = 1] − E [Y0 | D = 0]) (Difference in means)
− (E [Y0 | D = 1] − E [Y0 | D = 0]) (Selection bias)

Thus, βOLS = E [Y | D = 1] − E [Y | D = 0] = ATT + Selection bias


▶ Selection bias = 0 iff Y0 is mean-independent of D
ATE = ATT iff (Y1 − Y0 ) is mean-independent of D
▶ Simple regression identifies ATE and ATT in a randomized control trial (RCT)
where (Y0 , Y1 ) ⊥
⊥ D by design
▶ Regression with any (fixed set of) predetermined controls X also identifies ATE by
FWL or OVB logic
10
Connecting to linear models
With a binary treatment, the potential outcomes model implies

Yi = Y0i (1 − Di ) + Y1i Di = β0 + β1i Di + εi

where β0 = E [Y0 ], β1i = Y1i − Y0i and εi = Y0i − E [Y0 ]


With homogeneous effects, Yi = β0 + β1 Di + εi becomes a causal model where
Y1i − Y0i ≡ β1 (regardless of whether εi ⊥
⊥ Di ; think IV)
With heterogeneous effects, can rederive our result about RCTs: if (εi , β1i ) ⊥
⊥ Di
and denoting µ = E [Di ],

E [(Di − µ) Yi ] Cov [Di , εi ] E [Di (Di − µi ) β1i ]


βOLS = = + = E [βi ] ≡ ATE
Var [Di ] Var [Di ] Var [Di ]

11
RCT with ordered or continuous treatments
Consider a RCT where D takes more than two values (e.g. different dosages)
⊥ {Y(d)}d∈D =⇒ E [Y | D = d] = E [Y(d) | D = d] = E [Y(d)]
D⊥
A saturated regression of Y on dummies for all values of D (or a nonparametric
regression with continuous D) traces the average structural function E [Y(d)]
A simple regression of Y on D identifies a convexly-weighted average of
∂E [Y(d)] /∂d (or its discrete version):
[ ] [ [ ] ]
∫ ∞ ∂E Y(d̃) Cov 1 D ≥ d̃ , D
βOLS = ω(d̃) dd̃, ω(d̃) =
−∞ ∂ d̃ Var [D]
∑K
E [Y(dk ) − Y(dk−1 )] (dk − dk−1 ) Cov [1 [D ≥ dk ] , D]
or βOLS = ωk , ωk =
k=1
dk − dk−1 Var [D]

12
The importance of convex weighting
Imagine E [Y(0)] = 0, E [Y(1)] = 3, E [Y(2)] = 4
▶ Higher dosage is always good (on average)
[ ]
Y(2)−Y(1)
OLS of Y on D in an RCT will produce a coefficient between E = 1 and
[ ] 2−1
Y(1)−Y(0)
E 1−0
=3

An estimator without convex weighting may not: e.g.

2 · E [Y(2) − Y(1)] − 1 · E [Y(1) − Y(0)] = −1,

as if higher treatment is bad


▶ Convex weighting avoids sign reversals. Defines a weakly causal estimand.

13
Distribution of gains
Heckman, Lalonde, Smith (1999) list some other interesting parameters:

1. How widely are the gains distributed?

a. The proportion of people taking the program who benefit from it:
P(Y1 > Y0 | D = 1)

b. Median gains among participants (and other quantiles)

2. Does the program help the lower tail?

a. Distribution of gains by untreated value: e.g. E [Y1 − Y0 | Y0 = ȳ, D = 1]

b. Increase in % above a threshold due to a policy:


P(Y1 > ȳ | D = 1) − P(Y0 > ȳ | D = 1)

14
Distribution of gains: Identification
Does an RCT identify these other parameters, e.g. median gains?

Not without extra restrictions!


E.g. imagine an RCT where Y takes values 0, 1, 2 with equal prob. in both treated
and control groups
This is consistent with (Y0 , Y1 ) taking values (0, 0) , (1, 1) , (2, 2) with equal prob.
▶ No casual effect for anyone. Median gain = 0

Or with (Y0 , Y1 ) taking values (0, 1) , (1, 2) , (2, 0) with equal prob.
▶ Median gain = 1

Exception: P(Y1 > ȳ | D = 1) − P(Y0 > ȳ | D = 1) is identified — how?


15
Outline

1 The concept of potential outcomes

2 Causal parameters and their identification via RCTs

3 Limitations of the Rubin causal model and alternatives

4 Causality or prediction?
Criticisms by Heckman and Vytlacil 2007
1. Estimated effects cannot be transferred to new environments (limited external
validity) and to new programs never previously implemented
▶ Interventions are black boxes, with little attempt to unbundle their components
▶ Mechanisms are not possible to pin down
▶ Knowledge does not cumulate across studies (contrast with estimates of a labor
supply elasticity — a structural parameter)
⋆ Counterpoint from Angrist and Pischke (2010): “Empirical evidence on any given
causal effect is always local, derived from a particular time, place, and research
design. Invocation of a superficially general structural framework does not make the
underlying variation more representative. Economic theory often suggests general
principles, but extrapolation to new settings is always speculative. A constructive
response to the specificity of a given research design is to look for more evidence, so
that a more general picture begins to emerge.”

16
Criticisms by Heckman and Vytlacil 2007 (cont.)
2. Estimands need not be relevant even to analyze the observed policy
▶ Informative on whether to throw out the program entirely (ATT) and whether to
extend it forcing it on everyone not covered yet (ATU)
▶ But not whether to extend/shrink it on the margin (Heckman et al. 1999, Sec.3.4)
▶ Or a policy change that affects the assignment mechanism, e.g. available options
▶ No analysis from the social planner’s point of view, e.g. accounting for externalities
▶ No analysis of causal parameters other than means, e.g. median gains

Optional exercise: read Heckman-Vytlacil’s Sec. 4.4. Do you agree with


everything?

17
Roy model
Alternative “structural” approach: to model self-selection explicitly
Original Roy (1951) model: self-selection based on outcome comparison
▶ D = choice of occupation (e.g. agriculture vs not) or education level
▶ Y(d) = earnings for a given occupation/education
▶ People vary by occupational productivities/returns to education, known to them
▶ They choose based on them: D = arg maxd∈D Yi (d), perhaps with homogeneous
costs
Extended Roy model: costs are heterogeneous but fully determined by observables
▶ which may or may not affect the outcome at the same time
Generalized Roy model: self-selection based on unobserved preferences
▶ D = arg maxd∈D Ri (d) where e.g. Ri (d) = Yi (d) − Ci (d) for costs Ci (d)

18
Roy model: Identification
What does this structure buy us?
No free lunch: “for general skill distributions [i.e., without parametric restrictions],
the [original Roy] model is not identified [from a single cross-section] and has no
empirical content” (Heckman and Honore 1990)
But with more data and restrictions can identify the ATE and even the distribution
of (Y0 , Y1 , R1 − R0 ) =⇒ distribution of gains
Assumptions are often parametric: e.g. Heckman correction via normality of
potential outcomes
▶ Not living up to the goal of using economic theory for identification?

Can do better with cost shifters that shift selection but not outcomes
▶ Value over traditional IV methods is not so clear?

19
Another alternative: Directed acyclic graphs (DAGs)
Directed acyclic graphs of Judea Pearl represent causal relationships graphically: e.g.
X = soil treatment (fumigation)
Y = crop yield
Z1 = eelworm population before the treatment
Z2 = eelworm population after the treatment
Z3 = eelworm population at the end of season

Z0 = eelworm population last season (unobserved)


B = bird population (unobserved)

“Do-calculus” allows to verify whether the average total effect of X on Y is


identified from observing (X, Y, Z1 , Z2 , Z3 )
Popular in epidemiology but not in economics. Why?
20
Some limitations of DAGs
Imbens (JEL 2020) lists some pitfalls of DAGs relative to potential outcomes:

1. Economists avoid complex models with many variables


2. Randomization and manipulability have no special value in DAGs
3. Too nonparametric:
a. Not possible to incorporate additional assumptions, such as continuity (important
in RDD) and monotonicity (important in IV) (see Maiti, Plecko, Bareinboim 2024)

b. Too much focus on identification, relative to estimation and inference

4. Difficult to model interference


5. Clunky to model simultaneity, e.g. demand and supply

21
Outline

1 The concept of potential outcomes

2 Causal parameters and their identification via RCTs

3 Limitations of the Rubin causal model and alternatives

4 Causality or prediction?
Causality vs. prediction
Economists obsess with causality but sometimes prediction is the relevant goal
The choice should be guided by the ultimate goal: decision making
Two scenarios (see Kleinberg et al., 2015):
1. The action/policy D ∈ {0, 1} affects the outcome Y, and the payoff (i.e., utility) π
depends on Y
▶ E.g. D = rain dance in a drought, Y = it rains

π(d) = aY(d) − bd =⇒ E [π(1) − π(0)] = aE [Y(1) − Y(0)] − b

▶ Optimal decision: D = 1 [E [Y(1) − Y(0)] ≥ b/a]


▶ This is a causal problem. Running an RCT is very helpful
▶ Better knowledge of heterogeneous causal effects E [Y(1) − Y(0) | X] based on
observed covariates X also yields better decisions D(X)
22
Causality vs. prediction (2)
2. Y is unaffected by D but the marginal payoff of actions, ∂π/∂D, depends on Y
▶ E.g. D = take an umbrella, Y = it rains

π(d) = aY · d − bd =⇒ E [π(1) − π(0)] = aE [Y] − b

▶ Optimal decision: D = 1 [E [Y] ≥ b/a]


▶ This is a prediction problem. Running an RCT is not helpful
▶ Better prediction E [Y | X] yields better decisions D(X)
Note: This scenario can also be recast as a causal problem:
▶ D affects Ỹ(D) = you get wet = Y · (1 − D)
▶ But we know potential outcome Ỹ(1) = 0
▶ And we have data on Ỹ(0) = Y to make a prediction of Ỹ(1) − Ỹ(0)

23
Policy-relevant prediction problems: Examples
1. Eliminating futile hip and knee replacement surgeries
▶ Surgery has costs: monetary + painful recovery
▶ Benefits depend on life expectancy
▶ Kleinberg et al. (2015) show 10% (1%) of patients have predictable probability of
dying within a year of 24% (44%) for reasons unrelated to this surgery
2. Improving admissions by predicting college success
▶ Geiser and Santelices (2007) show that high-school GPA is a better predictor of
performance at UC colleges than SAT
▶ If UC had to reduce admissions, rejecting applicants with marginal GPAs would
result in losing fewer good students than rejecting marginal SAT applicants
3. See Kleinberg et al. “Human Decisions and Machine Predictions” (2018) for a
more subtle example on bail decisions by judges
24

You might also like