A2 Causality
A2 Causality
Kirill Borusyak
ARE 213 Applied Econometrics
UC Berkeley, Fall 2024
1
Outline
4 Causality or prediction?
Rubin causal model
Consider some population of units i
Each unit is observed one of several treatment conditions Di ∈ D
▶ E.g. D ∈ {0, 1}: untreated and treated
Suppose we can imagine each unit under all possible conditions (in the same
period)
▶ Causality always requires specifying alternatives
▶ Corresponding potential outcomes are {Yi (d) : d ∈ D}
⋆ e.g. (Yi (0), Yi (1)) (equivalently written as (Y0i , Y1i ))
⋆ e.g. demand function
▶ Causal effects Yi (d′ ) − Yi (d) are defined by this abstraction
▶ Writing Yi (d) encodes a possibility that Di impacts Yi
▶ Realized outcome: Yi = Yi (Di )
2
What can be a cause/treatment?
Is it meaningful to say “She did not get this position because she is a woman” (example
from Imbens 2020)?
3
What can be a cause/treatment?
Imagining each unit under all possible conditions is non-trivial:
2. “She did well on the exam because she was coached by her teacher” (Holland
1986) 3
▶ “She did well on the exam because she studied for it” (Holland 1986) 7
4
SUTVA (1)
In writing Yi (di ) we implicitly imposed SUTVA (“stable unit treatment value
assumption”)
Most common meaning: no unmodeled interference
▶ I.e., treatment statuses of other units, d−i do not affect Yi
▶ Frequently violated: e.g. vaccines and infectious disease; information and
technology adoption; equilibrium effects via prices
Allowing for interference, we’d write Yi (d1 , . . . , dN ) for the population of size N
▶ We may be interested in own-treatment effects Yi (d′i , d−i ) − Yi (di , d−i ) and various
spillover effects, e.g. Yi (di , 1, . . . , 1) − Yi (di , 0, . . . , 0)
No interference is an exclusion restriction: Yi (di , d−i ) = Yi (di , d′−i ) ≡ Yi (di ),
∀di , d−i , d′−i
→
− →
− ∑
Intermediate case: e.g. Yi ( d i ) for exposure mapping d i = (di , k∈Friends(i) dk )
5
SUTVA (2)
Additional meaning of SUTVA: D summarizes everything about the intervention
that is relevant for the outcome
Example 1: “She got a high wage because she studied for many years”
▶ In writing Yi (d), we implicitly assume that school quality does not matter
▶ To think through violations, we could start from Yi (years, quality)
6
Effects of causes vs. causes of effects
Statistical analysis focuses on effects of causes (treatments) rather than causes of
effects (outcomes)
7
Outline
4 Causality or prediction?
Common causal parameters (1)
We cannot learn the causal effect Yi (1) − Yi (0) for any particular unit
▶ “Fundamental problem of causal inference”: multiple potential outcomes are never
observed at once
▶ ... but we can sometimes learn some averages
Average treatment/causal effect: ATE = E [Yi (1) − Yi (0)]
▶ ATE = E [Yi (1)] − E [Yi (0)]
▶ Yi (1) − Yi (0) is never observed but Yi (1) and Yi (0) are: for some but not all units
▶ Causal inference can be understood as imputing missing data: e.g. from
E [Yi (1) | Di = 1] we try to learn E [Yi (1) | Di = 0] and thus E [Yi (1)]
Conditional average treatment effect E [Yi (1) − Yi (0) | Xi = x], for predetermined
covariates Xi
8
Common causal parameters (2)
Average effect on the treated: ATT = E [Yi (1) − Yi (0) | Di = 1] (a.k.a. TOT, TT)
▶ Yields the aggregate effect of the treatment: Pop Size · P(Di = 1) · ATT
All these parameters follow from the distribution of (Y(1), Y(0), D). But are they
identified from data on (Y, D)?
9
Identifying ATT & ATE
11
RCT with ordered or continuous treatments
Consider a RCT where D takes more than two values (e.g. different dosages)
⊥ {Y(d)}d∈D =⇒ E [Y | D = d] = E [Y(d) | D = d] = E [Y(d)]
D⊥
A saturated regression of Y on dummies for all values of D (or a nonparametric
regression with continuous D) traces the average structural function E [Y(d)]
A simple regression of Y on D identifies a convexly-weighted average of
∂E [Y(d)] /∂d (or its discrete version):
[ ] [ [ ] ]
∫ ∞ ∂E Y(d̃) Cov 1 D ≥ d̃ , D
βOLS = ω(d̃) dd̃, ω(d̃) =
−∞ ∂ d̃ Var [D]
∑K
E [Y(dk ) − Y(dk−1 )] (dk − dk−1 ) Cov [1 [D ≥ dk ] , D]
or βOLS = ωk , ωk =
k=1
dk − dk−1 Var [D]
12
The importance of convex weighting
Imagine E [Y(0)] = 0, E [Y(1)] = 3, E [Y(2)] = 4
▶ Higher dosage is always good (on average)
[ ]
Y(2)−Y(1)
OLS of Y on D in an RCT will produce a coefficient between E = 1 and
[ ] 2−1
Y(1)−Y(0)
E 1−0
=3
13
Distribution of gains
Heckman, Lalonde, Smith (1999) list some other interesting parameters:
a. The proportion of people taking the program who benefit from it:
P(Y1 > Y0 | D = 1)
14
Distribution of gains: Identification
Does an RCT identify these other parameters, e.g. median gains?
Or with (Y0 , Y1 ) taking values (0, 1) , (1, 2) , (2, 0) with equal prob.
▶ Median gain = 1
4 Causality or prediction?
Criticisms by Heckman and Vytlacil 2007
1. Estimated effects cannot be transferred to new environments (limited external
validity) and to new programs never previously implemented
▶ Interventions are black boxes, with little attempt to unbundle their components
▶ Mechanisms are not possible to pin down
▶ Knowledge does not cumulate across studies (contrast with estimates of a labor
supply elasticity — a structural parameter)
⋆ Counterpoint from Angrist and Pischke (2010): “Empirical evidence on any given
causal effect is always local, derived from a particular time, place, and research
design. Invocation of a superficially general structural framework does not make the
underlying variation more representative. Economic theory often suggests general
principles, but extrapolation to new settings is always speculative. A constructive
response to the specificity of a given research design is to look for more evidence, so
that a more general picture begins to emerge.”
16
Criticisms by Heckman and Vytlacil 2007 (cont.)
2. Estimands need not be relevant even to analyze the observed policy
▶ Informative on whether to throw out the program entirely (ATT) and whether to
extend it forcing it on everyone not covered yet (ATU)
▶ But not whether to extend/shrink it on the margin (Heckman et al. 1999, Sec.3.4)
▶ Or a policy change that affects the assignment mechanism, e.g. available options
▶ No analysis from the social planner’s point of view, e.g. accounting for externalities
▶ No analysis of causal parameters other than means, e.g. median gains
17
Roy model
Alternative “structural” approach: to model self-selection explicitly
Original Roy (1951) model: self-selection based on outcome comparison
▶ D = choice of occupation (e.g. agriculture vs not) or education level
▶ Y(d) = earnings for a given occupation/education
▶ People vary by occupational productivities/returns to education, known to them
▶ They choose based on them: D = arg maxd∈D Yi (d), perhaps with homogeneous
costs
Extended Roy model: costs are heterogeneous but fully determined by observables
▶ which may or may not affect the outcome at the same time
Generalized Roy model: self-selection based on unobserved preferences
▶ D = arg maxd∈D Ri (d) where e.g. Ri (d) = Yi (d) − Ci (d) for costs Ci (d)
18
Roy model: Identification
What does this structure buy us?
No free lunch: “for general skill distributions [i.e., without parametric restrictions],
the [original Roy] model is not identified [from a single cross-section] and has no
empirical content” (Heckman and Honore 1990)
But with more data and restrictions can identify the ATE and even the distribution
of (Y0 , Y1 , R1 − R0 ) =⇒ distribution of gains
Assumptions are often parametric: e.g. Heckman correction via normality of
potential outcomes
▶ Not living up to the goal of using economic theory for identification?
Can do better with cost shifters that shift selection but not outcomes
▶ Value over traditional IV methods is not so clear?
19
Another alternative: Directed acyclic graphs (DAGs)
Directed acyclic graphs of Judea Pearl represent causal relationships graphically: e.g.
X = soil treatment (fumigation)
Y = crop yield
Z1 = eelworm population before the treatment
Z2 = eelworm population after the treatment
Z3 = eelworm population at the end of season
21
Outline
4 Causality or prediction?
Causality vs. prediction
Economists obsess with causality but sometimes prediction is the relevant goal
The choice should be guided by the ultimate goal: decision making
Two scenarios (see Kleinberg et al., 2015):
1. The action/policy D ∈ {0, 1} affects the outcome Y, and the payoff (i.e., utility) π
depends on Y
▶ E.g. D = rain dance in a drought, Y = it rains
23
Policy-relevant prediction problems: Examples
1. Eliminating futile hip and knee replacement surgeries
▶ Surgery has costs: monetary + painful recovery
▶ Benefits depend on life expectancy
▶ Kleinberg et al. (2015) show 10% (1%) of patients have predictable probability of
dying within a year of 24% (44%) for reasons unrelated to this surgery
2. Improving admissions by predicting college success
▶ Geiser and Santelices (2007) show that high-school GPA is a better predictor of
performance at UC colleges than SAT
▶ If UC had to reduce admissions, rejecting applicants with marginal GPAs would
result in losing fewer good students than rejecting marginal SAT applicants
3. See Kleinberg et al. “Human Decisions and Machine Predictions” (2018) for a
more subtle example on bail decisions by judges
24