0% found this document useful (0 votes)
18 views23 pages

Matching and The Propensity Score Handout

Uploaded by

dudisunita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views23 pages

Matching and The Propensity Score Handout

Uploaded by

dudisunita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Matching and the Propensity Score

2020 AEA Continuing Education Program


Mastering Mostly Harmless Econometrics

Alberto Abadie
MIT

Adjustment techniques for observational studies

Alternatives to regression:
Subclassification

Matching

Propensity Score Methods

What should we match on? A brief introduction to DAGs.


Covariates and outcomes

Definition (Predetermined Covariates)


Variable X is predetermined with respect to the treatment D (also
called “pretreatment”) if for each individual i, X0i = X1i , ie. the
value of Xi does not depend on the value of Di . Such
characteristics are called covariates.
Does not imply that X and D are independent
Predetermined variables are often time invariant (sex, race,
etc.), but time invariance is not necessary

Definition (Outcomes)
Those variables, Y , that are (possibly) not predetermined are
called outcomes (for some individual i, Y0i 6= Y1i )
In general, one should not condition on outcomes, because this
may induce bias

Identification in randomized experiments

Randomization implies:

(Y1 , Y0 ) independent of D, or ⊥D.


(Y1 , Y0 )⊥

Therefore:

E [Y |D = 1] − E [Y |D = 0] = E [Y1 |D = 1] − E [Y0 |D = 0]
= E [Y1 ] − E [Y0 ]
= E [Y1 − Y0 ].

Also, we have that

E [Y1 − Y0 ] = E [Y1 − Y0 |D = 1].


The nature of confounding

Confounding may arise from common causes in observational


studies:
D Y

V U
X
X is a confounder, V and U are not.
Conditional on X there is no confounding.
Correlation between Y and D conditional on X is reflective of
the effect of D on Y . That is:

⊥D|X .
(Y1 , Y0 )⊥

Identification under selection on observables


Identification Assumption
1 ⊥D|X (selection on observables)
(Y1 , Y0 )⊥
2 0 < Pr(D = 1|X ) < 1 with probability one (common support)

Identification Result
Given selection on observables we have

E [Y1 − Y0 |X ] = E [Y1 − Y0 |X , D = 1]
= E [Y |X , D = 1] − E [Y |X , D = 0]

Therefore, under the common support condition:


Z
αATE = E [Y1 − Y0 ] = E [Y1 − Y0 |X ] dP(X )
Z

= E [Y |X , D = 1] − E [Y |X , D = 0] dP(X )
Identification under selection on observables
Identification Assumption
1 ⊥D|X (selection on observables)
(Y1 , Y0 )⊥
2 0 < Pr(D = 1|X ) < 1 with probability one (common support)

Identification Result
Similarly,

αATET = E [Y1 − Y0 |D = 1]
Z

= E [Y |X , D = 1] − E [Y |X , D = 0] dP(X |D = 1)

To identify αATET the selection on observables and common support


conditions can be relaxed to:
Y0 ⊥
⊥D|X
Pr(D = 1|X ) < 1 (with Pr(D = 1) > 0)

The subclassification estimator


The identification result is:
Z

αATE = E [Y |X , D = 1] − E [Y |X , D = 0] dP(X )
Z

αATET = E [Y |X , D = 1] − E [Y |X , D = 0] dP(X |D = 1)

Assume X takes on K different cells {X 1 , ..., X k , ..., X K }.Then,


the analogy principle suggests the following estimators:
K
X   K
X  
 Nk  N1k
bATE =
α Ȳ1k − Ȳ0k · bATET =
; α Ȳ1k − Ȳ0k ·
N N1
k=1 k=1

N k is # of obs. and N1k is # of treated obs. in cell k


Ȳ1k is mean outcome for the treated in cell k
Ȳ0k is mean outcome for the untreated in cell k
Subclassification and the “curse of dimensionality”

Subclassification becomes unfeasible with many covariates

Assume we have k covariates and divide each of them into 3


coarse categories (e.g., age could be “young”, “middle age” or
“old”, and income could be “low”, “medium” or “high”).

The number of subclassification cells is 3k . For k = 10, we


obtain 310 = 59049

Many cells may contain only treated or untreated


observations, so we cannot use subclassification

Subclassification is also problematic if the cells are “too


coarse”. But using “finer” cells worsens the curse of
dimensionality problem: e.g., using 10 variables and 5
categories for each variable we obtain 510 = 9765625

Matching

We could also estimate αATET by constructing a comparison


sample of untreated units with the same characteristics as the
sample of treated units.

This can be easily accomplished matching treated and


untreated units with the same characteristics.
Matching: An ideal example
Trainees Non-Trainees
unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500
21 32 25900
Average: 33 20724

Matching: An ideal example


Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average: 28.5 13982
21 32 25900
Average: 33 20724
Age distribution: Before matching

A: Trainees
3
2
1
frequency
0

B: Non−Trainees
3
2
1
0

20 30 40 50 60
age
Graphs by group

Age distribution: After matching

A: Trainees
3
2
1
frequency
0

B: Non−Trainees
3
2
1
0

20 30 40 50 60
age
Graphs by group
Treatment effect estimates

Difference in average earnings between trainees and non-trainees:

Before matching

16426 − 20724 = −4298

After matching:

16426 − 13982 = 2444

Matching

Perfect matches are often not available. In that case, a matching


estimator of αATET can be constructed as:
1 X 
bATET =
α Yi − Yj(i)
N1
Di =1

where Yj(i) is the outcome of an untreated observation such that


Xj(i) is the closest value to Xi among the untreated observations.
We can also use the average for M closest matches:
( M
!)
1 X 1 X
bATET =
α Yi − Yjm (i)
N1 M
Di =1 m=1

Works well when we can find good matches for each treated unit,
so M is usually small (typically, M = 1 or M = 2).
Matching

We can also use matching to estimate αATE . In that case, we


match in both directions:
1 If observation i is treated, we impute Y0i using untreated
matchs, {Yj1 (i) , . . . , YjM (i) }
2 If observation i is untreated, we impute Y1i using treated
matchs, {Yj1 (i) , . . . , YjM (i) }
The estimator is:
N
( M
!)
1 X 1 X
bATE
α = (2Di − 1) Yi − Yjm (i)
N M
i=1 m=1

Matching and the curse of dimensionality

When we match multiple variables we need to define a norm,


k · k, to measure matching discrepancies, kXi − Xj(i) k (see
appendix for usual norms)

Matching discrepancies kXi − Xj(i) k tend to increase with k,


the dimension of X

Matching discrepancies converge to zero. But they converge


very slowly if k is large

Mathematically, it can be shown that kXi − Xj(i) k converges


1
to zero at the same rate as 1/k
N
It is difficult to find good matches in large dimensions: you
need many observations if k is large
Matching: Bias
1 X
bATET =
α (Yi − Yj(i) ),
N1
Di =1

where Xi ' Xj(i) and Dj(i) = 0. Let

µ0 (x) = E [Y |X = x, D = 0] = E [Y0 |X = x],


µ1 (x) = E [Y |X = x, D = 1] = E [Y1 |X = x],
Yi = µDi (Xi ) + εi .

Then,
1 X
bATET − αATET
α = (µ1 (Xi ) − µ0 (Xi ) − αATET )
N1
Di =1
1 X
+ (εi − εj(i) )
N1
Di =1
1 X
+ (µ0 (Xi ) − µ0 (Xj(i) )).
N1
Di =1

Matching: Bias

We hope that we can apply a Central Limit Theorem and


p
αATET − αATET )
N1 (b

converges to a Normal distribution with zero mean. However,


p p
αATET − αATET )] = E [ N1 (µ0 (Xi ) − µ0 (Xj(i) ))|D = 1].
E [ N1 (b

Now, if k is large:
⇒ The difference between Xi and Xj(i) converges to zero very slowly
⇒ The difference µ0 (Xi ) − µ0 (Xj(i) ) converges to zero very slowly

⇒ E [ N1 (µ0 (Xi ) − µ0 (Xj(i) ))|D = 1] may not converge to zero!

⇒ E [ N1 (b αATET − αATET )] may not converge to zero!

⇒ Bias is often an issue when we match in many dimensions


Matching: Reducing bias
The bias of the matching estimator is caused by large matching
discrepancies kXi − Xj(i) k. However:

1 The matching discrepancies are observed. We can always


check in the data how well we are matching the covariates.

2 bATET we can always make the matching discrepancies


For α
small by using a large reservoir of untreated units to select the
matches (that is, by making N0 large).

3 If the matching discrepancies are large, so we are worried


about potential biases, we can apply bias correction
techniques.

4 Partial solution: Propensity score methods (to come).

Matching with bias correction

Each treated observation contributes

µ0 (Xi ) − µ0 (Xj(i) )

to the bias.

Bias-corrected matching:

BC 1 X 
bATET
α = (Yi − Yj(i) ) − (b
µ0 (Xi ) − µ
b0 (Xj(i) ))
N1
Di =1

b0 (x) is an estimate of E [Y |X = x, D = 0] (e.g., OLS).


where µ
Matching bias: Implications for practice

Bias arises because of the effect of large matching discrepancies on


µ0 (Xi ) − µ0 (Xj(i) ). To minimize matching discrepancies:
1 Use a small M (e.g., M = 1). Large values of M produce
large matching discrepancies.
2 Use matching with replacement. Because matching with
replacement can use untreated units as a match more than
once, matching with replacement produces smaller matching
discrepancies than matching without replacement.
3 Try to match covariates with a large effect on µ0 (·)
particularly well.

Propensity score

The propensity score is defined as the selection probability


conditional on the confounding variables: p(X ) = P(D = 1|X ).

The selection on observables identification assumption is:


1 ⊥D | X (selection on observables)
(Y1 , Y0 )⊥
2 0 < Pr(D = 1|X ) < 1 (common support)

Rosenbaum and Rubin (1983) proved that selection on observables


implies:
⊥D | p(X )
(Y1 , Y0 )⊥
⇒ conditioning on the propensity score is enough to have
independence between the treatment indicator and the
potential outcomes
⇒ substantial dimension reduction in the matching variables!
Matching on the propensity score

⊥D|X , then
Because of the Rosenbaum-Rubin result, if (Y1 , Y0 )⊥

E [Y1 − Y0 |p(X )] = E [Y |D = 1, p(X )] − E [Y |D = 0, p(X )]

This motivates a two step procedure to estimate causal effects


under selection on observables:
1 estimate the propensity score p(X ) = P(D = 1|X ) (e.g., using
logit or probit regression)
2 do matching or subclassification on the estimated propensity
score

Proof of the Rosenbaum and Rubin (1983) result


⊥D | X . Then:
Assume that (Y1 , Y0 )⊥

P(D = 1|Y1 , Y0 , p(X )) = E [D|Y1 , Y0 , p(X )]


= E [E [D|Y1 , Y0 , X ]|Y1 , Y0 , p(X )]
= E [E [D|X ]|Y1 , Y0 , p(X )]
= E [p(X )|Y1 , Y0 , p(X )]
= p(X )

Using a similar argument, we obtain

P(D = 1|p(X )) = E [D|p(X )] = E [E [D|X ]|p(X )]


= E [p(X )|p(X )] = p(X )

⇒ P(D = 1|Y1 , Y0 , p(X )) = P(D = 1|p(X ))


⇒ (Y1 , Y0 )⊥
⊥D | p(X )
Propensity score: Balancing property
D Y D Y

p(X )

X X

⇒ D and X are independent conditional on p(X ):


⊥X | p(X ).
D⊥
So we obtain the balancing property of the propensity score:
P(X |D = 1, p(X )) = P(X |D = 0, p(X )),
⇒ conditional on the propensity score, the distribution of the
covariates is the same for treated and non-treated.
We can use this to check if our estimated propensity score actually
produces balance:
P(X |D = 1, pb(X )) = P(X |D = 0, pb(X ))

Matching estimators: Large sample distribution

Matching estimators have a Normal distribution in large


samples (provided that the bias is small).

Abadie and Imbens (2006, 2012) provide standard errors


formulas for estimators that match on X .

Abadie and Imbens (2016) provide standard errors formulas


for estimators that match on pb(X ).

The bootstrap does not work in general.


Weighting on the propensity score (IPW)
Weighting estimators that use the propensity score (“Inverse Probability
Weighting”) are based on the following result: If Y1 , Y0 ⊥
⊥D|X , then
 
D − p(X )
αATE = E Y
p(X )(1 − p(X ))
 
1 D − p(X )
αATET = E Y
P(D = 1) 1 − p(X )

To prove this results notice that:


   
D − p(X ) Y
E Y X =E X , D = 1 p(X )
p(X )(1−p(X )) p(X )
 
−Y
+E X , D = 0 (1−p(X ))
1−p(X )
= E [Y |X , D = 1] − E [Y |X , D = 0]

And the results follow from integration over P(X ) and P(X |D = 1).

Weighting on the propensity score

 
D − p(X )
αATE =E Y
p(X )(1 − p(X ))
 
1 D − p(X )
αATET = E Y
P(D = 1) 1 − p(X )

The analogy principle suggests a two step estimator:


1 Estimate the propensity score: pb(X )
2 Use estimated score to produce analog estimators:
N
1 X Di − pb(Xi )
bATE
α = Yi
N pb(Xi )(1 − pb(Xi ))
i=1
XN
1 Di − pb(Xi )
bATET =
α Yi
N1 1 − pb(Xi )
i=1
Weighting on the propensity score

N
1 X Di − pb(Xi )
bATE
α = Yi
N pb(Xi )(1 − pb(Xi ))
i=1
XN
1 Di − pb(Xi )
bATET
α = Yi
N1 1 − pb(Xi )
i=1

Several improvements and variants have been proposed (e.g.,


normalizing the weights so that they sum to one, Imbens 2004).
Standard errors:
We need to adjust the s.e.’s for first-step estimation of p(X )
Parametric p(X ): Newey & McFadden (1994)
Non-parametric p(X ): Newey (1994), Hirano, Imbens, and Ridder
(2003)
Or bootstrap the entire two-step procedure

Doubly robust estimators

Combine propensity-score based and regression based estimation.


Estimators that only need the propensity score or the regression
function to be correctly specified (Bang and Robins, 2005).
N  
1 X Di (Yi − µ b1 (Xi ))
bATE
α = +µb1 (Xi )
N pb(Xi )
i=1
N  
1 X (1 − Di ) (Yi − µ b0 (Xi ))
− +µb0 (Xi ) .
N 1 − pb(Xi )
i=1

Doubly robust estimators, and more generally locally robust


estimators (Chernozhukov et al., 2016) have appealing properties
in terms of bias and in terms of inference after model selection in
the first step.
for assessing the degree of difficulty in adjusting for differences cause we are focused on the average effect for the treated, the
in covariates. bias correction only requires an estimate of μ0 (Xi ). We esti-
Panel A contains the results for pretreatment variables and mate this regression function using linear regression on all nine
Panel B for outcomes. Notice the large differences in back- pretreatment covariates in Table 1, panel A, but do not include
ground characteristics between the program participants and the any higher order terms or interactions, with only the control
Abadie and Imbens (2011)
PSID sample. This is what makes drawing causal inferences units that are used as a match [the units j such that Wj = 0 and

Table 2. Experimental and nonexperimental estimates for the NSW data

M=1 M=4 M = 16 M = 64 M = 2490


Est. (SE) Est. (SE) Est. (SE) Est. (SE) Est. (SE)
Panel A:
Experimental estimates
Covariate matching 1.22 (0.84) 1.99 (0.74) 1.75 (0.74) 2.20 (0.70) 1.79 (0.67)
Bias-adjusted cov matching 1.16 (0.84) 1.84 (0.74) 1.54 (0.75) 1.74 (0.71) 1.72 (0.68)
Pscore matching 1.43 (0.81) 1.95 (0.69) 1.85 (0.69) 1.85 (0.68) 1.79 (0.67)
Bias-adjusted pscore matching 1.22 (0.81) 1.89 (0.71) 1.78 (0.70) 1.67 (0.69) 1.72 (0.68)
Regression estimates
Mean difference 1.79 (0.67)
Linear 1.72 (0.68)
Quadratic 2.27 (0.80)
Weighting on pscore 1.79 (0.67)
Weighting and linear regression 1.69 (0.66)
Panel B:
Nonexperimental estimates
Simple matching 2.07 (1.13) 1.62 (0.91) 0.47 (0.85) −0.11 (0.75) −15.20 (0.61)
Bias-adjusted matching 2.42 (1.13) 2.51 (0.90) 2.48 (0.83) 2.26 (0.71) 0.84 (0.63)
Pscore matching 2.32 (1.21) 2.06 (1.01) 0.79 (1.25) −0.18 (0.92) −1.55 (0.80)
Bias-adjusted pscore matching 3.10 (1.21) 2.61 (1.03) 2.37 (1.28) 2.32 (0.94) 2.00 (0.84)
Regression estimates
Mean difference −15.20 (0.66)
Linear 0.84 (0.88)
Quadratic 3.26 (1.04)
Weighting on pscore 1.77 (0.67)
Weighting and linear regression 1.65 (0.66)
NOTE: The outcome is earnings in 1978 in thousands of dollars.

What to match on: A brief introduction to DAGs


A Directed Acyclic Graph (DAG) is a set of nodes (vertices) and
directed edges (arrows) with no directed cycles.

Z X Y

Nodes represent variables.


Arrows represent direct causal effects (“direct” means not
mediated by other variables in the graph).
A causal DAG must include:
1 All direct causal effects among the variables in the graph
2 All common causes (even if unmeasured) of any pair of
variables in the graph
Some DAG concepts

In the DAG:
Z X Y

U is a parent of X and Y .
X and Y are descendants of Z .
There is a directed path from Z to Y .
There are two paths from Z to U (but no directed path).
X is a collider of the path Z → X ← U.
X is a noncollider of the path Z → X → Y .

Confounding
Confounding arises when the treatment and the outcome have
common causes.
D Y

X
The association between D and Y does not only reflect the
causal effect of D on Y .
Confounding creates backdoor paths, that is, paths starting
with incoming arrows. In the DAG we can see a backdoor
path from D to Y (D ← X → Y ).
However, once we “block” the backdoor path by conditioning
on the common cause, X , the association between D and Y is
only reflective of the effect of D on Y .
D Y

X
Blocked paths
A path is blocked if and only if:
It contains a noncollider that has been conditioned on,
Or, it contains a collider that has not been conditioned on and
has no descendants that have been conditioned on.
Examples:
1 Conditioning on a noncollider blocks a path:

X Z Y
2 Conditioning on a collider opens a path:
Z X Y
3 Not conditioning on a collider (or its descendants) leaves a
path blocked:
Z X Y

Backdoor criterion
Suppose that:
D is a treatment,
Y is an outcome,
X1 , . . . , Xk is a set of covariates.

Is it enough to match on X1 , . . . , Xk in order to estimate the


causal effect of D on Y ? Pearl’s Backdoor Criterion
provides sufficient conditions.
Backdoor criterion: X1 , . . . , Xk satisfies the backdoor criterion
with respect to (D, Y ) if:
1 No element of X1 , . . . , Xk is a descendant of D.
2 All backdoor paths from D to Y are blocked by X1 , . . . , Xk .

If X1 , . . . , Xk satisfies the backdoor criterion with respect to


(D, Y ), then matching on X1 , . . . , Xk identifies the causal
effect of D on Y .
Implications for practice
Matching on all common causes is sufficient: There are
two backdoor paths from D to Y .

X1 D Y

X2

Conditioning on X1 and X2 blocks the backdoor paths.


Matching may work even if not all common causes are
observed: U and X1 are common causes.
X1

U X2 D Y

Conditioning on X1 and X2 is enough.

Implications for practice (cont.)


Matching on an outcome may create bias: There is only
one backdoor path from D to Y .

X1 D Y

X2
Conditioning on X1 blocks the backdoor path. Conditioning
on X2 would open a path!
Matching on all pretreatment covariates is not always
the answer: There is one backdoor path and it is closed.
U1

X D Y

U2
No confounding. Conditioning on X would open a path!
Implications for practice (cont.)

There may be more than one set of conditioning


variables that satisfy the backdoor criterion:
X1 X1

X3 D Y X3 D Y

X2 X2

Conditioning on the common causes, X1 and X2 , is sufficient,


as always.
But conditioning on X3 only also blocks the backdoor paths.

Appendix:
Matching Distance Metric
Matching: Distance metric
When the vector of matching covariates,
 
X1
 X2 
 
X =  . ,
 .. 
Xk

has more than one dimension (k > 1) we need to define a distance


metric to measure “closeness”. The usual Euclidean distance is:
q
kXi − Xj k = (Xi − Xj )0 (Xi − Xj )
v
u k
uX
= t (Xni − Xnj )2 .
n=1

⇒ The Euclidean distance is not invariant to changes in the scale of the


X ’s.
⇒ For this reason, we often use alternative distances that are invariant to
changes in scale.

Matching: Distance metric


A commonly used distance is the normalized Euclidean distance:
q
kXi − Xj k = (Xi − Xj )0 Vb −1 (Xi − Xj )

where  
b12
σ 0 ··· 0
 0 b22
σ ··· 0 
b  
V = .. .. .. .. .
 . . . . 
0 0 ··· bk2
σ
Notice that, the normalized Euclidean distance is equal to:
v
u k
uX (Xni − Xnj )2
kXi − Xj k = t 2
.
n=1
b
σ n

⇒ Changes in the scale of Xni affect also σ


bn , and the normalized
Euclidean distance does not change.
Matching: Distance metric

Another popular scale-invariant distance is the Mahalanobis distance:


q
kXi − Xj k = (Xi − Xj )0 Σ b −1 (Xi − Xj ),
X

b X is the sample variance-covariance matrix of X .


where Σ

We can also define arbitrary distances:


v
u k
uX
kXi − Xj k = t ωn · (Xni − Xnj )2
n=1

(with all ωn ≥ 0) so that we assign large ωn ’s to those covariates that we


want to match particularly well.

You might also like