0% found this document useful (0 votes)
46 views40 pages

Slides 1 Match7!30!07

1) The document summarizes key concepts in estimating average treatment effects under an assumption of unconfoundedness. It outlines potential outcomes, estimands, identification strategies, and assumptions needed like unconfoundedness and overlap. 2) Common estimation methods are discussed, including regression, matching, propensity score matching, and mixed methods. Concerns with regression include extrapolating outside the support of the covariate distributions. 3) Identification of the average treatment effect relies on estimating conditional expectations of outcomes given treatment and covariates across the full support of these variables. Overlap in the covariate distributions is important for identification.

Uploaded by

keyyongpark
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views40 pages

Slides 1 Match7!30!07

1) The document summarizes key concepts in estimating average treatment effects under an assumption of unconfoundedness. It outlines potential outcomes, estimands, identification strategies, and assumptions needed like unconfoundedness and overlap. 2) Common estimation methods are discussed, including regression, matching, propensity score matching, and mixed methods. Concerns with regression include extrapolating outside the support of the covariate distributions. 3) Identification of the average treatment effect relies on estimating conditional expectations of outcomes given treatment and covariates across the full support of these variables. Overlap in the covariate distributions is important for identification.

Uploaded by

keyyongpark
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Whats New in Econometrics

Lecture 1
Estimation of Average Treatment Eects
Under Unconfoundedness
Guido Imbens
NBER Summer Institute, 2007
WNE
7/30/07
#1
Outline
1. Introduction
2. Potential Outcomes
3. Estimands and Identication
4. Estimation and Inference
5. Assessing Unconfoundedness (not testable)
6. Overlap
7. Illustration based on Lalonde Data
1
1. Introduction
We are interested in estimating the average eect of a program
or treatment, allowing for heterogenous eects, assuming that
selection can be taken care of by adjusting for dierences in
observed covariates.
This setting is of great applied interest.
Long literature, in both statistics and economics. Inuential
economics/econometrics papers include Ashenfelter and Card
(1985), Barnow, Cain and Goldberger (1980), Card and Sulli-
van (1988), Dehejia and Wahba (1999), Hahn (1998), Heck-
man and Hotz (1989), Heckman and Robb (1985), Lalonde
(1986). In stat literature work by Rubin (1974, 1978), Rosen-
baum and Rubin (1983).
2
Unusual case with many proposed (semi-parametric) estima-
tors (matching, regression, propensity score, or combinations),
many of which are actually used in practice.
We discuss implementation, and assessment of the critical as-
sumptions (even if they are not testable).
In practice concern with overlap in covariate distributions tends
to be important.
Once overlap issues are addressed, choice of estimators is less
important. Estimators combining matching and regression or
weighting and regression are recommended for robustness rea-
sons.
Key role for analysis of the joint distribution of treatment in-
dicator and covariates prior to using outcome data.
3
2. Potential Outcomes (Rubin, 1974)
We observe N units, indexed by i = 1, . . . , N, viewed as drawn
randomly from a large population.
We postulate the existence for each unit of a pair of potential
outcomes,
Y
i
(0) for the outcome under the control treatment and
Y
i
(1) for the outcome under the active treatment
Y
i
(1) Y
i
(0) is unit-level causal eect
Covariates X
i
(not aected by treatment)
Each unit is exposed to a single treatment; W
i
= 0 if unit i
receives the control treatment and W
i
= 1 if unit i receives
the active treatment. We observe for each unit the triple
(W
i
, Y
i
, X
i
), where Y
i
is the realized outcome:
Y
i
Y
i
(W
i
) =
_
Y
i
(0) if W
i
= 0,
Y
i
(1) if W
i
= 1.
6
Several additional pieces of notation.
First, the propensity score (Rosenbaum and Rubin, 1983) is
dened as the conditional probability of receiving the treat-
ment,
e(x) = Pr(W
i
= 1|X
i
= x) = E[W
i
|X
i
= x].
Also the two conditional regression and variance functions:

w
(x) =E[Y
i
(w)|X
i
= x],
2
w
(x) =V(Y
i
(w)|X
i
= x).
7
3. Estimands and Identication
Population average treatments

P
= E[Y
i
(1) Y
i
(0)]
P,T
=E[Y
i
(1) Y
i
(0)|W = 1].
Most of the discussion in these notes will focus on
P
, with
extensions to
P,T
available in the references.
We will also look at the sample average treatment eect (SATE):

S
=
1
N
N

i=1
(Y
i
(1) Y
i
(0))

P
versus
S
does not matter for estimation, but matters for
variance.
8
4. Estimation and Inference
Assumption 1 (Unconfoundedness, Rosenbaumand Rubin, 1983a)
(Y
i
(0), Y
i
(1)) W
i
| X
i
.
conditional independence assumption, selection on observ-
ables. In missing data literature missing at random.
To see the link with standard exogeneity assumptions, assume
constant eect and linear regression:
Y
i
(0) = +X

i
+
i
, = Y
i
= + W
i
+X

i
+
i
with
i
X
i
. Given the constant treatment eect assumption,
unconfoundedness is equivalent to independence of W
i
and
i
conditional on X
i
, which would also capture the idea that W
i
is exogenous.
9
Motivation for Unconfoundeness Assumption (I)
The rst is a statistical, data descriptive motivation.
A natural starting point in the evaluation of any program is a
comparison of average outcomes for treated and control units.
A logical next step is to adjust any dierence in average out-
comes for dierences in exogenous background characteristics
(exogenous in the sense of not being aected by the treat-
ment).
Such an analysis may not lead to the nal word on the ecacy
of the treatment, but the absence of such an analysis would
seem dicult to rationalize in a serious attempt to understand
the evidence regarding the eect of the treatment.
10
Motivation for Unconfoundeness Assumption (II)
A second argument is that almost any evaluation of a treatment
involves comparisons of units who received the treatment with
units who did not.
The question is typically not whether such a comparison should
be made, but rather which units should be compared, that is,
which units best represent the treated units had they not been
treated.
It is clear that settings where some of necessary covariates are
not observed will require strong assumptions to allow for iden-
tication. E.g., instrumental variables settings Absent those
assumptions, typically only bounds can be identied (e.g., Man-
ski, 1990, 1995).
11
Motivation for Unconfoundeness Assumption (III)
Example of a model that is consistent with unconfoundedness:
suppose we are interested in estimating the average eect of
a binary input on a rms output, or Y
i
= g(W,
i
).
Suppose that prots are output minus costs,
W
i
= argmax
w
E[
i
(w)|c
i
] = argmax
w
E[g(w,
i
) c
i
w|c
i
],
implying
W
i
= 1{E[g(1,
i
) g(0,
i
) c
i
|c
i
]} = h(c
i
).
If unobserved marginal costs c
i
dier between rms, and these
marginal costs are independent of the errors
i
in the rms
forecast of output given inputs, then unconfoundedness will
hold as
(g(0,
i
), g(1,
i
)) c
i
.
12
Overlap
Second assumption on the joint distribution of treatments and
covariates:
Assumption 2 (Overlap)
0 < Pr(W
i
= 1|X
i
) < 1.
Rosenbaum and Rubin (1983a) refer to the combination of the
two assumptions as stongly ignorable treatment assignment.
13
Identication Given Assumptions
(x) E[Y
i
(1) Y
i
(0)|X
i
= x] = E[Y
i
(1)|X
i
= x] E[Y
i
(0)|X
i
= x]
= E[Y
i
(1)|X
i
= x, W
i
= 1] E[Y
i
(0)|X
i
= x, W
i
= 0]
= E[Y
i
|X
i
, W
i
= 1] E[Y
i
|X
i
, W
i
= 0].
To make this feasible, one needs to be able to estimate the
expectations E[Y
i
|X
i
= x, W
i
= w] for all values of w and x in the
support of these variables. This is where overlap is important.
Given identication of (x),

P
= E[(X
i
)]
14
Alternative Assumptions
E[Y
i
(w)|W
i
, X
i
] = E[Y
i
(w)|X
i
],
for w = 0, 1. Although this assumption is unquestionably
weaker, in practice it is rare that a convincing case can be
made for the weaker assumption without the case being equally
strong for the stronger Assumption.
The reason is that the weaker assumption is intrinsically tied to
functional form assumptions, and as a result one cannot iden-
tify average eects on transformations of the original outcome
(e.g., logarithms) without the strong assumption.
If we are interested in
P,T
it is sucient to assume
Y
i
(0) W
i
| X
i
,
15
Propensity Score
Result 1 Suppose that Assumption 1 holds. Then:
(Y
i
(0), Y
i
(1)) W
i
| e(X
i
).
Only need to condition on scalar function of covariates, which
would be much easier in practice if X
i
is high-dimensional.
(Problem is that the propensity score e(x) is almost never
known.)
16
Eciency Bound
Hahn (1998): for any regular estimator for
P
, denoted by ,
with

N (
P
)
d
N(0, V ),
the variance must satisfy:
V E
_

2
1
(X
i
)
e(X
i
)
+

2
0
(X
i
)
1 e(X
i
)
+((X
i
)
P
)
2
_
. (1)
Estimators exist that achieve this bound.
17
Estimators
A. Regression Estimators
B. Matching
C. Propensity Score Estimators
D. Mixed Estimators (recommended)
18
A. Regression Estimators
Estimate
w
(x) consistently and estimate
P
or
S
as

reg
=
1
N
N

i=1
(
1
(X
i
)
0
(X
i
)).
Simple implementations include

w
(x) =

x + w,
in which case the average treatment eect is equal to . In
this case one can estimate simply by least squares estimation
using the regression function
Y
i
= +

X
i
+ W
i
+
i
.
More generally, one can specify separate regression functions
for the two regimes,
w
(x) =

w
x.
19
These simple regression estimators can be sensitive to dif-
ferences in the covariate distributions for treated and control
units.
The reason is that in that case the regression estimators rely
heavily on extrapolation.
Note that
0
(x) is used to predict missing outcomes for the
treated. Hence on average one wishes to use predict the control
outcome at X
T
=

i
W
i
X
i
/N
T
, the average covariate value
for the treated. With a linear regression function, the average
prediction can be written as

Y
C
+

(X
T
X
C
).
If X
T
and X
C
are close, the precise specication of the regres-
sion function will not matter much for the average prediction.
With the two averages very dierent, the prediction based on
a linear regression function can be sensitive to changes in the
specication.
20
1 0 1 2 3 4 5 6 7 8
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
B. Matching
let
m
(i) is the mth closest match, that is, the index l that
satises W
l
= W
i
and

j|W
j
=W
i
1{X
j
X
i
X
l
X
i
} = m,
Then

Y
i
(0) =
_
Y
i
if W
i
= 0,
1
M

jJ
M
(i)
Y
j
if W
i
= 1,

Y
i
(1) =
_
1
M

jJ
M
(i)
Y
j
if
Y
i
if
The simple matching estimator is

sm
M
=
1
N
N

i=1
_

Y
i
(1)

Y
i
(0)
_
. (2)
21
Issues with Matching
Bias is of order O(N
1/K
), where K is dimension of covariates.
Is important in large samples if K 2 (and dominates variance
asymptotically if K 3)
Not Ecient (but eciency loss is small)
Easy to implement, robust.
22
C.1 Propensity Score Estimators: Weighting
E
_
WY
e(X)
_
= E
_
E
_
WY
i
(1)
e(X)

X
__
=E
_
E
_
e(X)Y
i
(1)
e(X)
__
=E[Y
i
(1)],
and similarly
E
_
(1 W)Y
1 e(X)
_
= E[Y
i
(0)],
implying

P
= E
_
W Y
e(X)

(1 W) Y
1 e(X)
_
.
With the propensity score known one can directly implement
this estimator as
=
1
N
N

i=1
_
W
i
Y
i
e(X
i
)

(1 W
i
) Y
i
1 e(X
i
)
_
. (3)
23
Implementation of Horvitz-Thompson Estimator
Estimate e(x) exibly (Hirano, Imbens and Ridder, 2003)

weight
=
N

i=1
W
i
Y
i
e(X
i
)
/
N

i=1
W
i
e(X
i
)

i=1
(1 W
i
) Y
i
1 e(X
i
)
/
N

i=1
(1 W
i
)
1 e(X
i
)
Is ecient given nonparametric estimator for e(x).
Potentially sensitive to estimator for propensity score.
24
Matching or Regression on the Propensity Score
Not clear what advantages are.
Large sample properties not known.
Simulation results not encouraging.
25
D.1 Mixed Estimators: Weighting and Regression
Interpret Horvitz-Thompson estimator as weighted regression
estimator:
Y
i
= + W
i
+
i
, with weights
i
=

W
i
e(X
i
)
+
1 W
i
1 e(X
i
)
.
This weighted-least-squares representation suggests that one
may add covariates to the regression function to improve pre-
cision, for example as
Y
i
= +

X
i
+ W
i
+
i
,
with the same weights
i
. Such an estimator is consistent
as long as either the regression model or the propensity score
(and thus the weights) are specied correctly. That is, in the
Robins-Ritov terminology, the estimator is doubly robust.
26
Matching and Regression
First match observations.
Dene

X
i
(0) =
_
X
i
if W
i
= 0,
X
(i)
if W
i
= 1,

X
i
(1) =
_
X
(i)
if W
i
= 0,
X
i
if W
i
= 1.
Then adjust within pair dierence for the within-pair dierence
in covariates

X
i
(1)

X
i
(0):

adj
M
=
1
N
N

i=1
_

Y
i
(1)

Y
i
(0)


_

X
i
(1)

X
i
(0)
__
,
using regression estimate for .
Can eliminate bias of matching estimator given exible speci-
cation of regression function.
27
Estimation of the Variance
For ecient estimator of
P
:
V
P
= E
_

2
1
(X
i
)
e(X
i
)
+

2
0
(X
i
)
1 e(X
i
)
+(
1
(X
i
)
0
(X
i
) )
2
_
,
Estimate all components nonparametrically, and plug in.
Alternatively, use bootstrap.
(Does not work for matching estimator)
28
Estimation of the Variance
For all estimators of
S
, for some known
i
(X, W)
=
N

i=1

i
(X, W) Y
i
,
V ( |X, W) =
N

i=1

i
(X, W)
2

2
W
i
(X
i
).
To estimate
2
W
i
(X
i
) one uses the closest match within the set
of units with the same treatment indicator. Let v(i) be the
closest unit to i with the same treatment indicator.
The sample variance of the outcome variable for these 2 units
can then be used to estimate
2
W
i
(X
i
):

2
W
i
(X
i
) =
_
Y
i
Y
v(i)
_
2
/2.
29
5.I Assessing Unconfoundedness: Multiple Control Groups
Suppose we have a three-valued indicator T
i
{0, 1, 1} for the
groups (e.g., ineligibles, eligible nonnonparticipants and partic-
ipants), with the treatment indicator equal to W
i
= 1{T
i
= 1},
so that
Y
i
=
_
Y
i
(0) if T
i
{1, 0}
Y
i
(1) if T
i
= 1.
Suppose we extend the unconfoundedness assumption to in-
dependence of the potential outcomes and the three-valued
group indicator given covariates,
Y
i
(0), Y
i
(1) T
i

X
i
30
Now a testable implication is
Y
i
(0) 1{T
i
= 0}

X
i
, T
i
{1, 0},
and thus
Y
i
1{T
i
= 0}

X
i
, T
i
{1, 0}.
An implication of this independence condition is being tested
by the tests discussed above. Whether this test has much bear-
ing on the unconfoundedness assumption depends on whether
the extension of the assumption is plausible given unconfound-
edness itself.
31
5.II Assessing Unconfoundedness: Estimate Eects on
Pseudo Outcomes
Suppose the covariates consist of a number of lagged outcomes
Y
i,1
, . . . , Y
i,T
as well as time-invariant individual characteris-
tics Z
i
, so that X
i
= (Y
i,1
, . . . , Y
i,T
, Z
i
).
Now consider the following two assumptions. The rst is un-
confoundedness given only T 1 lags of the outcome:
Y
i,0
(1), Y
i,0
(0) W
i
| Y
i,1
, . . . , Y
i,(T1)
, Z
i
,
and the second assumes stationarity and exchangeability: Then
it follows that
Y
i,1
W
i
| Y
i,2
, . . . , Y
i,T
, Z
i
,
which is testable.
32
6.I Assessing Overlap
The rst method to detect lack of overlap is to plot distribu-
tions of covariates by treatment groups. In the case with one
or two covariates one can do this directly. In high dimensional
cases, however, this becomes more dicult.
One can inspect pairs of marginal distributions by treatment
status, but these are not necessarily informative about lack of
overlap. It is possible that for each covariate the distribution
for the treatment and control groups are identical, even though
there are areas where the propensity score is zero or one.
A more direct method is to inspect the distribution of the
propensity score in both treatment groups, which can reveal
lack of overlap in the multivariate covariate distributions.
33
6.II Selecting a Subsample with Overlap
Dene average eects for subsamples A:
(A) =
N

i=1
1{X
i
A} (X
i
)/
N

i=1
1{X
i
A}.
The eciency bound for (A), assuming homoskedasticity, as

2
q(A)
E
_
1
e(X)
+
1
1 e(X)

X A
_
,
where q(A) = Pr(X A).
They derive the characterization for the set A that minimizes
the asymptotic variance .
34
The optimal set has the form
A

= {x X| e(X) 1 },
dropping observations with extreme values for the propensity
score, with the cuto value determined by the equation
1
(1 )
=
2 E
_
1
e(X) (1 e(X))

1
e(X) (1 e(X))

1
(1 )
_
.
Note that this subsample is selected solely on the basis of the
joint distribution of the treatment indicators and the covari-
ates, and therefore does not introduce biases associated with
selection based on the outcomes.
Calculations for Beta distributions for the propensity score sug-
gest that = 0.1 approximates the optimal set well in practice.
35
7. Applic. to Lalonde Data (Dehejia-Wahba Sample)
Controls Trainees CPS
(N=260) (N=185) (N=15,992)
mean (s.d.) mean (s.d.) di / sd mean (s.d.) di / s
Age 25.1 7.06 25.8 7.16 0.1 33.2 11.1 -0.7
Black 0.83 0.38 0.84 0.36 0.0 0.07 0.26 2.8
Ed 10.1 1.61 10.4 2.01 0.1 12.0 2.87 -0.6
Hisp 0.11 0.31 0.06 0.24 -0.2 0.07 0.26 -0.1
Marr 0.15 0.36 0.19 0.39 0.1 0.71 0.45 -1.2
E 74 2.11 5.69 2.10 4.89 -0.0 14.0 9.57 -1.2
E 75 1.27 3.10 1.53 3.22 0.1 0.12 0.32 1.8
U 74 0.75 0.43 0.71 0.46 -0.1 13.7 9.27 -1.3
U 75 0.68 0.47 0.60 0.49 -0.2 0.11 0.31 1.5
36
Table 2: Estimates for Lalonde Data with Earnings 75 as Outcome
Experimental Controls CPS Comparison Grou
mean (s.e.) t-stat mean (s.e.) t-stat
Simple Dif 0.27 0.30 0.9 -12.12 0.68 -17.8
OLS (parallel) 0.15 0.22 0.7 -1.15 0.36 -3.2
OLS (separate) 0.12 0.22 0.6 -1.11 0.36 -3.1
P Score Weighting 0.15 0.30 0.5 -1.17 0.26 -4.5
P Score Blocking 0.10 0.17 0.6 -2.80 0.56 -5.0
P Score Regression 0.16 0.30 0.5 -1.68 0.79 -2.1
P Score Matching 0.23 0.37 0.6 -1.31 0.46 -2.9
Matching 0.14 0.28 0.5 -1.33 0.41 -3.2
Weighting and Regr 0.15 0.21 0.7 -1.23 0.24 -5.2
Blocking and Regr 0.09 0.15 0.6 -1.30 0.50 -2.6
Matching and Regr 0.06 0.28 0.2 -1.34 0.42 -3.2
Table 3: Sample Sizes for CPS Sample
e(X
i
) < 0.1 0.1 e(X
i
) 0.9 0.9 < e(X
i
) All
Controls 15679 313 0 15992
Trainees 44 141 0 185
All 15723 454 0 16177
Dropping observations with a propensity score less than 0.1
leads to discarding most of the controls, 15679 to be precise,
leaving only 313 control observations. In addition 44 out of
the 185 treated units are dropped. Nevertheless, the improved
balance suggests that we may obtain more precise estimates
for the remaining sample.
38
Table 4: Summary Statistics for Selected CPS Sample
Controls (N=313) Trainees (N=141)
mean (s.d.) mean (s.d.) di / sd
Age 26.60 10.97 25.69 7.29 -0.09
Black 0.94 0.23 0.99 0.12 0.21
Education 10.66 2.81 10.26 2.11 -0.15
Hispanic 0.06 0.23 0.01 0.12 -0.21
Married 0.22 0.42 0.13 0.33 -0.24
Earnings 74 1.96 4.08 1.34 3.72 -0.15
Earnings 75 0.57 0.50 0.80 0.40 0.49
Unempl 74 0.92 1.57 0.75 1.48 -0.11
Unempl. 75 0.55 0.50 0.69 0.46 0.28
39
Table 5: Estimates on Selected CPS Lalonde Data
Earn 75 Outcome Earn 78 Outcome
mean (s.e.) t-stat mean (s.e.) t-stat
Simple Dif -0.17 0.16 -1.1 1.73 0.68 2.6
OLS (parallel) -0.09 0.14 -0.7 2.10 0.71 3.0
OLS (separate) -0.19 0.14 -1.4 2.18 0.72 3.0
P Score Weighting -0.16 0.15 -1.0 1.86 0.75 2.5
P Score Blocking -0.25 0.25 -1.0 1.73 1.23 1.4
P Score Regression -0.07 0.17 -0.4 2.09 0.73 2.9
P Score Matching -0.01 0.21 -0.1 0.65 1.19 0.5
Matching -0.10 0.20 -0.5 2.10 1.16 1.8
Weighting and Regr -0.14 0.14 -1.1 1.96 0.77 2.5
Blocking and Regr -0.25 0.25 -1.0 1.73 1.22 1.4
Matching and Regr -0.11 0.19 -0.6 2.23 1.16 1.9

You might also like