0% found this document useful (0 votes)
17 views104 pages

05 Covariates

This lecture discusses the use of covariates in Difference-in-Differences (DiD) analysis to improve the plausibility of the Parallel Trends Assumption (PTA). It introduces the Conditional Parallel Trends Assumption, which allows for covariate-specific trends, and the Strong Overlap Assumption to ensure identification of the Average Treatment Effect among Treated units (ATT). The lecture also highlights the limitations of using simple Two-Way Fixed Effect (TWFE) regression with covariates, demonstrating potential biases through a Monte Carlo simulation.

Uploaded by

kanspurchase2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views104 pages

05 Covariates

This lecture discusses the use of covariates in Difference-in-Differences (DiD) analysis to improve the plausibility of the Parallel Trends Assumption (PTA). It introduces the Conditional Parallel Trends Assumption, which allows for covariate-specific trends, and the Strong Overlap Assumption to ensure identification of the Average Treatment Effect among Treated units (ATT). The lecture also highlights the limitations of using simple Two-Way Fixed Effect (TWFE) regression with covariates, demonstrating potential biases through a Monte Carlo simulation.

Uploaded by

kanspurchase2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Causal Inference using Difference-in-Differences

Lecture 5: How covariates can make your DiD more plausible

Pedro H. C. Sant’Anna
Emory University

January 2025
Summary of previous lectures
Canonical DiD setup

■ So far, we have considered the canonical DiD setup.

▶ 2 time periods: t = 1 (before treatment) and t = 2 (after treatment).

▶ 2 groups: G = 2 (treated at period 2) and G = ∞ (untreated by period 2).

■ Main parameter of interest: Average Treatment Effect among Treated units

ATT ≡ E [Yt=2 (2) |G = 2] − E [Yt=2 (∞) |G = 2]


| {z } | {z }
estimable from the data counterfactual component

1
Canonical DiD setup

Identification of the ATT is achieved via two main assumptions:


Assumption (No-Anticipation)
For all units i, Yi,t (g) = Yi,t (∞) for all groups in their pre-treatment periods, i.e., for all
t < g.

Assumption (Parallel Trends Assumption)


E [Yi,t=2 (∞)|Gi = 2] − E [Yi,t=1 (∞)|Gi = 2] = E [Yi,t=2 (∞)|Gi = ∞] − E [Yi,t=1 (∞)|Gi = ∞]

PS: We are taking SUTVA for granted from now onwards (NOT without loss of generality,
though)
2
“Brute force” DiD estimator

■ Canonical DiD Estimator:

θbnDiD = (Ȳg=2,t=2 − Ȳg=2,t=1 ) − (Ȳg=∞,t=2 − Ȳg=∞,t=1 ) ,

where Ȳg=d,t=j is the sample mean of the outcome Y for units in group d in time
period j,
N·T
1
Ng=d,t=j i∑
Ȳg=d,t=j = Yi 1{Gi = d}1{Ti = j},
=1
with
N·T
Ng=d,t=j = ∑ 1{Gi = d}1{Ti = j},
i=1
Gi and Ti are group and time dummy, respectively, and Yi is the “poolled” outcome
data.
3
“TWFE” DiD estimator

■ In practice, most of us would rely on the following TWFE regression specification to


estimate the ATT:

Yi,t = α0 + γ0 1 {Gi = 2} + λ0 1 {Ti = 2} + βtwfe


0 (1 {Gi = 2} · 1 {Ti = 2}) + ε i,t ,
|{z}
≡ATT

where E [ε i,t |Gi , Ti ] = 0 almost surely.

4
What if the Parallel Trends

Assumption is not plausible?

5
Conditional Parallel Trends
Unconditional parallel trends assumption

■ So far, covariates have played no role in our analysis

■ But what if units with different observed characteristics were to evolve differently in
the absence of treatment?
▶ Effect of Minimum wage on employment: is it sensible to assume that, in the absence of
treatment, employment in states in the NE of the US would have evolved similarly as in
states in the South of the US?

▶ Effect of training on earnings: is it reasonable to assume that earnings among young


workers would have evolved similarly to older workers in the absence of treatment?

■ In general, the PTA may be implausible if pre-treatment characteristics that are


thought to be associated with the dynamics of the outcome variable are
“unbalanced“ between the treated and the untreated group. (Abadie, 2005).
6
How can we “relax” the PTA and

allow for “covariate-specific” trends?

7
Conditional Parallel Trends Assumption

■ In order to “relax” the PTA, we can assume that it holds only after conditioning on a
vector of observed pre-treatment covariates

Assumption (Conditional Parallel Trends Assumption)


E [Yt=2 (∞)|G = 2, X] − E [Yt=1 (∞)|G = 2, X] = E [Yt=2 (∞)|G = ∞, X] − E [Yt=1 (∞)|G = ∞, X] a.s.

■ The conditional PT assumption states that, in the absence of treatment, conditional


on X, the evolution of the outcome among the treated units is, on average, the same
as the evolution of the outcome among the untreated units.

■ It allows for covariate-specific trends!

8
Strong overlap

■ When covariates are available, we will introduce an additional assumption stating


that every unit has a strictly positive probability of being in the untreated group.
Assumption (Strong Overlap Assumption)
The conditional probability of belonging to the treatment group, given observed
characteristics X, is uniformly bounded away from 1.

That is, for some ϵ > 0, P [G = 2|X] < 1 − ϵ almost surely.

■ The covariates X here are the same as those used to justify the conditional PT
assumption!
■ For identification purposes, we can take ϵ
= 0. For (standard) inference, though, we
would have problems without relying on “extrapolation“; see, e.g., Khan and Tamer
(2010). 9
How do the conditional PTA and

and strong overlap help us, DiDistas?

10
Identification of ATT under conditional parallel trends and overlap

1) First, recall the conditional PT assumption:


E [Yt=2 (∞)|G = 2, X] − E [Yt=1 (∞)|G = 2, X] = E [Yt=2 (∞)|G = ∞, X] − E [Yt=1 (∞)|G = ∞, X] .

2) By simple manipulation, we can write it as


E [Yt=2 (∞) |G = 2, X] = E [Yt=1 (∞) |G = 2, X] + (E [Yt=2 (∞) |G = ∞, X] − E [Yt=1 (∞) |G = ∞, X])

3) Now, exploiting No-Anticipation, SUTVA, and strong overlap:


E [Yt=2 (∞) |G = 2, X] = E [Yt=1 (2) |G = 2, X] + (E [Yt=2 (∞) |G= ∞, X] − E [Yt=1 (∞) |G = ∞, X])
| {z }
by No−Anticipation
E [Yt=2 (∞) |G = 2, X] = E [Yt=1 |G = 2, X] + (E [Yt=2 |G = ∞, X] − E [Yt=1 |G = ∞, X])
| {z }
by SUTVA+overlap

11
Conditional Parallel Trends and the conditional ATT

■ Let’s define the Conditional ATT: ATT(X) ≡ E [Yt=2 (2) − Yt=2 (∞)|G = 2, X].
■ Now, combining the results of previous slides, we have that, under SUTVA +
No-Anticipation + Conditional PT assumptions, it follows that:

ATT(X) = E [Yt=2 |G = 2, X] − (E [Yt=1 |G = 2, X] + (E [Yt=2 |G = ∞, X] − E [Yt=1 |G = ∞, X]))


= (E [Yt=2 |G = 2, X] − E [Yt=1 |G = 2, X]) − (E [Yt=2 |G = ∞, X] − E [Yt=1 |G = ∞, X])

■ We can identify the conditional ATT function - a very rich object!

■ This also implies that the unconditional ATT is identified - all we have to do is to
integrate X among treated units:
ATT = E [ATT(X)|G = 2]
12
Conditional Parallel Trends and the unconditional ATT

In terms of estimable pieces, we get that

ATT = E [(E [Yt=2 |G = 2, X] − E [Yt=1 |G = 2, X]) − (E [Yt=2 |G = ∞, X] − E [Yt=1 |G = ∞, X])|G = 2]

= (E [Yt=2 |G = 2] − E [Yt=1 |G = 2]) − E [(E [Yt=2 |G = ∞, X] − E [Yt=1 |G = ∞, X])|G = 2]

where the second equality follows from the Law of Iterated Expectations and covariates
and group indicators being stationary (which hold by construction on a balanced panel;
we will come back to this in a bit).

13
Can we use a simple regression here?

14
Usage of simple TWFE linear regressions with covariates
Usage of simple TWFE linear regressions with covariates
The temptation
TWFE DiD estimator

■ Under unconditional PTA, we have shown that we can use the TWFE regression to
recover the ATT:

Yi,t = α0 + γ0 1 {Gi = 2} + λ0 1 {Ti = 2} + βtwfe


0 (1 {Gi = 2} · 1 {Ti = 2}) + ε i,t ,
|{z}
≡ATT

where E [ε i,t |Gi , Ti ] = 0 almost surely.

■ It is very tempting to “extrapolate” and use the “more general” TWFE regression
specification:

Yi,t = α̃0,1 + γ̃0 1 {Gi = 2} + λ̃0 1 {Ti = 2} + β̃twfe


0 (1 {Gi = 2} · 1 {Ti = 2}) + Xi′ α̃0,2 + ε̃ i,t ,
|{z}
????

where E [ε̃ i,t |Gi , Ti , Xi ] = 0 almost surely.

15
twfe
Is β̃ 0 “similar” to the ATT?

16
Usage of simple TWFE linear regressions with covariates
Simulation exercise
Monte Carlo simulation exercise

■ This is a great point to illustrate the power of simulations to assess if “intuitive”


extensions are sensible.

■ Here, knowing the “truth” help us to hold our methods accountable.

■ In this particular exercise, we will use a Data generating process similar to those of
Kang and Schafer (2007)

■ Samples sizes n = 1, 000

■ For each design, we consider 10, 000 Monte Carlo experiments

n
■ Available data are {Yt=2 , Yt=1 , D, X}i=1 , where Di = 1{Gi = 2}.

17
Monte Carlo simulation exercise

■ Covariates are generated as Xj ∼ N (0, 1), j = 1, 2, 3, 4.

■ Let X = (X1 , X2 , X3 , X4 ), and


freg (X) = 210 + 27.4 · X1 + 13.7 · (X2 + X3 + X4 )
fps (X) = 0.75 · (−X1 + 0.5 · X2 − 0.25 · X3 − 0.1 · X4 )
■ Also, let
d
v (X, D) ∼ N (D · freg (X) , 1)
d
ε t=1 ∼ N (0, 1)
d
ε t=2 (2) ∼ N (0, 1)
d
ε t=2 (∞) ∼ N (0, 1)
d
U ∼ U (0, 1)
18
How we generate potential outcomes and group indicators

Yi,t=1 (∞) = freg (Xi ) + vi (Xi , Di ) + ε i,t=1

Yi,t=2 (∞) = 2 · freg (Xi ) + vi (Xi , Di ) + ε i,t=2 (∞)

Yi,t=2 (2) = 2 · freg (Xi ) + vi (Xi , Di ) + ε i,t=2 (2)

exp (fps (Xi ))


p ( Xi ) =
1 + exp (fps (Xi ))

Di = 1 { p ( X i ) ≥ U }

In this setup, ATT(X) = 0 a.s. 19


Simulation results

We estimate β̃twfe
0 from the following specification:

Yi,t = α̃0,1 + γ̃0 1 {Gi = 2} + λ̃0 1 {Ti = 2} + β̃twfe


0 (1 {Gi = 2} · 1 {Ti = 2}) + Xi′ α̃0,2 + ε̃ i,t ,

twfe
■ Average of the b̃
β0 in the simulations: -16.36 (very biased!)

■ Coverage probability of 95% Confidence Interval: 0 (does not control size!)

20
Simulation results

Figure 1: Monte Carlo for TWFE-based estimators

0.15

0.10
Density

0.05

0.00

−20 −10 0
Two−Way Fixed Effect Estimator 21
Why there is so much bias here?

22
Usage of simple TWFE linear regressions with covariates
The problems of the simple TWFE specification with covariates
Simple TWFE DiD regression estimator with covariates

■ The TWFE specification is given by

Yi,t = α̃0,1 + γ̃0 1 {Gi = 2} + λ̃0 1 {Ti = 2} + β̃twfe


0 (1 {Gi = 2} · 1 {Ti = 2}) + Xi′ α̃0,2 + ε̃ i,t ,

where E [ε i,t |Gi , Ti , Xi ] = 0 almost surely.

■ Now, let’s play with its terms:

E [Yi,t |Gi = ∞, Ti = 1, Xi ] = α̃0,1 + Xi′ α̃0,2


E [Yi,t |Gi = ∞, Ti = 2, Xi ] = α̃0,1 + λ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 1, Xi ] = α̃0,1 + γ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 2, Xi ] = α̃0,1 + γ̃0 + λ̃0 + β̃twfe
0 + Xi′ α̃0,2

23
Simple TWFE regression estimator with covariates

■ Set of moment restrictions:

E [Yi,t |Gi = ∞, Ti = 1, Xi ] = α̃0,1 + Xi′ α̃0,2


E [Yi,t |Gi = ∞, Ti = 2, Xi ] = α̃0,1 + λ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 1, Xi ] = α̃0,1 + γ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 2, Xi ] = α̃0,1 + γ̃0 + λ̃0 + β̃twfe
0 + Xi′ α̃0,2

■ Let’s analyze the implications of these moment restrictions, one by one.

24
Simple TWFE regression estimator with covariates

■ Set of moment restrictions:

E [Yi,t |Gi = ∞, Ti = 1, Xi ] = α̃0,1 + Xi′ α̃0,2


E [Yi,t |Gi = ∞, Ti = 2, Xi ] = α̃0,1 + λ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 1, Xi ] = α̃0,1 + γ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 2, Xi ] = α̃0,1 + γ̃0 + λ̃0 + β̃twfe
0 + Xi′ α̃0,2

■ First, notice that

E [Yi,t |Gi = ∞, Ti = 2, Xi ] − E [Yi,t |Gi = ∞, Ti = 1, Xi ] = λ̃0

Evolution of the outcome among untreated units does not depend on X!

25
Simple TWFE regression estimator with covariates

■ Set of moment restrictions:

E [Yi,t |Gi = ∞, Ti = 1, Xi ] = α̃0,1 + Xi′ α̃0,2


E [Yi,t |Gi = ∞, Ti = 2, Xi ] = α̃0,1 + λ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 1, Xi ] = α̃0,1 + γ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 2, Xi ] = α̃0,1 + γ̃0 + λ̃0 + β̃twfe
0 + Xi′ α̃0,2

■ Second, notice that

E [Yi,t |Gi = 2, Ti = 2, Xi ] − E [Yi,t |Gi = 2, Ti = 1, Xi ] = λ̃0 + β̃twfe


0

Evolution of the outcome among treated units does not depend on X!

26
Simple TWFE regression estimator with covariates

■ Set of moment restrictions:

E [Yi,t |Gi = ∞, Ti = 1, Xi ] = α̃0,1 + Xi′ α̃0,2


E [Yi,t |Gi = ∞, Ti = 2, Xi ] = α̃0,1 + λ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 1, Xi ] = α̃0,1 + γ̃0 + Xi′ α̃0,2
E [Yi,t |Gi = 2, Ti = 2, Xi ] = α̃0,1 + γ̃0 + λ̃0 + β̃twfe
0 + Xi′ α̃0,2
■ Lastly, notice that, under conditional PT, No-Anticipation, and SUTVA,

ATT(X) = (E [Yi,t |Gi = 2, Ti = 2, Xi ] − E [Yi,t |Gi = 2, Ti = 1, Xi ])


− (E [Yi,t |Gi = ∞, Ti = 2, Xi ] − E [Yi,t |Gi = ∞, Ti = 1, Xi ])
= β̃twfe
0

Average Treatment effects are homogeneous between covariate subpopulations!


27
TWFE with covariates

28
Key to success:

Separate identification from


estimation/inference!

29
How can we do it?

30
Alternative Estimands
Semi and nonparametric DiD procedures

Once you separate identification from estimation procedures, we realize that DiD with
covariates has many faces!

31
Alternative Estimands
Regression adjustment
Regression adjustment procedure

■ The “first face” of DiD procedure is already familiar.

■ Originally proposed by Heckman, Ichimura and Todd (1997); Heckman, Ichimura,


Smith and Todd (1998)

■ Idea is to work directly from the identifying assumptions.

■ We have already seen that, under conditional PT, No-anticipation, and SUTVA,

    
    
ATT = E  E [Yt=2 |G = 2, X] − E [Yt=1 |G = 2, X] − E [Yt=2 |G = ∞, X] − E [Yt=1 |G = ∞, X] G = 2
| {z } | {z } | {z } | {z }
=mGt==22 (X) =mGt==12 (X) =mGt==2∞ (X) =mGt==1∞ (X)
h    i
= E mGt==22 (X) − mGt==12 (X) − mGt==2∞ (X) − mGt==1∞ (X) G = 2

32
Regression adjustment procedure

■ Our life is a bit easier once we have that


h    i
ATT = E mGt==22 (X) − mGt==12 (X) − mGt==2∞ (X) − mGt==1∞ (X) G = 2 .

■ Now, it is a matter of estimating the unknown regression functions mG


t (X) with your
favorite estimation method - it can be parametric, semiparametric, or nonparametric!

33
Regression adjustment procedure

■ Our life is a bit easier once we have that


h    i
ATT = E mGt==22 (X) − mGt==12 (X) − mGt==2∞ (X) − mGt==1∞ (X) G = 2 .

G=g
■ For example, let µt=s (X) = X′ βG0,t==gs be a working model for mGt==sg (X).

■ We can then estimate the betas in each subsample using OLS, compute the fitted
values using all covariates values among treated units, and then average the
combination of these fitted values:
h    i
d reg
ATTn = E n b
µ G=2
t=2 ( X ) − b
µ G=2
t=1 ( X ) − b
µ G= ∞
t=2 ( X ) − b
µ G= ∞
t=1 ( X ) G = 2 .

34
Regression adjustment with panel data

■ Our life can be even easier if we have access to panel data:

Assumption (Panel Data Sampling Scheme)


The data {Yi,t=1 , Yi,t=2 , Gi , Xi }ni=i is a random sample of the population of interest.

■ Observing Yt=1 and Yt=2 for the same units allows us to simplify the formulas a lot!

ATT = E [(E [Yt=2 |G = 2, X] − E [Yt=1 |G = 2, X]) − (E [Yt=2 |G = ∞, X] − E [Yt=1 |G = ∞, X])|G = 2]


= E [E [Yt=2 − Yt=1 |G = 2, X] − E [Yt=2 − Yt=1 |G = ∞, X] |G = 2]
= E [Yt=2 − Yt=1 |G = 2] − E [E [Yt=2 − Yt=1 |G = ∞, X] |G = 2]
h i
= E [Yt=2 − Yt=1 |G = 2] − E mG∆=∞ (X) |G = 2
■ Only have to model one conditional expectation:

mG∆=∞ (X) ≡ E [Yt=2 − Yt=1 |G = ∞, X]


35
Regression adjustment with stationary repeated cross-section data

■ Sometimes, we only have access to (stationary) repeated cross-section data:

Assumption (Repeated Cross-Section Data Sampling Scheme)


The pooled repeated cross-section data {Yi , Gi , Ti , Xi }ni=1 consist of iid draws from the
mixture distribution

P (Y ≤ y, X ≤ x, G = g, T = t) = 1{t = 2} · λ · P (Yt=2 ≤ y, X ≤ x, G = g|T = 2)


+1{t = 1} · (1 − λ) P (Yt=1 ≤ y, X ≤ x, G = g|T = 1) ,

where (y, x, g, t) ∈ R × Rk × {2, ∞} × {1, 2}, λ = P (T = 2) ∈ (0, 1).


d
Furthermore, (G, X) |T = 1 ∼ (G, X) |T = 2, i.e., there are no compositional changes over
time.

■ Question: Would it be possible to allow compositional changes? What would


36
change? How would you proceed?
Regression adjustment with stationary repeated cross-section data

■ In this case, the formula can also be simplified (but not as much as in the case of
panel data):
h    i
ATT = E mGt==22 (X) − mGt==12 (X) − mGt==2∞ (X) − mGt==1∞ (X) G = 2

h  i
= (E [Y|G = 2, T = 2] − E [Y|G = 2, T = 1]) − E mGt==2∞ (X) − mtG==1∞ (X) G=2

■ We have to model conditional expectations only for untreated units:

mGt==s∞ (X) = E [Y|G = ∞, T = s, X] , s = 1, 2

37
Regression-adjusted DiD estimators

rely on researchers ability to model

the outcome evolution.


38
Alternative Estimands
Inverse Probability Weighting procedure
Inverse probability weighting procedures

■ The “second face” of semi/nonparametric DiD procedures avoids directly modeling


the outcome evolution.

■ Instead, it models the propensity score, i.e., prob of belonging to the group G = 2:
p(X) ≡ P (G = 2|X) = P (D = 1|X), where D = 1{G = 2}.

■ Originally proposed by Abadie (2005):


  
(1 − D)p (X)
E D− ( Yt = 2 − Yt = 1 )
ipw,p 1 − p(X)
ATT = ,
E [D]
  
(1 − D)p (X) 1 {T = 2} − λ
E D− Y
1 − p(X) λ
ATTipw,rc = ,
E [D]
where λ = E [1 {T = 2}] . 39
Inverse probability weighting procedures

  
(1 − D)p (X)
E D− (Yt=2 − Yt=1 )
ipw,p 1 − p(X)
ATT = ,
E [D]
  
(1 − D)p (X) T−λ
E D− Y
ipw,rc 1 − p(X) λ
ATT = ,
E [D]
where λ = E [T] .
■ These formulas suggest a simple two-step estimation procedure, too!
1. Choose your favorite method to estimate the unknown propensity score p(X).
2. Plug in the estimated fitted propensity score values into the ATT equation, and replace
the population expectations with their sample analogue.
40
Inverse probability weighting procedures

exp(X′ γ0 )
■ For example, let π (X) = Λ (X) ≡ be a working model for the propensity
1 + exp(X′ γ0 )
score

■ We can estimate γ0 using the logit maximum likelihood estimator.

exp(X′ γ
b0 )
b (X)
■ Let π =
1 + exp(X′ γ
bn )

■ Abadie’s proposed ATT estimator with panel data is


  
(1 − D) π
b (X)
En D− ( Yt = 2 − Yt = 1 )
d ipw,p 1−π b (X)
ATTn = .
E n [D]

41
Hájek-based Inverse probability weighting procedures

■ One potential drawback of Abadie’s IPW DiD estimator is that their weights are not
“normalized”, i.e., they do not sum up to one.

■ More formally, Abadie’s IPW DiD estimator is of the Horvitz and Thompson (1952)
type.

■ We know from the survey literature that Hájek (1971)-type estimators can be more
stable, as they use “normalized” weights.

■ Building on this insight, Sant’Anna and Zhao (2020) built on Abadie (2005) and
considered the Hájek (1971)-type IPW DiD estimands.

42
Hájek-based Inverse probability weighting with panel

Sant’Anna and Zhao (2020) considered the following estimand when Panel data are
available:

  
ATTipw,p
std = E wpG=2 (D) − wpG=∞ (D, X; p) (Yt=2 − Yt=1 )

  
p(X) (1 − D)
 D 1 − p(X)  
= E   
 E [D] −  p(X) (1 − D)   (Yt=2 − Yt=1 ) ,
E
1 − p(X)
where
  
D g(X) (1 − D) g(X) (1 − D)
wpG=2 (D) = , and wpG=∞ (D, X; g) = E
E [D] 1 − g(X) 1 − g(X)
43
Hájek-based Inverse probability weighting with repeated cross-section

Sant’Anna and Zhao (2020) considered the following estimand when stationary RCS data
are available:
ATTipw,rc
std = E [(wrc rc
G=2 (D, T) − wG=∞ (D, T, X; p)) · Y]

where
wrc rc rc
G=2 (D, T) = wG=2,t=2 (D, T) − wG=2,t=1 (D, T) ,
wrc rc rc
G=∞ (D, T, X; g) = wG=∞,t=2 (D, T, X; g) − wG=∞,t=1 (D, T, X; g) ,

and, for s = 1, 2,
D · 1 {T = s}
wrc
G=2,t=s (D, T) = ,
E [D · 1 {T = s}]
  
g(X) (1 − D) · 1 {T = s} g(X) (1 − D) · 1 {T = s}
wrc
G=∞,t=s (D, T, X; g) = E .
1 − g(X) 1 − g(X)
44
IPW-adjusted DiD estimators

rely on researchers ability to model

the propensity score.

45
46
Alternative Estimands
Doubly robust DiD estimators
Doubly robust DiD procedures

■ Combine both outcome regression and IPW approaches to form more robust
estimators.

■ Originally proposed by Sant’Anna and Zhao (2020)

■ Estimators are Doubly Robust consistent: they are


consistent for the ATT if either (but not necessarily both)

▶ Regression working models for outcome dynamics are


correctly specified

▶ Propensity score working model is correctly specified

47
Doubly robust DiD procedure with panel

Sant’Anna and Zhao (2020) considered the following doubly robust estimand when panel
data are available:

h   i
ATTdr,p = E wpG=2 (D) − wpG=∞ (D, X; p) (Yt=2 − Yt=1 ) − mGt==2∞ (X) − mtG==1∞ (X)

  
p(X) (1 − D)
 D 1 − p(X)   
= E 
  E [D] −    (Yt=2 − Yt=1 ) − mtG==2∞ (X) − mtG==1∞ (X)  ,
p(X) (1 − D)  
E
1 − p(X)
where
  
D g(X) (1 − D) g(X) (1 − D)
wpG=2 (D) = , and wpG=∞ (D, X; g) = E
E [D] 1 − g(X) 1 − g(X)
48
Doubly robust DiD procedure with panel

■ Sant’Anna and Zhao (2020) also shown that ATTdr,p is semiparametrically (locally)
efficient.

■ If all working models are correctly specified, the DR DiD estimator for the ATTdr,p is
“the most precise estimator” (minimum asymptotic variance) among all (regular)
estimators that does not rely on additional functional form restrictions.

■ Sant’Anna and Zhao (2020) also discuss how to get further improved DR DiD
estimators by “carefully” choosing first-step estimators for the regression adjustment
and propensity score working models.

■ For the sake of time, we will not go into detail on these.

49
Doubly robust DiD procedure with repeated cross-section

Sant’Anna and Zhao (2020) considered two different doubly robust estimands when RCS
data are available.
 
ATTdr,rc
1 = E (wrc rc rc rc
G=2 (D, T) − wG=∞ (D, T, X; p)) · Y − mG=∞,t=2 (X) − mG=∞,t=1 (X)

where
wrc rc rc
G=2 (D, T) = wG=2,t=2 (D, T) − wG=2,t=1 (D, T) ,
wrc rc rc
G=∞ (D, T, X; g) = wG=∞,t=2 (D, T, X; g) − wG=∞,t=1 (D, T, X; g) ,

and, for s = 1, 2, g = 2, ∞, we have that mrc


G=g,t=s (x) ≡ E [Y|G = g, T = s, X = x],

D · 1 {T = s}
wrc
G=2,t=s (D, T) = ,
E [D · 1 {T = s}]
  
g(X) (1 − D) · 1 {T = s} g(X) (1 − D) · 1 {T = s}
wrc
G=∞,t=s (D, T, X; g) = E . 50
1 − g(X) 1 − g(X)
Doubly robust DiD procedure with repeated cross-section

Sant’Anna and Zhao (2020) second DR DiD estimand also relies on outcome regression
models for the treated unit:

ATTdr,rc
2 = ATT1dr,rc

   rc 
+ E mrc rc rc
G=2,t=2 (X) − mG=∞,t=2 (X) D = 1 − E mG=2,t=2 (X) − mG=∞,t=2 (X) D = 1, T = 2

   rc 
− E mrc rc rc
G=2,t=1 (X) − mG=∞,t=1 (X) D = 1 − E mG=2,t=1 (X) − mG=∞,t=1 (X) D = 1, T = 1 ,

51
Doubly robust DiD procedure with repeated cross-section

■ Both DR DiD estimators for RCS data are consistent for the ATT under the same
conditions:

■ Even if the regression model for the outcome evolution for the treated group is
misspecified, ATTdr,rc
2 is consistent for the ATT (provided that either the pscore or the
regression models for outcome evolution among untreated units are correctly
specified).

dr,rc
■ However, in general, ATT2 is more efficient than ATTdr,rc
1 .

dr,rc
■ In fact, Sant’Anna and Zhao (2020) shown that ATT2 is (locally) semiparametrically
efficient.

52
Let’s see how these work in a
simulation exercise

53
Monte Carlo simulations
Simulations

■ Data generating processes are similar to those considered in the TFWE example

■ We compare DR DiD estimators with IPW (standardized and non-standardized),


outcome regression, and TWFE estimators

■ Samples sizes n = 1, 000

■ For each design, we consider 10, 000 Monte Carlo experiments

n
■ Available data are {Yt=2 , Yt=1 , D, Z}i=1 , where Di = 1{Gi = 2}.

■ We estimate the pscore assuming a logit specification, and the outcome regression
models assuming a linear specification

54
DGPs

■ Since we want to check the effect of model misspecifications, we will generate


covariates slightly different than before.
  q 
■ Let Zj = Z̃ − E Z̃ Var Z̃ , j = 1, 2, 3, 4, where
 
X1
Z̃1 = exp
2
X2
Z̃2 = + 10
1 + exp (X1 )
 3
X1 X3
Z̃3 = + 0.6
25
Z̃4 = (X2 + X4 + 20)2
and Xj ∼ N (0, 1), j = 1, 2, 3, 4.

55
DGPs

■ For a generic W = (W1 , W2 , W3 , W4 ) , let


freg (W) = 210 + 27.4 · W1 + 13.7 · (W2 + W3 + W4 )

fps (W) = 0.75 · (−W1 + 0.5 · W2 − 0.25 · W3 − 0.1 · W4 )


■ Also, let
d
v (X, D) ∼ N (D · freg (X) , 1)
d
v (Z, D) ∼ N (D · freg (Z) , 1)
d
ε t=1 ∼ N (0, 1)
d
ε t=2 (2) ∼ N (0, 1)
d
ε t=2 (∞) ∼ N (0, 1)
d 56
U ∼ U (0, 1)
DGPs

■ We now consider four different DGPs

■ DGP1:

Yi,t=1 (∞) = freg (Zi ) + vi (Zi , Di ) + ε i,t=1


Yi,t=2 (∞) = 2 · freg (Zi ) + vi (Zi , Di ) + ε i,t=2 (∞)
Yi,t=2 (2) = 2 · freg (Zi ) + vi (Zi , Di ) + ε i,t=2 (∞)
exp (fps (Zi ))
p ( Zi ) =
1 + exp (fps (Zi ))
Di = 1 { p ( Z i ) ≥ U }

■ Both the pscore and the OR models are correctly specified

57
DGPs

■ DGP2:

Yi,t=1 (∞) = freg (Zi ) + vi (Zi , Di ) + ε i,t=1


Yi,t=2 (∞) = 2 · freg (Zi ) + vi (Zi , Di ) + ε i,t=2 (∞)
Yi,t=2 (2) = 2 · freg (Zi ) + vi (Zi , Di ) + ε i,t=2 (∞)
exp (fps (Xi ))
p ( Xi ) =
1 + exp (fps (Xi ))
Di = 1 { p ( X i ) ≥ U }

■ Only the OR model is correctly specified

58
DGPs

■ DGP3:

Yi,t=1 (∞) = freg (Xi ) + vi (Xi , Di ) + ε i,t=1


Yi,t=2 (∞) = 2 · freg (Xi ) + vi (Xi , Di ) + ε i,t=2 (∞)
Yi,t=2 (2) = 2 · freg (Xi ) + vi (Xi , Di ) + ε i,t=2 (∞)
exp (fps (Zi ))
p ( Zi ) =
1 + exp (fps (Zi ))
Di = 1 { p ( Z i ) ≥ U }

■ Only the pscore model is correctly specified

59
DGPs

■ DGP4:

Yi,t=1 (∞) = freg (Xi ) + vi (Xi , Di ) + ε i,t=1


Yi,t=2 (∞) = 2 · freg (Xi ) + vi (Xi , Di ) + ε i,t=2 (∞)
Yi,t=2 (2) = 2 · freg (Xi ) + vi (Xi , Di ) + ε i,t=2 (∞)
exp (fps (Xi ))
p ( Xi ) =
1 + exp (fps (Xi ))
Di = 1 { p ( X i ) ≥ U }

■ Both the pscore and the OR models are misspecified

60
Table 1: Monte Carlo Simulations, DGP1: Both pscore and OR are correctly specified

Bias RMSE Std. error Coverage CI length


τbfe -20.9518 21.1227 2.5271 0.0000 9.9061
τbreg -0.0012 0.1005 0.1010 0.9500 0.3960
τbipw,p 0.0257 2.7743 2.6636 0.9518 10.4412
ipw,p
τbstd 0.0075 1.1320 1.0992 0.9476 4.3090
τbdr,p -0.0014 0.1059 0.1052 0.9473 0.4124
dr,p
τbimp -0.0013 0.1057 0.1043 0.9451 0.4088

61
Figure 2: Monte Carlo for DID estimators, DGP1: Both pscore and OR are correctly specified

4
4

DR Tr.
DR Imp
0.3

3 3

0.2
Density

Density

Density
2 2

0.1
1 1

0 0 0.0

−0.25 0.00 0.25 0.50 −0.25 0.00 0.25 0.50 −5.0 −2.5 0.0 2.5 5.0 7.5
Regression DID Estimator DR DID Estimators Std. IPW DID Estimator

62
Table 2: Monte Carlo Simulations, DGP2: Only OR is correctly specified

Bias RMSE Std. error Coverage CI length


τbfe -19.2859 19.4683 2.5754 0.0000 10.0955
τbreg -0.0008 0.0997 0.1004 0.9492 0.3937
τbipw,p 2.0100 3.2982 2.5049 0.8376 9.8193
ipw,p
τbstd -0.7942 1.2253 0.9241 0.8564 3.6226
τbdr,p -0.0008 0.1036 0.1031 0.9469 0.4043
dr,p
τbimp -0.0007 0.1042 0.1030 0.9445 0.4039

63
Figure 3: Monte Carlo for DID estimators, DGP2: Only OR is correctly specified

4 4

0.4
DR Tr.
DR Imp

3 3
0.3
Density

Density

Density
2 2
0.2

1 1
0.1

0 0 0.0

−0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 −5.0 −2.5 0.0 2.5
Regression DID Estimator DR DID Estimators Std. IPW DID Estimator

64
Table 3: Monte Carlo Simulations, DGP3: Only PS is correctly specified

Bias RMSE Std. error Coverage CI length


τbfe -13.1703 13.3638 3.5611 0.0035 13.9596
τbreg -1.3843 1.8684 1.2286 0.8001 4.8159
τbipw,p 0.0114 3.1982 3.0043 0.9468 11.7769
ipw,p
τbstd -0.0299 1.4270 1.3990 0.9447 5.4840
τbdr,p -0.0513 1.2142 1.1768 0.9416 4.6132
dr,p
τbimp -0.0709 1.0151 0.9842 0.9423 3.8581

65
Figure 4: Monte Carlo for DID estimators, DGP3: Only PS is correctly specified

0.4
0.3 DR Tr.
DR Imp

0.3
0.2

0.2
Density

Density

Density
0.2

0.1
0.1

0.1

0.0 0.0 0.0

−8 −4 0 4 −8 −4 0 4 −8 −4 0 4
Regression DID Estimator DR DID Estimators Std. IPW DID Estimator

66
Table 4: Monte Carlo Simulations, DGP4: Both OR and PS are misspecified

Bias RMSE Std. error Coverage CI length


τbfe -16.3846 16.5383 3.6268 0.0000 14.2169
τbreg -5.2045 5.3641 1.2890 0.0145 5.0531
τbipw,p -1.0846 2.6557 2.3746 0.9487 9.3084
ipw,p
τbstd -3.9538 4.2154 1.4585 0.2282 5.7172
τbdr,p -3.1878 3.4544 1.2946 0.3076 5.0749
dr,p
τbimp -2.5291 2.7202 0.9837 0.2737 3.8561

67
Figure 5: Monte Carlo for DID estimators, DGP4: Both OR and PS are misspecified

0.4
0.3
DR Tr.
DR Imp

0.3
0.2

0.2
Density

Density

Density
0.2

0.1
0.1

0.1

0.0 0.0 0.0

−10.0 −7.5 −5.0 −2.5 0.0 −10.0 −7.5 −5.0 −2.5 0.0 −8 −4 0
Regression DID Estimator DR DID Estimators Std. IPW DID Estimator

68
Monte Carlo simulations for repeated cross-section data

■ Same DGPs as before, but now, we observe a sample from T = 2 or T = 1 with


probability 0.5.

69
Table 5: Monte Carlo Simulations, DGP1: Both the pscore and the OR are correctly specified

Bias RMSE Std. error Coverage CI length


τbfe -20.7916 21.0985 3.5705 0.0002 13.9962
τbreg 0.0263 7.5878 7.5702 0.9510 29.6751
τbipw,rc -0.6619 55.9708 55.5516 0.9493 217.7621
ipw,rc
τbstd -0.0502 9.6477 9.5815 0.9487 37.5596
τb1dr,rc 0.0129 3.0414 3.0340 0.9504 11.8934
τb2dr,rc 0.0041 0.2159 0.2102 0.9441 0.8239
dr,rc
τb1,imp 0.0136 3.0413 3.0337 0.9507 11.8921
dr,rc
τb2,imp 0.0047 0.2163 0.2049 0.9371 0.8032

70
Figure 6: Monte Carlo for DID estimators, DGP1: Both the pscore and the OR are correctly specified

0.05 0.04 0.05 OR


Std. IPW
0.04 0.04
0.03
Density

Density

Density
0.03 0.03
0.02
0.02 0.02

0.01
0.01 0.01

0.00 0.00 0.00


−40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40
Regression DID Estimator Std. IPW DID Estimator
DR Trad.
DR Trad. DR Trad. loc. eff. DR Imp.
DR Imp. 3 DR Imp. loc. eff. 3 DR Trad. loc. eff.
0.10
DR Imp. loc. eff.
Density

Density

Density
2 2

0.05
1 1

0.00 0 0
−40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40
DR (but not loc. eff.) DID Estimators DR & Loc. Eff. DID Estimators All DR DID Estimators

71
Figure 7: Monte Carlo for DID estimators, DGP1: Both the pscore and the OR are correctly specified

DR Trad.
DR Imp.
DR Trad. loc. eff.
DR Imp. loc. eff.
1.5
Density

1.0

0.5

0.0

−10 −5 0 5 10

72
Figure 8: Monte Carlo for DID estimators, DGP1: Both the pscore and the OR are correctly specified

DR Trad. loc. eff.


DR Imp. loc. eff.

1.5
Density

1.0

0.5

0.0

−0.5 0.0 0.5 1.0


DR & Loc. Eff. DID Estimators

73
Table 6: Monte Carlo Simulations, DGP2: Only the OR is correctly specified

Bias RMSE Std. error Coverage CI length


τbfe -19.1783 19.5289 3.6345 0.0005 14.2472
τbreg -0.0244 8.1906 8.1493 0.9481 31.9454
τbipw,rc 1.8203 55.0496 54.9614 0.9491 215.4486
ipw,rc
τbstd -0.8119 9.8141 9.7018 0.9459 38.0310
τb1dr,rc -0.0102 3.2814 3.2651 0.9486 12.7991
τb2dr,rc -0.0002 0.2108 0.2054 0.9454 0.8051
dr,rc
τb1,imp -0.0095 3.2818 3.2650 0.9488 12.7989
dr,rc
τb2,imp 0.0002 0.2127 0.2030 0.9403 0.7958

74
Figure 9: Monte Carlo for DID estimators, DGP2: Only the OR is correctly specified

0.05 0.05
0.04 OR
0.04 0.04 Std. IPW
0.03
0.03 0.03
Density

Density

Density
0.02
0.02 0.02

0.01
0.01 0.01

0.00 0.00 0.00


−40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40
Regression DID Estimator Std. IPW DID Estimator
0.125 DR Trad.
DR Trad. DR Trad. loc. eff. DR Imp.
3 3
0.100 DR Imp. DR Imp. loc. eff. DR Trad. loc. eff.
DR Imp. loc. eff.
0.075
Density

Density

Density
2 2

0.050
1 1
0.025

0.000 0 0
−40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40
DR (but not loc. eff.) DID Estimators DR & Loc. Eff. DID Estimators All DR DID Estimators

75
Figure 10: Monte Carlo for DID estimators, DGP2: Only the OR is correctly specified

2.0

DR Trad. loc. eff.


DR Imp. loc. eff.

1.5
Density

1.0

0.5

0.0

−0.5 0.0 0.5


DR & Loc. Eff. DID Estimators

76
Table 7: Monte Carlo Simulations, DGP3: Only the PS is correctly specified

Bias RMSE Std. error Coverage CI length


τbfe -13.1310 14.0577 5.0424 0.2598 19.7664
τbreg -1.3763 8.1367 8.0046 0.9421 31.3782
τbipw,rc -0.9734 57.2618 56.9005 0.9465 223.0501
ipw,rc
τbstd 0.0508 9.4283 9.3068 0.9431 36.4826
τb1dr,rc -0.0855 5.6917 5.6276 0.9453 22.0602
τb2dr,rc -0.0289 4.7419 4.6585 0.9416 18.2613
dr,rc
τb1,imp -0.1191 4.8371 4.7970 0.9450 18.8042
dr,rc
τb2,imp -0.0762 4.0623 3.9669 0.9436 15.5503

77
Figure 11: Monte Carlo for DID estimators, DGP3: Only the PS is correctly specified

0.05 0.05
0.04 OR
Std. IPW
0.04 0.04
0.03
Density

Density

Density
0.03 0.03
0.02
0.02 0.02

0.01
0.01 0.01

0.00 0.00 0.00


−20 0 20 40 −20 0 20 40 −20 0 20 40
Regression DID Estimator Std. IPW DID Estimator
0.100 0.100 DR Trad.
0.08 DR Trad. DR Trad. loc. eff. DR Imp.
DR Imp. DR Imp. loc. eff. DR Trad. loc. eff.
0.075 0.075
0.06 DR Imp. loc. eff.
Density

Density

Density
0.04 0.050 0.050

0.02 0.025 0.025

0.00 0.000 0.000


−20 0 20 40 −20 0 20 40 −20 0 20 40
DR (but not loc. eff.) DID Estimators DR & Loc. Eff. DID Estimators All DR DID Estimators

78
Figure 12: Monte Carlo for DID estimators, DGP3: Only the PS is correctly specified

0.100
DR Trad.
DR Imp.
DR Trad. loc. eff.
DR Imp. loc. eff.

0.075
Density

0.050

0.025

0.000

−20 −10 0 10 20

79
Figure 13: Monte Carlo for DID estimators, DGP3: Only the PS is correctly specified

0.100

DR Trad. loc. eff.


DR Imp. loc. eff.

0.075
Density

0.050

0.025

0.000

−10 0 10
DR & Loc. Eff. DID Estimators

80
Table 8: Monte Carlo Simulations, DGP4: Both the OR and the PS are misspecified

Bias RMSE Std. error Coverage CI length


τbfe -16.3305 17.1263 5.1307 0.1138 20.1123
τbreg -5.3378 9.9773 8.5196 0.9075 33.3969
τbipw,rc -1.3912 55.1777 55.6717 0.9518 218.2330
ipw,rc
τbstd -4.1487 10.5195 9.6864 0.9304 37.9707
τb1dr,rc -3.3422 7.0709 6.1963 0.9157 24.2897
τb2dr,rc -3.2751 6.0158 4.8876 0.8863 19.1593
dr,rc
τb1,imp -2.6888 5.5642 4.8416 0.9134 18.9790
dr,rc
τb2,imp -2.6138 4.8453 3.9673 0.8923 15.5519

81
Figure 14: Monte Carlo for DID estimators, DGP4: Both the OR and the PS are misspecified

0.05 0.05
0.04 OR
0.04 0.04 Std. IPW
0.03
0.03 0.03
Density

Density

Density
0.02
0.02 0.02

0.01 0.01 0.01

0.00 0.00 0.00


−25 0 25 −25 0 25 −25 0 25
Regression DID Estimator Std. IPW DID Estimator
0.100 0.100 DR Trad.
0.08
DR Trad. DR Trad. loc. eff. DR Imp.
DR Imp. DR Imp. loc. eff. DR Trad. loc. eff.
0.075 0.075
0.06 DR Imp. loc. eff.
Density

Density

Density
0.04 0.050 0.050

0.02 0.025 0.025

0.00 0.000 0.000


−25 0 25 −25 0 25 −25 0 25
DR (but not loc. eff.) DID Estimators DR & Loc. Eff. DID Estimators All DR DID Estimators

82
Figure 15: Monte Carlo for DID estimators, DGP4: Both the OR and the PS are misspecified

0.100
DR Trad.
DR Imp.
DR Trad. loc. eff.
DR Imp. loc. eff.

0.075
Density

0.050

0.025

0.000

−20 −10 0 10 20

83
Figure 16: Monte Carlo for DID estimators, DGP4: Both the OR and the PS are misspecified

0.100

DR Trad. loc. eff.


DR Imp. loc. eff.

0.075
Density

0.050

0.025

0.000

−20 −10 0 10 20
DR & Loc. Eff. DID Estimators

84
What are the main take-away
messages?

85
Take-way messages
DiD procedures with covariates

■ We can include covariates into DiD to allow for covariate-specific trends

■ Covariates should not be post-treatment variables

■ There are several ”correct” ways of implementing conditional DiD:


▶ Regression adjustments

▶ Inverse probability weighting

▶ Doubly Robust (augmented inverse probability weighting)

■ TWFE, though, can be severely biased.

■ DR DiD is my preferred method:


▶ More robust against model misspecifications

▶ Can be semiparametrically efficient (confidence intervals are tighter)


86
Empirical application
Empirical application

■ Let’s switch to R/Stata so we can see how to do all these things!

87
References
Abadie, Alberto, “Semiparametric Difference-in-Differences Estimators,” The Review of
Economic Studies, 2005, 72 (1), 1–19.
Hájek, J., “Discussion of ‘An essay on the logical foundations of survey sampling, Part I’, by
D. Basu,” in V. P. Godambe and D. A. Sprott, eds., Foundations of Statistical Inference,
Toronto: Holt, Rinehart, and Winston, 1971.
Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd, “Characterizing
Selection Bias Using Experimental Data,” Econometrica, 1998, 66 (5), 1017–1098.
Heckman, James J., Hidehiko Ichimura, and Petra E. Todd, “Matching As An Econometric
Evaluation Estimator: Evidence from Evaluating a Job Training Programme,” The Review
of Economic Studies, October 1997, 64 (4), 605–654.
Horvitz, D. G. and D. J. Thompson, “A Generalization of Sampling Without Replacement
From a Finite Universe,” Journal of the American Statistical Association, 1952, 47 (260),
663–685.

87
Kang, Joseph D. Y. and Joseph L. Schafer, “Demystifying Double Robustness: A
Comparison of Alternative Strategies for Estimating a Population Mean from
Incomplete Data.,” Statistical Science, 2007, 22 (4), 569–573.
Khan, Shakeeb and Elie Tamer, “Irregular Identification, Support Conditions, and Inverse
Weight Estimation,” Econometrica, 2010, 78 (6), 2021–2042.
Sant’Anna, Pedro H. C. and Jun Zhao, “Doubly robust difference-in-differences estimators,”
Journal of Econometrics, November 2020, 219 (1), 101–122.

87

You might also like