0% found this document useful (0 votes)
35 views58 pages

13 Dind

The document discusses the Difference-in-Differences (DiD) research design, emphasizing its application in estimating causal effects of policies across groups while addressing potential confounding factors. It highlights key considerations such as the importance of counterfactual estimands, structural assumptions, and the challenges of pre-testing for parallel trends. The document also references notable studies and suggests methods for improving the robustness of DiD analyses.

Uploaded by

hector.rufrancos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views58 pages

13 Dind

The document discusses the Difference-in-Differences (DiD) research design, emphasizing its application in estimating causal effects of policies across groups while addressing potential confounding factors. It highlights key considerations such as the importance of counterfactual estimands, structural assumptions, and the challenges of pre-testing for parallel trends. The document also references notable studies and suggests methods for improving the robustness of DiD analyses.

Uploaded by

hector.rufrancos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Canonical Research Designs I:

Difference-in-Differences

Paul Goldsmith-Pinkham

March 22, 2021

1/2
Revisiting Research Design
- Recall my attempt at a definition:
- A (causal) research design is a statistical
and/or economic statement of how an
empirical research paper will estimate a
relationship between two (or more)
variables that is causal in nature – X
causing Y .
- The design should have a description
for how some variation in X is either
caused by or approximated by a
randomized experiment.

2/2
Revisiting Research Design
- Recall my attempt at a definition:
- A (causal) research design is a statistical
and/or economic statement of how an
empirical research paper will estimate a
relationship between two (or more)
variables that is causal in nature – X
causing Y .
- The design should have a description
for how some variation in X is either
caused by or approximated by a
randomized experiment.

- Dinardo and Lee (2011) have a famous


handbook chapter entitled “Program
Evaluation and Research Design” where
they make a distinction between two
types of research designs
2/2
Revisiting Research Design

- D-condition designs fall clearly into the “PGP” description


of a research design – knowledge of the DGP leads to a
variation in the data generating our identification

- S-conditions fall less clearly into this context (as we will


discuss).
- The relationship between X and Y can be clearly
articulated, but how it is potentially approximated by a
random experiment is less obvious

- This issue will become clear as we discuss our first topic

3/2
Estimating causal effects in real settings

- In many applications, we want to estimate the


effect of a policy across groups

- However, the policy assignment is not


necessarily uncorrelated with group
characteristics

- How can we identify the effect of the policy


without being confounded by these level
differences?

4/2
Estimating causal effects in real settings

- In many applications, we want to estimate the


effect of a policy across groups

- However, the policy assignment is not Difference-in-differences!


necessarily uncorrelated with group (DinD)
characteristics

- How can we identify the effect of the policy


without being confounded by these level
differences?

4/2
First, a warning
- This literature has had a certain amount of upheaval over the past 5-6 years

- Tension: provide context for how people currently and historically have studied
diff-in-diff
- But also elaborate on concerns identified in recent papers

- The key issues boil down into two questions:


1. What is the counterfactual estimand?
- Does your estimator map to your estimand? (e.g. “Are you getting at what you meant to?”)
2. What are your structural assumptions and their implications?
- Do you need to assume functional forms? (e.g. “Is this really something that has an
experimental analog”?)

- Papers have both pointed out issues but also provided solutions to almost all of the
problems that they’ve raised, so not something that should prevent you from using
these tools
5/2
Basic setup
- Assume we have n units (i) and T time periods (t )

- Consider a binary policy Dit , and we are interested in estimating its effect on
outcomes Yit

- The inherent problem is that Dit is not necessarily randomly assigned

- The historical key (and parametric) assumption underlying of the potential outcomes
model (one version):
Yit (Dit ) = αi + γt + τi Dit
s.t. Yit (1) − Yit (0) = τi

- Implication? In the absence of the treatment, the Yit across units evolve in parallel –
their γt are identical. Absent the policy, units may have different levels (αi ) but their
changes would evolve in parallel
- This is a key (parametric!) identifying assumption
- Yit (0) − Yi ,t −k (0) = γt − γt −k , Yit (0) − Yjt (0) = αi − αj
6/2
Basic 2x2 DinD setup

- Recall our typical estimand of interest is the ATE or the ATT:

τATE = E (Yit (1) − Yit (0)) = E (τi )


τATT = E (Yit (1) − Yit (0)) = E (τi |Dit = 1)

- Since D is not randomly assigned and we only observe one time


period, this model is inherently not identified without additional
assumptions.
- Why? Di could be correlated with αi
- Recall that our plug-in estimator approaches need estimates for
E (Yit (1)) and E (Yit (0))
- Where can we get unbiased estimates?

- With two time periods we can make a lot more progress!

7/2
Basic 2x2 DinD setup

- Recall our typical estimand of interest is the ATE or the ATT:

τATE = E (Yit (1) − Yit (0)) = E (τi )


τATT = E (Yit (1) − Yit (0)) = E (τi |Dit = 1)

- Since D is not randomly assigned and we only observe one time


period, this model is inherently not identified without additional
assumptions.
- Why? Di could be correlated with αi
- Recall that our plug-in estimator approaches need estimates for
E (Yit (1)) and E (Yit (0))
- Where can we get unbiased estimates?

- With two time periods we can make a lot more progress!

7/2
2 × 2 DinD estimation
t =0 t=1
D=0 γ0 + αi γ1 + αi
D=1 γ0 + αi + τi γ1 + αi + τi
- Now consider the within unit difference:
Yi1 − Yi0 = (γ1 − γ0 ) + τi (Di1 − Di0 )

- Hence
E (Yi1 − Yi0 |Di1 − Di0 = 1) − E (Yi1 − Yi0 |Di1 − Di0 = 0) = E (τi |Di1 − Di0 = 1)

- Wait, you say, that’s a lot more notation than I was expecting.
- Simplifying assumption: treatment only goes one way in period 1
- “absorbing adoption”, e.g. Di0 = 0

E (Yi1 − Yi0 |Di1 = 1) − E (Yi1 − Yi0 |Di1 = 0) = E (τi |Di1 = 1)


| {z }
ATT
8/2
An aside on our simplifying assumption

- The choice of focusing on take-up of a policy, such that Di1 ≥ Di0 , is well-grounded in
many policy settings

- However, there are cases where policies turn on, and then turn off, and this can vary
across units

- This can be challenging and potentially problematic with heterogeneous effects

- Need to think carefully about whether Di turning on is identical (but opposite sign) to
Di turning off
- Hull (2018) working paper on mover designs discusses this

- For today, will ignore this issue

9/2
Estimation using linear regression
- A simple linear regression will identify E (τi |Di1 = 1) with two time periods:

Yit = αi + γt + Dit β + eit (1)

- This setup is sometimes referred to as the Two-way Fixed Effects estimator (TWFE)

- Note: we could have also estimated τ directly:

τ̂ = n−1 ∑ Di (Yi1 − Yi0 ) − (1 − Di1 )(Yi1 − Yi0 )


i
| {z } | {z }
∆Y 1 ∆Y 0

- Intuitively, we generate a counterfactual for the treatment using the changes in the
untreated units: E (Yi1 − Yi0 |Di = 0)

- Necessary: two time periods! What if we have more?


10 / 2
Multiple time periods in basic setup
- Let’s consider a policy that occurs all at t0 (e.g. single timing rolled out to treated units)

- More time periods helps in several ways:


1. If we have multiple periods before the policy implementation, we can partially test the
underlying assumptions
- Sometimes referred to as “pre-trends”
2. If we have multiple periods after the policy implementation, we can examine the timing of
the effect
- Is it an immediate effect? Does it die off? Is it persistent?
- If you pool all time periods together into one “post” variable, this estimates the average effect.
If sample is not balanced, can have unintended effects!

- How do we implement this?


T
Yit = αi + γt + ∑ δt Dit + eit ,
t =1,t 6=t0

- One of the coefficients is fundamentally unidentified because of αi


- All coefficients measure the effect relative to period t0 .
11 / 2
Pre-testing and structural assumptions

- Note that for the above model, we made a stronger assumption about trends
- The Dinardo and Lee “S-assumptions” start to bite
- We assumed that Yit (d ) − Yi ,t −k (d ) = γt − γt −k for all k and d
- This is testable pre-treatment (hence the pre-test)

- This is very powerful and has helped spark the growth in DinD regressions
- Visual demonstrate of “pre-trends” helps support the validity of the design
- Worth doing!

- Two key issues:


1. Pre-testing can cause statistical problems
2. What does parallel trends even mean?

12 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do

- Testing whether the difference relative


to t = 0 for t = −1 is significant

13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do

- Testing whether the difference relative


to t = 0 for t = −1 is significant

13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do

- Testing whether the difference relative


to t = 0 for t = −1 is significant

- Unconditionally, this is reasonable.


However, Roth (2020) highlights that
this is a form of pre-testing, and that low
power in detecting pre-trends can be
problematic

13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do

- Testing whether the difference relative


to t = 0 for t = −1 is significant

- Unconditionally, this is reasonable.


However, Roth (2020) highlights that
this is a form of pre-testing, and that low
power in detecting pre-trends can be
problematic

13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do

- Testing whether the difference relative


to t = 0 for t = −1 is significant

- Unconditionally, this is reasonable.


However, Roth (2020) highlights that
this is a form of pre-testing, and that low
power in detecting pre-trends can be
problematic

- By selecting on pre-trends that “pass”,


will tend to choose baseline realizations
that satisfy pre-trends, but induce bias in
the effect
13 / 2
How to interpret this caution?

- First, don’t panic. Examining pre-trend is still important diagnostic

- Important to realize that selecting your design based on pre-trend is constructing your
counterfactual
- Pre-tests will cause you to potentially contaminate your design

- Suggested solution from Roth (2020): incorporate robustness to pre-trends into your
analysis. Rambachan and Roth (2020) present results on testing sensitivity of DinD
results to pre-trends
- Brief intuition follows

14 / 2
Rambachan and Roth (2020) suggestion
- Intuitive proposed solution for robustness. Note the post and pre effects:

- parallel trends assumes these δ are zero. But pre-trends may not be zero.
- R&R say: we can use the info from our pre-trends to bound post-trend
- Use a smoothness assumption, M, on the second derivative. E.g. simple case:

15 / 2
This approach adds more work but also more validity

- Need to select M, and will likely have less strong results

- However, very powerful way to address concerns about pre-trends

- Code for applying this technique is availabe in R:


https://github.com/asheshrambachan/HonestDiD

16 / 2
Parallel trends in what?
- A known issue that was historically not formalized is the question of what the
outcome is specified as: logs, or levels?

- Hopefully it’s clear that if something satisfies pre-trends in logs, it seems unlikely to
satisfy in levels

- Recall that this is the issue of invariance we discussed with quantile treatment effects
- In our parametric setting, if there are time trends in the outcomes, the parallel trends are
likely not to hold for all transformations of the variables.
- That could be problematic if you wanted to be agnostic about the model!

- Roth and Sant’Anna (2021) directly discuss this issue. Their suggestion:
Our results suggest that researchers who wish to point-identify the ATT should justify
one of the following: (i) why treatment is as-if randomly assigned, (ii) why the chosen
functional form is correct at the exclusion of others, or (iii) a method for inferring the
entire counterfactual distribution of untreated potential outcomes.
17 / 2
Cases of DinD

- 1 treatment timing, Binary treatment, 2 periods


- Card and Krueger (AER, 1994)

- 1 treatment timing, Binary treatment, T periods


- Yagan (AER, 2015)

- 1 treatment timing, Continuous treatment


- Berger, Turner and Zwick (JF, 2020)

- Staggered treatment timing, Binary treatment


- Bailey and Goodman-Bacon (AER, 2015)

18 / 2
Card and Krueger (1994)

- Card and Krueger (1994) study the impact of New Jersey increasing the minimum
wage 4.25 to 5.05 dollars an hour on April 1, 1992

- Key question is what impact does this have on employment?


- Need a counterfactual for NJ, and use Pennsyvania as a control

- Collected data in 410 fast food restaurants


- Called places and asked for employment and starting wage data
- Sample data from Feb 1992 and Nov 1992

- Hence, Di is NJ vs PA, and t = 0 is Feb 1992 and t = 1 is Nov 1992

19 / 2
Stark Effect on Wages in Card and Krueger (1994)

20 / 2
Effect on Employment in Card and Krueger (1994)
- Despite a large increase in wages,
seemingly no negative impact on
employment
- In fact, marginally significant positive
impact

- Looking at raw data, this positive impact


is driven by a decline in PA
- This decline is reasonable if you think
that PA is a good counterfactual, since
1992 is in the middle of a recession

- A second comparison can be run with


stores whose starting wage in
pre-period was above treatment cutoff
- These stores perform similarly to PA

21 / 2
Key considerations for thinking about Card and Krueger (1994)

- The treatment can’t really be thought of as randomly assigned


- Treatment is completely correlated within states
- As a result, any within-state correlation of errors will be correlated with treatment status

- Given the limited number of states, time periods, and treatments, more valuable to
view this as a case study
- Under strong parametric assumptions, can infer causality!
- Card acknowledges (Card and Krueger interview with Ben Zipperer):

22 / 2
Yagan (2015)
- Yagan (2015) tests whether the 2003 dividend tax cut stimulated corporate
investment and increased labor earnings

- Big empirical question for corporate finance and public finance

- No direct evidence on the real effects of dividend tax cut


- real corporate outcomes are too cyclical to distinguish tax effects from business cycle
effects, and economy boomed

- Paper uses distinction between “C” corp and “S” corp designation to estimate effect
- Key feature of law: S-corps didn’t have dividend taxation

- Identifying assumption (from paper):


The identifying assumption underlying this research design is not random assignment
of C- versus S-status; it is that C- and S-corporation outcomes would have trended
similarly in the absence of the tax cut.
23 / 2
Investment Effects (none)

24 / 2
Employee + Shareholder effects (big)

25 / 2
Key Takeaway + threats

- Tax reform had zero impact on differential investment and employee compensation

- Challenges orthodoxy on estimates of cost-of-capital elasticity of investment

- What are underlying challenges to identification?


1. Have to assume (and try to prove) that the only differential effect to S- vs C-corporations
was through dividend tax changes
2. During 2003, could other shocks differentially impact?
- Yes, accelerated depreciation – but Yagan shows it impacts them similarly.

- Key point: you have to make more assumptions to assume that zero differential effect
on investment implies zero aggregate effect.

26 / 2
Berger, Turner and Zwick (2019)

- This paper studies the impact of temporary fiscal stimulus (First-Time Home Buyer tax
credit) on housing markets

- Policy was differentially targetted towards first time home buyers


- Define program exposure as “the number of potential first-time homebuyers in a ZIP
code, proxied by the share of people in that ZIP in the year 2000 who are first-time
homebuyers”
- The design:
The key threat to this design is the possibility that time-varying, place-specific shocks are co
related with our exposure measure.

- This measure is not binary – we are just comparing areas with a low share vs. high
share, effectively. However, we have a dose-response framework in mind – as we
increase the share, the effect size should grow.

27 / 2
First stage: Binary approximation

28 / 2
First stage: Regression coefficients

29 / 2
Final Outcome: Regression coefficients

30 / 2
Binary Approximation vs. Continuous Estimation
- Remember our main equation did not necessarily specify that Dit had to be binary.
T
Yit = αi + γt + ∑ δt Dit + eit , (2)
t =1,t 6=t0

- However, if it is continuous, we are making an additional strong functional form


assumption that the effect of Dit on our outcome is linear.

- We make this linear approximation all the time in our regression analysis, but it is
worth keepping in mind. It is partially testable in a few ways:
e itk }4 and estimate the effect across those
- Bin the continuous Dit into quartiles {D k =1
groups:
T 4
Yit = αi + γt + ∑ ∑ δt ,k De it ,k + eit . (3)
t = 1 ,t 6 = t 0 k = 1

- What does the ordering of δt ,k look like? Is it at least monotonic?


31 / 2
Berger, Turner and Zwick implementation of linearity test

32 / 2
Takeaway

- When you have a continuous exposure measure, can be intuitive and useful to present
binned means “high” and “low” groups

- However, best to present regression coefficients of the effects that exploits the full
range of the continuous measure so that people don’t think you’re data mining

- Consider examining for non-monotonicities in your policy exposure measure

- This paper is still has only one “shock” – one policy time period for implementation

33 / 2
Bailey and Goodman-Bacon (2015)
- Paper studies impact of rollout of Community Health Centers on mortality
- Idea is that CHCs can help lower mortality (esp. among elderly) by providing accessible
preventative care

- Exploit timing of implementation of CHCs


Our empirical strategy uses variation in when and where CHC programs were estab-
lished to quantify their effects on mortality rates. The findings from two empirical tests
support a key assumption of this approach—that the timing of CHC establishment is
uncorrelated with other determinants of changes in mortality.

- Issue is that CHCs tend to be done in places

- Since CHCs are started in different places in different time periods, we estimate
effects in event-time, e.g. relative to initial rollout.

34 / 2
Negative effect on mortality

35 / 2
Negative effect on mortality, particularly among elderly

36 / 2
Key takeaways

- Since the policy changes are staggered, we are less worried about effect driven by one
confounding macro shock.

- Easier to defend story that has effects across different timings


- Also allows us to test for heterogeneity in the time series

- Still makes the exact same identifying assumptions – parallel trends in absence of
changes

37 / 2
But a big issue emerges when we exploit differential timing
- We have been extrapolating from the simple pre-post, treatment-control setting to
broader cases
- multiple time periods of treatment

- In fact, in some applications, the policy eventually hits everyone – we are just
exploiting differential timing.

- If we run the “two-way fixed effects” model for these times of DinD

yit = αi + αt + βDD Dit + eit (4)

what comparisons are we doing once we have lots of timings?

- Key point: is our estimator mapping to our estimand?

- Well, what’s our estimand?


38 / 2
What is our estimand with staggered timings?
- There are a huge host of papers touching on this question

- Callaway and Sant’anna (2020) propose the following building block estimand:
τATT (g, t ) = E (Yit (1) − Yit (0)|Dit = 1∀t ≥ g ), (5)
the ATT in period t for those units whose treatment turns on in period g.
- In the 2x2 case, this was exactly our effect!
- This paper assumes absorbing treatment, but can be weakened in other papers (de
Chaisemartin and d’Haultfoeuille (2020) discuss this)

- It seems very reasonable that for our overall estimand, we want some weighted
combinated of these ATTs

- Callaway and Sant’anna (2020) highlight two ways to identify the above estimand:
1. Parallel trends of treatment group with a group that is “never-treated”
2. Parallel trends of treatment group with the group of the “not yet treated”

- Using these estimands, C&S provide a very natural set of potential ways to aggregate
these estimands up 39 / 2
Wait what happened to TWFE?
- It turns out that the logic of the TWFE does not naturally extend to differential timings
- Recall that from our discussion of linear regression, regression is great because it does a
variance weighted approximation:
E (σD2 (Wi )τ (Wi ))
τ= , σD2 (Wi ) = E ((Di − E (Di |Wi ))2 |Wi )
E (σD2 (Wi ))
- It turns out that in the panel setting with staggered timings, these weights are not
necessarily positive

- Key insight from several papers: with staggered timings + heterogeneous effects, the
TWFE approach to DinD (both using a single pooled estimator, or using an event
study) can put large negative weight on certain groups’ estimands, and large positive
weight on others
- Serious issue for interpretability
- Some example papers: Borusyak and Jaravel (2017), de Chaisemartin and D’Haultfœuille
(2020), Goodman-Bacon (2019), Sun and Abraham (2020)

- Key point: this is solveable. Merely a construct of being overly casual with estimator
definition 40 / 2
Goodman-Bacon 2x2 comparisons

- Consider two staggered


treatements and a
never-treated group

- What does the TWFE


estimator estimate?

41 / 2
Goodman-Bacon 2x2 comparisons

- Four potential comparisons


that can be made

- turns out that TWFE DD


estimator (pooled) is the
weighted average of all 2x2
comparisons

- These weights end up putting a


high degree of weight on units
treated in the middle of the
sample (since they have the
highest variance in the
treatment indicator!)

42 / 2
Goodman-Bacon 2x2 comparisons
- The weighting becomes
problematic if the effects vary
over time – if the effects are
instantaneous and
time-invariant, the weights are
all positive

43 / 2
Goodman-Bacon 2x2 comparisons
- The weighting becomes
problematic if the effects vary
over time – if the effects are
instantaneous and
time-invariant, the weights are
all positive

- However, time-varying effects


create bad counterfactual
groups, and create negative
weights

- Goodman-Bacon provides a
way to assess the weights in a
given TWFE design
43 / 2
What to do with staggered timing in DinD?
- There’s really no reason to use the baseline TWFE in staggered timings
- A perfect example wherein the estimator does not generate an estimate that maps to a
meaningful estimand

- There are several approaches proposed in the literature that are just as good!
- Sun and Abraham (2020)
- de Chaisemartin and d’Haultfoeuille (2020)
- Borusyak and Jaravel (2017)
- Callaway and Sant’anna (2020)

- These all are robust to this issue. I find Callaway and Sant’anna quite intuitive, but
your circumstances may vary slightly. Key piece to keep in mind that differs a bit:
- Is my treatment absorbing?

- Irrespective of the exact paper, the key point is that we are generating a
counterfactual and need to be careful that our estimator does so correctly
44 / 2
Finally, a discussion on inference

- First, let’s start with the old school fact that you must know if you are working with
panel data and Dind

- You must cluster on the unit of policy implementation if possible. See Bertrand, Duflo
and Mullainathan (2004)
- Why? Outcomes and the treatment tend to be severely autocorrelated within unit
- I say “if possible” since clearly in Card and Krueger that is infeasible

- If the policy variation is implemented at the industry level, you should not cluster at
the firm level

- If the policy variation is implemented at the firm level, you cannot use robust standard
errors

45 / 2
Small clusters

- The Card and Krueger case was too extreme, but there are approaches for dealing
with a small number of clusters

- This approach typically involves bootstrapping, and can handle small number of
treated groups relative to the overall population

- See Andreas Hagemann’s work for a place to start

46 / 2
Uniform confidence intervals

- Finally, when considering event study graphs, pre-trend graphs should use uniform
confidence intervals, rather than pointwise confidence intervals
- Advocated for by Freyaldenhoven et al (2018):

- Code available here thanks to Ryan Kessler:


https://github.com/paulgp/simultaneous_confidence_bands

47 / 2
Uniform confidence intervals
- Finally, when considering event study graphs, pre-trend graphs should use uniform
confidence intervals, rather than pointwise confidence intervals
- Advocated for by Freyaldenhoven et al (2018):

- Code available here thanks to Ryan Kessler:


https://github.com/paulgp/simultaneous_confidence_bands

47 / 2
Conclusion
- Difference in difference is hugely powerful in applied settings

- Does not require random assignment, but rather implementation of policies that
differentially impacts different groups and is not confounded by other shocks at the
same time.

- Can be a great application of big data, with convincing graphs that highlight your
application

- Also allows for partial tests of identifying assumptions

- Worth carefully thinking about what your identifying assumptions are in each setting,
and transparently highlighting them.

- Important to note that this always identifies a relative affect, and to aggregate, you will
typically need a model and additional strong assumptions (see Auclert, Dobbie and
Goldsmith-Pinkham (2019) for an example in a macro setting).
48 / 2
My takeaways from new literature

- Beware weak tests of pre-trends. Consider using R&R’s partial identification tests to
assess robustness of results.

- Do not worry about the new literature on staggered timings if you only have one
timing!

- Think carefully about your estimand if you’re using a staggered timing DinD – what’s
your counterfactual in each case?
- Software exists for many of these papers. This is doable!

- When plotting confidence intervals in event studies, you should plot uniform
confidence intervals.

49 / 2

You might also like