Canonical Research Designs I:
Difference-in-Differences
Paul Goldsmith-Pinkham
March 22, 2021
1/2
Revisiting Research Design
- Recall my attempt at a definition:
- A (causal) research design is a statistical
and/or economic statement of how an
empirical research paper will estimate a
relationship between two (or more)
variables that is causal in nature – X
causing Y .
- The design should have a description
for how some variation in X is either
caused by or approximated by a
randomized experiment.
2/2
Revisiting Research Design
- Recall my attempt at a definition:
- A (causal) research design is a statistical
and/or economic statement of how an
empirical research paper will estimate a
relationship between two (or more)
variables that is causal in nature – X
causing Y .
- The design should have a description
for how some variation in X is either
caused by or approximated by a
randomized experiment.
- Dinardo and Lee (2011) have a famous
handbook chapter entitled “Program
Evaluation and Research Design” where
they make a distinction between two
types of research designs
2/2
Revisiting Research Design
- D-condition designs fall clearly into the “PGP” description
of a research design – knowledge of the DGP leads to a
variation in the data generating our identification
- S-conditions fall less clearly into this context (as we will
discuss).
- The relationship between X and Y can be clearly
articulated, but how it is potentially approximated by a
random experiment is less obvious
- This issue will become clear as we discuss our first topic
3/2
Estimating causal effects in real settings
- In many applications, we want to estimate the
effect of a policy across groups
- However, the policy assignment is not
necessarily uncorrelated with group
characteristics
- How can we identify the effect of the policy
without being confounded by these level
differences?
4/2
Estimating causal effects in real settings
- In many applications, we want to estimate the
effect of a policy across groups
- However, the policy assignment is not Difference-in-differences!
necessarily uncorrelated with group (DinD)
characteristics
- How can we identify the effect of the policy
without being confounded by these level
differences?
4/2
First, a warning
- This literature has had a certain amount of upheaval over the past 5-6 years
- Tension: provide context for how people currently and historically have studied
diff-in-diff
- But also elaborate on concerns identified in recent papers
- The key issues boil down into two questions:
1. What is the counterfactual estimand?
- Does your estimator map to your estimand? (e.g. “Are you getting at what you meant to?”)
2. What are your structural assumptions and their implications?
- Do you need to assume functional forms? (e.g. “Is this really something that has an
experimental analog”?)
- Papers have both pointed out issues but also provided solutions to almost all of the
problems that they’ve raised, so not something that should prevent you from using
these tools
5/2
Basic setup
- Assume we have n units (i) and T time periods (t )
- Consider a binary policy Dit , and we are interested in estimating its effect on
outcomes Yit
- The inherent problem is that Dit is not necessarily randomly assigned
- The historical key (and parametric) assumption underlying of the potential outcomes
model (one version):
Yit (Dit ) = αi + γt + τi Dit
s.t. Yit (1) − Yit (0) = τi
- Implication? In the absence of the treatment, the Yit across units evolve in parallel –
their γt are identical. Absent the policy, units may have different levels (αi ) but their
changes would evolve in parallel
- This is a key (parametric!) identifying assumption
- Yit (0) − Yi ,t −k (0) = γt − γt −k , Yit (0) − Yjt (0) = αi − αj
6/2
Basic 2x2 DinD setup
- Recall our typical estimand of interest is the ATE or the ATT:
τATE = E (Yit (1) − Yit (0)) = E (τi )
τATT = E (Yit (1) − Yit (0)) = E (τi |Dit = 1)
- Since D is not randomly assigned and we only observe one time
period, this model is inherently not identified without additional
assumptions.
- Why? Di could be correlated with αi
- Recall that our plug-in estimator approaches need estimates for
E (Yit (1)) and E (Yit (0))
- Where can we get unbiased estimates?
- With two time periods we can make a lot more progress!
7/2
Basic 2x2 DinD setup
- Recall our typical estimand of interest is the ATE or the ATT:
τATE = E (Yit (1) − Yit (0)) = E (τi )
τATT = E (Yit (1) − Yit (0)) = E (τi |Dit = 1)
- Since D is not randomly assigned and we only observe one time
period, this model is inherently not identified without additional
assumptions.
- Why? Di could be correlated with αi
- Recall that our plug-in estimator approaches need estimates for
E (Yit (1)) and E (Yit (0))
- Where can we get unbiased estimates?
- With two time periods we can make a lot more progress!
7/2
2 × 2 DinD estimation
t =0 t=1
D=0 γ0 + αi γ1 + αi
D=1 γ0 + αi + τi γ1 + αi + τi
- Now consider the within unit difference:
Yi1 − Yi0 = (γ1 − γ0 ) + τi (Di1 − Di0 )
- Hence
E (Yi1 − Yi0 |Di1 − Di0 = 1) − E (Yi1 − Yi0 |Di1 − Di0 = 0) = E (τi |Di1 − Di0 = 1)
- Wait, you say, that’s a lot more notation than I was expecting.
- Simplifying assumption: treatment only goes one way in period 1
- “absorbing adoption”, e.g. Di0 = 0
E (Yi1 − Yi0 |Di1 = 1) − E (Yi1 − Yi0 |Di1 = 0) = E (τi |Di1 = 1)
| {z }
ATT
8/2
An aside on our simplifying assumption
- The choice of focusing on take-up of a policy, such that Di1 ≥ Di0 , is well-grounded in
many policy settings
- However, there are cases where policies turn on, and then turn off, and this can vary
across units
- This can be challenging and potentially problematic with heterogeneous effects
- Need to think carefully about whether Di turning on is identical (but opposite sign) to
Di turning off
- Hull (2018) working paper on mover designs discusses this
- For today, will ignore this issue
9/2
Estimation using linear regression
- A simple linear regression will identify E (τi |Di1 = 1) with two time periods:
Yit = αi + γt + Dit β + eit (1)
- This setup is sometimes referred to as the Two-way Fixed Effects estimator (TWFE)
- Note: we could have also estimated τ directly:
τ̂ = n−1 ∑ Di (Yi1 − Yi0 ) − (1 − Di1 )(Yi1 − Yi0 )
i
| {z } | {z }
∆Y 1 ∆Y 0
- Intuitively, we generate a counterfactual for the treatment using the changes in the
untreated units: E (Yi1 − Yi0 |Di = 0)
- Necessary: two time periods! What if we have more?
10 / 2
Multiple time periods in basic setup
- Let’s consider a policy that occurs all at t0 (e.g. single timing rolled out to treated units)
- More time periods helps in several ways:
1. If we have multiple periods before the policy implementation, we can partially test the
underlying assumptions
- Sometimes referred to as “pre-trends”
2. If we have multiple periods after the policy implementation, we can examine the timing of
the effect
- Is it an immediate effect? Does it die off? Is it persistent?
- If you pool all time periods together into one “post” variable, this estimates the average effect.
If sample is not balanced, can have unintended effects!
- How do we implement this?
T
Yit = αi + γt + ∑ δt Dit + eit ,
t =1,t 6=t0
- One of the coefficients is fundamentally unidentified because of αi
- All coefficients measure the effect relative to period t0 .
11 / 2
Pre-testing and structural assumptions
- Note that for the above model, we made a stronger assumption about trends
- The Dinardo and Lee “S-assumptions” start to bite
- We assumed that Yit (d ) − Yi ,t −k (d ) = γt − γt −k for all k and d
- This is testable pre-treatment (hence the pre-test)
- This is very powerful and has helped spark the growth in DinD regressions
- Visual demonstrate of “pre-trends” helps support the validity of the design
- Worth doing!
- Two key issues:
1. Pre-testing can cause statistical problems
2. What does parallel trends even mean?
12 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do
- Testing whether the difference relative
to t = 0 for t = −1 is significant
13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do
- Testing whether the difference relative
to t = 0 for t = −1 is significant
13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do
- Testing whether the difference relative
to t = 0 for t = −1 is significant
- Unconditionally, this is reasonable.
However, Roth (2020) highlights that
this is a form of pre-testing, and that low
power in detecting pre-trends can be
problematic
13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do
- Testing whether the difference relative
to t = 0 for t = −1 is significant
- Unconditionally, this is reasonable.
However, Roth (2020) highlights that
this is a form of pre-testing, and that low
power in detecting pre-trends can be
problematic
13 / 2
Pre-testing issues (Roth 2020)
- Consider T = 3 and think about what a
pre-trend test is trying to do
- Testing whether the difference relative
to t = 0 for t = −1 is significant
- Unconditionally, this is reasonable.
However, Roth (2020) highlights that
this is a form of pre-testing, and that low
power in detecting pre-trends can be
problematic
- By selecting on pre-trends that “pass”,
will tend to choose baseline realizations
that satisfy pre-trends, but induce bias in
the effect
13 / 2
How to interpret this caution?
- First, don’t panic. Examining pre-trend is still important diagnostic
- Important to realize that selecting your design based on pre-trend is constructing your
counterfactual
- Pre-tests will cause you to potentially contaminate your design
- Suggested solution from Roth (2020): incorporate robustness to pre-trends into your
analysis. Rambachan and Roth (2020) present results on testing sensitivity of DinD
results to pre-trends
- Brief intuition follows
14 / 2
Rambachan and Roth (2020) suggestion
- Intuitive proposed solution for robustness. Note the post and pre effects:
- parallel trends assumes these δ are zero. But pre-trends may not be zero.
- R&R say: we can use the info from our pre-trends to bound post-trend
- Use a smoothness assumption, M, on the second derivative. E.g. simple case:
15 / 2
This approach adds more work but also more validity
- Need to select M, and will likely have less strong results
- However, very powerful way to address concerns about pre-trends
- Code for applying this technique is availabe in R:
https://github.com/asheshrambachan/HonestDiD
16 / 2
Parallel trends in what?
- A known issue that was historically not formalized is the question of what the
outcome is specified as: logs, or levels?
- Hopefully it’s clear that if something satisfies pre-trends in logs, it seems unlikely to
satisfy in levels
- Recall that this is the issue of invariance we discussed with quantile treatment effects
- In our parametric setting, if there are time trends in the outcomes, the parallel trends are
likely not to hold for all transformations of the variables.
- That could be problematic if you wanted to be agnostic about the model!
- Roth and Sant’Anna (2021) directly discuss this issue. Their suggestion:
Our results suggest that researchers who wish to point-identify the ATT should justify
one of the following: (i) why treatment is as-if randomly assigned, (ii) why the chosen
functional form is correct at the exclusion of others, or (iii) a method for inferring the
entire counterfactual distribution of untreated potential outcomes.
17 / 2
Cases of DinD
- 1 treatment timing, Binary treatment, 2 periods
- Card and Krueger (AER, 1994)
- 1 treatment timing, Binary treatment, T periods
- Yagan (AER, 2015)
- 1 treatment timing, Continuous treatment
- Berger, Turner and Zwick (JF, 2020)
- Staggered treatment timing, Binary treatment
- Bailey and Goodman-Bacon (AER, 2015)
18 / 2
Card and Krueger (1994)
- Card and Krueger (1994) study the impact of New Jersey increasing the minimum
wage 4.25 to 5.05 dollars an hour on April 1, 1992
- Key question is what impact does this have on employment?
- Need a counterfactual for NJ, and use Pennsyvania as a control
- Collected data in 410 fast food restaurants
- Called places and asked for employment and starting wage data
- Sample data from Feb 1992 and Nov 1992
- Hence, Di is NJ vs PA, and t = 0 is Feb 1992 and t = 1 is Nov 1992
19 / 2
Stark Effect on Wages in Card and Krueger (1994)
20 / 2
Effect on Employment in Card and Krueger (1994)
- Despite a large increase in wages,
seemingly no negative impact on
employment
- In fact, marginally significant positive
impact
- Looking at raw data, this positive impact
is driven by a decline in PA
- This decline is reasonable if you think
that PA is a good counterfactual, since
1992 is in the middle of a recession
- A second comparison can be run with
stores whose starting wage in
pre-period was above treatment cutoff
- These stores perform similarly to PA
21 / 2
Key considerations for thinking about Card and Krueger (1994)
- The treatment can’t really be thought of as randomly assigned
- Treatment is completely correlated within states
- As a result, any within-state correlation of errors will be correlated with treatment status
- Given the limited number of states, time periods, and treatments, more valuable to
view this as a case study
- Under strong parametric assumptions, can infer causality!
- Card acknowledges (Card and Krueger interview with Ben Zipperer):
22 / 2
Yagan (2015)
- Yagan (2015) tests whether the 2003 dividend tax cut stimulated corporate
investment and increased labor earnings
- Big empirical question for corporate finance and public finance
- No direct evidence on the real effects of dividend tax cut
- real corporate outcomes are too cyclical to distinguish tax effects from business cycle
effects, and economy boomed
- Paper uses distinction between “C” corp and “S” corp designation to estimate effect
- Key feature of law: S-corps didn’t have dividend taxation
- Identifying assumption (from paper):
The identifying assumption underlying this research design is not random assignment
of C- versus S-status; it is that C- and S-corporation outcomes would have trended
similarly in the absence of the tax cut.
23 / 2
Investment Effects (none)
24 / 2
Employee + Shareholder effects (big)
25 / 2
Key Takeaway + threats
- Tax reform had zero impact on differential investment and employee compensation
- Challenges orthodoxy on estimates of cost-of-capital elasticity of investment
- What are underlying challenges to identification?
1. Have to assume (and try to prove) that the only differential effect to S- vs C-corporations
was through dividend tax changes
2. During 2003, could other shocks differentially impact?
- Yes, accelerated depreciation – but Yagan shows it impacts them similarly.
- Key point: you have to make more assumptions to assume that zero differential effect
on investment implies zero aggregate effect.
26 / 2
Berger, Turner and Zwick (2019)
- This paper studies the impact of temporary fiscal stimulus (First-Time Home Buyer tax
credit) on housing markets
- Policy was differentially targetted towards first time home buyers
- Define program exposure as “the number of potential first-time homebuyers in a ZIP
code, proxied by the share of people in that ZIP in the year 2000 who are first-time
homebuyers”
- The design:
The key threat to this design is the possibility that time-varying, place-specific shocks are co
related with our exposure measure.
- This measure is not binary – we are just comparing areas with a low share vs. high
share, effectively. However, we have a dose-response framework in mind – as we
increase the share, the effect size should grow.
27 / 2
First stage: Binary approximation
28 / 2
First stage: Regression coefficients
29 / 2
Final Outcome: Regression coefficients
30 / 2
Binary Approximation vs. Continuous Estimation
- Remember our main equation did not necessarily specify that Dit had to be binary.
T
Yit = αi + γt + ∑ δt Dit + eit , (2)
t =1,t 6=t0
- However, if it is continuous, we are making an additional strong functional form
assumption that the effect of Dit on our outcome is linear.
- We make this linear approximation all the time in our regression analysis, but it is
worth keepping in mind. It is partially testable in a few ways:
e itk }4 and estimate the effect across those
- Bin the continuous Dit into quartiles {D k =1
groups:
T 4
Yit = αi + γt + ∑ ∑ δt ,k De it ,k + eit . (3)
t = 1 ,t 6 = t 0 k = 1
- What does the ordering of δt ,k look like? Is it at least monotonic?
31 / 2
Berger, Turner and Zwick implementation of linearity test
32 / 2
Takeaway
- When you have a continuous exposure measure, can be intuitive and useful to present
binned means “high” and “low” groups
- However, best to present regression coefficients of the effects that exploits the full
range of the continuous measure so that people don’t think you’re data mining
- Consider examining for non-monotonicities in your policy exposure measure
- This paper is still has only one “shock” – one policy time period for implementation
33 / 2
Bailey and Goodman-Bacon (2015)
- Paper studies impact of rollout of Community Health Centers on mortality
- Idea is that CHCs can help lower mortality (esp. among elderly) by providing accessible
preventative care
- Exploit timing of implementation of CHCs
Our empirical strategy uses variation in when and where CHC programs were estab-
lished to quantify their effects on mortality rates. The findings from two empirical tests
support a key assumption of this approach—that the timing of CHC establishment is
uncorrelated with other determinants of changes in mortality.
- Issue is that CHCs tend to be done in places
- Since CHCs are started in different places in different time periods, we estimate
effects in event-time, e.g. relative to initial rollout.
34 / 2
Negative effect on mortality
35 / 2
Negative effect on mortality, particularly among elderly
36 / 2
Key takeaways
- Since the policy changes are staggered, we are less worried about effect driven by one
confounding macro shock.
- Easier to defend story that has effects across different timings
- Also allows us to test for heterogeneity in the time series
- Still makes the exact same identifying assumptions – parallel trends in absence of
changes
37 / 2
But a big issue emerges when we exploit differential timing
- We have been extrapolating from the simple pre-post, treatment-control setting to
broader cases
- multiple time periods of treatment
- In fact, in some applications, the policy eventually hits everyone – we are just
exploiting differential timing.
- If we run the “two-way fixed effects” model for these times of DinD
yit = αi + αt + βDD Dit + eit (4)
what comparisons are we doing once we have lots of timings?
- Key point: is our estimator mapping to our estimand?
- Well, what’s our estimand?
38 / 2
What is our estimand with staggered timings?
- There are a huge host of papers touching on this question
- Callaway and Sant’anna (2020) propose the following building block estimand:
τATT (g, t ) = E (Yit (1) − Yit (0)|Dit = 1∀t ≥ g ), (5)
the ATT in period t for those units whose treatment turns on in period g.
- In the 2x2 case, this was exactly our effect!
- This paper assumes absorbing treatment, but can be weakened in other papers (de
Chaisemartin and d’Haultfoeuille (2020) discuss this)
- It seems very reasonable that for our overall estimand, we want some weighted
combinated of these ATTs
- Callaway and Sant’anna (2020) highlight two ways to identify the above estimand:
1. Parallel trends of treatment group with a group that is “never-treated”
2. Parallel trends of treatment group with the group of the “not yet treated”
- Using these estimands, C&S provide a very natural set of potential ways to aggregate
these estimands up 39 / 2
Wait what happened to TWFE?
- It turns out that the logic of the TWFE does not naturally extend to differential timings
- Recall that from our discussion of linear regression, regression is great because it does a
variance weighted approximation:
E (σD2 (Wi )τ (Wi ))
τ= , σD2 (Wi ) = E ((Di − E (Di |Wi ))2 |Wi )
E (σD2 (Wi ))
- It turns out that in the panel setting with staggered timings, these weights are not
necessarily positive
- Key insight from several papers: with staggered timings + heterogeneous effects, the
TWFE approach to DinD (both using a single pooled estimator, or using an event
study) can put large negative weight on certain groups’ estimands, and large positive
weight on others
- Serious issue for interpretability
- Some example papers: Borusyak and Jaravel (2017), de Chaisemartin and D’Haultfœuille
(2020), Goodman-Bacon (2019), Sun and Abraham (2020)
- Key point: this is solveable. Merely a construct of being overly casual with estimator
definition 40 / 2
Goodman-Bacon 2x2 comparisons
- Consider two staggered
treatements and a
never-treated group
- What does the TWFE
estimator estimate?
41 / 2
Goodman-Bacon 2x2 comparisons
- Four potential comparisons
that can be made
- turns out that TWFE DD
estimator (pooled) is the
weighted average of all 2x2
comparisons
- These weights end up putting a
high degree of weight on units
treated in the middle of the
sample (since they have the
highest variance in the
treatment indicator!)
42 / 2
Goodman-Bacon 2x2 comparisons
- The weighting becomes
problematic if the effects vary
over time – if the effects are
instantaneous and
time-invariant, the weights are
all positive
43 / 2
Goodman-Bacon 2x2 comparisons
- The weighting becomes
problematic if the effects vary
over time – if the effects are
instantaneous and
time-invariant, the weights are
all positive
- However, time-varying effects
create bad counterfactual
groups, and create negative
weights
- Goodman-Bacon provides a
way to assess the weights in a
given TWFE design
43 / 2
What to do with staggered timing in DinD?
- There’s really no reason to use the baseline TWFE in staggered timings
- A perfect example wherein the estimator does not generate an estimate that maps to a
meaningful estimand
- There are several approaches proposed in the literature that are just as good!
- Sun and Abraham (2020)
- de Chaisemartin and d’Haultfoeuille (2020)
- Borusyak and Jaravel (2017)
- Callaway and Sant’anna (2020)
- These all are robust to this issue. I find Callaway and Sant’anna quite intuitive, but
your circumstances may vary slightly. Key piece to keep in mind that differs a bit:
- Is my treatment absorbing?
- Irrespective of the exact paper, the key point is that we are generating a
counterfactual and need to be careful that our estimator does so correctly
44 / 2
Finally, a discussion on inference
- First, let’s start with the old school fact that you must know if you are working with
panel data and Dind
- You must cluster on the unit of policy implementation if possible. See Bertrand, Duflo
and Mullainathan (2004)
- Why? Outcomes and the treatment tend to be severely autocorrelated within unit
- I say “if possible” since clearly in Card and Krueger that is infeasible
- If the policy variation is implemented at the industry level, you should not cluster at
the firm level
- If the policy variation is implemented at the firm level, you cannot use robust standard
errors
45 / 2
Small clusters
- The Card and Krueger case was too extreme, but there are approaches for dealing
with a small number of clusters
- This approach typically involves bootstrapping, and can handle small number of
treated groups relative to the overall population
- See Andreas Hagemann’s work for a place to start
46 / 2
Uniform confidence intervals
- Finally, when considering event study graphs, pre-trend graphs should use uniform
confidence intervals, rather than pointwise confidence intervals
- Advocated for by Freyaldenhoven et al (2018):
- Code available here thanks to Ryan Kessler:
https://github.com/paulgp/simultaneous_confidence_bands
47 / 2
Uniform confidence intervals
- Finally, when considering event study graphs, pre-trend graphs should use uniform
confidence intervals, rather than pointwise confidence intervals
- Advocated for by Freyaldenhoven et al (2018):
- Code available here thanks to Ryan Kessler:
https://github.com/paulgp/simultaneous_confidence_bands
47 / 2
Conclusion
- Difference in difference is hugely powerful in applied settings
- Does not require random assignment, but rather implementation of policies that
differentially impacts different groups and is not confounded by other shocks at the
same time.
- Can be a great application of big data, with convincing graphs that highlight your
application
- Also allows for partial tests of identifying assumptions
- Worth carefully thinking about what your identifying assumptions are in each setting,
and transparently highlighting them.
- Important to note that this always identifies a relative affect, and to aggregate, you will
typically need a model and additional strong assumptions (see Auclert, Dobbie and
Goldsmith-Pinkham (2019) for an example in a macro setting).
48 / 2
My takeaways from new literature
- Beware weak tests of pre-trends. Consider using R&R’s partial identification tests to
assess robustness of results.
- Do not worry about the new literature on staggered timings if you only have one
timing!
- Think carefully about your estimand if you’re using a staggered timing DinD – what’s
your counterfactual in each case?
- Software exists for many of these papers. This is doable!
- When plotting confidence intervals in event studies, you should plot uniform
confidence intervals.
49 / 2