2019-Impact Evaluation Using DiD
2019-Impact Evaluation Using DiD
www.emeraldinsight.com/2531-0488.htm
Impact
Impact evaluation using evaluation
Difference-in-Differences
Anders Fredriksson and Gustavo Magalhães de Oliveira
Center for Organization Studies (CORS), School of Economics, Business and
Accounting (FEA), University of São Paulo (USP), São Paulo, Brazil 519
Received 18 May 2019
Revised 27 July 2019
Abstract Accepted 8 August 2019
Purpose – This paper aims to present the Difference-in-Differences (DiD) method in an accessible language
to a broad research audience from a variety of management-related fields.
Design/methodology/approach – The paper describes the DiD method, starting with an intuitive
explanation, goes through the main assumptions and the regression specification and covers the use of several
robustness methods. Recurrent examples from the literature are used to illustrate the different concepts.
Findings – By providing an overview of the method, the authors cover the main issues involved when
conducting DiD studies, including the fundamentals as well as some recent developments.
Originality/value – The paper can hopefully be of value to a broad range of management scholars
interested in applying impact evaluation methods.
Keywords Impact evaluation, Policy evaluation, Management, Causal effects,
Difference-in-Differences, Parallel trends assumption
Paper type Research paper
1. Introduction
Difference-in-Differences (DiD) is one of the most frequently used methods in impact
evaluation studies. Based on a combination of before-after and treatment-control group
comparisons, the method has an intuitive appeal and has been widely used in economics,
public policy, health research, management and other fields. After the introductory section,
this paper outlines the method, discusses its main assumptions, then provides further details
and discusses potential pitfalls. Examples of typical DiD evaluations are referred to
throughout the text, and a separate section discusses a few papers from the broader
management literature. Conclusions are also presented.
Differently from the case of randomized experiments that allow for a simple comparison
of treatment and control groups, DiD is an evaluation method used in non-experimental
settings. Other members of this “family” are matching, synthetic control and regression
discontinuity. The goal of these methods is to estimate the causal effects of a program when
treatment assignment is non-random; hence, there is no obvious control group[1]. Although
where y is the outcome variable, the bar represents the average value (averaged over
individuals, typically indexed by i), the group is indexed by s (because in many studies,
policies are implemented at the state level) and t is time. With before and after data for
treatment and control, the data is thus divided into the four groups and the above double
difference is calculated. The information is typically presented in a 2 2 table, then a third
row and a third column are added in order to calculate the after-before and treatment-control
differences and the DiD impact measure. Figure 1 illustrates how the DiD estimate is
constructed.
The above calculation and illustration say nothing about the significance level of the DiD
estimate, hence regression analysis is used. In an OLS framework, the DiD estimate is
obtained as the b -coefficient in the following regression, in which As are treatment/control
group fixed effects, Bt before/after fixed effects, Ist is a dummy equaling 1 for treatment
observations in the after period (otherwise it is zero) and « ist the error term[4]:
RAUSP yist ¼ As þ Bt þ b I st þ «ist (2)
54,4 In order to verify that the estimate of b will recover the DiD estimate in (1), use (2) to get
E ðyist js ¼ Control; t ¼ BeforeÞ ¼ AControl þ BBefore
E ðyist js ¼ Control; t ¼ AfterÞ ¼ AControl þ BAfter
In these expressions, E(yist|s, t) is the expected value of yist in population subgroup (s, t),
which is estimated by the sample average y s;t . Estimating (2) and plugging in the sample
counterpart of the above expressions into (1), with the hat notation representing coefficient
estimates, gives DiD ¼ b ^ [5].
The DiD model is not limited to the 2 2 case, and expression 2 is written in a more
general form than what was needed so far. For models with several treatment- and/or
control groups, As stands for fixed effects for each of the different groups. Similarly, with
several before- and/or after periods, each period has its own fixed effect, represented by Bt. If
the reform is implemented in all treatment groups/states at the same time, Ist switches from
zero to one in all such locations at the same time. In the general case, however, the reform is
staggered and hence implemented in different treatment groups/states s at different times t.
Ist then switches from 0 to 1 accordingly. All these cases are covered by expression 2[6].
Individual-level control variables Xist can also be added to the regression, which
becomes:
An important aspect of DiD estimation concerns the data used. Although it cannot be done
with a 2 2 specification (as there would be four observations only), models with many time
periods and treatment/control groups can also be analyzed with state-level (rather than
individual-level) data (e.g. US or Brazilian data, with 50 and 27 states, respectively). There
would then be no i-index in regression 3 A. Perhaps the relevant data is at the state level (e.g.
unemployment rates from statistical institutes). Individual-level observations can also be
Figure 1.
Illustration of the
two-group two-period
DiD estimate. The
assumed treatment
group counterfactual
equals the treatment
group pre-reform
value plus the after-
before difference from
the control group
aggregated. An advantage of the latter approach is that one avoids the problem (discussed Impact
in Section 4) that the within group-period (e.g. state-year) error terms tend to be correlated evaluation
across individuals, hence standard errors should be corrected. With either type of data, also
state-level control variables, Zst, may be included in expression 3 A[7]. A more general form
of the regression specification, with individual-level data, becomes:
Figure 2.
Graphs used to
visually check the
parallel trends
assumption. (a) (left)
Child mortality rates,
different areas of
Buenos Aires,
Argentina, 1990-1999
(reproduced from
Galiani et al., 2005);
(b) (right) Days per
year not in good
physical health, 2001-
2009, Massachusetts
and control states
(from Courtemanche
& Zapata, 2014)
includes only high values of a control variable and the control group only low values, one is, in Impact
fact, comparing incomparable entities. There must instead be overlap in the distribution of the evaluation
control variables between the different groups and time periods.
It should be noted that the parallel trends assumption is scale dependent, which is an
undesirable feature of the DiD method. Unless the outcome variable is constant during the
pre-reform periods, in both treatment and control, it matters if the variable is used “as is” or
if it is transformed (e.g. wages vs log wages). One approach to this issue is to use the data in
the form corresponding to the parameter one wants to estimate (Lechner, 2011), rather than 525
adapting the data to a format that happens to fit the parallel trends assumption.
A closing remark in this section is that it is worth spending time when planning the
empirical project, before the actual analysis, carefully considering all possible data sources,
if first-hand data needs to be collected, etc. Perhaps data limitations are such that a robust
DiD study – including a parallel trend check – is not feasible. On the other hand, in the
process of learning about the institutional details of the intervention studied, new data
sources may appear.
4.2 Difference-in-Difference-in-Differences
Difference-in-Difference-in-Differences (DiDiD) is an extension of the DiD concept (Angrist
& Pischke, 2009), briefly mentioned through an example. Long, Yemane, & Stockley (2010)
study the effects of the special provisions for young people in the Massachusetts health
reform. The authors use data on both young adults and slightly older adults. Through the
DiDiD method, they compare the change over time in health outcomes for young adults in
Massachusetts to young adults in a comparison state and to slightly older adults in
Massachusetts and construct a triple difference, to also control for other changes that occur
in the treatment state.
4.3 Standard errors[12] Impact
In the basic OLS framework, observations are assumed to be independent and standard evaluation
errors homoscedastic. The standard errors of the regression coefficients then take a
particularly simple form. Such errors are typically “corrected”, however, to allow for
heteroscedasticity (Ecker-Huber-White heteroscedasticity-robust standard errors). The
second “standard” correction is to allow for clustering. Think of individual-level data from
different regions, where some regions are treated; others are not. Within a region (“cluster”),
the individuals are likely to share many characteristics: perhaps they go to the same schools, 527
work at the same firms, have access to the same media outlets, are exposed to similar
weather, etc. Factors such as these make observations within clusters correlated. In effect,
there is less variation than if the data had been independent random draws from the
population at large. Standard errors need to be corrected accordingly, typically implying
that the significance levels of the regression coefficients are reduced[13].
For correct inference with DiD, a third adjustment needs to be done. With many time
periods, the data can exhibit serial correlation. This holds for many typical dependent
variables in DiD studies, such as health outcomes, and, in particular, the treatment variable
itself. The observations within each of the treatment and control groups can thus be
correlated over time. Failing to correct for this fact can largely overstate significance levels,
which was the topic of the much influential paper by Bertrand et al. (2004).
One way of handling the within-group clustering issue is to collapse the individual data
to state-level averages. Similarly, the serial correlation problem can be handled by
collapsing all pre-treatment periods to one before-period, and all post-treatment periods to
one after-period. Having checked the parallel trends assumption, one thus works with two
periods of data, at the state level (which requires many treatment and control states). A
drawback, however, is that the sample size is greatly reduced. The option to instead
continue with the individual-level data and calculate standard errors that are robust to
heteroscedasticity, within-group effects and serial correlation, are provided by many
econometric software packages.
Notes
1. The reader is assumed to have basic knowledge about regression analysis (e.g. Wooldridge, 2012)
and also about the core concepts in impact evaluation, e.g. identification strategy, causal
inference, counterfactuals, randomization and treatment effects (e.g. Gertler, Martinez, Premand,
Rawlings, & Vermeersch, 2016, chapters 3-4; White & Raitzer, 2017, chapters 3-4).
2. In this text, the terms policy, program, reform, law, regulation, intervention, shock or
treatment are used interchangeably, when referring to the object being evaluated, i.e. the
treatment.
3. Lechner (2011) provides a historical account, including Snow’s study of cholera in London in the 1850s.
4. The variable denominations are similar to those in Bertrand et al. (2004). An alternative way to
specify regression 2, in the 2 2 case, is to use an intercept, treatment- and after dummies and a
dummy equaling the interaction between the treatment and after dummies (e.g. Wooldridge,
2012, chapter 13). The regression results are identical.
5. Angrist & Pischke (2009), Blundell & Costa Dias (2009), Lechner (2011) and Wing et al. (2018) are
examples of references that provide additional details on the correspondence between the
“potential outcomes framework”, the informal/intuitive/graphical derivation of the DiD measure
and the regression specification, as well as a discussion of population vs. sample properties.
6. Note that the interpretation of b changes somewhat if the reform is staggered (Goodman-Bacon,
2018). An even more general case, not covered in this text, is when Ist switches on and off. A
particular group/state can then go back and forth between being treated and untreated (e.g. Bertrand
et al., 2004). Again different is the case where Ist is continuous (e.g. Aragon & Rud, 2013).
7. Note that Xist and Zst are both vectors of variables. The X-variables could be e.g. gender, age and
income, i.e. three variables, each with individual level observations. Zst can be e.g. state
unemployment, variables representing racial composition, number of hospital beds, etc.,
depending on the study. The regression coefficients c and d are (row) vectors.
8. See also Wing et al. (2018, pp. 460-461) for a discussion of the related concept of event studies.
Their set-up can also be used to study short- and long term reform effects. A slightly different
type of placebo test is to use control states only, to study if there is an effect where there should
be none (Bertrand et al., 2004).
9. In relation to this discussion, note that the Difference-in-Differences method estimates the
Average Treatment Effect on the Treated, not on the population (e.g. Blundell & Costa Dias,
2009; Lechner, 2011; White & Raitzer, 2017, chapter 5).
10. Matching (also referred to as “selection on observables”) hinges upon the Conditional
Independence Assumption (CIA) (or “unconfoundedness”), which says that, conditional on the
control variables, treatment and control would have the same expected outcome, in either
treatment state (treated/untreated). Hence the treatment group, if untreated, would have the same
RAUSP expected outcome as the control group, and the selection bias disappears (e.g. Angrist &
Pischke, 2009, chapter 3). Rosenbaum & Rubin (1983) showed that if the CIA holds for a set of
54,4 variables Zs, then it also holds for the propensity score P(Zs).
11. Such a method is used for panel data. When the data are repeated cross sections, each of the three
groups treatment-before, control-before and control-after needs to be matched to the treatment-
after observations (Blundell & Costa Dias, 2000; Smith & Todd, 2005).
530 12. For a general discussion, refer to Angrist & Pischke (2009) and Wooldridge (2012). Abadie,
Athey, Imbens, and Wooldridge (2017), Bertrand et al. (2004) and Cameron & Miller (2015)
provide more details.
13. When there are group effects, it is important to have a large enough number of group-period cells,
in order to apply DiD, an issue further discussed in Bertrand et al. (2004).
References
Abadie, A., & Cattaneo, M. D. (2018). Econometric methods for program evaluation. Annual Review of
Economics, 10, 465–503.
Abadie, A., & Gardeazabal, J. (2003). The economic costs of conflict: A case study of the Basque
Country. American Economic Review, 93, 113–132.
Abadie, A., Athey, S., Imbens, G. W., & Wooldridge, J. (2017). When should you adjust standard
errors for clustering?. (No. Working Paper 24003). National Bureau of Economic Research
(NBER).
Aggarwal, V. A., & Hsu, D. H. (2014). Entrepreneurial exits and innovation. Management Science, 60,
867–887.
Angrist, J. D., & Krueger, A. B. (1999). Empirical strategies in labor economics. In Ashenfelter, O., &
Card, D. (Eds), Handbook of labor economics (Vol. 3, pp. 1277–1366). Amsterdam, The
Netherlands: Elsevier.
Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist’s companion,
Princeton, NJ: Princeton University Press.
Aragon, F. M., & Rud, J. P. (2013). Natural resources and local communities: Evidence from a peruvian
gold mine. American Economic Journal: Economic Policy, 5, 1–25.
Ashenfelter, O. (1978). Estimating the effect of training programs on earnings. The Review of
Economics and Statistics, 60, 47–57.
Athey, S., & Imbens, G. W. (2017). The state of applied econometrics: Causality and policy evaluation.
Journal of Economic Perspectives, 31, 3–32.
Berger, A. N., Kick, T., & Schaeck, K. (2014). Executive board composition and bank risk taking.
Journal of Corporate Finance, 28, 48–65.
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-differences
estimates? The Quarterly Journal of Economics, 119, 249–275.
Blundell, R., & Costa Dias, M. (2000). Evaluation methods for non-experimental data. Fiscal Studies, 21,
427–468.
Blundell, R., & Costa Dias, M. (2009). Alternative approaches to evaluation in empirical
microeconomics. Journal of Human Resources, 44, 565–640.
Bruno, V., Cornaggia, J., & Cornaggia, J. K. (2016). Does regulatory certification affect the information
content of credit ratings?. Management Science, 62, 1578–1597.
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. Journal of
Human Resources, 50, 317–372.
Card, D. (1990). The impact of the Mariel boatlift on the Miami labor market. ILR Review, 43, 245–257.
Card, D., & Krueger, A. B. (1994). Wages and employment: A case study of the fast-food industry in Impact
New Jersey and Pennsylvania. American Economic Review, 84, 772–793.
evaluation
Card, D., & Krueger, A. B. (2000). Minimum wages and employment: A case study of the fast-food
industry in New Jersey and Pennsylvania: reply. American Economic Review, 90, 1397–1420.
Chen, G., Crossland, C., & Huang, S. (2014). Female board representation and corporate acquisition
intensity. Strategic Management Journal, 37, 303–313.
Conyon, M. J., Hass, L. H., Peck, S. I., Sadler, G. V., & Zhang, Z. (2019). Do compensation
consultants drive up CEO pay? Evidence from UK public firms. British Journal of
531
Management, 30, 10–29.
Courtemanche, C. J., & Zapata, D. (2014). Does universal coverage improve health? The Massachusetts
experience. Journal of Policy Analysis and Management, 33, 36–69.
Distelhorst, G., Hainmueller, J., & Locke, R. M. (2016). Does lean improve labor standards? Management
and social performance in the Nike supply chain. Management Science, 63, 707–728.
Duflo, E., Glennerster, R., & Kremer, M. (2008). Using randomization in development economics
research: A toolkit. In P. Schultz, & J. Strauss, (Eds.), Handbook of development economics
(Vol. 4). Amsterdam, The Netherlands and Oxford, UK: Elsevier; North-Holland, 3895–3962.
Flammer, C. (2015). Does product market competition foster corporate social responsibility?. Strategic
Management Journal, 38, 163–183.
Flammer, C., & Kacperczyk, A. (2016). The impact of stakeholder orientation on innovation: Evidence
from a natural experiment. Management Science, 62, 1982–2001.
Galiani, S., Gertler, P., & Schargrodsky, E. (2005). Water for life: The impact of the privatization of
water services on child mortality. Journal of Political Economy, 113, 83–120.
Gertler, P. J., Martinez, S., Premand, P., Rawlings, L. B., & Vermeersch, C. M. (2016). Impact evaluation
in practice, Washington, DC: The World Bank.
Goodman-Bacon, A. (2018). Difference-in-Differences with variation in treatment timing. NBER
Working Paper No. 25018. NBER.
He, P., & Zhang, B. (2018). Environmental tax, polluting plants’ strategies and effectiveness: Evidence
from China. Journal of Policy Analysis and Management, 37, 493–520.
Holm, J. M. (2018). Successful problem solvers? Managerial performance information use to
improve low organizational performance. Journal of Public Administration Research and
Theory, 28, 303–320.
Hosken, D. S., Olson, L. M., & Smith, L. K. (2018). Do retail mergers affect competition? Evidence from
grocery retailing. Journal of Economics & Management Strategy, 27, 3–22.
Imbens, G. W., & Wooldridge, J. M. (2009). Recent developments in the econometrics of program
evaluation. Journal of Economic Literature, 47, 5–86.
Iyer, R., Peydro, J. L., da-Rocha-Lopes, S., & Schoar, A. (2013). Interbank liquidity crunch and the firm
credit crunch: Evidence from the 2007-2009 crisis. Review of Financial Studies, 27, 347–372.
Khwaja, A. I., & Mian, A. (2008). Tracing the impact of bank liquidity shocks: Evidence from an
emerging market. American Economic Review, 98, 1413–1442.
Kumar, A., Bezawada, R., Rishika, R., Janakiraman, R., & Kannan, P. K. (2016). From social to sale: The effects
of firm-generated content in social media on customer behavior. Journal of Marketing, 80, 7–25.
Lechner, M. (2011). The estimation of causal effects by difference-in-difference methods. Foundations
and Trends® in Econometrics, 4, 165–224.
Lemmon, M., & Roberts, M. R. (2010). The response of corporate financing and investment to changes
in the supply of credit. Journal of Financial and Quantitative Analysis, 45, 555–587.
Long, S. K., Yemane, A., & Stockley, K. (2010). Disentangling the effects of health reform in
Massachusetts: How important are the special provisions for young adults?. American Economic
Review, 100, 297–302.
RAUSP Pierce, L., Snow, D. C., & McAfee, A. (2015). cleaning house: The impact of information technology
monitoring on employee theft and productivity. Management Science, 61, 2299–2319.
54,4
Rosenbaum, P. R., & Rubin, D. B. (1983). The Central role of the propensity score in observational
studies for causal effects. Biometrika, 70, 41–55.
Schnabl, P. (2012). The international transmission of bank liquidity shocks: Evidence from an emerging
market. The Journal of Finance, 67, 897–932.
532 Singh, J., & Agrawal, A. (2011). Recruiting for ideas: How firms exploit the prior inventions of new
hires. Management Science, 57:, 129–150.
Smith, J. A., & Todd, P. E. (2005). Does matching overcome LaLonde’s critique of nonexperimental
estimators? Journal of Econometrics, 125, 305–353.
Sommers, B. D., Long, S. K., & Baicker, K. (2014). Changes in mortality after Massachusetts health care
reform: A quasi-experimental study. Annals of Internal Medicine, 160, 585–594.
White, H., & Raitzer, D. A. (2017). Impact evaluation of development interventions: A practical guide,
Mandaluyong, Philippines: Asian Development Bank.
Wing, C., Simon, K., & Bello-Gomez, R. A. (2018). Designing difference in difference studies: Best
practices for public health policy research. Annual Review of Public Health, 39, 453–469.
Wooldridge, J. M. (2012). Introductory econometrics: a modern approach (5th ed.). Mason, OH: South-
Western College Publisher.
Younge, K. A., Tong, T. W., & Fleming, L. (2014). How anticipated employee mobility affects
acquisition likelihood: Evidence from a natural experiment. Strategic Management Journal, 36,
686–708.
Corresponding author
Anders Fredriksson can be contacted at: [email protected]
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: [email protected]