Pooled Cross Sections and Panel Data,
Difference in Difference
Pooled Cross Sections and Panel Data 1
Pooled Cross Sections and Panel Data: Overview
Observations over individual units and time: Wooldridge
chapters 13 and 14.
Pooling independent cross sections across time (13.1-2).
Pooled Cross Sections and Panel Data 2
Data structures and definitions
Cross section : Observations on a set of variables in a given period,
t, for individual units i=1,2,…,n:
( yit , xit1 , xit 2 ,..., xitk )
Usually think of the cross section as a random sample from some
population at time t
Two period case:
Period 1 cross section: ( yi1 , xi11 , xi12 ,..., xi1k ), i 1, 2,..., n1
Period 2 cross section: ( yi 2 , xi 21 , xi 22 ,..., xi 2k ), i n1 1, n1 2,..., n1 n2
How are the period 1 and period 2 cross sections related?
Independent cross sections: Two independently drawn random samples:
(In general) different individual units in period 1 and period 2.
Panel data: Same n individuals appear in period 1 and in period 2.
Pooled Cross Sections and Panel Data 3
Pooling independent cross sections across time
Independent cross sections for two periods:
Pooled data:
( yit , xit1 , xit 2 ,..., xitk ), i 1, 2,..., n1 , n1 1,...., n1 n2
One extreme: Estimating pooled model: y X u ˆ pooled
Other extreme: Treat the data in each cross section
separately:
y X 1 u, i 1, 2,..., n ˆ 1 1
y X 2 u, i n1 1, n1 2,..., n1 n2 ˆ 2
”Partial pooling”: Combine the cross sections but allow the
coefficients of some variables to change between cross
sections.
Pooled Cross Sections and Panel Data 4
Pooling independent cross sections
Allow the coefficients of some of the variables to change over time:
A special case of structural change
Use dummy variables (W ch. 7): Time dummies (e.g. year dummies)
Two periods: Need one dummy variable, usually for second period:
d 2i 1 if individual i is in the period 2 sample
= 0 if individual i is not in the sample in period 2
Usually: Allow intercept to change
yi 0 0 d 2i 1 xi1 ... k xik ui
Other coefficients allowed to change as well: Interaction terms.
Pooled Cross Sections and Panel Data 5
Pooling independent cross sections: Testing
Testing: Is 1 constant over time? Usual t-test for 1 0 in
yi 0 0 d 2i 1 xi1 1d 2i xi1 2 xi 2 ... k xik ui
Allow all coefficients to change over time: No gain from pooling the
cross sections
Fully interacted regression:
yi 0 0 d 2i 1 xi1 1d 2i xi1 2 xi 2 2 d 2i xi 2
... k xik k d 2i xik ui
F-test for 0 1 ... k 0
Easy implementation of F-statistic: SSRs from pooled and separate
regressions (”Chow test”)
Pooled Cross Sections and Panel Data 6
Testing the equivalence of two regressions using dummy
variables (implementation Chow Test)
Pt 1 Dt 2 St (St Dt ) et
(1 ) (2 ) St 1 2 St desirable neighborhood data
E ( Pt )
1 2 St other neighborhood data
Pt 1 2 St et
Pt 1 2 St +et
Known as the Chow Test
Testing the equivalence of investment
demand in two firms
Restricted Equation (No dummies)
ˆ 17.8720 0.0152V 0.1436 K
INV
(2.544) (2.452) (7.719)
SSER =16563.00
Unrestricted Equation (Dummy applied to all parameters)
ˆ 9.9563 9.4469 D 0.0266V 0.0263( D V ) 0.1517 K 0.0593( D K )
INV
(0.421) (0.328) (2.265) (0.767) (7.837) ( 0.507)
SSEU 14989.82
( SSER SSEU ) / J (16563.00 14989.82) / 3
F 1.1894
SSEU /(T K ) 14989.82 /(40 6)
F<Fc=2.8826 So the restrictions are not rejected and it is concluded that the
equation is the same for both forms
An alternative way of computing SSU in
the Chow test
Estimate the simplified equation:
INVt 1 2Vt 3 Kt et
for each firm separately
SSEu = SSE1 + SSE2
What is “Difference-in-Difference”
(D-in-D) Estimation
D-in-D estimation is a research design and
empirical process intended to assess the “true”
effect of a policy or practice intervention where
random assignment is not feasible.
The “true” effect of an intervention is the total
effect of an intervention on an outcome, net any
changes in outcome that would occur in the
absence of the intervention.
10
What is “Difference-in-Difference”
(D-in-D) Estimation
A D-in-D is the difference between two
differences (or changes):
Difference #1: The change in outcome for an
intervention group from pre- to post-
intervention
Difference #2: The change in outcome for a
control (non-intervention) group over the
same pre- to post-intervention periods
11
When is D-in-D Used?
For policy or practice evaluations where
experimental conditions reasonably exist
except for randomization of subjects:
Natural Experiments – where the intervention is
established independent of the researcher (e.g.
public policy)
Quasi Experiments – where the researcher
controls the intervention but randomization isn’t
ethically or otherwise feasible.
12
Identifying Assumption
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the
program.
Identifying Assumption
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the
program.
Pre Post
Identifying Assumption
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the
program.
Effect of program using
only pre- & post- data
from T group (ignoring
general time trend).
Pre Post
Identifying Assumption
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the
program.
Effect of program using
only T & C comparison
from post-intervention
(ignoring pre-existing
differences between T &
C groups).
Pre Post
Identifying Assumption
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the
program.
Pre Post
Identifying Assumption
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the
program.
Effect of program
difference-in-difference
(taking into account pre-
existing differences
between T & C and
general time trend).
Pre Post
Uses of Diff-in-Diff
Simple two-period, two-group comparison
very useful in combination with
randomization, matching, or RD
Can also do much more complicated “cohort”
analysis, comparing many groups over many
time periods
A Main Underlying Assumption
Parallel Trends – in the absence of
intervention, the unobserved differences
between intervention and control groups are
the same over time.
Relaxes assumption that intervention and control
groups are the same in every respect apart from
the intervention (randomization is supposed to
achieve this)
Intervention group would follow the outcome
“path” of control group if no intervention
Any pre-intervention outcome differences between
intervention and control groups are constant
effects that can be factored (differenced) out
How it works
Given an outcome Oit measured for pre-post
intervention time periods (t=1,2) and control/
intervention groups (i=1,2)
Pre Post Change
Intervention O21 O22 O22-O21
Control O11 O12 O12-O11
Difference O21-O11 O22-O12 D-in-D
D-in-D = (O22-O21)-(O12-O11)
How it Works
To estimate the D-in-D in a regression
framework, we need dummy variables that
will identify the four subject group and time
period combinations:
P(ost) = 1 in post periods, =0 in pre periods
I(ntervention) = 1 if intervention, =0 if control
P(ost)xI(ntervention) = 1 if post & intervention, =0
otherwise
Note – Control group in pre-period is “excluded”
group – will be measured by regression constant
How it works
A D-in-D regression model would look like:
Oit = B0 + B1*I + B2*P + B3*PxI + e
Pre Post Change
Intervention B0+B1 B0+B1+B2+B3 B2+B3
Control B0 B0+B2 B2
Difference B1 B1+B3 D-in-D
D-in-D = B3
Getting Started
You need:
An intervention (change)
Outcome measure(s)
Comparison group(s)
Information on subject characteristics
24
Some Best Practices
Know your intervention
Is there clear documentation of what they are doing(fidelity)?
Are there types of individuals that are more or less likely to
respond to the intervention?
Are there likely anticipatory or shock (short-term) effects? (“wash
out” periods)
Know its environment
Can you identify those receiving intervention from those not?
Is there anything else going on that might effect the outcomes
you plan to measure?
Why is this being done now? In this particular place?
(endogeneity)
Some Best Practices
Take the parallel trend assumption seriously
Thoughtfully choose control group(s) e.g. can subjects choose to
be intervention or control? (selection)
Test for stable differences in outcomes between
control/intervention groups across pre-intervention time periods.
Minimize all observable differences (covariates/
matching/weighting methods addressing subject characteristics)
Understand and be prepared to explain the “flow” of outcomes
that result in the D-in-D – not just the D-in-D itself.
Be thorough and transparent
Seek additional ways to “test” your findings e.g. “internal” D-in-Ds
on intervention subjects more and less likely to be affected to
assure any outcome change is likely related to intervention
Report all aspects of the conduct and context of the study
26
Example:
Estimate effect of MH insurance parity in
Oregon state on receipt of MH outpatient
care within 30 days of MH inpatient stay.
Start with overall D-in-D to estimate policy effect
for all Oregonians experiencing parity
Used pooled comparison group of subjects from
states of Oregon, Washington, California
Followed with “internal” D-in-D estimating policy
effects for individuals most likely to be affected by
policy
Table 2 Estimated Average Marginal Effects : Parity vs. Non-Parity
Observations1
Measure Estimate SE P
Post -Parity Effect (D-in-D) .114 .056 .042
Post-Parity Period (Control) -.033 .034
Pre-Parity Period (Intervention) -.057 .040
Psychotic Disorder Discharge Dx .076 .031 .015
Female .046 .031
Spouse -.064 .043
Dependent -.147 .035 <.001
Calendar Quarter 2 -.000 .039
Calendar Quarter 3 -.061 .037
Calendar Quarter 4 -.087 .039 .027
Observations 888
Unique Subjects 727
1
Derived from logistic regression results
28
Table 4 Estimated Average Marginal Effects: Parity Observations
Meeting Pre-Parity Quantitative Limits vs. All Other1
Measure Estimate SE P
Post -Parity Effect (Met Limits) .203 .093 .028
Post-Parity Period (All Other) .021 .050
Pre-Parity Period (Met Limits) -.153 .070 .028
Psychotic Disorder Discharge Dx .067 .051
Female .019 .048
Spouse -.042 .064
Dependent -.138 .054 .011
Calendar Quarter 2 .064 .064
Calendar Quarter 3 -.053 .055
Calendar Quarter 4 -.070 .059
Observations 353
Unique Subjects 298
1
Derived from logistic regression results
29
Difference in Difference for Housing price cases
Example 13.3: Effect of the location of a garbage
incinerator on house prices.
Hypothesis: Having an incinerator nearby lowers the
price of a house.
Data: Prices and characteristics of houses in different
distances to the incinerator.
Two cross-sections: 1978 and 1981.
Before and ”after” the incinerator was built in 1981.
Pooled Cross Sections and Panel Data 30
Policy analysis with pooled cross sections
Naive approach: Use 1981 cross section to estimate the
model
price 0 1nearinc u
price is the price of a house, nearinc is a dummy variable that
takes the value 1 if the house is located near the incinerator.
OLS estimates using 1981 cross section: price 101 31 nearinc
Is this a ”good” estimate of the causal effect on house prices
of locating the incinerator nearby?
NO! Incinerator may have been located near houses that
were already cheap in 1978.
OLS estimates using 1978 cross section: price 83 19 nearinc
Pooled Cross Sections and Panel Data 31
Policy analysis with pooled cross sections
Difference-in-differences approach:
House prices have gone up between 1978 and 1981 for
most houses. Whether nearby and far away from the
location of the incinerator.
Relevant question: Has the change been bigger for houses
far from the incinerator?
Need to look at differences in space (nearby/far away) of
differences in time (between 1978 and 1981): Diff-in-diff.
Regression implementation:
price 0 0 y81 1nearinc 1 y81 nearinc u
Pooled Cross Sections and Panel Data 32
Policy analysis with pooled cross sections
Model: price 0 0 y81 1nearinc 1 y81 nearinc u
0 Common change over time
1 Pre-incinerator difference in prices
1 Change in price due to incinerator
Test of the hypothesis that nearby incinerator lowers house prices:
H 0 : 1 0 vs. H1 : 1 0
Pooled Cross Sections and Panel Data 33
Policy analysis with pooled cross sections:
Example 13.3
price 0 0 y81 1nearinc 1 y81 nearinc u
H 0 : 1 0 vs. H1 : 1 0
Coefficient Standard
R2
ˆ1 error
Model as -12 7.5 0.17
above
Full set of -14 5 0.67
”controls”
Pooled Cross Sections and Panel Data 34
Quasi-experiments and natural experiments
Mimic controlled experiments in science by finding
something that happened ”naturally” to one group of
people, but not to another.
Treated group: Houses nearby the location of the
incinerator.
Control group: Houses far away.
Comparing groups before and after the ”treatment”:
Building the incinerator
Pooled Cross Sections and Panel Data 35