Financial Econometrics Homework 6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Financial Econometrics

Empirical Application No. 6:


Difference-In-Difference Analysis

YE HONGBO, COLEASA Alexandra, ARIES Allen Jerry


Master 2 Finance Technology Data
UFR 02: École d’économie de la Sorbonne
Université Paris 1 Panthéon-Sorbonne

1. Introduction

Panel Data Regression Model

The general model can be written as:


yi,t = ai + bt + x′i,t β + εi,t
The model represents an attribute y of an entity i at different dates. The main hypothesis in panel
data analysis is that β is the same for all entities and all dates. There should be no cross-correlation
and no serial correlation. Individual effect is taken as ai while bt represents the time effect.

There are two main specifications to characterize ai : (i) fixed effects (ii) random effects. The
Hausman Test is used to differentiate or to choose between fixed effects model and random effects
model in panel analysis.

2. Problem Statement

Economic theory as it is currently practiced predicts that raising the minimum wage will result
in higher unemployment in a labor market with perfect competition. New Jersey (NJ), a U.S.
state, increased its minimum wage to $5.05. This change took effect in April 1992. In the project,
the study by David Card and Alan B. Krueger about the effect of a raise in minimum wages on
employment will be replicated.

Using a DiD methodology, Card and Krueger (1994) demonstrated how the increase in the mini-
mum wage resulted in more jobs being created in the fast-food restaurant industry. In their study,
Pennsylvania (PA), a neighboring state in the United States that was not affected by the policy
change, serves as the control group. A representative sample of fast-food restaurants in NJ and
PA participated in the survey the authors conducted before and after the minimum wage increase.

3. DiD Theory

DiD analysis is a statistical/econometric technique to mimic an experimental research design us-


ing observational study data. This involves studying the differential effect of a treatment on a
2

’treatment group’ versus a ’control group’ in a natural experiment. The main assumption is that,
although treatment and comparison groups may have different levels of the outcome prior to the
start of treatment, their trends in pre-treatment outcomes should be the same. The estimated
impact of the treatment is then the OLS estimate of the parameter:

δ = (ȳ12 − ȳ11 ) − (ȳ22 − ȳ21 )

Indeed, ȳ11 = α0 + a and ȳ12 = α0 + a + b + δ, are historical means of Y for all individuals
belonging to group 1, respectively, before and after treament date. ȳ21 = α0 and ȳ22 = α0 + b are
the historical means of Y for all individuals belonging to group 2, before and after treament date.
Accordingly, the two differences (with respect to before and after the treatment dates) are:
(ȳ12 − ȳ11 ) = a and (ȳ22 − ȳ21 ) = a + δ
Finally, the Difference-in-difference is:

(ȳ12 − ȳ11 ) − (ȳ22 − ȳ21 ) = δ

which can be estimated by OLS provided that the residuals of the the mean equations have no
cross and serial correlations.

4. Data Descriptions and Preparations

The New Jersey-Pennsylvania Data Set in the study of Card and Krueger with 410 observations
is also used in this analysis.

Table 1: Column Location

Name Start End Format Explanation


SHEET 1 3 3.0 sheet number (unique store id)
CHAIN 5 5 1.0 chain 1=bk; 2=kfc; 3=roys; 4=wendys
CO OWNED 7 7 1.0 1 if company owned
STATE 9 9 1.0 1 if NJ; 0 if Pa

Table 2: Dummies for Location

Name Start End Format Explanation


SHEET 1 3 3.0 sheet number (unique store id)
CHAIN 5 5 1.0 chain 1=bk; 2=kfc; 3=roys; 4=wendys
CO OWNED 7 7 1.0 1 if company owned
STATE 9 9 1.0 1 if NJ; 0 if Pa

Variable used in the model


# chain value label (four main restaurants)
mutate(chain = case_when(chain == 1 ~ "bk",
chain == 2 ~ "kfc",
3

chain == 3 ~ "roys",
chain == 4 ~ "wendys")) %>%
# state value label
mutate(state = case_when(state == 1 ~ "New Jersey",
state == 0 ~ "Pennsylvania")) %>%
# Region dummy
mutate(region = case_when(southj == 1 ~ "southj",
centralj == 1 ~ "centralj",
northj == 1 ~ "northj",
shore == 1 ~ "shorej",
pa1 == 1 ~ "phillypa",
pa2 == 1 ~ "eastonpa")) %>%
# meals value label
mutate(meals = case_when(meals == 0 ~ "none",
meals == 1 ~ "free meals",
meals == 2 ~ "reduced price meals",
meals == 3 ~ "both free and reduced price meals")) %>%
# meals value label
mutate(meals2 = case_when(meals2 == 0 ~ "none",
meals2 == 1 ~ "free meals",
meals2 == 2 ~ "reduced price meals",
meals2 == 3 ~ "both free and reduced price meals")) %>%
# status2 value label
mutate(status2 = case_when(status2 == 0 ~ "refused second interview",
status2 == 1 ~ "answered 2nd interview",
status2 == 2 ~ "closed for renovations",
status2 == 3 ~ "closed permanently",
status2 == 4 ~ "closed for highway construction",
status2 == 5 ~ "closed due to Mall fire")) %>%

Table 3: Distribution of Restaurants

chain New Jersey Pennsylvania


bk 41.1% 44.3%
kfc 20.5% 15.2%
roys 24.8% 21.5%
wendys 13.6% 19.0%
4

Table 4: Key Variables

Variable Description Formula


Emptot (y) Full-time Equivalent Employment emptot = empft + nmgrs + 0.5 * emppt
pct fte % of Full-time Employee pct fte = (empft / emptot )* 100
wage st Starting Wage ($/hr)
hrsopen Number Hours Open / Day

Table 5: Key Terms


Treatment Group Control Before Treatment After Treatment
New Jersey (NJ) Pennsylvania (PA) February November

5. Descrptive Statistics

Pre-treatment Mean
output:
variable ‘New Jersey‘ Pennsylvania
1 emptot 20.4 23.3
2 pct_fte 32.8 35.0
3 wage_st 4.61 4.63
4 hrsopen 14.4 14.5

Post-Treatment Mean

output:
variable New Jersey Pennsylvania
1 emptot 21.0 21.2
2 pct_fte 35.9 30.4
3 wage_st 5.08 4.62
4 hrsopen 14.4 14.7

In April 1992, the U.S. state of New Jersey (NJ) raised the minimum wage from $4.25 to $5.05. It
can be seen that Despite the increase in wages, full-time equivalent employment increased in New
Jersey relative to Pennsylvania. Whereas New Jersey stores were initially smaller, employment
gains in New Jersey coupled with losses in Pennsylvania led to a small and statistically insignificant
interstate.
5

Figure 1: Starting Wage Distribution February and November 2022 Comparison

6. Results of First Difference

Treatment group Control group Treatment group Control group


Notation (NJ) (PA) (NJ) (PA)
before treatment before treatment after treatment after treatment
Total Y1,1 Y2,1 Y1,2 Y2,2
Employment 20.4 23.3 21.0 21.2

In Yi,j , i group (1 = treatment group, 2 = control group). j indicates before (1) or after (2) the
treatment.

difference of difference of
difference of difference of
November November
NJ and PA NJ and PA
and February and February
within February within November
within NJ within PA
Formula NJ Nov-NJ Feb PA Nov-PA Feb NJ Feb-PA Feb NJ Nov-PA Nov
Total
0.5880214 -2.165584 -2.891761 -0.1381549
Employment

The difference between the difference of November and February within NJ and PA is calculated
as (NJnov-NJfeb)-(PAnov-PAfeb) = 2.75.

7. Counter-theory Results

According to the classical economic theory, the increase of minimum wage should have decreased
in theory (counterfactual). However, the full-time equivalent (FTE) employment has increased
6

after the policy of raising the minimum wage was published.

Understanding the DiD technique can be greatly aided by visually representing the relationship
between the treatment and control groups. It can be started by using the variable emptot differ-
ences for NJ and PJ in February and November that were computed in the preceding phase.

In addition, there is a need to know what would happen in New Jersey if the treatment (an
increase in the minimum wage) failed. The counterfactual result (NJ counterfactual) is what is
meant by this.

According to the DiD hypothesis, up until the start of the treatment, the trends of the treat-
ment and control groups are the same. Therefore, without treatment, NJ’s employment (emptot)
would decrease by the same amount as PA’s from February to November.

8. DiD Estimation

8.1 Stationarity Tests

The series must be all I(0) before difference-in-difference analysis. Augmented Dickey-Fuller Tests
are done in order to verify this:

"Level Variable emptot with Drift"


statistic 1pct 5pct 10pct
tau2 -18.89631 -3.43 -2.86 -2.57
phi1 178.53521 6.43 4.59 3.78

"Level Variable emptot with Drift and Trend"


statistic 1pct 5pct 10pct
tau3 -18.98278 -3.96 -3.41 -3.12
phi2 120.11572 6.09 4.68 4.03
phi3 180.17354 8.27 6.25 5.34

The results here only shows that the series full-time equivalent employment is I(0).
7

8.2 Dummy Variables

With linear regression, this result can be achieved very easy. At first, there is a need to create
two dummy variables. One indicates the start of the treatment (time) and is equal to zero before
the treatment and equal to one after the treatment. The other variable separates the observations
into a treatment and control group (treated). This dummy variable is equal to one for fast food
restaurants located in NJ and equal to zero for fast food restaurants located in PA.

8.3 Model 1: interaction method


The DiD estimator is an interaction between both dummy variables. This interaction can be
specified with the “:” operator in the formula of function lm() in addition to the individual dummy
variables.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.331 1.072 21.767 <2e-16 ***
time -2.166 1.516 -1.429 0.1535
treated -2.892 1.194 -2.423 0.0156 *
time:treated 2.754 1.688 1.631 0.1033
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 9.406 on 790 degrees of freedom


(26 observations deleted due to missingness)
Multiple R-squared: 0.007401, Adjusted R-squared: 0.003632
F-statistic: 1.964 on 3 and 790 DF, p-value: 0.118
Conclusion: The coefficient for ‘time:treated’ is the differences-in-differences estimator. The effect
is not significant with the treatment having a positive effect.

8.4 Residual Test for DiD Model


Duson Watson Test
DW = 1.8398, p-value = 0.008989
H0: There is no autocorrelation
H1: True autocorrelation is greater than 0

The Durbin-Watson test statistic has always a value between 0 and 4, where:
[0-2): means positive autocorrelation
2: means no autocorrelation
(2-4]: mean negative autocorrelation
According to the Durbin-Watson Test, there is auto-correlation in the residuals of DID model.

8.5 Model 2: the multiplication method

In this method, the DiD estimator is estimated without the need to generate the interaction.
8

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.331 1.072 21.767 <2e-16 ***
time -2.166 1.516 -1.429 0.1535
treated -2.892 1.194 -2.423 0.0156 *
time:treated 2.754 1.688 1.631 0.1033
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.406 on 790 degrees of freedom


Multiple R-squared: 0.007401,Adjusted R-squared: 0.003632
F-statistic: 1.964 on 3 and 790 DF, p-value: 0.118

Conclusion: The coefficient for ‘time:treated’ is the differences-in-differences estimator. The effect
is not significant with the treatment having a positive effect.

8.6 Residual Test for didm odel2

Durbin-Watson test

data: did_model2
DW = 1.8398, p-value = 0.008989
H0: There is no autocorrelation
H1: True autocorrelation is greater than 0

The Durbin-Watson test statistic has always a value between 0 and 4, where:
[0-2): means positive autocorrelation
9

2: means no autocorrelation
(2-4]: mean negative autocorrelation

According to the Durbin-Watson, there is auto-correlation in the residuals of DID model.

According to the Durbin-Watson Test, there is auto-correlation in the residuals of DID model.

Conclusion: Need to introduce control X-variables in the regression

8.7 Fixed effects


In the study the author shows a more precise calculation of the DiD estimator, which only in-
cludes fast food restaurants that have responses regarding employment (emptot) before and after
the treatment (this is a so-called balanced sample). In RStudio this result can be generated by
computing a fixed effects model which is sometimes also called a within estimator. The R package
plm is used to run this regression with function plm() and argument model = ”within”. Before-
hand, the data has to be declared as a panel with function p.dataframe(). With variable sheet
each fast food restaurant can be uniquely identified. Additionally, the function coeftest() from R
package lmtest is needed in order to obtain the correct standard errors which must be clustered
by sheet.
t test of coefficients:

Estimate Std. Error t value Pr(>|t|)


time2 -2.2833 1.2465 -1.8319 0.06775 .
time2:treated 2.7500 1.3359 2.0585 0.04022 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

9. DiD Estimation with Time-Varying Effects

The key assumption here is the Parallel Trends Assumption: Absent treatment the outcomes
for the control and treatment group would follow parallel trends. A confounder variable is added
10

that leads to non-parallel trends. It is assumed that the outcome y also depends on a confounding
variable x that develops differently for the control and treatment group over time:
term estimate std.error statistic p-value
1 (Intercept) -3.71 0.0511 -72.5 0
2 time 0.978 0.0723 13.5 9.47e-38
3 treated 39.2 0.0569 688. 0
4 time:treated 50.1 0.0805 622. 0

Conclusion: The coefficient for ‘time:treated’ is the differences-in-differences estimator. The effect
is strong significant with the treatment having a positive effect when considering time-varying
effects.

10. DiD Estimation with X-variables (Control Variables)

So far, this would be the standard way to do a regression if there were no confounding variables.
But, since there are strong reasons to believe that age might distort the analysis here, there should
be a control for age in this analysis. In R, age is simply added to the equation. Instead of a
regression line, there is now a three-dimensional model (time, treated and EMPPT, that is part-
time employees). So the regression line becomes a regression plane (or surface). This is how this
looks like with the data:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.58246 1.11671 13.954 <2e-16 ***
time -2.32777 1.36278 -1.708 0.0880 .
treated -2.56004 1.07323 -2.385 0.0173 *
emppt 0.39645 0.02887 13.730 <2e-16 ***
time:treated 3.09423 1.51806 2.038 0.0419 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.456 on 789 degrees of freedom


Multiple R-squared: 0.1988,Adjusted R-squared: 0.1948
F-statistic: 48.95 on 4 and 789 DF, p-value: < 2.2e-16
After taking the part-time employees into control, the DiD estimator becomes significant.

11. DiD Estimation with Anticipation Effects

Model 1: card_krueger_1994_mod$time ~ Lags(card_krueger_1994_mod$time, 1:1) +


Lags(card_krueger_1994_mod$emptot, 1:1)
Model 2: card_krueger_1994_mod$time ~ Lags(card_krueger_1994_mod$time, 1:1)
Res.Df Df F Pr(>F)
1 765
2 766 -1 0.0224 0.8811
11

> grangertest(card_krueger_1994_mod$emptot, card_krueger_1994_mod$treated, order = 1)

Granger causality test

Model 1: card_krueger_1994_mod$treated ~ Lags(card_krueger_1994_mod$treated, 1:1) +


Lags(card_krueger_1994_mod$emptot, 1:1)
Model 2: card_krueger_1994_mod$treated ~ Lags(card_krueger_1994_mod$treated, 1:1)
Res.Df Df F Pr(>F)
1 765
2 766 -1 0.0868 0.7684

The null hypothesis of the test cannot be rejected because the p-value is larger than 0.05 inferring
that the DiD estimator is not valuable for forecasting the future values of EMPTOT.

12. R Codes

# ---------------------------------------------------------------------------------
# Title: Financial Econometrics 6 Difference-in-difference approach
# Member: Hongbo, Allen, Alexandra
# Date: 12/Nov/2022
# ---------------------------------------------------------------------------------

# install the necessary library


library(dynlm)
library(xts)
library(TTR)
library(xts)
library(ggplot2)
library(tidyquant)
library(tidyverse)
library(timetk)
library(tibbletime)
library(broom)
library(quantmod)
install.packages("writexl")
library("writexl")
install.packages(’readxl’)
library(readxl)
install.packages(seastests)
install.packages(magrittr)
install.packages(dplyr)
library(magrittr)
library(dplyr)
install.packages(’moments’)
12

library(moments)
library(tseries)
install.packages(’stats’)
library(stats)
install.packages(forecast)
library(forecast)
library(urca)
install.packages("tsDyn")
library(tsDyn)
forecast::auto.arima
install.packages("sjlabelled")
library("sjlabelled")
install.packages("ggpubr")
library("ggpubr")
install.packages("plm")
library(plm)
install.packages("lmtest")
library(lmtest)

# ------------------------------------------------------------------------------
# Section 1 Data preparation for analysis
# ------------------------------------------------------------------------------

#njmin.zip is an archive with 5 files pertaining to the New Jersey - Pennsylvania surveys used
#in Card and Krueger’s book Myth and Measurement, chapter 2:

# This assignment will be replicating a study by David Card and Alan B. Krueger
about the effect of a raise in minimum wages on employment.

# Temporary file and path


tfile_path <- tempfile()
tdir_path <- tempdir()

# Download zip file


download.file("http://davidcard.berkeley.edu/data_sets/njmin.zip",
destfile = tfile_path)

# -----------------
# Raw Data
# -----------------
# Unzip
unzip(tfile_path, exdir = tdir_path)

# Read codebook
13

codebook <- read_lines(file = paste0(tdir_path, "/codebook"))

# Generate a vector with variable names


variable_names <- codebook %>%
‘[‘(8:59) %>% # Variablennamen starten bei Element 8 (sheet)
‘[‘(-c(5, 6, 13, 14, 32, 33)) %>% # Elemente ohne Variablennamen entfernen
A¤ngster Variablenname enth~
str_sub(1, 8) %>% # l~ A¤lt 8 Zeichen
str_squish() %>% # Whitespaces entfernen
str_to_lower() # nur Kleinbuchstaben verwenden

# Generate a vector with variable labels


variable_labels <- codebook %>%
‘[‘(8:59) %>% # variable names start at element 8 (sheet)
‘[‘(-c(5, 6, 13, 14, 32, 33)) %>% # remove elements w/o variable names
sub(".*\\.[0-9]", "", .) %>%
‘[‘(-c(5:10)) %>% # these elements are combined later on
str_squish() # remove white spaces

# Region
variable_labels[41] <- "region of restaurant"

# Read raw data


data_raw <- read_table2(paste0(tdir_path, "/public.dat"),
col_names = FALSE)

head(data_raw)

# -----------------
# Cleaned Data
# -----------------

# Add variable names


data_mod <- data_raw %>%
select(-X47) %>% # remove empty column
‘colnames<-‘(., variable_names) %>% # Assign variable names
mutate_all(as.numeric) %>% # treat all variables as numeric
mutate(sheet = ifelse(sheet == 407 & chain == 4, 408, sheet)) # duplicated sheet id 407

# Process data (currently wide format)


data_mod <- data_mod %>%
# chain value label
mutate(chain = case_when(chain == 1 ~ "bk",
chain == 2 ~ "kfc",
chain == 3 ~ "roys",
chain == 4 ~ "wendys")) %>%
14

# state value label


mutate(state = case_when(state == 1 ~ "New Jersey",
state == 0 ~ "Pennsylvania")) %>%
# Region dummy
mutate(region = case_when(southj == 1 ~ "southj",
centralj == 1 ~ "centralj",
northj == 1 ~ "northj",
shore == 1 ~ "shorej",
pa1 == 1 ~ "phillypa",
pa2 == 1 ~ "eastonpa")) %>%
# meals value label
mutate(meals = case_when(meals == 0 ~ "none",
meals == 1 ~ "free meals",
meals == 2 ~ "reduced price meals",
meals == 3 ~ "both free and reduced price meals")) %>%
# meals value label
mutate(meals2 = case_when(meals2 == 0 ~ "none",
meals2 == 1 ~ "free meals",
meals2 == 2 ~ "reduced price meals",
meals2 == 3 ~ "both free and reduced price meals")) %>%
# status2 value label
mutate(status2 = case_when(status2 == 0 ~ "refused second interview",
status2 == 1 ~ "answered 2nd interview",
status2 == 2 ~ "closed for renovations",
status2 == 3 ~ "closed permanently",
status2 == 4 ~ "closed for highway construction",
status2 == 5 ~ "closed due to Mall fire")) %>%
mutate(co_owned = if_else(co_owned == 1, "yes", "no")) %>%
mutate(bonus = if_else(bonus == 1, "yes", "no")) %>%
mutate(special2 = if_else(special2 == 1, "yes", "no")) %>%
mutate(type2 = if_else(type2 == 1, "phone", "personal")) %>%
select(-southj, -centralj, -northj, -shore, -pa1, -pa2) %>% # now included in region dummy
mutate(date2 = lubridate::mdy(date2)) %>% # Convert date
rename(open2 = open2r) %>% #Fit name to wave 1
rename(firstinc2 = firstin2) %>% # Fit name to wave 1
sjlabelled::set_label(variable_labels) # Add stored variable labels

# -----------------
# Transposed Data
# -----------------

# Structural variables
structure <- data_mod %>%
select(sheet, chain, co_owned, state, region)
15

# Wave 1 variables
wave1 <- data_mod %>%
select(-ends_with("2"), - names(structure)) %>%
mutate(observation = "February 1992") %>%
bind_cols(structure)

# Wave 2 variables
wave2 <- data_mod %>%
select(ends_with("2")) %>%
rename_all(~str_remove(., "2")) %>%
mutate(observation = "November 1992") %>%
bind_cols(structure)

# Final dataset
card_krueger_1994 <- bind_rows(wave1, wave2) %>%
select(sort(names(.))) %>% # Sort columns alphabetically
sjlabelled::copy_labels(data_mod) # Restore variable labels

# ------------
# Final Data
# ------------
card_krueger_1994_mod <- card_krueger_1994 %>%
mutate(emptot = empft + nmgrs + 0.5 * emppt,
pct_fte = empft / emptot * 100)

# ------------------------------------------------------------------------------
# Section 2 Descriptive Statistics
# ------------------------------------------------------------------------------

card_krueger_1994_mod %>%
select(chain, state) %>%
table() %>%
prop.table(margin = 2) %>%
apply(MARGIN = 2,
FUN = scales::percent_format(accuracy = 0.1)) %>%
noquote

# Pre-treatment Means
card_krueger_1994_mod %>%
filter(observation == "February 1992") %>%
group_by(state) %>%
summarise(emptot = mean(emptot, na.rm = TRUE),
pct_fte = mean(pct_fte, na.rm = TRUE),
wage_st = mean(wage_st, na.rm = TRUE),
16

hrsopen = mean(hrsopen, na.rm = TRUE)) %>%


pivot_longer(cols=-state, names_to = "variable") %>%
pivot_wider(names_from = state, values_from = value)

# Post-treatment Means
card_krueger_1994_mod %>%
filter(observation == "November 1992") %>%
group_by(state) %>%
summarise(emptot = mean(emptot, na.rm = TRUE),
pct_fte = mean(pct_fte, na.rm = TRUE),
wage_st = mean(wage_st, na.rm = TRUE),
hrsopen = mean(hrsopen, na.rm = TRUE)) %>%
pivot_longer(cols=-state, names_to = "variable") %>%
pivot_wider(names_from = state, values_from = value)

# Figure
hist.feb <- card_krueger_1994_mod %>%
filter(observation == "February 1992") %>%
ggplot(aes(wage_st, fill = state)) +
geom_histogram(aes(y=c(..count..[..group..==1]/sum(..count..[..group..==1]),
..count..[..group..==2]/sum(..count..[..group..==2]))*100),
alpha=0.5, position = "dodge", bins = 23) +
labs(title = "February 1992", x = "Wage range", y = "Percent of stores", fill = "") +
scale_fill_grey()

hist.nov <- card_krueger_1994_mod %>%


filter(observation == "November 1992") %>%
ggplot(aes(wage_st, fill = state)) +
geom_histogram(aes(y=c(..count..[..group..==1]/sum(..count..[..group..==1]),
..count..[..group..==2]/sum(..count..[..group..==2]))*100),
alpha = 0.5, position = "dodge", bins = 23) +
labs(title = "November 1992", x="Wage range", y = "Percent of stores", fill="") +
scale_fill_grey()

ggarrange(hist.feb, hist.nov, ncol = 2,


common.legend = TRUE, legend = "bottom")

# ------------------------------------------------------------------------------
# Section 3 First Difference
# ------------------------------------------------------------------------------

differences <- card_krueger_1994_mod %>%


group_by(observation, state) %>%
summarise(emptot = mean(emptot, na.rm = TRUE))
17

# Treatment group (NJ) before treatment


njfeb <- differences[1,3]
njfeb
# Control group (PA) before treatment
pafeb <- differences[2,3]
pafeb
# Treatment group (NJ) after treatment
njnov <- differences[3,3]
njnov
# Control group (PA) after treatment
panov <- differences[4,3]
panov

# ------------------------------------------------------------------------------
# Section 4 Average Treatment Effect
# ------------------------------------------------------------------------------

# calculate the difference between the difference of November and February within NJ and PA
(njnov-njfeb)-(panov-pafeb)

# calculate the difference between the difference of NJ and PA within November and February
(njnov-panov)-(njfeb-pafeb)

######Degression

# Calculate counterfactual outcome


nj_counterfactual <- tibble(
observation = c("February 1992","November 1992"),
state = c("New Jersey (Counterfactual)","New Jersey (Counterfactual)"),
emptot = as.numeric(c(njfeb, njfeb-(pafeb-panov)))
)

# Data points for treatment event


intervention <- tibble(
observation = c("Intervention", "Intervention", "Intervention"),
state = c("New Jersey", "Pennsylvania", "New Jersey (Counterfactual)"),
emptot = c(19.35, 22.3, 19.35)
)

# Combine data
did_plotdata <- bind_rows(differences,
nj_counterfactual,
intervention)
18

######Line Plot
did_plotdata %>%
mutate(label = if_else(observation == "November 1992",
as.character(state), NA_character_)) %>%
ggplot(aes(x=observation,y=emptot, group=state)) +
geom_line(aes(color=state), size=1.2) +
geom_vline(xintercept = "Intervention", linetype="dotted",
color = "black", size=1.1) +
scale_color_brewer(palette = "Accent") +
scale_y_continuous(limits = c(17,24)) +
ggrepel::geom_label_repel(aes(label = label),
nudge_x = 0.5, nudge_y = -0.5,
na.rm = TRUE) +
guides(color=FALSE) +
labs(x="", y="FTE Employment (mean)") +
annotate(
"text",
x = "November 1992",
y = 19.6,
label = "{Difference-in-Differences}",
angle = 90,
size = 3
)

# ------------------------------------------------------------------------------
# Section 5 DiD Estimator
# ------------------------------------------------------------------------------
colnames(card_krueger_1994_mod)

emptot <- card_krueger_1994_mod$emptot

#####Stationarity Test of the Series


adf.none = list(emptot = ur.df(na.omit(emptot), type=’none’,selectlags = c("BIC")))
adf.drift = list(emptot = ur.df(na.omit(emptot), type=’drift’,selectlags = c("BIC")))
adf.trend = list(emptot = ur.df(na.omit(emptot), type=’trend’,selectlags = c("BIC")))

print("Level Variable emptot with None")


cbind(t(adf.none$emptot@teststat), adf.none$emptot@cval)

print("Level Variable emptot with Drift")


cbind(t(adf.drift$emptot@teststat), adf.drift$emptot@cval)

print("Level Variable emptot with Drift and Trend")


cbind(t(adf.trend$emptot@teststat), adf.trend$emptot@cval)
19

#####Dummy Variable
card_krueger_1994_mod <- mutate(card_krueger_1994_mod,
time = ifelse(observation == "November 1992", 1, 0),
treated = ifelse(state == "New Jersey", 1, 0)
)

######DiD Estimation by interaction model

card_krueger_1994_mod$did = card_krueger_1994_mod$time * card_krueger_1994_mod$treated

did_model <- lm(emptot ~ time + treated + did, data = card_krueger_1994_mod)


summary(did_model)

# residual tests
acf(did_model$residuals)

library(lmtest)
lmtest::dwtest(did_model)

######DiD Estimation by interaction model


did_model2 <- lm(emptot ~ time*treated, data = card_krueger_1994_mod)
summary(did_model2)

acf(did_model2$residuals)

lmtest::dwtest(did_model2)

######fixed effects

# Declare as panel data


panel <- pdata.frame(card_krueger_1994_mod, "sheet")

# Within model
did.reg <- plm(emptot ~ time + treated + time:treated,
data = panel, model = "within")

# obtain clustered standard errors


coeftest(did.reg, vcov = function(x)
vcovHC(x, cluster = "group", type = "HC1"))

# ------------------------------------------------------------------------------
# Section 6 DiD Estimator with X-variables (control-variables)
20

# ------------------------------------------------------------------------------

# use part-time employees as control variabels


summary(lm(emptot ~ time*treated + emppt, card_krueger_1994_mod))

# ------------------------------------------------------------------------------
# Section 7 DiD Estimator with Anticipation Effects
# ------------------------------------------------------------------------------

grangertest(card_krueger_1994_mod$emptot, card_krueger_1994_mod$time, order = 1)


grangertest(card_krueger_1994_mod$emptot, card_krueger_1994_mod$treated, order = 1)

You might also like