Comp Lab 2 GunExample 2425
Comp Lab 2 GunExample 2425
Ralf Becker
Introduction
In this computer lab you will be practicing the following
• Creating time series plots with ggplot
• Merge datafiles
• Performing hypothesis tests to test the equality of means
• Estimate and interpret regressions
Let’s start by loading some useful packages
library(readxl) # enable the read_excel function
library(tidyverse) # for almost all data handling tasks
library(ggplot2) # plotting toolbox
library(stargazer) # for nice regression output
Context
Here we are looking at replicating aspects of this paper:
Siegel et al (2019) The Impact of State Firearm Laws on Homicide and Suicide Deaths in the USA, 1991–2016:
a Panel Study, J Gen Intern Med 34(10):2021–8.
this is the paper which we will replicate throughout the unit in order to demonstrate some Diff-in-Diff
techniques (not in this session but later).
Data Import
Import the data from the “US_Gun_example.csv” file. Recall, make sure the file (from the Week 3 BB page)
is saved in your working directory, that you set the working directory correctly and that you set the the na=
option in the read.csv function to the value in which missing values are coded in the csv file. To do this
correctly you will have to open the csv file (with your spreadsheet software, e.g. Excel) and check for instance
cell G9. The stringsAsFactors = TRUE option in read.csv automatically converts character variables into
factor (categorical) variables. This is useful when yo know that these variables represent categories (like here
states).
setwd("YOUR WORKING DORECTORY")
data <- read.csv(XXXX,na="XXXX", stringsAsFactors = TRUE)
str(data)
1
## $ Mechanism : Factor w/ 1 level "Firearm": 1 1 1 1 1 1 1 1 1 1 ...
## $ pol_ubc : int 0 0 0 0 1 0 1 NA 0 0 ...
## $ pol_vmisd : int 0 0 0 0 1 0 1 NA 1 0 ...
## $ pol_mayissue : int 0 1 0 0 1 1 1 NA 1 0 ...
## $ Age.Adjusted.Rate: num 14.83 16.41 15.27 15.92 9.32 ...
## $ logy : num 2.7 2.8 2.73 2.77 2.23 ...
## $ ur : num 6.28 5.18 4.76 4.72 5.47 ...
## $ law.officers : int 1821 15303 7538 18548 106244 15172 9825 4716 2964 63041 ...
## $ law.officers.pc : num 287 343 280 352 308 ...
## $ vcrime : int 3696 19203 12042 28275 210661 15334 11387 4845 4845 129839 ...
## $ vcrime.pc : num 583 430 447 536 611 ...
## $ alcc.pc : num 2.67 1.86 1.74 2.5 2.2 ...
## $ incarc : int 3033 24741 11489 27710 157142 14888 17507 NA 6841 72404 ...
## $ incarc.pc : num 479 554 427 525 456 ...
You got it right if the output from str(data) looks like the above.
2
state? Exactly, 1,071, which is the number of rows in the data object. You need to figure out why there are
so many rows before we can merge data into the data dataframe. Have a look at the spreadsheet. Can you
see the problem/issue?
The spreadsheet includes data for every county and not only for the whole state of Alabama. For instance
you see that some rows have in the description column only the name of a state and others have the name of
a county. To illustrate this look at the following snippet from the data table:
data_pop[1979:1982,]
You will notice that this dataframe now has rnrow(data_pop2)‘ rows of data. This is still too many rows.
Let’s look at the different geographies in our dataset.
unique(data_pop$Description)
3
## [25] Minnesota Mississippi Missouri
## [28] Montana Nebraska Nevada
## [31] New Hampshire New Jersey New Mexico
## [34] New York North Carolina North Dakota
## [37] Ohio Oklahoma Oregon
## [40] Pennsylvania Rhode Island South Carolina
## [43] South Dakota Tennessee Texas
## [46] Utah Vermont Virginia
## [49] Washington West Virginia Wisconsin
## [52] Wyoming Puerto Rico
## 3197 Levels: Abbeville County, SC Acadia Parish, LA ... Ziebach County, SD
You will immediately see that there are also observations for the entire U.S.. So, let’s extract the data that
are from states and years which are also represented in data, our original dataset. Complete the following
code for this task.
state_list <- unique(data$XXXX) # creates a list with state names in data
year_list <- XXXX(XXXX$Year) # creates a list of years in data
data_pop <- data_pop %>% filter(Description %in% XXXX) %>%
filter(XXXX XXXX year_list)
You got it right if data_pop has 969 observations and you can replicate the following table:
summary(data_pop$Total.Population)
You get it right if you can replicate these summary statistics for the new variable.
summary(data_pop$prop18.24)
It is easiest to merge datafiles if the variables on which we want to match (state name and Year) are called
the same in both datasets (data and data_pop). This is true for the Year variable, but not for the state
4
name (State in data and Description in data_pop). Let’s fix that and change the state variable name in
data_pop to State.
names(data_pop)[names(data_pop)=="Describtion"] <- "State"
Then look at names(data_pop) and see whether you achieved what you wanted . . . no, you didn’t? The
name has not changed? Sometimes you make a mistake but there is no error message. Look at the previous
line again and try and figure out what the problem is, correct it and rename the variable. But the message
here is an important one. Don’t assume that just because R ran your line and didn’t spit out an error
message that everything you wanted to happen did happen. You should always check whether the result is as
expected.
names(data_pop)[names(data_pop)=="Description"] <- "State"
As result your datafile has gained one variable, prop18.24, but lost a few rows. By default, the merge
function deletes rows for which it did not have matching rows in both datafiles and therefore all 2020 and
2021 observations have gone. Look at the help for merge (by typing ?merge into the console) and find the
change you need in the above line to make sure that we keep all 1071 observations from the data dataframe.
Then re-run the above line.
data2 <- merge(data,data_pop, all.x = TRUE)
Your data2 dataframe should end up with 1071 rows and 20 variables.
Region information
Merging datasets is a super important skill, so let’s practice this here again. We wish to differentiate between
different regions in the U.S. In your data2 dataframe one of the information is the state, coded by both
State and State_code variables. What we need is an additional variable that tells you which region the
state is in.
So, for instance:
You will first have to find a dataset on the internet that maps states to regions. Go to your favorite search
engine and search for something like “csv us States and regions”. Alternatively you could enlist the help of an
AI. For instance you could go to Google Bart and ask something like “create a csv file that maps US states to
regions”. Then save that file into your working directory and merge the Region variable into your data2 file.
states_info <- read_xlsx("../data/states.xlsx")
# The variable names may be different in your file
states_info <- states_info %>% select(STATEAB, Region)
names(states_info)[names(states_info)=="STATEAB"] <- "State_code"
data2 <- merge(data2, states_info)
Make sure that the region variable is called Region and that the State or State code variable name aligns
with that in data. If you got it right the following code should give you the same result
5
tab_regions <- data2 %>% select(State_code, Region) %>%
unique() %>%
group_by(Region) %>%
summarise(n = n()) %>%
print()
## # A tibble: 4 x 2
## Region n
## <chr> <int>
## 1 Midwest 12
## 2 Northeast 9
## 3 South 17
## 4 West 13
This table shows that there are 12 states in the Midwest region and 9 in the Northeast. Altogether the U.S.
is divided into four regions.
9.0
prop18.24
8.7
8.4
8.1
2000 2005 2010 2015 2020
Year
Now you want to compare this proportion between Florida and Texas. As you create your graph you should be
able to select the data from these two states by using subset(data2, State %in% c("Florida","Texas")).
How to create separate lines for the two states you learned in the Week 2 lecture (or better the accompanying
code).
6
Proportion of 18 to 24 year olds
10
prop18.24
State
Florida
Texas
8
2000 2005 2010 2015 2020
Year
If you wish you could experiment with the theme options in ggplot. Why not try the following code:
Proportion of 18 to 24 year olds
10
prop18.24
State
Florida
Texas
8
2000 2005 2010 2015 2020
Year
Check out the GGplot cheat sheet for more tricks and illustrations of the ggplot packages’ capabilities and
other themes available.
## # A tibble: 51 x 5
## State avg_prop18.24 avg_fad.rate avg_law.officers.pc avg_ur
## <fct> <dbl> <dbl> <dbl> <dbl>
7
## 1 Alabama NA 18.5 310. 6.04
## 2 Alaska NA 19.9 274. 6.93
## 3 Arizona NA 15.3 338. 6.26
## 4 Arkansas NA 17.0 316. 5.55
## 5 California NA 8.40 314. 7.32
## 6 Colorado NA 12.4 336. 5.33
## 7 Connecticut NA 5.22 275. 5.89
## 8 Delaware NA 10.4 355. 5.32
## 9 District of Columbia NA 18.1 791. 7.40
## 10 Florida NA 11.9 371. 5.69
## # i 41 more rows
As you can see the code calculated the average values for the firearm death rate (for instance, on average
there were 18.5 firearm deaths per 100,000 in Alabama per year), but the code did not calculate the average
proportion of 18 to 24 year olds. We get “NA” for all states. The reason for that is that some of the
prop18.24 observations are not available, in particular the data for years 2020 and 2021. The mean function
by default refuses to calculate the mean value if any of the data are “NA”s. However, there is a way to
instruct the mean function to ignore missing values and calculate the mean value on the basis of the available
data. Check the help for the mean function (type ?mean into the console) to find the option you should add
to the mean function to achieve this.
If you get it right you should be able to replicate the following table.
## # A tibble: 51 x 5
## State avg_prop18.24 avg_fad.rate avg_law.officers.pc avg_ur
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 9.86 18.5 310. 6.04
## 2 Alaska 10.4 19.9 274. 6.93
## 3 Arizona 9.93 15.3 338. 6.26
## 4 Arkansas 9.73 17.0 316. 5.55
## 5 California 10.2 8.40 314. 7.32
## 6 Colorado 9.79 12.4 336. 5.33
## 7 Connecticut 9.17 5.22 275. 5.89
## 8 Delaware 9.67 10.4 355. 5.32
## 9 District of Columbia 12.5 18.1 791. 7.40
## 10 Florida 8.89 11.9 371. 5.69
## # i 41 more rows
It is possibly not good practice to calculate averages over different sample sizes (over 2001 to 2019 for
prop18.24 and over 2001 to 2021 for Age.Adjusted.Rate). We therefore repeat the calculation but only for
the years up to and including 2019.
There is one mistake in the code and you should get an error message.
tab1 <- data2 %>% filter(year <= 2019) %>%
group_by(State) %>%
summarise(avg_prop18.24 = mean(prop18.24, na.rm = TRUE),
avg_fad.rate = mean(Age.Adjusted.Rate),
avg_law.officers.pc = mean(law.officers.pc),
avg_ur = mean(ur)) %>%
print()
8
## 2 Alaska West 10.4 19.5 275. 6.88
## 3 Arizona West 9.93 15.1 341. 6.24
## 4 Arkansas South 9.73 16.4 313. 5.59
## 5 California West 10.2 8.36 315. 7.17
## 6 Colorado West 9.79 12.0 336. 5.25
## 7 Connecticut North~ 9.17 5.11 278. 5.76
## 8 Delaware South 9.67 9.87 354. 5.19
## 9 District of Col~ South 12.5 17.6 797. 7.40
## 10 Florida South 8.89 11.7 374. 5.61
## # i 41 more rows
Let’s create a few plots which show the average death numbers against some of our state specific information.
ggplot(tab1,aes(avg_prop18.24,avg_fad.rate)) +
geom_point() +
ggtitle("Proportion of 18 to 24 year olds v Deaths by Firearms")
15
avg_fad.rate
10
9 10 11 12
avg_prop18.24
In order to demonstrate another trick in ggplot box of tricks we will also use the Region information.
ggplot(tab1,aes(avg_prop18.24,avg_fad.rate)) +
geom_point() +
facet_wrap(vars(Region)) +
ggtitle("Proportion of 18 to 24 year olds v Deaths by Firearms")
9
Proportion of 18 to 24 year olds v Deaths by Firearms
Midwest Northeast
20
15
10
5
avg_fad.rate
South West
20
15
10
9 10 11 12 9 10 11 12
avg_prop18.24
Very neat indeed. What we learn from these scatterplots is that there is no obvious correlation between the
average proportions of 18 to 24 year olds and the rate of firearm deaths.
## # A tibble: 4 x 3
## Region RAvg_cases n
## <chr> <dbl> <int>
## 1 Midwest 10.1 12
## 2 Northeast 6.55 9
## 3 South 14.4 17
## 4 West 13.0 13
Let’s see whether we find the regional averages to be statistically significantly different. Say we compare the
avg_fad.rate in the Northeast to that in the Midwest. So test the null hypothesis that H0 : µN E = µM W
(or H0 : µN E − µM W = 0) against the alternative hypothesis that HA : µN E ̸= µM W , where µ represents the
average firearm death rate (per 100,000 population) in states in the respective region over the sample period.
test_data_NE <- tab1 %>%
filter(Region == "Northeast") # pick Northeast states
##
## Welch Two Sample t-test
##
10
## data: test_data_NE$avg_fad.rate and test_data_MW$avg_fad.rate
## t = -3.2399, df = 15.565, p-value = 0.005282
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.895230 -1.225443
## sample estimates:
## mean of x mean of y
## 6.548129 10.108465
The difference in the averages is 6.548 - 10.108 = -3.56 (more than 3 in 100,000 population). We get a t-test
statistic of about -3.2. If in truth the two means were the same (H0 was correct) then we should expect
the test statistic to be around 0. Is -3.2 far enough away from 0 for us to conclude that we should stop
supporting the null hypothesis? Is -3.2 large (in absolute terms) enough?
The answer is yes and the p-value does tell us that it is. The p-value is 0.005282 0.53%. This means that if
the H0 was correct, the probability of getting a difference of -3.56 (per 100,000 population) or a more extreme
difference is 0.53%. We judge this probability to be too small for us to continue to support the H0 and we
reject the H0 . We do so as the p-value is smaller than any of the usual significance levels (10%, 5% or 1%).
We are not restricted to testing whether two population means are the same. You could also test whether the
difference in the population is anything different but 0. Say a politician claims that evidently the firearm
death rate rate in the Northeast is smaller by more than 3 per 100,000 population than the firearm death
rate in the Midwest.
Here our H0 is H0 : µN E = µM W − 3 (or µN E − µM W = −3) and we would test this against an alternative
hypothesis of H0 : µN E < µM W − 3 (or H0 : µN E − µM W < −3). Here the statement of the politician is
represented in the HA .
# testing that mu = -3
t.test(test_data_NE$avg_fad.rate,test_data_MW$avg_fad.rate, mu=-3, alternative = "less")
##
## Welch Two Sample t-test
##
## data: test_data_NE$avg_fad.rate and test_data_MW$avg_fad.rate
## t = -0.5099, df = 15.565, p-value = 0.3086
## alternative hypothesis: true difference in means is less than -3
## 95 percent confidence interval:
## -Inf -1.638469
## sample estimates:
## mean of x mean of y
## 6.548129 10.108465
Note the following. The parameter mu now takes the value -3 as we are hypothesising that the difference
in the means is -3 (or smaller than that in the HA ). Also, in contrast to the previous test we now care
whether the deviation is less than -3. In this case we wonder whether it is really smaller. Hence we use the
additional input into the test function, alternative = "less". (The default for this input is alternative
= "two.sided" and that is what is used, as in the previous case, if you don’t add it to the t.test function).
Also check ?t.test for an explanation of these optional input parameters.
Again we find ourselves asking whether the sample difference we obtained (-3.56) is consistent with the null
hypothesis (of the population difference being -3). The p-value is 0.3086, so the probability of obtaining a
sample difference as big as -3.56 (or smaller) is just a little over 30%. Say we set out to perform a test at a
10% significance level, then we would judge that a probability of just above 30% is larger than that p-value
and we would fail to reject the null hypothesis.
So let’s perform another test. A Republican governor of a Southern state in the U.S. claims that the average
firearm death rate in the South is just as big as the one in the West. Perform the appropriate hypothesis test.
11
test_data_SO XXXX tab1 %>%
filter(XXXX == "South") # pick Southern states
##
## Welch Two Sample t-test
##
## data: test_data_SO$avg_fad.rate and test_data_WE$avg_fad.rate
## t = 1.0149, df = 19.808, p-value = 0.3224
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.515748 4.384902
## sample estimates:
## mean of x mean of y
## 14.42065 12.98607
The p-value is certainly larger than any of the usual significance levels and we fail to reject H0 . This means
that the opposition governor’s statement is supported by the data or at least the data do not contradict it.
Regression
To perform inference in the context of regressions it pays to use an additional package, the car package. So
please load this package.
library(car)
If you get an error message it is likely that you first have to install that package.
##
## ===============================================
## Dependent variable:
## ---------------------------
## vcrime.pc
## -----------------------------------------------
## law.officers.pc 0.204
## (0.193)
##
## Constant 311.421***
## (61.296)
##
## -----------------------------------------------
12
## Observations 50
## R2 0.023
## Adjusted R2 0.002
## Residual Std. Error 156.747 (df = 48)
## F Statistic 1.114 (df = 1; 48)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Let’s change the dependent variable to the rate of firearm deaths (Age.Adjusted.rate or AAR for short) and
use several explanatory variables, the number of law officers, the unemployment rate and the amount of per
capita alcohol consumption. Also, let’s use all the years of data in our dataset.
AARi = α + β1 law.of f icer.pci + β2 uri + β3 alcc.pci + ui
We will estimate two models, one with only law.officers.pc as the explanatory variable and one with all
three explanatory variables.
mod2 <- lm(Age.Adjusted.Rate~law.officers.pc,data=data2)
mod3 <- lm(Age.Adjusted.Rate~law.officers.pc+ur+alcc.pc,data=data2)
stargazer(mod2,mod3,type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## Age.Adjusted.Rate
## (1) (2)
## --------------------------------------------------------------------
## law.officers.pc 0.006*** 0.007***
## (0.002) (0.002)
##
## ur 0.039
## (0.076)
##
## alcc.pc -0.528*
## (0.286)
##
## Constant 10.301*** 11.151***
## (0.501) (0.865)
##
## --------------------------------------------------------------------
## Observations 1,048 1,048
## R2 0.014 0.017
## Adjusted R2 0.013 0.014
## Residual Std. Error 4.907 (df = 1046) 4.902 (df = 1044)
## F Statistic 14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
How would we interpret the value of β̂1 = 0.007? For a one unit increase in the explanatory variable (law
officers per 100,000) we would expect the number of firearm deaths to increase by 0.007. Is that a lot or is
that economically significant? There are two aspects to that. Is an increase of 1 officer per 100,000 a lot? To
answer that we need to know how many officers there typically are. And to know whether 0.007 is a large
increase we need to know how many firearm deaths there typically are.
Let’s look at the summary stats to help with this judgement.
13
summary(data2[c("Age.Adjusted.Rate", "law.officers.pc", "ur", "alcc.pc")])
30
Age.Adjusted.Rate
20
10
What you can see from here is that observations for the same state cluster together and that different states
14
seem to differ significantly between each other. This variation is not reflected in the above estimation. In
next week’s computer lab you will see how this issue can be tackled.
Summary
In today’s computer lab you achieved a lot. You produced various plots and started to explore the amazing
power of the ggplot package. You merged data, performed hypothesis tests and learned how to run and
display multiple regressions.
TAKE A BREAK!
15