Recitation 4
Recitation 4
Agabek Kabdullin
2022-09-18
Contents
Counting Stars 2
Fixing working directories . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A note on TinyTex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Stargazer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Plots in Base R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Anscombe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Simpson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Try on your own . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1
Counting Stars
Let’s set up a global working directory to resolve some of the issues from last week:
A note on TinyTex
Not everyone is a fun of HTML. Some people prefer pdf outputs. Luckily, R Markdown
allows us to produce nicely looking pdf output. (We will use stargazer a bit later)
You will need tinytex package to knit pdfs in R Markdown. Please install it using the code
below. You only need to run it once! Don’t include it in all of your R Markdowns!
Also, please install the stargazer package. Again, you only need to install it once! But
you do need to call for it when using it in R Markdown.
```{r, include=FALSE}
#knitr::opts_chunk$set(echo = TRUE)
install.packages('stargazer')
library(stargazer)
There are, however, some limitations that come with that. For one, we lose a lot of inter-
activity as we move from HTML to pdf. After all, pdf is only a document. Some default
options are also quite limited in R Markdown when opting for pdf. For instance, the only
available font sizes by default are 10pt, 11pt and 12pt.
---
title: "Recitation Three"
author: "Agabek Kabdullin"
date: "2022-09-11"
output: pdf_document
geometry: margin=1in
fontsize: 12pt
---
2
If you wanted more fonts and font sizes, you’ll need to jump through a couple of hoops.
See more on that here and here
Still, some basic interactivity is there. You can add an interactive table of contents, for
instance:
NB: “depth” of table of contents goes up to 6 (that’s the number of types of headers in R
Markdown; don’t worry, you wouldn’t every produce a sub-sub-sub-sub-sub-section; and
if you do, you need to reconsider some of your writing choices)
---
title: "Recitation Three"
author: "Agabek Kabdullin"
date: "2022-09-11"
output:
pdf_document:
toc: true
toc_depth: 2
geometry: margin=1in
fontsize: 12pt
---
---
title: "Recitation Three"
author: "Agabek Kabdullin"
date: "2022-09-11"
output:
pdf_document:
toc: true
toc_depth: 2
geometry: margin=1in
fontsize: 12pt
urlcolor: blue
linkcolor: red
---
3
Stargazer
# write data:
write.csv(combined, 'combined.csv', row.names = F)
# "row.names = F" makes sure that we do not write the row
# names into the file
4
Data come from the experiment by Choi, Poertner and Sambanis, (2019):
5
help to natives help to non−natives
did not help
did not help
helped
helped
helped
helped
helped
helped
helped
helped
The pie charts above suggest that as temperature increases, people tend to provide less
help to perceived out-groups. Is that statistically significant?
The equation that we want to estimate is:
\begin{align*}
Help_i = \alpha + \beta_1*temperature_{i} +
\beta_2*hijab_{i} +
\beta_3*temperature_{i}*hijab_{i} +
\gamma*X + \varepsilon
\end{align*}
6
lm(anyhelp ~ temp*treat, data = combined)
##
## Call:
## lm(formula = anyhelp ~ temp * treat, data = combined)
##
## Coefficients:
## (Intercept) temp treat temp:treat
## 0.672474 0.003890 0.140017 -0.008733
##
## Call:
## lm(formula = anyhelp ~ temp * treat, data = combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8281 -0.6563 0.2303 0.3091 0.3880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.672474 0.097012 6.932 5.78e-12 ***
## temp 0.003890 0.003564 1.091 0.2752
## treat 0.140017 0.132792 1.054 0.2918
## temp:treat -0.008733 0.004840 -1.804 0.0714 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4451 on 1782 degrees of freedom
## (2024 observations deleted due to missingness)
## Multiple R-squared: 0.01343, Adjusted R-squared: 0.01177
## F-statistic: 8.084 on 3 and 1782 DF, p-value: 2.387e-05
Now we’re getting somewhere! You see that the coefficient for the “Intercept” is statistically
significant at the level 0 (check out the significance codes at the bottom). The interaction
term (temp:treat) has a p-value of 0.0714. That means that there is a 7% chance of
observing this estimate (or estimate even farther away from zero) if the null hypothesis
was true.
Suppose you wanted to present this table to your peers. You could, of course, copy and
paste each value into a text editor, but working on that table (or tables) would be as pleas-
ant as stabbing yourself with a fork. Fortunately, there is stargazer.
7
Originally, it was created to turn R output into text that is readable by Latex (or LATEX) as a
ready table:
```{r}
stargazer(lm(anyhelp ~ temp*treat, data = combined))
##
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-
mail: marek.hlavac at gmail.com
## % Date and time: Sun, Sep 25, 2022 - 7:47:08 PM
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{1}{c}{\textit{Dependent variable:}} \\
## \cline{2-2}
## \\[-1.8ex] & anyhelp \\
## \hline \\[-1.8ex]
## temp & 0.004 \\
## & (0.004) \\
## & \\
## treat & 0.140 \\
## & (0.133) \\
## & \\
## temp:treat & $-$0.009$^{*}$ \\
## & (0.005) \\
## & \\
## Constant & 0.672$^{***}$ \\
## & (0.097) \\
## & \\
## \hline \\[-1.8ex]
## Observations & 1,786 \\
## R$^{2}$ & 0.013 \\
## Adjusted R$^{2}$ & 0.012 \\
## Residual Std. Error & 0.445 (df = 1782) \\
## F Statistic & 8.084$^{***}$ (df = 3; 1782) \\
## \hline
## \hline \\[-1.8ex]
## \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01
## \end{tabular}
8
## \end{table}
In R Mardkown, however, if you use the option asis for your results argument in the
chunk with stargazer code, you’ll be getting a neat table:
```{r, results='asis'}
stargazer(lm(anyhelp ~ temp*treat, data = combined))
% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail:
marek.hlavac at gmail.com % Date and time: Sun, Sep 25, 2022 - 7:47:08 PM
Table 1:
Dependent variable:
anyhelp
temp 0.004
(0.004)
treat 0.140
(0.133)
temp:treat −0.009∗
(0.005)
Constant 0.672∗∗∗
(0.097)
Observations 1,786
R2 0.013
Adjusted R2 0.012
Residual Std. Error 0.445 (df = 1782)
F Statistic 8.084∗∗∗ (df = 3; 1782)
∗
Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
9
model1 = lm(anyhelp ~ temp + treat + temp*treat, data = combined)
model1$coefficients
model1$call
vcov(model1)
To get them out, we take the diagonal of that matrix and take the square root of each value:
sqrt(diag(vcov(model1)))
stargazer pulls out all of these details out of our linear model objects and plugs them into
pre-made tables that are readable by Latex.
stargazer also gives us quite some control over how our tables would look. For in-
stance, you could use the argument dep.var.labels to change the label of your depen-
dent variable. You could also use the argument covariate.labels to change the names
of your variables. Let’s also set header to FALSE so that we don’t get info on the author of
stargazer every time we use it (as much as we appreciate what Marek Hlavac has done
for R users).
10
stargazer((lm(anyhelp ~ temp + treat + temp*treat, data = combined)),
type = 'latex',
style = 'apsr',
dep.var.labels = 'Outcome: Did any bystanders offer help?',
covariate.labels = c('Temperature',
'Hijab vs native',
'Temperature x hijab versus native'),
header=FALSE)
Table 2:
Outcome: Did any bystanders offer help?
Temperature 0.004
(0.004)
Hijab vs native 0.140
(0.133)
Temperature x hijab versus native −0.009∗
(0.005)
Constant 0.672∗∗∗
(0.097)
N 1,786
R2 0.013
Adjusted R2 0.012
Residual Std. Error 0.445 (df = 1782)
F Statistic 8.084∗∗∗ (df = 3; 1782)
∗
p < .1; ∗∗ p < .05; ∗∗∗ p < .01
We are using APSR style from here on out, but if you’re interested, here is a full style list.
11
We could combine several models into one table with stargazer.
Note that we are omitting a bunch of variables from the second model using omit argu-
ment.
stargazer(model1, model2,
omit = c(paste("station1", unique(combined$station1), sep=""),
'rush'),
type = 'latex',
style = 'apsr',
dep.var.labels = 'Outcome: Did any bystanders offer help?',
covariate.labels = c('Temperature',
'Hijab vs native',
'Temperature x hijab versus native'),
header=FALSE,
title="There is some relation between
the attitudes towards the out-group and
the environment")
Table 3: There is some relation between the attitudes towards the out-group and the en-
vironment
Outcome: Did any bystanders offer help?
(1) (2)
Temperature 0.004 0.007∗
(0.004) (0.004)
Hijab vs native 0.140 0.137
(0.133) (0.133)
Temperature x hijab versus native −0.009∗ −0.008∗
(0.005) (0.005)
Constant 0.672∗∗∗ 0.570∗∗∗
(0.097) (0.108)
N 1,786 1,786
R2 0.013 0.043
Adjusted R2 0.012 0.025
Residual Std. Error 0.445 (df = 1782) 0.442 (df = 1752)
F Statistic 8.084∗∗∗ (df = 3; 1782) 2.400∗∗∗ (df = 33; 1752)
∗
p < .1; ∗∗ p < .05; ∗∗∗ p < .01
12
Let’s add models but this time without the intercept estimates.
Table 4: There is some relation between the attitudes towards the out-group and the en-
vironment
Outcome: Did any bystanders offer help?
(1) (2) (3
∗
Temperature 0.004 0.007 0.00
(0.004) (0.004) (0.0
Hijab vs native 0.140 0.137 0.1
(0.133) (0.133) (0.1
Temperature x hijab versus native −0.009∗ −0.008∗ −0.0
(0.005) (0.005) (0.0
N 1,786 1,786 1,7
R2 0.013 0.043 0.0
Adjusted R2 0.012 0.025 0.0
Residual Std. Error 0.445 (df = 1782) 0.442 (df = 1752) 0.442 (df
F Statistic 8.084∗∗∗ (df = 3; 1782) 2.400∗∗∗ (df = 33; 1752) 2.341∗∗∗ (df
∗
p < .1; ∗∗ p < .05; ∗∗∗ p < .01
13
We don’t always need all of the information on our results. R-squared, for instance,
wouldn’t tell us much in this case because our dependent variable is binary. So, let’s
drop it and some other stats via omit.stat = c('adj.rsq', 'rsq', 'ser', 'f').
That will also help us get rid of the table getting out the margins.
NB: See the list of statistic codes
Table 5: Relation between the attitudes towards the out-group and the environment
Outcome: Did any bystanders offer help?
(1) (2) (3)
Temperature 0.004 0.007∗ 0.006∗
(0.004) (0.004) (0.004)
Hijab vs native 0.140 0.137 0.138
(0.133) (0.133) (0.133)
Temperature x hijab versus native −0.009∗ −0.008∗ −0.009∗
(0.005) (0.005) (0.005)
N 1,786 1,786 1,786
∗
p < .1; ∗∗ p < .05; ∗∗∗ p < .01
14
Our three models are different: first model is the baseline model, second model has station
fixed effects and a variable on whether it was the rush hour. The last model contains all
these plus the number of bystanders. Let’s make sure that we communicate that in our
table. I’m gonna use the add.lines argument to add some lines to our table. Compare
Table 6 to the table in the original paper.
15
Plots in Base R
Anscombe
When exploring data one of your very first steps should be plotting the data. Let’s load
datasauRus package to demonstrate this point.
#install.packages('datasauRus')
library(datasauRus)
datasauRus package has a datasaurus_dozen data frame which is nothing but twelve data
sets combined into one (hence, the name).
unique(datasaurus_dozen$dataset)
Each set has x and y variables. Let’s examine star and dino datasets:
stargazer(model1, model2,
#omit.stat = c('adj.rsq', 'rsq', 'ser', 'f'),
type = 'latex',
style = 'apsr',
#dep.var.labels = 'Outcome: Did any bystanders offer help?',
column.labels = c('star', 'dino'),
header=FALSE,
title="Seemingly the relationship between x and y
is the same in both data sets"
)
If we were to only look at the linear regression results, we would have concluded that the
relationships between x and y are quite similar in both data sets.
16
Table 7: Seemingly the relationship between x and y is the same in both data sets
y
star dino
(1) (2)
x −0.101 −0.104
(0.135) (0.136)
Constant 53.327∗∗∗ 53.453∗∗∗
(7.692) (7.693)
N 142 142
R2 0.004 0.004
Adjusted R2 −0.003 −0.003
Residual Std. Error (df = 140) 26.973 26.975
F Statistic (df = 1; 140) 0.557 0.584
∗
p < .1; ∗∗ p < .05; ∗∗∗ p < .01
Were you to plot the data, however, you’d see that the relationships are far from being
similar:
par(mfrow = c(1, 2), # panel with one row and two columns
mai = c(0.5, 0.5, 0.25, 0.25)) # bottom, left, top and right margins
plot(subset_dino$x, subset_dino$y)
plot(subset_star$x, subset_star$y)
100
80
80
subset_dino$y
subset_star$y
60
60
40
40
20
20
0
20 40 60 80 100 30 40 50 60 70 80
17
Let’s make our panel of plots a tad bit prettier. We’ll add the axes labels, a title; we will
also change the shape, color and size of our points. Let’s remove the ticks from one of
our plots just because we can.
Let’s also add our regression lines to further demonstrate why Anscombe disagreed that
“numerical calculations are exact, but graphs are rough”
40
20
0
0 20 40 60 80 100
x x
18
Simpson
scores = read.csv('simpsons.csv')
These are simulated data on the relation between the amount of time students prepare for
a test and their final score.
84
80
10 12 14 16 18 20 22
preparation time
Figure 1: seemingly, the longer you prepare for a test, the worse you do
19
It looks like the more you study, the lower your grade gets. What happens if we break it
down by subject?
plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
col = ifelse(scores$subject == 'Physical Education', 'maroon', 'black'))
86
84
82
80
10 12 14 16 18 20 22
preparation time
Well, you’ll notice two things. First, physical education is an easy subject. You don’t need
to prepare for 20 hours to get a good grade in it. Second, the more you prepare for a
physical education test, the better you get at it.
20
Let’s highlight scores for English by using a slightly more complicated ifelse statement
with our col argument. Such ifelse statements are sometimes called “nested.”
plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
pch = 19,
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))
86
84
82
80
10 12 14 16 18 20 22
preparation time
You see that trend within a group (in our case within a subject) is positive, but the trend
across the groups is negative. This phenomenon is known as Simpson’s paradox
Unfortunately, you still need to study to do better in class.
21
Let’s add a legend to our plot so that our point comes across more effectively.
NB: Some guidance on legends here
plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
pch = 19,
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))
legend('topright', inset=0.1, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=19, cex=1)
PhysEd
90
English
88
score
86
84
82
80
10 12 14 16 18 20 22
preparation time
22
Let’s not forget about our colorblind folks and folks with bad printers:
plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
pch = ifelse(scores$subject == 'Physical Education', 19,
ifelse(scores$subject == 'English', 17, 21)),
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))
legend('topright', inset=0.1, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=c(19, 17), cex=1)
PhysEd
90
English
88
score
86
84
82
80
10 12 14 16 18 20 22
preparation time
23
Let’s not forget about our colorblind folks and folks with bad printers (2):
plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
cex = ifelse(scores$subject == 'Physical Education', 1.2,
ifelse(scores$subject == 'English', 2, 1)),
pch = ifelse(scores$subject == 'Physical Education', 19,
ifelse(scores$subject == 'English', 17, 21)),
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))
legend('topright', inset=0.1, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=c(19, 17), cex=c(1,1.2))
PhysEd
90
English
88
score
86
84
82
80
10 12 14 16 18 20 22
preparation time
24
Finishing touches:
plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
cex = ifelse(scores$subject == 'Physical Education', 1.2,
ifelse(scores$subject == 'English', 2, 1)),
pch = ifelse(scores$subject == 'Physical Education', 19,
ifelse(scores$subject == 'English', 17, 21)),
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue', 'black')))
legend('topright', inset=0.025, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=c(19, 17), cex=c(1,1.2),
title = 'subject', text.font = 2, box.lty = 0, bg = 'cadetblue1')
subject
92
PhysEd
90
English
88
score
86
84
82
80
10 12 14 16 18 20 22
preparation time
25