0% found this document useful (0 votes)
54 views25 pages

Recitation 4

The document discusses setting up working directories in R and installing packages like TinyTex and stargazer that allow formatting R code and output. It then demonstrates using the stargazer package to output regression results from R code in a nicely formatted table. Specifically, it estimates a linear regression model examining whether people are less likely to help those perceived as out-groups (wearing hijabs) as temperature increases, finding the interaction term to be marginally significant.

Uploaded by

Mane Harutyunyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views25 pages

Recitation 4

The document discusses setting up working directories in R and installing packages like TinyTex and stargazer that allow formatting R code and output. It then demonstrates using the stargazer package to output regression results from R code in a nicely formatted table. Specifically, it estimates a linear regression model examining whether people are less likely to help those perceived as out-groups (wearing hijabs) as temperature increases, finding the interaction term to be marginally significant.

Uploaded by

Mane Harutyunyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Recitation Four

Agabek Kabdullin

2022-09-18

Contents

Counting Stars 2
Fixing working directories . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A note on TinyTex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Stargazer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Plots in Base R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Anscombe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Simpson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Try on your own . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1
Counting Stars

Fixing working directories

Let’s set up a global working directory to resolve some of the issues from last week:

```{r setup, include=FALSE}


knitr::opts_chunk$set(root.dir =
'D:/.../Fall/data_analysis/recitations/recitation (4)/r_code')
library(stargazer)

A note on TinyTex

Not everyone is a fun of HTML. Some people prefer pdf outputs. Luckily, R Markdown
allows us to produce nicely looking pdf output. (We will use stargazer a bit later)
You will need tinytex package to knit pdfs in R Markdown. Please install it using the code
below. You only need to run it once! Don’t include it in all of your R Markdowns!
Also, please install the stargazer package. Again, you only need to install it once! But
you do need to call for it when using it in R Markdown.

```{r, include=FALSE}
#knitr::opts_chunk$set(echo = TRUE)

install.packages('tinytex') # installing 'tinytex' package


#which allows us to knit pdf in R Markdown
tinytex::install_tinytex() # install TinyTeX

install.packages('stargazer')
library(stargazer)

There are, however, some limitations that come with that. For one, we lose a lot of inter-
activity as we move from HTML to pdf. After all, pdf is only a document. Some default
options are also quite limited in R Markdown when opting for pdf. For instance, the only
available font sizes by default are 10pt, 11pt and 12pt.
---
title: "Recitation Three"
author: "Agabek Kabdullin"
date: "2022-09-11"
output: pdf_document
geometry: margin=1in
fontsize: 12pt
---

2
If you wanted more fonts and font sizes, you’ll need to jump through a couple of hoops.
See more on that here and here
Still, some basic interactivity is there. You can add an interactive table of contents, for
instance:
NB: “depth” of table of contents goes up to 6 (that’s the number of types of headers in R
Markdown; don’t worry, you wouldn’t every produce a sub-sub-sub-sub-sub-section; and
if you do, you need to reconsider some of your writing choices)

---
title: "Recitation Three"
author: "Agabek Kabdullin"
date: "2022-09-11"
output:
pdf_document:
toc: true
toc_depth: 2
geometry: margin=1in
fontsize: 12pt
---

Don’t forget to color your links!

---
title: "Recitation Three"
author: "Agabek Kabdullin"
date: "2022-09-11"
output:
pdf_document:
toc: true
toc_depth: 2
geometry: margin=1in
fontsize: 12pt
urlcolor: blue
linkcolor: red
---

3
Stargazer

“we’ll be countin’ stars”


– One Republic
Let’s take a look at some real world variables now.
Please open the first_experiment.csv and second_experiment.csv files (they should
be in the proper working directory!)
Remove the first column from both data frames (it’s a name of the row that made its way
into the file by accident)

## Read the data:


first_experiment <- read.csv("first_experiment.csv")
#data from the first experiment (2018)
#colnames(first_experiment)[-1]
first_experiment <- first_experiment[colnames(first_experiment)[-1]]
first_experiment$exp <- 1

second_experiment <- read.csv("second_experiment.csv")


#data from the second experiment (2019)
#colnames(second_experiment)[-1]
second_experiment <- second_experiment[colnames(second_experiment)[-1]]
second_experiment$exp <- 2

## Merge the data:


combined <- rbind(first_experiment, second_experiment)
# "rbind" stands for "row-binding"

## Some data cleaning:


combined$station1 <-
gsub(" ", "", combined$station1, fixed = TRUE)
#remove empty spaces from station names

# write data:
write.csv(combined, 'combined.csv', row.names = F)
# "row.names = F" makes sure that we do not write the row
# names into the file

4
Data come from the experiment by Choi, Poertner and Sambanis, (2019):

“The experimental intervention itself proceeded as follows: a female confeder-


ate approached a bench at a train station where other individuals were waiting
for their train and conducted a brief call addressing a friend regarding an in-
nocuous personal matter (step 1). During this call, the confederate dropped
fruit (oranges or lemons) from a paper bag that had seemingly torn at the bot-
tom (step 2). The fruit dispersed and the confederate appeared to be in need
of assistance to pick them up (step 3). We observed whether bystanders (Ger-
man natives) helped the confederate pick up the fruit (step 4)…

The key dimension of the intervention—the confederate’s perceived member-


ship in the ingroup (German natives) or outgroup (Muslim immigrants)—was
manipulated experimentally by randomly assigning a confederate with spe-
cific ethno-religious attributes: a Middle-Eastern immigrant wearing a hijab or
a white German female. We used several different actors (15 immigrants and
17 natives across 11 teams) and chose similarly aged confederates of com-
parable attractiveness and controlled for social class by having confederates
wear similar attire across iterations.”

5
help to natives help to non−natives
did not help
did not help

helped
helped

help to natives at 25 + C help to non−natives at 25 + C


did not help
did not help

helped
helped

help to natives at 30 + C help to non−natives at 30 + C


did not help
did not help

helped
helped

help to natives at 35 + C help to non−natives at 35 + C


did not help
did not help

helped
helped

The pie charts above suggest that as temperature increases, people tend to provide less
help to perceived out-groups. Is that statistically significant?
The equation that we want to estimate is:

\begin{align*}
Help_i = \alpha + \beta_1*temperature_{i} +
\beta_2*hijab_{i} +
\beta_3*temperature_{i}*hijab_{i} +
\gamma*X + \varepsilon
\end{align*}

𝐻𝑒𝑙𝑝𝑖 = 𝛼+𝛽1 ∗ 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒𝑖 +


𝛽2 ∗ ℎ𝑖𝑗𝑎𝑏𝑖 +
𝛽3 ∗ 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒𝑖 ∗ ℎ𝑖𝑗𝑎𝑏𝑖 +
𝛾∗𝑋+𝜀
We could estimate this equation using R’s lm function (which stands for “linear model”).
That, however, would only give us the estimates of coefficients without telling us anything
about the statistical significance:

6
lm(anyhelp ~ temp*treat, data = combined)

##
## Call:
## lm(formula = anyhelp ~ temp * treat, data = combined)
##
## Coefficients:
## (Intercept) temp treat temp:treat
## 0.672474 0.003890 0.140017 -0.008733

Let’s wrap our lm thing with the summary function:

summary(lm(anyhelp ~ temp*treat, data = combined))

##
## Call:
## lm(formula = anyhelp ~ temp * treat, data = combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8281 -0.6563 0.2303 0.3091 0.3880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.672474 0.097012 6.932 5.78e-12 ***
## temp 0.003890 0.003564 1.091 0.2752
## treat 0.140017 0.132792 1.054 0.2918
## temp:treat -0.008733 0.004840 -1.804 0.0714 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4451 on 1782 degrees of freedom
## (2024 observations deleted due to missingness)
## Multiple R-squared: 0.01343, Adjusted R-squared: 0.01177
## F-statistic: 8.084 on 3 and 1782 DF, p-value: 2.387e-05

Now we’re getting somewhere! You see that the coefficient for the “Intercept” is statistically
significant at the level 0 (check out the significance codes at the bottom). The interaction
term (temp:treat) has a p-value of 0.0714. That means that there is a 7% chance of
observing this estimate (or estimate even farther away from zero) if the null hypothesis
was true.
Suppose you wanted to present this table to your peers. You could, of course, copy and
paste each value into a text editor, but working on that table (or tables) would be as pleas-
ant as stabbing yourself with a fork. Fortunately, there is stargazer.

7
Originally, it was created to turn R output into text that is readable by Latex (or LATEX) as a
ready table:

```{r}
stargazer(lm(anyhelp ~ temp*treat, data = combined))

stargazer(lm(anyhelp ~ temp*treat, data = combined))

##
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-
mail: marek.hlavac at gmail.com
## % Date and time: Sun, Sep 25, 2022 - 7:47:08 PM
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{1}{c}{\textit{Dependent variable:}} \\
## \cline{2-2}
## \\[-1.8ex] & anyhelp \\
## \hline \\[-1.8ex]
## temp & 0.004 \\
## & (0.004) \\
## & \\
## treat & 0.140 \\
## & (0.133) \\
## & \\
## temp:treat & $-$0.009$^{*}$ \\
## & (0.005) \\
## & \\
## Constant & 0.672$^{***}$ \\
## & (0.097) \\
## & \\
## \hline \\[-1.8ex]
## Observations & 1,786 \\
## R$^{2}$ & 0.013 \\
## Adjusted R$^{2}$ & 0.012 \\
## Residual Std. Error & 0.445 (df = 1782) \\
## F Statistic & 8.084$^{***}$ (df = 3; 1782) \\
## \hline
## \hline \\[-1.8ex]
## \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01
## \end{tabular}

8
## \end{table}

In R Mardkown, however, if you use the option asis for your results argument in the
chunk with stargazer code, you’ll be getting a neat table:

```{r, results='asis'}
stargazer(lm(anyhelp ~ temp*treat, data = combined))

stargazer(lm(anyhelp ~ temp*treat, data = combined))

% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail:
marek.hlavac at gmail.com % Date and time: Sun, Sep 25, 2022 - 7:47:08 PM

Table 1:
Dependent variable:
anyhelp
temp 0.004
(0.004)

treat 0.140
(0.133)

temp:treat −0.009∗
(0.005)

Constant 0.672∗∗∗
(0.097)

Observations 1,786
R2 0.013
Adjusted R2 0.012
Residual Std. Error 0.445 (df = 1782)
F Statistic 8.084∗∗∗ (df = 3; 1782)

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

How does stargazer do that?


Let’s assign our linear model to an object model1. Like many other objects in R, model1
would have certain features (“values” of an lm).
For instance, model1 has coefficients:

9
model1 = lm(anyhelp ~ temp + treat + temp*treat, data = combined)
model1$coefficients

## (Intercept) temp treat temp:treat


## 0.672473976 0.003889554 0.140017333 -0.008732651

It also stores the formula for our regression:

model1$call

## lm(formula = anyhelp ~ temp + treat + temp * treat, data = combined)

How about some standard errors? They are stored in a matrix:

vcov(model1)

## (Intercept) temp treat temp:treat


## (Intercept) 0.0094112442 -3.410816e-04 -0.0094112442 3.410816e-04
## temp -0.0003410816 1.270026e-05 0.0003410816 -1.270026e-05
## treat -0.0094112442 3.410816e-04 0.0176337439 -6.344734e-04
## temp:treat 0.0003410816 -1.270026e-05 -0.0006344734 2.342818e-05

To get them out, we take the diagonal of that matrix and take the square root of each value:

sqrt(diag(vcov(model1)))

## (Intercept) temp treat temp:treat


## 0.097011567 0.003563743 0.132792108 0.004840266

stargazer pulls out all of these details out of our linear model objects and plugs them into
pre-made tables that are readable by Latex.
stargazer also gives us quite some control over how our tables would look. For in-
stance, you could use the argument dep.var.labels to change the label of your depen-
dent variable. You could also use the argument covariate.labels to change the names
of your variables. Let’s also set header to FALSE so that we don’t get info on the author of
stargazer every time we use it (as much as we appreciate what Marek Hlavac has done
for R users).

10
stargazer((lm(anyhelp ~ temp + treat + temp*treat, data = combined)),
type = 'latex',
style = 'apsr',
dep.var.labels = 'Outcome: Did any bystanders offer help?',
covariate.labels = c('Temperature',
'Hijab vs native',
'Temperature x hijab versus native'),
header=FALSE)

Table 2:
Outcome: Did any bystanders offer help?
Temperature 0.004
(0.004)
Hijab vs native 0.140
(0.133)
Temperature x hijab versus native −0.009∗
(0.005)
Constant 0.672∗∗∗
(0.097)
N 1,786
R2 0.013
Adjusted R2 0.012
Residual Std. Error 0.445 (df = 1782)
F Statistic 8.084∗∗∗ (df = 3; 1782)

p < .1; ∗∗ p < .05; ∗∗∗ p < .01

We are using APSR style from here on out, but if you’re interested, here is a full style list.

11
We could combine several models into one table with stargazer.
Note that we are omitting a bunch of variables from the second model using omit argu-
ment.

model1 = lm(anyhelp ~ temp + treat + temp*treat, data = combined)


model2 = lm(anyhelp ~ temp*treat + station1 + rush, data = combined)

stargazer(model1, model2,
omit = c(paste("station1", unique(combined$station1), sep=""),
'rush'),
type = 'latex',
style = 'apsr',
dep.var.labels = 'Outcome: Did any bystanders offer help?',
covariate.labels = c('Temperature',
'Hijab vs native',
'Temperature x hijab versus native'),
header=FALSE,
title="There is some relation between
the attitudes towards the out-group and
the environment")

Table 3: There is some relation between the attitudes towards the out-group and the en-
vironment
Outcome: Did any bystanders offer help?
(1) (2)
Temperature 0.004 0.007∗
(0.004) (0.004)
Hijab vs native 0.140 0.137
(0.133) (0.133)
Temperature x hijab versus native −0.009∗ −0.008∗
(0.005) (0.005)
Constant 0.672∗∗∗ 0.570∗∗∗
(0.097) (0.108)
N 1,786 1,786
R2 0.013 0.043
Adjusted R2 0.012 0.025
Residual Std. Error 0.445 (df = 1782) 0.442 (df = 1752)
F Statistic 8.084∗∗∗ (df = 3; 1782) 2.400∗∗∗ (df = 33; 1752)

p < .1; ∗∗ p < .05; ∗∗∗ p < .01

12
Let’s add models but this time without the intercept estimates.

model1 = lm(anyhelp ~ temp + treat + temp*treat,


data = combined)
model2 = lm(anyhelp ~ temp*treat + station1 + rush,
data = combined)
model3 = lm(anyhelp ~ temp*treat + station1 + rush + bystander,
data = combined)

stargazer(model1, model2, model3,


omit = c(paste("station1", unique(combined$station1), sep=""),
'rush',
'bystander',
'Constant'),
type = 'latex',
style = 'apsr',
dep.var.labels = 'Outcome: Did any bystanders offer help?',
covariate.labels = c('Temperature',
'Hijab vs native',
'Temperature x hijab versus native'),
header=FALSE,
title="There is some relation between
the attitudes towards the out-group and
the environment")

Table 4: There is some relation between the attitudes towards the out-group and the en-
vironment
Outcome: Did any bystanders offer help?
(1) (2) (3

Temperature 0.004 0.007 0.00
(0.004) (0.004) (0.0
Hijab vs native 0.140 0.137 0.1
(0.133) (0.133) (0.1
Temperature x hijab versus native −0.009∗ −0.008∗ −0.0
(0.005) (0.005) (0.0
N 1,786 1,786 1,7
R2 0.013 0.043 0.0
Adjusted R2 0.012 0.025 0.0
Residual Std. Error 0.445 (df = 1782) 0.442 (df = 1752) 0.442 (df
F Statistic 8.084∗∗∗ (df = 3; 1782) 2.400∗∗∗ (df = 33; 1752) 2.341∗∗∗ (df

p < .1; ∗∗ p < .05; ∗∗∗ p < .01

13
We don’t always need all of the information on our results. R-squared, for instance,
wouldn’t tell us much in this case because our dependent variable is binary. So, let’s
drop it and some other stats via omit.stat = c('adj.rsq', 'rsq', 'ser', 'f').
That will also help us get rid of the table getting out the margins.
NB: See the list of statistic codes

model1 = lm(anyhelp ~ temp + treat + temp*treat,


data = combined)
model2 = lm(anyhelp ~ temp*treat + station1 + rush,
data = combined)
model3 = lm(anyhelp ~ temp*treat + station1 + rush + bystander,
data = combined)

stargazer(model1, model2, model3,


omit = c(paste("station1", unique(combined$station1), sep=""),
'rush',
'bystander',
'Constant'),
omit.stat = c('adj.rsq', 'rsq', 'ser', 'f'),
type = 'latex',
style = 'apsr',
dep.var.labels = 'Outcome: Did any bystanders offer help?',
covariate.labels = c('Temperature',
'Hijab vs native',
'Temperature x hijab versus native'),
header=FALSE,
title="Relation between
the attitudes towards the out-group and
the environment")

Table 5: Relation between the attitudes towards the out-group and the environment
Outcome: Did any bystanders offer help?
(1) (2) (3)
Temperature 0.004 0.007∗ 0.006∗
(0.004) (0.004) (0.004)
Hijab vs native 0.140 0.137 0.138
(0.133) (0.133) (0.133)
Temperature x hijab versus native −0.009∗ −0.008∗ −0.009∗
(0.005) (0.005) (0.005)
N 1,786 1,786 1,786

p < .1; ∗∗ p < .05; ∗∗∗ p < .01

14
Our three models are different: first model is the baseline model, second model has station
fixed effects and a variable on whether it was the rush hour. The last model contains all
these plus the number of bystanders. Let’s make sure that we communicate that in our
table. I’m gonna use the add.lines argument to add some lines to our table. Compare
Table 6 to the table in the original paper.

stargazer(model1, model2, model3,


omit = c(paste("station1", unique(combined$station1), sep=""),
'rush', 'bystander', 'Constant'),
omit.stat = c('adj.rsq', 'rsq', 'ser', 'f'),
type = 'latex',
style = 'apsr',
dep.var.labels = 'Outcome: Did any bystanders offer help?',
covariate.labels = c('Temperature',
'Hijab vs native',
'Temperature x hijab versus native'),
header=FALSE, title="Help behavior by temperature",
add.lines = list(
c("Constant", round(summary(model1)$coefficients[1,1], 3), '', ''),
c("", round(summary(model1)$coefficients[1,2], 3), '', ''),
c("Rush hour FE", "No", "Yes", "Yes"),
c("Station FE", "No", "Yes", "Yes"),
c("Number of bystanders FE", "No", "No", "Yes")
)
)

Table 6: Help behavior by temperature


Outcome: Did any bystanders offer help?
(1) (2) (3)
Temperature 0.004 0.007∗ 0.006∗
(0.004) (0.004) (0.004)
Hijab vs native 0.140 0.137 0.138
(0.133) (0.133) (0.133)
Temperature x hijab versus native −0.009∗ −0.008∗ −0.009∗
(0.005) (0.005) (0.005)
Constant 0.672
0.097
Rush hour FE No Yes Yes
Station FE No Yes Yes
Number of bystanders FE No No Yes
N 1,786 1,786 1,786

p < .1; ∗∗ p < .05; ∗∗∗ p < .01

15
Plots in Base R

Anscombe

When exploring data one of your very first steps should be plotting the data. Let’s load
datasauRus package to demonstrate this point.

#install.packages('datasauRus')
library(datasauRus)

datasauRus package has a datasaurus_dozen data frame which is nothing but twelve data
sets combined into one (hence, the name).

unique(datasaurus_dozen$dataset)

## [1] "dino" "away" "h_lines" "v_lines" "x_shape"


## [6] "star" "high_lines" "dots" "circle" "bullseye"
## [11] "slant_up" "slant_down" "wide_lines"

Each set has x and y variables. Let’s examine star and dino datasets:

subset_star = datasaurus_dozen[datasaurus_dozen$dataset == 'star' , ]


subset_dino = datasaurus_dozen[datasaurus_dozen$dataset == 'dino' , ]

model1 = lm(y ~ x, data = subset_star)


model2 = lm(y ~ x, data = subset_dino)

stargazer(model1, model2,
#omit.stat = c('adj.rsq', 'rsq', 'ser', 'f'),
type = 'latex',
style = 'apsr',
#dep.var.labels = 'Outcome: Did any bystanders offer help?',
column.labels = c('star', 'dino'),
header=FALSE,
title="Seemingly the relationship between x and y
is the same in both data sets"
)

If we were to only look at the linear regression results, we would have concluded that the
relationships between x and y are quite similar in both data sets.

16
Table 7: Seemingly the relationship between x and y is the same in both data sets
y
star dino
(1) (2)
x −0.101 −0.104
(0.135) (0.136)
Constant 53.327∗∗∗ 53.453∗∗∗
(7.692) (7.693)
N 142 142
R2 0.004 0.004
Adjusted R2 −0.003 −0.003
Residual Std. Error (df = 140) 26.973 26.975
F Statistic (df = 1; 140) 0.557 0.584

p < .1; ∗∗ p < .05; ∗∗∗ p < .01

Were you to plot the data, however, you’d see that the relationships are far from being
similar:

par(mfrow = c(1, 2), # panel with one row and two columns
mai = c(0.5, 0.5, 0.25, 0.25)) # bottom, left, top and right margins
plot(subset_dino$x, subset_dino$y)
plot(subset_star$x, subset_star$y)
100

80
80
subset_dino$y

subset_star$y
60

60
40

40
20

20
0

20 40 60 80 100 30 40 50 60 70 80

NB: This actuallysubset_dino$x subset_star$x


is a (rather very) special extension of Anscombe’s quartet.

17
Let’s make our panel of plots a tad bit prettier. We’ll add the axes labels, a title; we will
also change the shape, color and size of our points. Let’s remove the ticks from one of
our plots just because we can.
Let’s also add our regression lines to further demonstrate why Anscombe disagreed that
“numerical calculations are exact, but graphs are rough”

par(mfrow = c(1, 2), # rows, columns


mai = c(0.75, 0.5, 0.75, 0.1)) # bottom, left, top and right
plot(subset_dino$x, subset_dino$y,
xlab = 'x', ylab = 'y', main = 'dino data',
pch = 19, cex = 1.2, col = 'maroon',
xlim = c(0, 120), ylim = c(0, 120),
xaxt = 'n', yaxt = 'n')
abline(lm(y ~ x, data = subset_dino), lty=2, lwd=3, col='seagreen')
plot(subset_star$x, subset_star$y,
xlab = 'x', ylab = 'y', main = 'star data',
pch = 21, cex = 3, col = 'cornflowerblue',
xlim = c(0, 120), ylim = c(0, 120))
abline(lm(y ~ x, data = subset_star), lty=2, lwd=3, col='seagreen')

dino data star data


120
100
80
60
y

40
20
0

0 20 40 60 80 100

x x

18
Simpson

Why else is it important to plot data?


Let’s explore another hypothetical example.

scores = read.csv('simpsons.csv')

These are simulated data on the relation between the amount of time students prepare for
a test and their final score.

```{r, include=T, fig.cap='seemingly, the longer you prepare for a test,


the worse you do'}
plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score')

relationship between preparation time and score


92
88
score

84
80

10 12 14 16 18 20 22

preparation time

Figure 1: seemingly, the longer you prepare for a test, the worse you do

19
It looks like the more you study, the lower your grade gets. What happens if we break it
down by subject?

plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
col = ifelse(scores$subject == 'Physical Education', 'maroon', 'black'))

relationship between preparation time and score


92
90
88
score

86
84
82
80

10 12 14 16 18 20 22

preparation time

Figure 2: physical education is relatively easy

Well, you’ll notice two things. First, physical education is an easy subject. You don’t need
to prepare for 20 hours to get a good grade in it. Second, the more you prepare for a
physical education test, the better you get at it.

20
Let’s highlight scores for English by using a slightly more complicated ifelse statement
with our col argument. Such ifelse statements are sometimes called “nested.”

plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
pch = 19,
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))

relationship between preparation time and score


92
90
88
score

86
84
82
80

10 12 14 16 18 20 22

preparation time

Figure 3: English is harder

You see that trend within a group (in our case within a subject) is positive, but the trend
across the groups is negative. This phenomenon is known as Simpson’s paradox
Unfortunately, you still need to study to do better in class.

21
Let’s add a legend to our plot so that our point comes across more effectively.
NB: Some guidance on legends here

plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
pch = 19,
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))
legend('topright', inset=0.1, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=19, cex=1)

relationship between preparation time and score


92

PhysEd
90

English
88
score

86
84
82
80

10 12 14 16 18 20 22

preparation time

Figure 4: English is harder

22
Let’s not forget about our colorblind folks and folks with bad printers:

plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
pch = ifelse(scores$subject == 'Physical Education', 19,
ifelse(scores$subject == 'English', 17, 21)),
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))
legend('topright', inset=0.1, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=c(19, 17), cex=1)

relationship between preparation time and score


92

PhysEd
90

English
88
score

86
84
82
80

10 12 14 16 18 20 22

preparation time

Figure 5: English is harder

23
Let’s not forget about our colorblind folks and folks with bad printers (2):

plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
cex = ifelse(scores$subject == 'Physical Education', 1.2,
ifelse(scores$subject == 'English', 2, 1)),
pch = ifelse(scores$subject == 'Physical Education', 19,
ifelse(scores$subject == 'English', 17, 21)),
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue',
'black')))
legend('topright', inset=0.1, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=c(19, 17), cex=c(1,1.2))

relationship between preparation time and score


92

PhysEd
90

English
88
score

86
84
82
80

10 12 14 16 18 20 22

preparation time

24
Finishing touches:

plot(scores$prep_time, scores$score,
xlab = 'preparation time', ylab = 'score',
main = 'relationship between preparation time and score',
cex = ifelse(scores$subject == 'Physical Education', 1.2,
ifelse(scores$subject == 'English', 2, 1)),
pch = ifelse(scores$subject == 'Physical Education', 19,
ifelse(scores$subject == 'English', 17, 21)),
col = ifelse(scores$subject == 'Physical Education', 'maroon',
ifelse(scores$subject == 'English', 'cornflowerblue', 'black')))
legend('topright', inset=0.025, legend=c("PhysEd", "English"),
col=c("maroon", "cornflowerblue"), pch=c(19, 17), cex=c(1,1.2),
title = 'subject', text.font = 2, box.lty = 0, bg = 'cadetblue1')

relationship between preparation time and score

subject
92

PhysEd
90

English
88
score

86
84
82
80

10 12 14 16 18 20 22

preparation time

Try on your own

1. What is the hardest subject in the scores data set?


2. Could you color all five subjects on the last plot? Would it be easier with ggplot2?

25

You might also like