R: Devore Solutions
R: Devore Solutions
Savaş Dayanık
24 09 2019
Introduction to R, Rstudio
You can type straight text and math, for example,
tips = Xβ +
Some tips:
• This is a bullet. You can emphasize any important infor by enclosing it with one-star parenthesis.
• If it is absolutely important, then double the stars.
• TO insert a new R chunk, use Insert menu or press Ctrl-Alt-i. For all keybindings, Tools menu is your
friend.
– To run a single R clause in chunk, press Ctrl + ENTER.
– To run all clauses in the same chunl, pres Ctrl-Shift-ENTER
– to comment out any part of code or text, highlight and CTRL-SHFT-C
## New names:
## * `` -> `...1`
1
d %>% distinct(DAY)
## # A tibble: 4 x 1
## DAY
## <fct>
## 1 SUN
## 2 SAT
## 3 THU
## 4 FRI
d %>% count(DAY)
## # A tibble: 4 x 2
## DAY n
## <fct> <int>
## 1 THU 62
## 2 FRI 19
## 3 SAT 87
## 4 SUN 76
d %>%
head() %>%
pander(caption = "(\\#tab:data) A glimpse over the data")
# pander(another(yetanother(d)))
#
# d %>%
# yetanother() %>%
# another() %>%
# pander()
Table @ref(tab:data) shows the first six rows of tip data set, whch has actually 244 rows. Let us describe the
variables in the table briefly:
TOTBILL Total bill paid by the party
TIP Tip left by the party
SEX Gender of who paid the bill (F, M)
SMOKER whether bill payer smokes or not (yes, no)
DAY Day of the week when the pary have had the meal (thurs, fri, sat, sun)
TIME Time of day when the party had had meal (lunch, dinner)
SIZE number of people in the party
Below is a summary of each variable:
2
d %>%
summary() %>%
pander()
SIZE
Min. :1.00
1st Qu.:2.00
Median :2.00
Mean :2.57
3rd Qu.:3.00
Max. :6.00
6
5
40
4
30
3
20
2
10
Figure 1: Boxplots for TOTALBILL on the left and TIP in the middle, and SIZE on the right.
Boxplots in Figure @ref(fig:bxplts) show that all numerical variables have right-skewed distributions.
d %>%
##select_if(is.numeric) %>%
gather(variable, value, TOTBILL, TIP, SIZE) %>%
ggplot(aes(variable, value)) +
# geom_boxplot(aes(fill = DAY)) +
geom_boxplot(aes(fill = TIME)) +
coord_flip()
3
TOTBILL
TIME
variable
TIP dinner
lunch
SIZE
0 10 20 30 40 50
value
Scatterplot
Modern version
g <- ggplot(d, aes(TOTBILL, TIP)) +
geom_point() +
geom_abline(intercept = 0, slope = .18, col = "red") +
geom_text(x=45, y=45*.18, label="18% tip\nline",
col="red", hjust = 0, vjust=1 )
print(g)
4
10.0
18% tip
7.5 line
TIP
5.0
2.5
10 20 30 40 50
TOTBILL
plot(g)
5
10.0
18% tip
7.5 line
TIP
5.0
2.5
10 20 30 40 50
TOTBILL
d
## # A tibble: 244 x 7
## TOTBILL TIP SEX SMOKER DAY TIME SIZE
## <dbl> <dbl> <fct> <fct> <fct> <fct> <dbl>
## 1 17.0 1.01 F no SUN dinner 2
## 2 10.3 1.66 M no SUN dinner 3
## 3 21.0 3.5 M no SUN dinner 3
## 4 23.7 3.31 M no SUN dinner 2
## 5 24.6 3.61 F no SUN dinner 4
## 6 25.3 4.71 M no SUN dinner 4
## 7 8.77 2 M no SUN dinner 2
## 8 26.9 3.12 M no SUN dinner 4
## 9 15.0 1.96 M no SUN dinner 2
## 10 14.8 3.23 M no SUN dinner 2
## # i 234 more rows
g + facet_grid(DAY+TIME~SMOKER+SEX, labeller = label_both) +
theme(strip.text.y = element_text(angle = 0))
6
SMOKER: no SMOKER: no SMOKER: yes SMOKER: yes
SEX: F SEX: M SEX: F SEX: M
10.0
7.5 18% tip
5.0 TIME: dinner DAY: THU
2.5 line
10.0
7.5 18% tip 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: THU
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: FRI
2.5 line line line line
TIP
10.0
7.5 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: FRI
2.5 line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SAT
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SUN
2.5 line line line line
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
TOTBILL
Let us also calculate the correlation between TOTBILL and TIP
cor(d$TOTBILL, d$TIP)
## [1] 0.6757341
d %>%
group_by(SMOKER, SEX, DAY, TIME) %>%
summarize(cor = cor(TOTBILL, TIP),
count = n())
## `summarise()` has grouped output by 'SMOKER', 'SEX', 'DAY'. You can override
## using the `.groups` argument.
## # A tibble: 20 x 6
## # Groups: SMOKER, SEX, DAY [16]
## SMOKER SEX DAY TIME cor count
## <fct> <fct> <fct> <fct> <dbl> <int>
## 1 no F THU dinner NA 1
## 2 no F THU lunch 0.881 24
## 3 no F FRI dinner NA 1
## 4 no F FRI lunch NA 1
## 5 no F SAT dinner 0.623 13
## 6 no F SUN dinner 0.849 14
## 7 no M THU lunch 0.798 20
## 8 no M FRI dinner 1 2
## 9 no M SAT dinner 0.920 32
## 10 no M SUN dinner 0.706 43
## 11 yes F THU lunch 0.869 7
7
## 12 yes F FRI dinner 0.949 4
## 13 yes F FRI lunch 0.374 3
## 14 yes F SAT dinner 0.448 15
## 15 yes F SUN dinner -0.665 4
## 16 yes M THU lunch 0.629 10
## 17 yes M FRI dinner 0.926 5
## 18 yes M FRI lunch -0.305 3
## 19 yes M SAT dinner 0.621 27
## 20 yes M SUN dinner -0.0835 15
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
ggplot(aes(TOTBILL, TIP)) +
geom_line(aes(col=crazy)) +
geom_point(aes(col=crazy)) +
theme(legend.position = "none")
10.0
7.5
TIP
5.0
2.5
10 20 30 40 50
TOTBILL
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
summarize(cor = cor(TOTBILL, TIP)) %>%
xtabs(~cor, .)
8
## Caused by warning in `cor()`:
## ! the standard deviation is zero
## i Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.
## cor
## -1 1
## 26 90
## SMOKER
## SEX no yes
## F 22 14
9
## M 40 25
# dimnames(d_tbl)
d_tbl %>%
margin.table(c(1,2))%>%
mosaic(type="expected")
SMOKER
no yes
F
SEX
M
d_tbl %>%
margin.table(c(1,2))%>%
mosaic(gp = gpar(fill = rep(c("pink", "lightblue"), each=2)))
10
SMOKER
F no yes
SEX
M
## $SEX
## [1] "F" "M"
##
## $SMOKER
## [1] "no" "yes"
##
## $DAY
## [1] "THU" "FRI" "SAT" "SUN"
##
## $TIME
## [1] "dinner" "lunch"
d_tbl %>%
margin.table(1)%>%
mosaic()
11
F
SEX
M
d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = rep(c("pink", "lightblue"), each=4)))
DAY
THU FRI SAT SUN
F
SEX
M
library(RColorBrewer)
d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = brewer.pal(4, "PuOr"))) # picked diverging palette
12
DAY
F THU FRI SAT SUN
SEX
M
d_tbl %>%
margin.table(c(1,3))%>%
mosaic(type="expected")
DAY
THU FRI SAT SUN
F
SEX
M
Because DAY tiles within each SEX blocks are significatly disaligned, we cannot expected independence of
SEX and DAY. So they seem to be related. Can I measure the strength of the relation? Later
• Tile areas are proportional to the cell counts of the corresponding table.
13
• Titles within blocks are aligned across blocks: strongly suggests that SEX and SMOKER are independent
(random variables).
How can we check the relation between every pait of categorical variables?
library(vcd)
d %>%
mutate_if(is.character, factor) %>%
select_if(is.factor) %>%
xtabs(~. , .) %>%
pairs(diag_panel = pairs_barplot(var_offset = 1.3,
rot = -30,
just_leveltext = "left",
gp_leveltext = gpar(fontsize = 8)),
shade = TRUE)
200
SEX
150
100
50
0 F M
SMOKER
200
150
100
50
0 no ye
s
DAY
80
40
0 TH SA
U T
TIME
200
150
100
50
0 din lun
ne ch
r
Independence tests
Digression: What does pipe %>% do?
5*((mean(extract2(d, "TOTBILL"), na.rm=TRUE))ˆ2 )
## [1] 1957.418
versus
d %>%
#select(TOTBILL) %>%
extract2("TOTBILL") %>%
mean(na.rm=TRUE) %>%
14
`ˆ`(2) %>%
`*`(5)
## [1] 1957.418
Which one is readable and easy to modify?
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
ftable(HairEyeColor)
15
## Green 8 8
structable( Eye ~ Sex + Hair, HairEyeColor)
Eye
Brown Blond
Brown Hazel
Sex
Blue Green
Male
279 313
Female
pairs(HairEyeColor, shade =TRUE)
16
Hair
Black Red
Eye
Brown Blond
Brown Hazel
Sex
Blue Green
Male
279 313
Female
dd <- read_csv("Hair Color, Eye Color, Gender.csv")[,-1] %>%
set_names(c("Hair", "Eye", "Sex"))
## Rows: 39 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Timestamp, Hair color?, Eye color?, Gender?
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dd %>%
count(Hair, Eye, Sex)
## # A tibble: 14 x 4
## Hair Eye Sex n
## <chr> <chr> <chr> <int>
## 1 Black Brown Female 5
## 2 Black Brown Male 11
## 3 Black Green Male 3
## 4 Black Hazel Male 1
## 5 Blond Blue Female 1
## 6 Blond Blue Male 2
## 7 Blond Green Female 1
## 8 Blond Green Male 2
## 9 Blond Hazel Female 1
## 10 Brown Brown Female 6
## 11 Brown Brown Male 2
17
## 12 Brown Green Male 1
## 13 Brown Hazel Female 2
## 14 Brown Hazel Male 1
dd_tbl <- xtabs(~ ., dd)
20 7 12
Eye
Blond
Blue Green
3 24 7 5
Sex
Brown Hazel
Female
16 23
Male
HairEyeColor %>%
margin.table(1)
18
## Hair
## Black Brown Red Blond
## 108 286 71 127
loglik_small <- HairEyeColor %>%
as_tibble() %>%
group_by(Hair) %>%
mutate(n_hair = sum(n)) %>%
group_by(Eye) %>%
mutate(n_eye = sum(n)) %>%
group_by(Sex) %>%
mutate(n_sex = sum(n)) %>%
ungroup() %>%
mutate(p_hair = n_hair/sum(n),
p_eye = n_eye/sum(n),
p_sex = n_sex/sum(n),
p_small = p_hair*p_eye*p_sex) %>%
select(Hair, Eye, Sex, n, p_small) %>%
summarize(loglik=sum(n*log(p_small))) %>%
extract2("loglik")
## [1] 166.3001
np_sat <- prod(dim(HairEyeColor)) -1
np_small <- sum(dim(HairEyeColor)-1)
(df <- np_sat -np_small)
## [1] 24
(pvalue <- 1-pchisq(LRT, df=df))
## [1] 0
curve(dchisq(x, df=df), xlim = c(0,200),
main = sprintf("p-value = %.8f", pvalue))
abline(v = LRT)
mtext(side=1, at = LRT, text = sprintf("LRT = %d", round(LRT)))
19
p−value = 0.00000000
0.06
dchisq(x, df = df)
0.04
0.02
0.00
LRT = 166
0 50 100 150 200
x We
observed that
• LRT is large
• p-value (=0) is small (smaller than 0.05 mean “significant”, smaller than 0.01 “very significant”, smaller
than 0.001 “very very significant”, goes like this)
They are the same. We reject the null hypothesis / small model / independence model.
We can crun independence test using
HairEyeColor %>% summary()
20
##
## $margin[[2]]
## [1] "Eye"
##
## $margin[[3]]
## [1] "Sex"
mosaic(HairEyeColor, expected = ~ Hair + Eye + Sex,
shade = TRUE, abbreviate = c(Sex = 1))
Eye
Brown Blue Hazel Green
Pearson
Black
M
residuals:
8.0
F
M
Brown
4.0
Hair
Sex
2.0
F
0.0
F M
Red
−2.0
M
Blond
−4.2
p−value =
F
< 2.22e−16
Check if Hair and Eye colors jointly independent of Sex?
dimnames(HairEyeColor)
## $Hair
## [1] "Black" "Brown" "Red" "Blond"
##
## $Eye
## [1] "Brown" "Blue" "Hazel" "Green"
##
## $Sex
## [1] "Male" "Female"
loglin(HairEyeColor, list(c(1,2), 3))
21
##
## $df
## [1] 15
##
## $margin
## $margin[[1]]
## [1] "Hair" "Eye"
##
## $margin[[2]]
## [1] "Sex"
MASS::loglm(~ Hair*Eye + Sex, HairEyeColor)
## Call:
## MASS::loglm(formula = ~Hair * Eye + Sex, data = HairEyeColor)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 19.85656 15 0.1775045
## Pearson 19.56712 15 0.1891745
p-values are large and we cannot reject the null/small model. Let us check the residuals:
mosaic(HairEyeColor, expected = ~ Hair * Eye + Sex,
shade = TRUE, abbreviate = c(Sex = 1))
Eye
Brown Blue Hazel Green
Pearson
Black
M
residuals:
2.0
F
M
Brown
Hair
Sex
0.0
F
F M
Red
M
Blond
−2.1
p−value =
F
0.18917
Very mild deviations (near -2) only for blond hair-blue eye people counts; the model overestimates (meaning
of the pink color) their counts.
22
Devore 647, problem 31
A random sample of smokers was obtained, and each individual was classified both with respect to gender
and with respect to the age at which he/she first started smoking. The data in the accompanying table is
consistent with summary results reported in the article “Cigarette Tar Yields in Relation to Mortality in the
Cancer Prevention Study II Prospective Cohort” (British Med. J., 2004: 72–79).
d <- c(25, 24, 18, 19, 10, 32, 17, 34) %>%
matrix(nrow = 4)
dimnames(d) <-
list(
Age = c("<16", "16-17", "18-20", ">20"),
Gender=c("Male", "Female"))
d2 <- as.table(d)
d2
## Gender
## Age Male Female
## <16 25 10
## 16-17 24 32
## 18-20 18 17
## >20 19 34
a. Calculate the proportion of males in each age category, and then do the same for females.
Based on these proportions, does it appear that there might be an association between gender and the age at
which an individual first smokes?
# cond dist of gender given age
d2 %>%
prop.table(1) %>%
round(2)
## Gender
## Age Male Female
## <16 0.71 0.29
## 16-17 0.43 0.57
## 18-20 0.51 0.49
## >20 0.36 0.64
# cond dist of age given gender
d2 %>%
prop.table(2) %>%
round(2)
## Gender
## Age Male Female
## <16 0.29 0.11
## 16-17 0.28 0.34
## 18-20 0.21 0.18
## >20 0.22 0.37
There seems to be an associaton betweem gender and age (cond’s distr do not look similar).
Check with mosaic plots
23
library(vcd)
mosaic(~ Age + Gender, d2, gp = shading_Marimekko(d2), byrow=TRUE)
Gender
Male Female
<16
16−17
Age
18−20 >20
## The next does not work because, shading_Marimekko() does not realize
## that we changed order of splits in mosaic with formula option.
safely(mosaic)(formula = ~ Gender + Age, data=d2, gp = shading_Marimekko(d2))
## $result
## NULL
##
## $error
## <subscriptOutOfBoundsError in x[cbind(index, m)]: subscript out of bounds>
## The quick fix for that is to change the order of margins of table
## outside/before mosaic and substitute table with "." (lambda programming)
dimnames(d2)
## $Age
## [1] "<16" "16-17" "18-20" ">20"
##
## $Gender
24
## [1] "Male" "Female"
margin.table(d2, c(2,1)) %>%
mosaic(formula = ~ Gender + Age, data=., gp = shading_Marimekko(.))
Age
<16 16−17 18−20 >20
Male
Gender
Female
b. Carry out a test of hypotheses to decide whether there is an association between the two factors.
d2 %>% summary()
25
Gender
Male Female Pearson
residuals:
<16
2.0
18−20 16−17
Age
0.0
>20
−1.9
p−value =
0.0089312
mosaic(~ Gender + Age, expected = ~ Gender+Age,
d2, shade = TRUE)
Age
<16 16−17 18−20 >20 Pearson
residuals:
2.0
Male
Gender
0.0
Female
−1.9
p−value =
0.0089312
Based on p-value (here, <.01), we reject the null hypothesis (= age and gender are independent),
• so we reject the null hypothesis and
• conclude that smoking starting age and gender are associated.
However, the residuals are between -2 and 2, which implies that * the deviations of observed frequencies from
the expected frequencies under independence model are consistent with what we expected from standard
normal variates. Therefore, we conclude that * association is weak.
Perhaps, we should collect more data and repeat the analysis.
26
article “Examining the Prevalence of Criminal Desistance” (Criminology, 2003: 423–448); our sample size
differs slightly from what was reported because of rounding.
d <- tribble(
~x, ~f,
0, 1627,
1, 421,
2, 219,
3, 130,
4, 107,
5, 51,
6, 15,
7, 22,
8, 8,
9, 14,
10, 5,
11, 8,
12, 5,
13, 0,
14, 3,
15,2
)
d %>%
t() %>%
pander(caption = "Incidence data")
x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
f 1627 421 219 130 107 51 15 22 8 14 5 8 5 0 3 2
x 0 1 2 3 4 5
f 1627 421 219 130 107 51
## # A tibble: 16 x 2
## x f
## <dbl> <dbl>
## 1 0 1627
## 2 1 421
## 3 2 219
## 4 3 130
## 5 4 107
27
## 6 5 51
## 7 6 15
## 8 7 22
## 9 8 8
## 10 9 14
## 11 10 5
## 12 11 8
## 13 12 5
## 14 13 0
## 15 14 3
## 16 15 2
lambdaMLE <- d %>%
summarize(mean = sum(x*f)/sum(f)) %>%
extract2("mean")
lambdaMLE
## [1] 0.9996208
d$f %>% sum()
## [1] 2637
# bar graph
d2 <- d %>%
mutate(fpois = sum(f)*dpois(x, lambdaMLE))
d2 %>%
gather(type, freq, -x) %>%
ggplot(aes(x,freq)) +
geom_col(aes(fill = type), position = "dodge")
28
1500
1000
type
freq
f
fpois
500
0 5 10 15
x
Column graph suggests that Poisson is not a good choice. Let us look for statistical evidence to suppor our
conclusion with Chi-square test.
X2 <- d2 %>%
summarize( X2 = sum((f-fpois)ˆ2/fpois)) %>%
extract2("X2")
## [1] 0
1 -pchisq(X2, df = df_LRT)
## [1] 0
Because p-value (= 0) is practically zero. Therefore, Poisson is rejected.
29