0% found this document useful (0 votes)
29 views29 pages

R: Devore Solutions

R: Devore Solutions

Uploaded by

Abdullah Bingazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views29 pages

R: Devore Solutions

R: Devore Solutions

Uploaded by

Abdullah Bingazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

An Introduction to R, RStudio, R Markdown

Savaş Dayanık

24 09 2019

Introduction to R, Rstudio
You can type straight text and math, for example,

tips = Xβ + 

Some tips:
• This is a bullet. You can emphasize any important infor by enclosing it with one-star parenthesis.
• If it is absolutely important, then double the stars.
• TO insert a new R chunk, use Insert menu or press Ctrl-Alt-i. For all keybindings, Tools menu is your
friend.
– To run a single R clause in chunk, press Ctrl + ENTER.
– To run all clauses in the same chunl, pres Ctrl-Shift-ENTER
– to comment out any part of code or text, highlight and CTRL-SHFT-C

Analyze tips dataset


A waiter collected the values of several variables that he thinks are important to determine tip amount, and
wants us to analyze the relation between tips he received and the factors that he just picked up.
d <- suppressWarnings(read_csv("tips.csv",
col_types = cols(
X1 = col_double(),
OBS = col_double(),
TOTBILL = col_double(),
TIP = col_double(),
SEX = col_character(),
SMOKER = col_character(),
DAY = col_character(),
TIME = col_character(),
SIZE = col_double()
))[,-1]) %>%
select(-OBS) %>%
mutate(
SEX = factor(SEX),
DAY = factor(DAY, levels = c("thurs","fri","sat","sun"),
labels = c("THU", "FRI", "SAT", "SUN") ),
TIME = factor(TIME),
SMOKER = factor(SMOKER))

## New names:
## * `` -> `...1`

1
d %>% distinct(DAY)

## # A tibble: 4 x 1
## DAY
## <fct>
## 1 SUN
## 2 SAT
## 3 THU
## 4 FRI
d %>% count(DAY)

## # A tibble: 4 x 2
## DAY n
## <fct> <int>
## 1 THU 62
## 2 FRI 19
## 3 SAT 87
## 4 SUN 76
d %>%
head() %>%
pander(caption = "(\\#tab:data) A glimpse over the data")

Table 1: (#tab:data) A glimpse over the data

TOTBILL TIP SEX SMOKER DAY TIME SIZE


16.99 1.01 F no SUN dinner 2
10.34 1.66 M no SUN dinner 3
21.01 3.5 M no SUN dinner 3
23.68 3.31 M no SUN dinner 2
24.59 3.61 F no SUN dinner 4
25.29 4.71 M no SUN dinner 4

# pander(another(yetanother(d)))
#
# d %>%
# yetanother() %>%
# another() %>%
# pander()

Table @ref(tab:data) shows the first six rows of tip data set, whch has actually 244 rows. Let us describe the
variables in the table briefly:
TOTBILL Total bill paid by the party
TIP Tip left by the party
SEX Gender of who paid the bill (F, M)
SMOKER whether bill payer smokes or not (yes, no)
DAY Day of the week when the pary have had the meal (thurs, fri, sat, sun)
TIME Time of day when the party had had meal (lunch, dinner)
SIZE number of people in the party
Below is a summary of each variable:

2
d %>%
summary() %>%
pander()

Table 2: Table continues below

TOTBILL TIP SEX SMOKER DAY TIME


Min. : 3.07 Min. : 1.000 F: 87 no :151 THU:62 dinner:176
1st Qu.:13.35 1st Qu.: 2.000 M:157 yes: 93 FRI:19 lunch : 68
Median :17.80 Median : 2.900 NA NA SAT:87 NA
Mean :19.79 Mean : 2.998 NA NA SUN:76 NA
3rd Qu.:24.13 3rd Qu.: 3.562 NA NA NA NA
Max. :50.81 Max. :10.000 NA NA NA NA

SIZE
Min. :1.00
1st Qu.:2.00
Median :2.00
Mean :2.57
3rd Qu.:3.00
Max. :6.00

boxplot(d$TOTBILL, main = "TOTALBILL")


boxplot(d$TIP, main = "TIP")
boxplot(d$SIZE, main = "SIZE")

TOTALBILL TIP SIZE


10
50

6
5
40

4
30

3
20

2
10

Figure 1: Boxplots for TOTALBILL on the left and TIP in the middle, and SIZE on the right.

Boxplots in Figure @ref(fig:bxplts) show that all numerical variables have right-skewed distributions.
d %>%
##select_if(is.numeric) %>%
gather(variable, value, TOTBILL, TIP, SIZE) %>%
ggplot(aes(variable, value)) +
# geom_boxplot(aes(fill = DAY)) +
geom_boxplot(aes(fill = TIME)) +
coord_flip()

3
TOTBILL

TIME
variable

TIP dinner
lunch

SIZE

0 10 20 30 40 50
value
Scatterplot
Modern version
g <- ggplot(d, aes(TOTBILL, TIP)) +
geom_point() +
geom_abline(intercept = 0, slope = .18, col = "red") +
geom_text(x=45, y=45*.18, label="18% tip\nline",
col="red", hjust = 0, vjust=1 )
print(g)

4
10.0

18% tip
7.5 line
TIP

5.0

2.5

10 20 30 40 50
TOTBILL
plot(g)

5
10.0

18% tip
7.5 line
TIP

5.0

2.5

10 20 30 40 50
TOTBILL
d

## # A tibble: 244 x 7
## TOTBILL TIP SEX SMOKER DAY TIME SIZE
## <dbl> <dbl> <fct> <fct> <fct> <fct> <dbl>
## 1 17.0 1.01 F no SUN dinner 2
## 2 10.3 1.66 M no SUN dinner 3
## 3 21.0 3.5 M no SUN dinner 3
## 4 23.7 3.31 M no SUN dinner 2
## 5 24.6 3.61 F no SUN dinner 4
## 6 25.3 4.71 M no SUN dinner 4
## 7 8.77 2 M no SUN dinner 2
## 8 26.9 3.12 M no SUN dinner 4
## 9 15.0 1.96 M no SUN dinner 2
## 10 14.8 3.23 M no SUN dinner 2
## # i 234 more rows
g + facet_grid(DAY+TIME~SMOKER+SEX, labeller = label_both) +
theme(strip.text.y = element_text(angle = 0))

6
SMOKER: no SMOKER: no SMOKER: yes SMOKER: yes
SEX: F SEX: M SEX: F SEX: M
10.0
7.5 18% tip
5.0 TIME: dinner DAY: THU
2.5 line
10.0
7.5 18% tip 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: THU
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: FRI
2.5 line line line line
TIP

10.0
7.5 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: FRI
2.5 line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SAT
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SUN
2.5 line line line line
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
TOTBILL
Let us also calculate the correlation between TOTBILL and TIP
cor(d$TOTBILL, d$TIP)

## [1] 0.6757341
d %>%
group_by(SMOKER, SEX, DAY, TIME) %>%
summarize(cor = cor(TOTBILL, TIP),
count = n())

## `summarise()` has grouped output by 'SMOKER', 'SEX', 'DAY'. You can override
## using the `.groups` argument.
## # A tibble: 20 x 6
## # Groups: SMOKER, SEX, DAY [16]
## SMOKER SEX DAY TIME cor count
## <fct> <fct> <fct> <fct> <dbl> <int>
## 1 no F THU dinner NA 1
## 2 no F THU lunch 0.881 24
## 3 no F FRI dinner NA 1
## 4 no F FRI lunch NA 1
## 5 no F SAT dinner 0.623 13
## 6 no F SUN dinner 0.849 14
## 7 no M THU lunch 0.798 20
## 8 no M FRI dinner 1 2
## 9 no M SAT dinner 0.920 32
## 10 no M SUN dinner 0.706 43
## 11 yes F THU lunch 0.869 7

7
## 12 yes F FRI dinner 0.949 4
## 13 yes F FRI lunch 0.374 3
## 14 yes F SAT dinner 0.448 15
## 15 yes F SUN dinner -0.665 4
## 16 yes M THU lunch 0.629 10
## 17 yes M FRI dinner 0.926 5
## 18 yes M FRI lunch -0.305 3
## 19 yes M SAT dinner 0.621 27
## 20 yes M SUN dinner -0.0835 15
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
ggplot(aes(TOTBILL, TIP)) +
geom_line(aes(col=crazy)) +
geom_point(aes(col=crazy)) +
theme(legend.position = "none")

10.0

7.5
TIP

5.0

2.5

10 20 30 40 50
TOTBILL
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
summarize(cor = cor(TOTBILL, TIP)) %>%
xtabs(~cor, .)

## Warning: There were 5 warnings in `summarize()`.


## The first warning was:
## i In argument: `cor = cor(TOTBILL, TIP)`.
## i In group 30: `crazy = 29`.

8
## Caused by warning in `cor()`:
## ! the standard deviation is zero
## i Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.
## cor
## -1 1
## 26 90

Tables are useful to analyze categorical (qualitative, factor) data


SMOKER, SEX, DAY, TIME
d_tbl <- d %>%
mutate_if(is.character, factor) %>%
select_if(is.factor) %>%
xtabs(~ . , .)

d_tbl %>% ftable

## TIME dinner lunch


## SEX SMOKER DAY
## F no THU 1 24
## FRI 1 1
## SAT 13 0
## SUN 14 0
## yes THU 0 7
## FRI 4 3
## SAT 15 0
## SUN 4 0
## M no THU 0 20
## FRI 2 0
## SAT 32 0
## SUN 43 0
## yes THU 0 10
## FRI 5 3
## SAT 27 0
## SUN 15 0
library(vcd)

## Loading required package: grid


structable( DAY + TIME ~ SEX + SMOKER, d_tbl)

## DAY THU FRI SAT SUN


## TIME dinner lunch dinner lunch dinner lunch dinner lunch
## SEX SMOKER
## F no 1 24 1 1 13 0 14 0
## yes 0 7 4 3 15 0 4 0
## M no 0 20 2 0 32 0 43 0
## yes 0 10 5 3 27 0 15 0
(margin.table(d_tbl, c(1,2)) / sum(d_tbl)) %>% round(2) %>% multiply_by(100)

## SMOKER
## SEX no yes
## F 22 14

9
## M 40 25
# dimnames(d_tbl)

d_tbl %>%
margin.table(c(1,2))%>%
mosaic(type="expected")

SMOKER
no yes
F
SEX
M

d_tbl %>%
margin.table(c(1,2))%>%
mosaic(gp = gpar(fill = rep(c("pink", "lightblue"), each=2)))

10
SMOKER
F no yes
SEX
M

Tiles are aliged within each block. Therefore,


we tend to think SMOKER and SEX are independent.
dimnames(d_tbl)

## $SEX
## [1] "F" "M"
##
## $SMOKER
## [1] "no" "yes"
##
## $DAY
## [1] "THU" "FRI" "SAT" "SUN"
##
## $TIME
## [1] "dinner" "lunch"
d_tbl %>%
margin.table(1)%>%
mosaic()

11
F
SEX
M

d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = rep(c("pink", "lightblue"), each=4)))

DAY
THU FRI SAT SUN
F
SEX
M

library(RColorBrewer)

d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = brewer.pal(4, "PuOr"))) # picked diverging palette

12
DAY
F THU FRI SAT SUN
SEX
M

d_tbl %>%
margin.table(c(1,3))%>%
mosaic(type="expected")

DAY
THU FRI SAT SUN
F
SEX
M

Because DAY tiles within each SEX blocks are significatly disaligned, we cannot expected independence of
SEX and DAY. So they seem to be related. Can I measure the strength of the relation? Later
• Tile areas are proportional to the cell counts of the corresponding table.

13
• Titles within blocks are aligned across blocks: strongly suggests that SEX and SMOKER are independent
(random variables).
How can we check the relation between every pait of categorical variables?
library(vcd)
d %>%
mutate_if(is.character, factor) %>%
select_if(is.factor) %>%
xtabs(~. , .) %>%
pairs(diag_panel = pairs_barplot(var_offset = 1.3,
rot = -30,
just_leveltext = "left",
gp_leveltext = gpar(fontsize = 8)),
shade = TRUE)

200
SEX
150
100
50
0 F M

SMOKER
200
150
100
50
0 no ye
s
DAY
80
40
0 TH SA
U T
TIME
200
150
100
50
0 din lun
ne ch
r

Independence tests
Digression: What does pipe %>% do?
5*((mean(extract2(d, "TOTBILL"), na.rm=TRUE))ˆ2 )

## [1] 1957.418
versus
d %>%
#select(TOTBILL) %>%
extract2("TOTBILL") %>%
mean(na.rm=TRUE) %>%

14
`ˆ`(2) %>%
`*`(5)

## [1] 1957.418
Which one is readable and easy to modify?

Hair color, Eye color, gender


Go to Google form and fill the form for
• yourself,
• your mother,
• your father,
• your siblings
library(vcd)
HairEyeColor

## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
ftable(HairEyeColor)

## Sex Male Female


## Hair Eye
## Black Brown 32 36
## Blue 11 9
## Hazel 10 5
## Green 3 2
## Brown Brown 53 66
## Blue 50 34
## Hazel 25 29
## Green 15 14
## Red Brown 10 16
## Blue 10 7
## Hazel 7 7
## Green 7 7
## Blond Brown 3 4
## Blue 30 64
## Hazel 5 5

15
## Green 8 8
structable( Eye ~ Sex + Hair, HairEyeColor)

## Eye Brown Blue Hazel Green


## Sex Hair
## Male Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
## Female Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
pairs(HairEyeColor)
Hair
Black Red

108 286 71127

Eye
Brown Blond
Brown Hazel

220 215 9364

Sex
Blue Green
Male

279 313

Female
pairs(HairEyeColor, shade =TRUE)

16
Hair
Black Red

108 286 71127

Eye
Brown Blond
Brown Hazel

220 215 9364

Sex
Blue Green
Male

279 313

Female
dd <- read_csv("Hair Color, Eye Color, Gender.csv")[,-1] %>%
set_names(c("Hair", "Eye", "Sex"))

## Rows: 39 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Timestamp, Hair color?, Eye color?, Gender?
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dd %>%
count(Hair, Eye, Sex)

## # A tibble: 14 x 4
## Hair Eye Sex n
## <chr> <chr> <chr> <int>
## 1 Black Brown Female 5
## 2 Black Brown Male 11
## 3 Black Green Male 3
## 4 Black Hazel Male 1
## 5 Blond Blue Female 1
## 6 Blond Blue Male 2
## 7 Blond Green Female 1
## 8 Blond Green Male 2
## 9 Blond Hazel Female 1
## 10 Brown Brown Female 6
## 11 Brown Brown Male 2

17
## 12 Brown Green Male 1
## 13 Brown Hazel Female 2
## 14 Brown Hazel Male 1
dd_tbl <- xtabs(~ ., dd)

structable(Hair ~ Sex + Eye, dd_tbl)

## Hair Black Blond Brown


## Sex Eye
## Female Blue 0 1 0
## Brown 5 0 6
## Green 0 1 0
## Hazel 0 1 2
## Male Blue 0 2 0
## Brown 11 0 2
## Green 3 2 1
## Hazel 1 0 1
pairs(dd_tbl, shade = TRUE)
Hair
Black Brown

20 7 12

Eye
Blond
Blue Green

3 24 7 5

Sex
Brown Hazel
Female

16 23

Male

Independence test (Chi-square test) 15 October 2019


loglik_sat <- sum(HairEyeColor*log(HairEyeColor / sum(HairEyeColor)))

HairEyeColor %>%
margin.table(1)

18
## Hair
## Black Brown Red Blond
## 108 286 71 127
loglik_small <- HairEyeColor %>%
as_tibble() %>%
group_by(Hair) %>%
mutate(n_hair = sum(n)) %>%
group_by(Eye) %>%
mutate(n_eye = sum(n)) %>%
group_by(Sex) %>%
mutate(n_sex = sum(n)) %>%
ungroup() %>%
mutate(p_hair = n_hair/sum(n),
p_eye = n_eye/sum(n),
p_sex = n_sex/sum(n),
p_small = p_hair*p_eye*p_sex) %>%
select(Hair, Eye, Sex, n, p_small) %>%
summarize(loglik=sum(n*log(p_small))) %>%
extract2("loglik")

(LRT <- 2*(loglik_sat - loglik_small))

## [1] 166.3001
np_sat <- prod(dim(HairEyeColor)) -1
np_small <- sum(dim(HairEyeColor)-1)
(df <- np_sat -np_small)

## [1] 24
(pvalue <- 1-pchisq(LRT, df=df))

## [1] 0
curve(dchisq(x, df=df), xlim = c(0,200),
main = sprintf("p-value = %.8f", pvalue))
abline(v = LRT)
mtext(side=1, at = LRT, text = sprintf("LRT = %d", round(LRT)))

19
p−value = 0.00000000

0.06
dchisq(x, df = df)

0.04
0.02
0.00

LRT = 166
0 50 100 150 200

x We
observed that
• LRT is large
• p-value (=0) is small (smaller than 0.05 mean “significant”, smaller than 0.01 “very significant”, smaller
than 0.001 “very very significant”, goes like this)
They are the same. We reject the null hypothesis / small model / independence model.
We can crun independence test using
HairEyeColor %>% summary()

## Number of cases in table: 592


## Number of factors: 3
## Test for independence of all factors:
## Chisq = 164.92, df = 24, p-value = 5.321e-23
## Chi-squared approximation may be incorrect
loglin(HairEyeColor, list(1,2,3))

## 2 iterations: deviation 5.684342e-14


## $lrt
## [1] 166.3001
##
## $pearson
## [1] 164.9247
##
## $df
## [1] 24
##
## $margin
## $margin[[1]]
## [1] "Hair"

20
##
## $margin[[2]]
## [1] "Eye"
##
## $margin[[3]]
## [1] "Sex"
mosaic(HairEyeColor, expected = ~ Hair + Eye + Sex,
shade = TRUE, abbreviate = c(Sex = 1))

Eye
Brown Blue Hazel Green
Pearson
Black

M
residuals:
8.0

F
M
Brown

4.0
Hair

Sex
2.0

F
0.0

F M
Red

−2.0

M
Blond

−4.2
p−value =
F
< 2.22e−16
Check if Hair and Eye colors jointly independent of Sex?
dimnames(HairEyeColor)

## $Hair
## [1] "Black" "Brown" "Red" "Blond"
##
## $Eye
## [1] "Brown" "Blue" "Hazel" "Green"
##
## $Sex
## [1] "Male" "Female"
loglin(HairEyeColor, list(c(1,2), 3))

## 2 iterations: deviation 5.684342e-14


## $lrt
## [1] 19.85656
##
## $pearson
## [1] 19.56712

21
##
## $df
## [1] 15
##
## $margin
## $margin[[1]]
## [1] "Hair" "Eye"
##
## $margin[[2]]
## [1] "Sex"
MASS::loglm(~ Hair*Eye + Sex, HairEyeColor)

## Call:
## MASS::loglm(formula = ~Hair * Eye + Sex, data = HairEyeColor)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 19.85656 15 0.1775045
## Pearson 19.56712 15 0.1891745
p-values are large and we cannot reject the null/small model. Let us check the residuals:
mosaic(HairEyeColor, expected = ~ Hair * Eye + Sex,
shade = TRUE, abbreviate = c(Sex = 1))

Eye
Brown Blue Hazel Green
Pearson
Black

M
residuals:
2.0
F
M
Brown
Hair

Sex

0.0
F
F M
Red

M
Blond

−2.1
p−value =
F

0.18917
Very mild deviations (near -2) only for blond hair-blue eye people counts; the model overestimates (meaning
of the pink color) their counts.

22
Devore 647, problem 31
A random sample of smokers was obtained, and each individual was classified both with respect to gender
and with respect to the age at which he/she first started smoking. The data in the accompanying table is
consistent with summary results reported in the article “Cigarette Tar Yields in Relation to Mortality in the
Cancer Prevention Study II Prospective Cohort” (British Med. J., 2004: 72–79).
d <- c(25, 24, 18, 19, 10, 32, 17, 34) %>%
matrix(nrow = 4)

dimnames(d) <-
list(
Age = c("<16", "16-17", "18-20", ">20"),
Gender=c("Male", "Female"))

d2 <- as.table(d)
d2

## Gender
## Age Male Female
## <16 25 10
## 16-17 24 32
## 18-20 18 17
## >20 19 34
a. Calculate the proportion of males in each age category, and then do the same for females.
Based on these proportions, does it appear that there might be an association between gender and the age at
which an individual first smokes?
# cond dist of gender given age
d2 %>%
prop.table(1) %>%
round(2)

## Gender
## Age Male Female
## <16 0.71 0.29
## 16-17 0.43 0.57
## 18-20 0.51 0.49
## >20 0.36 0.64
# cond dist of age given gender
d2 %>%
prop.table(2) %>%
round(2)

## Gender
## Age Male Female
## <16 0.29 0.11
## 16-17 0.28 0.34
## 18-20 0.21 0.18
## >20 0.22 0.37
There seems to be an associaton betweem gender and age (cond’s distr do not look similar).
Check with mosaic plots

23
library(vcd)
mosaic(~ Age + Gender, d2, gp = shading_Marimekko(d2), byrow=TRUE)

Gender
Male Female
<16
16−17
Age
18−20 >20

## The next does not work because, shading_Marimekko() does not realize
## that we changed order of splits in mosaic with formula option.
safely(mosaic)(formula = ~ Gender + Age, data=d2, gp = shading_Marimekko(d2))

## $result
## NULL
##
## $error
## <subscriptOutOfBoundsError in x[cbind(index, m)]: subscript out of bounds>
## The quick fix for that is to change the order of margins of table
## outside/before mosaic and substitute table with "." (lambda programming)
dimnames(d2)

## $Age
## [1] "<16" "16-17" "18-20" ">20"
##
## $Gender

24
## [1] "Male" "Female"
margin.table(d2, c(2,1)) %>%
mosaic(formula = ~ Gender + Age, data=., gp = shading_Marimekko(.))

Age
<16 16−17 18−20 >20
Male
Gender
Female

b. Carry out a test of hypotheses to decide whether there is an association between the two factors.
d2 %>% summary()

## Number of cases in table: 179


## Number of factors: 2
## Test for independence of all factors:
## Chisq = 11.589, df = 3, p-value = 0.008931
mosaic(~ Age + Gender, d2, shade = TRUE)

25
Gender
Male Female Pearson
residuals:
<16

2.0
18−20 16−17
Age

0.0
>20

−1.9
p−value =
0.0089312
mosaic(~ Gender + Age, expected = ~ Gender+Age,
d2, shade = TRUE)

Age
<16 16−17 18−20 >20 Pearson
residuals:
2.0
Male
Gender

0.0
Female

−1.9
p−value =
0.0089312
Based on p-value (here, <.01), we reject the null hypothesis (= age and gender are independent),
• so we reject the null hypothesis and
• conclude that smoking starting age and gender are associated.
However, the residuals are between -2 and 2, which implies that * the deviations of observed frequencies from
the expected frequencies under independence model are consistent with what we expected from standard
normal variates. Therefore, we conclude that * association is weak.
Perhaps, we should collect more data and repeat the analysis.

Devore Exercise 16 on 638


Let X = the number of adult police contacts for a randomly selected individual who previously had at least
one such contact prior to age 18. The following frequencies were calculated from information given in the

26
article “Examining the Prevalence of Criminal Desistance” (Criminology, 2003: 423–448); our sample size
differs slightly from what was reported because of rounding.
d <- tribble(
~x, ~f,
0, 1627,
1, 421,
2, 219,
3, 130,
4, 107,
5, 51,
6, 15,
7, 22,
8, 8,
9, 14,
10, 5,
11, 8,
12, 5,
13, 0,
14, 3,
15,2
)

d %>%
t() %>%
pander(caption = "Incidence data")

Table 4: Incidence data

x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
f 1627 421 219 130 107 51 15 22 8 14 5 8 5 0 3 2

Table 5: (#tab:incidat) Incidence data II

x 0 1 2 3 4 5
f 1627 421 219 130 107 51

Look at Table @ref(tab:incidat) for frequencies of all incidences.


1. Find MLE of Poisson mean
2. Calculate Poisson frequencies
3. Draw a bar/column graph
d

## # A tibble: 16 x 2
## x f
## <dbl> <dbl>
## 1 0 1627
## 2 1 421
## 3 2 219
## 4 3 130
## 5 4 107

27
## 6 5 51
## 7 6 15
## 8 7 22
## 9 8 8
## 10 9 14
## 11 10 5
## 12 11 8
## 13 12 5
## 14 13 0
## 15 14 3
## 16 15 2
lambdaMLE <- d %>%
summarize(mean = sum(x*f)/sum(f)) %>%
extract2("mean")

lambdaMLE

## [1] 0.9996208
d$f %>% sum()

## [1] 2637
# bar graph
d2 <- d %>%
mutate(fpois = sum(f)*dpois(x, lambdaMLE))

d2 %>%
gather(type, freq, -x) %>%
ggplot(aes(x,freq)) +
geom_col(aes(fill = type), position = "dodge")

28
1500

1000
type
freq

f
fpois

500

0 5 10 15
x
Column graph suggests that Poisson is not a good choice. Let us look for statistical evidence to suppor our
conclusion with Chi-square test.
X2 <- d2 %>%
summarize( X2 = sum((f-fpois)ˆ2/fpois)) %>%
extract2("X2")

df_saturated <- nrow(d) - 1


df_small <- 1
df_LRT <- df_saturated - df_small

(pval <- pchisq(X2, df = df_LRT, lower.tail = FALSE))

## [1] 0
1 -pchisq(X2, df = df_LRT)

## [1] 0
Because p-value (= 0) is practically zero. Therefore, Poisson is rejected.

29

You might also like