0% found this document useful (0 votes)

29 views29 pages

R: Devore Solutions

Uploaded by

Abdullah Bingazi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views29 pages

R: Devore Solutions

Uploaded by

Abdullah Bingazi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

An Introduction to R, RStudio, R Markdown

Savaş Dayanık

24 09 2019

Introduction to R, Rstudio
You can type straight text and math, for example,

tips = Xβ +

Some tips:
• This is a bullet. You can emphasize any important infor by enclosing it with one-star parenthesis.
• If it is absolutely important, then double the stars.
• TO insert a new R chunk, use Insert menu or press Ctrl-Alt-i. For all keybindings, Tools menu is your
friend.
– To run a single R clause in chunk, press Ctrl + ENTER.
– To run all clauses in the same chunl, pres Ctrl-Shift-ENTER
– to comment out any part of code or text, highlight and CTRL-SHFT-C

Analyze tips dataset

A waiter collected the values of several variables that he thinks are important to determine tip amount, and
wants us to analyze the relation between tips he received and the factors that he just picked up.
d <- suppressWarnings(read_csv("tips.csv",
col_types = cols(
X1 = col_double(),
OBS = col_double(),
TOTBILL = col_double(),
TIP = col_double(),
SEX = col_character(),
SMOKER = col_character(),
DAY = col_character(),
TIME = col_character(),
SIZE = col_double()
))[,-1]) %>%
select(-OBS) %>%
mutate(
SEX = factor(SEX),
DAY = factor(DAY, levels = c("thurs","fri","sat","sun"),
labels = c("THU", "FRI", "SAT", "SUN") ),
TIME = factor(TIME),
SMOKER = factor(SMOKER))

## New names:
## * `` -> `...1`

1
d %>% distinct(DAY)

## # A tibble: 4 x 1
## DAY
## <fct>
## 1 SUN
## 2 SAT
## 3 THU
## 4 FRI
d %>% count(DAY)

## # A tibble: 4 x 2
## DAY n
## <fct> <int>
## 1 THU 62
## 2 FRI 19
## 3 SAT 87
## 4 SUN 76
d %>%
head() %>%
pander(caption = "(\\#tab:data) A glimpse over the data")

Table 1: (#tab:data) A glimpse over the data

TOTBILL TIP SEX SMOKER DAY TIME SIZE

16.99 1.01 F no SUN dinner 2
10.34 1.66 M no SUN dinner 3
21.01 3.5 M no SUN dinner 3
23.68 3.31 M no SUN dinner 2
24.59 3.61 F no SUN dinner 4
25.29 4.71 M no SUN dinner 4

# pander(another(yetanother(d)))
#
# d %>%
# yetanother() %>%
# another() %>%
# pander()

Table @ref(tab:data) shows the first six rows of tip data set, whch has actually 244 rows. Let us describe the
variables in the table briefly:
TOTBILL Total bill paid by the party
TIP Tip left by the party
SEX Gender of who paid the bill (F, M)
SMOKER whether bill payer smokes or not (yes, no)
DAY Day of the week when the pary have had the meal (thurs, fri, sat, sun)
TIME Time of day when the party had had meal (lunch, dinner)
SIZE number of people in the party
Below is a summary of each variable:

2
d %>%
summary() %>%
pander()

Table 2: Table continues below

TOTBILL TIP SEX SMOKER DAY TIME

Min. : 3.07 Min. : 1.000 F: 87 no :151 THU:62 dinner:176
1st Qu.:13.35 1st Qu.: 2.000 M:157 yes: 93 FRI:19 lunch : 68
Median :17.80 Median : 2.900 NA NA SAT:87 NA
Mean :19.79 Mean : 2.998 NA NA SUN:76 NA
3rd Qu.:24.13 3rd Qu.: 3.562 NA NA NA NA
Max. :50.81 Max. :10.000 NA NA NA NA

SIZE
Min. :1.00
1st Qu.:2.00
Median :2.00
Mean :2.57
3rd Qu.:3.00
Max. :6.00

boxplot(d$TOTBILL, main = "TOTALBILL")

boxplot(d$TIP, main = "TIP")
boxplot(d$SIZE, main = "SIZE")

TOTALBILL TIP SIZE

10
50

6
5
40

4
30

3
20

2
10

Figure 1: Boxplots for TOTALBILL on the left and TIP in the middle, and SIZE on the right.

Boxplots in Figure @ref(fig:bxplts) show that all numerical variables have right-skewed distributions.
d %>%
##select_if(is.numeric) %>%
gather(variable, value, TOTBILL, TIP, SIZE) %>%
ggplot(aes(variable, value)) +
# geom_boxplot(aes(fill = DAY)) +
geom_boxplot(aes(fill = TIME)) +
coord_flip()

3
TOTBILL

TIME
variable

TIP dinner
lunch

SIZE

0 10 20 30 40 50
value
Scatterplot
Modern version
g <- ggplot(d, aes(TOTBILL, TIP)) +
geom_point() +
geom_abline(intercept = 0, slope = .18, col = "red") +
geom_text(x=45, y=45*.18, label="18% tip\nline",
col="red", hjust = 0, vjust=1 )
print(g)

4
10.0

18% tip
7.5 line
TIP

5.0

2.5

10 20 30 40 50
TOTBILL
plot(g)

5
10.0

18% tip
7.5 line
TIP

5.0

2.5

10 20 30 40 50
TOTBILL
d

## # A tibble: 244 x 7
## TOTBILL TIP SEX SMOKER DAY TIME SIZE
## <dbl> <dbl> <fct> <fct> <fct> <fct> <dbl>
## 1 17.0 1.01 F no SUN dinner 2
## 2 10.3 1.66 M no SUN dinner 3
## 3 21.0 3.5 M no SUN dinner 3
## 4 23.7 3.31 M no SUN dinner 2
## 5 24.6 3.61 F no SUN dinner 4
## 6 25.3 4.71 M no SUN dinner 4
## 7 8.77 2 M no SUN dinner 2
## 8 26.9 3.12 M no SUN dinner 4
## 9 15.0 1.96 M no SUN dinner 2
## 10 14.8 3.23 M no SUN dinner 2
## # i 234 more rows
g + facet_grid(DAY+TIME~SMOKER+SEX, labeller = label_both) +
theme(strip.text.y = element_text(angle = 0))

6
SMOKER: no SMOKER: no SMOKER: yes SMOKER: yes
SEX: F SEX: M SEX: F SEX: M
10.0
7.5 18% tip
5.0 TIME: dinner DAY: THU
2.5 line
10.0
7.5 18% tip 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: THU
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: FRI
2.5 line line line line
TIP

10.0
7.5 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: FRI
2.5 line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SAT
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SUN
2.5 line line line line
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
TOTBILL
Let us also calculate the correlation between TOTBILL and TIP
cor(d$TOTBILL, d$TIP)

## [1] 0.6757341
d %>%
group_by(SMOKER, SEX, DAY, TIME) %>%
summarize(cor = cor(TOTBILL, TIP),
count = n())

## `summarise()` has grouped output by 'SMOKER', 'SEX', 'DAY'. You can override
## using the `.groups` argument.
## # A tibble: 20 x 6
## # Groups: SMOKER, SEX, DAY [16]
## SMOKER SEX DAY TIME cor count
## <fct> <fct> <fct> <fct> <dbl> <int>
## 1 no F THU dinner NA 1
## 2 no F THU lunch 0.881 24
## 3 no F FRI dinner NA 1
## 4 no F FRI lunch NA 1
## 5 no F SAT dinner 0.623 13
## 6 no F SUN dinner 0.849 14
## 7 no M THU lunch 0.798 20
## 8 no M FRI dinner 1 2
## 9 no M SAT dinner 0.920 32
## 10 no M SUN dinner 0.706 43
## 11 yes F THU lunch 0.869 7

7
## 12 yes F FRI dinner 0.949 4
## 13 yes F FRI lunch 0.374 3
## 14 yes F SAT dinner 0.448 15
## 15 yes F SUN dinner -0.665 4
## 16 yes M THU lunch 0.629 10
## 17 yes M FRI dinner 0.926 5
## 18 yes M FRI lunch -0.305 3
## 19 yes M SAT dinner 0.621 27
## 20 yes M SUN dinner -0.0835 15
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
ggplot(aes(TOTBILL, TIP)) +
geom_line(aes(col=crazy)) +
geom_point(aes(col=crazy)) +
theme(legend.position = "none")

10.0

7.5
TIP

5.0

2.5

10 20 30 40 50
TOTBILL
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
summarize(cor = cor(TOTBILL, TIP)) %>%
xtabs(~cor, .)

## Warning: There were 5 warnings in `summarize()`.

## The first warning was:
## i In argument: `cor = cor(TOTBILL, TIP)`.
## i In group 30: `crazy = 29`.

8
## Caused by warning in `cor()`:
## ! the standard deviation is zero
## i Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.
## cor
## -1 1
## 26 90

Tables are useful to analyze categorical (qualitative, factor) data

SMOKER, SEX, DAY, TIME
d_tbl <- d %>%
mutate_if(is.character, factor) %>%
select_if(is.factor) %>%
xtabs(~ . , .)

d_tbl %>% ftable

## TIME dinner lunch

## SEX SMOKER DAY
## F no THU 1 24
## FRI 1 1
## SAT 13 0
## SUN 14 0
## yes THU 0 7
## FRI 4 3
## SAT 15 0
## SUN 4 0
## M no THU 0 20
## FRI 2 0
## SAT 32 0
## SUN 43 0
## yes THU 0 10
## FRI 5 3
## SAT 27 0
## SUN 15 0
library(vcd)

## Loading required package: grid

structable( DAY + TIME ~ SEX + SMOKER, d_tbl)

## DAY THU FRI SAT SUN

## TIME dinner lunch dinner lunch dinner lunch dinner lunch
## SEX SMOKER
## F no 1 24 1 1 13 0 14 0
## yes 0 7 4 3 15 0 4 0
## M no 0 20 2 0 32 0 43 0
## yes 0 10 5 3 27 0 15 0
(margin.table(d_tbl, c(1,2)) / sum(d_tbl)) %>% round(2) %>% multiply_by(100)

## SMOKER
## SEX no yes
## F 22 14

9
## M 40 25
# dimnames(d_tbl)

d_tbl %>%
margin.table(c(1,2))%>%
mosaic(type="expected")

SMOKER
no yes
F
SEX
M

d_tbl %>%
margin.table(c(1,2))%>%
mosaic(gp = gpar(fill = rep(c("pink", "lightblue"), each=2)))

10
SMOKER
F no yes
SEX
M

Tiles are aliged within each block. Therefore,

we tend to think SMOKER and SEX are independent.
dimnames(d_tbl)

## $SEX
## [1] "F" "M"
##
## $SMOKER
## [1] "no" "yes"
##
## $DAY
## [1] "THU" "FRI" "SAT" "SUN"
##
## $TIME
## [1] "dinner" "lunch"
d_tbl %>%
margin.table(1)%>%
mosaic()

11
F
SEX
M

d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = rep(c("pink", "lightblue"), each=4)))

DAY
THU FRI SAT SUN
F
SEX
M

library(RColorBrewer)

d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = brewer.pal(4, "PuOr"))) # picked diverging palette

12
DAY
F THU FRI SAT SUN
SEX
M

d_tbl %>%
margin.table(c(1,3))%>%
mosaic(type="expected")

DAY
THU FRI SAT SUN
F
SEX
M

Because DAY tiles within each SEX blocks are significatly disaligned, we cannot expected independence of
SEX and DAY. So they seem to be related. Can I measure the strength of the relation? Later
• Tile areas are proportional to the cell counts of the corresponding table.

13
• Titles within blocks are aligned across blocks: strongly suggests that SEX and SMOKER are independent
(random variables).
How can we check the relation between every pait of categorical variables?
library(vcd)
d %>%
mutate_if(is.character, factor) %>%
select_if(is.factor) %>%
xtabs(~. , .) %>%
pairs(diag_panel = pairs_barplot(var_offset = 1.3,
rot = -30,
just_leveltext = "left",
gp_leveltext = gpar(fontsize = 8)),
shade = TRUE)

200
SEX
150
100
50
0 F M

SMOKER
200
150
100
50
0 no ye
s
DAY
80
40
0 TH SA
U T
TIME
200
150
100
50
0 din lun
ne ch
r

Independence tests
Digression: What does pipe %>% do?
5*((mean(extract2(d, "TOTBILL"), na.rm=TRUE))ˆ2 )

## [1] 1957.418
versus
d %>%
#select(TOTBILL) %>%
extract2("TOTBILL") %>%
mean(na.rm=TRUE) %>%

14
`ˆ`(2) %>%
`*`(5)

## [1] 1957.418
Which one is readable and easy to modify?

Hair color, Eye color, gender

Go to Google form and fill the form for
• yourself,
• your mother,
• your father,
• your siblings
library(vcd)
HairEyeColor

## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
ftable(HairEyeColor)

## Sex Male Female

## Hair Eye
## Black Brown 32 36
## Blue 11 9
## Hazel 10 5
## Green 3 2
## Brown Brown 53 66
## Blue 50 34
## Hazel 25 29
## Green 15 14
## Red Brown 10 16
## Blue 10 7
## Hazel 7 7
## Green 7 7
## Blond Brown 3 4
## Blue 30 64
## Hazel 5 5

15
## Green 8 8
structable( Eye ~ Sex + Hair, HairEyeColor)

## Eye Brown Blue Hazel Green

## Sex Hair
## Male Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
## Female Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
pairs(HairEyeColor)
Hair
Black Red

108 286 71127

Eye
Brown Blond
Brown Hazel

220 215 9364

Sex
Blue Green
Male

279 313

Female
pairs(HairEyeColor, shade =TRUE)

16
Hair
Black Red

108 286 71127

Eye
Brown Blond
Brown Hazel

220 215 9364

Sex
Blue Green
Male

279 313

Female
dd <- read_csv("Hair Color, Eye Color, Gender.csv")[,-1] %>%
set_names(c("Hair", "Eye", "Sex"))

## Rows: 39 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Timestamp, Hair color?, Eye color?, Gender?
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dd %>%
count(Hair, Eye, Sex)

## # A tibble: 14 x 4
## Hair Eye Sex n
## <chr> <chr> <chr> <int>
## 1 Black Brown Female 5
## 2 Black Brown Male 11
## 3 Black Green Male 3
## 4 Black Hazel Male 1
## 5 Blond Blue Female 1
## 6 Blond Blue Male 2
## 7 Blond Green Female 1
## 8 Blond Green Male 2
## 9 Blond Hazel Female 1
## 10 Brown Brown Female 6
## 11 Brown Brown Male 2

17
## 12 Brown Green Male 1
## 13 Brown Hazel Female 2
## 14 Brown Hazel Male 1
dd_tbl <- xtabs(~ ., dd)

structable(Hair ~ Sex + Eye, dd_tbl)

## Hair Black Blond Brown

## Sex Eye
## Female Blue 0 1 0
## Brown 5 0 6
## Green 0 1 0
## Hazel 0 1 2
## Male Blue 0 2 0
## Brown 11 0 2
## Green 3 2 1
## Hazel 1 0 1
pairs(dd_tbl, shade = TRUE)
Hair
Black Brown

20 7 12

Eye
Blond
Blue Green

3 24 7 5

Sex
Brown Hazel
Female

16 23

Male

Independence test (Chi-square test) 15 October 2019

loglik_sat <- sum(HairEyeColor*log(HairEyeColor / sum(HairEyeColor)))

HairEyeColor %>%
margin.table(1)

18
## Hair
## Black Brown Red Blond
## 108 286 71 127
loglik_small <- HairEyeColor %>%
as_tibble() %>%
group_by(Hair) %>%
mutate(n_hair = sum(n)) %>%
group_by(Eye) %>%
mutate(n_eye = sum(n)) %>%
group_by(Sex) %>%
mutate(n_sex = sum(n)) %>%
ungroup() %>%
mutate(p_hair = n_hair/sum(n),
p_eye = n_eye/sum(n),
p_sex = n_sex/sum(n),
p_small = p_hair*p_eye*p_sex) %>%
select(Hair, Eye, Sex, n, p_small) %>%
summarize(loglik=sum(n*log(p_small))) %>%
extract2("loglik")

(LRT <- 2*(loglik_sat - loglik_small))

## [1] 166.3001
np_sat <- prod(dim(HairEyeColor)) -1
np_small <- sum(dim(HairEyeColor)-1)
(df <- np_sat -np_small)

## [1] 24
(pvalue <- 1-pchisq(LRT, df=df))

## [1] 0
curve(dchisq(x, df=df), xlim = c(0,200),
main = sprintf("p-value = %.8f", pvalue))
abline(v = LRT)
mtext(side=1, at = LRT, text = sprintf("LRT = %d", round(LRT)))

19
p−value = 0.00000000

0.06
dchisq(x, df = df)

0.04
0.02
0.00

LRT = 166
0 50 100 150 200

x We
observed that
• LRT is large
• p-value (=0) is small (smaller than 0.05 mean “significant”, smaller than 0.01 “very significant”, smaller
than 0.001 “very very significant”, goes like this)
They are the same. We reject the null hypothesis / small model / independence model.
We can crun independence test using
HairEyeColor %>% summary()

## Number of cases in table: 592

## Number of factors: 3
## Test for independence of all factors:
## Chisq = 164.92, df = 24, p-value = 5.321e-23
## Chi-squared approximation may be incorrect
loglin(HairEyeColor, list(1,2,3))

## 2 iterations: deviation 5.684342e-14

## $lrt
## [1] 166.3001
##
## $pearson
## [1] 164.9247
##
## $df
## [1] 24
##
## $margin
## $margin[[1]]
## [1] "Hair"

20
##
## $margin[[2]]
## [1] "Eye"
##
## $margin[[3]]
## [1] "Sex"
mosaic(HairEyeColor, expected = ~ Hair + Eye + Sex,
shade = TRUE, abbreviate = c(Sex = 1))

Eye
Brown Blue Hazel Green
Pearson
Black

M
residuals:
8.0

F
M
Brown

4.0
Hair

Sex
2.0

F
0.0

F M
Red

−2.0

M
Blond

−4.2
p−value =
F
< 2.22e−16
Check if Hair and Eye colors jointly independent of Sex?
dimnames(HairEyeColor)

## $Hair
## [1] "Black" "Brown" "Red" "Blond"
##
## $Eye
## [1] "Brown" "Blue" "Hazel" "Green"
##
## $Sex
## [1] "Male" "Female"
loglin(HairEyeColor, list(c(1,2), 3))

## 2 iterations: deviation 5.684342e-14

## $lrt
## [1] 19.85656
##
## $pearson
## [1] 19.56712

21
##
## $df
## [1] 15
##
## $margin
## $margin[[1]]
## [1] "Hair" "Eye"
##
## $margin[[2]]
## [1] "Sex"
MASS::loglm(~ Hair*Eye + Sex, HairEyeColor)

## Call:
## MASS::loglm(formula = ~Hair * Eye + Sex, data = HairEyeColor)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 19.85656 15 0.1775045
## Pearson 19.56712 15 0.1891745
p-values are large and we cannot reject the null/small model. Let us check the residuals:
mosaic(HairEyeColor, expected = ~ Hair * Eye + Sex,
shade = TRUE, abbreviate = c(Sex = 1))

Eye
Brown Blue Hazel Green
Pearson
Black

M
residuals:
2.0
F
M
Brown
Hair

Sex

0.0
F
F M
Red

M
Blond

−2.1
p−value =
F

0.18917
Very mild deviations (near -2) only for blond hair-blue eye people counts; the model overestimates (meaning
of the pink color) their counts.

22
Devore 647, problem 31
A random sample of smokers was obtained, and each individual was classified both with respect to gender
and with respect to the age at which he/she first started smoking. The data in the accompanying table is
consistent with summary results reported in the article “Cigarette Tar Yields in Relation to Mortality in the
Cancer Prevention Study II Prospective Cohort” (British Med. J., 2004: 72–79).
d <- c(25, 24, 18, 19, 10, 32, 17, 34) %>%
matrix(nrow = 4)

dimnames(d) <-
list(
Age = c("<16", "16-17", "18-20", ">20"),
Gender=c("Male", "Female"))

d2 <- as.table(d)
d2

## Gender
## Age Male Female
## <16 25 10
## 16-17 24 32
## 18-20 18 17
## >20 19 34
a. Calculate the proportion of males in each age category, and then do the same for females.
Based on these proportions, does it appear that there might be an association between gender and the age at
which an individual first smokes?
# cond dist of gender given age
d2 %>%
prop.table(1) %>%
round(2)

## Gender
## Age Male Female
## <16 0.71 0.29
## 16-17 0.43 0.57
## 18-20 0.51 0.49
## >20 0.36 0.64
# cond dist of age given gender
d2 %>%
prop.table(2) %>%
round(2)

## Gender
## Age Male Female
## <16 0.29 0.11
## 16-17 0.28 0.34
## 18-20 0.21 0.18
## >20 0.22 0.37
There seems to be an associaton betweem gender and age (cond’s distr do not look similar).
Check with mosaic plots

23
library(vcd)
mosaic(~ Age + Gender, d2, gp = shading_Marimekko(d2), byrow=TRUE)

Gender
Male Female
<16
16−17
Age
18−20 >20

## The next does not work because, shading_Marimekko() does not realize
## that we changed order of splits in mosaic with formula option.
safely(mosaic)(formula = ~ Gender + Age, data=d2, gp = shading_Marimekko(d2))

## $result
## NULL
##
## $error
## <subscriptOutOfBoundsError in x[cbind(index, m)]: subscript out of bounds>
## The quick fix for that is to change the order of margins of table
## outside/before mosaic and substitute table with "." (lambda programming)
dimnames(d2)

## $Age
## [1] "<16" "16-17" "18-20" ">20"
##
## $Gender

24
## [1] "Male" "Female"
margin.table(d2, c(2,1)) %>%
mosaic(formula = ~ Gender + Age, data=., gp = shading_Marimekko(.))

Age
<16 16−17 18−20 >20
Male
Gender
Female

b. Carry out a test of hypotheses to decide whether there is an association between the two factors.
d2 %>% summary()

## Number of cases in table: 179

## Number of factors: 2
## Test for independence of all factors:
## Chisq = 11.589, df = 3, p-value = 0.008931
mosaic(~ Age + Gender, d2, shade = TRUE)

25
Gender
Male Female Pearson
residuals:
<16

2.0
18−20 16−17
Age

0.0
>20

−1.9
p−value =
0.0089312
mosaic(~ Gender + Age, expected = ~ Gender+Age,
d2, shade = TRUE)

Age
<16 16−17 18−20 >20 Pearson
residuals:
2.0
Male
Gender

0.0
Female

−1.9
p−value =
0.0089312
Based on p-value (here, <.01), we reject the null hypothesis (= age and gender are independent),
• so we reject the null hypothesis and
• conclude that smoking starting age and gender are associated.
However, the residuals are between -2 and 2, which implies that * the deviations of observed frequencies from
the expected frequencies under independence model are consistent with what we expected from standard
normal variates. Therefore, we conclude that * association is weak.
Perhaps, we should collect more data and repeat the analysis.

Devore Exercise 16 on 638

Let X = the number of adult police contacts for a randomly selected individual who previously had at least
one such contact prior to age 18. The following frequencies were calculated from information given in the

26
article “Examining the Prevalence of Criminal Desistance” (Criminology, 2003: 423–448); our sample size
differs slightly from what was reported because of rounding.
d <- tribble(
~x, ~f,
0, 1627,
1, 421,
2, 219,
3, 130,
4, 107,
5, 51,
6, 15,
7, 22,
8, 8,
9, 14,
10, 5,
11, 8,
12, 5,
13, 0,
14, 3,
15,2
)

d %>%
t() %>%
pander(caption = "Incidence data")

Table 4: Incidence data

x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
f 1627 421 219 130 107 51 15 22 8 14 5 8 5 0 3 2

Table 5: (#tab:incidat) Incidence data II

x 0 1 2 3 4 5
f 1627 421 219 130 107 51

Look at Table @ref(tab:incidat) for frequencies of all incidences.

1. Find MLE of Poisson mean
2. Calculate Poisson frequencies
3. Draw a bar/column graph
d

## # A tibble: 16 x 2
## x f
## <dbl> <dbl>
## 1 0 1627
## 2 1 421
## 3 2 219
## 4 3 130
## 5 4 107

27
## 6 5 51
## 7 6 15
## 8 7 22
## 9 8 8
## 10 9 14
## 11 10 5
## 12 11 8
## 13 12 5
## 14 13 0
## 15 14 3
## 16 15 2
lambdaMLE <- d %>%
summarize(mean = sum(x*f)/sum(f)) %>%
extract2("mean")

lambdaMLE

## [1] 0.9996208
d$f %>% sum()

## [1] 2637
# bar graph
d2 <- d %>%
mutate(fpois = sum(f)*dpois(x, lambdaMLE))

d2 %>%
gather(type, freq, -x) %>%
ggplot(aes(x,freq)) +
geom_col(aes(fill = type), position = "dodge")

28
1500

1000
type
freq

f
fpois

500

0 5 10 15
x
Column graph suggests that Poisson is not a good choice. Let us look for statistical evidence to suppor our
conclusion with Chi-square test.
X2 <- d2 %>%
summarize( X2 = sum((f-fpois)ˆ2/fpois)) %>%
extract2("X2")

df_saturated <- nrow(d) - 1

df_small <- 1
df_LRT <- df_saturated - df_small

(pval <- pchisq(X2, df = df_LRT, lower.tail = FALSE))

## [1] 0
1 -pchisq(X2, df = df_LRT)

## [1] 0
Because p-value (= 0) is practically zero. Therefore, Poisson is rejected.

Verzani Answers
100% (8)
Verzani Answers
94 pages
Time Series For Data Science Analysis and Forecasting (Wayne A. Woodward, Bivin Philip Sadler Etc.) (Z-Library)
100% (1)
Time Series For Data Science Analysis and Forecasting (Wayne A. Woodward, Bivin Philip Sadler Etc.) (Z-Library)
529 pages
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
100% (1)
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
14 pages
A) Intoduction: Jharcraft: A Strong Backbone of The Jharkhand State
0% (1)
A) Intoduction: Jharcraft: A Strong Backbone of The Jharkhand State
37 pages
Handbook of Parametric and Nonparametric Statistical Procedures PDF
100% (3)
Handbook of Parametric and Nonparametric Statistical Procedures PDF
972 pages
R Examples
No ratings yet
R Examples
15 pages
R: Eye Color Example
No ratings yet
R: Eye Color Example
22 pages
Individual Part 3
No ratings yet
Individual Part 3
4 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
Notes Chp5 R Programming
No ratings yet
Notes Chp5 R Programming
4 pages
Unit 3 Regression Models
No ratings yet
Unit 3 Regression Models
74 pages
R Analysis Summary
No ratings yet
R Analysis Summary
6 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
Modeling and Visulizing Data Using R: A Practical Introduction
No ratings yet
Modeling and Visulizing Data Using R: A Practical Introduction
106 pages
Essential R
No ratings yet
Essential R
261 pages
Data Tabulation and Frequencies
No ratings yet
Data Tabulation and Frequencies
34 pages
7.19 Problem Set
No ratings yet
7.19 Problem Set
2 pages
BES - R Lab
No ratings yet
BES - R Lab
5 pages
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
No ratings yet
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
43 pages
EXAM1 - Muhibbul Arman Mannan: List Ls
No ratings yet
EXAM1 - Muhibbul Arman Mannan: List Ls
13 pages
R Statistical Package
No ratings yet
R Statistical Package
63 pages
Recipes For Data Processing
No ratings yet
Recipes For Data Processing
51 pages
Arsenal
No ratings yet
Arsenal
60 pages
Week 02
No ratings yet
Week 02
39 pages
Pracal Labexamsamplequestions
No ratings yet
Pracal Labexamsamplequestions
35 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Unit3-Data Science
No ratings yet
Unit3-Data Science
37 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Unit 3
No ratings yet
Unit 3
36 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
STATA - Subject Table of Contents
No ratings yet
STATA - Subject Table of Contents
15 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
3 Graphical Descriptive Techniques 2
No ratings yet
3 Graphical Descriptive Techniques 2
41 pages
Stoc
No ratings yet
Stoc
44 pages
4 III BTech Minor DS Courses Syllabus
No ratings yet
4 III BTech Minor DS Courses Syllabus
5 pages
Visual Statistics Use R!
50% (2)
Visual Statistics Use R!
388 pages
Visual Statistics Use R PDF
No ratings yet
Visual Statistics Use R PDF
388 pages
Advanced Statistical Methods Using R Notes
No ratings yet
Advanced Statistical Methods Using R Notes
55 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Visual Guide To Machine Learning
No ratings yet
Visual Guide To Machine Learning
349 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
Applied Statistics
No ratings yet
Applied Statistics
457 pages
Statistical Modelling
No ratings yet
Statistical Modelling
39 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Stastistics and Probability With R Programming Language: Lab Report
50% (2)
Stastistics and Probability With R Programming Language: Lab Report
44 pages
Comp333 wk2 Example1
No ratings yet
Comp333 wk2 Example1
2 pages
Report of BDA Mini Project
No ratings yet
Report of BDA Mini Project
11 pages
Rintro
No ratings yet
Rintro
61 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
Module2 BDA
No ratings yet
Module2 BDA
44 pages
R Software - Notes
No ratings yet
R Software - Notes
18 pages
DV - Unit 2
No ratings yet
DV - Unit 2
73 pages
Learn R For Applied Statistics
No ratings yet
Learn R For Applied Statistics
457 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
R Viva Ques
No ratings yet
R Viva Ques
24 pages
Unit 3
No ratings yet
Unit 3
11 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
Re Center Psych Stats
No ratings yet
Re Center Psych Stats
560 pages
Oversize Fashion Crochet: 6 Cozy Cardigans, Pullovers & Wraps Designed with Maximum Style and Ease
From Everand
Oversize Fashion Crochet: 6 Cozy Cardigans, Pullovers & Wraps Designed with Maximum Style and Ease
Salena Baca
No ratings yet
Computer Solved: Nonlinear Differential Equations
From Everand
Computer Solved: Nonlinear Differential Equations
Joe J. Ettl
No ratings yet
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Eviews Output
No ratings yet
Eviews Output
6 pages
Random Number Generation
No ratings yet
Random Number Generation
42 pages
Eviews 2
No ratings yet
Eviews 2
15 pages
Assignment 3 P Value
No ratings yet
Assignment 3 P Value
6 pages
Introduction To Business Statistics Sixth Edition Ronald M. Weiers Instant Download
No ratings yet
Introduction To Business Statistics Sixth Edition Ronald M. Weiers Instant Download
52 pages
Full
No ratings yet
Full
1,224 pages
GC SY Bistatistics III - Chi Square Test Practice Sums
No ratings yet
GC SY Bistatistics III - Chi Square Test Practice Sums
2 pages
Poisson Models For Count Data: 4.1 Introduction To Poisson Regression
No ratings yet
Poisson Models For Count Data: 4.1 Introduction To Poisson Regression
14 pages
Savlon Summer Project Summary
No ratings yet
Savlon Summer Project Summary
38 pages
Study On Brand Share Across The Four Wards Using Chi-Square Test
100% (1)
Study On Brand Share Across The Four Wards Using Chi-Square Test
16 pages
Recritment and Selection Project
No ratings yet
Recritment and Selection Project
65 pages
Statistics 4e by Mark Berenson
No ratings yet
Statistics 4e by Mark Berenson
50 pages
Important Statistics Formulas
No ratings yet
Important Statistics Formulas
12 pages
Logit Model
No ratings yet
Logit Model
50 pages
Ejercicios Resueltos de Inferencia Estadistica
No ratings yet
Ejercicios Resueltos de Inferencia Estadistica
229 pages
MCQ Chapter 11
100% (1)
MCQ Chapter 11
3 pages
RBC Statistics Overview RBC
No ratings yet
RBC Statistics Overview RBC
31 pages
Statistics Management
No ratings yet
Statistics Management
10 pages
Part Time Workers
No ratings yet
Part Time Workers
8 pages
Project On Investor's Perception Towards Real Estate
No ratings yet
Project On Investor's Perception Towards Real Estate
11 pages
Are Declustered Earthquake Catalogs Poisson?: Brad Luen
No ratings yet
Are Declustered Earthquake Catalogs Poisson?: Brad Luen
8 pages
Nonparametric Statistics: Significance of Computed
No ratings yet
Nonparametric Statistics: Significance of Computed
1 page
Introduction To SEM (Webinar Slides)
No ratings yet
Introduction To SEM (Webinar Slides)
70 pages
SYBSc Statistics Syllabus
No ratings yet
SYBSc Statistics Syllabus
35 pages
Quantitative Method Study Manual PDF
No ratings yet
Quantitative Method Study Manual PDF
339 pages
Chi-Square Test: by Dr. M.Supriya Moderator:Dr.B.Aruna, M.D. (H)
No ratings yet
Chi-Square Test: by Dr. M.Supriya Moderator:Dr.B.Aruna, M.D. (H)
75 pages
LogConcave Probability and Its Applications
No ratings yet
LogConcave Probability and Its Applications
26 pages

R: Devore Solutions

Uploaded by

R: Devore Solutions

Uploaded by

An Introduction to R, RStudio, R Markdown

Analyze tips dataset

Table 1: (#tab:data) A glimpse over the data

TOTBILL TIP SEX SMOKER DAY TIME SIZE

Table 2: Table continues below

TOTBILL TIP SEX SMOKER DAY TIME

boxplot(d$TOTBILL, main = "TOTALBILL")

TOTALBILL TIP SIZE

## Warning: There were 5 warnings in `summarize()`.

Tables are useful to analyze categorical (qualitative, factor) data

d_tbl %>% ftable

## TIME dinner lunch

## Loading required package: grid

## DAY THU FRI SAT SUN

Tiles are aliged within each block. Therefore,

Hair color, Eye color, gender

## Sex Male Female

## Eye Brown Blue Hazel Green

108 286 71127

220 215 9364

108 286 71127

220 215 9364

structable(Hair ~ Sex + Eye, dd_tbl)

## Hair Black Blond Brown

Independence test (Chi-square test) 15 October 2019

(LRT <- 2*(loglik_sat - loglik_small))

## Number of cases in table: 592

## 2 iterations: deviation 5.684342e-14

## 2 iterations: deviation 5.684342e-14

## Number of cases in table: 179

Devore Exercise 16 on 638

Table 4: Incidence data

Table 5: (#tab:incidat) Incidence data II

Look at Table @ref(tab:incidat) for frequencies of all incidences.

df_saturated <- nrow(d) - 1

(pval <- pchisq(X2, df = df_LRT, lower.tail = FALSE))

You might also like