ADA1 Notes F14
ADA1 Notes F14
ADA1 Notes F14
Fall 2014
Contents
0 Introduction to R, Rstudio, and ggplot 1
0.1 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.2 Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.3 R building blocks . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.4 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . . 13
0.4.1 Improving plots . . . . . . . . . . . . . . . . . . . . . 20
0.5 Course Overview . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Two-Sample Inferences 98
3.1 Comparing Two Sets of Measurements . . . . . . . . . . . . . 98
3.1.1 Plotting head breadth data: . . . . . . . . . . . . . . . 100
3.1.2 Salient Features to Notice . . . . . . . . . . . . . . . . 106
3.2 Two-Sample Methods: Paired Versus Independent Samples . . 106
3.3 Two Independent Samples: CI and Test Using Pooled Variance 108
3.4 Satterthwaite’s Method, unequal variances . . . . . . . . . . . 109
3.4.1 R Implementation . . . . . . . . . . . . . . . . . . . . 109
3.5 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.6 Paired Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.6.1 R Analysis . . . . . . . . . . . . . . . . . . . . . . . . 120
3.7 Should You Compare Means? . . . . . . . . . . . . . . . . . . 128
Introduction to R,
Rstudio, and ggplot
Learning objectives
identify a function or operation and describe its use
apply functions and operations to achieve a specific result
predict answers of calculations written in R
use R’s functions to get help and numerically summarize data
apply ggplot() to organize and reveal patterns visually
explain what each plotting option does
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge
6. summarize data visually, numerically, and descriptively
8. use statistical software
2 Ch 0: Introduction to R, Rstudio, and ggplot
0.1 Syllabus
Tools for this course
Computer: Windows/Mac/Linux
Software: R, text editor (Rstudio)
Brain: scepticism, curiosity, organization
planning, execution, clarity
To do by next class:
HW 00 due Thursday (or at end of class today).
Set up R and Rstudio.
Obtain and register an iClicker.
Read Hadley Wickham’s R style guide stat405.had.co.nz/r-style.
html
Read the rubric http://statacumen.com/teach/rubrics.pdf for goal
setting and definitions of quality work.
Obtain (optional) text book for an R programming reference.
If you need a disability requiring accommodation, please see me and reg-
ister with the UNM Accessibility Resource Center.
6 * Create/synthesize
5 * Evaluate
4 * Analyze
3 Apply
2 Understand
1 Remember
0.2 Rstudio
Quick tour Note that I changed my background to black for stealth coding
at night. . .
1
en.wikipedia.org/wiki/Bloom’s_Taxonomy
4 Ch 0: Introduction to R, Rstudio, and ggplot
Learning the keyboard shortcuts will make your life more wonderful. (Un-
der Help menu)
0.2: Rstudio 5
In particular:
1. Work out commands in the editor, then copy/paste into console (or use
Ctrl-Enter to submit current line or selection).
2. Your editor will keep a history of what you’ve done.
3. Make comments to help your future-self recall why you did what you did.
6 Ch 0: Introduction to R, Rstudio, and ggplot
0.3 R building blocks
0.3: R building blocks 7
## [1] 1 2 3 4 5
b <- seq(15, 3, length = 5)
b
## [1] 15 12 9 6 3
c <- a * b
c
## [1] 15 24 27 24 15
Basic functions There are lots of functions available in the base package.
Type ?base and click on Index at the bottom of the help page for a complete
list of functions. Other functions to look at are in the ?stats and ?datasets
packages.
#### Basic functions
# Lots of familiar functions work
a
## [1] 1 2 3 4 5
sum(a)
## [1] 15
prod(a)
## [1] 120
mean(a)
## [1] 3
sd(a)
## [1] 1.581
var(a)
## [1] 2.5
min(a)
## [1] 1
median(a)
## [1] 3
max(a)
## [1] 5
range(a)
## [1] 1 5
Your turn!
#### Library
# each time you start R
# load package ggplot2 for its functions and datasets
library(ggplot2)
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
# summary() gives frequeny tables for categorical variables
# and mean and five-number summaries for continuous variables
summary(mpg)
## manufacturer model displ
## dodge :37 caravan 2wd : 11 Min. :1.60
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.40
## volkswagen:27 civic : 9 Median :3.30
## ford :25 dakota pickup 4wd : 9 Mean :3.47
## chevrolet :19 jetta : 9 3rd Qu.:4.60
## audi :18 mustang : 9 Max. :7.00
## (Other) :74 (Other) :177
## year cyl trans drv cty
## Min. :1999 Min. :4.00 auto(l4) :83 4:103 Min. : 9.0
## 1st Qu.:1999 1st Qu.:4.00 manual(m5):58 f:106 1st Qu.:14.0
## Median :2004 Median :6.00 auto(l5) :39 r: 25 Median :17.0
## Mean :2004 Mean :5.89 manual(m6):19 Mean :16.9
## 3rd Qu.:2008 3rd Qu.:8.00 auto(s6) :16 3rd Qu.:19.0
## Max. :2008 Max. :8.00 auto(l6) : 6 Max. :35.0
## (Other) :13
## hwy fl class
## Min. :12.0 c: 1 2seater : 5
## 1st Qu.:18.0 d: 5 compact :47
## Median :24.0 e: 8 midsize :41
## Mean :23.4 p: 52 minivan :11
## 3rd Qu.:27.0 r:168 pickup :33
## Max. :44.0 subcompact:35
## suv :62
#### ggplot_mpg_displ_hwy
# specify the dataset and variables
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point() # add a plot layer with points
print(p)
16 Ch 0: Introduction to R, Rstudio, and ggplot
40
●
●
●
●
● ●
● ●
● ● ●
30 ● ● ●
● ● ● ● ● ● ●
hwy
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ●
20 ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
2 3 4 5 6 7
displ
Geoms, aesthetics, and facets are three concepts we’ll see in this section.
had.co.nz/ggplot2
had.co.nz/ggplot2/geom_point.html
40
●
●
●
●
class
● ●
● ● ● 2seater
● ● ●
● compact
30 ● ● ●
● ● ● ● ● ● ● ● midsize
hwy
● ● ● ● ●
● ● ● ● ● ● ● ●
● minivan
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● pickup
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● subcompact
● ● ● ● ● ● ●
● suv
● ● ● ● ●
● ●
20 ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
2 3 4 5 6 7
displ
drv
● 4
40 f
r
cyl
●
4
● 5
● 6
30
● 7
hwy
●8
●
● ●
● ● ●
● ●
● ● class
●
●
●
●
●
● ● 2seater
● compact
20 ●
● ● midsize
● ● ● ●● ● ● minivan
● ● ●● ●●
●● ●● ● ●● ●●● ● ● ● pickup
●● ●
● ●● ●● ●● ● subcompact
● ● ● suv
●
2 3 4 5 6 7
displ
#### ggplot_mpg_displ_hwy_colour_class_size_cyl_shape_drv_alpha
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class, size = cyl, shape = drv)
, alpha = 1/4) # alpha is the opacity
print(p)
drv
4
40 f
r
cyl
4
5
6
30
7
hwy
class
2seater
compact
20 midsize
minivan
pickup
subcompact
suv
2 3 4 5 6 7
displ
0.4: Plotting with ggplot2 19
Faceting A small multiple3 (sometimes called faceting, trellis chart, lattice
chart, grid chart, or panel chart) is a series or grid of small similar graphics or
charts, allowing them to be easily compared.
## two methods
# facet_grid(rows ~ cols) for 2D grid, "." for no split.
# facet_wrap(~ var) for 1D ribbon wrapped into 2D.
3
According to Edward Tufte (Envisioning Information, p. 67): “At the heart of quantitative reasoning
is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer
directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of
alternatives. For a wide range of problems in data presentation, small multiples are the best design
solution.”
20 Ch 0: Introduction to R, Rstudio, and ggplot
4 5 6 8
40
●
30 ●
4
● ● ● ● ●
●
● ●
● ●
● ●
40 ● ● ● ●
20 ● ● ●
●
●
● ●● ●
●● ●●
● ●
● ● ●
●●● ● ●
● ●
●●
● ●
● ● ● ● ● ●● ● ● ●
● ●
●
●
●
●
●● 40
● ● ●
●
●●● ●
●
●
● ● ●
30
hwy
●● ● ● ●
●●
30 ● ●
● ●
● ● ● ● ●
hwy
●●●●● ● ● ● ● ● ● ●
f
●
● ● ●
●● ● ● ●
● ● ●
● ●
● ● ●●● ● ●●
● ●
● ● ●● ● ●
●●●
● ●●●● ● ●
● ●
●
●●●●
● ●●●●●●●● ● ● 20
● ● ●●● ●● ● ● ●
●
● ●
●●● ● ●
● ● ●● ●● ●
● ●● ● ●
● ●
20 ● ● ●
● 40
●●● ●
● ●
●● ●● ●●●
●
● ●●● ●
●●●●●●● ● 30
r
●
● ●● ● ● ● ●
● ●● ●●
●●● ● ● ● ●
●
● ●
● ● ●
20 ●●
●
● ● ● ●
● ●●
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ displ
hwy
●
30 ●●
●●
●●●
● ● ●
● ●● 30 ● ● ●
●●
●● ●
●● ●
f
●●●
● ● ●●●● ●
●●
●● ● ● ●●
● ●●
● ●
●
●●●●●●● ● ● ● ● ●● ● ●
● ● ●
● ● ●
●
20 ● 20 ●
● ●● ●
● ●●●● ●
●
●●
●
● ● ●
●●●
●● ●
●
suv
40
40
30
30
r
●●
●● ● ●
● ●
● ● ● ●
●
● ●
●
● ●
20 ●
●
●●● 20 ● ●●●● ●● ● ●●●●
● ●
● ●● ●●●●●● ●●● ●● ●
● ● ●● ● ●● ●●●
●
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ displ
40
●
● ●
●
●
● ●
● ● ●
● ● ● ●
30 ● ● ●
● ● ● ● ● ●
hwy
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ●
●
20 ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
●
●
10 15 20 25 30 35
cty
Problem: points lie on top of each other, so it’s impossible to tell how many
observations each point represents.
A solution: Jitter the points to reveal the individual points and reduce the
opacity to 1/2 to indicate when points overlap.
#### ggplot_mpg_cty_hwy_jitter
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point(position = "jitter", alpha = 1/2)
print(p)
40
30
hwy
20
10 20 30
cty
40
●
●
●
●
● ●
● ●
● ●
30 ● ●
● ● ●
hwy
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
A solution: Reorder the class variable by the mean hwy for a meaningful
ordering. Get help with ?reorder to understand how this works.
#### ggplot_mpg_reorder_class_hwy
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point()
print(p)
● ●
40
●
●
●
●
● ●
● ●
● ●
30 ● ●
● ● ●
hwy
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
. . . add jitter
#### ggplot_mpg_reorder_class_hwy_jitter
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point(position = "jitter")
print(p)
0.4: Plotting with ggplot2 23
● ●
40
●
●
●
●
●
● ●
●
● ● ●
●
●● ● ●●
●
●
30 ●
●
●
●
●● ● ● ● ● ● ● ● ●
●● ●●●●● ●
hwy
●
●
● ●
● ● ●
● ●
● ● ●●
● ● ●●
● ● ● ● ●
●
● ● ●● ●● ● ●● ● ●●
● ●● ●●● ●
● ●
● ●●● ● ● ●● ●
● ● ● ● ● ● ●
● ● ● ●
● ●
●
● ● ● ● ●
●● ● ● ● ●
● ●
●
● ● ●
● ● ●
● ● ●●
● ●●
●
●
● ●
20 ● ● ●● ● ● ●
● ●
● ●
● ●
● ●●
●
● ● ● ● ●
● ● ●●●
● ● ● ●
● ● ● ●● ●
●● ● ●
●
●●
● ●● ●● ●● ●
● ●●
●
●●● ● ● ●
● ●●● ● ●●
● ●● ●
●
●● ● ● ●
●
●
● ● ● ●
●
● ●
40
30
hwy
●
●
●
●
●
●
20
● ●
● ●
40
●
●
● ●
● ● ●
●
● ●
● ●●
●
●● ●
●
● ● ●
● ● ●●
30 ●
● ● ● ● ●● ●●● ● ●
● ● ● ●● ● ● ● ● ●
hwy
●● ●
● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ●
●
●
● ● ●●
● ●
● ● ●● ●● ● ● ● ●
● ● ● ●● ●●
●● ● ●
●
●● ● ●
● ●●
● ● ● ●● ● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ●
● ● ● ● ●
● ● ● ● ● ●
● ●
● ●
● ●●
20 ● ● ● ●
● ●●● ● ●●
● ● ● ● ● ● ●●
●
● ● ●● ●
●
●
●
● ● ● ●● ● ● ● ●●
● ●●
● ●● ● ●
● ●● ● ●
● ●
●
●● ●● ● ●
●
● ●
● ● ●
● ●
●● ● ● ● ● ●●
●
●
● ● ● ● ● ●
●
10
pickup suv minivan 2seater midsize subcompact compact
reorder(class, hwy)
●
● ● ●
● ●
40
● ●
●
●
● ●●
●
● ●
● ●
● ●
● ● ●
●
●● ●
● ●
30 ●
●
● ● ●●● ● ● ● ●●●●
● ● ● ● ●● ● ●
hwy
● ●
● ●● ● ●
● ●
●● ● ●
● ● ● ● ● ● ●●
● ● ●
● ●● ●
● ● ● ●
●● ●
●
● ● ●● ● ● ● ● ● ●
● ● ● ● ●●●● ● ● ●●
● ●
● ● ● ●
● ●
● ● ●● ● ● ●
● ● ●
● ● ● ●● ●
● ●
● ● ●
● ● ● ● ●
● ●
●
● ● ● ● ●
● ● ●
● ●
● ● ●
20 ● ● ● ● ●● ● ●
●
●
●● ●
● ● ● ● ●● ● ●
●● ● ●
●● ● ● ●
● ●● ● ● ● ●●●● ●
●●● ●●
● ● ● ●● ● ● ●● ●●● ● ●
● ● ●
● ●
● ● ●● ●
● ● ● ●●
● ● ● ●
●
● ●
●
● ● ● ● ●●
. . . and can easily reorder by median() instead of mean() (mean is the default)
#### ggplot_mpg_reorder_class_hwy_boxplot_jitter_median
p <- ggplot(mpg, aes(x = reorder(class, hwy, FUN = median), y = hwy))
p <- p + geom_boxplot(alpha = 0.5)
p <- p + geom_point(position = "jitter")
print(p)
0.4: Plotting with ggplot2 25
● ● ●
●
● ●
40
●
●
●
●
● ●
●
●
● ●
●● ●
●
● ● ●
●● ● ●
● ●
30 ●●
● ● ●● ●● ● ● ● ● ●
●
● ● ● ● ●
● ● ● ●
hwy
● ● ● ●
●● ●
● ● ● ●
● ● ● ●● ●● ● ● ●
● ● ●● ●● ● ● ● ●● ● ●●
●
● ● ●● ●
● ● ● ● ● ● ● ● ●● ●● ●
● ●●●● ● ●
●●● ● ● ●●
●●
● ● ● ● ● ● ●
● ● ● ● ● ●
●
●
● ● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ● ●● ●
20 ● ● ● ● ● ●
● ● ●
● ● ● ●● ●
● ●●
●
● ● ●
● ●● ● ●●
● ●● ● ● ●●
●●● ● ●● ● ● ● ●● ●●
● ● ● ●●
● ● ●●●
● ●
● ● ● ●
● ●
●
● ● ● ● ● ●●
● ●
●
● ●
● ●
● ● ● ●●
One-minute paper:
Muddy Any “muddy” points — anything that doesn’t make sense yet?
Thumbs up Anything you really enjoyed or feel excited about?
26 Ch 0: Introduction to R, Rstudio, and ggplot
0.5 Course Overview
See ADA2 Chapter 1 notes for a brief overview of all we’ll cover in this semester
of ADA1.
Chapter 1
Summarizing and
Displaying Data
Learning objectives
After completing this topic, you should be able to:
use R’s functions to get help and numerically summarize data.
apply R’s base graphics and ggplot to visually summarize data in several
ways.
explain what each plotting option does.
describe the characteristics of a data distribution.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
The sample standard deviation is the square root of the sample variance
2
(Y1 − Ȳ )2 + (Y2 − Ȳ )2 + · · · + (Yk − Ȳ )2
P
2 i (Yi − Ȳ )
s = =
n−1 n−1
(5 − 20) + (9 − 20)2 + · · · + (40 − 20)2
2
= = 156.3,
√ 7
s = s2 = 12.5.
#### variance
var(y)
## [1] 156.3
sd(y)
## [1] 12.5
Remark If the divisor for s2 was n instead of n − 1, then the variance would
be the average squared deviation observations are from the center of the data
as measured by the mean.
30 Ch 1: Summarizing and Displaying Data
The following graphs should help you to see some physical meaning of the
sample mean and variance. If the data values were placed on a “massless”
ruler, the balance point would be the mean (20). The variance is basically the
“average” (remember n − 1 instead of n) of the total areas of all the squares
obtained when squares are formed by joining each value to the mean. In both
cases think about the implication of unusual values (outliers). What happens
to the balance point if the 40 were a 400 instead of a 40? What happens to the
squares?
The median M is the value located at the half-way point of the ordered
string. There is an even number of observations, so M is defined to be half-way
between the two middle values, 14 and 18. That is, M = 0.5(14 + 18) = 16
lb. To get the quartiles, break the data into the lower half: 5 9 12 14, and the
upper half: 18 30 32 and 40. Then
Q1 = first quartile = median of lower half of data = 0.5(9+12)=10.5 lb,
and
Q3 = third quartile = median of upper half of data = 0.5(30+32) = 31 lb.
1.2: Numerical summaries 31
The interquartile range is
#### quartiles
median(y)
## [1] 16
fivenum(y)
## [1] 5.0 10.5 16.0 31.0 40.0
# The quantile() function can be useful, but doesn't calculate Q1 and Q3
# as defined above, regardless of the 9 types of calculations for them!
# summary() is a combination of mean() and quantile(y, c(0, 0.25, 0.5, 0.75, 1))
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 11.2 16.0 20.0 30.5 40.0
# IQR
fivenum(y)[c(2,4)]
## [1] 10.5 31.0
fivenum(y)[4] - fivenum(y)[2]
## [1] 20.5
diff(fivenum(y)[c(2,4)])
## [1] 20.5
The quartiles, with M being the second quartile, break the data set roughly
into fourths. The first quartile is also called the 25th percentile, whereas the
median and third quartiles are the 50th and 75th percentiles, respectively. The
IQR is the range for the middle half of the data.
If you look at the data set with all eight observations, there actually are many
numbers that split the data set in half, so the median is not uniquely defined1,
although “everybody” agrees to use the average of the two middle values. With
quartiles there is the same ambiguity but no such universal agreement on what
to do about it, however, so R will give slightly different values for Q1 and
Q3 when using summary() and some other commands than we just calculated,
and other packages will report even different values. This has no practical
implication (all the values are “correct”) but it can appear confusing.
Example The data given below are the head breadths in mm for a sample
of 18 modern Englishmen, with numerical summaries generated by R.
#### Englishmen
hb <- c(141, 148, 132, 138, 154, 142, 150, 146, 155
, 158, 150, 140, 147, 148, 144, 150, 149, 145)
1
The technical definition of the median for an even set of values includes the entire range between the
two center values. Thus, selecting any single value in this center range is convenient and the center of this
center range is one sensible choice for the median, M .
1.3: Graphical summaries for one quantitative sample 33
There are four graphical summaries of primary interest: the dotplot, the
histogram, the stem-and-leaf display, and the boxplot. There are many
more possible, but these will often be useful. The plots can be customized.
Make liberal use of the help for learning how to customize them. Plots can also
be generated along with many statistical analyses, a point that we will return
to repeatedly.
34 Ch 1: Summarizing and Displaying Data
1.3.1 Dotplots
The dotplot breaks the range of data into many small-equal width intervals,
and counts the number of observations in each interval. The interval count is
superimposed on the number line at the interval midpoint as a series of dots,
usually one for each observation. In the head breadth data, the intervals are
centered at integer values, so the display gives the number of observations at
each distinct observed head breadth.
A dotplot of the head breadth data is given below. Of the examples below,
the R base graphics stripchart() with method="stack" resembles the traditional
dotplot.
#### stripchart-ggplot
# stripchart (dotplot) using R base graphics
# 3 rows, 1 column
par(mfrow=c(3,1))
stripchart(hb, main="Modern Englishman", xlab="head breadth (mm)")
stripchart(hb, method="stack", cex=2
, main="larger points (cex=2), method is stack")
stripchart(hb, method="jitter", cex=2, frame.plot=FALSE
, main="no frame, method is jitter")
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
1.3: Graphical summaries for one quantitative sample 35
Modern Englishman Modern Englishman head breadth
1.00
0.75
count
0.50
0.25
0.00
135 140 145 150 155
130 140 150 160
head breadth (mm) head breadth (mm)
larger points (cex=2), method is stack Modern Englishman head breadth, stackdir=center
0.50
0.25
count
0.00
−0.25
−0.50
135 140 145 150 155
130 140 150 160
head breadth (mm)
0.25
count
0.00
−0.25
−0.50
135 140 145 150 155
130 140 150 160
head breadth (mm)
1.3.2 Histogram
The histogram and stem-and-leaf displays are similar, breaking the range
of data into a smaller number of equal-width intervals. This produces graphical
information about the observed distribution by highlighting where data values
cluster. The histogram can use arbitrary intervals, whereas the intervals for the
stem-and-leaf display use the base 10 number system. There is more arbitrari-
ness to histograms than to stem-and-leaf displays, so histograms can sometimes
be regarded a bit suspiciously.
#### hist
# histogram using R base graphics
# par() gives graphical options
# mfrow = "multifigure by row" with 1 row and 3 columns
par(mfrow=c(1,3))
# main is the title, xlab is x-axis label (ylab also available)
hist(hb, main="Modern Englishman", xlab="head breadth (mm)")
# breaks are how many bins-1 to use
36 Ch 1: Summarizing and Displaying Data
0.08
6
0.06
4
Frequency
Frequency
Density
count
4
0.04
2
0.02
2
0.00
0
0
130 140 150 160 135 145 155 130 140 150 160
130 140 150 160
head breadth (mm) hb hb hb
R allows you to modify the graphical display. For example, with the his-
togram you might wish to use different midpoints or interval widths. I will let
you explore the possibilities.
##
## 132 | 0
## 134 |
## 136 |
## 138 | 0
## 140 | 00
## 142 | 0
## 144 | 00
## 146 | 00
## 148 | 000
## 150 | 000
## 152 |
## 154 | 00
## 156 |
## 158 | 0
The data values are always truncated so that a leaf has one digit. The leaf
unit (location of the decimal point) tells us the degree of round-off. This will
become clearer in the next example.
Of the three displays, which is the most informative? I think the middle
option is best to see the clustering and shape of distributions of numbers.
The boxplot breaks up the range of data values into regions about the center
of the data, measured by the median. The boxplot highlights outliers and
provides a visual means to assess “normality”. The following help entry
outlines the construction of the boxplot, given the placement of data values on
the axis.
1.3: Graphical summaries for one quantitative sample 39
The endpoints of the box are placed at the locations of the first and third
quartiles. The location of the median is identified by the line in the box. The
whiskers extend to the data points closest to but not on or outside the outlier
fences, which are 1.5IQR from the quartiles. Outliers are any values on or
outside the outlier fences.
The boxplot for the head breadth data is given below. There are a lot
of options that allow you to clutter the boxplot with additional information.
Just use the default settings. We want to see the relative location of data (the
median line), have an idea of the spread of data (IQR, the length of the box),
see the shape of the data (relative distances of components from each other –
to be covered later), and identify outliers (if present). The default boxplot has
all these components.
Note that the boxplots below are horizontal to better fit on the page. The
horizontal=TRUE and coord_flip() commands do this.
#### boxplot
fivenum(hb)
## [1] 132.0 142.0 147.5 150.0 158.0
# boxplot using R base graphics
par(mfrow=c(1,1))
boxplot(hb, horizontal=TRUE
, main="Modern Englishman", xlab="head breadth (mm)")
40 Ch 1: Summarizing and Displaying Data
"hb"
hb
ClickerQ s — Boxplots
# histogram
hist(hb, freq = FALSE
, main="Histogram with kernel density plot, Modern Englishman")
# Histogram overlaid with kernel density curve
points(density(hb), type = "l")
# rug of points under histogram
rug(hb)
# violin plot
library(vioplot)
vioplot(hb, horizontal=TRUE, col="gray")
title("Violin plot, Modern Englishman")
# boxplot
boxplot(hb, horizontal=TRUE
, main="Boxplot, Modern Englishman", xlab="head breadth (mm)")
1.3: Graphical summaries for one quantitative sample 41
Histogram with kernel density plot, Modern Englishman
0.08
Density
0.04
0.00
130 135 140 145 150 155 160
hb
1
135 140 145 150 155
Example: income The data below are incomes in $1000 units for a sample
of 12 retired couples. Numerical and graphical summaries are given. There are
two stem-and-leaf displays provided. The first is the default display.
#### Income examples
income <- c(7, 1110, 7, 5, 8, 12, 0, 5, 2, 2, 46, 7)
# sort in decreasing order
income <- sort(income, decreasing = TRUE)
income
## [1] 1110 46 12 8 7 7 7 5 5 2 2 0
summary(income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 4.2 7.0 101.0 9.0 1110.0
# stem-and-leaf plot
stem(income)
##
## The decimal point is 3 digit(s) to the right of the |
##
## 0 | 00000000000
## 0 |
## 1 | 1
Because the two large outliers, I trimmed them to get a sense of the shape
of the distribution where most of the observations are.
#### remove largest
# remove two largest values (the first two)
income2 <- income[-c(1,2)]
income2
42 Ch 1: Summarizing and Displaying Data
## [1] 12 8 7 7 7 5 5 2 2 0
summary(income2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.75 6.00 5.50 7.00 12.00
# stem-and-leaf plot
stem(income2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 022
## 0 | 557778
## 1 | 2
# scale=2 makes plot roughly twice as wide
stem(income2, scale=2)
##
## The decimal point is at the |
##
## 0 | 0
## 2 | 00
## 4 | 00
## 6 | 000
## 8 | 0
## 10 |
## 12 | 0
Boxplots with full data, then incrementally removing the two largest outliers.
#### income-boxplot
# boxplot using R base graphics
# 1 row, 3 columns
par(mfrow=c(1,3))
boxplot(income, main="Income")
boxplot(income[-1], main="(remove largest)")
boxplot(income2, main="(remove 2 largest)")
● ●
1000
40
10
800
8
30
600
6
20
400
4
10
200
●
0
0
1.4: Interpretation of Graphical Displays for Numerical Data 43
1.4 Interpretation of Graphical Displays for
Numerical Data
In many studies, the data are viewed as a subset or sample from a larger
collection of observations or individuals under study, called the population.
A primary goal of many statistical analyses is to generalize the information in
the sample to infer something about the population. For this generalization
to be possible, the sample must reflect the basic patterns of the population.
There are several ways to collect data to ensure that the sample reflects the
basic properties of the population, but the simplest approach, by far, is to
take a random or “representative” sample from the population. A random
sample has the property that every possible sample of a given size has the
same chance of being the sample (eventually) selected (though we often do this
only once). Random sampling eliminates any systematic biases associated with
the selected observations, so the information in the sample should accurately
reflect features of the population. The process of sampling introduces random
variation or random errors associated with summaries. Statistical tools are used
to calibrate the size of the errors.
Whether we are looking at a histogram (or stem-and-leaf, or dotplot) from
a sample, or are conceptualizing the histogram generated by the population
data, we can imagine approximating the “envelope” around the display with a
smooth curve. The smooth curve that approximates the population histogram
is called the population frequency curve or population probability
density function or population distribution2. Statistical methods for
inference about a population usually make assumptions about the shape of the
population frequency curve. A common assumption is that the population has
a normal frequency curve. In practice, the observed data are used to assess
the reasonableness of this assumption. In particular, a sample display should
resemble a population display, provided the collected data are a random or rep-
resentative sample from the population. Several common shapes for frequency
distributions are given below, along with the statistical terms used to describe
2
“Distribution function” often refers to the “cumulative distribution function”, which is a different (but
one-to-one related) function than what I mean here.
44 Ch 1: Summarizing and Displaying Data
them.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x1, freq = FALSE, breaks = 20)
points(density(x1), type = "l")
rug(x1)
# violin plot
library(vioplot)
vioplot(x1, horizontal=TRUE, col="gray")
# boxplot
boxplot(x1, horizontal=TRUE)
## ggplot
# Histogram overlaid with kernel density curve
x1_df <- data.frame(x1)
p1 <- ggplot(x1_df, aes(x = x1))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..)
, binwidth=5
, colour="black", fill="white")
# Overlay with transparent density plot
1.4: Interpretation of Graphical Displays for Numerical Data 45
# violin plot
p2 <- ggplot(x1_df, aes(x = "x1", y = x1))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x1_df, aes(x = "x1", y = x1))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
Histogram of x1 0.03
0.02
density
0.020
Density
0.01
0.000
0.00
60 80 100 120 140
50 75 100 125 150
x1 x1
"x1"
●
x1 ● ●
1
● ●
x1 ● ●
stem(x1)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 9
## 6 | 4
## 6 | 5889
## 7 | 3333344
## 7 | 578888899
## 8 | 01111122222223344444
## 8 | 55555666667777888889999999
## 9 | 000111111122222233333344
## 9 | 5555555556666666677777888888899999999
## 10 | 00000111222222233333344444
## 10 | 555555555666666667777777788888999999999
## 11 | 0000011111122233444
## 11 | 566677788999
## 12 | 00001123444
## 12 | 5679
## 13 | 00022234
## 13 | 6
## 14 | 3
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x2, freq = FALSE, breaks = 20)
points(density(x2), type = "l")
rug(x2)
# violin plot
library(vioplot)
vioplot(x2, horizontal=TRUE, col="gray")
# boxplot
1.4: Interpretation of Graphical Displays for Numerical Data 47
boxplot(x2, horizontal=TRUE)
# violin plot
p2 <- ggplot(x2_df, aes(x = "x2", y = x2))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x2_df, aes(x = "x2", y = x2))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
Histogram of x2
0.04
density
0.015
Density
0.02
0.000
0.00
−150 −100 −50 0 50 100 150
−100 0 100
x2 x2
"x2"
●
x2 ● ● ●●● ●●●
●●●
●●
●● ●
●
●●●● ●
●
●●●●
●
●●
●●●
●●●
●●●●
●
●
1
● ● ● ●● ●●●
●●●
●●●● ● ●
●●
●● ●
● ●●
●●● ●
●●
●●
●●●● ●●●●●●●
●
x2 ● ● ●●● ●●●
●●●
●●
●● ●
●
●●●● ●
●
●●●●
●
●●
●●●
●●●
●●●●
●
●
summary(x2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
48 Ch 1: Summarizing and Displaying Data
# violin plot
library(vioplot)
vioplot(x3, horizontal=TRUE, col="gray")
1.4: Interpretation of Graphical Displays for Numerical Data 49
# boxplot
boxplot(x3, horizontal=TRUE)
# violin plot
p2 <- ggplot(x3_df, aes(x = "x3", y = x3))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x3_df, aes(x = "x3", y = x3))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
Histogram of x3 0.015
0.010
density
0.010
Density
0.005
0.000
0.000
60 80 100 120 140
40 60 80 100 120 140 160
x3 x3
"x3"
●
x3
1
x3
summary(x3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.4 75.1 97.4 99.5 125.0 150.0
sd(x3)
## [1] 28.87
skewness(x3)
## [1] 0.0481
kurtosis(x3)
## [1] -1.225
stem(x3)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 011122223344
## 5 | 5666677778
## 6 | 000111123334
## 6 | 55667788999
## 7 | 0000112233344444
## 7 | 5556677788999
## 8 | 0000111222223333444
## 8 | 577888899
## 9 | 00001222333444
## 9 | 5566777777888888999
## 10 | 122333334
## 10 | 5566799
## 11 | 012223334
## 11 | 566667777889
## 12 | 01222333334444
## 12 | 5566788899
## 13 | 000000112223334444
## 13 | 556668999
## 14 | 00001111222333444
## 14 | 557888999
## 15 | 0
The mean and median are identical in a population with a (exact) symmetric
frequency curve. The histogram and stem-and-leaf displays for a sample selected
from a symmetric population will tend to be fairly symmetric. Further, the
sample means and medians will likely be close.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x4, freq = FALSE, breaks = 20)
points(density(x4), type = "l")
rug(x4)
# violin plot
library(vioplot)
vioplot(x4, horizontal=TRUE, col="gray")
# boxplot
boxplot(x4, horizontal=TRUE)
# violin plot
p2 <- ggplot(x4_df, aes(x = "x4", y = x4))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x4_df, aes(x = "x4", y = x4))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
52 Ch 1: Summarizing and Displaying Data
Histogram of x4
0.6
density
0.6
0.4
Density
0.3
0.2
0.0
0.0
0 2 4 6
0 2 4 6 8
x4 x4
"x4"
●
x4 ● ● ● ● ●
1
0 1 2 3 4 5 6 7
0 2 4 6
x4
"x4"
● ● ● ● ●
x4 ● ● ● ● ●
0 1 2 3 4 5 6 7
0 2 4 6
x4
summary(x4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005 0.333 0.744 1.080 1.590 7.060
sd(x4)
## [1] 1.044
skewness(x4)
## [1] 1.756
kurtosis(x4)
## [1] 4.708
stem(x4)
##
## The decimal point is at the |
##
## 0 | 00000000000000111111111111111111111111112222222222222233333333334444
## 0 | 555555555555555555555666666666666666666677777778888999999999999
## 1 | 0000000000011111111222222333333344444
## 1 | 55556666666667778888899999
## 2 | 0000000111111222233344
## 2 | 66688
## 3 | 022333333444
## 3 | 569
## 4 |
## 4 | 6
## 5 | 2
## 5 |
## 6 |
## 6 |
## 7 | 1
1.4: Interpretation of Graphical Displays for Numerical Data 53
Unimodal, skewed left The distribution below is unimodal and skewed
to the left. The two examples show that extremely skewed distributions often
contain outliers in the longer tail of the distribution.
#### Unimodal, skewed left
# sample from uniform distribution
x5 <- 15 - rexp(250, rate = 0.5)
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x5, freq = FALSE, breaks = 20)
points(density(x5), type = "l")
rug(x5)
# violin plot
library(vioplot)
vioplot(x5, horizontal=TRUE, col="gray")
# boxplot
boxplot(x5, horizontal=TRUE)
# violin plot
p2 <- ggplot(x5_df, aes(x = "x5", y = x5))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x5_df, aes(x = "x5", y = x5))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
54 Ch 1: Summarizing and Displaying Data
Histogram of x5
0.4 0.4
0.3
density
Density
0.2
0.2
0.1
0.0
0.0
6 8 10 12 14
4 8 12 16
x5 x5
"x5"
●
x5 ●● ● ● ●
●●● ● ● ● ●●●
1
6 8 10 12 14
5.0 7.5 10.0 12.5 15.0
x5
"x5"
● ● ● ● ●● ● ● ●
● ● ● ● ●●
x5 ●● ● ● ●
●●● ● ● ● ●●●
6 8 10 12 14
5.0 7.5 10.0 12.5 15.0
x5
summary(x5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.99 12.30 13.60 13.00 14.40 15.00
sd(x5)
## [1] 2.03
skewness(x5)
## [1] -1.713
kurtosis(x5)
## [1] 2.96
stem(x5)
##
## The decimal point is at the |
##
## 4 |
## 5 | 017
## 6 | 05567
## 7 | 55
## 8 | 15
## 9 | 0112467889
## 10 | 0012234566688
## 11 | 00122346667777788
## 12 | 001111222233444555555556667778888999
## 13 | 00000000111111122223333334445555666666666777888888999999
## 14 | 00000000000111111111112222223333333333444444444445555555555555666666+19
## 15 | 0000000
1.4: Interpretation of Graphical Displays for Numerical Data 55
Bimodal (multi-modal) Not all distributions are unimodal. The distribu-
tion below has two modes or peaks, and is said to be bimodal. Distributions
with three or more peaks are called multi-modal.
#### Bimodal (multi-modal)
# sample from uniform distribution
x6 <- c(rnorm(150, mean = 100, sd = 15), rnorm(150, mean = 150, sd = 15))
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x6, freq = FALSE, breaks = 20)
points(density(x6), type = "l")
rug(x6)
# violin plot
library(vioplot)
vioplot(x6, horizontal=TRUE, col="gray")
# boxplot
boxplot(x6, horizontal=TRUE)
# violin plot
p2 <- ggplot(x6_df, aes(x = "x6", y = x6))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x6_df, aes(x = "x6", y = x6))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
56 Ch 1: Summarizing and Displaying Data
Histogram of x6
0.015
density
0.010
0.010
Density
0.005
0.000
0.000
100 150 200
50 100 150 200
x6 x6
"x6"
●
x6
1
"x6"
x6
summary(x6)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.1 102.0 124.0 126.0 153.0 204.0
sd(x6)
## [1] 28.84
skewness(x6)
## [1] -0.01518
kurtosis(x6)
## [1] -1.113
stem(x6)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 6 | 139
## 7 | 0145677
## 8 | 001112233355677778899
## 9 | 001112223333344455566667777788899999
## 10 | 0001112222222333344444555555666677777777888999
## 11 | 00011111334556667777888899999
## 12 | 112333344556778889
## 13 | 0011122235555677777
## 14 | 000001122333344445666777777788999
## 15 | 000000012233333334444555555555555566666778889
## 16 | 00000011122222222333444555666789
## 17 | 0113355677
## 18 |
## 19 |
## 20 | 4
The boxplot and histogram or stem-and-leaf display (or dotplot) are used
1.5: Interpretations for examples 57
together to describe the distribution. The boxplot does not provide infor-
mation about modality – it only tells you about skewness and the presence of
outliers.
As noted earlier, many statistical methods assume the population frequency
curve is normal. Small deviations from normality usually do not dramatically
influence the operating characteristics of these methods. We worry most when
the deviations from normality are severe, such as extreme skewness or heavy
tails containing multiple outliers.
Estimation in
One-Sample Problems
Learning objectives
After completing this topic, you should be able to:
select graphical displays that meaningfully communicate properties of a
sample.
assess the assumptions of the one-sample t-test visually.
decide whether the mean of a population is different from a hypothesized
value.
recommend action based on a hypothesis test.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
Population
Huge set of values
Can see very little
Sample
Y1, Y2, …, Yn
Inference
Mean µ
Standard Deviation σ
µ and σ unknown
There are two main methods that are used for inferences on µ: confidence
intervals (CI) and hypothesis tests. The standard CI and test procedures
are based on the sample mean and the sample standard deviation, denoted by
s.
where s is the sample standard deviation (i.e., the sample-based estimate of the
standard deviation of the population), and n is the size (number of observations)
of the sample.
In probability theory, the law of large numbers (LLN) is a theorem
that describes the result of performing the same experiment a large number of
times. According to the law, the average of the results obtained from a large
number of trials (the sample mean, Ȳ ) should be close to the expected value
(the population mean, µ), and will tend to become closer as more trials are
performed.
In probability theory, the central limit theorem (CLT) states that,
given certain conditions, the mean of a sufficiently large number of independent
random variables, each with finite mean and variance, will be approximately
normally distributed1.
As a joint illustration of these concepts, consider drawing random variables
following a Uniform(0,1) distribution, that is, any value in the interval [0, 1]
is equally likely. By definition, the mean of this distributionpis µ = 1/2 and
the variance is σ 2 = 1/12 (so the standard deviation is σ = 1/12 = 0.289).
Therefore, if we draw a sample of size n, then the standard error of the mean
will be σ/n, and as n gets larger the distribution of the mean will increasingly
follow a normal distribution. We illustrate this by drawing N = 10000 samples
of size n and plot those N means, computing the expected and observed SEM
1
The central limit theorem has a number of variants. In its common form, the random variables must
be identically distributed. In variants, convergence of the mean to the normal distribution also occurs for
non-identical distributions, given that they comply with certain conditions.
62 Ch 2: Estimation in One-Sample Problems
and how well the histogram of sampled means follows a normal distribution,
Notice, indeed, that even with samples as small as 2 and 6 that the properties
of the SEM and the distribution are as predicted.
#### Illustration of Central Limit Theorem, Uniform distribution
# demo.clt.unif(N, n)
# draws N samples of size n from Uniform(0,1)
# and plots the N means with a normal distribution overlay
demo.clt.unif <- function(N, n) {
# draw sample in a matrix with N columns and n rows
sam <- matrix(runif(N*n, 0, 1), ncol=N);
# calculate the mean of each column
sam.mean <- colMeans(sam)
# the sd of the mean is the SEM
sam.se <- sd(sam.mean)
# calculate the true SEM given the sample size n
true.se <- sqrt((1/12)/n)
# draw a histogram of the means
hist(sam.mean, freq = FALSE, breaks = 25
, main = paste("True SEM =", round(true.se, 4)
, ", Est SEM = ", round( sam.se, 4))
, xlab = paste("n =", n))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(0, 1, length = 1000)
points(x, dnorm(x, mean = 0.5, sd = true.se), type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam.mean)
}
par(mfrow=c(2,2));
demo.clt.unif(10000, 1);
demo.clt.unif(10000, 2);
demo.clt.unif(10000, 6);
demo.clt.unif(10000, 12);
2.1: Inference for a population mean 63
True SEM = 0.2887 , Est SEM = 0.2893 True SEM = 0.2041 , Est SEM = 0.2006
2.0
1.0
1.5
0.8
0.6
Density
Density
1.0
0.4
0.5
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
n=1 n=2
4
2.5
2.0
3
Density
Density
1.5
2
1.0
1
0.5
0.0
0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8
n=6 n = 12
par(mfrow=c(2,2));
demo.clt.exp(10000, 1);
demo.clt.exp(10000, 6);
demo.clt.exp(10000, 30);
demo.clt.exp(10000, 100);
2.1: Inference for a population mean 65
0.8
True SEM = 1 , Est SEM = 1.006 True SEM = 0.4082 , Est SEM = 0.4028
1.0
0.6
0.8
0.6
Density
Density
0.4
0.4
0.2
0.2
0.0
0.0
0 2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0
n=1 n=6
3
1.5
Density
Density
2
1.0
0.5
1
0.0
n = 30 n = 100
Note well that the further the population distribution is from being nor-
mal, the larger the sample size is required to be for the sampling distribution
of the sample mean to be normal. If the population distribution is normal,
what’s the minimum sample size for the sampling distribution of the mean to
be normal?
For more examples, try:
#### More examples for Central Limit Theorem can be illustrated with this code
# install.packages("TeachingDemos")
library(TeachingDemos)
# look at examples at bottom of the help page
66 Ch 2: Estimation in One-Sample Problems
?clt.examp
2.1.2 t-distribution
0.4
0.3
dnorm(x)
0.2
0.1
0.0
−5 0 5
2.2 CI for µ
Statistical inference provides methods for drawing conclusions about a pop-
ulation from sample data. In this chapter, we want to make a claim about
population mean µ given sample statistics Ȳ and s.
A CI for µ is a range of plausible values for the unknown population mean
µ, based on the observed data, of the form “Best Guess ± Reasonable Error of
the Guess”. To compute a CI for µ:
1. Define the population parameter, “Let µ = mean [characteristic] for
population of interest”.
2. Specify the confidence coefficient, which is a number between 0 and
100%, in the form 100(1 − α)%. Solve for α. (For example, 95% has
α = 0.05.)
68 Ch 2: Estimation in One-Sample Problems
3. Compute the t-critical value: tcrit = t0.5α such that the area under the
t-curve (df = n − 1) to the right of tcrit is 0.5α. See appendix or internet
for a t-table.
4. Report the CI in the form Ȳ ± tcritSEȲ or as an interval (L, U ). The
desired CI has lower and upper endpoints given by L = Ȳ − tcritSEȲ
√
and U = Ȳ + tcritSEȲ , respectively, where SEȲ = s/ n is the standard
error of the sample mean.
5. Assess method assumptions (see below).
In practice, the confidence coefficient is large, say 95% or 99%, which corre-
spond to α = 0.05 and 0.01, respectively. The value of α expressed as a percent
is known as the error rate of the CI.
The CI is determined once the confidence coefficient is specified and the data
are collected. Prior to collecting the data, the interval is unknown and is viewed
as random because it will depend on the actual sample selected. Different
samples give different CIs. The “confidence” in, say, the 95% CI (which has
a 5% error rate) can be interpreted as follows. If you repeatedly sample the
population and construct 95% CIs for µ, then 95% of the intervals will
contain µ, whereas 5% will not. The interval you construct from your data
will either cover µ, or it will not.
The length of the CI
U − L = 2tcritSEȲ
|
| | |
| |
| |
|| |
|
| | |
||
| | |
80
| |
|
|| |
| |
| |
| | |
| |
|
|| | |
60
||
| |
| |
| |
Index
| ||
| |
| | |
| |
| ||
40
| |
| | |
| | |
| | |
| |
| | | |
| |
20
| |
| |
| |
| |
| |
||
| |
| |
| ||
|
0
9 10 11 12
Confidence Interval
70 Ch 2: Estimation in One-Sample Problems
ClickerQ s — CI for µ, 2
# example data, skewed --- try others datasets to develop your intuition
x <- rgamma(10, shape = .5, scale = 20)
bs.one.samp.dist(x)
72 Ch 2: Estimation in One-Sample Problems
0.02
0.00
0 10 20 30 40 50
dat
Bootstrap sampling distribution of the mean
0.08
0.06
Density
0.04
0.02
0.00
0 5 10 15 20 25 30 35
54, 42, 51, 54, 49, 56, 33, 58, 54, 64, 49.
Summaries for √ the data are: n = 11, Ȳ = 51.27, and s = 8.26 so that
SEȲ = 8.26/ 11 = 2.4904. The degrees of freedom are df = 11 − 1 = 10.
3. Specify confidence level, find critical value, calculate limits
Let us calculate a 95% CI for µ. For a 95% CI α = 0.05, so we need to
find tcrit = t0.025, which is 2.228. Now tcritSEȲ = 2.228 × 2.4904 = 5.55.
The lower limit on the CI is L = 51.27 − 5.55 = 45.72. The upper limit is
U = 51.27 + 5.55 = 56.82.
4. Summarize in words For example, I am 95% confident that the
population mean age at first transplant is 51.3 ± 5.55, that is, between 45.7
and 56.8 years (rounding off to 1 decimal place).
5. Check assumptions We will see this in several pages, sampling
distribution is reasonably normal.
α 1−α α
2 2
Reject H0 Reject H0
− tcrit 0
tcrit
2.3.1 P-values
The p-value, or observed significance level for the test, provides a mea-
sure of plausibility for H0. Smaller values of the p-value imply that H0 is less
plausible. To compute the p-value for a two-sided test, you
p−value p−value
2 2
− ts 0
ts
The p-value is the total shaded area, or twice the area in either tail. A use-
ful interpretation of the p-value is that it is the chance of obtaining data
favoring HA by this much or more if H0 actually is true. Another interpre-
tation is that
the p-value is the probability of observing a sample mean at
least as extreme as the one observed assuming µ0 from H0 is
the true population mean.
If the p-value is small then the sample we obtained is pretty unusual to have
obtained if H0 is true — but we actually got the sample, so probably it is not
very unusual, so we would conclude H0 is false (it would not be unusual if HA
is true).
Most, if not all, statistical packages summarize hypothesis tests with a p-
value, rather than a decision (i.e., reject or not reject at a given α level). You
can make a decision to reject or not reject H0 for a size α test based on the
p-value as follows — reject H0 if the p-value is less than α. This decision is
identical to that obtained following the formal rejection procedure given earlier.
The reason for this is that the p-value can be interpreted as the smallest value
you can set the size of the test and still reject H0 given the observed data.
There are a lot of terms to keep straight here. α and tcrit are constants
we choose (actually, one determines the other so we really only choose one,
2.3: Hypothesis Testing for µ 77
usually α) to set how rigorous evidence against H0 needs to be. ts and the
p-value (again, one determines the other) are random variables because they
are calculated from the random sample. They are the evidence against H0.
Ȳ − µ0 51.27 − 50
ts = = = 0.51.
SEȲ 2.4904
Since tcrit = 2.228, we do not reject H0 using a 5% test. Notice the placement
of ts relative to tcrit in the picture below. Equivalently, the p-value for the test
is 0.62, thus we fail to reject H0 because 0.62 > 0.05 = α. The results of
the hypothesis test should not be surprising, since the CI tells you that 50 is a
plausible value for the population mean age at transplant. Note: All you can
say is that the data could have come from a distribution with a mean of 50 —
this is not convincing evidence that µ actually is 50.
78 Ch 2: Estimation in One-Sample Problems
.95
.025 .025
Reject H0 0 Reject H0
−2.228 0.51 2.228 0
−.51 .51
ts in middle of distribution, so do not reject H0 Total shaded area is the p−value, .62
# violin plot
library(vioplot)
vioplot(age, horizontal=TRUE, col="gray")
2.3: Hypothesis Testing for µ 79
Histogram of age
Density
0.04
0.00
30 35 40 45 50 55 60 65
age
●
1
35 40 45 50 55 60 65
# stem-and-leaf plot
stem(age, scale=2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 3 | 3
## 3 |
## 4 | 2
## 4 | 99
## 5 | 1444
## 5 | 68
## 6 | 4
# t.crit
qt(1 - 0.05/2, df = length(age) - 1)
## [1] 2.228
# look at help for t.test
?t.test
# defaults include: alternative = "two.sided", conf.level = 0.95
t.summary <- t.test(age, mu = 50)
t.summary
##
## One Sample t-test
##
## data: age
## t = 0.5111, df = 10, p-value = 0.6204
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 45.72 56.82
## sample estimates:
## mean of x
## 51.27
80 Ch 2: Estimation in One-Sample Problems
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.0 49.0 54.0 51.3 55.0 64.0
The assumption of normality of the sampling distribution appears reason-
ablly close, using the bootstrap discussed earlier. Therefore, the results for the
t-test above can be trusted.
bs.one.samp.dist(age)
0.04
0.00
30 35 40 45 50 55 60 65
dat
Bootstrap sampling distribution of the mean
0.15
0.10
Density
0.05
0.00
40 45 50 55 60
Aside: To print the shaded region for the p-value, you can use the result of
t.test() with the function t.dist.pval() defined here.
# Function ot plot t-distribution with shaded p-value
t.dist.pval <- function(t.summary) {
par(mfrow=c(1,1))
lim.extreme <- max(4, abs(t.summary$statistic) + 0.5)
lim.lower <- -lim.extreme;
lim.upper <- lim.extreme;
x.curve <- seq(lim.lower, lim.upper, length=200)
y.curve <- dt(x.curve, df = t.summary$parameter)
plot(x.curve, y.curve, type = "n"
, ylab = paste("t-dist( df =", signif(t.summary$parameter, 3), ")")
, xlab = paste("t-stat =", signif(t.summary$statistic, 5)
, ", Shaded area is p-value =", signif(t.summary$p.value, 5)))
if ((t.summary$alternative == "less")
| (t.summary$alternative == "two.sided")) {
x.pval.l <- seq(lim.lower, -abs(t.summary$statistic), length=200)
y.pval.l <- dt(x.pval.l, df = t.summary$parameter)
polygon(c(lim.lower, x.pval.l, -abs(t.summary$statistic))
, c(0, y.pval.l, 0), col="gray")
}
if ((t.summary$alternative == "greater")
| (t.summary$alternative == "two.sided")) {
2.3: Hypothesis Testing for µ 81
0.2
0.1
0.0
−4 −2 0 2 4
Aside: Note that the t.summary object returned from t.test() includes a
number of quantities that might be useful for additional calculations.
names(t.summary)
## [1] "statistic" "parameter" "p.value" "conf.int"
## [5] "estimate" "null.value" "alternative" "method"
## [9] "data.name"
t.summary$statistic
## t
## 0.5111
t.summary$parameter
## df
## 10
t.summary$p.value
## [1] 0.6204
t.summary$conf.int
## [1] 45.72 56.82
## attr(,"conf.level")
## [1] 0.95
t.summary$estimate
## mean of x
## 51.27
t.summary$null.value
## mean
## 50
t.summary$alternative
82 Ch 2: Estimation in One-Sample Problems
## [1] "two.sided"
t.summary$method
## [1] "One Sample t-test"
t.summary$data.name
## [1] "age"
# violin plot
library(vioplot)
vioplot(toco, horizontal=TRUE, col="gray")
Histogram of toco
0.00 0.15 0.30
Density
1 2 3 4 5 6 7
toco
●
1
2 3 4 5 6
# stem-and-leaf plot
stem(toco, scale=2)
##
## The decimal point is at the |
##
## 1 | 5
## 2 | 479
## 3 | 068
## 4 | 03
## 5 | 6
## 6 | 27
# t.crit
84 Ch 2: Estimation in One-Sample Problems
0.20
Density
0.10
0.3
0.00
t−dist( df = 11 )
1 2 3 4 5 6 7
0.2
dat
Bootstrap sampling distribution of the mean
0.8
0.1
Density
0.4
0.0
0.0
−5 0 5
2.5 3.0 3.5 4.0 4.5 5.0 5.5
t−stat = 7.3366 , Shaded area is p−value = 1.473e−05
Data: n = 12 , mean = 3.8917 , se = 0.456843 5
Let us piece together these ideas for the meteorite problem. Evolutionary
history predicts µ = 0.54. A scientist examining the validity of the theory is
trying to decide whether µ = 0.54 or µ 6= 0.54. Good scientific practice dictates
that rejecting another’s claim when it is true is more serious than not being able
to reject it when it is false. This is consistent with defining H0 : µ = 0.54 (the
status quo) and HA : µ 6= 0.54. To convince yourself, note that the implications
of a Type-I error would be to claim the evolutionary theory is false when it is
true, whereas a Type-II error would correspond to not being able to refute the
evolutionary theory when it is false. With this setup, the scientist will refute
the theory only if the data overwhelmingly suggest that it is false.
Ȳ − µ0
ts =
SEȲ
0
−3.106 3.106
−2.201 2.201
Rejection Regions for .05 and .01 level tests
The critical value is computed so that the area under the t-probability curve
(with df = n − 1) outside ±tcrit is α, with 0.5α in each tail. Reducing α
makes tcrit larger. That is, reducing the size of the test makes rejecting H0
harder because the rejection region is smaller. A pictorial representation is
given above for the Tocopilla data, where µ0 = 0.54, n = 12, and df = 11.
Note that tcrit = 2.201 and 3.106 for α = 0.05 and 0.01, respectively.
The mathematics behind the test presumes that H0 is true. Given the data,
you use
Ȳ − µ0
ts =
SEȲ
to measure how far Ȳ is from µ0, relative to the spread in the data given by
SEȲ . For ts to be in the rejection region, Ȳ must be significantly above or
below µ0, relative to the spread in the data. To see this, note that rejection
88 Ch 2: Estimation in One-Sample Problems
rule can be expressed as: Reject H0 if
The rejection rule is sensible because Ȳ is our best guess for µ. You would
reject H0 : µ = µ0 only if Ȳ is so far from µ0 that you would question the
reasonableness of assuming µ = µ0. How far Ȳ must be from µ0 before you
reject H0 depends on α (i.e., how willing you are to reject H0 if it is true), and
on the value of SEȲ . For a given sample, reducing α forces Ȳ to be further
from µ0 before you reject H0. For a given value of α and s, increasing n allows
smaller differences between Ȳ and µ0 to be statistically significant (i.e.,
lead to rejecting H0). In problems where small differences between Ȳ and µ0
lead to rejecting H0, you need to consider whether the observed differences are
important.
In essence, the t-distribution provides an objective way to calibrate whether
the observed Ȳ is typical of what sample means look like when sampling from
a normal population where H0 is true. If all other assumptions are satisfied,
and Ȳ is inordinately far from µ0, then our only recourse is to conclude that
H0 must be incorrect.
0
− tcrit tcrit
If ts is here then p−value > α
α p−value
0 0
tcrit ts
α p−value
0 0
− tcrit ts
2.7: One-sided tests on µ 93
ClickerQ s — One-sided tests on µ
par(mfrow=c(2,1))
# Histogram overlaid with kernel density curve
hist(tomato, freq = FALSE, breaks = 6)
points(density(tomato), type = "l")
rug(tomato)
# violin plot
library(vioplot)
vioplot(tomato, horizontal=TRUE, col="gray")
# t.crit
qt(1 - 0.05/2, df = length(tomato) - 1)
## [1] 2.16
t.summary <- t.test(tomato, mu = 20, alternative = "less")
t.summary
##
## One Sample t-test
##
## data: tomato
## t = -0.9287, df = 13, p-value = 0.185
## alternative hypothesis: true mean is less than 20
## 95 percent confidence interval:
## -Inf 20.29
## sample estimates:
## mean of x
## 19.68
summary(tomato)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.5 18.8 19.8 19.7 20.4 22.5
2.7: One-sided tests on µ 95
Histogram of tomato
0.4
Density
0.2
0.0
17 18 19 20 21 22 23
tomato
●
1
18 19 20 21 22
Density
0.3
t−dist( df = 13 )
17 18 19 20 21 22 23
0.2
dat
Bootstrap sampling distribution of the mean
1.2
0.1
0.8
Density
0.4
0.0
0.0
−4 −2 0 2 4
18.5 19.0 19.5 20.0 20.5 21.0
t−stat = −0.92866 , Shaded area is p−value = 0.18499
Data: n = 14 , mean = 19.679 , se = 0.346121 5
ClickerQ s — P-value
Chapter 3
Two-Sample Inferences
Learning objectives
After completing this topic, you should be able to:
select graphical displays that meaningfully compare independent popula-
tions.
assess the assumptions of the two-sample t-test visually.
decide whether the means between two populations are different.
recommend action based on a hypothesis test.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
## 6 142 English
## 7 150 English
## 8 146 English
## 9 155 English
## 10 158 English
## 11 150 English
## 12 140 English
## 13 147 English
## 14 148 English
## 15 144 English
## 16 150 English
## 17 149 English
## 18 145 English
## 19 133 Celts
## 20 138 Celts
## 21 130 Celts
## 22 138 Celts
## 23 134 Celts
## 24 127 Celts
## 25 128 Celts
## 26 138 Celts
## 27 136 Celts
## 28 131 Celts
## 29 126 Celts
## 30 120 Celts
## 31 124 Celts
## 32 132 Celts
## 33 132 Celts
## 34 125 Celts
English
Celts
120 130 140 150
0.75
Celts
0.50
● ●
0.25 ●● ●● ●
● ●●● ●●●●●
count
0.00
1.00
0.75
English
0.50
●●
0.25 ● ●●●● ●
0.00 ● ● ●● ●●●● ● ●
120 130 140 150 160
head breadth (mm)
2. Boxplots for comparison are most helpful when plotted in the same axes.
English
Celts
120 130 140 150
English
Group
Celts
3. Histograms are hard to compare unless you make the scale and actual
bins the same for both. Why is the pair on the right clearly preferable?
# common x-axis limits based on the range of the entire data set
hist(hb$HeadBreadth[(hb$Group == "Celts")], xlim = range(hb$HeadBreadth),
main = "Head breadth, Celts", xlab = "head breadth (mm)")
hist(hb$HeadBreadth[(hb$Group == "English")], xlim = range(hb$HeadBreadth),
main = "Head breadth, English", xlab = "head breadth (mm)")
3.1: Comparing Two Sets of Measurements 103
5
4
4
Frequency
Frequency
3
3
2
2
1
1
0
0
120 125 130 135 140 120 130 140 150
8
6
6
Frequency
Frequency
4
4
2
2
0
130 135 140 145 150 155 160 120 130 140 150
Celts
2
count
0
6
English
2
0
120 130 140 150 160
head breadth (mm)
4 4
Group Group
count
count
Celts Celts
English English
2 2
0 0
120 130 140 150 160 120 130 140 150 160
head breadth (mm) head breadth (mm)
##
## 120 | 0
## 122 |
## 124 | 00
## 126 | 00
## 128 | 0
## 130 | 00
## 132 | 000
## 134 | 0
## 136 | 0
## 138 | 000
## 3rd Qu.:150
## Max. :158
Example The English and Celt head breadth samples are independent.
3.2: Two-Sample Methods: Paired Versus Independent Samples 107
Example Suppose you are interested in whether the CaCO3 (calcium car-
bonate) level in the Atrisco well field, which is the water source for Albuquerque,
is changing over time. To answer this question, the CaCO3 level was recorded
at each of 15 wells at two time points. These data are paired. The two samples
are the observations at Times 1 and 2.
Example Suppose you are interested in whether the husband or wife is typ-
ically the heavier smoker among couples where both adults smoke. Data are
collected on households. You measure the average number of cigarettes smoked
by each husband and wife within the sample of households. These data are
paired, i.e., you have selected husband wife pairs as the basis for the samples.
It is reasonable to believe that the responses within a pair are related, or cor-
related.
Although the focus here will be on comparing population means, you should
recognize that in paired samples you may also be interested, as in the problems
above, in how observations compare within a pair. That is, a paired compar-
ison might be interested in the difference between the two paired samples.
These goals need not agree, depending on the questions of interest. Note that
with paired data, the sample sizes are equal, and equal to the number of pairs.
3.4.1 R Implementation
R does the pooled and Satterthwaite (Welch) analyses, either on stacked or
unstacked data. The output will contain a p-value for a two-sided test of equal
population means and a CI for the difference in population means. If you
include var.equal = TRUE you will get the pooled method, otherwise the output
is for Satterthwaite’s method.
Example: Head Breadths The English and Celts are independent sam-
ples. We looked at boxplots and histograms, which suggested that the normality
assumption for the t-test is reasonable. The R output shows the English and
110 Ch 3: Two-Sample Inferences
Celt sample standard deviations and IQRs are fairly close, so the pooled and
Satterthwaite results should be comparable. The pooled analysis is preferable
here, but either is appropriate.
We are interested in difference in mean head breadths between Celts and
English.
1. Define the population parameters and hypotheses in words
and notation
Let µ1 and µ2 be the mean head breadth for the Celts and English, respec-
tively.
In words: “The difference in population means between Celts and English is
different from zero mm.”
In notation: H0 : µ1 = µ2 versus HA : µ1 6= µ2.
Alternatively: H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0.
2. Calculate summary statistics from sample
Mean, standard deviation, sample size:
#### Calculate summary statistics
m1 <- mean(celts)
s1 <- sd(celts)
n1 <- length(celts)
m2 <- mean(english)
s2 <- sd(english)
n2 <- length(english)
c(m1, s1, n1)
## [1] 130.750 5.434 16.000
c(m2, s2, n2)
## [1] 146.500 6.382 18.000
The pooled-standard devation, standard error, and degrees-of-freedom are:
sdpool <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
sdpool
## [1] 5.957
SEpool <- sdpool * sqrt(1 / n1 + 1 / n2)
SEpool
## [1] 2.047
dfpool <- n1 + n2 - 2
dfpool
## [1] 32
t_pool <- (m1 - m2) / SEpool
t_pool
## [1] -7.695
3.4: Satterthwaite’s Method, unequal variances 111
The Satterthwaite SE and degrees-of-freedom are:
SE_Sat <- sqrt(s1^2 / n1 + s2^2 / n2)
SE_Sat
## [1] 2.027
df_Sat <- (SE_Sat^2)^2 / (s1^4 / (n1^2 * (n1 - 1)) + s2^4 / (n2^2 * (n2 - 1)))
df_Sat
## [1] 31.95
t_Sat <- (m1 - m2) / SE_Sat
t_Sat
## [1] -7.77
## 130.8 146.5
The form of the output will tell you which sample corresponds to population
1 and which corresponds to population 2.
4. Summarize in words (Using the pooled-variance results.)
The pooled analysis strongly suggests that H0 : µ1 − µ2 = 0 is false, given
the large t-statistic of −7.7 and two-sided p-value of 9 × 10−9. Because
the p-value < 0.05 we reject the Null hypothesis in favor of the Alternative
hypothesis concluding that the difference in population mean head breadths
between the Celts and English are different.
We are 95% confident that the difference in population means, µ1 − µ2, is
between −19.9 and −11.6 mm. That is, we are 95% confident that the
population mean head breadth for Englishmen (µ2) exceeds the population
mean head breadth for Celts (µ1) by between 11.6 and 19.9 mm.
The CI interpretation is made easier by recognizing that we concluded the
population means are different, so the direction of difference must be con-
sistent with that seen in the observed data, where the sample mean head
breadth for Englishmen exceeds that for the Celts. Thus, the limits on the
CI for µ1 − µ2 tells us how much smaller the mean is for the Celts (that is,
between −19.9 and −11.6 mm).
5. Check assumptions
The assumption of equal population variances will be left to a later chapter.
We can test the assumption that the distribution of Ȳ1 − Ȳ2 is normal using
the bootstrap in the following function.
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution
# of the difference of means from two samples with
# a normal distribution with mean and SEM estimated from the data
bs.two.samp.diff.dist <- function(dat1, dat2, N = 1e4) {
n1 <- length(dat1);
n2 <- length(dat2);
# resample from data
sam1 <- matrix(sample(dat1, size = N * n1, replace = TRUE), ncol=N);
sam2 <- matrix(sample(dat2, size = N * n2, replace = TRUE), ncol=N);
# calculate the means and take difference between populations
sam1.mean <- colMeans(sam1);
sam2.mean <- colMeans(sam2);
diff.mean <- sam1.mean - sam2.mean;
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(3,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat1, freq = FALSE, breaks = 6
, main = paste("Sample 1", "\n"
, "n =", n1
, ", mean =", signif(mean(dat1), digits = 5)
, ", sd =", signif(sd(dat1), digits = 5))
3.4: Satterthwaite’s Method, unequal variances 113
, xlim = range(c(dat1, dat2)))
points(density(dat1), type = "l")
rug(dat1)
The distribution of difference in means in the third plot looks very close to
normal.
bs.two.samp.diff.dist(celts, english)
Sample 1
n = 16 , mean = 130.75 , sd = 5.4345
0.06
0.04
Density
0.02
0.00
Sample
dat1 2
n = 18 , mean = 146.5 , sd = 6.3824
0.08
0.06
Density
0.04
0.02
0.00
0.10
0.05
0.00
diff.mean
114 Ch 3: Two-Sample Inferences
ClickerQ s — t-interval, STT.08.02.010
## 18 84 women
## 19 73 women
## 20 66 women
## 21 70 women
## 22 35 women
## 23 77 women
## 24 73 women
## 25 56 women
## 26 112 women
## 27 56 women
## 28 84 women
## 29 80 women
## 30 101 women
## 31 66 women
## 32 84 women
# numerical summaries
by(andro, sex, summary)
## sex: men
## level sex
## Min. : 59.0 men :14
## 1st Qu.: 72.5 women: 0
## Median :118.5
## Mean :112.5
## 3rd Qu.:132.8
## Max. :217.0
## ----------------------------------------------------
## sex: women
## level sex
## Min. : 35.0 men : 0
## 1st Qu.: 67.0 women:18
## Median : 77.0
## Mean : 75.8
## 3rd Qu.: 84.0
## Max. :112.0
c(sd(men), sd(women), IQR(men), IQR(women), length(men), length(women))
## [1] 42.75 17.24 60.25 17.00 14.00 18.00
p <- ggplot(andro, aes(x = sex, y = level, fill=sex))
p <- p + geom_boxplot()
# add a "+" at the mean
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
#p <- p + coord_flip()
p <- p + labs(title = "Androstenedione Levels in Diabetics")
print(p)
200
6
150
sex sex
4
count
level
men men
● women women
100
2
50
● 0
0.005
0.000
Sample
dat1 2
n = 18 , mean = 75.833 , sd = 17.236
0.020
Density
0.010
0.000
0.010
0.000
0 20 40 60 80
diff.mean
0.2
0.1
0.0
−4 −2 0 2 4
One-sided tests for two-sample problems are where the null hypothesis is H0 :
µ1 − µ2 = 0 but the alternative is directional, either HA : µ1 − µ2 < 0 (i.e.,
µ1 < µ2) or HA : µ1 − µ2 > 0 (i.e., µ1 > µ2). Once you understand the
general form of rejection regions and p-values for one-sample tests, the one-
sided two-sample tests do not pose any new problems. Use the t-statistic, with
the appropriate tail of the t-distribution to define critical values and p-values.
One-sided two-sample tests are directly implemented in R, by specifying the
type of test with alternative = "less" or alternative = "greater". One-sided
confidence bounds are given with the one-sided tests.
3.6: Paired Analysis 119
3.6 Paired Analysis
With paired data, inferences on µ1 − µ2 are based on the sample of differences
within pairs. By taking differences within pairs, two dependent samples are
transformed into one sample, which contains the relevant information for infer-
ences on µd = µ1 − µ2. To see this, suppose the observations within a pair are
Y1 and Y2. Then within each pair, compute the difference d = Y1 − Y2:
d1 = Y11 − Y21
d2 = Y12 − Y22
...
dn = Y1n − Y2n
If the Y1 data are from a population with mean µ1 and the Y2 data are from a
population with mean µ2, then the ds are a sample from a population with mean
µd = µ1 − µ2. Furthermore, if the sample of differences comes from a normal
population, then we can use standard one-sample techniques on d1, . . . , dn to
test µd = 0 (that is, µ1 = µ2), and to get a CI for µd = µ1 − µ2.
Let d¯ = n−1 i di = Ȳ1 − Ȳ2 be the sample mean of the differences (which
P
is also the mean difference), and let sd be the sample standard deviation of the
√
differences. The standard error of d¯ is SEd¯ = sd/ n, where n is the number
of pairs. The paired t-test (two-sided) CI for µd is given by d¯ ± tcritSEd¯. To
test H0 : µd = 0 (µ1 = µ2) against HA : µd 6= 0 (µ1 6= µ2), use
d¯ − 0
ts =
SEd¯
The most natural way to enter paired data is as two columns, one for each
treatment group. You can then create a new column of differences, and do the
usual one-sample graphical and inferential analysis on this column of differences,
or you can do the paired analysis directly without this intermediate step.
120
● ●
●
●
●
genetic
100 ●
● ●●
●
●
●
●
●
● ●
●
● ●
●
80 ●
●
60 80 100 120
foster
This plot of IQ scores shows that scores are related within pairs of twins.
This is consistent with the need for a paired analysis.
Given the sample of differences, I created a boxplot and a stem and leaf
display, neither which showed marked deviation from normality. The boxplot
is centered at zero, so one would not be too surprised if the test result is
insignificant.
p1 <- ggplot(iq, aes(x = diff))
p1 <- p1 + scale_x_continuous(limits=c(-20,+20))
# vertical line at 0
p1 <- p1 + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p1 <- p1 + geom_histogram(aes(y=..density..)
, binwidth=5
, colour="black", fill="white")
# Overlay with transparent density plot
p1 <- p1 + geom_density(alpha=0.1, fill="#FF6666")
p1 <- p1 + geom_point(aes(y = -0.005)
, position = position_jitter(height = 0.001)
, alpha = 1/5)
122 Ch 3: Two-Sample Inferences
# violin plot
p2 <- ggplot(iq, aes(x = "diff", y = diff))
p2 <- p2 + scale_y_continuous(limits=c(-20,+20))
p2 <- p2 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
p2 <- p2 + geom_violin(fill = "gray50", alpha=1/2)
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(iq, aes(x = "diff", y = diff))
p3 <- p3 + scale_y_continuous(limits=c(-20,+20))
p3 <- p3 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
0.06
0.04
density
0.02
0.00
−20 −10 0 10 20
diff
"diff"
diff
−20 −10 0 10 20
diff
"diff"
diff
−20 −10 0 10 20
diff
0.4
0.04
Density
0.02
0.3
0.00
t−dist( df = 26 )
dat
Bootstrap sampling distribution of the mean
0.1
0.20
Density
0.10
0.0
0.00
−4 −2 0 2 4
−6 −4 −2 0 2 4 6
t−stat = 0.12438 , Shaded area is p−value = 0.90197
Data: n = 27 , mean = 0.18519 , se = 1.48884 5
Alternatively, I can generate the test and CI directly from the raw data in
two columns, specifying paired=TRUE. This gives the following output, which
leads to identical conclusions to the earlier analysis.
# two-sample paired t-test
t.summary <- t.test(iq$genetic, iq$foster, paired=TRUE)
124 Ch 3: Two-Sample Inferences
t.summary
##
## Paired t-test
##
## data: iq$genetic and iq$foster
## t = 0.1244, df = 26, p-value = 0.902
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.875 3.246
## sample estimates:
## mean of the differences
## 0.1852
You might ask why I tortured you by doing the first analysis, which re-
quired creating and analyzing the sample of differences, when the alternative
and equivalent second analysis is so much easier. (A later topic deals with
non-parametric analyses of paired data for which the differences must be first
computed.)
d <- b - a;
sleep <- data.frame(a, b, d)
sleep
## a b d
## 1 0.7 1.9 1.2
## 2 -1.6 0.8 2.4
## 3 -0.2 1.1 1.3
## 4 -1.2 0.1 1.3
## 5 0.1 -0.1 -0.2
## 6 3.4 4.4 1.0
## 7 3.7 5.5 1.8
## 8 0.8 1.6 0.8
## 9 0.0 4.6 4.6
## 10 2.0 3.0 1.0
# scatterplot of a and b IQs, with 1:1 line
p <- ggplot(sleep, aes(x = a, y = b))
# draw a 1:1 line, dots above line indicate "b > a"
p <- p + geom_abline(intercept=0, slope=1, alpha=0.2)
p <- p + geom_point()
# make the axes square so it's a fair visual comparison
p <- p + coord_equal()
p <- p + labs(title = "Sleep hours gained on two sleep remedies: a vs b")
print(p)
126 Ch 3: Two-Sample Inferences
●
b
2 ●
●
0 ●
−1 0 1 2 3
a
There is evidence here against the normality assumption of the sample mean.
We’ll continue anyway (in practice we’d use a nonparametric method, instead,
in a later chapter).
p1 <- ggplot(sleep, aes(x = d))
p1 <- p1 + scale_x_continuous(limits=c(-5,+5))
# vertical line at 0
p1 <- p1 + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p1 <- p1 + geom_histogram(aes(y=..density..)
, binwidth=1
, colour="black", fill="white")
# Overlay with transparent density plot
p1 <- p1 + geom_density(alpha=0.1, fill="#FF6666")
p1 <- p1 + geom_point(aes(y = -0.01)
, position = position_jitter(height = 0.005)
, alpha = 1/5)
p1 <- p1 + labs(title = "Difference of sleep hours gained: d = b - a")
# violin plot
p2 <- ggplot(sleep, aes(x = "d", y = d))
p2 <- p2 + scale_y_continuous(limits=c(-5,+5))
p2 <- p2 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
3.6: Paired Analysis 127
# boxplot
p3 <- ggplot(sleep, aes(x = "d", y = d))
p3 <- p3 + scale_y_continuous(limits=c(-5,+5))
p3 <- p3 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
bs.one.samp.dist(sleep$d)
Difference of sleep hours gained: d = b − a Plot of data with smoothed density curve
0.6 0.4
density
0.4
0.3
0.2
Density
0.0
0.2
−1 0 1 2 3 4 5
"d"
d ● ●
dat
Bootstrap sampling distribution of the mean
d
Density
0.4
"d"
d ● ●
0.0
##
## One Sample t-test
##
## data: sleep$d
## t = 3.78, df = 9, p-value = 0.004352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.6102 2.4298
## sample estimates:
## mean of x
## 1.52
# plot t-distribution with shaded p-value
t.dist.pval(t.summary)
0.4
0.3
t−dist( df = 9 )
0.2
0.1
0.0
−4 −2 0 2 4
0 5 10 15
10 15 20 25 30 35 40
Chapter 4
Checking Assumptions
Learning objectives
After completing this topic, you should be able to:
assess the assumptions visually and via formal tests.
Achieving these goals contributes to mastery in these course learning outcomes:
10. Model assumptions.
4.1 Introduction
Almost all statistical methods make assumptions about the data collection pro-
cess and the shape of the population distribution. If you reject the null hypoth-
esis in a test, then a reasonable conclusion is that the null hypothesis is false,
provided all the distributional assumptions made by the test are satisfied. If the
assumptions are not satisfied then that alone might be the cause of rejecting
H0. Additionally, if you fail to reject H0, that could be caused solely by failure
to satisfy assumptions also. Hence, you should always check assumptions to
the best of your abilities.
Two assumptions that underly the tests and CI procedures that I have
discussed are that the data are a random sample, and that the population fre-
quency curve is normal. For the pooled variance two-sample test the population
variances are also required to be equal.
The random sample assumption can often be assessed from an understand-
ing of the data collection process. Unfortunately, there are few general tests for
4.2: Testing Normality 131
checking this assumption. I have described exploratory (mostly visual) meth-
ods to assess the normality and equal variance assumption. I will now discuss
formal methods to assess these assumptions.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x1, freq = FALSE, breaks = 20)
points(density(x1), type = "l")
rug(x1)
# violin plot
library(vioplot)
vioplot(x1, horizontal=TRUE, col="gray")
# boxplot
boxplot(x1, horizontal=TRUE)
132 Ch 4: Checking Assumptions
Histogram of x1
x1
1
70 80 90 100 110 120 130
There are many ways to get adequate QQ plots. Consider how outliers
shows up in the QQ plot. There may be isolated points on ends of the QQ plot,
but only on the right side is there an outlier. How could you have identified
that the right tail looks longer than the left tail from the QQ plot?
#### QQ plots
# R base graphics
par(mfrow=c(1,1))
# plots the data vs their normal scores
qqnorm(x1)
# plots the reference line
qqline(x1)
# ggplot2 graphics
library(ggplot2)
# http://had.co.nz/ggplot2/stat_qq.html
df <- data.frame(x1)
# stat_qq() below requires "sample" to be assigned a data.frame column
p <- ggplot(df, aes(sample = x1))
# plots the data vs their normal scores
p <- p + stat_qq()
print(p)
4.2: Testing Normality 133
●
●
● ● ●
130
●
● ●●
●●
●● 120 ●●
●● ●
●●
●
120
●● ●●
●
●●
●● ●●
●● ●●
●
●●
● ●●●
●
● ●
●●
● ●
●●
●●
● ●
110
●
● ●
●●
●
● ●
Sample Quantiles
●
●
●●
●
●● ●●
●
●
●●
●
● ●
●●
●
● ●●
●●
●
●
●●
●
●●
●
●
●●
● ●
●
sample
● ●
●
●
●●
● ●
●●
●●
●
●
●●
●
●● ●
100
●●
● 100 ●●
●
●
●
●●
● ●
●
●
●
●
●
● ●
●
●
●● ●●
●
●
●
●● ●●
●
●●
●
●
●●
●
● ●●
●
●
● ●
●
●●
●●
●
●
●
●●
●
●●
●
● ●
●
●●
●
● ●
●●
●
●●
●●
●
90
● ●
●
●
● ●
●
●
●
●●
●● ●
●
●●
● ●●
●●
●
●
●
●●
●●
● ●
●
●
●● ●
●● ●●
●
●●
●● ●●
●●
●●●
●● ●
●
80
●
● ●
●● ●●
●●●
●●●●●● ●●●
80 ●
●
● ●●
● ● ●● ●●●●●
70
● ●
● ● ●●
●
−2 −1 0 1 2
Theoretical Quantiles −2 −1 0 1 2
theoretical
If you lay a straightedge along the bulk of the plot (putting in a regression
line is not the right way to do it, even if it is easy), you see that the most
extreme point on the right is a little below the line, and the last few points
on the left a little above the line. What does this mean? The point on the
right corresponds to a data value more extreme than expected from a normal
distribution (the straight line is where expected and actual coincide). Extreme
points on the right are above the line. What about the left? Extreme points
there should be above the line — since the deviations from the line are above
it on the left, those points are also more extreme than expected.
Even more useful is to add confidence intervals (point-wise, not family-wise
— you will learn the meaning of those terms in the ANOVA section). You
don’t expect a sample from a normally distributed population to have a normal
scores plot that falls exactly on the line, and the amount of deviation depends
upon the sample size.
The best QQ plot I could find is available in the car package called qqPlot.
Note that with the dist= option you can use this technique to see if the data
appear from lots of possible distributions, not just normal.
par(mfrow=c(1,1))
# Normality of Residuals
library(car)
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
134 Ch 4: Checking Assumptions
QQ Plot
110 ● 65 ●
130 31 ●86
125 ●●
●●
●●
●●
120 ●●
●●
●●
●●
●
●
●
●●
●
●
●●
●
110 ●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
x1
●
100 ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
90 ●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●●
●
80 ●
●●
●●●●●
●
●
● ● ●●
70
● 111
−2 −1 0 1 2
norm quantiles
In this case the x-axis is labelled “norm quantiles”. You only see a couple
of data values outside the limits (in the tails, where it usually happens). You
expect around 5% outside the limits, so there is no indication of non-normality
here. I did sample from a normal population.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x2, freq = FALSE, breaks = 20)
points(density(x2), type = "l")
rug(x2)
# violin plot
library(vioplot)
vioplot(x2, horizontal=TRUE, col="gray")
# boxplot
boxplot(x2, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x2, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot")
Histogram of x2
QQ Plot
0.010
Density
● ● ●
●●● ●
●●●
0.000
●●●
●●●●
140 ●
●
●
●
●
60 80 100 120 140 ●
●
●●
●●
●
x2 ●●
●
●●
●●
●●
●
●●
●
●
●
●●
●
●
120 ●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
x2
100 ●
●
● ●
●
●●
1
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
60 80 100 120 140 ●
●
●●
80 ●●
●
●
●●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●●●
60 ●●●
●
●
●●●
● ●●
● ● ●
−2 −1 0 1 2
60 80 100 120 140
norm quantiles
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
136 Ch 4: Checking Assumptions
# violin plot
library(vioplot)
vioplot(x3, horizontal=TRUE, col="gray")
# boxplot
boxplot(x3, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x3, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot")
Histogram of x3
QQ Plot
0.000 0.010 0.020
Density
●
300
200
●
●
●●
x3
● 150 ●●
1
●●●
●●●●●●
●
●
●
●●
●●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
0 50 100 150 200 250 300 100 ●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●
●
50 ●●●
●
●●●
●
0 ● ● ●
●● ● ● ●● ● ●
●● ● ● ● ●●●
●●● ● ● ●
● ● ● ●
−2 −1 0 1 2
0 50 100 150 200 250 300
norm quantiles
Right-skewed (Exponential)
#### Right-skewed (Exponential)
# sample from exponential distribution
x4 <- rexp(150, rate = 1)
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x4, freq = FALSE, breaks = 20)
points(density(x4), type = "l")
rug(x4)
# violin plot
library(vioplot)
vioplot(x4, horizontal=TRUE, col="gray")
4.2: Testing Normality 137
# boxplot
boxplot(x4, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x4, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot")
Histogram of x4
QQ Plot
0.8
Density
8
0.4
●
0.0
0 2 4 6 8
x4
6
●
●
x4
● 4
1
●
●●●
0 2 4 6 8 ●●●
●●●●●●
●●
●
2 ●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●●
●
●●
●●
●
●●
●
●●●●●●●●
●●
0 ● ● ● ● ●●●●●●
● ● ● ● ●
−2 −1 0 1 2
0 2 4 6 8
norm quantiles
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x5, freq = FALSE, breaks = 20)
points(density(x5), type = "l")
rug(x5)
# violin plot
library(vioplot)
vioplot(x5, horizontal=TRUE, col="gray")
# boxplot
boxplot(x5, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x5, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot")
138 Ch 4: Checking Assumptions
Histogram of x5
0.4
QQ Plot
Density
0.2
●
●●●● ● ● ●
●●
●●
●●
●●
●●●●●●●●●
●●
●●
●●
●●
●
●●
●●
●
0.0
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
8 10 12 14 14 ●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
x5 ●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●●
12 ●
●
●
●
●
●
●
●
●
●
●●
x5
● ●
●●
1
●●
●
●
●●
●●●
●●
●●●
8 10 12 14 10 ●●
●
●
8 ●
●
● ● ●
−2 −1 0 1 2
8 10 12 14
norm quantiles
Notice how striking is the lack of linearity in the QQ plot for all the non-
normal distributions, particularly the symmetric light-tailed distribution where
the boxplot looks fairly good. The QQ plot is a sensitive measure of normality.
Let us summarize the patterns we see regarding tails in the plots:
Tail
Tail Weight Left Right
Light Left side of plot points left Right side of plot points right
Heavy Left side of plot points down Right side of plot points up
## library(nortest)
## Anderson-Darling normality test ad.test(x5)
## ##
## data: x4 ## Anderson-Darling normality test
## A = 9.371, p-value < 2.2e-16 ##
# lillie.test(x4) ## data: x5
cvm.test(x4) ## A = 6.002, p-value = 7.938e-15
# plot of data
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(sleep$d, freq = FALSE, breaks = 20)
points(density(sleep$d), type = "l")
rug(sleep$d)
# violin plot
library(vioplot)
vioplot(sleep$d, horizontal=TRUE, col="gray")
# boxplot
boxplot(sleep$d, horizontal=TRUE)
# QQ plot
142 Ch 4: Checking Assumptions
par(mfrow=c(1,1))
qqPlot(sleep$d, las = 1, id.n = 4, id.cex = 1, lwd = 1, main="QQ Plot")
## 9 5 2 8
## 10 1 9 2
Histogram of sleep$d
QQ Plot
0.8
Density
9●
0.4
0.0
0 1 2 3 4 4
sleep$d
sleep$d
2●
●
1
2
●
0 1 2 3 4
● ●
●
1 ● ●
● 8
0
● 5
● ●
library(gridExtra)
grid.arrange(p1, p2, ncol=1)
# QQ plot
par(mfrow=c(2,1))
qqPlot(men, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot, Men")
qqPlot(women, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot, Women")
144 Ch 4: Checking Assumptions
Androstenedione Levels in Diabetics QQ Plot, Men
200 ●
200
150
sex 150 ●
men
level
●
men ●
● ● ●
● women ●
●
100 100
●
● ●
● ●
50
● −1 0 1
●
6
100 ●
●
● ● ● ●
women
sex 80 ●
4
count
● ●
men ● ●
●
● ●
women
60 ● ●
2
40
●
0 −2 −1 0 1 2
0
6 7 8 9 10
4
0
11 12 13 14 15
4
3
count
0
16 17 18 19 20
4
0
21 22 23 24 25
4
0
−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
x
7.5
5.0
2.5
0.0
6 7 8 9 10
7.5
5.0
2.5
0.0
11 12 13 14 15
7.5
count
5.0
2.5
0.0
16 17 18 19 20
7.5
5.0
2.5
0.0
21 22 23 24 25
7.5
5.0
2.5
0.0
−2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4
x
where s2pooled is the pooled estimator of variance and s2i is the estimated variance
based on the ith sample.
Large values of Bobs suggest that the population variances are unequal. For
a size α test, we reject H0 if Bobs ≥ χ2k−1,crit, where χ2k−1,crit is the upper-α
percentile for the χ2k−1 (chi-squared) probability distribution with k − 1 degrees
of freedom. A generic plot of the χ2 distribution is given below. A p-value for
the test is given by the area under the chi-squared curve to the right of Bobs.
One-Way Analysis of
Variance
Learning objectives
After completing this topic, you should be able to:
select graphical displays that meaningfully compare independent popula-
tions.
assess the assumptions of the ANOVA visually and by formal tests.
decide whether the means between populations are different, and how.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
5.1 ANOVA
The one-way analysis of variance (ANOVA) is a generalization of the two
sample t-test to k ≥ 2 groups. Assume that the populations of interest have
the following (unknown) population means and standard deviations:
5.1: ANOVA 151
population 1 population 2 · · · population k
mean µ1 µ2 ··· µk
std dev σ1 σ2 ··· σk
A usual interest in ANOVA is whether µ1 = µ2 = · · · = µk . If not, then we
wish to know which means differ, and by how much. To answer these questions
we select samples from each of the k populations, leading to the following data
summary:
sample 1 sample 2 · · · sample k
size n1 n2 ··· nk
mean Ȳ1 Ȳ2 ··· Ȳk
std dev s1 s2 ··· sk
A little more notation is needed for the discussion. Let Yij denote the j th
observation in the ith sample and define the total sample size n∗ = n1 + n2 +
· · · + nk . Finally, let Ȳ¯ be the average response over all samples (combined),
that is P
Y
P
ij ij i ni Ȳi
Ȳ¯ = = .
n∗ n∗
Note that Ȳ¯ is not the average of the sample means, unless the samples sizes
ni are equal.
An F -statistic is used to test H0 : µ1 = µ2 = · · · = µk against HA : not H0
(that is, at least two means are different). The assumptions needed for the
standard ANOVA F -test are analogous to the independent pooled two-sample
t-test assumptions: (1) Independent random samples from each population. (2)
The population frequency curves are normal. (3) The populations have equal
standard deviations, σ1 = σ2 = · · · = σk .
The F -test is computed from the ANOVA table, which breaks the spread in
the combined data set into two components, or Sums of Squares (SS). The
Within SS, often called the Residual SS or the Error SS, is the portion
of the total spread due to variability within samples:
SS(Within) = (n1 − 1)s1 + (n2 − 1)s2 + · · · + (nk − 1)sk = ij (Yij − Ȳi)2.
2 2 2
P
The Between SS, often called the Model SS, measures the spread between
the sample means
SS(Between) =
n1(Ȳ1 − Ȳ¯ )2 + n2(Ȳ2 − Ȳ¯ )2 + · · · + nk (Ȳk − Ȳ¯ )2 = i ni(Ȳi − Ȳ¯ )2,
P
152 Ch 5: One-Way Analysis of Variance
weighted by the sample sizes. These two SS add to give
SS(Total) = SS(Between) + SS(Within) = ij (Yij − Ȳ¯ )2.
P
Each SS has its own degrees of freedom (df ). The df (Between) is the number
of groups minus one, k − 1. The df (Within) is the total number of observations
minus the number of groups: (n1 − 1) + (n2 − 1) + · · · + (nk − 1) = n∗ − k.
These two df add to give df (Total) = (k − 1) + (n∗ − k) = n∗ − 1.
The Sums of Squares and df are neatly arranged in a table, called the
ANOVA table:
Source df SS MS F
SSM =P i ni (Ȳi − Ȳ¯ )2
P
Between Groups (Model) df M = k − 1 M SM = SSM/df M M SM/M SE
Within Groups (Error) df E = n∗ − k SSE = i (ni − 1)s2i M SE = SSE/df E
df T = n∗ − 1 SST = ij (Yij − Ȳ¯ )2
P
Total M ST = SST /df T
The Mean Square for each source of variation is the corresponding SS divided
by its df . The Mean Squares can be easily interpreted.
The MS(Within)
− Ȳ¯ )2
P
i ni (Ȳi
MS(Between) =
k−1
MS(Between)
Fs = .
MS(Within)
Large values of Fs indicate large variability among the sample means Ȳ1, Ȳ2, . . . , Ȳk
relative to the spread of the data within samples. That is, large values of Fs
suggest that H0 is false.
Formally, for a size α test, reject H0 if Fs ≥ Fcrit, where Fcrit is the upper-α
percentile from an F distribution with numerator degrees of freedom k − 1 and
denominator degrees of freedom n∗ − k (i.e., the df for the numerators and
denominators in the F -ratio). The p-value for the test is the area under the
F -probability curve to the right of Fs:
0 1 2 3 4 5 6 0 1 2 3 4 5 6
FCrit Reject H0 for FS here FS FCrit
If you don’t specify variable.name, it will name that column “variable”, and
if you leave out value.name, it will name that column “value”.
From long to wide: Use dcast() from the reshape2 package.
156 Ch 5: One-Way Analysis of Variance
Now that we’ve got our data in the long format, let’s return to the ANOVA.
190 ●
●
amount absorbed (g)
180
type
fat1
fat2
fat3
170
fat4
160
150
The p-value for the F -test is 0.001. The scientist would reject H0 at any
of the usual test levels (such as, 0.05 or 0.01). The data suggest that the
population mean absorption rates differ across fats in some way. The F -test
does not say how they differ. The pooled standard deviation spooled = 8.18 is
the “Residual standard error”. We’ll ignore the rest of this output for now.
fit.f <- aov(amount ~ type, data = fat.long)
summary(fit.f)
## Df Sum Sq Mean Sq F value Pr(>F)
158 Ch 5: One-Way Analysis of Variance
Ȳi − Ȳj
ts = q .
1 1
spooled ni + nj
The minimum absolute difference between Ȳi and Ȳj needed to reject H0 is the
LSD, the quantity on the right hand side of this inequality. If all the sample sizes
are equal n1 = n2 = · · · = nk then the LSD is the same for each comparison:
r
2
LSD = tcritspooled ,
n1
where n1 is the common sample size.
I will illustrate Fisher’s method on the doughnut data, using α = 0.05. At
the first step, you reject the hypothesis that the population mean absorptions
are equal because p-value= 0.001. At the second step, compare all pairs of fats
at the 5% level. Here, spooled = 8.18 and tcrit = 2.086 for a two-sided test based
on 20 df (the df E for Residual SS). Each sample has six observations, so the
LSD for each comparison is
r
2
LSD = 2.086 × 8.18 × = 9.85.
6
Any two sample means that differ by at least 9.85 in magnitude are signifi-
cantly different at the 5% level.
An easy way to compare all pairs of fats is to order the samples by their
sample means. The samples can then be grouped easily, noting that two fats
are in the same group if the absolute difference between their sample means is
smaller than the LSD.
160 Ch 5: One-Way Analysis of Variance
Fats Sample Mean
2 185.00
3 174.83
1 174.50
4 162.00
There are six comparisons of two fats. From this table, you can visually
assess which sample means differ by at least the LSD=9.85, and which ones do
not. For completeness, the table below summarizes each comparison:
To see why you must be careful when interpreting groupings, suppose you obtain
two groups in a three sample problem. One group has samples 1 and 3. The
other group has samples 3 and 2:
1 3 2
-----------
-----------
This occurs, for example, when |Ȳ1 − Ȳ2| ≥ LSD, but both |Ȳ1 − Ȳ3| and
|Ȳ3 − Ȳ2| are less than the LSD. There is a tendency to conclude, and please
try to avoid this line of attack, that populations 1 and 3 have the same mean,
populations 2 and 3 have the same mean, but populations 1 and 2 have different
means. This conclusion is illogical. The groupings imply that we have sufficient
evidence to conclude that population means 1 and 2 are different, but insufficient
evidence to conclude that population mean 3 differs from either of the other
population means.
There are c = k(k − 1)/2 pairs of means to compare in the second step of
the FSD method. Each comparison is done at the α level, where for a generic
comparison of the ith and j th populations
α = probability of rejecting H0 : µi = µj when H0 is true.
The individual error rate is not the only error rate that is important in mul-
tiple comparisons. The family error rate (FER), or the experimentwise
error rate, is defined to be the probability of at least one false rejection of
a true hypothesis H0 : µi = µj over all comparisons. When many compar-
isons are made, you may have a large probability of making one or more false
rejections of true null hypotheses. In particular, when all c comparisons of two
population means are performed, each at the α level, then
Assuming all comparisons are of interest, you can implement the Bonferroni
adjustment in R by specifying p.adjust.method = "bonf" A by-product of the
Bonferroni adjustment is that we have at least 100(1 − α)% confidence that all
pairwise t-test statements hold simultaneously!
# Bonferroni 95% Individual p-values
# All Pairwise Comparisons among Levels of fat
pairwise.t.test(fat.long$amount, fat.long$type,
pool.sd = TRUE, p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fat.long$amount and fat.long$type
##
## fat1 fat2 fat3
## fat2 0.22733 - -
## fat3 1.00000 0.26241 -
## fat4 0.09286 0.00056 0.07960
##
## P value adjustment method: bonferroni
Looking at the output, can you create the groups? You should get the
groups given below, which implies you have sufficient evidence to conclude that
the population mean absorption for Fat 2 is different than that for Fat 4.
FAT 4 FAT 1 FAT 3 FAT 2
-----------------------
-----------------------
The Bonferroni method tends to produce “coarser” groups than the FSD method,
because the individual comparisons are conducted at a lower (alpha/error) level.
Equivalently, the minimum significant difference is inflated for the Bonferroni
method. For example, in the doughnut problem with F ER ≤ 0.05, the critical
164 Ch 5: One-Way Analysis of Variance
value for the individual comparisons at the 0.0083 level is tcrit = 2.929 with
df = 20. The minimum significant difference for the Bonferroni comparisons is
r
2
LSD = 2.929 × 8.18 × = 13.824
6
versus an LSD=9.85 for the FSD method. Referring back to our table of sam-
ple means on page 160, we see that the sole comparison where the absolute
difference between sample means exceeds 13.824 involves Fats 2 and 4.
temporal line
r
rio
pe
Zygomatic tubercle Su
Zygomaticofrontal Lambda
suture rio
n
Supraorbital foramen Pte
Glabella
22mm
35mm
Nasion
Asterion
Inion
Zygomatic
Zygomatic
arch
hal
bone nuc Reid's base
line line
Mandible
Auricular point
Pre-auricular point
13 NA 5.00 4.75
14 NA NA 6.00
", header=TRUE)
Glabella thickness
8 ●
7
●
thickness (mm)
pop
cauc
6
afam
naaa
At the 5% level, you would not reject the hypothesis that the population
mean Glabella measurements are identical. That is, you do not have sufficient
evidence to conclude that these racial groups differ with respect to their average
Glabella measurement. This is the end of the analysis!
The Bonferroni intervals reinforce this conclusion, all the p-values are greater
than 0.05. If you were to calculate CIs for the difference in population means,
each would contain zero. You can think of the Bonferroni intervals as simul-
taneous CI. We’re (at least) 95% confident that all of the following statements
hold simultaneously: −1.62 ≤ µc − µa ≤ 0.32, −0.91 ≤ µn − µc ≤ 1.00, and
−1.54 ≤ µn − µa ≤ 0.33. The individual CIs have level 100(1 − 0.0167)% =
98.33%.
# Bonferroni 95% Individual p-values
# All Pairwise Comparisons among Levels of glabella
pairwise.t.test(glabella.long$thickness, glabella.long$pop,
pool.sd = TRUE, p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: glabella.long$thickness and glabella.long$pop
##
## cauc afam
## afam 0.30 -
## naaa 1.00 0.34
##
## P value adjustment method: bonferroni
168 Ch 5: One-Way Analysis of Variance
5.3 Further Discussion of Multiple Compar-
isons
The FSD and Bonferroni methods comprise the ends of the spectrum of mul-
tiple comparisons methods. Among multiple comparisons procedures, the FSD
method is most likely to find differences, whether real or due to sampling vari-
ation, whereas Bonferroni is often the most conservative method. You can
be reasonably sure that differences suggested by the Bonferroni method will
be suggested by almost all other methods, whereas differences not significant
under FSD will not be picked up using other approaches.
The Bonferroni method is conservative, but tends to work well when the
number of comparisons is small, say 4 or less. A smart way to use the Bonferroni
adjustment is to focus attention only on the comparisons of interest (generated
independently of looking at the data!), and ignore the rest. I will return to this
point later.
A commonly-used alternative is Tukey’s honest significant difference method
(HSD). John Tukey’s honest significant difference method is to reject the equal-
ity of a pair of means based, not on the t-distribution, but the studentized
range distribution. To implement Tukey’s method with a FER of α, reject
H0 : µi = µj when
s
qcrit 1 1
|Ȳi − Ȳj | ≥ √ spooled + ,
2 ni nj
where qcrit is the α level critical value of the studentized range distribution. For
the doughnut fats, the groupings based on Tukey and Bonferroni comparisons
are identical.
#### Tukey's honest significant difference method (HSD)
## Fat
# Tukey 95% Individual p-values
# All Pairwise Comparisons among Levels of fat
TukeyHSD(fit.f)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = amount ~ type, data = fat.long)
##
5.3: Further Discussion of Multiple Comparisons 169
## $type
## diff lwr upr p adj
## fat2-fat1 10.5000 -2.719 23.7190 0.1511
## fat3-fat1 0.3333 -12.886 13.5524 0.9999
## fat4-fat1 -12.5000 -25.719 0.7190 0.0679
## fat3-fat2 -10.1667 -23.386 3.0524 0.1710
## fat4-fat2 -23.0000 -36.219 -9.7810 0.0005
## fat4-fat3 -12.8333 -26.052 0.3857 0.0590
## Glabella
# Tukey 95% Individual p-values
# All Pairwise Comparisons among Levels of pop
TukeyHSD(fit.g)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = thickness ~ pop, data = glabella.long)
##
## $pop
## diff lwr upr p adj
## afam-cauc 0.64904 -0.2943 1.5924 0.2260
## naaa-cauc 0.04464 -0.8824 0.9717 0.9924
## naaa-afam -0.60440 -1.5120 0.3033 0.2473
Another popular method controls the false discovery rate (FDR) instead
of the FER. The FDR is the expected proportion of false discoveries amongst
the rejected hypotheses. The false discovery rate is a less stringent condition
than the family-wise error rate, so these methods are more powerful than the
others, though with a higher FER. I encourage you to learn more about the
methods by Benjamini, Hochberg, and Yekutieli.
#### false discovery rate (FDR)
## Fat
# FDR
pairwise.t.test(fat.long$amount, fat.long$type,
pool.sd = TRUE, p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fat.long$amount and fat.long$type
##
## fat1 fat2 fat3
## fat2 0.05248 - -
## fat3 0.94443 0.05248 -
## fat4 0.03095 0.00056 0.03095
##
## P value adjustment method: BH
170 Ch 5: One-Way Analysis of Variance
## Glabella
# FDR
pairwise.t.test(glabella.long$thickness, glabella.long$pop,
pool.sd = TRUE, p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: glabella.long$thickness and glabella.long$pop
##
## cauc afam
## afam 0.17 -
## naaa 0.91 0.17
##
## P value adjustment method: BH
# violin plot
library(vioplot)
vioplot(fit.g$residuals, horizontal=TRUE, col="gray")
# boxplot
boxplot(fit.g$residuals, horizontal=TRUE)
# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(fit.g$residuals, las = 1, id.n = 8, id.cex = 1, lwd = 1, main="QQ Plot")
## 29 8 34 40 21 25 27 37
## 39 38 1 2 37 3 4 36
Histogram of fit.g$residuals
QQ Plot
0.8
Density
0.4
29 ●
2 8●
0.0
−2 −1 0 1 2 21 ●
37 ●
fit.g$residuals
●
●
1 ●
●
fit.g$residuals
●
●●
●●
● ●●●●●
1
●
0 ●●●●●
●●
●●
−2 −1 0 1 2 ●
●
●
●●
−1 ●
●
● ● 2527
● 40
● 34
● ●
−2
−2 −1 0 1 2
−2 −1 0 1 2
norm quantiles
shapiro.test(fit.g$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit.g$residuals
## W = 0.9769, p-value = 0.5927
library(nortest)
ad.test(fit.g$residuals)
##
## Anderson-Darling normality test
##
## data: fit.g$residuals
## A = 0.3773, p-value = 0.3926
172 Ch 5: One-Way Analysis of Variance
# lillie.test(fit.g£residuals)
cvm.test(fit.g$residuals)
##
## Cramer-von Mises normality test
##
## data: fit.g$residuals
## W = 0.0709, p-value = 0.2648
In Chapter 4, I illustrated the use of Bartlett’s test and Levene’s test for
equal population variances, and showed how to evaluate these tests in R.
α = .05 (fixed)
0 4
χ2Crit Reject H0 for χ2S here
R does the calculation for us, as illustrated below. Because the p-value
> 0.5, we fail to reject the null hypothesis that the population variances are
equal. This result is not surprising given how close the sample variances are to
each other.
## Test equal variance
# Barlett assumes populations are normal
bartlett.test(thickness ~ pop, data = glabella.long)
##
## Bartlett test of homogeneity of variances
##
5.5: Example from the Child Health and Development Study (CHDS) 173
# no cigs
chds[(chds$m_smok == 0), "smoke"] <- "0 cigs";
# less than 1 pack (20 cigs = 1 pack)
chds[(chds$m_smok > 0) & (chds$m_smok < 20),"smoke"] <- "1-19 cigs";
# at least 1 pack (20 cigs = 1 pack)
chds[(chds$m_smok >= 20),"smoke"] <- "20+ cigs";
chds$smoke <- factor(chds$smoke)
# histogram using ggplot
p1 <- ggplot(chds, aes(x = c_bwt))
p1 <- p1 + geom_histogram(binwidth = .4)
p1 <- p1 + facet_grid(smoke ~ .)
p1 <- p1 + labs(title = "Child birthweight vs maternal smoking") +
xlab("child birthweight (lb)")
#print(p1)
library(gridExtra)
grid.arrange(p1, p2, ncol=1)
0 cigs
●
40
20 ●
0 ●
40 10.0 ●
20
0
60
40
60
smoke
0 cigs 5.0
count
40
1−19 cigs
20+ cigs
20 ●
●
0
2.5 5.0 7.5 10.0 0 cigs 1−19 cigs 20+ cigs
child birthweight (lb) smoke
●
10 ● ●
● ●
●
●●
●●● 9 ●●●●
●
●
●
●
● 9 ●
●
●●
●
●
●
●
●
●● ●
●
●●
●● ●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●● ●● 8 ●●
●
●
●
●
●
●
●
●●
●●
●
● ●●
● ●
●●
●
●
●
● ●
● ●●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
● ●
●●
●
●
●●
●
●
●
● ●
●
●●
●●
●
●
● ●
●
●
●●
●
● ●●
●
●●
●●
●
●
●
8 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● 8 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
● ●●
●
●● ●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
7 ●
●
●
●●
●
●●
●
● ●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
● ● ●
●
●● ●●
●
●● ●
●●
●●
●
●
●●
●
●
●●
● ●●
● ●
●●
●
●●
●●
●
● ●
●●
●
●●
●●
● ●
●
●●
●●
●
●●
●
●
●
●
●
●
●
● 7 ●
●
●
●
●
● ●
●
●
6 ●
●
●
●
●●
●
●
●
●
●●
6 ●
●●
●
●
●
●
●
●
● ●
●● ●
●●
● ●
●● ●
●
●●
●
●● ●
●
●● ●
●●
●
●
●●
●
●
● ●●
● ●
●
●●
●
●●
●
●●
●
● ●
6 ●
●
●●
●●
●
●●
5
4 ●
●
●●●
●●
●● ●
● ●● ●
● ● ● ●
−3 −2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2
library(nortest)
# 0 cigs --------------------
shapiro.test(subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Shapiro-Wilk normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## W = 0.9872, p-value = 0.00199
ad.test( subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Anderson-Darling normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## A = 0.9283, p-value = 0.01831
cvm.test( subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Cramer-von Mises normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## W = 0.1384, p-value = 0.03374
# 1-19 cigs --------------------
shapiro.test(subset(chds, smoke == "1-19 cigs")$c_bwt)
##
## Shapiro-Wilk normality test
##
## data: subset(chds, smoke == "1-19 cigs")$c_bwt
## W = 0.9785, p-value = 0.009926
ad.test( subset(chds, smoke == "1-19 cigs")$c_bwt)
##
## Anderson-Darling normality test
##
## data: subset(chds, smoke == "1-19 cigs")$c_bwt
## A = 0.8308, p-value = 0.03149
5.5: Example from the Child Health and Development Study (CHDS) 177
# violin plot
library(vioplot)
178 Ch 5: One-Way Analysis of Variance
# boxplot
boxplot(fit.c$residuals, horizontal=TRUE)
# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(fit.c$residuals, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot")
Histogram of fit.c$residuals
QQ Plot
0.4
Density
0.2
●
●
0.0
●●●
●●
−4 −2 0 2 4
●
●●●●
●
●●
●●
●
●
●
●●
fit.c$residuals 2 ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
fit.c$residuals ●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
0 ●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●
●●
●
●●
●
1
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
−4 −2 0 2 ●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
−2 ●
●
●
●
●
●●●
●
●●
●●●
●
−4
●
● ● ● ●
−3 −2 −1 0 1 2 3
−4 −2 0 2
norm quantiles
shapiro.test(fit.c$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit.c$residuals
## W = 0.9955, p-value = 0.04758
library(nortest)
ad.test(fit.c$residuals)
##
## Anderson-Darling normality test
##
## data: fit.c$residuals
## A = 0.6218, p-value = 0.1051
cvm.test(fit.c$residuals)
##
## Cramer-von Mises normality test
##
## data: fit.c$residuals
## W = 0.092, p-value = 0.1449
5.5: Example from the Child Health and Development Study (CHDS) 179
Looking at the summaries, we see that the sample standard deviations are
close. Formal tests of equal population variances are far from significant. The
p-values for Bartlett’s test and Levene’s test are greater than 0.4. Thus, the
standard ANOVA appears to be appropriate here.
# calculate summaries
chds.summary <- ddply(chds, "smoke",
function(X) { data.frame( m = mean(X$c_bwt),
s = sd(X$c_bwt),
n = length(X$c_bwt) ) } )
chds.summary
## smoke m s n
## 1 0 cigs 7.733 1.052 381
## 2 1-19 cigs 7.221 1.078 169
## 3 20+ cigs 7.266 1.091 130
## Test equal variance
# assumes populations are normal
bartlett.test(c_bwt ~ smoke, data = chds)
##
## Bartlett test of homogeneity of variances
##
## data: c_bwt by smoke
## Bartlett's K-squared = 0.3055, df = 2, p-value = 0.8583
# does not assume normality, requires car package
library(car)
leveneTest(c_bwt ~ smoke, data = chds)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.76 0.47
## 677
# nonparametric test
fligner.test(c_bwt ~ smoke, data = chds)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: c_bwt by smoke
## Fligner-Killeen:med chi-squared = 2.093, df = 2, p-value =
## 0.3512
The p-value for the F -test is less than 0.0001. We would reject H0 at any of
the usual test levels (such as 0.05 or 0.01). The data suggest that the population
mean birth weights differ across smoking status groups.
summary(fit.c)
## Df Sum Sq Mean Sq F value Pr(>F)
## smoke 2 41 20.35 17.9 2.6e-08 ***
## Residuals 677 769 1.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.c
180 Ch 5: One-Way Analysis of Variance
## Call:
## aov(formula = c_bwt ~ smoke, data = chds)
##
## Terms:
## smoke Residuals
## Sum of Squares 40.7 769.5
## Deg. of Freedom 2 677
##
## Residual standard error: 1.066
## Estimated effects may be unbalanced
The Tukey multiple comparisons suggest that the mean birth weights are
different (higher) for children born to mothers that did not smoke during preg-
nancy.
## CHDS
# Tukey 95% Individual p-values
TukeyHSD(fit.c)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = c_bwt ~ smoke, data = chds)
##
## $smoke
## diff lwr upr p adj
## 1-19 cigs-0 cigs -0.51151 -0.7429 -0.2801 0.0000
## 20+ cigs-0 cigs -0.46665 -0.7210 -0.2123 0.0001
## 20+ cigs-1-19 cigs 0.04485 -0.2473 0.3370 0.9308
Chapter 6
Nonparametric Methods
Learning objectives
After completing this topic, you should be able to:
select the appropriate procedure based on assumptions.
explain reason for using one procedure over another.
decide whether the medians between multiple populations are different.
Achieving these goals contributes to mastery in these course learning outcomes:
3. select correct statistical procedure.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
10. identify and explain statistical methods, assumptions, and limitations.
12. make evidence-based decisions.
6.1 Introduction
Nonparametric methods do not require the normality assumption of classical
techniques. When the normality assumption is met, the ANOVA and t-test are
most powerful, in that if the alternative is true these methods will make the
correct decision with highest probability. However, if the normality assumption
is not met, results from the ANOVA and t-test can be misleading and too
liberal. I will describe and illustrate selected non-parametric methods,
and compare them with classical methods. Some motivation and discussion of
182 Ch 6: Nonparametric Methods
the strengths and weaknesses of non-parametric methods is given.
50%
The income data is unimodal, skewed right, with two extreme outliers.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(income, freq = FALSE, breaks = 1000)
points(density(income), type = "l")
rug(income)
# violin plot
library(vioplot)
vioplot(income, horizontal=TRUE, col="gray")
# boxplot
boxplot(income, horizontal=TRUE)
184 Ch 6: Nonparametric Methods
Histogram of income
Density
0.15
0.00
income
●
1
● ●
The normal QQ-plot of the sample data indicates strong deviation from
normality, and the CLT can’t save us: even the bootstrap sampling distribution
of the mean indicates strong deviation from normality.
library(car)
qqPlot(income, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot, Income")
bs.one.samp.dist(income)
●
Density
1000
0.002
800
0.000
income
dat
Bootstrap sampling distribution of the mean
400
0.010
Density
200
●
● ● ● ● ● ● ●
0 ● ● ●
0.000
The presence of the outliers has a dramatic effect on the 95% CI for the
population mean income µ, which goes from −101 to 303 (in 1000 dollar units).
This t-CI is suspect because the normality assumption is unreasonable. A CI
for the population median income η is more sensible because the median is likely
to be a more reasonable measure of typical value. Using the sign procedure,
6.2: The Sign Test and CI for a Population Median 185
you are 95% confident that the population median income is between 2.32 and
11.57 (times $1000).
library(BSDA)
t.test(income)
##
## One Sample t-test
##
## data: income
## t = 1.099, df = 11, p-value = 0.2951
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -101.1 303.0
## sample estimates:
## mean of x
## 100.9
SIGN.test(income)
##
## One-sample Sign-Test
##
## data: income
## s = 11, p-value = 0.0009766
## alternative hypothesis: true median is not equal to 0
## 95 percent confidence interval:
## 2.319 11.575
## sample estimates:
## median of x
## 7
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8540 5.000 8.00
## Interpolated CI 0.9500 2.319 11.57
## Upper Achieved CI 0.9614 2.000 12.00
## [1] 8.259
# violin plot
library(vioplot)
vioplot(age, horizontal=TRUE, col="gray")
# boxplot
boxplot(age, horizontal=TRUE)
Histogram of age
Density
0.04
0.00
30 35 40 45 50 55 60 65
age
●
1
35 40 45 50 55 60 65
35 40 45 50 55 60 65
The normal QQ-plot of the sample data indicates mild deviation from nor-
mality in the left tail (2 points of 11 outside the bands), and the bootstrap
sampling distribution of the mean indicates weak deviation from normality. It
is good practice in this case to use the nonparametric test as a double-check of
the t-test, with the nonparametric test being the more conservative test.
library(car)
qqPlot(age, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot, Income")
bs.one.samp.dist(age)
6.2: The Sign Test and CI for a Population Median 187
65
●
Density
0.04
60
●
●
55
● ● ●
0.00
●
50 30 35 40 45 50 55 60 65
age
● ●
dat
Bootstrap sampling distribution of the mean
45
0.15
●
40
0.10
Density
0.05
35
●
0.00
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
40 45 50 55 60
norm quantiles
Data: n = 11 , mean = 51.273 , se = 2.49031 5
# violin plot
library(vioplot)
vioplot(dat, horizontal=TRUE, col="gray")
# boxplot
boxplot(dat, horizontal=TRUE)
190 Ch 6: Nonparametric Methods
Histogram of dat
0.12
Density
0.06
0.00
5 10 15 20
dat
1
5 10 15 20
5 10 15 20
bs.one.samp.dist(dat)
●
Density
0.04
20 ●
● ●
0.00
15 5 10 15 20 25
dat
dat
Bootstrap sampling distribution of the mean
10
Density
0.10
5 ●
0.00
d ● ● ● ● ● ● ● ●
0 1 2 3 4
d
If you are uncomfortable with the symmetry assumption, you could use the
sign CI for the population median difference between B and A. I will note that
a 95% CI for the median difference goes from 0.86 to 2.2 hours.
t.test(sleep$d, mu=0)
##
## One Sample t-test
##
## data: sleep$d
## t = 3.78, df = 9, p-value = 0.004352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.6102 2.4298
## sample estimates:
## mean of x
## 1.52
# with continuity correction in the normal approximation for the p-value
wilcox.test(sleep$d, mu=0, conf.int=TRUE)
## Warning: cannot compute exact p-value with ties
## Warning: cannot compute exact confidence interval with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: sleep$d
## V = 54, p-value = 0.008004
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
## 0.7999 2.8000
## sample estimates:
## (pseudo)median
## 1.3
# can use the paired= option
#wilcox.test(sleep£b, sleep£a, paired=TRUE, mu=0, conf.int=TRUE)
# if don't assume symmetry, can use sign test
#SIGN.test(sleep£d)
194 Ch 6: Nonparametric Methods
6.3.2 Comments on One-Sample Nonparametric Meth-
ods
For this discussion, I will assume that the underlying population distribution
is (approximately) symmetric, which implies that population means and me-
dians are equal (approximately). For symmetric distributions the t, sign, and
Wilcoxon procedures are all appropriate.
If the underlying population distribution is extremely skewed, you can use
the sign procedure to get a CI for the population median. Alternatively, as
illustrated on HW 2, you can transform the data to a scale where the underlying
distribution is nearly normal, and then use the classical t-methods. Moderate
degrees of skewness will not likely have a big impact on the standard t-test and
CI.
The one-sample t-test and CI are optimal when the underlying population
frequency curve is normal. Essentially this means that the t-CI is, on average,
narrowest among all CI procedures with given level, or that the t-test has the
highest power among all tests with a given size. The width of a CI provides a
measure of the sensitivity of the estimation method. For a given level CI, the
narrower CI better pinpoints the unknown population mean.
With heavy-tailed symmetric distributions, the t-test and CI tend to be
conservative. Thus, for example, a nominal 95% t-CI has actual coverage rates
higher than 95%, and the nominal 5% t-test has an actual size smaller than 5%.
The t-test and CI possess a property that is commonly called robustness of
validity. However, data from heavy-tailed distributions can have a profound
effect on the sensitivity of the t-test and CI. Outliers can dramatically inflate
the standard error of the mean, causing the CI to be needlessly wide, and
tests to have diminished power (outliers typically inflate p-values for the t-
test). The sign and Wilcoxon procedures downweight the influence of outliers
by looking at sign or signed-ranks instead of the actual data values. These
two nonparametric methods are somewhat less efficient than the t-methods
when the population is normal (efficiency is about 0.64 and 0.96 for the sign
and Wilcoxon methods relative to the normal t-methods, where efficiency is the
ratio of sample sizes needed for equal power), but can be infinitely more efficient
with heavier than normal tailed distributions. In essence, the t-methods do not
6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 195
have a robustness of sensitivity.
Nonparametric methods have gained widespread acceptance in many sci-
entific disciplines, but not all. Scientists in some disciplines continue to use
classical t-methods because they believe that the methods are robust to non-
normality. As noted above, this is a robustness of validity, not sensitivity. This
misconception is unfortunate, and results in the routine use of methods that
are less powerful than the non-parametric techniques. Scientists need to
be flexible and adapt their tools to the problem at hand, rather
than use the same tool indiscriminately! I have run into suspicion
that use of nonparametric methods was an attempt to “cheat” in some way —
properly applied, they are excellent tools that should be used.
A minor weakness of nonparametric methods is that they do not easily
generalize to complex modelling problems. A great deal of progress has been
made in this area, but most software packages have not included the more
advanced techniques (R is among the forerunners).
Nonparametric statistics used to refer almost exclusively to the set of meth-
ods such as we have been discussing that provided analogs like tests and CIs
to the normal theory methods without requiring the assumption of sampling
from normal distributions. There is now a large area of statistics also called
nonparametric methods not focused on these goals at all. In our department
we (used to) have a course titled “Nonparametric Curve Estimation & Image
Reconstruction”, where the focus is much more general than relaxing an as-
sumption of normality. In that sense, what we are covering in this course could
be considered “classical” nonparametrics.
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.0
0.0
-5 0 5 10 15 20 0 5 10 15
The R help on ?wilcox.test gives references to how the exact WMW proce-
dure is actually calculated; here is a good approximation to the exact method
that is easier to understand. The WMW procedure is based on ranks. The
two samples are combined, ranked from smallest to largest (1=smallest) and
separated back into the original samples. If the two populations have equal me-
dians, you expect the average rank in the two samples to be roughly equal. The
WMW test computes a classical two sample t-test using the pooled variance on
the ranks to assess whether the sample mean ranks are significantly different.
1
http://www.lpi.usra.edu/meteor/metbull.php?code=24138
2
http://www.lpi.usra.edu/meteor/metbull.php?code=24204
6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 197
#### Example: Comparison of Cooling Rates of Uwet and Walker Co. Meteorites
Uwet <- c(0.21, 0.25, 0.16, 0.23, 0.47, 1.20, 0.29, 1.10, 0.16)
Walker <- c(0.69, 0.23, 0.10, 0.03, 0.56, 0.10, 0.01, 0.02, 0.04, 0.22)
The boxplots and normal QQ-plots show that the distributions are rather
skewed to the right. The AD test of normality indicate that a normality as-
sumption is unreasonable for each population.
met <- data.frame(Uwet=c(Uwet,NA), Walker)
library(reshape2)
met.long <- melt(met, variable.name = "site", value.name = "cool", na.rm=TRUE)
## Using as id variables
# naming variables manually, the variable.name and value.name not working 11/2012
names(met.long) <- c("site", "cool")
library(ggplot2)
p <- ggplot(met.long, aes(x = site, y = cool, fill=site))
p <- p + geom_boxplot()
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2)
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
p <- p + coord_flip()
p <- p + labs(title = "Cooling rates for samples of meteorites at two locations")
p <- p + theme(legend.position="none")
print(p)
par(mfrow=c(1,2))
library(car)
qqPlot(Walker, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot, Walker")
qqPlot(Uwet, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot, Uwet")
Cooling rates for samples of meteorites at two locations QQ Plot, Walker QQ Plot, Uwet
0.7 ● 1.2 ●
●
0.6
● 1.0
Walker ● ●
0.5
0.8
0.4
Walker
Uwet
site
0.3 0.6
●
● ●
0.2
0.4
Uwet ● ●
0.1 ● ● ●
●
●
● ●
●
● 0.2
● ● ●
0.0
0.00 0.25 0.50 0.75 1.00 1.25 norm quantiles norm quantiles
cool
I carried out the standard two-sample procedures to see what happens. The
pooled-variance and Satterthwaithe results are comparable, which is expected
because the sample standard deviations and sample sizes are roughly equal.
198 Ch 6: Nonparametric Methods
Both tests indicate that the mean cooling rates for Uwet and Walker Co. me-
teorites are not significantly different at the 10% level. You are 95% confident
that the mean cooling rate for Uwet is at most 0.1 less, and no more than 0.6
greater than that for Walker Co. (in degrees per million years).
# numerical summaries
summary(Uwet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.160 0.210 0.250 0.452 0.470 1.200
c(sd(Uwet), IQR(Uwet), length(Uwet))
## [1] 0.407 0.260 9.000
summary(Walker)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0325 0.1000 0.2000 0.2280 0.6900
c(sd(Walker), IQR(Walker), length(Walker))
## [1] 0.239 0.195 10.000
t.test(Uwet, Walker, var.equal = TRUE)
##
## Two Sample t-test
##
## data: Uwet and Walker
## t = 1.669, df = 17, p-value = 0.1134
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06663 0.57107
## sample estimates:
## mean of x mean of y
## 0.4522 0.2000
t.test(Uwet, Walker)
##
## Welch Two Sample t-test
##
## data: Uwet and Walker
## t = 1.624, df = 12.65, p-value = 0.129
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08421 0.58865
## sample estimates:
## mean of x mean of y
## 0.4522 0.2000
The difference between the WMW and t-test p-values and CI lengths (i.e.
the WMW CI is narrower and the p-value smaller) reflects the effect of the
outliers on the sensitivity of the standard tests and CI.
3
http://en.wikipedia.org/wiki/File:Speed_of_light_(foucault).PNG
6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 201
The problem is to determine a 95% CI for the “true” passage time, which
is taken to be the typical time (mean or median) of the population of measure-
ments that were or could have been taken by this experiment.
#### Example: Newcombe's Data
time <- c(24.828, 24.833, 24.834, 24.826, 24.824, 24.756
, 24.827, 24.840, 24.829, 24.816, 24.798, 24.822
, 24.824, 24.825, 24.823, 24.821, 24.830, 24.829
, 24.831, 24.824, 24.836, 24.819, 24.820, 24.832
, 24.836, 24.825, 24.828, 24.828, 24.821, 24.829
, 24.837, 24.828, 24.830, 24.825, 24.826, 24.832
, 24.836, 24.830, 24.836, 24.826, 24.822, 24.823
, 24.827, 24.828, 24.831, 24.827, 24.827, 24.827
, 24.826, 24.826, 24.832, 24.833, 24.832, 24.824
, 24.839, 24.824, 24.832, 24.828, 24.825, 24.825
, 24.829, 24.828, 24.816, 24.827, 24.829, 24.823)
library(nortest)
ad.test(time)
##
## Anderson-Darling normality test
##
## data: time
## A = 5.884, p-value = 1.217e-14
# Histogram overlaid with kernel density curve
Passage_df <- data.frame(time)
p1 <- ggplot(Passage_df, aes(x = time))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..)
, binwidth=0.001
, colour="black", fill="white")
202 Ch 6: Nonparametric Methods
# violin plot
p2 <- ggplot(Passage_df, aes(x = "t", y = time))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(Passage_df, aes(x = "t", y = time))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol=1)
## Warning: position stack requires constant width: output may be incorrect
90
density
60
30
0
24.77 24.80 24.82
time
"t"
t ● ●
t ● ●
par(mfrow=c(1,1))
library(car)
qqPlot(time, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot, Time")
bs.one.samp.dist(time)
6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 203
60
24.84 ● ●
●● ● ● ●
●
40
Density
●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●
20
24.82 ● ●●●
● ●
0
24.76 24.78 24.80 24.82 24.84
time
24.80 ●
dat
Bootstrap sampling distribution of the mean
300
24.78
200
Density
100
24.76
●
0
−2 −1 0 1 2
24.820 24.822 24.824 24.826 24.828 24.830
norm quantiles
Data: n = 66 , mean = 24.826 , se = 0.00132266 5
The data set is skewed to the left, due to the presence of two extreme
outliers that could potentially be misrecorded observations. Without additional
information I would be hesitant to apply normal theory methods (the t-test),
even though the sample size is “large” (bootstrap sampling distribution is still
left-skewed). Furthermore, the t-test still suffers from a lack of robustness of
sensitivity, even in large samples. A formal QQ-plot and normal test rejects, at
the 0.01 level, the normality assumption needed for the standard methods.
The table below gives 95% t, sign, and Wilcoxon CIs. I am more comfortable
with the sign CI for the population median than the Wilcoxon method, which
assumes symmetry.
t.sum <- t.test(time)
t.sum$conf.int
## [1] 24.82 24.83
## attr(,"conf.level")
## [1] 0.95
diff(t.test(time)$conf.int)
## [1] 0.005283
s.sum <- SIGN.test(time)
##
## One-sample Sign-Test
##
## data: time
## s = 66, p-value < 2.2e-16
## alternative hypothesis: true median is not equal to 0
## 95 percent confidence interval:
## 24.83 24.83
## sample estimates:
204 Ch 6: Nonparametric Methods
## median of x
## 24.83
s.sum[2,c(2,3)]
## L.E.pt U.E.pt
## 24.83 24.83
diff(s.sum[2,c(2,3)])
## U.E.pt
## 0.0025
w.sum <- wilcox.test(time, conf.int=TRUE)
w.sum$conf.int
## [1] 24.83 24.83
## attr(,"conf.level")
## [1] 0.95
diff(w.sum$conf.int)
## [1] 0.002488
## emis.long$year: y68.9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71 196 324 506 462 3000
## ----------------------------------------------------
## emis.long$year: y70.1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100 198 244 381 450 940
## ----------------------------------------------------
## emis.long$year: y72.4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20 65 160 244 222 1880
# IQR and sd of each year
by(emis.long$hc, emis.long$year, function(X) { c(IQR(X), sd(X), length(X)) })
## emis.long$year: Pre.y63
## [1] 471.5 591.6 10.0
## ----------------------------------------------------
## emis.long$year: y63.7
## [1] 517.0 454.9 13.0
## ----------------------------------------------------
## emis.long$year: y68.9
## [1] 266.2 707.8 16.0
## ----------------------------------------------------
## emis.long$year: y70.1
## [1] 252.2 287.9 20.0
## ----------------------------------------------------
## emis.long$year: y72.4
## [1] 156.5 410.8 19.0
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(emis.long, aes(x = year, y = hc))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(emis.long$hc),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
aes(colour=year), alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, aes(colour=year), alpha = 0.8)
p <- p + labs(title = "Albuquerque automobile hydrocarbon emissions data") + ylab("hc (ppm)")
# to reverse order that years print, so oldest is first on top
p <- p + scale_x_discrete(limits = rev(levels(emis.long$year)) )
p <- p + coord_flip()
p <- p + theme(legend.position="none")
print(p)
208 Ch 6: Nonparametric Methods
Albuquerque automobile hydrocarbon emissions data
Pre.y63 ●
y63.7 ●
year
y68.9 ● ●
y70.1 ● ●
y72.4 ●
Pre.y63
y63.7
year
y68.9 ●
y70.1
y72.4 ●
3 4 5 6 7 8
log(hc) (log(ppm))
After transformation, the samples have roughly the same spread (IQR and
s) and shape. The transformation does not completely eliminate the outliers.
However, I am more comfortable with a standard ANOVA on this scale than
with the original data. A difficulty here is that the ANOVA is comparing
population mean log HC emission (so interpretations are on the log ppm scale,
instead of the natural ppm scale). Summaries for the ANOVA on the log
hydrocarbon emissions levels are given below.
# ANOVA of rank, for illustration that this is similar to what KW is doing
fit.le <- aov(loghc ~ year, data = emis.long)
summary(fit.le)
## Df Sum Sq Mean Sq F value Pr(>F)
212 Ch 6: Nonparametric Methods
5
It is unethical to choose a method based on the results it gives.
6.5: Alternatives for ANOVA and Planned Comparisons 213
Example: Hodgkin’s Disease Study Plasma bradykininogen levels
were measured in normal subjects, in patients with active Hodgkin’s disease,
and in patients with inactive Hodgkin’s disease. The globulin bradykininogen
is the precursor substance for bradykinin, which is thought to be a chemical
mediator of inflammation. The data (in micrograms of bradykininogen per
milliliter of plasma) are displayed below. The three samples are denoted by
nc for normal controls, ahd for active Hodgkin’s disease patients, and ihd for
inactive Hodgkin’s disease patients.
The medical investigators wanted to know if the three samples differed in
their bradykininogen levels. Carry out the statistical analysis you consider to
be most appropriate, and state your conclusions to this question.
Read in the data, look at summaries on the original scale, and create a plot.
Also, look at summaries on the log scale and create a plot.
#### Example: Hodgkin's Disease Study
hd <- read.table(text="
nc ahd ihd
5.37 3.96 5.37
5.80 3.04 10.60
4.70 5.28 5.02
5.70 3.40 14.30
3.40 4.10 9.90
8.60 3.61 4.27
7.48 6.16 5.75
5.77 3.22 5.03
7.15 7.48 5.74
6.49 3.87 7.85
4.09 4.27 6.82
5.94 4.05 7.90
6.38 2.40 8.36
9.24 5.81 5.72
5.66 4.29 6.00
4.53 2.77 4.75
6.51 4.40 5.83
7.00 NA 7.30
6.20 NA 7.52
7.04 NA 5.32
4.82 NA 6.05
6.73 NA 5.68
5.26 NA 7.57
NA NA 5.68
NA NA 8.91
NA NA 5.39
NA NA 4.40
NA NA 7.13
", header=TRUE)
214 Ch 6: Nonparametric Methods
#hd
## log scale
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(hd.long, aes(x = patient, y = loglevel))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(hd.long$loglevel),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
aes(colour=patient), alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, aes(colour=patient), alpha = 0.8)
p <- p + labs(title = "Plasma bradykininogen levels for three patient groups (log scale)")
p <- p + ylab("log(level) (log(mg/ml))")
# to reverse order that years print, so oldest is first on top
p <- p + scale_x_discrete(limits = rev(levels(hd.long$patient)) )
216 Ch 6: Nonparametric Methods
p <- p + ylim(c(0,max(hd.long$loglevel)))
p <- p + coord_flip()
p <- p + theme(legend.position="none")
print(p)
Plasma bradykininogen levels for three patient groups Plasma bradykininogen levels for three patient groups (log scale)
nc ● nc ●
patient
patient
ahd ● ● ahd ●
ihd ● ihd ●
0 5 10 15 0 1 2
level (mg/ml) log(level) (log(mg/ml))
Although the spread (IQR, s) in the ihd sample is somewhat greater than
the spread in the other samples, the presence of skewness and outliers in the
boxplots is a greater concern regarding the use of the classical ANOVA. The
shapes and spreads in the three samples are roughly identical, so a Kruskal-
Wallis nonparametric ANOVA appears ideal. As a sidelight, I transformed
plasma levels to a log scale to reduce the skewness and eliminate the outliers.
The boxplots of the transformed data show reasonable symmetry across groups,
but outliers are still present. I will stick with the Kruskal-Wallis ANOVA
(although it would not be much of a problem to use the classical ANOVA on
transformed data).
Let ηnc = population median plasma level for normal controls, ηahd = pop-
ulation median plasma level for active Hodgkin’s disease patients, and ηihd =
population median plasma level for inactive Hodgkin’s disease patients. The
KW test of H0 : ηnc = ηahd = ηihd versus HA : not H0 is highly significant (p-
value= 0.00003), suggesting differences among the population median plasma
levels. The Kruskal-Wallis ANOVA summary is given below.
# KW ANOVA
fit.h <- kruskal.test(level ~ patient, data = hd.long)
6.5: Alternatives for ANOVA and Planned Comparisons 217
fit.h
##
## Kruskal-Wallis rank sum test
##
## data: level by patient
## Kruskal-Wallis chi-squared = 20.57, df = 2, p-value =
## 3.421e-05
##
## data: hd$ahd and hd$ihd
## W = 56, p-value = 2.143e-05
## alternative hypothesis: true location shift is not equal to 0
## 98.33 percent confidence interval:
## -3.50 -1.32
## sample estimates:
## difference in location
## -2.147
The only comparison with a p-value greater than 0.0167 involved the nc
and ihd samples. The comparison leads to two groups, and is consistent with
what we see in the boxplots.
ahd nc ihd
--- --------
You have sufficient evidence to conclude that the plasma bradykininogen levels
for active Hodgkin’s disease patients (ahd) is lower than the population median
levels for normal controls (nc) and for patients with inactive Hodgkin’s disease
(ihd). You do not have sufficient evidence to conclude that the population
median levels for normal controls (nc) and for patients with inactive Hodgkin’s
disease (ihd) are different. The CIs give an indication of size of differences in
the population medians.
6
The ANOVA is the multi-sample analog to the two-sample t-test for the mean, and the KW ANOVA is
the multi-sample analog to the WMW two-sample test for the median. Thus, we follow up a KW ANOVA
with WMW two-sample tests at the chosen multiple comparison adjusted error rate.
220 Ch 6: Nonparametric Methods
##
## data: emis$y63.7 and emis$Pre.y63
## W = 61.5, p-value = 0.8524
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -530 428
## sample estimates:
## difference in location
## -15.48
wilcox.test(emis$y68.9, emis$y63.7 , conf.int=TRUE, conf.level = 0.9875)
## Warning: cannot compute exact p-value with ties
## Warning: cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y68.9 and emis$y63.7
## W = 43, p-value = 0.007968
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -709 -52
## sample estimates:
## difference in location
## -397.4
wilcox.test(emis$y70.1, emis$y68.9 , conf.int=TRUE, conf.level = 0.9875)
## Warning: cannot compute exact p-value with ties
## Warning: cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y70.1 and emis$y68.9
## W = 156, p-value = 0.9112
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -206 171
## sample estimates:
## difference in location
## -11
wilcox.test(emis$y72.4, emis$y70.1 , conf.int=TRUE, conf.level = 0.9875)
## Warning: cannot compute exact p-value with ties
## Warning: cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y72.4 and emis$y70.1
## W = 92.5, p-value = 0.006384
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -286 -6
## sample estimates:
## difference in location
6.6: Permutation tests 221
## -130
There are significant differences between the 1963-67 and 1968-69 samples,
and between the 1970-71 and 1972-74 samples. You are 98.75% confident that
the population median HC emissions for 1963-67 year cars is between 52 and
708.8 ppm greater than the population median for 1968-69 cars. Similarly, you
are 98.75% confident that the population median HC emissions for 1970-71 year
cars is between 6.1 and 285.9 ppm greater than the population median for 1972-
74 cars. Overall, you are 95% confident among the four pairwise comparisons
that you have not declared a difference significant when it isn’t.
library(ggplot2)
p <- ggplot(dat, aes(x = Tperm))
#p <- p + scale_x_continuous(limits=c(-20,+20))
p <- p + geom_histogram(aes(y=..density..)
, binwidth=0.01
, colour="black", fill="white")
# Overlay with transparent density plot
p <- p + geom_density(alpha=0.2, fill="#FF6666")
#p <- p + geom_point(aes(y = -0.05)
# , position = position_jitter(height = 0.01)
# , alpha = 1/5)
# vertical line at Tobs
p <- p + geom_vline(aes(xintercept=Tobs), colour="#BB0000", linetype="dashed")
p <- p + labs(title = "Permutation distribution of difference in means, Uwet and Walker Meteor
p <- p + xlab("difference in means (red line = observed difference in means)")
print(p)
224 Ch 6: Nonparametric Methods
2
density
Note that the two-sided p-value of 0.1229 is consistent, in this case, with
6.6: Permutation tests 225
the two-sample t-test p-values of 0.1134 (pooled) and 0.1290 (Satterthwaite),
but different from 0.0497 (WMW). The permutation is a comparison of means
without the normality assumption, though requires that the observations are
exchangable between populations under H0.
If the only purpose of the test is reject or not reject the null hypothesis, we
can as an alternative sort the recorded differences, and then observe if T(obs) is
contained within the middle 95% of them. If it is not, we reject the hypothesis
of equal means at the 5% significance level.
6.6.1 Automated in R
The lmPerm package provides permutation tests for linear models (including t-
tests and ANOVA) and is particularly easy to impliment. You can use it for
all manner of ANOVA/ANCOVA designs, as well as simple, polynomial, and
multiple regression. Simply use lmp() and aovp() where you would have used
lm() and aov(). Note that the t-test can be performed with lm().
Below I calculate the standard t-test for the Meteorite data using t.test()
and lm(), then compare that with lmp() and what we calculated using our
calculation of the permutation test.
# standard two-sample t-test with equal variances
t.summary <- t.test(cool ~ site, data = met.long, var.equal = TRUE)
t.summary
##
## Two Sample t-test
##
## data: cool by site
## t = 1.669, df = 17, p-value = 0.1134
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06663 0.57107
## sample estimates:
## mean in group Uwet mean in group Walker
## 0.4522 0.2000
# linear model form of t-test, "siteWalker" has estimate, se, t-stat, and p-value
lm.summary <- lm(cool ~ site, data = met.long)
summary(lm.summary)
##
## Call:
## lm(formula = cool ~ site, data = met.long)
##
## Residuals:
## Min 1Q Median 3Q Max
226 Ch 6: Nonparametric Methods
The permutation test gives a p-value of 0.1424 which is a bit different from
our previously calculated 0.1229, and is different each time you run the routine
(since it is based on a sampling approximation, just as we coded). The function
also provides, as a reference, the pooled two-sample test at the bottom.
For the emisions data, the permutation test gives this result.
# permutation test version
library(lmPerm)
aovp.summary <- aovp(hc ~ year, data = emis.long)
## [1] "Settings: unique SS "
aovp.summary
## Call:
## aovp(formula = hc ~ year, data = emis.long)
##
## Terms:
6.6: Permutation tests 227
## year Residuals
## Sum of Squares 4226834 17759968
## Deg. of Freedom 4 73
##
## Residual standard error: 493.2
## Estimated effects may be unbalanced
summary(aovp.summary)
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## year 4 4226834 1056709 5000 0.0084 **
## Residuals 73 17759968 243287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus the overall ANOVA rejects the null hypothesis of all equal means. A
followup set of pairwise tests can be done with lmp().
# these are the levels of the factor, the first is the baseline group
levels(emis.long$year)
## [1] "Pre.y63" "y63.7" "y68.9" "y70.1" "y72.4"
# permutation test version
library(lmPerm)
lmp.summary <- lmp(hc ~ year, data = emis.long)
## [1] "Settings: unique SS "
summary(lmp.summary)
##
## Call:
## lmp(formula = hc ~ year, data = emis.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -543.6 -224.1 -126.4 44.2 2492.7
##
## Coefficients:
## Estimate Iter Pr(Prob)
## year1 325.8 5000 <2e-16 ***
## year2 236.7 798 0.112
## year3 -58.5 221 0.312
## year4 -183.3 1793 0.053 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 493 on 73 degrees of freedom
## Multiple R-Squared: 0.192,Adjusted R-squared: 0.148
## F-statistic: 4.34 on 4 and 73 DF, p-value: 0.00331
The year1, year2, year3, and year4 correspond to the 2nd through 5th
groups. The Pr(Prob) column gives the p-values for the permutation test com-
228 Ch 6: Nonparametric Methods
paring each of the 2nd through 5th groups to the 1st group (year0). For the
other comparisons, reorder the factor levels and rerun.
(Note, I thought lmp() would use the baseline factor level as the intercept,
which would result in the year1 . . . year4 estimates being the pairwise differ-
ences from the intercept group. Therefore, by changing which group is the
baseline intercept group, I thought you could get pairwise tests between all
pairs. However, lmp() must be performing some internal factor level ordering
ordering because it always seems to use the third group in the emis.long$year
factor as the baseline intercept. Unfortunately, I don’t know how to use lmp()
to get the other pairwise comparisons.)
40
20
0
default
25
Frequency
15
0 5
10 breaks
12
Frequency
8
4
0
20 breaks
6
Frequency
4
2
0
100 breaks
6
Frequency
4
2
0
Notice that we are starting to see more and more bins that include only a
single observation (or multiple observations at the precision of measurement).
Taken to its extreme, this type of exercise gives in some sense a “perfect” fit to
the data but is useless as an estimator of shape.
On the other hand, it is obvious that a single bin would also be completely
useless. So we try in some sense to find a middle ground between these two
extremes: “Oversmoothing” by using only one bin and “undersmooting” by
using too many. This same paradigm occurs for density estimation, in which
the amount of smoothing is determined by a quantity called the bandwidth.
By default, R uses an optimal (in some sense) choice of bandwidth.
We’ve already used the density() function to provide a smooth curve to
our histograms. So far, we’ve taken the default “bandwidth”. Let’s see what
happens when we use different bandwidths.
230 Ch 6: Nonparametric Methods
par(mfrow=c(3,1))
# undersmooth
hist(time2, prob=TRUE, main="")
lines(density(time2, bw=0.0004), col=3, lwd=2)
text(17.5, .35, "", col=3, cex=1.4)
title(main=paste("Undersmooth, BW = 0.0004"), col.main=3)
# oversmooth
hist(time2, prob=TRUE, main="")
lines(density(time2, bw=0.008), col=4, lwd=2)
title(main=paste("Oversmooth, BW = 0.008"), col.main=4)
Default = 0.0018
80
60
Density
40
20
0
time2
Undersmooth, BW = 0.0004
80
60
Density
40
20
0
time2
Oversmooth, BW = 0.008
80
60
Density
40
20
0
time2
6.7: Density estimation 231
The other determining factor is the kernel, which is the shape each individual
point takes before all the shapes are added up for a final density line. While
the choice of bandwidth is very important, the choice of kernel is not. Choosing
a kernel with hard edges (such as ”rect”) will result in jagged artifacts, so
smoother kernels are often preferred.
par(mfrow=c(1,1))
40
20
0
time2
Chapter 7
Categorical Data
Analysis
Learning objectives
After completing this topic, you should be able to:
select the appropriate statistical method to compare summaries from cate-
gorical variables.
assess the assumptions of one-way and two-way tests of proportions and
independence.
decide whether the proportions between populations are different, including
in stratified and cross-sectional studies.
recommend action based on a hypothesis test.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
Example: Titanic The sinking of the Titanic is a famous event, and new
books are still being published about it. Many well-known facts — from the
proportions of first-class passengers to the “women and children first” policy,
and the fact that that policy was not entirely successful in saving the women
and children in the third class — are reflected in the survival rates for various
classes of passenger. The source provides a data set recording class, sex, age,
and survival status for each person on board of the Titanic, and is based on
data originally collected by the British Board of Trade1.
# The Titanic dataset is a 4-dimensional table: Class, Sex, Age, Survived
library(datasets)
data(Titanic)
Titanic
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
1
British Board of Trade (1990), Report on the Loss of the “Titanic” (S.S.). British Board of Trade
Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing. Note that there is not complete
agreement among primary sources as to the exact numbers on board, rescued, or lost.
234 Ch 7: Categorical Data Analysis
Sex
Female Male
Class
1st 2nd 3rd Crew
There are many questions that can be asked of this dataset. How likely were
people to survive such a ship sinking in cold water? Is the survival proportion
dependent on sex, class, or age, or a combination of these? How different is the
survival proportions for 1st class females versus 3rd class males?
7.2.1 A CI for p
A two-sided CI for p is a range of plausible values for the unknown population
proportion p, based on the observed data. To compute a two-sided CI for p:
1. Specify the confidence level as the percent 100(1 − α)% and solve for the
error rate α of the CI.
2. Compute zcrit = z0.5α (i.e., area under the standard normal curve to
the left and to the right of zcrit are 1 − 0.5α and 0.5α, respectively).
qnorm(1-0.05/2)=1.96.
3. The 100(1 − α)% CI for p has endpoints L = p̂ − zcritSE and U =
p̂ + zcritSE, respectively, where the “CI standard error” is
r
p̂(1 − p̂)
SE = .
n
The CI is often written as p̂ ± zcritSE.
so zcritSE = 1.96 × 0.028 = 0.055. The 95% CI for p is 0.700 ± 0.055. You are
95% confident that the proportion of consumers willing to pay extra for better
packaging is between 0.645 and 0.755. (Willing to pay how much extra?)
Appropriateness of the CI
α α p − value p − value
(fixed)
2 2 2 2
−4 Rej H0 − z 0 4 −4 − zs 0 zs 4
Crit zCrit Rej H0 − zCrit zCrit
Example: Emissions data Each car in the target population (L.A. county)
either has been tampered with (a success) or has not been tampered with (a
failure). Let p = the proportion of cars in L.A. county with tampered emissions
control devices. You want to test H0 : p = 0.15 against HA : p 6= 0.15 (here
p0 = 0.15). The critical value for a two-sided test of size α = 0.05 is zcrit = 1.96.
The data are a sample of n = 200 cars. The sample proportion of cars that
have been tampered with is p̂ = 21/200 = 0.105. The test statistic is
0.105 − 0.15
zs = = −1.78,
0.02525
r
0.15 × 0.85
SE = = 0.02525.
200
Given that |zs| = 1.78 < 1.96, you have insufficient evidence to reject H0 at the
5% level. That is, you have insufficient evidence to conclude that the proportion
of cars in L.A. county that have been tampered with differs from the statewide
proportion.
This decision is reinforced by the p-value calculation. The p-value is the area
under the standard normal curve outside ±1.78. This is 2 × 0.0375 = 0.075,
which exceeds the test size of 0.05.
7.2: Single Proportion Problems 241
.0375 .0375
−4 −1.78 0 1.78 4
Remark The SE used in the test and CI are different. This implies that a
hypothesis test and CI could potentially lead to different decisions. That is, a
95% CI for a population proportion might cover p0 when the p-value for testing
H0 : p = p0 is smaller than 0.05. This will happen, typically, only in cases
where the decision is “borderline.”
7.2.5 R Implementation
#### Single Proportion Problems
# Approximate normal test for proportion, without Yates' continuity correction
prop.test(21, 200, p = 0.15, correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 21 out of 200, null probability 0.15
## X-squared = 3.176, df = 1, p-value = 0.07471
## alternative hypothesis: true p is not equal to 0.15
## 95 percent confidence interval:
## 0.06971 0.15518
## sample estimates:
## p
## 0.105
# Approximate normal test for proportion, with Yates' continuity correction
#prop.test(21, 200, p = 0.15)
I will answer this question by computing a p-value for a one-sided test. Let
p be the population proportion of learning disabled children with brains having
larger right sides. I am interested in testing H0 : p = 0.25 against HA : p > 0.25
(here p0 = 0.25).
The proportion of children sampled with brains having larger right sides is
p̂ = 22/53 = 0.415. The test statistic is
0.415 − 0.25
zs = = 2.78,
0.0595
r
0.25 × 0.75
SE = = 0.0595.
53
The p-value for an upper one-sided test is the area under the standard normal
curve to the right of 2.78, which is approximately .003; see the picture below.
I would reject H0 in favor of HA using any of the standard test levels, say 0.05
or 0.01. The newspaper’s claim is reasonable.
244 Ch 7: Categorical Data Analysis
p−value is area in
right tail only
.003
−4 −2 0 zs = 2.78 4
## [1] 0.95
# Exact binomial test for proportion
binom.test(1, 6, p = 0.85)$conf.int
## [1] 0.004211 0.641235
## attr(,"conf.level")
## [1] 0.95
Returning to the problem, you might check for discrimination by testing
H0 : p = 0.85 against HA : p < 0.85 using an exact test. The exact test
p-value is 0.000 to three decimal places, and an exact upper bound for p is
0.582. What does this suggest to you?
# Exact binomial test for proportion
binom.test(1, 6, alternative = "less", p = 0.85)
##
## Exact binomial test
##
## data: 1 and 6
## number of successes = 1, number of trials = 6, p-value =
## 0.0003987
## alternative hypothesis: true probability of success is less than 0.85
## 95 percent confidence interval:
## 0.0000 0.5818
## sample estimates:
## probability of success
## 0.1667
It is possible that the order (alphabetically) is the wrong order, failures and
successes, in which case we’d need to reorder the input to binom.test().
In Chapter 6 we looked at the binomial distribution to obtain an exact Sign
7.4: Goodness-of-Fit Tests (Multinomial) 249
Test confidence interval for the median. Examine the following to see where
the exact p-value for this test comes from.
n <- 6
x <- 0:n
p0 <- 0.85
bincdf <- pbinom(x, n, p0)
cdf <- data.frame(x, bincdf)
cdf
## x bincdf
## 1 0 1.139e-05
## 2 1 3.987e-04
## 3 2 5.885e-03
## 4 3 4.734e-02
## 5 4 2.235e-01
## 6 5 6.229e-01
## 7 6 1.000e+00
where Oi is the observed number in the sample that fall into the ith category
(Oi = np̂i), and Ei = np0i is the number of individuals expected to be in the
ith category when H0 is true.
The Pearson statistic can also be computed as the sum of the squared resid-
uals: r
X
χ2s = Zi2,
i=1
√
where Zi = (Oi − Ei)/ Ei, or in terms of the observed and hypothesized
category proportions
r
2
X (p̂i − p0i)2
χs = n .
i=1
p 0i
7.4: Goodness-of-Fit Tests (Multinomial) 251
The Pearson statistic χ2s is “small” when all of the observed counts (propor-
tions) are close to the expected counts (proportions). The Pearson χ2 is “large”
when one or more observed counts (proportions) differs noticeably from what
is expected when H0 is true. Put another way, large values of χ2s suggest that
H0 is false.
The critical value χ2crit for the test is obtained from a chi-squared probability
table with r − 1 degrees of freedom. The picture below shows the form of the
rejection region. For example, if r = 5 and α = 0.05, then you reject H0 when
χ2s ≥ χ2crit = 9.49 (qchisq(0.95, 5-1)). The p-value for the test is the area
under the chi-squared curve with df = r − 1 to the right of the observed χ2s
value.
0 5 10 15 0 5 10 15
χ2Crit Reject H0 for χ2S here χ2Crit χ2S
Example: jury pool Let p18 be the proportion in the jury pool population
between ages 18 and 19. Define p20, p25, p30, p40, p50, and p65 analogously.
You are interested in testing that the true jury proportions equal the census
proportions, H0 : p18 = 0.061, p20 = 0.150, p25 = 0.135, p30 = 0.217, p40 =
0.153, p50 = 0.182, and p65 = 0.102 against HA : not H0, using the sample of
1336 from the jury pool.
The observed counts, the expected counts, and the category residuals are
given in the table √below. For example, E18 = 1336 × (0.061) = 81.5 and
Z18 = (23 − 81.5)/ 81.5 = −6.48 in the 18-19 year category.
252 Ch 7: Categorical Data Analysis
The Pearson statistic is
7.4.2 R Implementation
#### Example: jury pool
jury <- read.table(text="
Age Count CensusProp
18-19 23 0.061
20-24 96 0.150
25-29 134 0.135
30-39 293 0.217
40-49 297 0.153
50-64 380 0.182
7.4: Goodness-of-Fit Tests (Multinomial) 253
Plot observed vs expected values to help identify age groups that deviate
the most. Plot contribution to chi-square values to help identify age groups
that deviate the most. The term “Contribution to Chi-Square” (chisq) refers
254 Ch 7: Categorical Data Analysis
(O−E)2
to the values of E for each category. χ2s is the sum of those contributions.
library(reshape2)
x.table.obsexp <- melt(x.table,
# id.vars: ID variables
# all variables to keep but not split apart on
id.vars = c("age"),
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
measure.vars = c("obs", "exp"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "stat",
# value.name: column name for values in table
value.name = "value"
)
# naming variables manually, the variable.name and value.name not working 11/2012
names(x.table.obsexp) <- c("age", "stat", "value")
# Contribution to chi-sq
# pull out only the age and chisq columns
x.table.chisq <- x.table[, c("age","chisq")]
# reorder the age categories to be descending relative to the chisq statistic
x.table.chisq$age <- with(x.table, reorder(age, -chisq))
300 60
stat
Chi−sq
count
200 40
obs
exp
100 20
0 0
18−19 20−24 25−29 30−39 40−49 50−64 65−99 50−64 20−24 18−19 40−49 25−29 65−99 30−39
Age category (years) Sorted age category (years)
The CIs for the 30-39 and 65-99 year categories contain the census propor-
tions. In the other five age categories, there are significant differences between
the jury pool proportions and the census proportions. In general, young adults
appear to be underrepresented in the jury pool whereas older age groups are
overrepresented.
Age p.value CI.lower CI.upper Observed CensusProp
1 18-19 0.000 0.009 0.029 0.017 0.061
2 20-24 0.000 0.054 0.093 0.072 0.150
3 25-29 0.000 0.079 0.124 0.100 0.135
4 30-39 0.842 0.190 0.251 0.219 0.217
5 40-49 0.000 0.192 0.254 0.222 0.153
6 50-64 0.000 0.252 0.319 0.284 0.182
7 65-99 0.037 0.065 0.107 0.085 0.102
The residuals also highlight significant differences because the largest resid-
uals correspond to the categories that contribute most to the value of χ2s. Some
researchers use the residuals for the multiple comparisons, treating the Zis as
standard normal variables. Following this approach, you would conclude that
the jury pool proportions differ from the proportions in the general population
7.5: Comparing Two Proportions: Independent Samples 257
in every age category where |Zi| ≥ 2.70 (using the same Bonferroni correction).
This gives the same conclusion as before.
The two multiple comparison methods are similar, but not identical. The
residuals
Oi − Ei p̂i − p0i
Zi = √ = p p0i
Ei n
agree with the large-sample statistic for testing H0 : pi = p0i, except that the
divisor in Zi omits a 1 − p0i term. The Zis are not standard normal random
variables as assumed, and the value of Zi underestimates the significance of the
observed differences. Multiple comparisons using the Zis will find, on average,
fewer significant differences than the preferred method based on the large sample
tests. However, the differences between the two methods are usually minor when
all of the hypothesized proportions are small.
The New Mexico state legislature is interested in how the proportion of reg-
istered voters that support Indian gaming differs between New Mexico and
Colorado. Assuming neither population proportion is known, the state’s statis-
tician might recommend that the state conduct a survey of registered voters
sampled independently from the two states, followed by a comparison of the
sample proportions in favor of Indian gaming.
Statistical methods for comparing two proportions using independent sam-
ples can be formulated as follows. Let p1 and p2 be the proportion of populations
1 and 2, respectively, with the attribute of interest. Let p̂1 and p̂2 be the corre-
sponding sample proportions, based on independent random or representative
samples of size n1 and n2 from the two populations.
258 Ch 7: Categorical Data Analysis
7.5.1 Large Sample CI and Tests for p1 − p2
A large-sample CI for p1 − p2 is (p̂1 − p̂2) ± zcritSECI (p̂1 − p̂2), where zcrit is
the standard normal critical value for the desired confidence level, and
s
p̂1(1 − p̂1) p̂2(1 − p̂2)
SECI (p̂1 − p̂2) = +
n1 n2
is the test standard error for p̂1 − p̂2. The pooled proportion
n1p̂1 + n2p̂2
p̄ =
n1 + n2
is the proportion of successes in the two samples combined. The test standard
error has the same functional form as the CI standard error, with p̄ replacing
the individual sample proportions.
The pooled proportion is the best guess at the common population propor-
tion when H0 : p1 = p2 is true. The test standard error estimates the standard
deviation of p̂1 − p̂2 assuming H0 is true.
Remark: As in the one-sample proportion problem, the test and CI SE’s are
different. This can (but usually does not) lead to some contradiction between
the test and CI.
7.5: Comparing Two Proportions: Independent Samples 259
Example, vitamin C Two hundred and seventy nine (279) French skiers
were studied during two one-week periods in 1961. One group of 140 skiers
receiving a placebo each day, and the other 139 receiving 1 gram of ascorbic
acid (Vitamin C) per day. The study was double blind — neither the subjects
nor the researchers knew who received which treatment. Let p1 be the prob-
ability that a member of the ascorbic acid group contracts a cold during the
study period, and p2 be the corresponding probability for the placebo group.
Linus Pauling (Chemistry and Peace Nobel prize winner) and I are interested in
testing whether H0 : p1 = p2. The data are summarized below as a two-by-two
table of counts (a contingency table)
Conditional probability
In probability theory, a conditional probability is the probability that an event
will occur, when another event is known to occur or to have occurred. If
the events are A and B respectively, this is said to be “the probability of A
given B”. It is commonly denoted by Pr(A|B). Pr(A|B) may or may not be
equal to Pr(A), the probability of A. If they are equal, A and B are said to
be independent. For example, if a coin is flipped twice, “the outcome of the
second flip” is independent of “the outcome of the first flip”.
7.5: Comparing Two Proportions: Independent Samples 261
In the Vitamin C example above, the unconditional observed probability of
contracting a cold is Pr(cold) = (17 + 31)/(139 + 140) = 0.172. The condi-
tional observed probabilities are Pr(cold|ascorbic acid) = 17/139 = 0.1223 and
Pr(cold|placebo) = 31/140 = 0.2214. The two-sample test of H0 : p1 = p2
where p1 = Pr(cold|ascorbic acid) and p2 = Pr(cold|placebo) is effectively test-
ing whether Pr(cold) = Pr(cold|ascorbic acid) = Pr(cold|placebo). This tests
whether contracting a cold is independent of the vitamin C treatment.
The standard two-sample CI and test used above are appropriate when each
sample is large. A rule of thumb suggests a minimum of at least five successes
(i.e., observations with the characteristic of interest) and failures (i.e., observa-
tions without the characteristic of interest) in each sample before using these
methods. This condition is satisfied in our two examples.
7.6: Effect Measures in Two-by-Two Tables 263
ClickerQ s — Comparing two proportions STT.08.02.010
is commonly reported when the individual risks p1 and p2 are small. The odds
ratio
p1/(1 − p1)
OR =
p2/(1 − p2)
is another standard measure. Here p1/(1 − p1) is the odds of being diseased
in the exposed group, whereas p2/(1 − p2) is the odds of being diseased in the
non-exposed group.
I mention these measures because you may see them or hear about them.
Note that each of these measures can be easily estimated from data, using the
sample proportions as estimates of the unknown population proportions. For
example, in the vitamin C study:
264 Ch 7: Categorical Data Analysis
Outcome Ascorbic Acid Placebo
# with cold 17 31
# with no cold 122 109
Totals 139 140
the proportion with colds in the placebo group is p̂2 = 31/140 = 0.221. The
proportion with colds in the vitamin C group is p̂1 = 17/139 = 0.122.
The estimated absolute difference in risk is p̂1 −p̂2 = 0.122−0.221 = −0.099.
The estimated risk ratio and odds ratio are
d = 0.122 = 0.55
RR
0.221
and
d = 0.122/(1 − 0.122) = 0.49,
OR
0.221/(1 − 0.221)
respectively.
A 95% CI for pA+ − p+A is (0.590 − 0.550) ± 0.019, or (0.021, 0.059). You
are 95% confident that the population proportion of voter-age Americans that
approved of the President’s performance the first month was between 0.021
and 0.059 larger than the proportion that approved one month later. This
gives evidence of a decrease in the President’s approval rating.
A test of H0 : pA+ = p+A can be based on the CI for pA+ − p+A, or on a
standard normal approximation to the test statistic
p̂A+ − p̂+A
zs = ,
SEtest(p̂A+ − p̂+A)
where the test standard error is given by
r
p̂A+p̂+A − 2p̂AA
SEtest(p̂A+ − p̂+A) = .
n
The test statistic is often written in the simplified form
nAD − nDA
zs = √ ,
nAD + nDA
where the nij s are the observed cell counts. An equivalent form of this test,
based on comparing the square of zs to a chi-squared distribution with 1 degree
of freedom, is the well-known McNemar’s test for marginal homogeneity (or
symmetry) in the two-by-two table.
For example, in the Presidential survey
150 − 86
zs = √ = 4.17.
150 + 86
The p-value for a two-sided test is, as usual, the area under the standard normal
curve outside ±4.17. The p-value is less than 0.001, suggesting that H0 is false.
R can perform this test as McNemar’s test.
268 Ch 7: Categorical Data Analysis
# association plot
library(vcd)
assoc(candeath, shade=TRUE)
Location of death
Home Acute Care Chronic care
Location of death Pearson
Home Acute Care Chronic care residuals:
15−54
Pearson 9.9
15−54
residuals:
9.9
55−64
55−64
4.0
4.0
Age
Age
2.0 2.0
65−74
65−74
0.0
0.0
−2.0
−2.0
−4.0
75+
75+
−6.0 −4.0
p−value =
<2e−16 −6.0
p−value =
<2e−16
4
http://cran.r-project.org/web/packages/vcd/vignettes/strucplot.pdf
7.8: Testing for Homogeneity of Proportions 273
For example, a sieve plot for an n-way contingency table plots rectangles
with areas proportional to the expected cell frequencies and filled with a number
of squares equal to the observed frequencies. Thus, the densities visualize the
deviations of the observed from the expected values.
# sieve plot
library(vcd)
# plot observed table, then label cells with observed values in the cells
sieve(candeath, pop = FALSE, shade = TRUE)
labeling_cells(text = candeath, gp_text = gpar(fontface = 2))(as.table(candeath))
Location of death Location of death
Home Acute Care Chronic care Home Acute Care Chronic care
15−54
15−54
94 418 23
55−64
55−64
116 524 34
Age
Age
65−74
65−74
75+
The p-value for the chi-squared test is 0.00005, which leads to rejecting H0
at the 0.05 or 0.01 levels. The data strongly suggest there are differences in the
effectiveness of the various treatments for postoperative nausea.
#### Example: drugs and nausea, testing for homogeneity
nausea <-
matrix(c(96, 70, 52, 100, 52, 33, 35, 32, 37, 48),
nrow = 5, byrow = TRUE,
dimnames = list("Drug" = c("PL", "CH", "DI", "PE100", "PE150"),
"Result" = c("Nausea", "No Nausea")))
nausea
## Result
## Drug Nausea No Nausea
## PL 96 70
## CH 52 100
## DI 52 33
## PE100 35 32
## PE150 37 48
# Sorted proportions of nausea by drug
nausea.prop <- sort(nausea[,1]/rowSums(nausea))
nausea.prop
## CH PE150 PE100 PL DI
## 0.3421 0.4353 0.5224 0.5783 0.6118
# chi-sq test of association
chisq.summary <- chisq.test(nausea, correct=FALSE)
chisq.summary
##
## Pearson's Chi-squared test
##
## data: nausea
## X-squared = 24.83, df = 4, p-value = 5.451e-05
# All expected frequencies are at least 5
chisq.summary$expected
## Result
## Drug Nausea No Nausea
## PL 81.35 84.65
## CH 74.49 77.51
## DI 41.66 43.34
## PE100 32.84 34.16
## PE150 41.66 43.34
7.9: Testing for Homogeneity in Cross-Sectional and Stratified Studies 279
A sensible follow-up analysis is to identify which treatments were responsible
for the significant differences. For example, the placebo and chlorpromazine can
be compared using a test of pP L = pCH or with a CI for pP L − pCH .
In certain experiments, specific comparisons are of interest, for example a
comparison of the drugs with the placebo. Alternatively, all possible compar-
isons might be deemed relevant. The second case is suggested here based on
the problem description. I will use a Bonferroni adjustment to account for the
multiple comparisons. The Bonferroni adjustment accounts for data dredging,
but at a cost of less sensitive comparisons.
There are 10 possible comparisons here. The Bonferroni analysis with an
overall Family Error Rate of 0.05 (or less) tests the 10 individual hypotheses at
the 0.05/10=0.005 level.
nausea.table <- data.frame(Interval = rep(NA,10)
, CI.lower = rep(NA,10)
, CI.upper = rep(NA,10)
, Z = rep(NA,10)
, p.value = rep(NA,10)
, sig.temp = rep(NA,10)
, sig = rep(NA,10))
# row names for table
nausea.table[,1] <- c("p_PL - p_CH"
, "p_PL - p_DI"
, "p_PL - p_PE100"
, "p_PL - p_PE150"
, "p_CH - p_DI"
, "p_CH - p_PE100"
, "p_CH - p_PE150"
, "p_DI - p_PE100"
, "p_DI - p_PE150"
, "p_PE100 - p_PE150")
# test results together in a table
i.tab <- 0
for (i in 1:4) {
for (j in (i+1):5) {
i.tab <- i.tab + 1
nausea.summary <- prop.test(nausea[c(i,j),], correct = FALSE, conf.level = 1-0.05/10)
nausea.table[i.tab, 2:6] <- c(nausea.summary$conf.int[1]
, nausea.summary$conf.int[2]
, sign(-diff(nausea.summary$estimate)) * nausea.summary$statistic^0.5
, nausea.summary$p.value
, (nausea.summary$p.value < 0.05/10))
if (nausea.table$sig.temp[i.tab] == 1) { nausea.table$sig[i.tab] <- "*" }
else { nausea.table$sig[i.tab] <- " " }
}
}
The following table gives two-sample tests of proportions with nausea and
99.5% CIs for the differences between the ten pairs of proportions. The only
two p-values are less than 0.005 corresponding to pP L − pCH and pCH − pDI .
I am 99.5% confident that pCH is between 0.084 and 0.389 less than pP L, and
I am 99.5% confident that pCH is between 0.086 and 0.453 less than pDI . The
other differences are not significant.
280 Ch 7: Categorical Data Analysis
Interval CI.lower CI.upper Z p.value sig
1 p PL - p CH 0.0838 0.3887 4.2182 0.0000 *
2 p PL - p DI −0.2167 0.1498 −0.5099 0.6101
3 p PL - p PE100 −0.1464 0.2582 0.7788 0.4361
4 p PL - p PE150 −0.0424 0.3284 2.1485 0.0317
5 p CH - p DI −0.4532 −0.0861 −4.0122 0.0001 *
6 p CH - p PE100 −0.3828 0.0222 −2.5124 0.0120
7 p CH - p PE150 −0.2788 0.0924 −1.4208 0.1554
8 p DI - p PE100 −0.1372 0.3160 1.1058 0.2688
9 p DI - p PE150 −0.0352 0.3881 2.3034 0.0213
10 p PE100 - p PE150 −0.1412 0.3154 1.0677 0.2857
Using ANOVA-type groupings, and arranging the treatments from most to
least effective (low proportions to high), we get:
CH (0.34) PE150 (0.44) PE100 (0.52) PL (0.58) DI (0.61)
---------------------------------------
---------------------------------------------------
Chapter 8
Correlation and
Regression
Learning objectives
After completing this topic, you should be able to:
select graphical displays that reveal the relationship between two continu-
ous variables.
summarize model fit.
interpret model parameters, such as slope and R2.
assess the model assumptions visually and numerically.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
8.1 Introduction
Suppose we select n = 10 people from the population of college seniors who
plan to take the medical college admission test (MCAT) exam. Each takes the
test, is coached, and then retakes the exam. Let Xi be the pre-coaching score
and let Yi be the post-coaching score for the ith individual, i = 1, 2, . . . , n.
282 Ch 8: Correlation and Regression
There are several questions of potential interest here, for example: Are Y and
X related (associated), and how? Does coaching improve your MCAT score?
Can we use the data to develop a mathematical model (formula) for predicting
post-coaching scores from the pre-coaching scores? These questions can be
addressed using correlation and regression models.
The correlation coefficient is a standard measure of association or
relationship between two features Y and X. Most scientists equate Y and X
being correlated to mean that Y and X are associated, related, or dependent
upon each other. However, correlation is only a measure of the strength of a
linear relationship. For later reference, let ρ be the correlation between Y
and X in the population and let r be the sample correlation. I define r below.
The population correlation is defined analogously from population data.
Suppose each of n sampled individuals is measured on two quantitative
characteristics called Y and X. The data are pairs of observations (X1, Y1),
(X2, Y2), . . ., (Xn, Yn), where (Xi, Yi) is the (X, Y ) pair for the ith individual in
the sample. The sample correlation between Y and X, also called the Pearson
product moment correlation coefficient, is
P
SXY (Xi − X̄)(Yi − Ȳ )
r= = pP i ,
SX SY 2
P 2
i (Xi − X̄) i (Yi − Ȳ )
where Pn
i=1 (Xi − X̄)(Yi − Ȳ )
SXY =
n−1
pP
2
is the sample
pPcovariance between Y and X, and SY = i (Yi − Ȳ ) /(n − 1)
2
and SX = i (Xi − X̄) /(n − 1) are the standard deviations for the Y and
X samples.
Important properties of r:
1. −1 ≤ r ≤ 1.
2. If Yi tends to increase linearly with Xi then r > 0.
3. If Yi tends to decrease linearly with Xi then r < 0.
4. If there is a perfect linear relationship between Yi and Xi with a positive
slope then r = +1.
8.1: Introduction 283
5. If there is a perfect linear relationship between Yi and Xi with a negative
slope then r = −1.
6. The closer the points (Xi, Yi) come to forming a straight line, the closer
r is to ±1.
7. The magnitude of r is unchanged if either the X or Y sample is trans-
formed linearly (such as feet to inches, pounds to kilograms, Celsius to
Fahrenheit).
8. The correlation does not depend on which variable is called Y and which
is called X.
If r is near ±1, then there is a strong linear relationship between Y and X
in the sample. This suggests we might be able to accurately predict Y from X
with a linear equation (i.e., linear regression). If r is near 0, there is a weak
linear relationship between Y and X, which suggests that a linear equation
provides little help for predicting Y from X. The pictures below should help
you develop a sense about the size of r.
Note that r = 0 does not imply that Y and X are not related in the sample.
It only implies they are not linearly related. For example, in the last plot r = 0
yet Yi = Xi2, exactly.
Correlation=1 Correlation=-1 Correlation=.3 Correlation=-.3
• • • •
•
•• ••• •• • •
••• • • • •• •
••• • ••••
•••• •
• • • • • • • •
•••••••
• •••••• • •• •• •• ••••• • • • • •• •• •
•
••••••
•
• •••• • • • • ••• • • •• ••• • •• ••• • • ••
• •
•••••• • ••• •• •• • • •• ••••• •••• • • • • • • •
•••••• ••••• •• ••• ••• •• • ••• • • • •••• ••• •• • • •
•••••• •• •
•••••• • • ••• • • ••• •• • • • • •• • • • •
•
•
•• •
•• •••• • • • ••• •
• • •• •• ••• •••• •••••• • • •
• •
•• •
• ••••
• •• • • • • • • •• • •• • •
• •• • • • • •
• • • • • • • • •
• •
• •• • • •
• • ••
• • •• • • • • •••• • • •• • • • •
• • • ••• •• • • • • • • • • ••• • • •
•
• • • •• ••• •• • • • • • • •
• • • •••••••• •• •••••• •• •• • ••• • •• •• • ••••• ••• • •
• •• •••••••• • • • •• • •••• •••••••••• •••••••••••• •••••• • • •• •• •••••••• •••••• • • •
• •• •••• • • • • •• •••• • • • • • • •
• •• •• • • • •••• •••• •• •• •
••• • ••
• • • ••• • • •• • •• • • •
• •• •••• •
•
• •
•
•
•
• •• • •
• •• •• • •
• • •• •
• • • ••••
-10 -5 0 5 10
• • •• • •
••••••••••••••••••
• ••••••••••••••••••••••••••••
• •••• •
•• •••••••••••••••••••••••• •
0
• ••••••••••••••••••••••••••••
• •• •• •
-2
Y
-4
-6
•
-8
0 2 4 6 8 10
X
Here are scatterplots for the original data and the ranks of the data using
ggpairs() from the GGally package with ggplot2.
# Plot the data using ggplot
library(ggplot2)
library(GGally)
## Loading required package: reshape
##
## Attaching package: ’reshape’
##
## The following object is masked from ’package:class’:
##
## condense
##
## The following objects are masked from ’package:plyr’:
##
## rename, round any
##
## The following objects are masked from ’package:reshape2’:
##
## colsplit, melt, recast
p1 <- ggpairs(thyroid)
print(p1)
288 Ch 8: Correlation and Regression
p2 <- ggpairs(thyroid[,4:6])
print(p2)
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
detach("package:GGally", unload=TRUE)
detach("package:reshape", unload=TRUE)
70 8
60 6
weight Corr: Corr: rank_weight Corr: Corr:
50 −0.0663 −0.772 4 0.286 −0.874
40 50 60 70 2 4 6 8
●
120 ●
8
●
●
● 110 ● 6
●
time Corr: ●
rank_time Corr:
● 100
● −0.107 ● 4 −0.156
●
90
● ● 2 4 6 8
● 80 90 100 110 120 ●
● ● ● ●
8
510
● ● ● ●
● ●
500 ● ● 6
● ● ● ● 490 blood_loss ● ● ● ● rank_blood_loss
● ●
4
480 ● ●
● ●
470 480 490 500 510 ● ● 2 4 6 8
● ● ● ●
Comments:
1. (Pearson correlations). Blood loss tends to decrease linearly as weight
increases, so r should be negative. The output gives r = −0.77. There is
not much of a linear relationship between blood loss and time, so r should
be close to 0. The output gives r = −0.11. Similarly, weight and time
have a weak negative correlation, r = −0.07.
2. The Pearson and Spearman correlations are fairly consistent here. Only
the correlation between blood loss and weight is significantly different from
zero at the α = 0.05 level (the p-values are given below the correlations).
3. (Spearman p-values) R gives the correct p-values. Calculating the p-value
using the Pearson correlation on the ranks is not correct, strictly speaking.
8.3: Simple Linear Regression 289
ClickerQ s — Coney Island STT.02.02.050
8
15
6
1
10
Y
Y
4
-2
2
5
0
-1 0 1 2 3 4 -1 0 1 2 3 4
X X
over all possible choices of β0 and β1. These values can be obtained using
calculus. Rather than worry about this calculation, note that the LS line makes
the sum of squared (vertical) deviations between the responses Yi and the line
as small as possible, over all possible lines. The LS line goes through the mean
8.3: Simple Linear Regression 291
point, (X̄, Ȳ ), which is typically in the “the heart” of the data, and is often
closely approximated by an eye-ball fit to the data.
19
•
18
•
17
Y
16
•
15
•
4.5 5.0 5.5
X
ŷ = b0 + b1X
b0 = Ȳ − b1X̄
##
## Call:
## lm(formula = blood_loss ~ weight, data = thyroid)
##
## Coefficients:
## (Intercept) weight
## 552.4 -1.3
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.blood.wt)
##
## Call:
## lm(formula = blood_loss ~ weight, data = thyroid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.57 -6.19 4.71 8.19 9.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 552.442 21.441 25.77 2.3e-07 ***
## weight -1.300 0.436 -2.98 0.025 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.7 on 6 degrees of freedom
## Multiple R-squared: 0.597,Adjusted R-squared: 0.529
## F-statistic: 8.88 on 1 and 6 DF, p-value: 0.0247
# Base graphics: Plot the data with linear regression fit and confidence bands
# scatterplot
plot(thyroid$weight, thyroid$blood_loss)
# regression line from lm() fit
abline(lm.blood.wt)
8.4: ANOVA Table for Regression 293
510
●
510
●
●
500 ●
500
thyroid$blood_loss
blood_loss
490 ● ●
490
● ●
● ●
480
480
470
●
●
470
●
35 40 45 50 55 60 65 70
40 50 60 70 thyroid$weight
weight
Ŷi = b0 + b1Xi
•
19
•
18
•
17
Fitted
residual
16
Response •
15
•
4.5 X-val 5.0 5.5
8.4: ANOVA Table for Regression 295
The Residual SS, or sum of squared residuals, is small if each Ŷi is close to
Yi (i.e., the line closely fits the data). It can be shown that
n
X
Total SS in Y = (Yi − Ȳ )2 ≥ Res SS ≥ 0.
i=1
Also define
n
X
Regression SS = Reg SS = Total SS − Res SS = b1 (Yi − Ȳ )(Xi − X̄).
i=1
Reg SS
R2 = coefficient of determination = .
Total SS
To understand the interpretation of R2, at least in two extreme cases, note that
8
•
6
Variation in Y
•
4
•
2
•
0
•
-2
-1 0 1 2 3 4
Variation in X
Furthermore,
5
• •
• •
4
• ••
• • • ••
•
•• • •• •• • • •
• • • • • • • ••• • • • •
• • ••
3
• •• • • •• • • • •
• • ••• •• • •• •• •• •
• • • • • •• • • ••
Y
• • • • ••• • • • • •
• • • • • • •• •• • •
••
• • • • •• • • ••• • • •• • • •
2
• • • • • • •••• • • • •
• • •• •
•• •
• • • •• • • • •
• ••• •• •
• • •
• ••
•• • • •• •
1
• ••
• • • • •• •
•
• •• • •
•
0
• •
•
-3 -2 -1 0 1 2
X
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yi = β0 + β1Xi + εi
(i.e., Response = Mean Response + Residual), where the εis are, by virtue of
assumptions 2, 3, and 4, independent normal random variables with mean 0
and variance σY2 |X . The following picture might help see this. Note that the
population regression line is unknown, and is estimated from the data using the
LS line.
14
14
12
12
10
10
Y
epsilon_i
8
Y_i
6
1 2 3 4 5 1 2 X_i 3 4 5
X X
In the plot below, data are simulated where yi = 4 − 2xi + ei, where
xi ∼ Gamma(3, 0.5) and ei ∼ Normal(0, 32). The data are plotted and a
linear regression is fit and the mean regression line is overlayed. Select normal
distributions with variance estimated from the linear model fit are overlayed,
one which indicates limits at two standard deviations. See the R code to create
this image.
300 Ch 8: Correlation and Regression
5
●
●
●
● ●
●
● ●
●
0
●
●● ●
●
● ●●
● ● ●
●● ●
●
● ● ● ●
● ● ● ●●
● ●
−5
●
● ● ●
● ● ●
●● ●
● ●
● ●
● ● ●
●
● ● ●
● ●
●●
●
● ● ●
−10
●
●
● ●
● ● ●
●
●
● ● ●
● ●
● ● ●
●
●
−15
●
● ●
● ●
●
● ●
●
−20
● ●
●
−25
2 4 6 8 10 12 14
1. Validity. Most importantly, the data you are analyzing should map to
the research question you are trying to answer. This sounds obvious but
is often overlooked or ignored because it can be inconvenient.
5. Normality of errors.
Normality and equal variance are typically minor concerns, unless you’re using
the model to make predictions for individual data points.
8.6: CI and tests for β1 301
8.5.1 Back to the Data
There are three unknown population parameters in the model: β0, β1 and
σY2 |X . Given the data, the LS line
Ŷ = b0 + b1X
− Ŷi)2
P
Res SS i (Yi
s2Y |X = Res MS = = .
Res df n−2
The denominator df = n − 2 is the number of observations minus the number
of beta parameters in the model, i.e., β0 and β1.
and where tcrit is the appropriate critical value for the desired CI level from a
t-distribution with df =Res df .
To test H0 : β1 = β10 (a given value) against HA : β1 6= β10, reject H0 if
|ts| ≥ tcrit, where
b1 − β10
ts = ,
SEb1
302 Ch 8: Correlation and Regression
and tcrit is the t-critical value for a two-sided test, with the desired size and
df =Res df . Alternatively, you can evaluate a p-value in the usual manner to
make a decision about H0.
# CI for beta1
sum.lm.blood.wt <- summary(lm.blood.wt)
sum.lm.blood.wt$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 552.4 21.4409 25.77 2.253e-07
## weight -1.3 0.4364 -2.98 2.465e-02
est.beta1 <- sum.lm.blood.wt$coefficients[2,1]
se.beta1 <- sum.lm.blood.wt$coefficients[2,2]
sum.lm.blood.wt$fstatistic
## value numdf dendf
## 8.878 1.000 6.000
df.beta1 <- sum.lm.blood.wt$fstatistic[3]
t.crit <- qt(1-0.05/2, df.beta1)
t.crit
## [1] 2.447
CI.lower <- est.beta1 - t.crit * se.beta1
CI.upper <- est.beta1 + t.crit * se.beta1
c(CI.lower, CI.upper)
## [1] -2.3682 -0.2325
The parameter estimates table gives the standard error, t-statistic, and p-
value for testing H0 : β1 = 0. Analogous summaries are given for the intercept,
β0, but these are typically of less interest.
8.6.1 Testing β1 = 0
Assuming the mean relationship is linear, consider testing H0 : β1 = 0 against
HA : β1 6= 0. This test can be conducted using a t-statistic, as outlined above,
or with an ANOVA F -test, as outlined below.
For the analysis of variance (ANOVA) F -test, compute
Reg MS
Fs =
Res MS
and reject H0 when Fs exceeds the critical value (for the desired size test) from
an F -table with numerator df = 1 and denominator df = n − 2 (see qf()).
The hypothesis of zero slope (or no relationship) is rejected when Fs is large,
which happens when a significant portion of the variation in Y is explained by
8.7: A CI for the population regression line 303
the linear relationship with X.
The p-values from the t-test and the F -test are always equal. Furthermore
this p-value is equal to the p-value for testing no correlation between Y and X,
using the t-test described earlier. Is this important, obvious, or disconcerting?
Comments
1. The prediction interval is wider than the CI for the mean response. This
is reasonable because you are less confident in predicting an individual
response than the mean response for all individuals.
2. The CI for the mean response and the prediction interval for an individual
response become wider as Xp moves away from X̄. That is, you get a
more sensitive CI and prediction interval for Xps near the center of the
data.
3. In plots below include confidence and prediction bands along with the
fitted LS line.
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid, aes(x = weight, y = blood_loss))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = TRUE)
print(p)
# Base graphics: Plot the data with linear regression fit and confidence bands
# scatterplot
plot(thyroid$weight, thyroid$blood_loss)
# regression line from lm() fit
abline(lm.blood.wt)
# x values of weight for predictions of confidence bands
x.pred <- data.frame(weight = seq(min(thyroid$weight), max(thyroid$weight),
8.7: A CI for the population regression line 305
length = 20))
# draw upper and lower confidence bands
lines(x.pred$weight, predict(lm.blood.wt, x.pred,
interval = "confidence")[, "upr"], col = "blue")
lines(x.pred$weight, predict(lm.blood.wt, x.pred,
interval = "confidence")[, "lwr"], col = "blue")
525
●
●
510
●
● ●
500 ●
500
thyroid$blood_loss
● ●
blood_loss
490
● ● ●
475
● ●
480
●
470
●
450
●
35 40 45 50 55 60 65 70
40 50 60 70 thyroid$weight
weight
8.8.1 Introduction
Yi = β0 + β1Xi + εi
where the εis are independent normal random variables with mean 0 and vari-
ance σ 2. The model implies (1) The average Y -value at a given X-value is
linearly related to X. (2) The variation in responses Y at a given X value
is constant. (3) The population of responses Y at a given X is normally dis-
tributed. (4) The observed data are a random sample.
A regression analysis is never complete until these assumptions have been
checked. In addition, you need to evaluate whether individual observations, or
groups of observations, are unduly influencing the analysis. A first step in any
analysis is to plot the data. The plot provides information on the linearity and
constant variance assumption.
8.8: Model Checking and Regression Diagnostics 307
• •
25
•
18
300
•
•
• • •
16
• •• •
20
•
200
••
14
Y
Y
• ••
600
• •
• •
15
• •
12
• •
400
• • •
100
•••
10
• • • •
•
200
• 10 • ••
• • •• • •
• • •
8
• • • ••
0
• • •• • • •
0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
X X X X
Figure (a) is the only plot that is consistent with the assumptions. The plot
shows a linear relationship with constant variance. The other figures show one
or more deviations. Figure (b) shows a linear relationship but the variability
increases as the mean level increases. In Figure (c) we see a nonlinear relation-
ship with constant variance, whereas (d) exhibits a nonlinear relationship with
non-constant variance.
In many examples, nonlinearity or non-constant variability can be addressed
by transforming Y or X (or both), or by fitting polynomial models.
These issues will be addressed later.
The residual is the difference between the observed values and predicted or
fitted values. The residual is the part of the observation that is not explained
by the fitted model. You can analyze residuals to determine the adequacy of
the model. A large residual identifies an observation poorly fit by the model.
The residuals are usually plotted in various ways to assess potential inad-
equacies. The observed residuals have different variances, depending on Xi.
308 Ch 8: Correlation and Regression
Recall that the standard error of Ŷi (and therefore ei) is
s
1 (Xi − X̄)2
SE(Ŷi) = SE(ei) = sY |X +P 2
.
n j (X j − X̄)
ei
ri = .
SE(ei)
The standardized residual is the residual, ei, divided by an estimate of its stan-
dard deviation. This form of the residual takes into account that the residuals
may have different variances, which can make it easier to detect outliers. The
studentized residuals have a constant variance of 1 (approximately). Standard-
ized residuals greater than 2 and less than −2 are usually considered large. I
will focus on diagnostic methods using the studentized residuals.
A plot of the studentized residuals ri against the fitted values Ŷi often reveals
inadequacies with the model. The real power of this plot is with multiple
predictor problems (multiple regression). The information contained in this
plot with simple linear regression is similar to the information contained in the
original data plot, except it is scaled better and eliminates the effect of the
trend on your perceptions of model adequacy. The residual plot should exhibit
no systematic dependence of the sign or the magnitude of the residuals on the
fitted values:
8.8: Model Checking and Regression Diagnostics 309
2
• • •
25
•
•
• •
1
•• • • •
•• • • •
• • • •• •
••
20
• • • • •
•• • •
•
Resids
• • •
•• • • • • •••
0
Y
• •••• • • • •• •
• • •
• •••••• •
15
•••
••
• • •
••••
• •
-1
• •• •• •
•• •
•
10
•
•
• •
-2
3 4 5 6 7 10 15 20 25
X Fitted
The following sequence of plots show how inadequacies in the data plot
appear in a residual plot. The first plot shows a roughly linear relationship
between Y and X with non-constant variance. The residual plot shows a
megaphone shape rather than the ideal horizontal band. A possible remedy is
a weighted least squares analysis to handle the non-constant variance (see
end of chapter for an example), or to transform Y to stabilize the variance.
Transforming the data may destroy the linearity.
• • • •
12
120
60
•
100
10
40
• • •
• • •• ••
• • •
80
• •• • • • •
20
• • • •
8
• • • • • •• •
2
•
Resids
Resids
•
• • •
60
••
Y
• • •
• •• • • •
0
• • •
• • • ••
•• • • • •• • • • •
6
• • • • ••
40
• • • •
• • • • •• ••
-20
• •• ••• •
• • • • • • • •
0
•• • • • •
• •• •
• • ••• ••• • •• • • ••
• •
20
• • • •
• •• • •• •
4
••• ••• • •• •• ••
-40
• • • •• • •
• • • •• • • ••• • • ••
•• • •
• •• •• • •
0
• • • •
-60
• • • • •
2
-2
3 4 5 6 7 25 30 35 40 45 50 55 2 3 4 5 6 2 3 4 5 6 7
X Fitted X Fitted
• • •• •
2
40
• • •
• • ••• • • • •• • • •
• • •• • •
•• ••••• • ••• •
0
• • •
• • • •
35
• •• •
• •• •
-2
• •
• •• •
•
••
30
Resids
• • ••
-4
• ••
• •••••• •
Y
•••
•
-6
• •••
25
• •
•• •
-8
••
20
•• •
-10
•
• •
-12
3 4 5 6 7 20 25 30 35 40
X Fitted
library(ggplot2)
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
Constant variance, constant sample size Constant variance, different sample sizes
● ● ● ●
● ●
● ●
●
2 ● ● ●
●
2 ● ●
●●●
● ● ● ●
●
● ● ●
●● ● ●●● ●●
●● ● ● ● ●
●●
●
● ●● ●
●
● ●● ●●● ●●●
● ●● ●
1 ● ● ● ● ●
● ●●●● ●●
● ●●●
● ● ● ●● ● ●●
● ●● ● ●●
●●●●
●
●
● ● ● ● ● ●● ● ●●
● ● ● ● ●
●● 0 ● ●●
●●●● ●
● ● ●● ● ●●●● ●
● ● ●● ● ●●
● ●● ●●●●●
● ● ● ● ● ●
●●● ●
0 ●● ● ●
y
● ●● ● ● ● ● ●●
●● ● ●● ● ● ● ● ● ●● ●
● ●●
●●
●
● ●
● ● ●● ● ●
●
●●●
● ● ● ●●●●
● ● ●
● ●● ●● ●● ● ●
●
● ●● ●
● ● ●●
●
● ●
●
● ●
● ● ●
●● ● ●●
−1 ● ● ●●
●●
● ●● ● ●
−2 ● ●
● ●●
●
●
● ●
●●
●
● ●
●
−2
●
● ● ●
0 1 2 3 4 0 1 2 3 4
x x
Different variance, constant sample size Different variance, different sample sizes
● ●
●
6
●●
● ●
●
● 4 ● ●●●
●
●
●●●
● ● ● ●●
● ●● ●●●
●●
3 ●
●
● ●●●
●
●●
●●
●●●●●●
● ● ● ●●
● ●●
● ● ●● ● ● ●●● ● ●● ●
●●
●●
●
●
● ● ●●● ●
●● ● ●● ● ● ● ● ●
●●
●
●
● ● ● ● ●● ●●● ●
0 ●●
● ● ●● ● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●
● ● ●● ●●●
y
●● ●
●
●
●● ●● ●● ● ●
● ●● ●●●
●
●●●
●● ●●●●● ●● ● ● ●●●●
0 ●
●●
●
●
●●●
●
●●
●●
● ●●
●● ●●●
●
●●●
●●
●● ●
●
●
● ●●●
● ● ● ●● ●● ●
●●● ● ● ● ●
● ● ● ● ●
● ●● ●● ● ●●
● ● ● ●●
●●● ●
●
● ●●
●● ● ●●
● ●●● ●
−4 ●● ●
●
●
−3
●
●
●
●● ●
●
● ●
−8
0 1 2 3 4 0 1 2 3 4
x x
8.8.5 Outliers
Outliers are observations that are poorly fitted by the regression model. The
response for an outlier is far from the fitted line, so outliers have large positive
or negative values of the studentized residual ri. Usually, |ri| > 2 is considered
large. Outliers are often highlighted in residual plots.
What do you do with outliers? Outliers may be due to incorrect recordings
of the data or failure of the measuring device, or indications or a change in the
mean or variance structure for one or more cases. Incorrect recordings should
be fixed if possible, but otherwise deleted from the analysis.
Routine deletion of outliers from the analysis is not recommended. This
practice can have a dramatic effect on the fit of the model and the perceived
precision of parameter estimates and predictions. Analysts who routinely omit
outliers without cause tend to overstate the significance of their findings and get
a false sense of precision in their estimates and predictions. To assess effects of
outliers, a data analyst should repeat the analysis holding out the outliers to see
whether any substantive conclusions are changed. Very often the only real effect
of an outlier is to inflate MSE and hence make p-values a little larger and CIs
314 Ch 8: Correlation and Regression
a little wider than necessary, but without substantively changing conclusions.
They can completely mask underlying patterns, however.
• •
45
8
40
•
6
35
•
Y
•
4
•
30
• •
••
2
•
25
• • • •
••••• •
0
• • •
20
4 5 6 0 2 4 6 8 10
X X
In the second plot, the extreme value is a high leverage value, which is
basically an outlier among the X values; Y does not enter in this calculation.
This influential observation is not an outlier because its presence in the analysis
determines that the LS line will essentially pass through it! These are values
8.8: Model Checking and Regression Diagnostics 315
with the potential of greatly distorting the fitted model. They may or may not
actually have distorted it.
The hat variable from the influence() function on the object returned from
lm() fit will give the leverages: influence(lm.output)$hat. Leverage values
fall between 0 and 1. Experts consider a leverage value greater than 2p/n or
3p/n, where p is the number of predictors or factors plus the constant and n is
the number of observations, large and suggest you examine the corresponding
observation. A rule-of-thumb is to identify observations with leverage over 3p/n
or 0.99, whichever is smaller.
Dennis Cook developed a measure of the impact that individual cases have
on the placement of the LS line. His measure, called Cook’s distance or
Cook’s D, provides a summary of how far the LS line changes when each
individual point is held out (one at a time) from the analysis. While high
leverage values indicate observations that have the potential of causing trouble,
those with high Cook’s D values actually do disproportionately affect the overall
fit. The case with the largest D has the greatest impact on the placement of
the LS line. However, the actual influence of this case may be small. In the
plots above, the observations I focussed on have the largest Cook’s Ds.
A simple, but not unique, expression for Cook’s distance for the j th case is
X
Dj ∝ (Ŷi − Ŷi[−j])2,
i
where Ŷi[−j] is the fitted value for the ith case when the LS line is computed from
all the data except case j. Here ∝ means that Dj is a multiple of i(Ŷi −Ŷi[−j])2
P
where the multiplier does not depend on the case. This expression implies that
Dj is also an overall measure of how much the fitted values change when case
j is deleted.
Observations with large D values may be outliers. Because D is calculated
using leverage values and standardized residuals, it considers whether an ob-
servation is unusual with respect to both x- and y-values. To interpret D,
compare it to the F -distribution with (p, n − p) degrees-of-freedom to deter-
mine the corresponding percentile. If the percentile value is less than 10% or
20%, the observation has little influence on the fitted values. If the percentile
316 Ch 8: Correlation and Regression
value is greater than 50%, the observation has a major influence on the fitted
values and should be examined.
Many statisticians make it a lot simpler than this sounds and use 1 as a
cutoff value for large Cook’s D (when D is on the appropriate scale). Using
the cutoff of 1 can simplify an analysis, since frequently one or two values will
have noticeably larger D values than other observations without actually having
much effect, but it can be important to explore any observations that stand
out. Cook’s distance values for each observation from a linear regression fit are
given with cooks.distance(lm.output).
Given a regression problem, you should locate the points with the largest
Dj s and see whether holding these cases out has a decisive influence on the fit
of the model or the conclusions of the analysis. You can examine the relative
magnitudes of the Dj s across cases without paying much attention to the actual
value of Dj , but there are guidelines (see below) on how large Dj needs to be
before you worry about it.
It is difficult to define a good strategy for dealing with outliers and influential
observations. Experience is the best guide. I will show you a few examples that
highlight some standard phenomena. One difficulty you will find is that certain
observations may be outliers because other observations are influential, or vice-
versa. If an influential observation is held out, an outlier may remain an outlier,
may become influential, or both, or neither. Observations of moderate influence
may become more, or less influential, when the most influential observation is
held out.
In the plots below, which cases do you think are most influential, and which
are outliers. What happens in each analysis if I delete the most influential case?
Are any of the remaining cases influential or poorly fitted?
8.8: Model Checking and Regression Diagnostics 317
• •
10 12 14 16
15
•
• •
10
Y
Y
• •
•
5
8
• •
0
• •
15
15
•
• •
• •
10
10
Y
• •
5
•
• •
0
0 2 4 6 8 10 12 2 4 6 8 10 12
X X
525
7
●
4
●1
●
500
2 6
● ●
blood_loss
5
●
475
3
●
8
●
450
40 50 60 70
weight
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.blood.wt, which = c(1,4,6))
# residuals vs weight
plot(thyroid$weight, lm.blood.wt$residuals, main="Residuals vs weight")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(lm.blood.wt$residuals, las = 1, id.n = 3, main="QQ Plot")
## 8 2 4
## 1 2 8
# residuals vs order of data
plot(lm.blood.wt$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
322 Ch 8: Correlation and Regression
2.5
3
10
2.5
4●
● ●
●
2.0
2.0
●
Cook's distance
Cook's distance
0
Residuals
1.5
1.5
●
1.0
−10
1.0
2●
0.5
0.5
0.5
8
●8
−20
●8 7
● ●7
0.0
0.0
●
● 0
10
● 10 4● ●
● ● ● ● ● ●
● ● ●
5
5
5
lm.blood.wt$residuals
lm.blood.wt$residuals
lm.blood.wt$residuals
● ● ●
0
0
0
−5
−5
● −5 ● ●
−10
−10
●
−10 ● 2 ●
−15
−20
−20
● −20 ● 8 ●
# exclude obs 3
thyroid.no3 <- subset(thyroid, wt == 1)
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid.no3, aes(x = weight, y = blood_loss, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)
324 Ch 8: Correlation and Regression
520
7
●
4
● 1
blood_loss
●
500
2 6
● ●
5
●
480
8
●
460
35 40 45 50
weight
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.blood.wt.no3, which = c(1,4,6))
# residuals vs weight
plot(thyroid.no3$weight, lm.blood.wt.no3$residuals[(thyroid$wt == 1)]
, main="Residuals vs weight")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(lm.blood.wt.no3$residuals, las = 1, id.n = 3, main="QQ Plot")
## 3 8 2
## 8 1 2
# residuals vs order of data
plot(lm.blood.wt.no3$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
8.9: Regression analysis suggestion 325
10
8 8●
4●
● ●
0.6
0.6
5
Cook's distance
Cook's distance
Residuals
●
0.4
0
0.4
●
0.5
2 ●2
−5
0.2
0.2
6
−10
●6
●
2● ●
●8
−15
0.0
0.0
●
● 0
10
3●
30
● ●
● ● 30
lm.blood.wt.no3$residuals
lm.blood.wt.no3$residuals
5
20
20
●
0
10
●
10 ● ●
●
● ● ●
−5
● ●
0
0
● ●
−10
−10
−10
●
● ● 8 ● 2 ● ●
How much difference is there in a practical sense? Examine the 95% predic-
tion interval for a new observation at Weight = 50kg. Previously we saw that
interval based on all 8 observations was from 457.1 to 517.8 ml of Blood Loss.
Based on just the 7 observations the prediction interval is 451.6 to 512.4 ml.
There really is no practical difference here.
# CI for the mean and PI for a new observation at weight=50
predict(lm.blood.wt , data.frame(weight=50), interval = "prediction")
## fit lwr upr
## 1 487.4 457.1 517.8
predict(lm.blood.wt.no3, data.frame(weight=50), interval = "prediction")
## Warning: Assuming constant prediction variance even though model fit is weighted
## fit lwr upr
## 1 482 451.6 512.4
These data are from a UCLA study of cyanotic heart disease in children. The
predictor is the age of the child in months at first word and the response variable
is the Gesell adaptive score, for each of 21 children.
id age score
1 1 15 95
2 2 26 71
3 3 10 83
4 4 9 91
5 5 15 102
6 6 20 87
7 7 18 93
8 8 11 100
9 9 8 104
10 10 20 94
11 11 7 113
12 12 9 96
13 13 10 83
14 14 11 84
15 15 11 102
16 16 10 100
17 17 12 105
18 18 42 57
19 19 17 121
20 20 11 86
21 21 10 100
11
●
9 17
● 15 ● 5
16 8
21 ● ●
100 ● ●
12 1
●
● 7 10
●
4 ●
●
20 6
●
3 14
13 ●
score
●
●
80
2
●
60 18
●
10 20 30 40
age
# residuals vs weight
plot(gesell$age, lm.score.age$residuals, main="Residuals vs age")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm.score.age$residuals, las = 1, id.n = 3, main="QQ Plot")
## 19 3 13
## 21 1 2
# residuals vs order of data
plot(lm.score.age$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
● 19
0.6
0.6
20
Cook's distance
Cook's distance
0.5
Residuals
0.4
0.4
●
10
● ●
●
●
● ● ●
● ●
0
● 19 ● 19
0.2
0.2
●
●
−10
● ●
● 2
● ● ●2
3●
13 ●●
●
●
0.0
0.0
●
−20
●
●●
●
●● 0
19 ●
30
30
● 30 ●
lm.score.age$residuals
lm.score.age$residuals
lm.score.age$residuals
20
20
20
● ●
10
10
● ● 10 ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
0
0
● ● ●
● ● ●
● ● ●
−10
−10
● ●
●
● ●
●
−10 ●
●
●
● ● ●
● ● 3 ● 13 ● ●
10 15 20 25 30 35 40 −2 −1 0 1 2 5 10 15 20
Can you think of any reasons to justify doing the analysis without observation
18?
If you include observation 18 in the analysis, you are assuming that the mean
Gesell score is linearly related to age over the entire range of observed ages.
Observation 18 is far from the other observations on age (age for observation
18 is 42; the second highest age is 26; the lowest age is 7). There are no
children with ages between 27 and 41, so we have no information on whether
330 Ch 8: Correlation and Regression
the relationship is roughly linear over a significant portion of the range of ages. I
am comfortable deleting observation 18 from the analysis because it’s inclusion
forces me to make an assumption that I can not check using these data. I am
only willing to make predictions of Gesell score for children with ages roughly
between 7 and 26. However, once this point is omitted, age does not appear to
be an important predictor.
A more complete analysis would delete observation 18 and 19 together.
What would you expect to see if you did this?
over all possible choices of β0 and β1. If σY |X depends up X, then the correct
choice of weights is inversely proportional to variance, wi ∝ σY2 |X .
Consider the following data and plot of y vs. x and standardized OLS resid-
uals vs x. It is very clear that variability increases with x.
#### Weighted Least Squares
# R code to generate data
set.seed(7)
n <- 100
# 1s, Xs uniform 0 to 100
X <- matrix(c(rep(1,n),runif(n,0,100)), ncol=2)
# intercept and slope (5, 5)
beta <- matrix(c(5,5),ncol=1)
8.10: Weighted Least Squares 331
# errors are X*norm(0,1), so variance increases with X
e <- X[,2]*rnorm(n,0,1)
# response variables
y <- X %*% beta + e
# fit regression
lm.y.x <- lm(y ~ x, data = wlsdat)
● ●
●
600
● 100 ●
● ●
● ● ● ●
●
●
●
● ●
● ●
● ● ● ●
●
● ●
● ● ● ●
● ● ● ● ●
● ●●
● ●
●
● ● ●
● ● ●
● ● ● ● ● ●
400 ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ●● ● ●
●●
● 0 ●
●
●●●
●
●
●
●
● ●
● ● ●● ●
● ● ●
● ●
res
● ●
y
● ● ●
● ●
● ● ●
● ● ● ●
● ●
●
● ●
● ● ●
● ● ●
●● ● ● ●
●
● ● ●
●
● ●●
● ●
● ● ● ●
● ● ●
● ●
200 ●
● ●
● ● ●
●
●
●
●
●
● ● ● −100
● ●
● ● ●
●
●
●● ●
●
●●●● ●
● ●
●
●●●
●●
● ●
●
● ●
0
0 25 50 75 100 0 25 50 75 100
x x
In order to use WLS to solve this problem, we need some form for σY2 |X .
Finding that form is a real problem with WLS. It can be useful to plot the
absolute value of the standardized residual vs. x to see if the top boundary
seems to follow a general pattern.
# ggplot: Plot the absolute value of the residuals
library(ggplot2)
p <- ggplot(wlsdat, aes(x = x, y = abs(res)))
332 Ch 8: Correlation and Regression
p <- p + geom_point()
print(p)
150
100 ●
●
abs(res)
● ●
● ●
●
●
●
●
● ●
●
●
●
● ●
● ●
● ●
50 ● ●
● ●
●
●
●
● ● ● ● ●
● ●
● ● ● ●
● ●● ● ●
● ● ●
● ● ● ● ● ●
●
● ● ● ● ●
● ● ●
● ● ●
● ●
●● ● ●
●
●● ● ●
●
● ● ● ●
● ● ● ●
● ● ● ●● ●
0 ●● ● ●
0 25 50 75 100
x
●
●
● ●
●
●
●
1 ●
●
● ●
● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ●
● ● ●
●
reswt * wt
●
● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ●
●
● ● ●
0 ●
●
●
● ● ●
● ● ●
● ●
●
● ●
●
●●
●
● ● ●
● ●
● ●
●
−1 ● ● ●
● ●
●
●
●
●
●
● ●
● ● ●
● ●
●
−2
0 25 50 75 100
x
Clearly the weighted fit looks better, although note that everything is based
on the weighted SS. In practice it can be pretty difficult to determine the correct
set of weights, but WLS works much better than OLS if appropriate. I actually
simulated this data set using β0 = β1 = 5. Which fit actually did better?
Chapter 9
Introduction to the
Bootstrap
Learning objectives
After completing this topic, you should be able to:
explain the bootstrap principle for hypothesis tests and inference.
decide (for simple problems) how to construct a bootstrap procedure.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
8. use statistical software.
12. make evidence-based decisions.
9.1 Introduction
Statistical theory attempts to answer three basic questions:
1. How should I collect my data?
2. How should I analyze and summarize the data that I’ve collected?
3. How accurate are my data summaries?
Question 3 consitutes part of the process known as statistical inference. The
bootstrap makes certain kinds of statistical inference1. Let’s look at an example.
1
Efron (1979), “Bootstrap methods: another look at the jackknife.” Ann. Statist. 7, 1–26
336 Ch 9: Introduction to the Bootstrap
Example: Aspirin and heart attacks, large-sample theory Does
aspirin prevent heart attacks in healthy middle-aged men? A controlled, ran-
domized, double-blind study was conducted and gathered the following data.
(fatal plus non-fatal)
heart attacks subjects
aspirin group: 104 11037
placebo group: 189 11034
A good experimental design, such as this one, simplifies the results! The ratio
of the two rates (the risk ratio) is
104/11037
θ̂ = = 0.55.
189/11034
Because of the solid experimental design, we can believe that the aspirin-takers
only have 55% as many heart attacks as the placebo-takers.
We are not really interested in the estimated ratio θ̂, but the true ratio, θ.
That is the ratio if we could treat all possible subjects, not just a sample of
them. Large sample theory tells us that the log risk ratio has an approximate
Normal distribution. The standard error of the log risk ratio is estimated simply
by the square root of the sum of the reciprocals of the four frequencies:
r
1 1 1 1
SE(log(RR)) = + + + = 0.1228
104 189 11037 11034
The 95% CI for log(θ) is
The same data that allowed us to estimate the ratio θ with θ̂ = 0.55 also
allowed us to get an idea of the estimate’s accuracy.
119/11037
θ̂ = = 1.21.
98/11034
It looks like aspirin is actually harmful, now, however the 95% interval for the
true stroke ratio θ is (0.925, 1.583). This includes the neutral value θ = 1, at
which aspirin would be no better or worse than placebo for strokes.
9.2 Bootstrap
The bootstrap is a data-based simulation method for statistical inference, which
can be used to produce inferences like those in the previous slides. The term
“bootstrap” comes from literature. In “The Adventures of Baron Munchausen”,
by Rudolph Erich Raspe, the Baron had fallen to the bottom of a deep lake,
and he thought to get out by pulling himself up by his own bootstraps.
library(ggplot2)
p <- ggplot(dat.rat, aes(x = rat))
p <- p + geom_histogram(aes(y=..density..)
, binwidth=0.02
, colour="black", fill="white")
# Overlay with transparent density plot
p <- p + geom_density(alpha=0.2, fill="#FF6666")
# vertical line at 1 and CI
p <- p + geom_vline(xintercept=1, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of relative risk ratio, strokes")
p <- p + xlab("ratio (red = 1, green = bootstrap CI)")
print(p)
## Warning: position stack requires constant width: output may be incorrect
2
density
In this simple case, the confidence interval derived from the bootstrap
(0.94, 1.588) agrees very closely with the one derived from statistical theory
(0.925, 1.583). Bootstrap methods are intended to simplify the calculation of
inferences like those using large-sample theory, producing them in an automatic
way even in situations much more complicated than the risk ratio in the aspirin
340 Ch 9: Introduction to the Bootstrap
example.
Numerical and graphical summaries of the data are below. There seems to
be a slight difference in variability between the two treatment groups.
#### Example: Mouse survival, two-sample t-test, mean
treatment <- c(94, 197, 16, 38, 99, 141, 23)
control <- c(52, 104, 146, 10, 51, 30, 40, 27, 46)
survive <- c(treatment, control)
group <- c(rep("Treatment", length(treatment)), rep("Control", length(control)))
mice <- data.frame(survive, group)
library(plyr)
# ddply "dd" means the input and output are both data.frames
mice.summary <- ddply(mice,
"group",
function(X) {
data.frame( m = mean(X$survive),
s = sd(X$survive),
n = length(X$survive)
)
}
)
# standard errors
mice.summary$se <- mice.summary$s/sqrt(mice.summary$n)
# individual confidence limits
mice.summary$ci.l <- mice.summary$m - qt(1-.05/2, df=mice.summary$n-1) * mice.summary$se
mice.summary$ci.u <- mice.summary$m + qt(1-.05/2, df=mice.summary$n-1) * mice.summary$se
mice.summary
## group m s n se ci.l ci.u
## 1 Control 56.22 42.48 9 14.16 23.57 88.87
## 2 Treatment 86.86 66.77 7 25.24 25.11 148.61
diff(mice.summary$m) #£
## [1] 30.63
# histogram using ggplot
p <- ggplot(mice, aes(x = survive))
p <- p + geom_histogram(binwidth = 20)
p <- p + facet_grid(group ~ .)
p <- p + labs(title = "Mouse survival following a test surgery") + xlab("Survival (days)")
print(p)
342 Ch 9: Introduction to the Bootstrap
Mouse survival following a test surgery
4
Control
2
count
0
4
Treatment
2
0
0 50 100 150 200
Survival (days)
√
The standard error for the difference is 28.93 = 25.242 + 14.142, so the
observed difference of 30.63 is only 30.63/28.93=1.05 estimated standard errors
greater than zero, an insignificant result.
The two-sample t-test of the difference in means confirms the lack of sta-
tistically significant difference between these two treatment groups with a p-
value=0.3155.
t.test(survive ~ group, data = mice)
##
## Welch Two Sample t-test
##
## data: survive by group
## t = -1.059, df = 9.654, p-value = 0.3155
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -95.42 34.15
## sample estimates:
## mean in group Control mean in group Treatment
## 56.22 86.86
But these are small samples, and the control sample does not look normal.
We could do a nonparametric two-sample test of difference of medians. Or, we
could use the bootstrap to make our inference.
µ̂∗ = x̄∗ − ȳ ∗.
Repeat this process a large number of times, say 10000 times, and obtain 10000
bootstrap replicates µ̂∗. The summaries are in the code, followed by a histogram
of bootstrap replicates, µ̂∗.
#### Example: Mouse survival, two-sample bootstrap, mean
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs1 <- rep(NA, R)
bs2 <- rep(NA, R)
# draw R bootstrap resamples of means
for (i in 1:R) {
bs2[i] <- mean(sample(control, replace = TRUE))
bs1[i] <- mean(sample(treatment, replace = TRUE))
}
# bootstrap replicates of difference estimates
bs.diff <- bs1 - bs2
sd(bs.diff)
## [1] 27
# sort the difference estimates to obtain bootstrap CI
diff.sorted <- sort(bs.diff)
# 0.025th and 0.975th quantile gives equal-tail bootstrap CI
CI.bs <- c(diff.sorted[round(0.025*R)], diff.sorted[round(0.975*R+1)])
CI.bs
## [1] -21.97 83.10
## Plot the bootstrap distribution with CI
# First put data in data.frame for ggplot()
dat.diff <- data.frame(bs.diff)
library(ggplot2)
p <- ggplot(dat.diff, aes(x = bs.diff))
p <- p + geom_histogram(aes(y=..density..)
, binwidth=5
, colour="black", fill="white")
# Overlay with transparent density plot
p <- p + geom_density(alpha=0.2, fill="#FF6666")
# vertical line at 0 and CI
p <- p + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of difference in survival time, median")
p <- p + xlab("ratio (red = 0, green = bootstrap CI)")
print(p)
344 Ch 9: Introduction to the Bootstrap
Bootstrap distribution of difference in survival time, median
0.015
0.010
density
0.005
0.000
library(ggplot2)
p <- ggplot(dat.diff, aes(x = bs.diff))
p <- p + geom_histogram(aes(y=..density..)
, binwidth=5
, colour="black", fill="white")
# Overlay with transparent density plot
p <- p + geom_density(alpha=0.2, fill="#FF6666")
# vertical line at 0 and CI
p <- p + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of difference in survival time, median")
p <- p + xlab("ratio (red = 0, green = bootstrap CI)")
print(p)
346 Ch 9: Introduction to the Bootstrap
Bootstrap distribution of difference in survival time, median
0.03
0.02
density
0.01
0.00
−100 0 100
ratio (red = 0, green = bootstrap CI)
LSAT <- c(622, 542, 579, 653, 606, 576, 620, 615, 553, 607, 558, 596, 635,
581, 661, 547, 599, 646, 622, 611, 546, 614, 628, 575, 662, 627,
608, 632, 587, 581, 605, 704, 477, 591, 578, 572, 615, 606, 603,
535, 595, 575, 573, 644, 545, 645, 651, 562, 609, 555, 586, 580,
594, 594, 560, 641, 512, 631, 597, 621, 617, 637, 572, 610, 562,
635, 614, 546, 598, 666, 570, 570, 605, 565, 686, 608, 595, 590,
558, 611, 564, 575)
GPA <- c(3.23, 2.83, 3.24, 3.12, 3.09, 3.39, 3.10, 3.40, 2.97, 2.91, 3.11,
3.24, 3.30, 3.22, 3.43, 2.91, 3.23, 3.47, 3.15, 3.33, 2.99, 3.19,
3.03, 3.01, 3.39, 3.41, 3.04, 3.29, 3.16, 3.17, 3.13, 3.36, 2.57,
3.02, 3.03, 2.88, 3.37, 3.20, 3.23, 2.98, 3.11, 2.92, 2.85, 3.38,
2.76, 3.27, 3.36, 3.19, 3.17, 3.00, 3.11, 3.07, 2.96, 3.05, 2.93,
348 Ch 9: Introduction to the Bootstrap
3.28, 3.01, 3.21, 3.32, 3.24, 3.03, 3.33, 3.08, 3.13, 3.01, 3.30,
3.15, 2.82, 3.20, 3.44, 3.01, 2.92, 3.45, 3.15, 3.50, 3.16, 3.19,
3.15, 2.81, 3.16, 3.02, 2.74)
# law = population
law <- data.frame(School, LSAT, GPA, Sampled)
law$Sampled <- factor(law$Sampled)
# law.sam = sample
law.sam <- subset(law, Sampled == 1)
library(ggplot2)
p <- ggplot(law, aes(x = LSAT, y = GPA))
p <- p + geom_point(aes(colour = Sampled, shape = Sampled), alpha = 0.5, size = 2)
p <- p + labs(title = "Law School average scores of LSAT and GPA")
print(p)
3.25
Sampled
GPA
0
3.00 1
2.75
300
group
200
count
Pop
Sam
100
One-sample power figure Consider the plot below for a one-sample one-
tailed greater-than t-test. If the null hypothesis, H0 : µ = µ0, is true, then
the test statistic t is follows the null distribution indicated by the hashed area.
Under a specific alternative hypothesis, H1 : µ = µ1, the test statistic t follows
the distribution indicated by the solid area. If α is the probability of making
a Type-I error (rejecting H0 when it is true), then “crit. val.” indicates the
location of the tcrit value associated with H0 on the scale of the data. The
rejection region is the area under H0 that is at least as far as “crit. val.” is from
µ0. The power (1 − β) of the test is the green area, the area under H1 in the
rejection region. A Type-II error is made when H1 is true, but we fail to reject
H0 in the red region. (Note, for a two-tailed test the rejection region for both
tails under the H1 curve contribute to the power.)
#### One-sample power
# Power plot with two normal distributions
# http://stats.stackexchange.com/questions/14140/how-to-best-display-graphically-type-ii-bet
#col_null = "#DDDDDD"
#polygon(c(min(x), x,max(x)), c(0,hx,0), col=col_null)
#lines(x, hx, lwd=2)
col_null = "#AAAAAA"
polygon(c(min(x), x,max(x)), c(0,hx,0), col=col_null, lwd=2, density=c(10, 40), angle=-45, bor
lines(x, hx, lwd=2, lty="dashed", col=col_null)
Color
Null hypothesis
Type−II error
Power
−∞ µ0 crit. val. µ1 ∞
powsF <- sapply(nn, getFPow) # ANOVA power for for all group sizes
powsT <- sapply(nn, getTPow) # t-Test power for for all group sizes
#dev.new(width=10, fig.height=5)
par(mfrow=c(1, 2))
matplot(dVals, powsT, type="l", lty=1, lwd=2, xlab="effect size d",
ylab="Power", main="Power one-sample t-test", xaxs="i",
xlim=c(-0.05, 1.1), col=c("blue", "red", "darkgreen", "green"))
#legend(x="bottomright", legend=paste("N =", c(5,10,25,100)), lwd=2,
# col=c("blue", "red", "darkgreen", "green"))
legend(x="bottomright", legend=paste("N =", nn), lwd=2,
col=c("blue", "red", "darkgreen", "green"))
#matplot(fVals, powsF, type="l", lty=1, lwd=2, xlab="effect size f",
# ylab="Power", main=paste("Power one-way ANOVA, ", P, " groups", sep=""), xaxs="i",
# xlim=c(-0.05, 1.1), col=c("blue", "red", "darkgreen", "green"))
##legend(x="bottomright", legend=paste("Nj =", c(10, 15, 20, 25)), lwd=2,
## col=c("blue", "red", "darkgreen", "green"))
#legend(x="bottomright", legend=paste("Nj =", nn), lwd=2,
# col=c("blue", "red", "darkgreen", "green"))
library(pwr)
pwrt2 <- pwr.t.test(d=.2,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt3 <- pwr.t.test(d=.3,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt5 <- pwr.t.test(d=.5,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt8 <- pwr.t.test(d=.8,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
#plot(pwrt£n, pwrt£power, type="b", xlab="sample size", ylab="power")
matplot(matrix(c(pwrt2$n,pwrt3$n,pwrt5$n,pwrt8$n),ncol=4),
matrix(c(pwrt2$power,pwrt3$power,pwrt5$power,pwrt8$power),ncol=4),
type="l", lty=1, lwd=2, xlab="sample size",
ylab="Power", main="Power one-sample t-test", xaxs="i",
xlim=c(0, 100), ylim=c(0,1), col=c("blue", "red", "darkgreen", "green"))
legend(x="bottomright", legend=paste("d =", c(0.2, 0.3, 0.5, 0.8)), lwd=2,
col=c("blue", "red", "darkgreen", "green"))
10.3: Sample size 361
1.0
1.0
0.8
0.8
0.6
0.6
Power
Power
0.4
0.4
0.2
0.2
N=5 d = 0.2
N = 10 d = 0.3
N = 25 d = 0.5
0.0
N = 100 d = 0.8
# Strategy:
# Do this R times:
# draw a sample of size N from the distribution specified by the alternative hypothesis
# That is, 25 subjects from a normal distribution with mean 102 and sigma 15
# Calculate the mean of our sample
# Calculate the associated z-statistic
# See whether that z-statistic has a p-value < 0.05 under H0: mu=100
# If we reject H0, then set reject = 1, else reject = 0.
# Finally, the proportion of rejects we observe is the approximate power
reject <- rep(NA, R); # allocate a vector of length R with missing values (NA)
# to fill with 0 (fail to reject H0) or 1 (reject H0)
for (i in 1:R) {
sam <- rnorm(n, mean=mu1, sd=sigma); # sam is a vector with 25 values
power <- mean(reject); # the average reject (proportion of rejects) is the power
power
## [1] 0.166
# 0.1655 for mu1=102
# 0.5082 for mu1=105
Our simulation (this time) with µ1 = 102 gave a power of 0.166 (exact
answer is P (Z > 0.98) = 0.1635). Rerunning with µ1 = 105 gave a power
of 0.5082 (exact answer is P (Z > −0.02) = 0.5080). Our simulation well-
approximates the true value, and the power can be made more precise by in-
creasing the number of repetitions calculated. However, two to three decimal
precision is quite sufficient.
Imagine that we don’t have the information above. Imagine we have been
invited to a UK university to take skull measurements for 18 modern day En-
glishmen, and 16 ancient Celts. We have some information about modern day
skulls to use as prior information for measurement mean and standard devia-
tion. What is the power to observe a difference between the populations? Let’s
make some reasonable assumptions that allows us to be a bit conservative. Let’s
assume the sampled skulls from each of our populations is a random sample
with common standard deviation 7mm, and let’s assume we can’t get the full
sample but can only measure 15 skulls from each population. At a significance
level of α = 0.05, what is the power for detecting a difference of 5, 10, 15, 20,
or 25 mm?
The theoretical two-sample power result is not too hard to derive (and is
available in text books), but let’s simply compare the power calculated exactly
and by simulation.
For the exact result we use R library pwr. Below is the function call as well
as the result. Note that we specified multiple effect sizes (diff/SD) in one call
of the function.
# R code to compute exact two-sample two-sided power
library(pwr) # load the power calculation library
pwr.t.test(n = 15,
d = c(5,10,15,20,25)/7,
sig.level = 0.05,
power = NULL,
type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 15
## d = 0.7143, 1.4286, 2.1429, 2.8571, 3.5714
## sig.level = 0.05
## power = 0.4717, 0.9652, 0.9999, 1.0000, 1.0000
## alternative = two.sided
##
## NOTE: n is number in *each* group
366 Ch 10: Power and Sample size
To simulate the power under the same circumstances, we follow a similar
strategy as in the one-sample example.
# R code to simulate two-sample two-sided power
# Strategy:
# Do this R times:
# draw a sample of size N from the two hypothesized distributions
# That is, 15 subjects from a normal distribution with specified means and sigma=7
# Calculate the mean of the two samples
# Calculate the associated z-statistic
# See whether that z-statistic has a p-value < 0.05 under H0: mu_diff=0
# If we reject H0, then set reject = 1, else reject = 0.
# Finally, the proportion of rejects we observe is the approximate power
reject <- rep(NA, R); # allocate a vector of length R with missing values (NA)
# to fill with 0 (fail to reject H0) or 1 (reject H0)
for (i in 1:R) {
sam1 <- rnorm(n, mean=mu1 , sd=sigma); # English sample
sam2 <- rnorm(n, mean=mu2[j], sd=sigma); # Celt sample
power
## [1] 0.4928 0.9765 1.0000 1.0000 1.0000
Note the similarity between power calculated using both the exact and sim-
ulation methods. If there is a power calculator for your specific problem, it is
best to use that because it is faster and there is no programming. However,
using the simulation method is better if we wanted to entertain different sam-
ple sizes with different standard deviations, etc. There may not be a standard
calculator for our specific problem, so knowing how to simulate the power can
be valuable.
Mean Sample size Power
µE µC diff SD nE nC exact simulated
147 142 5 7 15 15 0.4717 0.4928
147 137 10 7 15 15 0.9652 0.9765
147 132 15 7 15 15 0.9999 1
147 127 20 7 15 15 1.0000 1
147 122 25 7 15 15 1.0000 1
Appendix A
Custom R functions
A.1 Ch 2. Estimation in One-Sample Problems
# a function to compare the bootstrap sampling distribution with a
# normal distribution with mean and SEM estimated from the data
bs.one.samp.dist <- function(dat, N = 10000) {
n <- length(dat)
# resample from data
sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol = N)
# draw a histogram of the means
sam.mean <- colMeans(sam)
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow = c(2, 1), mar = c(3, 2, 2, 1), oma = c(1, 1, 1, 1))
# Histogram overlaid with kernel density curve
hist(dat, freq = FALSE, breaks = 6, main = "Plot of data with smoothed density curve")
points(density(dat), type = "l")
rug(dat)
hist(sam.mean, freq = FALSE, breaks = 25, main = "Bootstrap sampling distribution of the m
xlab = paste("Data: n =", n, ", mean =", signif(mean(dat), digits = 5),
", se =", signif(sd(dat)/sqrt(n)), digits = 5))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(sam.mean), max(sam.mean), length = 1000)
points(x, dnorm(x, mean = mean(dat), sd = sd(dat)/sqrt(n)), type = "l",
lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam.mean)
# restore par() settings
par(old.par)
}
A.2: Ch 3. Two-Sample Inferences 369