Advanced Data Analysis - Lecture Notes
Advanced Data Analysis - Lecture Notes
Erik B. Erhardt
Edward J. Bedrick
Ronald M. Schrader
I ADA1: Software 1
0 Introduction to R, Rstudio, and ggplot 3
0.1 R building blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.2.1 Improving plots . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.3 Course Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3.2 Assumptions for procedures . . . . . . . . . . . . . . . . . . . 72
2.3.3 The mechanics of setting up hypothesis tests . . . . . . . . . . 79
2.3.4 The effect of α on the rejection region of a two-sided test . . . 81
2.4 Two-sided tests, CI and p-values . . . . . . . . . . . . . . . . . . . . . 82
2.5 Statistical versus practical significance . . . . . . . . . . . . . . . . . 83
2.6 Design issues and power . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.7 One-sided tests on µ . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.7.1 One-sided CIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3 Two-Sample Inferences 91
3.1 Comparing Two Sets of Measurements . . . . . . . . . . . . . . . . . 92
3.1.1 Plotting head breadth data: . . . . . . . . . . . . . . . . . . . 93
3.1.2 Salient Features to Notice . . . . . . . . . . . . . . . . . . . . 99
3.2 Two-Sample Methods: Paired Versus Independent Samples . . . . . . 99
3.3 Two Independent Samples: CI and Test Using Pooled Variance . . . . 100
3.4 Satterthwaite’s Method, unequal variances . . . . . . . . . . . . . . . 101
3.4.1 R Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.5 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6 Paired Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6.1 R Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.7 Should You Compare Means? . . . . . . . . . . . . . . . . . . . . . . 120
17 Classification 825
17.1 Classification using Mahalanobis distance . . . . . . . . . . . . . . . . 827
17.2 Evaluating the Accuracy of a Classification Rule . . . . . . . . . . . . 829
17.3 Example: Carapace classification and error . . . . . . . . . . . . . . . 830
17.4 Example: Fisher’s Iris Data cross-validation . . . . . . . . . . . . . . 835
17.4.1 Stepwise variable selection for classification . . . . . . . . . . . 843
17.5 Example: Analysis of Admissions Data . . . . . . . . . . . . . . . . . 846
17.5.1 Further Analysis of the Admissions Data . . . . . . . . . . . . 847
17.5.2 Classification Using Unequal Prior Probabilities . . . . . . . . 851
17.5.3 Classification With Unequal Covariance Matrices, QDA . . . . 855
Preface
UNM Stat 427/527: Advanced Data Analysis I (ADA1)
The course website (https://statacumen.com/teaching/ada1) includes these lec-
ture notes and R code, video lectures, helper videos, quizzes, in-class assignments,
homework assignments, datasets, and more course description including student learn-
ing outcomes, rubrics, and other helpful course information.
These notes include clicker questions which I used to use when I lectured from
these notes. I found these, along with the think-pair-share strategy, very effective in
assessing comprehension during the lecture.
Description Statistical tools for scientific research, including parametric and non-
parametric methods for ANOVA and group comparisons, simple linear and multiple
linear regression and basic ideas of experimental design and analysis. Emphasis placed
on the use of statistical packages such as R. Course cannot be counted in the hours
needed for graduate degrees in Mathematics and Statistics.
Goal Learn to produce beautiful (markdown) and reproducible (knitr) reports with
informative plots (ggplot2) and tables (xtable) by writing code (R, Rstudio) to answer
questions using fundamental statistical methods (all one- and two-variable methods),
which youll be proud to present (poster).
Course structure The course semester structure enables students to develop the
statistical, programming, visualization, and research skills to give a poster presenta-
tion. This includes conducting a literature review, developing a dataset codebook,
cleaning data, performing univariate and bivariate statistal summaries, tests, and
visualizations, and presenting results and evaluating peer presentations as part of a
poster session. This course structure follows the GAISE recommendations.
Each week, the course structure is the following. Students prepare for class by
reading these notes, watching video lectures of the notes, and taking a quiz online
before class. In class, students download an Rmarkdown file and work through a
real-data problem or two in teams to reinforce the content from the reading; some-
times, this includes taking data in class into a google spreadsheet and analyzing it
immediately afterwards. Finally, students work on a homework assignment outside
of class that is due the following week.
Goal Learn to produce beautiful (markdown) and reproducible (knitr) reports with
informative plots (ggplot2) and tables (xtable) by writing code (R, Rstudio) to an-
swer questions using fundamental statistical methods (analysis of covariance, logistic
regression, and multivariate methods), which youll be proud to present (poster).
Course structure The course semester structure builds on ADA1 and enables stu-
dents to develop the statistical, programming, visualization, and research skills to give
a poster presentation. This includes multiple regression (with an emphasis on assess-
ing model assumptions), logistic regression, multivariate methods, classification, and
combining these methods. Visualization remains an important topic. The semester
culminates in a poster session with many students using their own data. This course
structure follows the GAISE recommendations.
Each week, the course structure is the following. Students prepare for class by
reading these notes, watching video lectures of the notes, and taking a quiz online
before class. In class, students download an Rmarkdown file and work through a
real-data problem or two in teams to reinforce the content from the reading. We
have used student-suggested or student-collected datasets when possible. Homework
is started in class since the realistic data analysis problems can be quite involved and
we want to resolve most student questions in class so that they don’t get stuck on
a small detail. In my experience, the students in the second semester have become
quite mature in their attitudes toward data analysis, coding, and visualization; they
are eager to be challenged and make well-reasoned decisions in their work.
ADA1: Software
Contents
0.1 R building blocks . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . 10
0.2.1 Improving plots . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.3 Course Overview . . . . . . . . . . . . . . . . . . . . . . . . 22
## [1] 32
9^(1/2)
## [1] 3
Basic functions There are lots of functions available in the base package.
Type ?base and click on Index at the bottom of the help page for a complete list
of functions. Other functions to look at are in the ?stats and ?datasets packages.
#### Basic functions
# Lots of familiar functions work
a
## [1] 1 2 3 4 5
sum(a)
## [1] 15
prod(a)
## [1] 120
mean(a)
## [1] 3
sd(a)
## [1] 1.581139
var(a)
## [1] 2.5
min(a)
## [1] 1
median(a)
## [1] 3
max(a)
## [1] 5
range(a)
## [1] 1 5
Boolean: True/False Subsets can be selected based on which elements meet spe-
cific conditions.
#### Boolean
a
## [1] 333 555 20 30 40 50 60 70 80 90 100
(a > 50) # TRUE/FALSE for each element
## [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
which(a > 50) # which indicies are TRUE
## [1] 1 2 7 8 9 10 11
a[(a > 50)]
## [1] 333 555 60 70 80 90 100
!(a > 50) # ! negates (flips) TRUE/FALSE values
## [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
a[!(a > 50)]
## [1] 20 30 40 50
#### Boolean
# & and, | or, ! not
a
## [1] 333 555 20 30 40 50 60 70 80 90 100
a[(a >= 50) & (a <= 90)]
## [1] 50 60 70 80 90
a[(a < 50) | (a > 100)]
## [1] 333 555 20 30 40
a[(a < 50) | !(a > 100)]
## [1] 20 30 40 50 60 70 80 90 100
a[(a >= 50) & !(a <= 90)]
## [1] 333 555 100
Missing values The value NA (not available) means the value is missing. Any
calculation involving NA will return an NA by default.
#### Missing values
NA + 8
## [1] NA
3 * NA
## [1] NA
mean(c(1, 2, NA))
## [1] NA
# Many functions have an na.rm argument (NA remove)
mean(c(NA, 1, 2), na.rm = TRUE)
## [1] 1.5
sum(c(NA, 1, 2))
## [1] NA
sum(c(NA, 1, 2), na.rm = TRUE)
## [1] 3
# Or you can remove them yourself
a <- c(NA, 1:5, NA)
a
## [1] NA 1 2 3 4 5 NA
is.na(a) # which values are missing?
## [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
!is.na(a) # which values are NOT missing?
## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE
a[!is.na(a)] # return those which are NOT missing
## [1] 1 2 3 4 5
a # note, this did not change the variable a
## [1] NA 1 2 3 4 5 NA
# To save the results of removing the NAs,
# assign to another variable or reassign to the original variable
# Warning: if you write over variable a then the original version is gone forever!
a <- a[!is.na(a)]
a
## [1] 1 2 3 4 5
Your turn!
1
What’s your Carnegie Hall that you’re working towards?
#### Library
# each time you start R
# load package ggplot2 for its functions and datasets
library(ggplot2)
#### ggplot_mpg_displ_hwy
# specify the dataset and variables
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point() # add a plot layer with points
print(p)
40
● ●
● ●
● ● ●
30 ● ● ●
● ● ● ● ● ● ●
hwy
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ●
20 ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
2 3 4 5 6 7
displ
Geoms, aesthetics, and facets are three concepts we’ll see in this section.
had.co.nz/ggplot2
had.co.nz/ggplot2/geom_point.html
When certain aesthetics are defined, an appropriate legend is chosen and displayed
automatically.
#### ggplot_mpg_displ_hwy_colour_class
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class))
print(p)
40
● ●
class
● ● ● 2seater
● ● ●
30 ● ● ●
● compact
● ● ● ● ● ● ● ● midsize
hwy
● ● ● ● ●
● ● ● ● ● ● ● ●
● minivan
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● pickup
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● subcompact
● ● ● ● ● ● ●
● suv
● ● ● ● ●
● ●
20 ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
2 3 4 5 6 7
displ
class
● 2seater
40 ● compact
● midsize
● minivan
● pickup
● subcompact
● suv
30
drv
hwy
● ●
● 4
● ● ●
f
● ●
● ●
●
● r
●
●
●
●
cyl
20 ●
●
● ● ● ●● ● ● 4
● ● ●● ●● ● 5
●● ●● ● ● ● ●●● ● ●
●● ● ● 6
● ●● ●● ●● ● 7
● ●
● 8
●
2 3 4 5 6 7
displ
#### ggplot_mpg_displ_hwy_colour_class_size_cyl_shape_drv_alpha
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class, size = cyl, shape = drv)
, alpha = 1/4) # alpha is the opacity
print(p)
class
2seater
40 compact
midsize
minivan
pickup
subcompact
suv
30
drv
hwy
4
f
r
cyl
20
4
5
6
7
8
2 3 4 5 6 7
displ
Faceting A small multiple2 (sometimes called faceting, trellis chart, lattice chart,
grid chart, or panel chart) is a series or grid of small similar graphics or charts,
allowing them to be easily compared.
Experiment with faceting of different types. What relationships would you like to
see?
#### ggplot_mpg_displ_hwy_facet
# start by creating a basic scatterplot
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point()
## two methods
# facet_grid(rows ~ cols) for 2D grid, "." for no split.
# facet_wrap(~ var) for 1D ribbon wrapped into 2D.
2
According to Edward Tufte (Envisioning Information, p. 67): “At the heart of quantitative
reasoning is a single question: Compared to what? Small multiple designs, multivariate and data
bountiful, answer directly by visually enforcing comparisons of changes, of the differences among
objects, of the scope of alternatives. For a wide range of problems in data presentation, small
multiples are the best design solution.”
Facet examples
4 5 6 8
40
●
30 ●
4
● ● ● ●
●
● ● ● ● ●
●
● ● ●
40 ● ●
20 ● ● ●
●
●
●
● ● ● ●● ● ●●
●● ●● ● ●●
● ● ● ● ● ●
● ●●● ●
●●● ● ● ●
●
●
●
● ●
● ●
●● 40
● ● ●
●
●
●
● ●● ● ●
● ● ● ●
●●
hwy
30 ●● ● 30 ● ●
● ●
●
● ● ● ●
● ● ●
● ●
f
●●●●● ● ●
hwy
●
● ● ●
●● ● ● ●
● ● ●
●● ●
●
● ● ●● ● ● ●● ●
● ● ● ●
● ● ●
● ●
●●●● ●●● ● ●
●●●●● ● ●●● ●●●● ● ● 20
●
● ● ●●● ● ● ● ●
●● ●●● ● ● ●
● ● ●● ●● ●
● ●● ● ●
● ●
20 ● ● ●● 40
●●● ●● ●
●● ●● ●●●
●● ●●● ●● ●●● ●●● ● 30
r
●● ●● ● ● ● ●
● ● ● ●
● ●● ●●● ●● ● ●
●
●
● ●
20 ●●
●
● ● ● ●
● ●●
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ displ
40 40
●
●
● ●
●
●● ● ●●
30 ●
●●●● ●● ●● ●
●● ●●
●●
30 ● ● ● ● ● ●● ●●●●
●● ● ●
●●
●● ●●● ●
●●●
●
4
● ● ● ● ●● ●●● ● ●
● ● ● ●
● ●●
●
● ●●
●
●● ● ● 20
20 ● ●●●● ●●●
● ●●
● ●●
● ●
●● ●
● ●●● ●
● ●
● ●●
● ●
● ●● ●●●
●
●
minivan pickup subcompact
●
● ●
40 40 ●
●
●
●
● ●
●
●●● ●
● ● ●
●
hwy
hwy
●
30 ●●
●●
●●●
● ● ●
● ●●
●● 30 ● ● ● ●
●● ●
f
●
●●● ●
●●●●● ● ● ●
● ● ●●
● ●
● ● ● ● ●● ● ●●●
● ●●
●● ● ●● ●
● ●●● ● ●
● ● ●
●
20 ● 20 ● ●●
●● ● ●
● ● ●●● ●
●●
● ●● ●
●
● ● ● ●
●
2 3 4 5 6 7 2 3 4 5 6 7
suv
40
40
30
r
●●
●● ● ●
●
30
● ●
●
●
● ● ●
●
● ● ●
20 ●
●
● 20 ● ●●●● ●● ●
●
● ●
● ●● ●● ●●●●●● ●●●●●●
●● ●
● ● ●● ●● ●●●
●
●
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ displ
● ●
40
● ●
● ●
● ● ●
● ● ● ●
30 ● ● ●
● ● ● ● ● ●
hwy
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ●
20 ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
10 15 20 25 30 35
cty
Problem: points lie on top of each other, so it’s impossible to tell how many
observations each point represents.
A solution: Jitter the points to reveal the individual points and reduce the opacity
to 1/2 to indicate when points overlap.
#### ggplot_mpg_cty_hwy_jitter
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point(position = "jitter", alpha = 1/2)
print(p)
40
30
hwy
20
10 20 30
cty
● ●
40
● ●
● ●
● ●
30 ● ●
● ● ●
hwy
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●
● ●
A solution: Reorder the class variable by the mean hwy for a meaningful ordering.
Get help with ?reorder to understand how this works.
#### ggplot_mpg_reorder_class_hwy
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point()
print(p)
● ●
40
● ●
● ●
● ●
30 ● ●
● ● ●
hwy
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●
● ●
. . . add jitter
#### ggplot_mpg_reorder_class_hwy_jitter
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point(position = "jitter")
print(p)
● ●
40
●
●
●
●
●
● ●
●
● ● ●
●
● ● ● ●●
●
●
30 ●
●
●
● ● ● ●
●● ● ● ● ●
●●●●●
●
●
●● ● ●
●
hwy
● ●
● ● ●
● ●
● ● ●●
● ● ●●
● ● ● ● ●
●
●
● ● ●● ● ● ●● ● ● ●
● ●● ●●●
● ● ●
● ● ●● ● ● ●● ●
● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
●● ● ● ● ●
● ●
●
● ● ●
● ● ●
● ● ●●
●
●●
●
●
● ●
20 ● ● ●● ● ● ●
● ●
● ● ●
●
● ● ●
●
● ● ● ● ●
● ● ●●●
● ●
● ● ●
● ● ●● ●● ● ●
●● ● ●● ●● ●
●
● ●● ● ●● ● ● ● ●●
● ●
●
● ●● ●●
●
● ●● ●
● ●
● ● ● ●
●
●
● ● ● ●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
● ●●
●●● ●●
●
●
30 ● ●●
●
● ● ●
●● ●
● ●●
● ● ●
●
●● ● ●
●
hwy
●
● ● ●
●●
●
●
●●● ●
●
● ● ● ●●
● ●
● ●●●
●● ●● ●● ●
● ● ●●● ●●●
● ●
● ● ●●
● ●
● ● ● ● ●●
● ●●● ● ●●
● ●●● ● ●●● ●
● ● ●
●
● ● ●
● ● ●
●
● ●
●
● ●●
●
● ●
●● ● ● ●
● ●
20 ● ●
●
● ●● ●
●●● ●●●
●●●
●● ●
● ●●●
●
● ●
●●
● ●
●
●●●●●
●
●●●●
●
●
● ●●
●●●
●
●●
● ●
● ●● ●●
●●
● ●●●●
● ● ●
●
●
●●
●
●●
● ●
40
30
hwy ●
20
● ●
● ●
40
●
●
●
●
●
●
●
●
● ●●
●
● ●● ●●
● ●
30 ● ● ●
●
● ● ●● ●●
●● ● ●
●●● ●●●●
●
hwy
● ●
●●
● ● ● ●
●
● ●●
●●
●● ●
● ●●
● ●
● ● ●● ● ●
● ●●
●●
●
●●
●
●●
● ●● ●●● ● ●
● ● ●●●
● ● ● ●●●
●●●
● ●●● ● ●
●
●● ● ● ●
● ● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ● ●
● ● ●
● ●
● ●
●● ●
20 ●● ●●
●
●● ●●
●●●
●● ●●
● ●●
● ●
●
●●● ●
● ●●
● ●
●●●● ● ●
● ●● ●
●● ●●
● ●●
●
●●● ●●● ●
●● ●● ●
● ●
● ● ● ●
●
●●● ●
●
●
●
● ●●
● ●
● ●
●
40
●
●
●
●
●
●
● ●
●
● ●●
●
● ● ●●●
●
●
30 ●
● ●
● ● ● ●●●●
●● ● ● ●●● ●
● ● ●● ●
hwy
●● ● ● ●
●●
●●● ●●
●
● ● ●
● ●●● ●
●
●● ●
● ●
●● ● ●●● ● ●
● ● ●● ●●
●
● ●●
● ● ●
● ●●
● ● ●●●●●●
● ● ●● ●
● ● ●
●
●
●
●
● ●●
● ●●
● ● ●
● ● ● ●
●●● ●
● ● ●
● ●
● ●●
●
●
20 ●● ● ●
●●
●
● ●●●●●
●● ●●●●
● ●●
● ●●
●●●●
●●●● ●● ●
●●●●
●
●●●●
●●
● ●●●●● ●
●● ●
●● ●
●
●●● ●
●
●●● ●
●●
● ●
●
●
●● ● ●
10
pickup suv minivan 2seater midsize subcompact compact
reorder(class, hwy)
. . . and can easily reorder by median() instead of mean() (mean is the default)
#### ggplot_mpg_reorder_class_hwy_boxplot_jitter_median
p <- ggplot(mpg, aes(x = reorder(class, hwy, FUN = median), y = hwy))
p <- p + geom_boxplot(alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1))
print(p)
●
●
40
●
●
●
●
●
● ●
●● ●
●
● ●●
●●
●●
●
30 ● ●
●
● ●
●● ●
●
●
●●●●●● ● ●●●
●
●● ● ●
hwy
● ● ● ●●
●
● ● ●●● ●● ●●
●
●● ●
● ● ●
●
● ● ●●●●●
●● ●●●● ●●●●●●●
● ● ● ●●●
● ● ● ●●
●● ●● ●●
●●
●●
●
● ● ●●● ● ●
● ●● ●
●
● ● ● ●
● ● ●
● ●●
● ●●●
● ●
●●
20 ● ●
● ●●●●● ●
●● ●●
● ●● ●
● ● ●
● ●
●●
● ●● ●●
● ●●
●●●●
●●●●
●● ●
●●●●● ●
● ●●● ● ●●
● ● ●
●● ● ●
● ●
● ●● ●
● ●●
●
●
●
●
●●
● ●
One-minute paper:
Muddy Any “muddy” points — anything that doesn’t make sense yet?
Thumbs up Anything you really enjoyed or feel excited about?
Contents
1.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . 26
1.2 Numerical summaries . . . . . . . . . . . . . . . . . . . . . 26
1.3 Graphical summaries for one quantitative sample . . . . 31
1.3.1 Dotplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.3.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.3.3 Stem-and-leaf plot . . . . . . . . . . . . . . . . . . . . . . . 34
1.3.4 Boxplot or box-and-whiskers plot . . . . . . . . . . . . . . . 36
1.4 Interpretation of Graphical Displays for Numerical Data 40
1.5 Interpretations for examples . . . . . . . . . . . . . . . . . 54
Learning objectives
After completing this topic, you should be able to:
use R’s functions to get help and numerically summarize data.
apply R’s base graphics and ggplot to visually summarize data in several ways.
explain what each plotting option does.
describe the characteristics of a data distribution.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
should be familiar to you. Let us consider a simple example to refresh your memory
on how to compute them.
Suppose we have a sample of n = 8 children with weights (in pounds): 5, 9, 12,
30, 14, 18, 32, 40. Then
P
i Yi Y1 + Y2 + · · · + Yn
Ȳ = =
n n
5 + 9 + 12 + 30 + 14 + 18 + 32 + 40 160
= = = 20.
8 8
#### variance
var(y)
## [1] 156.2857
sd(y)
## [1] 12.50143
Summary statistics have well-defined units of measurement, for example, Ȳ = 20
lb, s2 = 156.3 lb2 , and s = 12.5 lb. The standard deviation is often used instead of
s2 as a measure of spread because s is measured in the same units as the data.
Remark If the divisor for s2 was n instead of n − 1, then the variance would be the
average squared deviation observations are from the center of the data as measured
by the mean.
The following graphs should help you to see some physical meaning of the sample
mean and variance. If the data values were placed on a “massless” ruler, the balance
point would be the mean (20). The variance is basically the “average” (remember
n − 1 instead of n) of the total areas of all the squares obtained when squares are
formed by joining each value to the mean. In both cases think about the implication
of unusual values (outliers). What happens to the balance point if the 40 were a 400
instead of a 40? What happens to the squares?
#### quartiles
median(y)
## [1] 16
fivenum(y)
## [1] 5.0 10.5 16.0 31.0 40.0
# The quantile() function can be useful, but doesn't calculate Q1 and Q3
# as defined above, regardless of the 9 types of calculations for them!
# summary() is a combination of mean() and quantile(y, c(0, 0.25, 0.5, 0.75, 1))
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 11.25 16.00 20.00 30.50 40.00
# IQR
fivenum(y)[c(2,4)]
## [1] 10.5 31.0
fivenum(y)[4] - fivenum(y)[2]
## [1] 20.5
diff(fivenum(y)[c(2,4)])
## [1] 20.5
The quartiles, with M being the second quartile, break the data set roughly into
fourths. The first quartile is also called the 25th percentile, whereas the median and
third quartiles are the 50th and 75th percentiles, respectively. The IQR is the range
for the middle half of the data.
If you look at the data set with all eight observations, there actually are many
numbers that split the data set in half, so the median is not uniquely defined1 , al-
though “everybody” agrees to use the average of the two middle values. With quartiles
there is the same ambiguity but no such universal agreement on what to do about it,
however, so R will give slightly different values for Q1 and Q3 when using summary()
and some other commands than we just calculated, and other packages will report
even different values. This has no practical implication (all the values are “correct”)
but it can appear confusing.
Example The data given below are the head breadths in mm for a sample of 18
modern Englishmen, with numerical summaries generated by R.
#### Englishmen
hb <- c(141, 148, 132, 138, 154, 142, 150, 146, 155
, 158, 150, 140, 147, 148, 144, 150, 149, 145)
1
The technical definition of the median for an even set of values includes the entire range between
the two center values. Thus, selecting any single value in this center range is convenient and the
center of this center range is one sensible choice for the median, M .
# standard deviation
sd(hb)
## [1] 6.382421
# standard error of the mean
se <- sd(hb)/sqrt(length(hb))
√
Note that se is the standard error of the sample mean, SEȲ = s/ n, and is a
measure of the precision of the sample mean Ȳ .
1.3.1 Dotplots
The dotplot breaks the range of data into many small-equal width intervals, and
counts the number of observations in each interval. The interval count is superimposed
on the number line at the interval midpoint as a series of dots, usually one for each
observation. In the head breadth data, the intervals are centered at integer values, so
the display gives the number of observations at each distinct observed head breadth.
A dotplot of the head breadth data is given below. Of the examples below, the R
base graphics stripchart() with method="stack" resembles the traditional dotplot.
#### stripchart-ggplot
# stripchart (dotplot) using R base graphics
# 3 rows, 1 column
par(mfrow=c(3,1))
stripchart(hb, main="Modern Englishman", xlab="head breadth (mm)")
stripchart(hb, method="stack", cex=2
, main="larger points (cex=2), method is stack")
stripchart(hb, method="jitter", cex=2, frame.plot=FALSE
, main="no frame, method is jitter")
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
0.75
count
0.50
0.25
0.00
135 140 145 150 155
130 140 150 160
head breadth (mm)
head breadth (mm)
0.25
count
0.00
−0.25
−0.50
135 140 145 150 155
130 140 150 160
head breadth (mm)
0.25
count
0.00
−0.25
−0.50
135 140 145 150 155
130 140 150 160
head breadth (mm)
1.3.2 Histogram
The histogram and stem-and-leaf displays are similar, breaking the range of data
into a smaller number of equal-width intervals. This produces graphical information
about the observed distribution by highlighting where data values cluster. The his-
togram can use arbitrary intervals, whereas the intervals for the stem-and-leaf display
use the base 10 number system. There is more arbitrariness to histograms than to
stem-and-leaf displays, so histograms can sometimes be regarded a bit suspiciously.
#### hist
# histogram using R base graphics
# par() gives graphical options
# mfrow = "multifigure by row" with 1 row and 3 columns
par(mfrow=c(1,3))
# main is the title, xlab is x-axis label (ylab also available)
hist(hb, main="Modern Englishman", xlab="head breadth (mm)")
# breaks are how many bins-1 to use
hist(hb, breaks = 15, main="Histogram, 15 breaks")
# freq=FALSE changes the vertical axis to density,
# so the total area of the bars is now equal to 1
hist(hb, breaks = 8, freq = FALSE, main="Histogram, density")
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1)
0.08
6
0.06
4 2
Frequency
Frequency
count
count
Density
4
0.04
2 1
0.02
2
0.00
0
0 0
130 140 150 160 135 145 155 130 140 150 160
130 140 150 160 130 140 150 160
head breadth (mm) hb hb
hb hb
R allows you to modify the graphical display. For example, with the histogram
you might wish to use different midpoints or interval widths. I will let you explore
the possibilities.
#### stem-and-leaf
# stem-and-leaf plot
stem(hb)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 13 | 28
## 14 | 0124567889
## 15 | 000458
# scale=2 makes plot roughly twice as wide
stem(hb, scale=2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 13 | 2
## 13 | 8
## 14 | 0124
## 14 | 567889
## 15 | 0004
## 15 | 58
# scale=5 makes plot roughly five times as wide
stem(hb, scale=5)
##
## The decimal point is at the |
##
## 132 | 0
## 134 |
## 136 |
## 138 | 0
## 140 | 00
## 142 | 0
## 144 | 00
## 146 | 00
## 148 | 000
## 150 | 000
## 152 |
## 154 | 00
## 156 |
## 158 | 0
The data values are always truncated so that a leaf has one digit. The leaf unit
(location of the decimal point) tells us the degree of round-off. This will become
clearer in the next example.
Of the three displays, which is the most informative? I think the middle option
is best to see the clustering and shape of distributions of numbers.
The endpoints of the box are placed at the locations of the first and third quartiles.
The location of the median is identified by the line in the box. The whiskers extend to
the data points closest to but not on or outside the outlier fences, which are 1.5IQR
from the quartiles. Outliers are any values on or outside the outlier fences.
The boxplot for the head breadth data is given below. There are a lot of options
that allow you to clutter the boxplot with additional information. Just use the default
settings. We want to see the relative location of data (the median line), have an idea
of the spread of data (IQR, the length of the box), see the shape of the data (relative
distances of components from each other – to be covered later), and identify outliers
(if present). The default boxplot has all these components.
Note that the boxplots below are horizontal to better fit on the page. The
hb
x
ClickerQ s — Boxplots
# histogram
hist(hb, freq = FALSE
, main="Histogram with kernel density plot, Modern Englishman")
# Histogram overlaid with kernel density curve
points(density(hb), type = "l")
# rug of points under histogram
rug(hb)
# violin plot
library(vioplot)
## Loading required package: sm
## Package ’sm’, version 2.2-5.5: type help(sm) for summary information
vioplot(hb, horizontal=TRUE, col="gray")
title("Violin plot, Modern Englishman")
# boxplot
boxplot(hb, horizontal=TRUE
, main="Boxplot, Modern Englishman", xlab="head breadth (mm)")
Histogram with kernel density plot, Modern Englishman
0.08
Density
0.04
0.00
hb
●
1
Example: income The data below are incomes in $1000 units for a sample of
12 retired couples. Numerical and graphical summaries are given. There are two
stem-and-leaf displays provided. The first is the default display.
#### Income examples
income <- c(7, 1110, 7, 5, 8, 12, 0, 5, 2, 2, 46, 7)
# sort in decreasing order
income <- sort(income, decreasing = TRUE)
income
## [1] 1110 46 12 8 7 7 7 5 5 2 2 0
summary(income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.25 7.00 100.92 9.00 1110.00
# stem-and-leaf plot
stem(income)
##
## The decimal point is 3 digit(s) to the right of the |
##
## 0 | 00000000000
## 0 |
## 1 | 1
Because the two large outliers, I trimmed them to get a sense of the shape of the
distribution where most of the observations are.
#### remove largest
# remove two largest values (the first two)
income2 <- income[-c(1,2)]
income2
## [1] 12 8 7 7 7 5 5 2 2 0
summary(income2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.75 6.00 5.50 7.00 12.00
# stem-and-leaf plot
stem(income2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 022
## 0 | 557778
## 1 | 2
# scale=2 makes plot roughly twice as wide
stem(income2, scale=2)
##
## The decimal point is at the |
##
## 0 | 0
## 2 | 00
## 4 | 00
## 6 | 000
## 8 | 0
## 10 |
## 12 | 0
Boxplots with full data, then incrementally removing the two largest outliers.
#### income-boxplot
# boxplot using R base graphics
# 1 row, 3 columns
par(mfrow=c(1,3))
boxplot(income, main="Income")
boxplot(income[-1], main="(remove largest)")
boxplot(income2, main="(remove 2 largest)")
12
● ●
1000
40
10
800
8
30
600
6
20
400
4
10
200
2
●
0
0
1.4 Interpretation of Graphical Displays for Nu-
merical Data
In many studies, the data are viewed as a subset or sample from a larger collection
of observations or individuals under study, called the population. A primary goal
of many statistical analyses is to generalize the information in the sample to infer
something about the population. For this generalization to be possible, the sample
must reflect the basic patterns of the population. There are several ways to collect
data to ensure that the sample reflects the basic properties of the population, but
the simplest approach, by far, is to take a random or “representative” sample from
the population. A random sample has the property that every possible sample of
a given size has the same chance of being the sample (eventually) selected (though
we often do this only once). Random sampling eliminates any systematic biases
associated with the selected observations, so the information in the sample should
accurately reflect features of the population. The process of sampling introduces
random variation or random errors associated with summaries. Statistical tools are
used to calibrate the size of the errors.
Whether we are looking at a histogram (or stem-and-leaf, or dotplot) from a sam-
ple, or are conceptualizing the histogram generated by the population data, we can
imagine approximating the “envelope” around the display with a smooth curve. The
smooth curve that approximates the population histogram is called the population
frequency curve or population probability density function or population
distribution2 . Statistical methods for inference about a population usually make as-
sumptions about the shape of the population frequency curve. A common assumption
is that the population has a normal frequency curve. In practice, the observed data
are used to assess the reasonableness of this assumption. In particular, a sample dis-
2
“Distribution function” often refers to the “cumulative distribution function”, which is a different
(but one-to-one related) function than what I mean here.
play should resemble a population display, provided the collected data are a random
or representative sample from the population. Several common shapes for frequency
distributions are given below, along with the statistical terms used to describe them.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x1, freq = FALSE, breaks = 20)
points(density(x1), type = "l")
rug(x1)
# violin plot
library(vioplot)
vioplot(x1, horizontal=TRUE, col="gray")
# boxplot
boxplot(x1, horizontal=TRUE)
## ggplot
# Histogram overlaid with kernel density curve
x1_df <- data.frame(x1)
p1 <- ggplot(x1_df, aes(x = x1))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()
# violin plot
p2 <- ggplot(x1_df, aes(x = "x1", y = x1))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x1_df, aes(x = "x1", y = x1))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
0.03
Histogram of x1
0.02
density
0.020
Density
0.01
0.000
x1
x
●
1
x1
x
● ●
● ●
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 9
## 6 | 4
## 6 | 5889
## 7 | 3333344
## 7 | 578888899
## 8 | 01111122222223344444
## 8 | 55555666667777888889999999
## 9 | 000111111122222233333344
## 9 | 5555555556666666677777888888899999999
## 10 | 00000111222222233333344444
## 10 | 555555555666666667777777788888999999999
## 11 | 0000011111122233444
## 11 | 566677788999
## 12 | 00001123444
## 12 | 5679
## 13 | 00022234
## 13 | 6
## 14 | 3
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x2, freq = FALSE, breaks = 20)
points(density(x2), type = "l")
rug(x2)
# violin plot
library(vioplot)
vioplot(x2, horizontal=TRUE, col="gray")
# boxplot
boxplot(x2, horizontal=TRUE)
# violin plot
p2 <- ggplot(x2_df, aes(x = "x2", y = x2))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x2_df, aes(x = "x2", y = x2))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
0.04
Histogram of x2
0.03
density
0.015
Density
0.02
0.01
0.000
x2
x
●
1
x2
x
●●● ● ●● ● ●
●● ●● ●●●● ●●
●
●●●●●●● ● ●
●●● ● ● ●● ●
●● ● ● ● ● ● ●
●● ●●
● ●●●●
● ●
●
●● ● ●●●
●
●●●
● ● ●
● ●● ● ● ●● ●
summary(x2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.186 93.748 100.446 102.150 108.950 306.868
sd(x2)
## [1] 29.4546
skewness(x2)
## [1] 1.124581
kurtosis(x2)
## [1] 13.88607
stem(x2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## -0 | 30
## 0 | 34
## 2 | 938
## 4 | 891799
## 6 | 4679991123334678899
## 8 | 001233345667888890012222222333444445555666667777788888899999999
## 10 | 00000000000000000000000000000000111111111222222222222233333333444445+33
## 12 | 000001222455601122344458
## 14 | 135668998
## 16 | 33786
## 18 | 514
## 20 |
## 22 |
## 24 |
## 26 |
## 28 |
## 30 | 7
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x3, freq = FALSE, breaks = 20)
points(density(x3), type = "l")
rug(x3)
# violin plot
library(vioplot)
vioplot(x3, horizontal=TRUE, col="gray")
# boxplot
boxplot(x3, horizontal=TRUE)
# violin plot
p2 <- ggplot(x3_df, aes(x = "x3", y = x3))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x3_df, aes(x = "x3", y = x3))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
0.015
Histogram of x3
0.010
density
0.010
Density
0.005
0.000
x3
x
●
1
x3
x
summary(x3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.61 75.29 101.44 101.31 127.46 149.46
sd(x3)
## [1] 29.02638
skewness(x3)
## [1] -0.00953667
kurtosis(x3)
## [1] 1.778113
stem(x3)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 12234444
## 5 | 555577778889999
## 6 | 0111223334
## 6 | 556678899
## 7 | 0000011111122334444
## 7 | 5567778899
## 8 | 011111224444
## 8 | 5556666799999
## 9 | 0001112233
## 9 | 55667778999
## 10 | 00000111223334444
## 10 | 5555566777889
## 11 | 001123344444
## 11 | 55577888899999
## 12 | 001122444
## 12 | 5677778999
## 13 | 000011222333344
## 13 | 556667788889
## 14 | 01111222344444
## 14 | 55666666777788999
The mean and median are identical in a population with a (exact) symmetric
frequency curve. The histogram and stem-and-leaf displays for a sample selected
from a symmetric population will tend to be fairly symmetric. Further, the sample
means and medians will likely be close.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x4, freq = FALSE, breaks = 20)
points(density(x4), type = "l")
rug(x4)
# violin plot
library(vioplot)
vioplot(x4, horizontal=TRUE, col="gray")
# boxplot
boxplot(x4, horizontal=TRUE)
# violin plot
p2 <- ggplot(x4_df, aes(x = "x4", y = x4))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x4_df, aes(x = "x4", y = x4))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
0.8
Histogram of x4
0.8 0.6
density
Density
0.4
0.4
0.2
0.0
0 2 4 6 8 10 0.0
0.0 2.5 5.0 7.5 10.0
x4
x4
x4
x
●
1
0 2 4 6 8 10
0.0 2.5 5.0 7.5 10.0
x4
x4
x
●●
●●●● ● ● ● ● ●
● ●
●●● ●● ● ●
● ● ● ●● ●
0 2 4 6 8 10
0.0 2.5 5.0 7.5 10.0
x4
summary(x4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003949 0.261104 0.611573 0.908212 1.194486 9.769742
sd(x4)
## [1] 0.9963259
skewness(x4)
## [1] 3.596389
kurtosis(x4)
## [1] 27.45438
stem(x4)
##
## The decimal point is at the |
##
## 0 | 00000000001111111111111111111111122222222222222222222222222333333333+15
## 0 | 55555555555555555555566666666666666777777777777778888888889999999999
## 1 | 0000000111111111112222333333444
## 1 | 5555555566666777788999
## 2 | 0000123333444
## 2 | 556677799
## 3 | 122
## 3 | 68
## 4 | 00
## 4 |
## 5 |
## 5 |
## 6 |
## 6 |
## 7 |
## 7 |
## 8 |
## 8 |
## 9 |
## 9 | 8
Unimodal, skewed left The distribution below is unimodal and skewed to the
left. The two examples show that extremely skewed distributions often contain out-
liers in the longer tail of the distribution.
#### Unimodal, skewed left
# sample from uniform distribution
x5 <- 15 - rexp(250, rate = 0.5)
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x5, freq = FALSE, breaks = 20)
points(density(x5), type = "l")
rug(x5)
# violin plot
library(vioplot)
vioplot(x5, horizontal=TRUE, col="gray")
# boxplot
boxplot(x5, horizontal=TRUE)
# violin plot
p2 <- ggplot(x5_df, aes(x = "x5", y = x5))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x5_df, aes(x = "x5", y = x5))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
Histogram of x5
0.4
density
0.4
Density
0.2
0.2
0.0
4 6 8 10 12 14 0.0
6 9 12 15
x5
x5
x5
x
●
1
4 6 8 10 12 14
6 9 12 15
x5
x5
x
● ●●
●● ● ● ●
● ●
● ● ●● ● ● ● ●
● ●
4 6 8 10 12 14
6 9 12 15
x5
summary(x5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.224 12.391 13.795 13.183 14.506 14.994
sd(x5)
## [1] 1.870509
skewness(x5)
## [1] -1.961229
kurtosis(x5)
## [1] 7.888622
stem(x5)
##
## The decimal point is at the |
##
## 4 | 2
## 5 | 5678
## 6 |
## 7 | 9
## 8 | 2569
## 9 | 44889
## 10 | 11334566678
## 11 | 001124456677778899
## 12 | 0001111223334444444444455666667788889999
## 13 | 0000001111222222333344555666666667777778888889999999
## 14 | 00000000001111111111122222223333333333444444555555555555666667777777+25
## 15 | 000000000
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x6, freq = FALSE, breaks = 20)
points(density(x6), type = "l")
rug(x6)
# violin plot
library(vioplot)
vioplot(x6, horizontal=TRUE, col="gray")
# boxplot
boxplot(x6, horizontal=TRUE)
# violin plot
p2 <- ggplot(x6_df, aes(x = "x6", y = x6))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(x6_df, aes(x = "x6", y = x6))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
Histogram of x6 0.015
density
0.010
Density
0.010
0.005
0.000
x6
x
●
1
x6
x
60 80 100 120 140 160 180
100 150
x6
summary(x6)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.87 99.03 122.71 124.71 151.45 184.32
sd(x6)
## [1] 29.59037
skewness(x6)
## [1] -0.005817938
kurtosis(x6)
## [1] 1.85249
stem(x6)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 |
## 6 | 0368
## 7 | 002244688
## 8 | 0111233356677788889999
## 9 | 000001112222223333444455555666677789999999999999
## 10 | 0011111122444555666666678899
## 11 | 00011222333334555555666678899
## 12 | 00000001123345567889
## 13 | 00011122334456666777788889999
## 14 | 00122233455556677788888999
## 15 | 00000001111222223333344444455555666777888999
## 16 | 01111233344444455555666778889
## 17 | 0125678
## 18 | 00124
The boxplot and histogram or stem-and-leaf display (or dotplot) are used to-
gether to describe the distribution. The boxplot does not provide information about
modality – it only tells you about skewness and the presence of outliers.
As noted earlier, many statistical methods assume the population frequency curve
is normal. Small deviations from normality usually do not dramatically influence the
operating characteristics of these methods. We worry most when the deviations from
normality are severe, such as extreme skewness or heavy tails containing multiple
outliers.
Estimation in One-Sample
Problems
Contents
2.1 Inference for a population mean . . . . . . . . . . . . . . 56
2.1.1 Standard error, LLN, and CLT . . . . . . . . . . . . . . . . 57
2.1.2 z-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.1.3 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 CI for µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2.1 Assumptions for procedures . . . . . . . . . . . . . . . . . . 66
2.2.2 The effect of α on a two-sided CI . . . . . . . . . . . . . . . 69
2.3 Hypothesis Testing for µ . . . . . . . . . . . . . . . . . . . 69
2.3.1 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3.2 Assumptions for procedures . . . . . . . . . . . . . . . . . . 72
2.3.3 The mechanics of setting up hypothesis tests . . . . . . . . 79
2.3.4 The effect of α on the rejection region of a two-sided test . 81
2.4 Two-sided tests, CI and p-values . . . . . . . . . . . . . . 82
2.5 Statistical versus practical significance . . . . . . . . . . . 83
2.6 Design issues and power . . . . . . . . . . . . . . . . . . . 84
2.7 One-sided tests on µ . . . . . . . . . . . . . . . . . . . . . . 84
2.7.1 One-sided CIs . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Learning objectives
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
Suppose that you have identified a population of interest where individuals are mea-
sured on a single quantitative characteristic, say, weight, height or IQ. You select a
random or representative sample from the population with the goal of estimating the
(unknown) population mean value, identified by µ. You cannot see much of the
population, but you would like to know what is typical in the population (µ). The
only information you can see is that in the sample.
This is a standard problem in statistical inference, and the first inferential problem
that we will tackle. For notational convenience, identify the measurements on the
sample as Y1 , Y2 , . . . , Yn , where n is the sample
P size. Given the data, our best guess,
i Yi
or estimate, of µ is the sample mean: Ȳ = n = Y1 +Y2 +···+Y n
n
.
Population
Huge set of values
Can see very little
Sample
Y1, Y2, …, Yn
Inference
Mean µ
Standard Deviation σ
µ and σ unknown
There are two main methods that are used for inferences on µ: confidence in-
tervals (CI) and hypothesis tests. The standard CI and test procedures are based
on the sample mean and the sample standard deviation, denoted by s.
According to the law, the average of the results obtained from a large number of
trials (the sample mean, Ȳ ) should be close to the expected value (the population
mean, µ), and will tend to become closer as more trials are performed.
In probability theory, the central limit theorem (CLT) states that, given
certain conditions, the mean of a sufficiently large number of independent random
variables, each with finite mean and variance, will be approximately normally dis-
tributed1 .
As a joint illustration of these concepts, consider drawing random variables fol-
lowing a Uniform(0,1) distribution, that is, any value in the interval [0, 1] is equally
likely. By definition, the mean of this distribution
p is µ = 1/2 and the variance is
2
σ = 1/12 (so the standard deviation is σ = 1/12 = 0.289). Therefore,√ if we draw
a sample of size n, then the standard error of the mean will be σ/ n, and as n gets
larger the distribution of the mean will increasingly follow a normal distribution. We
illustrate this by drawing N = 10000 samples of size n and plot those N means,
computing the expected and observed SEM and how well the histogram of sampled
means follows a normal distribution, Notice, indeed, that even with samples as small
as 2 and 6 that the properties of the SEM and the distribution are as predicted.
#### Illustration of Central Limit Theorem, Uniform distribution
# demo.clt.unif(N, n)
# draws N samples of size n from Uniform(0,1)
# and plots the N means with a normal distribution overlay
demo.clt.unif <- function(N, n) {
# draw sample in a matrix with N columns and n rows
sam <- matrix(runif(N*n, 0, 1), ncol=N);
# calculate the mean of each column
sam.mean <- colMeans(sam)
# the sd of the mean is the SEM
sam.se <- sd(sam.mean)
# calculate the true SEM given the sample size n
true.se <- sqrt((1/12)/n)
# draw a histogram of the means
hist(sam.mean, freq = FALSE, breaks = 25
, main = paste("True SEM =", round(true.se, 4)
, ", Est SEM = ", round( sam.se, 4))
, xlab = paste("n =", n))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(0, 1, length = 1000)
points(x, dnorm(x, mean = 0.5, sd = true.se), type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
1
The central limit theorem has a number of variants. In its common form, the random variables
must be identically distributed. In variants, convergence of the mean to the normal distribution also
occurs for non-identical distributions, given that they comply with certain conditions.
rug(sam.mean)
}
par(mfrow=c(2,2));
demo.clt.unif(10000, 1);
demo.clt.unif(10000, 2);
demo.clt.unif(10000, 6);
demo.clt.unif(10000, 12);
True SEM = 0.2887 , Est SEM = 0.2893 True SEM = 0.2041 , Est SEM = 0.2006
2.0
1.0
1.5
0.8
0.6
Density
Density
1.0
0.4
0.5
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
n=1 n=2
True SEM = 0.1179 , Est SEM = 0.1194 True SEM = 0.0833 , Est SEM = 0.0843
5
3.0
4
2.5
2.0
3
Density
Density
1.5
2
1.0
1
0.5
0.0
0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8
n=6 n = 12
par(mfrow=c(2,2));
demo.clt.exp(10000, 1);
demo.clt.exp(10000, 6);
demo.clt.exp(10000, 30);
demo.clt.exp(10000, 100);
True SEM = 1 , Est SEM = 1.006 True SEM = 0.4082 , Est SEM = 0.4028
0.8
1.0
0.6
0.8
0.6
Density
Density
0.4
0.4
0.2
0.2
0.0
0.0
0 2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0
n=1 n=6
True SEM = 0.1826 , Est SEM = 0.1827 True SEM = 0.1 , Est SEM = 0.0995
4
2.0
3
1.5
Density
Density
2
1.0
0.5
1
0.0
n = 30 n = 100
Note well that the further the population distribution is from being normal, the
larger the sample size is required to be for the sampling distribution of the sample
mean to be normal. If the population distribution is normal, what’s the minimum
sample size for the sampling distribution of the mean to be normal?
For more examples, try:
#### More examples for Central Limit Theorem can be illustrated with this code
# install.packages("TeachingDemos")
library(TeachingDemos)
# look at examples at bottom of the help page
?clt.examp
2.1.2 z-score
x − x̄
z = .
s
Below, the original variable x has a normal distribution with mean 100 and stan-
dard deviation 15, Normal(100, 152 ), and z has a Normal(0, 1) distribution.
# sample from normal distribution
df <- data.frame(x = rnorm(100, mean = 100, sd = 15))
df$z <- scale(df$x) # by default, this performs a z-score transformation
summary(df)
## x z.V1
## Min. : 39.64 Min. :-3.446123
## 1st Qu.: 90.99 1st Qu.:-0.485300
## Median :100.00 Median : 0.033925
## Mean : 99.41 Mean : 0.000000
## 3rd Qu.:110.72 3rd Qu.: 0.652006
## Max. :132.70 Max. : 1.919736
## ggplot
library(ggplot2)
p1 <- ggplot(df, aes(x = x))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()
p1 <- p1 + labs(title = "X ~ Normal(100, 15)")
library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
X ~ Normal(100, 15)
0.03
density
0.02
0.01
0.00
40 60 80 100 120
x
Z ~ Normal(0, 1)
0.6
density
0.4
0.2
0.0
−2 0 2
z
2.1.3 t-distribution
The Student’s t-distribution is a family of continuous probability distributions that
arises when estimating the mean of a normally distributed population in situations
where the sample size is small and population standard deviation is unknown. The
t-distribution is symmetric and bell-shaped, like the normal distribution, but has
heavier tails, meaning that it is more prone to producing values that fall far from its
mean. Effectively, the t-distribution is wider than the normal distribution because in
addition to estimating the mean µ with Ȳ , we also have to estimate σ 2 with s2 , so
there’s some additional uncertainty. The degrees-of-freedom (df) parameter of the t-
distribution is the sample size n minus the number of variance parameters estimated.
Thus, df = n−1 when we have one sample and df = n−2 when we have two samples.
As n increases, the t-distribution becomes close to the normal distribution, and when
n = ∞ the distributions are equivalent.
#### Normal vs t-distributions with a range of degrees-of-freedom
x <- seq(-8, 8, length = 1000)
par(mfrow=c(1,1))
plot(x, dnorm(x), type = "l", lwd = 2, col = "red"
, main = "Normal (red) vs t-dist with df=1, 2, 6, 12, 30, 100")
points(x, dt(x, 1), type = "l")
points(x, dt(x, 2), type = "l")
points(x, dt(x, 6), type = "l")
points(x, dt(x, 12), type = "l")
points(x, dt(x, 30), type = "l")
points(x, dt(x,100), type = "l")
0.4
0.3
dnorm(x)
0.2
0.1
0.0
−5 0 5
2.2 CI for µ
Statistical inference provides methods for drawing conclusions about a population
from sample data. In this chapter, we want to make a claim about population mean
µ given sample statistics Ȳ and s.
A CI for µ is a range of plausible values for the unknown population mean µ, based
on the observed data, of the form “Best Guess ± Reasonable Error of the Guess”. To
compute a CI for µ:
1. Define the population parameter, “Let µ = mean [characteristic] for popu-
lation of interest”.
2. Specify the confidence coefficient, which is a number between 0 and 100%,
in the form 100(1 − α)%. Solve for α. (For example, 95% has α = 0.05.)
3. Compute the t-critical value: tcrit = t0.5α such that the area under the t-curve
(df = n − 1) to the right of tcrit is 0.5α. See appendix or internet for a t-table.
4. Report the CI in the form Ȳ ± tcrit SEȲ or as an interval (L, U ). The desired CI
has lower and upper endpoints given by L = Ȳ −tcrit SEȲ and U = Ȳ +tcrit SEȲ ,
√
respectively, where SEȲ = s/ n is the standard error of the sample mean.
5. Assess method assumptions (see below).
In practice, the confidence coefficient is large, say 95% or 99%, which correspond
to α = 0.05 and 0.01, respectively. The value of α expressed as a percent is known
as the error rate of the CI.
The CI is determined once the confidence coefficient is specified and the data
are collected. Prior to collecting the data, the interval is unknown and is viewed as
random because it will depend on the actual sample selected. Different samples give
different CIs. The “confidence” in, say, the 95% CI (which has a 5% error rate)
can be interpreted as follows. If you repeatedly sample the population and construct
95% CIs for µ, then 95% of the intervals will contain µ, whereas 5% will not. The
interval you construct from your data will either cover µ, or it will not.
U − L = 2tcrit SEȲ
| ||
| |
| |
||
||
| |
| |
| | |
| |
80
| |
|
| |
| |
| |
| |
|| |
| |
| |
| |
60
| |
| |
| |
| |
Index
| |
| |
|| |
| | |
| ||
40
| |
| | |
| | | |
| |
| |
| | |
| | |
20
| |
| |
| |
| |
||
| |
| |
| |
| | |
|
0
8 9 10 11 12
Confidence Interval
ClickerQ s — CI for µ, 2
entire population data. You can assess the reasonableness of this assumption using
a stem-and-leaf display or a boxplot of the sample data. The stem-and-leaf display
from the data should resemble a normal curve.
In fact, the assumptions are slightly looser than this, the population frequency
curve can be anything provided the sample size is large enough that it’s reasonable
to assume that the sampling distribution of the mean is normal.
# example data, skewed --- try others datasets to develop your intuition
x <- rgamma(10, shape = .5, scale = 20)
bs.one.samp.dist(x)
0.10
0.05
0.00
0 2 4 6 8 10 12
dat
Bootstrap sampling distribution of the mean
0.30
0.20
Density
0.10
0.00
0 2 4 6 8
hypothesis test, where you are trying to decide which of two contradictory claims
or hypotheses about µ is more reasonable given the observed data. The null hy-
pothesis, or the hypothesis under test, is H0 : µ = µ0 , whereas the alternative
hypothesis is HA : µ 6= µ0 .
I will explore the ideas behind hypothesis testing later. At this point, I focus on
the mechanics behind the test. The steps in carrying out the test are:
1. Set up the null and alternative hypotheses in words and notation. In words:
“The population mean for [what is being studied] is different from [value of µ0 ].”
(Note that the statement in words is in terms of the alternative hypothesis.)
In notation: H0 : µ = µ0 versus HA : µ 6= µ0 (where µ0 is specified by the
context of the problem).
2. Choose the size or significance level of the test, denoted by α. In practice,
α is set to a small value, say, 0.01 or 0.05, but theoretically can be any value
between 0 and 1.
3. Compute the test statistic
Ȳ − µ0
ts = ,
SEȲ
√
where SEȲ = s/ n is the standard error.
Note: I sometimes call the test statistic tobs to emphasize that the computed
value depends on the observed data.
4. Compute the critical value tcrit = t0.5α (or p-value from the test statistic) in
the direction of the alternative hypothesis from the t-distribution table with
degrees of freedom df = n − 1.
5. State the conclusion in terms of the problem.
Reject H0 in favor of HA (i.e., decide that H0 is false, based on the data) if
|ts | > tcrit or p-value < α, that is, reject if ts < −tcrit or if ts > tcrit . Otherwise,
Fail to reject H0 .
(Note: We DO NOT accept H0 — more on this later.)
6. Check assumptions of the test, when possible (could do earlier to save yourself
some effort if they are not met).
The process is represented graphically below. The area under the t-probability
curve outside ±tcrit is the size of the test, α. One-half α is the area in each tail. You
reject H0 in favor of HA only if the test statistic is outside ±tcrit .
α 1−α α
2 2
Reject H0 Reject H0
− tcrit 0
tcrit
2.3.1 P-values
The p-value, or observed significance level for the test, provides a measure of
plausibility for H0 . Smaller values of the p-value imply that H0 is less plausible. To
compute the p-value for a two-sided test, you
1. Compute the test statistic ts as above.
2. Evaluate the area under the t-probability curve (with df = n − 1) outside ±|ts |.
p−value p−value
2 2
− ts 0
ts
The p-value is the total shaded area, or twice the area in either tail. A useful in-
terpretation of the p-value is that it is the chance of obtaining data favoring HA by
Example: Age at First Transplant (Revisited) The ages (in years) at first
transplant for a sample of 11 heart transplant patients are as follows: 54, 42, 51, 54,
49, 56, 33, 58, 54, 64, 49. Summaries for these data are: n = 11, Ȳ = 51.27, s = 8.26
and SEȲ = 2.4904. Test the hypothesis that the mean age at first transplant is 50.
Use α = 0.05.
As in the earlier analysis, define
µ = mean age at time of first transplant for population of patients.
We are interested in testing H0 : µ = 50 against HA : µ 6= 50, so µ0 = 50.
The degrees of freedom are df = 11 − 1 = 10. The critical value for a 5% test is
tcrit = t0.025 = 2.228. (Note α/2 = 0.05/2 = 0.025). The same critical value was used
with the 95% CI.
For the test,
Ȳ − µ0 51.27 − 50
ts = = = 0.51.
SEȲ 2.4904
Since tcrit = 2.228, we do not reject H0 using a 5% test. Notice the placement of ts
relative to tcrit in the picture below. Equivalently, the p-value for the test is 0.62,
thus we fail to reject H0 because 0.62 > 0.05 = α. The results of the hypothesis test
should not be surprising, since the CI tells you that 50 is a plausible value for the
population mean age at transplant. Note: All you can say is that the data could have
come from a distribution with a mean of 50 — this is not convincing evidence that µ
actually is 50.
.95
.025 .025
Reject H0 0 Reject H0
−2.228 0.51 2.228 0
−.51 .51
ts in middle of distribution, so do not reject H0 Total shaded area is the p−value, .62
Example: Age at First Transplant R output for the heart transplant problem
is given below. Let us look at the output and find all of the summaries we com-
puted. Also, look at the graphical summaries to assess whether the t-test and CI are
reasonable here.
#### Example: Age at First Transplant
# enter data as a vector
age <- c(54, 42, 51, 54, 49, 56, 33, 58, 54, 64, 49)
The age data is unimodal, skewed left, no extreme outliers.
par(mfrow=c(2,1))
# Histogram overlaid with kernel density curve
hist(age, freq = FALSE, breaks = 6)
points(density(age), type = "l")
rug(age)
# violin plot
library(vioplot)
vioplot(age, horizontal=TRUE, col="gray")
Histogram of age
Density
0.04
0.00
30 35 40 45 50 55 60 65
age
●
1
35 40 45 50 55 60 65
# stem-and-leaf plot
stem(age, scale=2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 3 | 3
## 3 |
## 4 | 2
## 4 | 99
## 5 | 1444
## 5 | 68
## 6 | 4
# t.crit
qt(1 - 0.05/2, df = length(age) - 1)
## [1] 2.228139
# look at help for t.test
?t.test
# defaults include: alternative = "two.sided", conf.level = 0.95
t.summary <- t.test(age, mu = 50)
t.summary
##
## One Sample t-test
##
## data: age
## t = 0.51107, df = 10, p-value = 0.6204
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 45.72397 56.82149
## sample estimates:
## mean of x
## 51.27273
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.00 49.00 54.00 51.27 55.00 64.00
The assumption of normality of the sampling distribution appears reasonablly
close, using the bootstrap discussed earlier. Therefore, the results for the t-test above
can be trusted.
bs.one.samp.dist(age)
0.04
0.00
30 35 40 45 50 55 60 65
dat
Bootstrap sampling distribution of the mean
0.15
0.10
Density
0.05
0.00
40 45 50 55 60
Aside: To print the shaded region for the p-value, you can use the result of
t.test() with the function t.dist.pval() defined here.
# Function to plot t-distribution with shaded p-value
t.dist.pval <- function(t.summary) {
par(mfrow=c(1,1))
lim.extreme <- max(4, abs(t.summary$statistic) + 0.5)
lim.lower <- -lim.extreme;
lim.upper <- lim.extreme;
x.curve <- seq(lim.lower, lim.upper, length=200)
y.curve <- dt(x.curve, df = t.summary$parameter)
plot(x.curve, y.curve, type = "n"
, ylab = paste("t-dist( df =", signif(t.summary$parameter, 3), ")")
, xlab = paste("t-stat =", signif(t.summary$statistic, 5)
, ", Shaded area is p-value =", signif(t.summary$p.value, 5)))
if ((t.summary$alternative == "less")
| (t.summary$alternative == "two.sided")) {
x.pval.l <- seq(lim.lower, -abs(t.summary$statistic), length=200)
y.pval.l <- dt(x.pval.l, df = t.summary$parameter)
polygon(c(lim.lower, x.pval.l, -abs(t.summary$statistic))
0.2
0.1
0.0
−4 −2 0 2 4
Aside: Note that the t.summary object returned from t.test() includes a number
of quantities that might be useful for additional calculations.
names(t.summary)
## [1] "statistic" "parameter" "p.value" "conf.int"
## [5] "estimate" "null.value" "alternative" "method"
## [9] "data.name"
t.summary$statistic
## t
## 0.5110715
t.summary$parameter
## df
## 10
t.summary$p.value
## [1] 0.6203942
t.summary$conf.int
## [1] 45.72397 56.82149
## attr(,"conf.level")
## [1] 0.95
t.summary$estimate
## mean of x
## 51.27273
t.summary$null.value
## mean
## 50
t.summary$alternative
## [1] "two.sided"
t.summary$method
## [1] "One Sample t-test"
t.summary$data.name
## [1] "age"
Example: Meteorites One theory of the formation of the solar system states
that all solar system meteorites have the same evolutionary history and thus have the
same cooling rates. By a delicate analysis based on measurements of phosphide crystal
widths and phosphide-nickel content, the cooling rates, in degrees Celsius per million
years, were determined for samples taken from meteorites named in the accompanying
table after the places they were found. The Walker2 County (Alabama, US), Uwet3
(Cross River, Nigeria), and Tocopilla4 (Antofagasta, Chile) meteorite cooling rate
data are below.
Suppose that a hypothesis of solar evolution predicted a mean cooling rate of
µ = 0.54 degrees per million years for the Tocopilla meteorite. Do the observed
cooling rates support this hypothesis? Test at the 5% level. The boxplot and stem-
and-leaf display (given below) show good symmetry. The assumption of a normal
distribution of observations basic to the t-test appears to be realistic.
Meteorite Cooling rates
Walker County 0.69 0.23 0.10 0.03 0.56 0.10 0.01 0.02 0.04 0.22
Uwet 0.21 0.25 0.16 0.23 0.47 1.20 0.29 1.10 0.16
Tocopilla 5.60 2.70 6.20 2.90 1.50 4.00 4.30 3.00 3.60 2.40 6.70 3.80
Let
µ = mean cooling rate over all pieces of the Tocopilla meteorite.
To answer the question of interest, we consider the test of H0 : µ = 0.54 against
HA : µ 6= 0.54. Let us go carry out the test, compute the p-value, and calculate
a 95% CI for µ. The sample √ summaries are n = 12, Ȳ = 3.892, s = 1.583. The
standard error is SEȲ = s/ n = 0.457.
R output for this problem is given below. For a 5% test (i.e., α = 0.05), you
would reject H0 in favor of HA because the p-value ≤ 0.05. The data strongly suggest
that µ 6= 0.54. The 95% CI says that you are 95% confident that the population
mean cooling rate for the Tocopilla meteorite is between 2.89 and 4.90 degrees per
million years. Note that the CI gives us a means to assess how different µ is from the
hypothesized value of 0.54.
2
http://www.lpi.usra.edu/meteor/metbull.php?code=24204
3
http://www.lpi.usra.edu/meteor/metbull.php?code=24138
4
http://www.lpi.usra.edu/meteor/metbull.php?code=17001
# violin plot
library(vioplot)
vioplot(toco, horizontal=TRUE, col="gray")
Histogram of toco
0.00 0.15 0.30
Density
1 2 3 4 5 6 7
toco
●
1
2 3 4 5 6
# stem-and-leaf plot
stem(toco, scale=2)
##
## The decimal point is at the |
##
## 1 | 5
## 2 | 479
## 3 | 068
## 4 | 03
## 5 | 6
## 6 | 27
# t.crit
qt(1 - 0.05/2, df = length(toco) - 1)
## [1] 2.200985
t.summary <- t.test(toco, mu = 0.54)
t.summary
##
## One Sample t-test
##
## data: toco
## t = 7.3366, df = 11, p-value = 1.473e-05
## alternative hypothesis: true mean is not equal to 0.54
## 95 percent confidence interval:
## 2.886161 4.897172
## sample estimates:
## mean of x
## 3.891667
summary(toco)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 2.850 3.700 3.892 4.625 6.700
The assumption of normality of the sampling distribution appears reasonable.
Therefore, the results for the t-test above can be trusted.
t.dist.pval(t.summary)
bs.one.samp.dist(toco)
0.20
Density
0.10
0.3
0.00
t−dist( df = 11 )
1 2 3 4 5 6 7
0.2
dat
Bootstrap sampling distribution of the mean
0.8
0.1
Density
0.4
0.0
0.0
−5 0 5
2.5 3.0 3.5 4.0 4.5 5.0 5.5
t−stat = 7.3366 , Shaded area is p−value = 1.473e−05
Data: n = 12 , mean = 3.8917 , se = 0.456843 5
the researcher wishes to make. In some studies you define the hypotheses so that
HA is the take action hypothesis — rejecting H0 in favor of HA leads one to take a
radical action.
Some perspective on testing is gained by understanding the mechanics behind the
tests. A hypothesis test is a decision process in the face of uncertainty. You are given
data and asked which of two contradictory claims about a population parameter, say
µ, is more reasonable. Two decisions are possible, but whether you make the correct
decision depends on the true state of nature which is unknown to you.
State of nature
Decision H0 true HA true
Fail to reject [accept] H0 correct decision Type-II error
Reject H0 in favor of HA Type-I error correct decision
For a given problem, only one of these errors is possible. For example, if H0 is
true you can make a Type-I error but not a Type-II error. Any reasonable decision
rule based on the data that tells us when to reject H0 and when to not reject H0 will
have a certain probability of making a Type-I error if H0 is true, and a corresponding
probability of making a Type-II error if H0 is false and HA is true. For a given
decision rule, define
α = Prob( Reject H0 given H0 is true ) = Prob( Type-I error )
and
β = Prob( Fail to reject H0 when HA true ) = Prob( Type-II error ).
The mathematics behind hypothesis tests allows you to prespecify or control α.
For a given α, the tests we use (typically) have the smallest possible value of β. Given
the researcher can control α, you set up the hypotheses so that committing a Type-I
error is more serious than committing a Type-II error. The magnitude of α, also
called the size or level of the test, should depend on the seriousness of a Type-I
error in the given problem. The more serious the consequences of a Type-I error, the
smaller α should be. In practice α is often set to 0.10, 0.05, or 0.01, with α = 0.05
being the scientific standard. By setting α to be a small value, you reject H0 in favor
of HA only if the data convincingly indicate that H0 is false.
Let us piece together these ideas for the meteorite problem. Evolutionary history
predicts µ = 0.54. A scientist examining the validity of the theory is trying to
decide whether µ = 0.54 or µ 6= 0.54. Good scientific practice dictates that rejecting
another’s claim when it is true is more serious than not being able to reject it when
it is false. This is consistent with defining H0 : µ = 0.54 (the status quo) and
HA : µ 6= 0.54. To convince yourself, note that the implications of a Type-I error
would be to claim the evolutionary theory is false when it is true, whereas a Type-
II error would correspond to not being able to refute the evolutionary theory when
it is false. With this setup, the scientist will refute the theory only if the data
overwhelmingly suggest that it is false.
Ȳ − µ0
ts =
SEȲ
0
−3.106 3.106
−2.201 2.201
Rejection Regions for .05 and .01 level tests
The critical value is computed so that the area under the t-probability curve (with
df = n − 1) outside ±tcrit is α, with 0.5α in each tail. Reducing α makes tcrit larger.
That is, reducing the size of the test makes rejecting H0 harder because the rejection
region is smaller. A pictorial representation is given above for the Tocopilla data,
where µ0 = 0.54, n = 12, and df = 11. Note that tcrit = 2.201 and 3.106 for α = 0.05
and 0.01, respectively.
The mathematics behind the test presumes that H0 is true. Given the data, you
use
Ȳ − µ0
ts =
SEȲ
to measure how far Ȳ is from µ0 , relative to the spread in the data given by SEȲ . For
ts to be in the rejection region, Ȳ must be significantly above or below µ0 , relative
to the spread in the data. To see this, note that rejection rule can be expressed as:
Reject H0 if
The rejection rule is sensible because Ȳ is our best guess for µ. You would reject
H0 : µ = µ0 only if Ȳ is so far from µ0 that you would question the reasonableness
of assuming µ = µ0 . How far Ȳ must be from µ0 before you reject H0 depends on
α (i.e., how willing you are to reject H0 if it is true), and on the value of SEȲ . For
a given sample, reducing α forces Ȳ to be further from µ0 before you reject H0 . For
a given value of α and s, increasing n allows smaller differences between Ȳ and µ0
to be statistically significant (i.e., lead to rejecting H0 ). In problems where small
differences between Ȳ and µ0 lead to rejecting H0 , you need to consider whether the
observed differences are important.
In essence, the t-distribution provides an objective way to calibrate whether the
observed Ȳ is typical of what sample means look like when sampling from a normal
population where H0 is true. If all other assumptions are satisfied, and Ȳ is inordi-
nately far from µ0 , then our only recourse is to conclude that H0 must be incorrect.
size α test rejects H0 ⇔ 100(1 − α)% CI does not contain µ0 ⇔ p-value ≤ α, and
0
− tcrit tcrit
If ts is here then p−value > α
Either a CI or a test can be used to decide the plausibility of the claim that
µ = µ0 . Typically, you use the test to answer the question is there a difference?
If so, you use the CI to assess how much of a difference exists. I believe that
scientists place too much emphasis on hypothesis testing.
Regardless of the alternative hypothesis, the tests are based on the t-statistic:
Ȳ − µ0
ts = .
SEȲ
1. Compute the critical value tcrit such that the area under the t-curve to the right
of tcrit is the desired size α, that is tcrit = tα .
2. Reject H0 if and only if ts ≥ tcrit .
3. The p-value for the test is the area under the t-curve to the right of the test
statistic ts .
The upper one-sided test uses the upper tail of the t-distribution for a re-
jection region. The p-value calculation reflects the form of the rejection region. You
will reject H0 only for large positive values of ts which require Ȳ to be significantly
greater than µ0 . Does this make sense?
1. Compute the critical value tcrit such that the area under the t-curve to the right
of tcrit is the desired size α, that is tcrit = tα .
2. Reject H0 if and only if ts ≤ −tcrit .
3. The p-value for the test is the area under the t-curve to the left of the test
statistic ts .
The lower one-sided test uses the lower tail of the t-distribution for a rejection
region. The calculation of the rejection region in terms of −tcrit is awkward but is
necessary for hand calculations because many statistical tables only give upper tail
percentiles. Note that here you will reject H0 only for large negative values of ts
which require Ȳ to be significantly less than µ0 .
As with two-sided tests, the p-value can be used to decide between rejecting or
not rejecting H0 for a test with a given size α. A picture of the rejection region and
the p-value evaluation for one-sided tests is given below.
α p−value
0 0
tcrit ts
α p−value
0 0
− tcrit ts
Let µ = the population mean weight for advertised 20 ounce cans of tomatoes
produced by the cannery. The company claims that µ = 20, but the consumer group
believes that µ < 20. Hence the consumer group wishes to test H0 : µ = 20 (or
µ ≥ 20) against HA : µ < 20. The consumer group will reject H0 only if the data
overwhelmingly suggest that H0 is false.
You should assess the normality assumption prior to performing the t-test. The
stem-and-leaf display and the boxplot suggest that the distribution might be slightly
skewed to the left. However, the skewness is not severe and no outliers are present,
so the normality assumption is not unreasonable.
R output for the problem is given below. Let us do a hand calculation using the
summarized data. The sample size, mean, and standard √ deviation are 14, 19.679, and
1.295, respectively. The standard error is SEȲ = s/ n = 0.346. We see that the
sample mean is less than 20. But is it sufficiently less than 20 for us to be willing to
publicly refute the canner’s claim? Let us carry out the test, first using the rejection
region approach, and then by evaluating a p-value.
The test statistic is
Ȳ − µ0 19.679 − 20
ts = = = −0.93.
SEȲ 0.346
The critical value for a 5% one-sided test is t0.05 = 1.771, so we reject H0 if ts < −1.771
(you can get that value from r or from the table). The test statistic is not in the
rejection region. Using the t-table, the p-value is between 0.15 and 0.20. I will draw
a picture to illustrate the critical region and p-value calculation. The exact p-value
from R is 0.185, which exceeds 0.05.
Both approaches lead to the conclusion that we do not have sufficient evidence
to reject H0 . That is, we do not have sufficient evidence to question the accuracy of
the canner’s claim. If you did reject H0 , is there something about how the data were
recorded that might make you uncomfortable about your conclusions?
#### Example: Weights of canned tomatoes
tomato <- c(20.5, 18.5, 20.0, 19.5, 19.5, 21.0, 17.5
, 22.5, 20.0, 19.5, 18.5, 20.0, 18.0, 20.5)
par(mfrow=c(2,1))
# Histogram overlaid with kernel density curve
hist(tomato, freq = FALSE, breaks = 6)
points(density(tomato), type = "l")
rug(tomato)
# violin plot
library(vioplot)
vioplot(tomato, horizontal=TRUE, col="gray")
# t.crit
qt(1 - 0.05/2, df = length(tomato) - 1)
## [1] 2.160369
t.summary <- t.test(tomato, mu = 20, alternative = "less")
t.summary
##
## One Sample t-test
##
## data: tomato
## t = -0.92866, df = 13, p-value = 0.185
## alternative hypothesis: true mean is less than 20
## 95 percent confidence interval:
## -Inf 20.29153
## sample estimates:
## mean of x
## 19.67857
summary(tomato)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.50 18.75 19.75 19.68 20.38 22.50
Histogram of tomato
0.4
Density
0.2
0.0
17 18 19 20 21 22 23
tomato
●
1
18 19 20 21 22
Density
0.3
t−dist( df = 13 )
17 18 19 20 21 22 23
0.2
dat
Bootstrap sampling distribution of the mean
1.2
0.1
0.8
Density
0.4
0.0
0.0
−4 −2 0 2 4
18.5 19.0 19.5 20.0 20.5 21.0
t−stat = −0.92866 , Shaded area is p−value = 0.18499
Data: n = 14 , mean = 19.679 , se = 0.346121 5
Thus, you are 95% confident that the population mean weight of the canner’s 20oz
cans of tomatoes is less than or equal to 20.29. As expected, this interval covers 20.
If you are doing a one-sided test in R, it will generate the correct one-sided bound.
That is, a lower one-sided test will generate an upper bound, whereas an upper one-
sided test generates a lower bound. If you only wish to compute a one-sided bound
without doing a test, you need to specify the direction of the alternative which gives
the type of bound you need. An upper bound was generated by R as part of the test
we performed earlier. The result agrees with the hand calculation.
Quite a few packages, do not directly compute one-sided bounds so you have to
fudge a bit. In the cannery problem, to get an upper 95% bound on µ, you take
the upper limit from a 90% two-sided confidence limit on µ. The rational for this
is that with the 90% two-sided CI, µ will fall above the upper limit 5% of the time
and fall below the lower limit 5% of the time. Thus, you are 95% confident that µ
falls below the upper limit of this interval, which gives us our one-sided bound. Here,
you are 95% confident that the population mean weight of the canner’s 20 oz cans of
tomatoes is less than or equal to 20.29, which agrees with our hand calculation.
One-Sample T: Cans
Variable N Mean StDev SE Mean 90% CI
Cans 14 19.6786 1.2951 0.3461 (19.0656, 20.2915)
The same logic applies if you want to generalize the one-sided confidence bounds
to arbitrary confidence levels and to lower one-sided bounds — always double the
error rate of the desired one-sided bound to get the error rate of the required two-
sided interval! For example, if you want a lower 99% bound on µ (with a 1% error
rate), use the lower limit on the 98% two-sided CI for µ (which has a 2% error rate).
ClickerQ s — P-value
Two-Sample Inferences
Contents
3.1 Comparing Two Sets of Measurements . . . . . . . . . . 92
3.1.1 Plotting head breadth data: . . . . . . . . . . . . . . . . . . 93
3.1.2 Salient Features to Notice . . . . . . . . . . . . . . . . . . . 99
3.2 Two-Sample Methods: Paired Versus Independent Sam-
ples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3 Two Independent Samples: CI and Test Using Pooled
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.4 Satterthwaite’s Method, unequal variances . . . . . . . . 101
3.4.1 R Implementation . . . . . . . . . . . . . . . . . . . . . . . 102
3.5 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6 Paired Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6.1 R Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.7 Should You Compare Means? . . . . . . . . . . . . . . . . 120
Learning objectives
After completing this topic, you should be able to:
select graphical displays that meaningfully compare independent populations.
assess the assumptions of the two-sample t-test visually.
decide whether the means between two populations are different.
recommend action based on a hypothesis test.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
0.75
Celts
0.50
● ●
0.25
●● ●● ●
● ●●● ●●●●●
count
0.00
1.00
0.75
English
0.50
●●
0.25
● ●●●● ●
0.00 ● ● ●● ●●●● ● ●
120 130 140 150 160
head breadth (mm)
2. Boxplots for comparison are most helpful when plotted in the same axes.
# boxplot using R base graphics
boxplot(HeadBreadth ~ Group, method = "stack", data = hb,
horizontal = TRUE,
main = "Head breadth comparison", xlab = "head breadth (mm)")
English
Celts
English
Group
Celts
3. Histograms are hard to compare unless you make the scale and actual bins the
same for both. Why is the pair on the right clearly preferable?
# common x-axis limits based on the range of the entire data set
hist(hb$HeadBreadth[(hb$Group == "Celts")], xlim = range(hb$HeadBreadth),
main = "Head breadth, Celts", xlab = "head breadth (mm)")
hist(hb$HeadBreadth[(hb$Group == "English")], xlim = range(hb$HeadBreadth),
main = "Head breadth, English", xlab = "head breadth (mm)")
5
4
4
Frequency
Frequency
3
3
2
2
1
1
0
0
120 125 130 135 140 120 130 140 150
8
6
6
Frequency
Frequency
4
4
2
2
0
130 135 140 145 150 155 160 120 130 140 150
Celts
2
count
0
English
4
0
120 130 140 150 160
head breadth (mm)
6 6
4 Group 4 Group
count
count
Celts Celts
English English
2 2
0 0
120 130 140 150 160 120 130 140 150 160
head breadth (mm) head breadth (mm)
stem(celts, scale = 2)
##
## The decimal point is at the |
##
## 120 | 0
## 122 |
## 124 | 00
## 126 | 00
## 128 | 0
## 130 | 00
## 132 | 000
## 134 | 0
## 136 | 0
## 138 | 000
## Group: English
## HeadBreadth Group
## Min. :132.0 Celts : 0
## 1st Qu.:142.5 English:18
## Median :147.5
## Mean :146.5
## 3rd Qu.:150.0
## Max. :158.0
Example The English and Celt head breadth samples are independent.
Example Suppose you are interested in whether the CaCO3 (calcium carbonate)
level in the Atrisco well field, which is the water source for Albuquerque, is changing
over time. To answer this question, the CaCO3 level was recorded at each of 15 wells
at two time points. These data are paired. The two samples are the observations at
Times 1 and 2.
Example Suppose you are interested in whether the husband or wife is typically
the heavier smoker among couples where both adults smoke. Data are collected on
households. You measure the average number of cigarettes smoked by each husband
and wife within the sample of households. These data are paired, i.e., you have
selected husband wife pairs as the basis for the samples. It is reasonable to believe
that the responses within a pair are related, or correlated.
Although the focus here will be on comparing population means, you should rec-
ognize that in paired samples you may also be interested, as in the problems above,
in how observations compare within a pair. That is, a paired comparison might
be interested in the difference between the two paired samples. These goals need
not agree, depending on the questions of interest. Note that with paired data, the
sample sizes are equal, and equal to the number of pairs.
3.4.1 R Implementation
R does the pooled and Satterthwaite (Welch) analyses, either on stacked or unstacked
data. The output will contain a p-value for a two-sided test of equal population means
and a CI for the difference in population means. If you include var.equal = TRUE you
will get the pooled method, otherwise the output is for Satterthwaite’s method.
Example: Head Breadths The English and Celts are independent samples. We
looked at boxplots and histograms, which suggested that the normality assumption
for the t-test is reasonable. The R output shows the English and Celt sample standard
deviations and IQRs are fairly close, so the pooled and Satterthwaite results should
be comparable. The pooled analysis is preferable here, but either is appropriate.
We are interested in difference in mean head breadths between Celts and English.
1. Define the population parameters and hypotheses in words and
notation
Let µ1 and µ2 be the mean head breadth for the Celts and English, respectively.
In words: “The difference in population means between Celts and English is differ-
ent from zero mm.”
In notation: H0 : µ1 = µ2 versus HA : µ1 6= µ2 .
Alternatively: H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0.
2. Calculate summary statistics from sample
Mean, standard deviation, sample size:
#### Calculate summary statistics
m1 <- mean(celts)
s1 <- sd(celts)
n1 <- length(celts)
m2 <- mean(english)
s2 <- sd(english)
n2 <- length(english)
c(m1, s1, n1)
## [1] 130.750000 5.434458 16.000000
c(m2, s2, n2)
## [1] 146.500000 6.382421 18.000000
The pooled-standard devation, standard error, and degrees-of-freedom are:
sdpool <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
sdpool
## [1] 5.956876
The distribution of difference in means in the third plot looks very close to normal.
bs.two.samp.diff.dist(celts, english)
Sample 1
n = 16 , mean = 130.75 , sd = 5.4345
0.06
0.04
Density
0.02
0.00
Sample
dat1 2
n = 18 , mean = 146.5 , sd = 6.3824
0.08
0.06
Density
0.04
0.02
0.00
0.10
0.05
0.00
diff.mean
## 3rd Qu.:132.8
## Max. :217.0
## ----------------------------------------------------
## sex: women
## level sex
## Min. : 35.00 men : 0
## 1st Qu.: 67.00 women:18
## Median : 77.00
## Mean : 75.83
## 3rd Qu.: 84.00
## Max. :112.00
c(sd(men), sd(women), IQR(men), IQR(women), length(men), length(women))
## [1] 42.75467 17.23625 60.25000 17.00000 14.00000 18.00000
p <- ggplot(andro, aes(x = sex, y = level, fill=sex))
p <- p + geom_boxplot()
# add a "+" at the mean
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
#p <- p + coord_flip()
p <- p + labs(title = "Androstenedione Levels in Diabetics")
print(p)
200
7.5
150
sex sex
count
level
2.5
50
● 0.0
Because of the large difference in variances, I will be more comfortable with the
Satterthwaite analysis here than the pooled variance analysis. The normality assump-
tion of the difference in means appears to be met using the bootstrap assessment. The
distribution of difference in means in the third plot looks very close to normal.
bs.two.samp.diff.dist(men, women)
Sample 1
n = 14 , mean = 112.5 , sd = 42.755
0.015
0.010
Density
0.005
0.000
Sample
dat1 2
n = 18 , mean = 75.833 , sd = 17.236
0.020
Density
0.010
0.000
0.010
0.000
0 20 40 60 80
diff.mean
0.2
0.1
0.0
−4 −2 0 2 4
3.6.1 R Analysis
The most natural way to enter paired data is as two columns, one for each treatment
group. You can then create a new column of differences, and do the usual one-sample
graphical and inferential analysis on this column of differences, or you can do the
paired analysis directly without this intermediate step.
p <- p + scale_y_continuous(limits=axis.lim)
p <- p + labs(title = "IQ of identical twins raised by genetic vs foster parents")
print(p)
IQ of identical twins raised by genetic vs foster parents
●
●
120
● ●
●
●
●
genetic
100 ●
● ●●
●
●
●
●
●
● ●
●
● ●
●
80 ●
●
60
60 80 100 120
foster
This plot of IQ scores shows that scores are related within pairs of twins. This is
consistent with the need for a paired analysis.
Given the sample of differences, I created a boxplot and a stem and leaf display,
neither which showed marked deviation from normality. The boxplot is centered at
zero, so one would not be too surprised if the test result is insignificant.
p1 <- ggplot(iq, aes(x = diff))
p1 <- p1 + scale_x_continuous(limits=c(-20,+20))
# vertical line at 0
p1 <- p1 + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p1 <- p1 + geom_histogram(aes(y=..density..), binwidth=5)
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()
# violin plot
p2 <- ggplot(iq, aes(x = "diff", y = diff))
p2 <- p2 + scale_y_continuous(limits=c(-20,+20))
# boxplot
p3 <- ggplot(iq, aes(x = "diff", y = diff))
p3 <- p3 + scale_y_continuous(limits=c(-20,+20))
p3 <- p3 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
0.06
0.04
density
0.02
0.00
−20 −10 0 10 20
diff
diff
x
−20 −10 0 10 20
diff
diff
x
−20 −10 0 10 20
diff
The normality assumption of the sample mean for a one-sample test is satisfied
(below, left).
Given the sample of differences, I generated a one-sample CI and test. The hy-
pothesis under test is µd = µg − µf = 0. The p-value for this test is large. We do not
have sufficient evidence to claim that the population mean IQs for twins raised apart
are different. This is consistent with the CI for µd given below, which covers zero.
bs.one.samp.dist(iq$diff)
0.4
0.04
Density
0.02
0.3
0.00
t−dist( df = 26 )
dat
Bootstrap sampling distribution of the mean
0.1
0.20
Density
0.10
0.0
0.00
−4 −2 0 2 4
−6 −4 −2 0 2 4 6
t−stat = 0.12438 , Shaded area is p−value = 0.90197
Data: n = 27 , mean = 0.18519 , se = 1.48884 5
Alternatively, I can generate the test and CI directly from the raw data in two
columns, specifying paired=TRUE. This gives the following output, which leads to
identical conclusions to the earlier analysis.
# two-sample paired t-test
t.summary <- t.test(iq$genetic, iq$foster, paired=TRUE)
t.summary
##
## Paired t-test
##
Remark: I could have defined the difference to be the foster IQ score minus the
genetic IQ score. How would this change the conclusions?
library(ggplot2)
# scatterplot of A and B sleep times, with 1:1 line
p <- ggplot(sleep, aes(x = A, y = B))
# draw a 1:1 line, dots above line indicate "B > A"
p <- p + geom_abline(intercept=0, slope=1, alpha=0.2)
p <- p + geom_point()
p <- p + geom_rug()
# make the axes square so it's a fair visual comparison
p <- p + coord_equal()
p <- p + scale_x_continuous(limits=axis.lim)
p <- p + scale_y_continuous(limits=axis.lim)
p <- p + labs(title = "Sleep hours gained on two sleep remedies: A vs B")
print(p)
●
●
2
B
●
0 ●
0 2 4
A
There is evidence here against the normality assumption of the sample mean.
We’ll continue anyway (in practice we’d use a nonparametric method, instead, in a
later chapter).
library(ggplot2)
p1 <- ggplot(sleep, aes(x = D))
p1 <- p1 + scale_x_continuous(limits=c(-5,+5))
p1 <- p1 + geom_histogram(aes(y=..density..), binwidth = 1)
p1 <- p1 + geom_density(alpha=0.1, fill="white", adjust = 2)
# vertical reference line at 0
p1 <- p1 + geom_vline(xintercept = 0, colour="red", linetype="dashed")
p1 <- p1 + geom_vline(xintercept = mean(sleep$D), colour="blue", alpha = 0.5)
p1 <- p1 + geom_rug()
p1 <- p1 + labs(title = "Difference of sleep hours gained: D = B - A")
print(p1)
bs.one.samp.dist(sleep$D)
Density
0.0
0.4
−1 0 1 2 3 4 5
density
dat
0.2 Bootstrap sampling distribution of the mean
Density
0.0
0.0
0.5 1.0 1.5 2.0 2.5 3.0
−5.0 −2.5 0.0 2.5 5.0
D Data: n = 10 , mean = 1.52 , se = 0.402161 5
The p-value for testing H0 is 0.004. We’d reject H0 at the 5% or 1% level, and con-
clude that the population mean sleep gains on the remedies are different. We are 95%
confident that µB exceeds µA by between 0.61 and 2.43 hours. Again, these results
must be reported with caution, because the normality assumption is unreasonable.
However, the presence of outliers tends to make the t-test and CI conservative, so we’d
expect to find similar conclusions if we used the nonparametric methods discussed
later in the semester.
# one-sample t-test of differences (paired t-test)
t.summary <- t.test(sleep$D)
t.summary
##
## One Sample t-test
##
## data: sleep$D
## t = 3.7796, df = 9, p-value = 0.004352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.610249 2.429751
## sample estimates:
## mean of x
## 1.52
# plot t-distribution with shaded p-value
t.dist.pval(t.summary)
0.4
0.3
t−dist( df = 9 )
0.2
0.1
0.0
−4 −2 0 2 4
0 5 10 15
10 15 20 25 30 35 40
Checking Assumptions
Contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.2 Testing Normality . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.1 Normality tests on non-normal data . . . . . . . . . . . . . 127
4.3 Formal Tests of Normality . . . . . . . . . . . . . . . . . . 131
4.4 Testing Equal Population Variances . . . . . . . . . . . . 139
4.5 Small sample sizes, a comment . . . . . . . . . . . . . . . 141
Learning objectives
After completing this topic, you should be able to:
assess the assumptions visually and via formal tests.
Achieving these goals contributes to mastery in these course learning outcomes:
10. Model assumptions.
4.1 Introduction
Almost all statistical methods make assumptions about the data collection process
and the shape of the population distribution. If you reject the null hypothesis in a
test, then a reasonable conclusion is that the null hypothesis is false, provided all
the distributional assumptions made by the test are satisfied. If the assumptions are
not satisfied then that alone might be the cause of rejecting H0 . Additionally, if you
fail to reject H0 , that could be caused solely by failure to satisfy assumptions also.
Hence, you should always check assumptions to the best of your abilities.
Two assumptions that underly the tests and CI procedures that I have discussed
are that the data are a random sample, and that the population frequency curve is
normal. For the pooled variance two-sample test the population variances are also
required to be equal.
The random sample assumption can often be assessed from an understanding of
the data collection process. Unfortunately, there are few general tests for checking
this assumption. I have described exploratory (mostly visual) methods to assess the
normality and equal variance assumption. I will now discuss formal methods to assess
these assumptions.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x1, freq = FALSE, breaks = 20)
points(density(x1), type = "l")
rug(x1)
# violin plot
library(vioplot)
vioplot(x1, horizontal=TRUE, col="gray")
# boxplot
boxplot(x1, horizontal=TRUE)
Histogram of x1
x1
1
70 80 90 100 110 120 130
There are many ways to get adequate QQ plots. Consider how outliers shows up
in the QQ plot. There may be isolated points on ends of the QQ plot, but only on
the right side is there an outlier. How could you have identified that the right tail
looks longer than the left tail from the QQ plot?
#### QQ plots
# R base graphics
par(mfrow=c(1,1))
# plots the data vs their normal scores
qqnorm(x1)
# plots the reference line
qqline(x1)
# ggplot2 graphics
library(ggplot2)
# http://had.co.nz/ggplot2/stat_qq.html
df <- data.frame(x1)
# stat_qq() below requires "sample" to be assigned a data.frame column
p <- ggplot(df, aes(sample = x1))
# plots the data vs their normal scores
p <- p + stat_qq()
print(p)
● ●
Normal Q−Q Plot ● ●
●
●
●
●
● ● ●
130
●
●● ●●
●●
120 ●●
●●
●● ●
●
●●
120
●● ●
●● ●●
●● ●
● ●●
●●
● ●●
●
● ●
●
● ●
●●
● ●
●●
●●●
●
●
● ●●
110
●●
● ●
Sample Quantiles
● ●●
●
●
●
●●
●●
●
● ●
●●
●●
● ●
●●
●●
●
●
●
●●
● ●
●
●
●●
●
● ●
●
sample
● ●
●
●
●
●
●●
● ●
●●
●
●
●●
●
●●
100
●●
● 100 ●
●●
●
●
●
●●
● ●●
●
●
●
●
●
● ●●
●
●●
●
● ●●
●●
●●
●
●●
● ●●
●
●
●●
●
●● ●
●●
●●
●●
●
●
●●
●
●●
●
● ●
●
●
●●
●
● ●●
●●
●●
●
●
90
● ●
●●
● ●
●
●
●
●●
●● ●
●
●●
● ●
●●
●●
●
●
●●
●●
● ●
●
●● ●
●
●● ●●
●
●●
●● ●
●●●
●●●
●● ●
80
● ●●
●
●● ●●
●●●
●●●●●● ●●●
80 ●
● ●●
● ● ●● ●●●●●
●
70
● ●●
● ● ●
−2 −1 0 1 2 ●
Theoretical Quantiles −2 −1 0 1 2
theoretical
If you lay a straightedge along the bulk of the plot (putting in a regression line is
not the right way to do it, even if it is easy), you see that the most extreme point on
the right is a little below the line, and the last few points on the left a little above the
line. What does this mean? The point on the right corresponds to a data value more
extreme than expected from a normal distribution (the straight line is where expected
and actual coincide). Extreme points on the right are above the line. What about
the left? Extreme points there should be above the line — since the deviations from
the line are above it on the left, those points are also more extreme than expected.
Even more useful is to add confidence intervals (point-wise, not family-wise —
you will learn the meaning of those terms in the ANOVA section). You don’t expect
a sample from a normally distributed population to have a normal scores plot that
falls exactly on the line, and the amount of deviation depends upon the sample size.
The best QQ plot I could find is available in the car package called qqPlot. Note
that with the dist= option you can use this technique to see if the data appear from
lots of possible distributions, not just normal.
par(mfrow=c(1,1))
# Normality of Residuals
library(car)
## Loading required package: carData
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(x1, las = 1, id = list(n = 6, cex = 1), lwd = 1, main="QQ Plot")
## [1] 65 110 86 125 31 111
QQ Plot
110 ● 65 ●
130 31 ●86
125 ●●
●●
●●
●●
120 ●●
●●
●●
●●
●
●
●
●●
●
●
●●
●
110 ●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
x1
●
100 ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
90 ●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●●
●
80 ●
●●
●●●●●
●
●
● ● ●●
70
● 111
−2 −1 0 1 2
norm quantiles
In this case the x-axis is labelled “norm quantiles”. You only see a couple of data
values outside the limits (in the tails, where it usually happens). You expect around
5% outside the limits, so there is no indication of non-normality here. I did sample
from a normal population.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x2, freq = FALSE, breaks = 20)
# violin plot
library(vioplot)
vioplot(x2, horizontal=TRUE, col="gray")
# boxplot
boxplot(x2, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x2, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
Histogram of x2
QQ Plot
0.010
Density
● ● ●
●●● ●
●●●
0.000
●●●
●●●●
140 ●
●
●
●
●
60 80 100 120 140 ●
●
●●
●●
●
x2 ●●
●
●●
●●
●●
●
●●
●
●
●
●●
●
●
120 ●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
x2
100 ●
●
● ●
●
●●
1
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
60 80 100 120 140 ●
●
●●
80 ●●
●
●
●●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●●●
60 ●●●
●
●
●●●
● ●●
● ● ●
−2 −1 0 1 2
60 80 100 120 140
norm quantiles
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x3, freq = FALSE, breaks = 20)
points(density(x3), type = "l")
rug(x3)
# violin plot
library(vioplot)
# boxplot
boxplot(x3, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x3, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
Histogram of x3
QQ Plot
0.000 0.010 0.020
Density
●
300
200
●
●
●●
x3
● 150 ●●
1
●●
●●●●
●
●●●●
●
●●
●●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
0 50 100 150 200 250 300 100 ●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●
●
50 ●●●
●
●●●
●
0 ● ● ●
●● ● ● ●● ● ●
●● ● ● ● ●●●
●●● ● ● ●
● ● ● ●
−2 −1 0 1 2
0 50 100 150 200 250 300
norm quantiles
Right-skewed (Exponential)
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x4, freq = FALSE, breaks = 20)
points(density(x4), type = "l")
rug(x4)
# violin plot
library(vioplot)
vioplot(x4, horizontal=TRUE, col="gray")
# boxplot
boxplot(x4, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x4, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
Histogram of x4
QQ Plot
0.8
Density
8
0.4
●
0.0
0 2 4 6 8
x4
6
●
●
x4
● 4
1
●
●●●
0 2 4 6 8 ●●●
●●●●●●
●●
●
2 ●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●●
●
●●
●●
●
●●
●
●●●●●●●●
●●
0 ● ● ● ● ●●●●●●
● ● ● ● ●
−2 −1 0 1 2
0 2 4 6 8
norm quantiles
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x5, freq = FALSE, breaks = 20)
points(density(x5), type = "l")
rug(x5)
# violin plot
library(vioplot)
vioplot(x5, horizontal=TRUE, col="gray")
# boxplot
boxplot(x5, horizontal=TRUE)
par(mfrow=c(1,1))
qqPlot(x5, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
Histogram of x5
0.4 QQ Plot
Density
0.2
●● ● ●
●●●●
●●
●●
●●
●●
●●●●●●●●●
●●
●●
●●
●●
●
●●
●●
●
0.0
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
8 10 12 14 14 ●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
x5 ●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●●
12 ●
●
●
●
●
●
●
●
●
●
●●
x5
● ●
●●
1
●●
●
●●●
●●●●
●
●●●
8 10 12 14 10 ●●
●
●
8 ●
●
● ● ●
−2 −1 0 1 2
8 10 12 14
norm quantiles
Notice how striking is the lack of linearity in the QQ plot for all the non-normal
distributions, particularly the symmetric light-tailed distribution where the boxplot
looks fairly good. The QQ plot is a sensitive measure of normality. Let us summarize
the patterns we see regarding tails in the plots:
Tail
Tail Weight Left Right
Light Left side of plot points left Right side of plot points right
Heavy Left side of plot points down Right side of plot points up
Extreme outliers and skewness have the biggest effects on standard methods based
on normality. The Shapiro-Wilk (SW) test is better at picking up these problems
than the Kolmogorov-Smirnov (KS) test. The KS test tends to highlight deviations
from normality in the center of the distribution. These types of deviations are rarely
important because they do not have a noticeable effect on the operating characteristics
of the standard methods. The AD and RJ tests are modifications designed to handle
some of these objections.
Tests for normality may have low power in small to moderate sized samples. Visual
assessment of normality is often more valuable than a formal test. The tests for the
distributions of data above are below and in Figure 4.1.
Normal distribution
#### Formal Tests of Normality
shapiro.test(x1)
##
## Shapiro-Wilk normality test
##
## data: x1
## W = 0.98584, p-value = 0.1289
library(nortest)
ad.test(x1)
##
## Anderson-Darling normality test
##
## data: x1
## A = 0.40732, p-value = 0.3446
# lillie.test(x1)
cvm.test(x1)
##
## Cramer-von Mises normality test
##
## data: x1
## W = 0.05669, p-value = 0.4159
# plot of data
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(sleep$d, freq = FALSE, breaks = 20)
points(density(sleep$d), type = "l")
rug(sleep$d)
# violin plot
library(vioplot)
vioplot(sleep$d, horizontal=TRUE, col="gray")
# boxplot
boxplot(sleep$d, horizontal=TRUE)
# QQ plot
par(mfrow=c(1,1))
9●
0.4
0.0
0 1 2 3 4 4
sleep$d
sleep$d
2●
●
1
2
●
0 1 2 3 4
● ●
●
1 ● ●
● 8
0
● 5
● ●
Men Women
shapiro.test(men) shapiro.test(women)
## ##
## Shapiro-Wilk normality test ## Shapiro-Wilk normality test
## ##
## data: men ## data: women
## W = 0.90595, p-value = 0.1376 ## W = 0.95975, p-value = 0.5969
library(nortest) library(nortest)
ad.test(men) ad.test(women)
## ##
## Anderson-Darling normality test ## Anderson-Darling normality test
## ##
## data: men ## data: women
## A = 0.4718, p-value = 0.2058 ## A = 0.39468, p-value = 0.3364
# lillie.test(men) # lillie.test(women)
cvm.test(men) cvm.test(women)
## ##
## Cramer-von Mises normality test ## Cramer-von Mises normality test
## ##
## data: men ## data: women
## W = 0.063063, p-value = 0.3221 ## W = 0.065242, p-value = 0.3057
library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1)
# QQ plot
par(mfrow=c(2,1))
qqPlot(men, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Men")
qqPlot(women, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Women")
200
150
sex
150 ●
men
level
●
men ●
● ● ●
● women ●
●
100 100
●
● ●
● ●
50
−1 0 1
●
7.5 100 ●
●
sex ● ● ● ●
women
80 ●
count
● ●
5.0 men ● ●
●
● ●
women
60 ● ●
2.5
40
●
−2 −1 0 1 2
0.0
Most statisticians use graphical methods (boxplot, normal scores plot) to assess
normality, and do not carry out formal tests.
6 7 8 9 10
5
11 12 13 14 15
5
4
count
16 17 18 19 20
5
21 22 23 24 25
5
0
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
x
print(p)
Twenty−five samples of size n=30 from a Normal(0,1) distribution
1 2 3 4 5
10.0
7.5
5.0
2.5
0.0
6 7 8 9 10
10.0
7.5
5.0
2.5
0.0
11 12 13 14 15
10.0
7.5
count
5.0
2.5
0.0
16 17 18 19 20
10.0
7.5
5.0
2.5
0.0
21 22 23 24 25
10.0
7.5
5.0
2.5
0.0
−2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4
x
By viewing many versions of this of varying samples sizes you’ll develop your
intuition about what a normal sample looks like.
methods otherwise.
There are a number of well-known tests for equal population variances, of which
Bartlett’s test and Levene’s test are probably the best known. Bartlett’s test assumes
the population distributions are normal, and is the best test when this is true. In
practice, unequal variances and non-normality often go hand-in-hand, so you should
check normality prior to using Bartlett’s test. It is sensitive to data which is not
non-normally distributed, thus it is more likely to return a “false positive” (reject
H0 of equal variances) when the data is non-normal. Levene’s test is more robust
to departures from normality than Bartlett’s test; it is in the car package. Fligner-
Killeen test is a non-parametric test which is very robust against departures from
normality.
I will now define Bartlett’s test, which assumes normally distributed data. As
above, let n∗ = n1 + n2 + · · · + nk , where the ni s are the sample sizes from the k
groups, and define
k
!
1 X 1 1
v =1+ − ∗ .
3(k − 1) i=1 ni − 1 n − k
where s2pooled is the pooled estimator of variance and s2i is the estimated variance
based on the ith sample.
Large values of Bobs suggest that the population variances are unequal. For a size
α test, we reject H0 if Bobs ≥ χ2k−1,crit , where χ2k−1,crit is the upper-α percentile for the
χ2k−1 (chi-squared) probability distribution with k − 1 degrees of freedom. A generic
plot of the χ2 distribution is given below. A p-value for the test is given by the area
under the chi-squared curve to the right of Bobs .
These ailing counties tend to be mostly rural, sparsely populated, and located in
traditionally Republican states in the Midwest, the South, and the West.” Tongue-
in-cheek, Wainer and Zwerling comment: “It is easy to infer that their high cancer
rates might be directly due to the poverty of the rural lifestyle — no access to good
medical care, a high-fat diet, and too much alcohol, too much tobacco.” Something
is wrong, of course. The rural lifestyle cannot explain both very high and very low
incidence of kidney cancer.
The key factor is not that the counties were rural or predominantly Republican. It
is that rural counties have small populations. The law of large numbers says that as
sample sizes increase that the sample statistic converges to the population proportion,
that is, large samples are more precise than small samples. What Kahneman is calling
the law of small numbers warns that small samples yield extreme results more often
than large samples do.
Contents
5.1 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2 Multiple Comparison Methods: Fisher’s Method . . . . 151
5.2.1 FSD Multiple Comparisons in R . . . . . . . . . . . . . . . 153
5.2.2 Bonferroni Comparisons . . . . . . . . . . . . . . . . . . . . 154
5.3 Further Discussion of Multiple Comparisons . . . . . . . 159
5.4 Checking Assumptions in ANOVA Problems . . . . . . . 161
5.5 Example from the Child Health and Development Study
(CHDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Learning objectives
After completing this topic, you should be able to:
select graphical displays that meaningfully compare independent populations.
assess the assumptions of the ANOVA visually and by formal tests.
decide whether the means between populations are different, and how.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
5.1 ANOVA
The one-way analysis of variance (ANOVA) is a generalization of the two sample
t-test to k ≥ 2 groups. Assume that the populations of interest have the following
(unknown) population means and standard deviations:
population 1 population 2 · · · population k
mean µ1 µ2 ··· µk
std dev σ1 σ2 ··· σk
A usual interest in ANOVA is whether µ1 = µ2 = · · · = µk . If not, then we wish
to know which means differ, and by how much. To answer these questions we select
samples from each of the k populations, leading to the following data summary:
sample 1 sample 2 · · · sample k
size n1 n2 ··· nk
mean Ȳ1 Ȳ2 ··· Ȳk
std dev s1 s2 ··· sk
A little more notation is needed for the discussion. Let Yij denote the j th observation
in the ith sample and define the total sample size n∗ = n1 + n2 + · · · + nk . Finally, let
Ȳ¯ be the average response over all samples (combined), that is
P
ij Yij
P
¯ ni Ȳi
Ȳ = ∗
= i ∗ .
n n
Note that Ȳ¯ is not the average of the sample means, unless the sample sizes n are i
equal.
An F -statistic is used to test H0 : µ1 = µ2 = · · · = µk against HA : not H0 (that is,
at least two means are different). The assumptions needed for the standard ANOVA
F -test are analogous to the independent pooled two-sample t-test assumptions: (1)
Independent random samples from each population. (2) The population frequency
curves are normal. (3) The populations have equal standard deviations, σ1 = σ2 =
· · · = σk .
The F -test is computed from the ANOVA table, which breaks the spread in the
combined data set into two components, or Sums of Squares (SS). The Within
SS, often called the Residual SS or the Error SS, is the portion of the total spread
due to variability within samples:
SS(Within) = (n1 − 1)s21 + (n2 − 1)s22 + · · · + (nk − 1)s2k = ij (Yij − Ȳi )2 .
P
The Between SS, often called the Model SS, measures the spread between the sample
means
SS(Between) = n1 (Ȳ1 − Ȳ¯ )2 + n2 (Ȳ2 − Ȳ¯ )2 + · · · + nk (Ȳk − Ȳ¯ )2 = i ni (Ȳi − Ȳ¯ )2 ,
P
weighted by the sample sizes. These two SS add to give P
SS(Total) = SS(Between) + SS(Within) = ij (Yij − Ȳ¯ )2 .
Each SS has its own degrees of freedom (df ). The df (Between) is the number of
groups minus one, k − 1. The df (Within) is the total number of observations minus
MS(Between)
Fs = .
MS(Within)
Large values of Fs indicate large variability among the sample means Ȳ1 , Ȳ2 , . . . , Ȳk
relative to the spread of the data within samples. That is, large values of Fs suggest
that H0 is false.
Formally, for a size α test, reject H0 if Fs ≥ Fcrit , where Fcrit is the upper-α
percentile from an F distribution with numerator degrees of freedom k − 1 and de-
nominator degrees of freedom n∗ −k (i.e., the df for the numerators and denominators
in the F -ratio). The p-value for the test is the area under the F -probability curve to
the right of Fs :
0 1 2 3 4 5 6 0 1 2 3 4 5 6
FCrit Reject H0 for FS here FS FCrit
## 6 6 fat1 176
## 7 1 fat2 178
## 8 2 fat2 191
## 9 3 fat2 197
## 10 4 fat2 182
## 11 5 fat2 185
## 12 6 fat2 177
## 13 1 fat3 175
## 14 2 fat3 186
## 15 3 fat3 178
## 16 4 fat3 171
## 17 5 fat3 163
## 18 6 fat3 176
## 19 1 fat4 155
## 20 2 fat4 166
## 21 3 fat4 149
## 22 4 fat4 164
## 23 5 fat4 170
## 24 6 fat4 168
# or as simple as:
# melt(fat, "Row")
If you don’t specify variable.name, it will name that column “variable”, and if you
leave out value.name, it will name that column “value”.
From long to wide: Use dcast() from the reshape2 package.
#### From long to wide format
fat.wide <- dcast(fat.long, Row ~ type, value.var = "amount")
fat.wide
## Row fat1 fat2 fat3 fat4
## 1 1 164 178 175 155
## 2 2 172 191 186 166
## 3 3 168 197 178 149
## 4 4 177 182 171 164
## 5 5 190 185 163 170
## 6 6 176 177 176 168
Now that we’ve got our data in the long format, let’s return to the ANOVA.
Back to ANOVA: Let’s look at the numerical summaries. We’ve seen other ways
of computing these so I’ll show you another way.
#### Back to ANOVA
# Calculate the mean, sd, n, and se for the four fats
Let’s plot the data with boxplots, individual points, mean, and CI by fat type.
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(fat.long, aes(x = type, y = amount))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(fat.long$amount),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Doughnut fat absorption") + ylab("amount absorbed (g)")
print(p)
190
180
amount absorbed (g)
170
160
150
The p-value for the F -test is 0.001. The scientist would reject H0 at any of the
usual test levels (such as, 0.05 or 0.01). The data suggest that the population mean
absorption rates differ across fats in some way. The F -test does not say how they
differ. The pooled standard deviation spooled = 8.18 is the “Residual standard error”.
We’ll ignore the rest of this output for now.
fit.f <- aov(amount ~ type, data = fat.long)
summary(fit.f)
## Df Sum Sq Mean Sq F value Pr(>F)
## type 3 1596 531.8 7.948 0.0011 **
## Residuals 20 1338 66.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.f
## Call:
## aov(formula = amount ~ type, data = fat.long)
##
## Terms:
## type Residuals
## Sum of Squares 1595.500 1338.333
## Deg. of Freedom 3 20
##
Ȳi − Ȳj
ts = q .
spooled n1i + 1
nj
The minimum absolute difference between Ȳi and Ȳj needed to reject H0 is the LSD,
the quantity on the right hand side of this inequality. If all the sample sizes are equal
n1 = n2 = · · · = nk then the LSD is the same for each comparison:
r
2
LSD = tcrit spooled ,
n1
where n1 is the common sample size.
I will illustrate Fisher’s method on the doughnut data, using α = 0.05. At the
first step, you reject the hypothesis that the population mean absorptions are equal
because p-value= 0.001. At the second step, compare all pairs of fats at the 5% level.
Here, spooled = 8.18 and tcrit = 2.086 for a two-sided test based on 20 df (the df E for
Residual SS). Each sample has six observations, so the LSD for each comparison is
r
2
LSD = 2.086 × 8.18 × = 9.85.
6
Any two sample means that differ by at least 9.85 in magnitude are significantly
different at the 5% level.
An easy way to compare all pairs of fats is to order the samples by their sample
means. The samples can then be grouped easily, noting that two fats are in the same
group if the absolute difference between their sample means is smaller than the LSD.
Fats Sample Mean
2 185.00
3 174.83
1 174.50
4 162.00
There are six comparisons of two fats. From this table, you can visually assess
which sample means differ by at least the LSD=9.85, and which ones do not. For
completeness, the table below summarizes each comparison:
Comparison Absolute difference in means Exceeds LSD?
Fats 2 and 3 10.17 Yes
2 and 1 10.50 Yes
2 and 4 23.00 Yes
Fats 3 and 1 0.33 No
3 and 4 12.83 Yes
Fats 1 and 4 12.50 Yes
The end product of the multiple comparisons is usually presented as a collection of
groups, where a group is defined to be a set of populations with sample means that
are not significantly different from each other. Overlap among groups is common, and
occurs when one or more populations appears in two or more groups. Any overlap
requires a more careful interpretation of the analysis.
There are three groups for the doughnut data, with no overlap. Fat 2 is in a group
by itself, and so is Fat 4. Fats 3 and 1 are in a group together. This information
can be summarized by ordering the samples from lowest to highest average, and then
connecting the fats in the same group using an underscore:
FAT 4 FAT 1 FAT 3 FAT 2
----- -------------- -----
Assuming all comparisons are of interest, you can implement the Bonferroni ad-
justment in R by specifying p.adjust.method = "bonf" A by-product of the Bonfer-
roni adjustment is that we have at least 100(1 − α)% confidence that all pairwise
t-test statements hold simultaneously!
# Bonferroni 95% Individual p-values
# All Pairwise Comparisons among Levels of fat
pairwise.t.test(fat.long$amount, fat.long$type,
pool.sd = TRUE, p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fat.long$amount and fat.long$type
##
## fat1 fat2 fat3
## fat2 0.22733 - -
## fat3 1.00000 0.26241 -
## fat4 0.09286 0.00056 0.07960
##
## P value adjustment method: bonferroni
Looking at the output, can you create the groups? You should get the groups given
below, which implies you have sufficient evidence to conclude that the population
mean absorption for Fat 2 is different than that for Fat 4.
FAT 4 FAT 1 FAT 3 FAT 2
-----------------------
-----------------------
The Bonferroni method tends to produce “coarser” groups than the FSD method,
because the individual comparisons are conducted at a lower (alpha/error) level.
Equivalently, the minimum significant difference is inflated for the Bonferroni method.
For example, in the doughnut problem with F ER ≤ 0.05, the critical value for the
individual comparisons at the 0.0083 level is tcrit = 2.929 with df = 20. The minimum
significant difference for the Bonferroni comparisons is
r
2
LSD = 2.929 × 8.18 × = 13.824
6
versus an LSD=9.85 for the FSD method. Referring back to our table of sample
means on page 152, we see that the sole comparison where the absolute difference
between sample means exceeds 13.824 involves Fats 2 and 4.
groups (cauc = Caucasian, afam = African American, and naaa = Native American
and Asian) follow. The data values are in mm.
Bregma
temporal line
r
rio
pe
Zygomatic tubercle Su
Zygomaticofrontal Lambda
suture rio
n
Supraorbital foramen Pte
Glabella
22mm
35mm
Nasion
Asterion
Inion
Zygomatic
Zygomatic
arch
hal
bone nuc Reid's base
line line
Mandible
Auricular point
Pre-auricular point
Glabella thickness
7
thickness (mm)
There are 3 groups, so there are 3 possible pairwise comparisons. If you want a
Bonferroni analysis with FER of no greater than 0.05, you should do the individual
comparisons at the 0.05/3 = 0.0167 level. Except for the mild outlier in the Caucasian
sample, the observed distributions are fairly symmetric, with similar spreads. I would
expect the standard ANOVA to perform well here.
Let µc = population mean Glabella measurement for Caucasians, µa = popula-
tion mean Glabella measurement for African Americans, and µn = population mean
Glabella measurement for Native Americans and Asians.
glabella.summary <- ddply(glabella.long, "pop",
function(X) { data.frame( m = mean(X$thickness),
s = sd(X$thickness),
n = length(X$thickness) ) } )
glabella.summary
## pop m s n
## 1 cauc 5.812500 0.8334280 12
## 2 afam 6.461538 0.8946959 13
## 3 naaa 5.857143 1.1168047 14
fit.g <- aov(thickness ~ pop, data = glabella.long)
summary(fit.g)
## Df Sum Sq Mean Sq F value Pr(>F)
## pop 2 3.40 1.6991 1.828 0.175
## Residuals 36 33.46 0.9295
fit.g
## Call:
## aov(formula = thickness ~ pop, data = glabella.long)
##
## Terms:
## pop Residuals
## Sum of Squares 3.39829 33.46068
## Deg. of Freedom 2 36
##
## Residual standard error: 0.9640868
## Estimated effects may be unbalanced
At the 5% level, you would not reject the hypothesis that the population mean
Glabella measurements are identical. That is, you do not have sufficient evidence
to conclude that these racial groups differ with respect to their average Glabella
measurement. This is the end of the analysis!
The Bonferroni intervals reinforce this conclusion, all the p-values are greater than
0.05. If you were to calculate CIs for the difference in population means, each would
contain zero. You can think of the Bonferroni intervals as simultaneous CI. We’re
(at least) 95% confident that all of the following statements hold simultaneously:
−1.62 ≤ µc − µa ≤ 0.32, −0.91 ≤ µn − µc ≤ 1.00, and −1.54 ≤ µn − µa ≤ 0.33. The
individual CIs have level 100(1 − 0.0167)% = 98.33%.
Another popular method controls the false discovery rate (FDR) instead of the
FER. The FDR is the expected proportion of false discoveries amongst the rejected
hypotheses. The false discovery rate is a less stringent condition than the family-wise
error rate, so these methods are more powerful than the others, though with a higher
FER. I encourage you to learn more about the methods by Benjamini, Hochberg, and
Yekutieli.
#### false discovery rate (FDR)
## Fat
# FDR
pairwise.t.test(fat.long$amount, fat.long$type,
pool.sd = TRUE, p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fat.long$amount and fat.long$type
##
## Glabella
# FDR
pairwise.t.test(glabella.long$thickness, glabella.long$pop,
pool.sd = TRUE, p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: glabella.long$thickness and glabella.long$pop
##
## cauc afam
## afam 0.17 -
## naaa 0.91 0.17
##
## P value adjustment method: BH
# violin plot
library(vioplot)
vioplot(fit.g$residuals, horizontal=TRUE, col="gray")
# boxplot
boxplot(fit.g$residuals, horizontal=TRUE)
# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(fit.g$residuals, las = 1, id = list(n = 8, cex = 1), lwd = 1, main="QQ Plot")
## 29 8 34 40 21 25 27 37
## 26 8 31 37 19 23 25 34
Histogram of fit.g$residuals
QQ Plot
0.8
Density
0.4
29 ●
2 8●
0.0
−2 −1 0 1 2 21 ●
37 ●
fit.g$residuals
●
●
1 ●
●
fit.g$residuals
●
●●
●●
● ●●●●●
1
●
0 ●●●●●
●●
●●
−2 −1 0 1 2 ●
●
●
●●
−1 ●
●
● ● 2527
● 40
● 34
● ●
−2
−2 −1 0 1 2
−2 −1 0 1 2
norm quantiles
shapiro.test(fit.g$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit.g$residuals
## W = 0.97693, p-value = 0.5927
library(nortest)
ad.test(fit.g$residuals)
##
## Anderson-Darling normality test
##
## data: fit.g$residuals
## A = 0.37731, p-value = 0.3926
# lillie.test(fit.g£residuals)
cvm.test(fit.g$residuals)
##
## Cramer-von Mises normality test
##
## data: fit.g$residuals
## W = 0.070918, p-value = 0.2648
In Chapter 4, I illustrated the use of Bartlett’s test and Levene’s test for equal
population variances, and showed how to evaluate these tests in R.
α = .05 (fixed)
0 4
χ2Crit Reject H0 for χ2S here
R does the calculation for us, as illustrated below. Because the p-value > 0.5, we
fail to reject the null hypothesis that the population variances are equal. This result
is not surprising given how close the sample variances are to each other.
## Test equal variance
# Barlett assumes populations are normal
bartlett.test(thickness ~ pop, data = glabella.long)
##
## Bartlett test of homogeneity of variances
##
## data: thickness by pop
## Bartlett's K-squared = 1.1314, df = 2, p-value = 0.568
Levene’s and Flinger’s tests are consistent with Bartlett’s.
# Levene does not assume normality, requires car package
library(car)
leveneTest(thickness ~ pop, data = glabella.long)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5286 0.5939
## 36
# Fligner is a nonparametric test
fligner.test(thickness ~ pop, data = glabella.long)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: thickness by pop
## Fligner-Killeen:med chi-squared = 1.0311, df = 2, p-value =
## 0.5972
p2 <-
ggplot(chds, aes(x = c_bwt, fill=smoke))
p2 <-
p2 + geom_histogram(binwidth = .4, alpha = 1/3, position="identity")
p2 <-
p2 + geom_rug(aes(colour = smoke), alpha = 1/3)
p2 <-
p2 + labs(title = "Child birthweight vs maternal smoking") +
xlab("child birthweight (lb)")
#print(p2)
library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1)
0 cigs
40
20
0
60
1−19 cigs
count
40 10.0
20
0
60
20+ cigs
40
60
smoke
40
count
0 cigs 5.0
1−19 cigs
20 20+ cigs
0
5.0 7.5 10.0 0 cigs 1−19 cigs 20+ cigs
child birthweight (lb) smoke
●
10 ● ●
● ●
●
●●
●●● 9 ●●●●
●
●
●
●
● 9 ●
●
●●
●
●
●
●
●
●● ●
●
●●
●● ●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●● ●● 8 ●●
●
●
●
●
●
●
●
●●
●●
●
● ●●
● ●
●●
●
●
●
● ●
● ●●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
● ●
●●
●
●
●●
●
●
●
● ●
●
●●
●●
●
●
● ●
●
●
●●
●
● ●●
●
●●
●●
●
●
●
8 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● 8 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
● ●●
●
●● ●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
7 ●
●
●
●●
●
●●
●
● ●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
● ● ●
●
●● ●●
●
●● ●
●●
●●
●
●
●●
●
●
●●
● ●●
● ●
●●
●
●●
●●
●
● ●
●●
●
●●
●●
● ●
●
●●
●●
●
●●
●
●
●
●
●
●
●
● 7 ●
●
●
●
●
● ●
●
●
6 ●
●
●
●
●●
●
●
●
●
●●
6 ●
●●
●
●
●
●
●
●
● ●
●● ●
●●
● ●
●● ●
●
●●
●
●● ●
●
●● ●
●●
●
●
●●
●
●
● ●●
● ●
●
●●
●
●●
●
●●
●
● ●
6 ●
●
●●
●●
●
●●
5
4 ●
●
●●●
●●
●● ●
● ●● ●
● ● ● ●
−3 −2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2
library(nortest)
# 0 cigs --------------------
shapiro.test(subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Shapiro-Wilk normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## W = 0.98724, p-value = 0.00199
ad.test( subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Anderson-Darling normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## A = 0.92825, p-value = 0.01831
cvm.test( subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Cramer-von Mises normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## W = 0.13844, p-value = 0.03374
# 1-19 cigs --------------------
shapiro.test(subset(chds, smoke == "1-19 cigs")$c_bwt)
##
## Shapiro-Wilk normality test
##
## data: subset(chds, smoke == "1-19 cigs")$c_bwt
## W = 0.97847, p-value = 0.009926
ad.test( subset(chds, smoke == "1-19 cigs")$c_bwt)
##
## Anderson-Darling normality test
##
# violin plot
library(vioplot)
vioplot(fit.c$residuals, horizontal=TRUE, col="gray")
# boxplot
boxplot(fit.c$residuals, horizontal=TRUE)
# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(fit.c$residuals, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
Histogram of fit.c$residuals
QQ Plot
0.4
Density
0.2
●
●
0.0
●●●
●●
−4 −2 0 2 4
●
●●●●
●
●●
●●
●
●
●
●●
fit.c$residuals 2 ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
fit.c$residuals
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
0 ●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●
●●
●
●●
●
1
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
−4 −2 0 2 ●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
−2 ●
●
●
●
●
●●●
●
●●
●●●
●
−4
●
● ● ● ●
−3 −2 −1 0 1 2 3
−4 −2 0 2
norm quantiles
shapiro.test(fit.c$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit.c$residuals
## W = 0.99553, p-value = 0.04758
library(nortest)
ad.test(fit.c$residuals)
##
## Anderson-Darling normality test
##
## data: fit.c$residuals
## A = 0.62184, p-value = 0.1051
cvm.test(fit.c$residuals)
##
## Cramer-von Mises normality test
##
## data: fit.c$residuals
## W = 0.091963, p-value = 0.1449
Looking at the summaries, we see that the sample standard deviations are close.
Formal tests of equal population variances are far from significant. The p-values for
Bartlett’s test and Levene’s test are greater than 0.4. Thus, the standard ANOVA
appears to be appropriate here.
# calculate summaries
chds.summary <- ddply(chds, "smoke",
function(X) { data.frame( m = mean(X$c_bwt),
s = sd(X$c_bwt),
n = length(X$c_bwt) ) } )
chds.summary
## smoke m s n
## 1 0 cigs 7.732808 1.052341 381
## 2 1-19 cigs 7.221302 1.077760 169
## 3 20+ cigs 7.266154 1.090946 130
## Test equal variance
# assumes populations are normal
bartlett.test(c_bwt ~ smoke, data = chds)
##
## Bartlett test of homogeneity of variances
##
## data: c_bwt by smoke
## Bartlett's K-squared = 0.3055, df = 2, p-value = 0.8583
# does not assume normality, requires car package
library(car)
leveneTest(c_bwt ~ smoke, data = chds)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.7591 0.4685
## 677
# nonparametric test
fligner.test(c_bwt ~ smoke, data = chds)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: c_bwt by smoke
## Fligner-Killeen:med chi-squared = 2.0927, df = 2, p-value =
## 0.3512
The p-value for the F -test is less than 0.0001. We would reject H0 at any of the
usual test levels (such as 0.05 or 0.01). The data suggest that the population mean
birth weights differ across smoking status groups.
summary(fit.c)
## Df Sum Sq Mean Sq F value Pr(>F)
## smoke 2 40.7 20.351 17.9 2.65e-08 ***
ADA1: Nonparametric,
categorical, and regression
methods
Nonparametric Methods
Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2 The Sign Test and CI for a Population Median . . . . . 176
6.3 Wilcoxon Signed-Rank Procedures . . . . . . . . . . . . . 183
6.3.1 Nonparametric Analyses of Paired Data . . . . . . . . . . . 187
6.3.2 Comments on One-Sample Nonparametric Methods . . . . 189
6.4 (Wilcoxon-)Mann-Whitney Two-Sample Procedure . . . 190
6.5 Alternatives for ANOVA and Planned Comparisons . . 198
6.5.1 Kruskal-Wallis ANOVA . . . . . . . . . . . . . . . . . . . . 199
6.5.2 Transforming Data . . . . . . . . . . . . . . . . . . . . . . . 199
6.5.3 Planned Comparisons . . . . . . . . . . . . . . . . . . . . . 212
6.5.4 Two final ANOVA comments . . . . . . . . . . . . . . . . . 215
6.6 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . 215
6.6.1 Linear model permutation tests in R . . . . . . . . . . . . . 219
6.7 Density estimation . . . . . . . . . . . . . . . . . . . . . . . 222
Learning objectives
After completing this topic, you should be able to:
select the appropriate procedure based on assumptions.
explain reason for using one procedure over another.
decide whether the medians between multiple populations are different.
Achieving these goals contributes to mastery in these course learning outcomes:
6.1 Introduction
The sign test assumes that you have a random sample from a population, but
makes no assumption about the population shape. The standard t-test provides
inferences on a population mean. The sign test, in contrast, provides inferences about
a population median.
If the population frequency curve is symmetric (see below), then the population
median, identified by η, and the population mean µ are identical. In this case the
sign procedures provide inferences for the population mean, though less powerfully
than the t-test.
The idea behind the sign test is straightforward. Suppose you have a sample of
size m from the population, and you wish to test H0 : η = η0 (a given value). Let S
be the number of sampled observations above η0 . If H0 is true, you expect S to be
approximately one-half the sample size, 0.5m. If S is much greater than 0.5m, the
data suggests that η > η0 . If S is much less than 0.5m, the data suggests that η < η0 .
Mean and Median differ with skewed distributions Mean and Median are the same with symmetric distributions
50%
Example: Income Data Recall that the income distribution is extremely skewed,
with two extreme outliers at 46 and 1110.
#### Example: Income Data
income <- c(7, 1110, 7, 5, 8, 12, 0, 5, 2, 2, 46, 7)
# sort in decreasing order
income <- sort(income, decreasing = TRUE)
income
## [1] 1110 46 12 8 7 7 7 5 5 2 2 0
summary(income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.25 7.00 100.92 9.00 1110.00
sd(income)
## [1] 318.0078
The income data is unimodal, skewed right, with two extreme outliers.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(income, freq = FALSE, breaks = 1000)
points(density(income), type = "l")
rug(income)
# violin plot
library(vioplot)
vioplot(income, horizontal=TRUE, col="gray")
# boxplot
boxplot(income, horizontal=TRUE)
Histogram of income
Density
0.15
0.00
income
●
1
● ●
The normal QQ-plot of the sample data indicates strong deviation from normality,
and the CLT can’t save us: even the bootstrap sampling distribution of the mean
indicates strong deviation from normality.
library(car)
qqPlot(income, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Income")
bs.one.samp.dist(income)
0.004
●
Density
1000
0.002
800
0.000
income
dat
Bootstrap sampling distribution of the mean
400
0.010
Density
200
●
● ● ● ● ● ● ●
0 ● ● ●
0.000
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0 100 200 300 400 500 600
norm quantiles
Data: n = 12 , mean = 100.92 , se = 91.801 5
The presence of the outliers has a dramatic effect on the 95% CI for the population
mean income µ, which goes from −101 to 303 (in 1000 dollar units). This t-CI is
suspect because the normality assumption is unreasonable. A CI for the population
median income η is more sensible because the median is likely to be a more reasonable
measure of typical value. Using the sign procedure, you are 95% confident that the
population median income is between 2.32 and 11.57 (times $1000).
library(BSDA)
## Loading required package: lattice
##
## Attaching package: ’BSDA’
## The following objects are masked from ’package:carData’:
##
## Vocab, Wool
## The following object is masked from ’package:TeachingDemos’:
##
## z.test
## The following object is masked from ’package:datasets’:
##
## Orange
t.test(income)
##
## One Sample t-test
##
## data: income
## t = 1.0993, df = 11, p-value = 0.2951
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -101.1359 302.9692
## sample estimates:
## mean of x
## 100.9167
SIGN.test(income)
##
## One-sample Sign-Test
##
## data: income
## s = 11, p-value = 0.0009766
## alternative hypothesis: true median is not equal to 0
## 95 percent confidence interval:
## 2.319091 11.574545
## sample estimates:
## median of x
## 7
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8540 5.0000 8.0000
## Interpolated CI 0.9500 2.3191 11.5745
## Upper Achieved CI 0.9614 2.0000 12.0000
Example: Age at First Heart Transplant Recall that the distribution of ages is
skewed to the left with a lower outlier. A question of interest is whether the “typical
age” at first transplant is 50. This can be formulated as a test about the population
median η or as a test about the population mean µ, depending on the interpretation.
# violin plot
library(vioplot)
vioplot(age, horizontal=TRUE, col="gray")
# boxplot
boxplot(age, horizontal=TRUE)
Histogram of age
Density
0.04
0.00
30 35 40 45 50 55 60 65
age
●
1
35 40 45 50 55 60 65
35 40 45 50 55 60 65
The normal QQ-plot of the sample data indicates mild deviation from normality in
the left tail (2 points of 11 outside the bands), and the bootstrap sampling distribution
of the mean indicates weak deviation from normality. It is good practice in this case
to use the nonparametric test as a double-check of the t-test, with the nonparametric
test being the more conservative test.
library(car)
qqPlot(age, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Income")
bs.one.samp.dist(age)
65
●
Density
0.04
60
●
●
55
● ● ●
0.00
●
50 30 35 40 45 50 55 60 65
age
● ●
dat
Bootstrap sampling distribution of the mean
45
0.15
●
40
0.10
Density
0.05
35
●
0.00
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
40 45 50 55 60
norm quantiles
Data: n = 11 , mean = 51.273 , se = 2.49031 5
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.9346 49.0000 56.0000
## Interpolated CI 0.9500 46.9891 56.5745
## Upper Achieved CI 0.9883 42.0000 58.0000
from the expected value of 18? To formally answer this question, we need to use the
Wilcoxon procedures, which are implemented in R with wilcox.test().
Example: Made-up Data The boxplot indicates that the distribution is fairly
symmetric, so the Wilcoxon method is reasonable (so is a t-CI and test).
#### Example: Made-up Data
dat <- c(20, 18, 23, 5, 14, 8, 18, 22)
# sort in decreasing order
dat <- sort(dat, decreasing = TRUE)
dat
## [1] 23 22 20 18 18 14 8 5
summary(dat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 12.5 18.0 16.0 20.5 23.0
sd(dat)
## [1] 6.524678
# violin plot
library(vioplot)
vioplot(dat, horizontal=TRUE, col="gray")
# boxplot
boxplot(dat, horizontal=TRUE)
Histogram of dat
0.12
Density
0.06
0.00
5 10 15 20
dat
1
5 10 15 20
5 10 15 20
The normal QQ-plot of the sample data indicates insufficient evidence of deviation
from normality though both the QQ-plot and the bootstrap sampling distribution of
the mean indicates weak left-skewness. Either the Wilcoxon or t-test are appropriate.
par(mfrow=c(1,1))
library(car)
qqPlot(dat, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Income")
bs.one.samp.dist(dat)
●
Density
0.04
20 ●
● ●
0.00
15 5 10 15 20 25
dat
dat
Bootstrap sampling distribution of the mean
10
Density
0.10
5 ●
0.00
t.test(dat, mu=10)
##
## One Sample t-test
##
## data: dat
## t = 2.601, df = 7, p-value = 0.03537
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
## 10.54523 21.45477
## sample estimates:
## mean of x
## 16
# with continuity correction in the normal approximation for the p-value
wilcox.test(dat, mu=10, conf.int=TRUE)
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE): cannot compute exact
p-value with ties
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE): cannot compute exact
confidence interval with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: dat
## V = 32, p-value = 0.0584
## alternative hypothesis: true location is not equal to 10
## 95 percent confidence interval:
## 9.500002 21.499942
## sample estimates:
## (pseudo)median
## 16.0056
# without continuity correction
wilcox.test(dat, mu=10, conf.int=TRUE, correct=FALSE)
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE, correct = FALSE): cannot
compute exact p-value with ties
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE, correct = FALSE): cannot
compute exact confidence interval with ties
##
## Wilcoxon signed rank test
##
## data: dat
## V = 32, p-value = 0.04967
## alternative hypothesis: true location is not equal to 10
## 95 percent confidence interval:
## 10.99996 21.00005
## sample estimates:
## (pseudo)median
## 16.0056
Example: Sleep Remedies I will illustrate Wilcoxon methods on the paired com-
parison of two remedies A and B for insomnia. The number of hours of sleep gained
on each method was recorded.
#### Example: Sleep Remedies
# Data and numerical summaries
a <- c( 0.7, -1.6, -0.2, -1.2, 0.1, 3.4, 3.7, 0.8, 0.0, 2.0)
b <- c( 1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.0)
d <- b - a;
sleep <- data.frame(a, b, d)
summary(sleep$d)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.200 1.000 1.250 1.520 1.675 4.600
shapiro.test(sleep$d)
##
## Shapiro-Wilk normality test
##
## data: sleep$d
## W = 0.83798, p-value = 0.04173
# boxplot
library(ggplot2)
p3 <- ggplot(sleep, aes(x = "d", y = d))
p3 <- p3 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
p3 <- p3 + geom_boxplot()
p3 <- p3 + geom_point()
p3 <- p3 + stat_summary(fun.y = mean, geom = "point", shape = 18,
size = 4, alpha = 0.3)
p3 <- p3 + coord_flip()
print(p3)
d
x
● ● ● ● ● ● ● ●
0 1 2 3 4
d
Let µB be the population mean sleep gain on remedy B, and µA be the population
mean sleep gain on remedy A. You are 95% confident that µB − µA is between 0.8
and 2.8 hours. Putting this another way, you are 95% confident that µB exceeds µA
by between 0.8 and 2.8 hours. The p-value for testing H0 : µB − µA = 0 against a
two-sided alternative is 0.008, which strongly suggests that µB 6= µA . This agrees
with the CI. Note that the t-CI and test give qualitatively similar conclusions as the
Wilcoxon methods, but the t-test p-value is about half as large.
If you are uncomfortable with the symmetry assumption, you could use the sign
CI for the population median difference between B and A. I will note that a 95% CI
for the median difference goes from 0.86 to 2.2 hours.
t.test(sleep$d, mu=0)
##
## One Sample t-test
##
## data: sleep$d
## t = 3.7796, df = 9, p-value = 0.004352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.610249 2.429751
## sample estimates:
## mean of x
## 1.52
# with continuity correction in the normal approximation for the p-value
wilcox.test(sleep$d, mu=0, conf.int=TRUE)
## Warning in wilcox.test.default(sleep$d, mu = 0, conf.int = TRUE): cannot compute
exact p-value with ties
## Warning in wilcox.test.default(sleep$d, mu = 0, conf.int = TRUE): cannot compute
exact confidence interval with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: sleep$d
## V = 54, p-value = 0.008004
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
## 0.7999339 2.7999620
## sample estimates:
## (pseudo)median
## 1.299983
# can use the paired= option
#wilcox.test(sleep£b, sleep£a, paired=TRUE, mu=0, conf.int=TRUE)
# if don't assume symmetry, can use sign test
#SIGN.test(sleep£d)
The WMW procedure assumes you have independent random samples from the two
populations, and assumes that the populations have the same shapes and spreads
(the frequency curves for the two populations are “shifted” versions of each other
— see below). The frequency curves are not required to be symmetric. The WMW
procedures give a CI and tests on the difference η1 − η2 between the two population
medians. If the populations are symmetric, then the methods apply to µ1 − µ2 .
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.0
0.0
-5 0 5 10 15 20 0 5 10 15
The R help on ?wilcox.test gives references to how the exact WMW procedure
is actually calculated; here is a good approximation to the exact method that is
easier to understand. The WMW procedure is based on ranks. The two samples are
combined, ranked from smallest to largest (1=smallest) and separated back into the
original samples. If the two populations have equal medians, you expect the average
rank in the two samples to be roughly equal. The WMW test computes a classical
two sample t-test using the pooled variance on the ranks to assess whether the sample
mean ranks are significantly different.
1
http://www.lpi.usra.edu/meteor/metbull.php?code=24138
2
http://www.lpi.usra.edu/meteor/metbull.php?code=24204
par(mfrow=c(1,2))
library(car)
qqPlot(Walker, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Walker")
qqPlot(Uwet, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Uwet")
Cooling rates for samples of meteorites at two locations
QQ Plot, Walker QQ Plot, Uwet
0.7 ● 1.2 ●
●
0.6
Walker ● ● ● 1.0
0.5
0.8
0.4
Walker
Uwet
site
0.3 0.6
●
● ●
0.2
0.4
Uwet ● ●
0.1 ● ● ●
●
●
● ●
●
● 0.2
● ● ●
0.0
0.00 0.25 0.50 0.75 1.00 1.25 norm quantiles norm quantiles
cool
I carried out the standard two-sample procedures to see what happens. The
pooled-variance and Satterthwaithe results are comparable, which is expected be-
cause the sample standard deviations and sample sizes are roughly equal. Both tests
indicate that the mean cooling rates for Uwet and Walker Co. meteorites are not
significantly different at the 10% level. You are 95% confident that the mean cooling
rate for Uwet is at most 0.1 less, and no more than 0.6 greater than that for Walker
Co. (in degrees per million years).
# numerical summaries
summary(Uwet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.2100 0.2500 0.4522 0.4700 1.2000
c(sd(Uwet), IQR(Uwet), length(Uwet))
## [1] 0.4069944 0.2600000 9.0000000
summary(Walker)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0325 0.1000 0.2000 0.2275 0.6900
c(sd(Walker), IQR(Walker), length(Walker))
## [1] 0.2389793 0.1950000 10.0000000
t.test(Uwet, Walker, var.equal = TRUE)
##
## Two Sample t-test
##
## data: Uwet and Walker
## t = 1.6689, df = 17, p-value = 0.1134
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0666266 0.5710710
## sample estimates:
## mean of x mean of y
## 0.4522222 0.2000000
t.test(Uwet, Walker)
##
## Welch Two Sample t-test
##
## data: Uwet and Walker
## t = 1.6242, df = 12.652, p-value = 0.129
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08420858 0.58865302
## sample estimates:
## mean of x mean of y
## 0.4522222 0.2000000
Given the marked skewness, a nonparametric procedure is more appropriate. The
Wilcoxon-Mann-Whitney comparison of population medians is reasonable. Why?
The WMW test of equal population medians is significant (barely) at the 5% level.
You are 95% confident that median cooling rate for Uwet exceeds that for Walker by
between 0+ and 0.45 degrees per million years.
wilcox.test(Uwet, Walker, conf.int = TRUE)
## Warning in wilcox.test.default(Uwet, Walker, conf.int = TRUE): cannot compute exact
p-value with ties
## Warning in wilcox.test.default(Uwet, Walker, conf.int = TRUE): cannot compute exact
as the mean density of the earth, the distance from the earth to the sun, and the
velocity of light. An interesting series of experiments to determine the velocity of
light was begun in 1875. The first method used, and reused with refinements several
times thereafter, was the rotating mirror method3 . In this method a beam of light
is reflected off a rapidly rotating mirror to a fixed mirror at a carefully measured
distance from the source. The returning light is re-reflected from the rotating mirror
at a different angle, because the mirror has turned slightly during the passage of the
corresponding light pulses. From the speed of rotation of the mirror and from careful
measurements of the angular difference between the outward-bound and returning
light beams, the passage time of light can be calculated for the given distance. After
averaging several calculations and applying various corrections, the experimenter can
combine mean passage time and distance for a determination of the velocity of light.
Simon Newcombe, a distinguished American scientist, used this method during the
year 1882 to generate the passage time measurements given below, in microseconds.
The travel path for this experiment was 3721 meters in length, extending from Ft.
Meyer, on the west bank of the Potomac River in Washington, D.C., to a fixed mirror
at the base of the Washington Monument.
The problem is to determine a 95% CI for the “true” passage time, which is taken
to be the typical time (mean or median) of the population of measurements that were
or could have been taken by this experiment.
#### Example: Newcombe's Data
time <- c(24.828, 24.833, 24.834, 24.826, 24.824, 24.756
, 24.827, 24.840, 24.829, 24.816, 24.798, 24.822
, 24.824, 24.825, 24.823, 24.821, 24.830, 24.829
3
http://en.wikipedia.org/wiki/File:Speed_of_light_(foucault).PNG
# violin plot
p2 <- ggplot(Passage_df, aes(x = "t", y = time))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()
# boxplot
p3 <- ggplot(Passage_df, aes(x = "t", y = time))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
90
density
60
30
0
24.775 24.800 24.825
time
t
x
t
x
● ●
par(mfrow=c(1,1))
library(car)
qqPlot(time, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Time")
bs.one.samp.dist(time)
24.84 ● ●
●● ● ● ●
●●●
40
Density
●●●●●●●
●●●
●●●
●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●
●●●●●
20
24.82 ● ●●●
● ●
0
24.80 ●
dat
Bootstrap sampling distribution of the mean
300
24.78
200
Density
100
24.76
●
0
−2 −1 0 1 2
24.818 24.820 24.822 24.824 24.826 24.828 24.830
norm quantiles
Data: n = 66 , mean = 24.826 , se = 0.00132266 5
The data set is skewed to the left, due to the presence of two extreme outliers
that could potentially be misrecorded observations. Without additional information I
would be hesitant to apply normal theory methods (the t-test), even though the sam-
ple size is “large” (bootstrap sampling distribution is still left-skewed). Furthermore,
the t-test still suffers from a lack of robustness of sensitivity, even in large samples. A
formal QQ-plot and normal test rejects, at the 0.01 level, the normality assumption
needed for the standard methods.
The table below gives 95% t, sign, and Wilcoxon CIs. I am more comfortable with
the sign CI for the population median than the Wilcoxon method, which assumes
symmetry.
t.sum <- t.test(time)
t.sum$conf.int
## [1] 24.82357 24.82885
## attr(,"conf.level")
## [1] 0.95
diff(t.test(time)$conf.int)
## [1] 0.005283061
s.sum <- SIGN.test(time)
s.sum$conf.int
## [1] 24.82600 24.82849
## attr(,"conf.level")
## [1] 0.95
diff(s.sum$conf.int)
## [1] 0.00249297
w.sum <- wilcox.test(time, conf.int=TRUE)
w.sum$conf.int
## [1] 24.82604 24.82853
## attr(,"conf.level")
## [1] 0.95
diff(w.sum$conf.int)
## [1] 0.002487969
parameter Method CI Limits Width
mean t (24.8236, 24.8289) 0.0053
median sign (24.8260, 24.8285) 0.0025
median Wilcoxon (24.8260, 24.8285) 0.0025
Note the big difference between the nonparametric and the t-CI. The nonparametric
CIs are about 1/2 as wide as the t-CI. This reflects the impact that outliers have on
the standard deviation, which directly influences the CI width.
note that deviations from normality in one or more samples might be expected in
a comparison involving many samples. You should downplay small deviations from
normality in problems involving many samples.
Example: Hydrocarbon (HC) Emissions Data These data are the HC emis-
sions at idling speed, in ppm, for automobiles of different years of manufacture. The
data are a random sample of all automobiles tested at an Albuquerque shopping
center. (It looks like we need to find some newer cars!)
#### Example: Hydrocarbon (HC) Emissions Data
emis <- read.table(text="
Pre-y63 y63-7 y68-9 y70-1 y72-4
2351 620 1088 141 140
1293 940 388 359 160
541 350 111 247 20
1058 700 558 940 20
411 1150 294 882 223
570 2000 211 494 60
800 823 460 306 20
630 1058 470 200 95
905 423 353 100 360
347 900 71 300 70
NA 405 241 223 220
NA 780 2999 190 400
NA 270 199 140 217
NA NA 188 880 58
NA NA 353 200 235
NA NA 117 223 1880
NA NA NA 188 200
NA NA NA 435 175
NA NA NA 940 85
NA NA NA 241 NA
", header=TRUE)
#emis
Pre.y63
y63.7
year
y68.9
y70.1
y72.4
The standard ANOVA shows significant differences among the mean HC emis-
sions. However, the standard ANOVA is inappropriate because the distributions are
extremely skewed to the right due to presence of outliers in each sample.
fit.e <- aov(hc ~ year, data = emis.long)
summary(fit.e)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 4 4226834 1056709 4.343 0.00331 **
## Residuals 73 17759968 243287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.e
## Call:
## aov(formula = hc ~ year, data = emis.long)
##
## Terms:
## year Residuals
## Sum of Squares 4226834 17759968
## Deg. of Freedom 4 73
##
## Residual standard error: 493.2416
## Estimated effects may be unbalanced
The boxplots show that the typical HC emissions appear to decrease as the age
of car increases (the simplest description). Although the spread in the samples, as
measured by the IQR, also decreases as age increases, I am more comfortable with the
KW ANOVA, in part because the KW analysis is not too sensitive to differences in
spreads among samples. This point is elaborated upon later. As described earlier, the
KW ANOVA is essentially an ANOVA based on the ranks. I give below the ANOVA
based on ranks and the output from the KW procedure. They give similar p-values,
and lead to the conclusion that there are significant differences among the population
median HC emissions. A simple description is that the population median emission
tends to decrease with the age of the car. You should follow up this analysis with
Mann-Whitney multiple comparisons.
# ANOVA of rank, for illustration that this is similar to what KW is doing
fit.er <- aov(rank(hc) ~ year, data = emis.long)
summary(fit.er)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 4 16329 4082 12.85 5.74e-08 ***
## Residuals 73 23200 318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.er
## Call:
## aov(formula = rank(hc) ~ year, data = emis.long)
##
## Terms:
## year Residuals
## Sum of Squares 16329.32 23199.68
## Deg. of Freedom 4 73
##
## Residual standard error: 17.82705
## Estimated effects may be unbalanced
# KW ANOVA
fit.ek <- kruskal.test(hc ~ year, data = emis.long)
fit.ek
##
## Kruskal-Wallis rank sum test
##
## data: hc by year
## Kruskal-Wallis chi-squared = 31.808, df = 4, p-value =
## 2.093e-06
It is common to transform the data to a log scale when the spread increases as
the median or mean increases.
# log scale
emis.long$loghc <- log(emis.long$hc)
# summary of each year
Pre.y63
y63.7
year
y68.9
y70.1
y72.4
3 4 5 6 7 8
log(hc) (log(ppm))
After transformation, the samples have roughly the same spread (IQR and s) and
shape. The transformation does not completely eliminate the outliers. However, I
am more comfortable with a standard ANOVA on this scale than with the original
data. A difficulty here is that the ANOVA is comparing population mean log HC
emission (so interpretations are on the log ppm scale, instead of the natural ppm
scale). Summaries for the ANOVA on the log hydrocarbon emissions levels are given
below.
The boxplot of the log-transformed data reinforces the reasonableness of the orig-
inal KW analysis. Why? The log-transformed distributions have fairly similar shapes
and spreads, so a KW analysis on these data is sensible. The ranks for the original
and log-transformed data are identical, so the KW analyses on the log-transformed
data and the original data must lead to the same conclusions. This suggests that the
KW ANOVA is not overly sensitive to differences in spreads among the samples.
There are two reasonable analyses here: the standard ANOVA using log HC emis-
sions, and the KW analysis of the original data. The first analysis gives a comparison
of mean log HC emissions. The second involves a comparison of median HC emis-
sions. A statistician would present both analyses to the scientist who collected the
data to make a decision on which was more meaningful (independently of the re-
sults5 !). Multiple comparisons would be performed relative to the selected analysis
(t-tests for ANOVA or WMW-tests for KW ANOVA).
5
It is unethical to choose a method based on the results it gives.
## log scale
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(hd.long, aes(x = patient, y = loglevel))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(hd.long$loglevel),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Plasma bradykininogen levels for three patient groups (log scale)")
p <- p + ylab("log(level) (log(mg/ml))")
# to reverse order that years print, so oldest is first on top
p <- p + scale_x_discrete(limits = rev(levels(hd.long$patient)) )
p <- p + ylim(c(0,max(hd.long$loglevel)))
p <- p + coord_flip()
p <- p + theme(legend.position="none")
print(p)
Plasma bradykininogen levels for three patient groups Plasma bradykininogen levels for three patient groups (log scale)
nc nc
patient
patient
ahd ahd
ihd ihd
0 5 10 15 0 1 2
level (mg/ml) log(level) (log(mg/ml))
Although the spread (IQR, s) in the ihd sample is somewhat greater than the
spread in the other samples, the presence of skewness and outliers in the boxplots is a
greater concern regarding the use of the classical ANOVA. The shapes and spreads in
the three samples are roughly identical, so a Kruskal-Wallis nonparametric ANOVA
appears ideal. As a sidelight, I transformed plasma levels to a log scale to reduce
the skewness and eliminate the outliers. The boxplots of the transformed data show
reasonable symmetry across groups, but outliers are still present. I will stick with
the Kruskal-Wallis ANOVA (although it would not be much of a problem to use the
classical ANOVA on transformed data).
Let ηnc = population median plasma level for normal controls, ηahd = popula-
tion median plasma level for active Hodgkin’s disease patients, and ηihd = popula-
tion median plasma level for inactive Hodgkin’s disease patients. The KW test of
H0 : ηnc = ηahd = ηihd versus HA : not H0 is highly significant (p-value= 0.00003),
suggesting differences among the population median plasma levels. The Kruskal-
Wallis ANOVA summary is given below.
# KW ANOVA
fit.h <- kruskal.test(level ~ patient, data = hd.long)
fit.h
##
## Kruskal-Wallis rank sum test
##
## data: level by patient
## Kruskal-Wallis chi-squared = 20.566, df = 2, p-value =
## 3.421e-05
I followed up the KW ANOVA with Bonferroni comparisons of the samples, using
the Mann-Whitney two sample procedure. There are three comparisons, so an overall
FER of 0.05 is achieved by doing the individual tests at the 0.05/3=0.0167 level.
Alternatively, you can use 98.33% CI for differences in population medians.
# with continuity correction in the normal approximation for the p-value
wilcox.test(hd$nc , hd$ahd, conf.int=TRUE, conf.level = 0.9833)
## Warning in wilcox.test.default(hd$nc, hd$ahd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact p-value with ties
## Warning in wilcox.test.default(hd$nc, hd$ahd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: hd$nc and hd$ahd
## W = 329, p-value = 0.0002735
## alternative hypothesis: true location shift is not equal to 0
## 98.33 percent confidence interval:
## 0.8599458 2.9000789
## sample estimates:
## difference in location
## 1.910067
wilcox.test(hd$nc , hd$ihd, conf.int=TRUE, conf.level = 0.9833)
## Warning in wilcox.test.default(hd$nc, hd$ihd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact p-value with ties
## Warning in wilcox.test.default(hd$nc, hd$ihd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: hd$nc and hd$ihd
## W = 276.5, p-value = 0.3943
## alternative hypothesis: true location shift is not equal to 0
## 98.33 percent confidence interval:
## -1.5600478 0.6800262
## sample estimates:
## difference in location
## -0.3413932
wilcox.test(hd$ahd, hd$ihd, conf.int=TRUE, conf.level = 0.9833)
## Warning in wilcox.test.default(hd$ahd, hd$ihd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact p-value with ties
a 0.01 level for the comparisons, instead of the more conservative 0.0033 level needed
when doing all possible comparisons.
To illustrate this idea, consider the KW analysis of HC emissions. We saw that
there are significant differences among the population median HC emissions. Given
that the samples have a natural ordering
Sample Year of manufacture
1 Pre-1963
2 63 – 67
3 68 – 69
4 70 – 71
5 72 – 74
you may primarily be interested in whether the population medians for cars manufac-
tured in consecutive samples are identical. That is, you may be primarily interested
in the following 4 comparisons:
Pre-1963 vs 63 – 67
63 – 67 vs 68 – 69
68 – 69 vs 70 – 71
70 – 71 vs 72 – 74
A Bonferroni analysis would carry out each comparison at the 0.05/4 = 0.0125 level
versus the 0.05/10 = 0.005 level when all comparisons are done.
The following output was obtained for doing these four comparisons, based on
Wilcoxon-Mann-Whitney two-sample tests (why?6 ). Two-year groups are claimed to
be different if the p-value is 0.0125 or below, or equivalently, if a 98.75% CI for the
difference in population medians does not contain zero.
#### Planned Comparisons
# with continuity correction in the normal approximation for the p-value
wilcox.test(emis$y63.7, emis$Pre.y63, conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y63.7, emis$Pre.y63, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y63.7, emis$Pre.y63, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y63.7 and emis$Pre.y63
## W = 61.5, p-value = 0.8524
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
6
The ANOVA is the multi-sample analog to the two-sample t-test for the mean, and the KW
ANOVA is the multi-sample analog to the WMW two-sample test for the median. Thus, we follow
up a KW ANOVA with WMW two-sample tests at the chosen multiple comparison adjusted error
rate.
## -530.0001 428.0000
## sample estimates:
## difference in location
## -15.4763
wilcox.test(emis$y68.9, emis$y63.7 , conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y68.9, emis$y63.7, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y68.9, emis$y63.7, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y68.9 and emis$y63.7
## W = 43, p-value = 0.007968
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -708.99999 -51.99998
## sample estimates:
## difference in location
## -397.4227
wilcox.test(emis$y70.1, emis$y68.9 , conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y70.1, emis$y68.9, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y70.1, emis$y68.9, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y70.1 and emis$y68.9
## W = 156, p-value = 0.9112
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -206.0001 171.0000
## sample estimates:
## difference in location
## -10.99997
wilcox.test(emis$y72.4, emis$y70.1 , conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y72.4, emis$y70.1, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y72.4, emis$y70.1, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y72.4 and emis$y70.1
## W = 92.5, p-value = 0.006384
## alternative hypothesis: true location shift is not equal to 0
an experimental design is mirrored in the analysis of that design. If the labels are
exchangeable under the null hypothesis, then the resulting tests yield exact signifi-
cance levels. Confidence intervals can then be derived from the tests. The theory has
evolved from the works of R.A. Fisher and E.J.G. Pitman in the 1930s.
Let’s illustrate the basic idea of a permutation test using the Meteorites example.
Suppose we have two groups Uwet and Walker whose sample means are ȲU and
ȲW , and that we want to test, at 5% significance level, whether they come from the
same distribution. Let nU = 9 and nW = 10 be the sample size corresponding to
each group. The permutation test is designed to determine whether the observed
difference between the sample means is large enough to reject the null hypothesis
H0 : µU = µW , that the two groups have identical means.
The test proceeds as follows. First, the difference in means between the two
samples is calculated: this is the observed value of the test statistic, T(obs) . Then the
observations of groups Uwet and Walker are pooled.
#### Permutation tests
# Calculated the observed difference in means
# met.long includes both Uwet and Walker groups
Tobs <- mean(met.long[(met.long$site == "Uwet" ), 2]) -
mean(met.long[(met.long$site == "Walker"), 2])
Tobs
## [1] 0.2522222
Next, the difference in sample means is calculated and recorded for every possible
way of dividing these pooled values into two groups of size nU = 9 and nW = 10
(i.e., for every permutation of the group labels Uwet and Walker). The set of these
calculated differences is the exact distribution of possible differences under the null
hypothesis that group label does not matter. This exact distribution can be approx-
imated by drawing a large number of random permutations.
# Plan:
# Initialize a vector in which to store the R number of difference of means.
# Calculate R differences in means for R permutations, storing the results.
# Note that there are prod(1:19) = 10^17 total permutations,
# but the R repetitions will serve as a good approximation.
# Plot the permutation null distribution with an indication of the Tobs.
library(ggplot2)
p <- ggplot(dat, aes(x = Tperm))
#p <- p + scale_x_continuous(limits=c(-20,+20))
p <- p + geom_histogram(aes(y=..density..), binwidth=0.01)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at Tobs
p <- p + geom_vline(aes(xintercept=Tobs), colour="#BB0000", linetype="dashed")
p <- p + labs(title = "Permutation distribution of difference in means, Uwet and Walker Meteorite
p <- p + xlab("difference in means (red line = observed difference in means)")
print(p)
2
density
Note that the two-sided p-value of 0.1177 is consistent, in this case, with the two-
sample t-test p-values of 0.1134 (pooled) and 0.1290 (Satterthwaite), but different
from 0.0497 (WMW). The permutation is a comparison of means without the nor-
mality assumption, though requires that the observations are exchangable between
populations under H0 .
If the only purpose of the test is reject or not reject the null hypothesis, we can
as an alternative sort the recorded differences, and then observe if T(obs) is contained
within the middle 95% of them. If it is not, we reject the hypothesis of equal means
at the 5% significance level.
library(coin)
# Fisher-Pitman permutation test
oneway.summary <- oneway_test(hc ~ year, data = subset(emis.long, (year == fac.lev[i1] | year == fac
# p-values
mc.pval
## y63.7 Pre.y63 y68.9 y70.1 y72.4
## y63.7 1.000000000 0.676572596 0.1993877 0.004273746 0.002185513
## Pre.y63 0.676572596 1.000000000 0.1611790 0.005319987 0.003379156
## y68.9 0.199387725 0.161179041 1.0000000 0.468455149 0.177187250
## y70.1 0.004273746 0.005319987 0.4684551 1.000000000 0.227517382
## y72.4 0.002185513 0.003379156 0.1771873 0.227517382 1.000000000
Summarize the results of the pairwise comparisons. Groups with a common letter
are not statistically different.
# summary of pairwise comparisons
# threshold is Bonferroni-corrected alpha=0.05 / 10
library(multcompView)
multcompLetters( mc.pval
, compare = "<"
, threshold = 0.05 / choose(length(fac.lev), 2)
, Letters = letters
, reversed = FALSE)
## y63.7 Pre.y63 y68.9 y70.1 y72.4
## "a" "ab" "abc" "bc" "c"
1 break
60
Frequency
40
20
0
default
25
Frequency
15
0 5
10 breaks
12
Frequency
8
4
0
20 breaks
6
Frequency
4
2
0
100 breaks
6
Frequency
4
2
0
Notice that we are starting to see more and more bins that include only a single
observation (or multiple observations at the precision of measurement). Taken to its
extreme, this type of exercise gives in some sense a “perfect” fit to the data but is
useless as an estimator of shape.
On the other hand, it is obvious that a single bin would also be completely use-
less. So we try in some sense to find a middle ground between these two extremes:
“Oversmoothing” by using only one bin and “undersmooting” by using too many.
This same paradigm occurs for density estimation, in which the amount of smoothing
We’ve already used the density() function to provide a smooth curve to our
histograms. So far, we’ve taken the default “bandwidth”. Let’s see what happens
when we use different bandwidths.
par(mfrow=c(3,1))
# undersmooth
hist(time2, prob=TRUE, main="")
lines(density(time2, bw=0.0004), col=3, lwd=2)
text(17.5, .35, "", col=3, cex=1.4)
title(main=paste("Undersmooth, BW = 0.0004"), col.main=3)
# oversmooth
hist(time2, prob=TRUE, main="")
lines(density(time2, bw=0.008), col=4, lwd=2)
title(main=paste("Oversmooth, BW = 0.008"), col.main=4)
Default = 0.0018
80
60
Density
40
20
0
time2
Undersmooth, BW = 0.0004
80
60
Density
40
20
0
time2
Oversmooth, BW = 0.008
80
60
Density
40
20
0
time2
The other determining factor is the kernel, which is the shape each individual
point takes before all the shapes are added up for a final density line. While the
choice of bandwidth is very important, the choice of kernel is not. Choosing a kernel
with hard edges (such as ”rect”) will result in jagged artifacts, so smoother kernels
are often preferred.
par(mfrow=c(1,1))
80
60
Density
40
20
0
time2
Contents
7.1 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . 228
7.2 Single Proportion Problems . . . . . . . . . . . . . . . . . 231
7.2.1 A CI for p . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.2.2 Hypothesis Tests on Proportions . . . . . . . . . . . . . . . 233
7.2.3 The p-value for a two-sided test . . . . . . . . . . . . . . . . 235
7.2.4 Appropriateness of Test . . . . . . . . . . . . . . . . . . . . 236
7.2.5 R Implementation . . . . . . . . . . . . . . . . . . . . . . . 237
7.2.6 One-Sided Tests and One-Sided Confidence Bounds . . . . 237
7.2.7 Small Sample Procedures . . . . . . . . . . . . . . . . . . . 239
7.3 Analyzing Raw Data . . . . . . . . . . . . . . . . . . . . . 241
7.4 Goodness-of-Fit Tests (Multinomial) . . . . . . . . . . . . 244
7.4.1 Adequacy of the Goodness-of-Fit Test . . . . . . . . . . . . 246
7.4.2 R Implementation . . . . . . . . . . . . . . . . . . . . . . . 246
7.4.3 Multiple Comparisons in a Goodness-of-Fit Problem . . . . 249
7.5 Comparing Two Proportions: Independent Samples . . 251
7.5.1 Large Sample CI and Tests for p1 − p2 . . . . . . . . . . . . 251
7.6 Effect Measures in Two-by-Two Tables . . . . . . . . . . 258
7.7 Analysis of Paired Samples: Dependent Proportions . . 260
7.8 Testing for Homogeneity of Proportions . . . . . . . . . . 263
7.8.1 Adequacy of the Chi-Square Approximation . . . . . . . . . 268
Learning objectives
After completing this topic, you should be able to:
select the appropriate statistical method to compare summaries from categorical
variables.
assess the assumptions of one-way and two-way tests of proportions and indepen-
dence.
decide whether the proportions between populations are different, including in
stratified and cross-sectional studies.
recommend action based on a hypothesis test.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
Example: Titanic The sinking of the Titanic is a famous event, and new books
are still being published about it. Many well-known facts — from the proportions of
first-class passengers to the “women and children first” policy, and the fact that policy
was not entirely successful in saving the women and children in the third class — are
reflected in the survival rates for various classes of passenger. The source provides a
data set recording class, sex, age, and survival status for each person on board of the
Titanic, and is based on data originally collected by the British Board of Trade1 .
1
British Board of Trade (1990), Report on the Loss of the “Titanic” (S.S.). British Board of
Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing. Note that there is not
complete agreement among primary sources as to the exact numbers on board, rescued, or lost.
Titanic
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
# reshape into long data.frame
library(reshape2)
df.titanic <- melt(Titanic, value.name = "Freq")
df.titanic
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
Survived
No Yes
Sex
Male Female
Class
1st 2nd 3rd Crew
There are many questions that can be asked of this dataset. How likely were people
to survive such a ship sinking in cold water? Is the survival proportion dependent on
sex, class, or age, or a combination of these? How different is the survival proportions
for 1st class females versus 3rd class males?
7.2.1 A CI for p
A two-sided CI for p is a range of plausible values for the unknown population pro-
portion p, based on the observed data. To compute a two-sided CI for p:
1. Specify the confidence level as the percent 100(1 − α)% and solve for the error
rate α of the CI.
2. Compute zcrit = z0.5α (i.e., area under the standard normal curve to the left and
to the right of zcrit are 1 − 0.5α and 0.5α, respectively). qnorm(1-0.05/2)=1.96.
3. The 100(1 − α)% CI for p has endpoints L = p̂ − zcrit SE and U = p̂ + zcrit SE,
respectively, where the “CI standard error” is
r
p̂(1 − p̂)
SE = .
n
The CI is often written as p̂ ± zcrit SE.
for tamper resistant packaging?” The number of yes respondents was 189. Construct
a 95% CI for the proportion p of all consumers who were willing in 1983 to pay extra
for such packaging.
Here n = 270 and p̂ = 189/270 = 0.700. The critical value for a 95% CI for p is
z0.025 = 1.96. The CI standard error is given by
r
0.7 × 0.3
SE = = 0.028,
270
so zcrit SE = 1.96 × 0.028 = 0.055. The 95% CI for p is 0.700 ± 0.055. You are 95%
confident that the proportion of consumers willing to pay extra for better packaging
is between 0.645 and 0.755. (Willing to pay how much extra?)
Appropriateness of the CI
The standard CI is based on a large-sample standard normal approximation to
p̂ − p
z= .
SE
A simple rule of thumb requires np̂ ≥ 5 and n(1− p̂) ≥ 5 for the method to be suitable.
Given that np̂ and n(1 − p̂) are the observed numbers of successes and failures, you
should have at least 5 of each to apply the large-sample CI.
In the packaging example, np̂ = 270 × (0.700) = 189 (the number who support
the new packaging) and n(1 − p̂) = 270 × (0.300) = 81 (the number who oppose) both
exceed 5. The normal approximation is appropriate here.
p̂ − p0
zs = ,
SE
where the “test standard error” (based on the hypothesized value) is
r
p0 (1 − p0 )
SE = .
n
5. Reject H0 in favor of HA if |zobs | ≥ zcrit . Otherwise, do not reject H0 .
The rejection rule is easily understood visually. The area under the normal curve
outside ±zcrit is the size α of the test. One-half of α is the area in each tail. You
reject H0 in favor of HA if the test statistic exceeds ±zcrit . This occurs when p̂ is
significantly different from p0 , as measured by the standardized distance zobs between
p̂ and p0 .
Z−distribution with two−sided size α = .05 critical region Z−distribution with two−sided p−value
α α p − value p − value
(fixed)
2 2 2 2
−4 Rej H0 − z 0 4 −4 − zs 0 zs 4
Crit zCrit Rej H0 − zCrit zCrit
To compute the p-value (not to be confused with the value of the proportion p) for a
two-sided test:
1. Compute the test statistic zs = zobs .
2. Evaluate the area under the normal probability curve outside ±|zs |.
Recall that the null hypothesis for a size α test is rejected if and only if the p-value
is less than or equal to α.
Example: Emissions data Each car in the target population (L.A. county) either
has been tampered with (a success) or has not been tampered with (a failure). Let
p = the proportion of cars in L.A. county with tampered emissions control devices.
You want to test H0 : p = 0.15 against HA : p 6= 0.15 (here p0 = 0.15). The critical
value for a two-sided test of size α = 0.05 is zcrit = 1.96.
The data are a sample of n = 200 cars. The sample proportion of cars that have
been tampered with is p̂ = 21/200 = 0.105. The test statistic is
0.105 − 0.15
zs = = −1.78,
0.02525
r
0.15 × 0.85
SE = = 0.02525.
200
Given that |zs | = 1.78 < 1.96, you have insufficient evidence to reject H0 at the 5%
level. That is, you have insufficient evidence to conclude that the proportion of cars
in L.A. county that have been tampered with differs from the statewide proportion.
This decision is reinforced by the p-value calculation. The p-value is the area
under the standard normal curve outside ±1.78. This is 2 × 0.0375 = 0.075, which
exceeds the test size of 0.05.
.0375 .0375
−4 −1.78 0 1.78 4
Remark The SE used in the test and CI are different. This implies that a hypothesis
test and CI could potentially lead to different decisions. That is, a 95% CI for a
population proportion might cover p0 when the p-value for testing H0 : p = p0 is
smaller than 0.05. This will happen, typically, only in cases where the decision is
“borderline.”
7.2.5 R Implementation
#### Single Proportion Problems
# Approximate normal test for proportion, without Yates' continuity correction
prop.test(21, 200, p = 0.15, correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 21 out of 200, null probability 0.15
## X-squared = 3.1765, df = 1, p-value = 0.07471
## alternative hypothesis: true p is not equal to 0.15
## 95 percent confidence interval:
## 0.06970749 0.15518032
## sample estimates:
## p
## 0.105
# Approximate normal test for proportion, with Yates' continuity correction
#prop.test(21, 200, p = 0.15)
I will answer this question by computing a p-value for a one-sided test. Let p
be the population proportion of learning disabled children with brains having larger
right sides. I am interested in testing H0 : p = 0.25 against HA : p > 0.25 (here
p0 = 0.25).
The proportion of children sampled with brains having larger right sides is p̂ =
22/53 = 0.415. The test statistic is
0.415 − 0.25
zs = = 2.78,
0.0595
where the test standard error satisfies
r
0.25 × 0.75
SE = = 0.0595.
53
The p-value for an upper one-sided test is the area under the standard normal curve
to the right of 2.78, which is approximately .003; see the picture below. I would
reject H0 in favor of HA using any of the standard test levels, say 0.05 or 0.01. The
newspaper’s claim is reasonable.
p−value is area in
right tail only
.003
−4 −2 0 zs = 2.78 4
standard CI.
This adjustment has little effect when n is large and p̂ is not near either 0 or 1,
as in the Tylenol example.
# Agresti's method
prop.test(1+2, 6+4, p = 0.85, correct = FALSE)$conf.int
## Warning in prop.test(1 + 2, 6 + 4, p = 0.85, correct = FALSE): Chi-squared approximation
may be incorrect
## [1] 0.1077913 0.6032219
## attr(,"conf.level")
## [1] 0.95
# Exact binomial test for proportion
binom.test(1, 6, p = 0.85)$conf.int
## [1] 0.004210745 0.641234579
## attr(,"conf.level")
## [1] 0.95
Returning to the problem, you might check for discrimination by testing H0 : p =
0.85 against HA : p < 0.85 using an exact test. The exact test p-value is 0.000 to
three decimal places, and an exact upper bound for p is 0.582. What does this suggest
to you?
# Exact binomial test for proportion
binom.test(1, 6, alternative = "less", p = 0.85)
##
## Exact binomial test
##
## data: 1 and 6
## number of successes = 1, number of trials = 6, p-value =
## 0.0003987
## alternative hypothesis: true probability of success is less than 0.85
## 95 percent confidence interval:
## 0.0000000 0.5818034
## sample estimates:
## probability of success
## 0.1666667
not approved
not approved
not approved
approved
not approved
not approved
", sep = ",", header=FALSE, stringsAsFactors=FALSE)
## 0.0000000 0.5818034
## sample estimates:
## probability of success
## 0.1666667
It is possible that the order (alphabetically) is the wrong order, failures and suc-
cesses, in which case we’d need to reorder the input to binom.test().
In Chapter 6 we looked at the binomial distribution to obtain an exact Sign Test
confidence interval for the median. Examine the following to see where the exact
p-value for this test comes from.
n <- 6
x <- 0:n
p0 <- 0.85
bincdf <- pbinom(x, n, p0)
cdf <- data.frame(x, bincdf)
cdf
## x bincdf
## 1 0 1.139063e-05
## 2 1 3.986719e-04
## 3 2 5.885156e-03
## 4 3 4.733859e-02
## 5 4 2.235157e-01
## 6 5 6.228505e-01
## 7 6 1.000000e+00
where Oi is the observed number in the sample that fall into the ith category (Oi =
np̂i ), and Ei = np0i is the number of individuals expected to be in the ith category
when H0 is true.
The Pearson statistic can also be computed as the sum of the squared residuals:
r
X
χ2s = Zi2 ,
i=1
√
where Zi = (Oi − Ei )/ Ei , or in terms of the observed and hypothesized category
proportions
r
2
X (p̂i − p0i )2
χs = n .
i=1
p 0i
The Pearson statistic χ2s is “small” when all of the observed counts (proportions)
are close to the expected counts (proportions). The Pearson χ2 is “large” when one
or more observed counts (proportions) differs noticeably from what is expected when
H0 is true. Put another way, large values of χ2s suggest that H0 is false.
The critical value χ2crit for the test is obtained from a chi-squared probability
table with r −1 degrees of freedom. The picture below shows the form of the rejection
region. For example, if r = 5 and α = 0.05, then you reject H0 when χ2s ≥ χ2crit = 9.49
(qchisq(0.95, 5-1)). The p-value for the test is the area under the chi-squared curve
with df = r − 1 to the right of the observed χ2s value.
0 5 10 15 0 5 10 15
χ2Crit Reject H0 for χ2S here χ2Crit χ2S
Example: jury pool Let p18 be the proportion in the jury pool population between
ages 18 and 19. Define p20 , p25 , p30 , p40 , p50 , and p65 analogously. You are interested in
testing that the true jury proportions equal the census proportions, H0 : p18 = 0.061,
p20 = 0.150, p25 = 0.135, p30 = 0.217, p40 = 0.153, p50 = 0.182, and p65 = 0.102
against HA : not H0 , using the sample of 1336 from the jury pool.
The observed counts, the expected counts, and the category residuals are given
in the√table below. For example, E18 = 1336 × (0.061) = 81.5 and Z18 = (23 −
81.5)/ 81.5 = −6.48 in the 18-19 year category.
The Pearson statistic is
7.4.2 R Implementation
#### Example: jury pool
jury <- read.table(text="
Age Count CensusProp
18-19 23 0.061
20-24 96 0.150
25-29 134 0.135
30-39 293 0.217
40-49 297 0.153
Plot observed vs expected values to help identify age groups that deviate the
most. Plot contribution to chi-square values to help identify age groups that deviate
the most. The term “Contribution to Chi-Square” (chisq) refers to the values of
(O−E)2
E
for each category. χ2s is the sum of those contributions.
library(reshape2)
x.table.obsexp <- melt(x.table,
# id.vars: ID variables
# all variables to keep but not split apart on
id.vars = c("age"),
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
measure.vars = c("obs", "exp"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "stat",
# value.name: column name for values in table
value.name = "value"
)
# naming variables manually, the variable.name and value.name not working 11/2012
names(x.table.obsexp) <- c("age", "stat", "value")
# Contribution to chi-sq
# pull out only the age and chisq columns
x.table.chisq <- x.table[, c("age","chisq")]
# reorder the age categories to be descending relative to the chisq statistic
x.table.chisq$age <- with(x.table, reorder(age, -chisq))
300 60
stat
Chi−sq
count
200 40
obs
exp
100 20
0 0
18−19 20−24 25−29 30−39 40−49 50−64 65−99 50−64 20−24 18−19 40−49 25−29 65−99 30−39
Age category (years) Sorted age category (years)
, c(b.sum3$p.value, b.sum3$conf.int)
, c(b.sum4$p.value, b.sum4$conf.int)
, c(b.sum5$p.value, b.sum5$conf.int)
, c(b.sum6$p.value, b.sum6$conf.int)
, c(b.sum7$p.value, b.sum7$conf.int)
)
)
names(b.sum) <- c("p.value", "CI.lower", "CI.upper")
b.sum$Age <- jury$Age
b.sum$Observed <- x.table$obs/sum(x.table$obs)
b.sum$CensusProp <- jury$CensusProp
b.sum
## p.value CI.lower CI.upper Age Observed CensusProp
## 1 8.814860e-15 0.00913726 0.02920184 18-19 0.01721557 0.061
## 2 2.694633e-18 0.05415977 0.09294037 20-24 0.07185629 0.150
## 3 1.394274e-04 0.07939758 0.12435272 25-29 0.10029940 0.135
## 4 8.421685e-01 0.18962122 0.25120144 30-39 0.21931138 0.217
## 5 2.383058e-11 0.19245560 0.25433144 40-49 0.22230539 0.153
## 6 5.915839e-20 0.25174398 0.31880556 50-64 0.28443114 0.182
## 7 3.742335e-02 0.06536589 0.10707682 65-99 0.08458084 0.102
The CIs for the 30-39 and 65-99 year categories contain the census proportions.
In the other five age categories, there are significant differences between the jury
pool proportions and the census proportions. In general, young adults appear to be
underrepresented in the jury pool whereas older age groups are overrepresented.
##
## Attaching package: ’xtable’
## The following object is masked from ’package:TeachingDemos’:
##
## digits
Age p.value CI.lower CI.upper Observed CensusProp
1 18-19 0.000 0.009 0.029 0.017 0.061
2 20-24 0.000 0.054 0.093 0.072 0.150
3 25-29 0.000 0.079 0.124 0.100 0.135
4 30-39 0.842 0.190 0.251 0.219 0.217
5 40-49 0.000 0.192 0.254 0.222 0.153
6 50-64 0.000 0.252 0.319 0.284 0.182
7 65-99 0.037 0.065 0.107 0.085 0.102
The residuals also highlight significant differences because the largest residuals
correspond to the categories that contribute most to the value of χ2s . Some researchers
use the residuals for the multiple comparisons, treating the Zi s as standard normal
variables. Following this approach, you would conclude that the jury pool proportions
differ from the proportions in the general population in every age category where
|Zi | ≥ 2.70 (using the same Bonferroni correction). This gives the same conclusion
as before.
The two multiple comparison methods are similar, but not identical. The residuals
Oi − Ei p̂i − p0i
Zi = √ = p p0i
Ei n
agree with the large-sample statistic for testing H0 : pi = p0i , except that the divisor
in Zi omits a 1 − p0i term. The Zi s are not standard normal random variables as as-
sumed, and the value of Zi underestimates the significance of the observed differences.
Multiple comparisons using the Zi s will find, on average, fewer significant differences
than the preferred method based on the large sample tests. However, the differences
between the two methods are usually minor when all of the hypothesized proportions
are small.
where
s s
p̄(1 − p̄) p̄(1 − p̄) 1 1
SEtest (p̂1 − p̂2 ) = + = p̄(1 − p̄) +
n1 n2 n1 n2
is the test standard error for p̂1 − p̂2 . The pooled proportion
n1 p̂1 + n2 p̂2
p̄ =
n1 + n2
is the proportion of successes in the two samples combined. The test standard error
has the same functional form as the CI standard error, with p̄ replacing the individual
sample proportions.
The pooled proportion is the best guess at the common population proportion
when H0 : p1 = p2 is true. The test standard error estimates the standard deviation
of p̂1 − p̂2 assuming H0 is true.
Remark: As in the one-sample proportion problem, the test and CI SE’s are dif-
ferent. This can (but usually does not) lead to some contradiction between the test
and CI.
Example, vitamin C Two hundred and seventy nine (279) French skiers were
studied during two one-week periods in 1961. One group of 140 skiers receiving a
placebo each day, and the other 139 receiving 1 gram of ascorbic acid (Vitamin C) per
day. The study was double blind — neither the subjects nor the researchers knew who
received which treatment. Let p1 be the probability that a member of the ascorbic
acid group contracts a cold during the study period, and p2 be the corresponding
probability for the placebo group. Linus Pauling (Chemistry and Peace Nobel prize
winner) and I are interested in testing whether H0 : p1 = p2 . The data are summarized
below as a two-by-two table of counts (a contingency table)
Outcome Ascorbic Acid Placebo
# with cold 17 31
# with no cold 122 109
Totals 139 140
The sample sizes are n1 = 139 and n2 = 140. The sample proportion of skiers
developing colds in the placebo and treatment groups are p̂2 = 31/140 = 0.221 and
p̂1 = 17/139 = 0.122, respectively. The difference is p̂1 − p̂2 = 0.122−0.221 = −0.099.
The pooled proportion is the number of skiers that developed colds divided by the
number of skiers in the study: p̄ = 48/279 = 0.172.
The test standard error is
s
1 1
SEtest (p̂1 − p̂2 ) = 0.172 × (1 − 0.172) + = 0.0452.
139 140
The 95% CI for p1 − p2 is −0.099 ± 0.088, or (−0.187, −0.011). We are 95% confident
that p2 exceeds p1 by at least 0.011 but not by more than 0.187.
On the surface, we would conclude that a daily dose of Vitamin C decreases
a French skier’s chance of developing a cold by between 0.011 and 0.187 (with 95%
confidence). This conclusion was somewhat controversial. Several reviews of the study
felt that the experimenter’s evaluations of cold symptoms were unreliable. Many other
studies refute the benefit of Vitamin C as a treatment for the common cold.
#### Example, vitamin C
# Approximate normal test for two-proportions, without Yates' continuity correction
prop.test(c(17, 31), c(139, 140), correct = FALSE)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(17, 31) out of c(139, 140)
## X-squared = 4.8114, df = 1, p-value = 0.02827
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.18685917 -0.01139366
## sample estimates:
## prop 1 prop 2
## 0.1223022 0.2214286
Conditional probability
In probability theory, a conditional probability is the probability that an event will
occur, when another event is known to occur or to have occurred. If the events are A
and B respectively, this is said to be “the probability of A given B”. It is commonly
denoted by Pr(A|B). Pr(A|B) may or may not be equal to Pr(A), the probability of
A. If they are equal, A and B are said to be independent. For example, if a coin is
flipped twice, “the outcome of the second flip” is independent of “the outcome of the
first flip”.
In the Vitamin C example above, the unconditional observed probability of con-
tracting a cold is Pr(cold) = (17 + 31)/(139 + 140) = 0.172. The conditional observed
probabilities are Pr(cold|ascorbic acid) = 17/139 = 0.1223 and Pr(cold|placebo) =
31/140 = 0.2214. The two-sample test of H0 : p1 = p2 where p1 = Pr(cold|ascorbic acid)
and p2 = Pr(cold|placebo) is effectively testing whether Pr(cold) = Pr(cold|ascorbic acid) =
Pr(cold|placebo). This tests whether contracting a cold is independent of the vitamin
C treatment.
2
http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0002461/
Finally, plot the frequencies, and the proportions in three ways (the frequencies
can obviously be plotted in many ways, too).
# plots are easier now that data are in long format.
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Frequency, fill = HPV.Outcome))
p <- p + geom_bar(stat="identity", position = "dodge")
p <- p + theme_bw()
p <- p + labs(title = "Frequency of HPV by Case/Control group")
print(p)
# bars, stacked
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Proportion, fill = HPV.Outcome))
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + labs(title = "Proportion of HPV by Case/Control group")
print(p)
# bars, dodged
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Proportion, fill = HPV.Outcome))
p <- p + geom_bar(stat="identity", position = "dodge")
p <- p + theme_bw()
p <- p + labs(title = "Proportion of HPV by Case/Control group")
p <- p + scale_y_continuous(limits = c(0, 1))
print(p)
# lines are sometimes easier, especially when many categories along the x-axis
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Proportion, colour = HPV.Outcome))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_point(aes(shape = HPV.Outcome))
p <- p + geom_line(aes(linetype = HPV.Outcome, group = HPV.Outcome))
p <- p + theme_bw()
p <- p + labs(title = "Proportion of HPV by Case/Control group")
p <- p + scale_y_continuous(limits = c(0, 1))
print(p)
150
0.75
Frequency
Proportion
HPV.Outcome HPV.Outcome
100
Positive 0.50 Positive
Negative Negative
50
0.25
0 0.00
0.75 0.75
Proportion
Proportion
HPV.Outcome HPV.Outcome
0.50 Positive 0.50 ● Positive
Negative ● Negative
0.25 0.25
0.00 0.00
Returning to the hypothesis test, let p1 be the probability that a case is HPV
positive and let p2 be the probability that a control is HPV positive. The sample
sizes are n1 = 175 and n2 = 308. The sample proportions of positive cases and
controls are p̂1 = 164/175 = 0.937 and p̂2 = 130/308 = 0.422.
For a 95% CI
r
0.937 × (1 − 0.937) 0.422 × (1 − 0.422)
zcrit SECI (p̂1 − p̂2 ) = 1.96 +
175 308
= 1.96 × (0.03336) = 0.0659.
## 0.4492212 0.5809087
## sample estimates:
## prop 1 prop 2
## 0.9371429 0.4220779
Not surprisingly, a two-sided test at the 5% level would reject H0 : p1 = p2 . In
this problem one might wish to do a one-sided test, instead of a two-sided test. Let
us carry out this test, as a refresher on how to conduct one-sided tests.
# one-sided test, are cases more likely to be HPV positive?
prop.test(c(164, 130), c(175, 308), correct = FALSE, alternative = "greater")
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(164, 130) out of c(175, 308)
## X-squared = 124.29, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.4598071 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.9371429 0.4220779
A standard measure of the difference between the exposed and non-exposed pop-
ulations is the absolute difference: p1 − p2 . We have discussed statistical methods
for assessing this difference.
In many epidemiological and biostatistical settings, other measures of the differ-
ence between populations are considered. For example, the relative risk
p1
RR =
p2
is commonly reported when the individual risks p1 and p2 are small. The odds ratio
p1 /(1 − p1 )
OR =
p2 /(1 − p2 )
is another standard measure. Here p1 /(1 − p1 ) is the odds of being diseased in the
exposed group, whereas p2 /(1 − p2 ) is the odds of being diseased in the non-exposed
group.
I mention these measures because you may see them or hear about them. Note
that each of these measures can be easily estimated from data, using the sample
proportions as estimates of the unknown population proportions. For example, in the
vitamin C study:
Outcome Ascorbic Acid Placebo
# with cold 17 31
# with no cold 122 109
Totals 139 140
the proportion with colds in the placebo group is p̂2 = 31/140 = 0.221. The propor-
tion with colds in the vitamin C group is p̂1 = 17/139 = 0.122.
The estimated absolute difference in risk is p̂1 − p̂2 = 0.122 − 0.221 = −0.099. The
estimated risk ratio and odds ratio are
d = 0.122 = 0.55
RR
0.221
and
d = 0.122/(1 − 0.122) = 0.49,
OR
0.221/(1 − 0.221)
respectively.
Interpretting odds ratios, two examples Let’s begin with probability3 . Let’s
say that the probability of success is 0.8, thus p = 0.8. Then the probability of
3
Borrowed graciously from UCLA Academic Technology Services at http://www.ats.ucla.edu/
stat/sas/faq/oratio.htm
A 95% CI for pA+ − p+A is (0.590 − 0.550) ± 0.019, or (0.021, 0.059). You are 95%
confident that the population proportion of voter-age Americans that approved of the
President’s performance the first month was between 0.021 and 0.059 larger than the
proportion that approved one month later. This gives evidence of a decrease in the
President’s approval rating.
A test of H0 : pA+ = p+A can be based on the CI for pA+ − p+A , or on a standard
normal approximation to the test statistic
p̂A+ − p̂+A
zs = ,
SEtest (p̂A+ − p̂+A )
where the test standard error is given by
r
p̂A+ p̂+A − 2p̂AA
SEtest (p̂A+ − p̂+A ) = .
n
The test statistic is often written in the simplified form
nAD − nDA
zs = √ ,
nAD + nDA
where the nij s are the observed cell counts. An equivalent form of this test, based
on comparing the square of zs to a chi-squared distribution with 1 degree of freedom,
is the well-known McNemar’s test for marginal homogeneity (or symmetry) in the
two-by-two table.
For example, in the Presidential survey
150 − 86
zs = √ = 4.17.
150 + 86
The p-value for a two-sided test is, as usual, the area under the standard normal
curve outside ±4.17. The p-value is less than 0.001, suggesting that H0 is false.
R can perform this test as McNemar’s test.
#### Example, President performance
# McNemar's test needs data as a matrix
## 2nd Survey
## 1st Survey Approve Disapprove
## Approve 794 150
## Disapprove 86 570
mcnemar.test(pres, correct=FALSE)
##
## McNemar's Chi-squared test
##
## data: pres
## McNemar's chi-squared = 17.356, df = 1, p-value = 3.099e-05
# => significant change (in fact, drop) in approval ratings
## data: candeath
## X-squared = 197.62, df = 6, p-value < 2.2e-16
# The Pearson residuals
chisq.summary$residuals
## Location of death
## Age Home Acute Care Chronic care
## 15-54 0.3989527 2.3587229 -5.798909
## 55-64 0.2205584 2.5273526 -5.982375
## 65-74 1.1176594 -0.3297027 -0.500057
## 75+ -1.5530094 -3.6183388 9.946704
# The sum of the squared residuals is the chi-squared statistic:
chisq.summary$residuals^2
## Location of death
## Age Home Acute Care Chronic care
## 15-54 0.1591633 5.5635737 33.627351
## 55-64 0.0486460 6.3875111 35.788805
## 65-74 1.2491626 0.1087039 0.250057
## 75+ 2.4118382 13.0923756 98.936922
sum(chisq.summary$residuals^2)
## [1] 197.6241
A visualization of the Pearson residuals is available with a mosaic() plot in the vcd
package. Extended mosaic and association plots are each helpful methods of visualing
complex data and evaluating deviations from a specified independence model. For
extended mosaic plots, use mosaic(x, condvar=, data=) where x is a table or formula,
condvar= is an optional conditioning variable, and data= specifies a data frame or a
table. Include shade=TRUE to color the figure, and legend=TRUE to display a legend
for the Pearson residuals.
# mosaic plot
library(vcd)
## Loading required package: grid
##
## Attaching package: ’vcd’
## The following object is masked from ’package:BSDA’:
##
## Trucks
mosaic(candeath, shade=TRUE, legend=TRUE)
# association plot
library(vcd)
assoc(candeath, shade=TRUE)
Location of death
Home Acute Care Chronic care
Location of death Pearson
Home Acute Care Chronic care residuals:
15−54
Pearson 9.9
15−54
residuals:
9.9
55−64
55−64
4.0
4.0
Age
Age
2.0 2.0
65−74
65−74
0.0
0.0
−2.0
−2.0
−4.0
75+
75+
−6.0 −4.0
p−value =
< 2.22e−16 −6.0
p−value =
< 2.22e−16
The vcd package provides a variety of methods for visualizing multivariate categor-
ical data, inspired by Michael Friendly’s wonderful “Visualizing Categorical Data”.
For more details, see The Strucplot Framework4 .
For example, a sieve plot for an n-way contingency table plots rectangles with
areas proportional to the expected cell frequencies and filled with a number of squares
equal to the observed frequencies. Thus, the densities visualize the deviations of the
observed from the expected values.
# sieve plot
library(vcd)
# plot observed table, then label cells with observed values in the cells
sieve(candeath, pop = FALSE, shade = TRUE)
labeling_cells(text = candeath, gp_text = gpar(fontface = 2))(as.table(candeath))
4
http://cran.r-project.org/web/packages/vcd/vignettes/strucplot.pdf
15−54
94 418 23
55−64
55−64
116 524 34
Age
Age
65−74
65−74
156 581 109
75+
75+
138 558 238
design (where the two strata were NM and CO voters). Stratified designs provide
estimates for the strata (population) proportion in each of the categories. A test for
homogeneity of proportions is used to compare the strata.
In a cross-sectional design, individuals are randomly selected from a population
and classified by the levels of two categorical variables. With cross-sectional samples
you can test homogeneity of proportions by comparing either the row proportions or
by comparing the column proportions.
## Reaction
## Status Str. Dislike Dislike Neutral Like Str. Like
## Smoker 10.00794 14.37037 28.99735 21.04233 22.58201
## Non-smoker 28.99206 41.62963 84.00265 60.95767 65.41799
# Contribution to chi-squared statistic
chisq.summary$residuals^2
## Reaction
## Status Str. Dislike Dislike Neutral Like
## Smoker 0.4028612 0.009545628 1.2425876 8.514567e-05
## Non-smoker 0.1390660 0.003295110 0.4289359 2.939192e-05
## Reaction
## Status Str. Like
## Smoker 0.5681868
## Non-smoker 0.1961356
There are 10 possible comparisons here. The Bonferroni analysis with an over-
all Family Error Rate of 0.05 (or less) tests the 10 individual hypotheses at the
0.05/10=0.005 level.
nausea.table <- data.frame(Interval = rep(NA,10)
, CI.lower = rep(NA,10)
, CI.upper = rep(NA,10)
, Z = rep(NA,10)
, p.value = rep(NA,10)
, sig.temp = rep(NA,10)
, sig = rep(NA,10))
# row names for table
nausea.table[,1] <- c("p_PL - p_CH"
, "p_PL - p_DI"
, "p_PL - p_PE100"
, "p_PL - p_PE150"
, "p_CH - p_DI"
, "p_CH - p_PE100"
, "p_CH - p_PE150"
, "p_DI - p_PE100"
, "p_DI - p_PE150"
, "p_PE100 - p_PE150")
# test results together in a table
i.tab <- 0
for (i in 1:4) {
for (j in (i+1):5) {
i.tab <- i.tab + 1
nausea.summary <- prop.test(nausea[c(i,j),], correct = FALSE, conf.level = 1-0.05/10)
nausea.table[i.tab, 2:6] <- c(nausea.summary$conf.int[1]
, nausea.summary$conf.int[2]
, sign(-diff(nausea.summary$estimate)) * nausea.summary$statistic^0.5
, nausea.summary$p.value
, (nausea.summary$p.value < 0.05/10))
if (nausea.table$sig.temp[i.tab] == 1) { nausea.table$sig[i.tab] <- "*" }
else { nausea.table$sig[i.tab] <- " " }
}
}
The following table gives two-sample tests of proportions with nausea and 99.5%
CIs for the differences between the ten pairs of proportions. The only two p-values
are less than 0.005 corresponding to pPL − pCH and pCH − pDI . I am 99.5% confident
that pCH is between 0.084 and 0.389 less than pPL , and I am 99.5% confident that pCH
is between 0.086 and 0.453 less than pDI . The other differences are not significant.
Using ANOVA-type groupings, and arranging the treatments from most to least
effective (low proportions to high), we get:
CH (0.34) PE150 (0.44) PE100 (0.52) PL (0.58) DI (0.61)
---------------------------------------
---------------------------------------------------
Contents
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 276
8.2 Logarithmic transformations . . . . . . . . . . . . . . . . . 278
8.2.1 Log-linear and log-log relationships: amoebas, squares, and
cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
8.3 Testing that ρ = 0 . . . . . . . . . . . . . . . . . . . . . . . 285
8.3.1 The Spearman Correlation Coefficient . . . . . . . . . . . . 285
8.4 Simple Linear Regression . . . . . . . . . . . . . . . . . . . 289
8.4.1 Linear Equation . . . . . . . . . . . . . . . . . . . . . . . . 290
8.4.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 290
8.5 ANOVA Table for Regression . . . . . . . . . . . . . . . . 293
8.5.1 Brief discussion of the output for blood loss problem . . . . 297
8.6 The regression model . . . . . . . . . . . . . . . . . . . . . 297
8.6.1 Back to the Data . . . . . . . . . . . . . . . . . . . . . . . . 299
8.7 CI and tests for β1 . . . . . . . . . . . . . . . . . . . . . . . 300
8.7.1 Testing β1 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.8 A CI for the population regression line . . . . . . . . . . 301
8.8.1 CI for predictions . . . . . . . . . . . . . . . . . . . . . . . . 302
8.8.2 A further look at the blood loss data . . . . . . . . . . . . . 303
8.9 Model Checking and Regression Diagnostics . . . . . . . 304
8.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 304
8.9.2 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . 305
Learning objectives
After completing this topic, you should be able to:
select graphical displays that reveal the relationship between two continuous vari-
ables.
summarize model fit.
interpret model parameters, such as slope and R2 .
assess the model assumptions visually and numerically.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.
8.1 Introduction
Suppose we select n = 10 people from the population of college seniors who plan
to take the medical college admission test (MCAT) exam. Each takes the test, is
coached, and then retakes the exam. Let Xi be the pre-coaching score and let Yi
be the post-coaching score for the ith individual, i = 1, 2, . . . , n. There are several
questions of potential interest here, for example: Are Y and X related (associated),
and how? Does coaching improve your MCAT score? Can we use the data to develop
a mathematical model (formula) for predicting post-coaching scores from the pre-
coaching scores? These questions can be addressed using correlation and regression
models.
where Pn
− X̄)(Yi − Ȳ )
i=1 (Xi
SXY =
n−1
pP
2
is the p
sample covariance between Y and X, and SY =
P i (Yi − Ȳ ) /(n − 1) and
2
SX = i (Xi − X̄) /(n − 1) are the standard deviations for the Y and X samples.
Important properties of r:
1. −1 ≤ r ≤ 1.
2. If Yi tends to increase linearly with Xi then r > 0.
3. If Yi tends to decrease linearly with Xi then r < 0.
4. If there is a perfect linear relationship between Yi and Xi with a positive slope
then r = +1.
5. If there is a perfect linear relationship between Yi and Xi with a negative slope
then r = −1.
6. The closer the points (Xi , Yi ) come to forming a straight line, the closer r is to
±1.
7. The magnitude of r is unchanged if either the X or Y sample is transformed
linearly (such as feet to inches, pounds to kilograms, Celsius to Fahrenheit).
8. The correlation does not depend on which variable is called Y and which is
called X.
If r is near ±1, then there is a strong linear relationship between Y and X in the
sample. This suggests we might be able to accurately predict Y from X with a linear
equation (i.e., linear regression). If r is near 0, there is a weak linear relationship
between Y and X, which suggests that a linear equation provides little help for
predicting Y from X. The pictures below should help you develop a sense about the
size of r.
Note that r = 0 does not imply that Y and X are not related in the sample. It
only implies they are not linearly related. For example, in the last plot r = 0 yet
Yi = Xi2 , exactly.
Correlation=1 Correlation=-1 Correlation=.3 Correlation=-.3
• • • •
•
•• ••• •• • •
••• • • • •• •
••• • ••••
•••• • •• • • • • • •
•••••• •••••• • •• •• •• ••••• • • • •• •• •
••
•••••••
• •••• • • •
•
• ••• • • •• ••• • •• ••• • • ••
• •
•
•••••• • •••
••••• •• •• • • •• ••••• •••• • • • • • • • • • •• •
•••••• •• ••• ••• •• • ••• • • • •• • •• •
•••••• •• •
•••••• • • ••• • • ••• •• • • • • ••••• •• • • • •• • ••
•• • •• •••• • • • ••• •
• • •
• •• •• ••• •••• •
• •
•• •
• ••••
• •• • • • • • • •• • •• • •
• •• • • • • •
0 20 40 60 80 100
• • • • • • • • •
• •
• •• • • • • •
• • • • • • • • •• • • •
• •• • • • ••• •• • • • • • ••••• ••• • ••• • • •
• • ••••••••• •• • • • • • • •
• •• • • •
• • •••• • • •••• • •• • ••• • •• •• • ••••• ••• • •
•• •••••• • • • •••• •••••••••• •••••••••••• •••••• • • • •• •••• •••• •
•
• •• •••• •••• •• • • • • • • • •• •••• • • • • • •• ••• ••• ••• • •
• ••• • ••
• • • • •••• •••• •• • • ••• • • •• • •• •
• • •
• •• •••• •
•
• •
•
•
•
• •• • •
• •• •• • •
• • •• •
• • • ••••
-10 -5 0 5 10
Suppose you have the same example, but the amoeba takes three hours to divide
at each step. Then the number of amoebas y after time x has the equation, y = 2x/3 =
(21/3 )x = 1.26x or, on the logarithmic scale, log(y) = (log(1.26))x = 0.10x. The slope
of 0.10 is one-third the earlier slope of 0.30 because the population is growing at
one-third the rate.
In the example of exponential growth of amoebas, y is logged while x remains the
same. For power-law relations, it makes sense to log both x and y. How does the
area of a square relate to its circumference (perimeter)? If the side of the cube has
length L, then the area is L2 and the circumference is 4L; thus
area = (circumference/4)2 .
800 220
1200 360
1600 545
1800 900
1850 1200
1900 1625
1950 2500
1975 3900
2000 6080
2012 7000
", header=TRUE)
pop$Pop <- 1e6 * pop$Pop_M # convert to millions
pop$PopL10 <- log10(pop$Pop)
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1
, top = "Log-linear transformation: world population")
● ●
●
● ●
6e+09
9.5
●
4e+09 9.0
PopL10
●
●
Pop
●
●
8.5
2e+09
● ●
●
● ●
●
● 8.0
● ● ●
0e+00
0 500 1000 1500 2000 0 500 1000 1500 2000
Year Year
When using data of this nature, consider the source of the population numbers.
How would you estimate the population of the world in the year 1?
library(gdata)
## gdata: read.xls support for ’XLS’ (Excel 97-2004) files
## gdata: ENABLED.
##
## gdata: read.xls support for ’XLSX’ (Excel 2007+) files
## gdata: ENABLED.
##
## Attaching package: ’gdata’
## The following object is masked from ’package:gridExtra’:
##
## combine
2
One of the world experts in allometric scaling is Prof. Jim Brown, UNM Biology, http://
biology.unm.edu/jhbrown
3
White and Seymour (2003) PNAS, 10.1073/pnas.0436428100
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1
, top = "Log-log transformation: metabolic rates")
## Warning: Removed 56 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).
## Warning: Removed 5 rows containing non-finite values (stat smooth).
## Warning: Removed 56 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).
## Warning: Removed 5 rows containing non-finite values (stat smooth).
## Warning: Removed 1 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).
● Insectivora 51
Lagomorpha 10
●
Macroscelidea 8
●
500 Monotremata 4
● ●
● Notoryctemorphia
●
● Peramelemorphia 9
●
● Pholidota 5
Primates 25
●●●●
0 ● Rodentia 289
Scandentia 3
0 1000 2000 3000 4000 5000
BodyMass Xenarthra 15
Group
log10(BaseMetRate) = 0.678 + 0.658 log10(BodyMass)
Carnivora 48
● Chiroptera 77
3.0
● Dasyuromorpha 23
●
● Didelphimorphia 11
● ●
● ● Diprotodontia 25
2.5 ●
●
Log10BaseMetRate
Hyracoidea 5
●
Insectivora 51
2.0 ● ● Lagomorpha 10
●
● Macroscelidea 8
Monotremata 4
1.5
● Notoryctemorphia
Peramelemorphia 9
Pholidota 5
1.0
Primates 25
Rodentia 289
Scandentia 3
1 2 3
Log10BodyMass Xenarthra 15
Group
log10(BaseMetRate) = 0.678 + 0.658 log10(BodyMass)
Carnivora 48
Chiroptera 77
Dasyuromorpha 23
10000
Didelphimorphia 11
Diprotodontia 25
Hyracoidea 5
●
BaseMetRate
1000 Insectivora 51
●
●● Lagomorpha 10
● ● ●
● ●
● Macroscelidea 8
●
Monotremata 4
100 ● ●
● Notoryctemorphia
●
Peramelemorphia 9
●
Pholidota 5
10 Primates 25
Rodentia 289
Scandentia 3
1e+01 1e+02 1e+03 1e+04 1e+05
BodyMass Xenarthra 15
The table below provides predictions over a range of scales. Note that the smallest
mammel in this dataset is about 2.4 grams. A 5-gram mammal uses about 13.7 Watts,
so 1000 5-gram mammals use about 13714 Watts. Whereas, one 5000-gram mammal
uses 1287 Watts. Thus, larger mammals give off less heat than the equivalent weight
of many smaller mammals.
pred.bm.bmr <- data.frame(BodyMass = 5 * c(1, 10, 100, 1000))
pred.bm.bmr$Log10BodyMass <- log10(pred.bm.bmr$BodyMass)
pred.bm.bmr$Log10BaseMetRate <- predict(lm.fit, pred.bm.bmr)
pred.bm.bmr$BaseMetRate <- 10^pred.bm.bmr$Log10BaseMetRate
We want to focus on the slope in the log-log plot, which is the exponent in original
scale plot. On the log scale we have
To interpret the slope, for each unit increase in (predictor, x-variable) log(BodyMass),
the expected increase in (response, y-variable) log(BaseMetRate) is 4.55. By expo-
nentiating on both sides, the expression on the original scale is
For example, if you multiply body mass by 10, then you multiply metabolic rate by
100.658 = 4.55. If you multiply body mass by 100, then you multiply metabolic rate
by 1000.658 = 20.7, and so forth. The relation between metabolic rate and body mass
is less than linear (that is, the exponent 0.658 is less than 1.0, and the line in the
original-scal plot curves downward, not upward), which implies that the equivalent
mass of small mammals gives off more heat, and the equivalent mass of large mammals
gives off less heat.
This seems related to the general geometrical relation that surface area and volume
are proportional to linear dimension to the second and third power, respectively, and
thus surface area should be proportional to volume to the 2/3 power. Heat produced
by a mammal is emitted from its surface, and it would thus be reasonable to suspect
metabolic rate to be proportional to the 2/3 power of body mass. Biologists have
considered whether the empirical slope is closer to 3/4 or 2/3; the important thing
here is to think about log transformations and power laws (and have a chat with Jim
Brown or someone from his lab at UNM for the contextual details). As an aside,
something not seen from this plot is that males tend to be above the line and females
below the line.
r
n−2
ts = r ,
1 − r2
then the test rejects H0 in favor of HA if |ts | ≥ tcrit , where tcrit is the two-sided test
critical value from a t-distribution with df = n − 2. The p-value for the test is the
area under the t-curve outside ±ts (i.e., two-tailed test p-value).
This test assumes that the data are a random sample from a bivariate normal
population for (X, Y ). This assumption implies that all linear combinations of X
and Y , say aX + bY , are normal. In particular, the (marginal) population frequency
curves for X and Y are normal. At a minimum, you should make boxplots of the
X and Y samples to check marginal normality. For large-sized samples, a plot of
Y against X should be roughly an elliptical cloud, with the density of the points
decreasing as the points move away from the center of the cloud.
The Pearson correlation r can be highly influenced by outliers in one or both samples.
For example, r ≈ −1 in the plot below. If you delete the one extreme case with the
largest X and smallest Y value then r ≈ 0. The two analyses are contradictory. The
first analysis (ignoring the plot) suggests a strong linear relationship, whereas the
second suggests the lack of a linear relationship. I will not strongly argue that you
should (must?) delete the extreme case, but I am concerned about any conclusion
that depends heavily on the presence of a single observation in the data set.
• • •• • •
•••••••••••••••••••
• ••••••••••••••••••••••••••••
•••• •
•• •••••••••••••••••••••• •
0
• ••••••••••••••••••••••••••••
• •• •• •
-2
Y
-4
-6
•
-8
0 2 4 6 8 10
X
Example: Blood loss Eight patients underwent a thyroid operation. Three vari-
ables were measured on each patient: weight in kg, time of operation in minutes, and
blood loss in ml. The scientists were interested in the factors that influence blood
loss.
Below, we calculate the Pearson correlations between all pairs of variables (left),
as well as the p-values (right) for testing whether the correlation is equal to zero.
p.corr <- cor(thyroid);
#p.corr
Similarly, we calculate the Spearman (rank) correlation table (left), as well as the
p-values (right) for testing whether the correlation is equal to zero.
Here are scatterplots for the original data and the ranks of the data using ggpairs()
from the GGally package with ggplot2.
# Plot the data using ggplot
library(ggplot2)
library(GGally)
p1 <- ggpairs(thyroid[,1:3], progress=FALSE)
print(p1)
0.12
0.04
rank_weight
0.03 0.08
weight
Corr: Corr: Corr: Corr:
0.02 −0.0663 −0.772 0.286 −0.874
0.04
0.01
0.00 0.00
120 ● 8 ●
●
●
110 ● 6 ●
rank_time
●
Corr: ● Corr:
time
100 ●
● −0.107 4 ● −0.156
●
90
● 2 ●
80 ● ●
● ● 8 ● ●
510
● ● ● ●
● ●
rank_blood_loss
500 6 ● ●
blood_loss
490 ● ● ● ●
● ● ● ●
4
● ●
480 ● ●
470 ● ● 2 ● ●
● ● ● ●
Comments:
1. (Pearson correlations). Blood loss tends to decrease linearly as weight increases,
so r should be negative. The output gives r = −0.77. There is not much of
a linear relationship between blood loss and time, so r should be close to 0.
The output gives r = −0.11. Similarly, weight and time have a weak negative
correlation, r = −0.07.
2. The Pearson and Spearman correlations are fairly consistent here. Only the
correlation between blood loss and weight is significantly different from zero at
the α = 0.05 level (the p-values are given below the correlations).
3. (Spearman p-values) R gives the correct p-values. Calculating the p-value using
the Pearson correlation on the ranks is not correct, strictly speaking.
8
15
3
6
1
10
Y
Y
4
-2
2
5
1
0
-1 0 1 2 3 4 -1 0 1 2 3 4
X X
over all possible choices of β0 and β1 . These values can be obtained using calculus.
Rather than worry about this calculation, note that the LS line makes the sum of
squared (vertical) deviations between the responses Yi and the line as small as possible,
over all possible lines. The LS line goes through the mean point, (X̄, Ȳ ), which is
typically in the “the heart” of the data, and is often closely approximated by an
eye-ball fit to the data.
•
19
•
18
•
17
Y
16
•
15
•
4.5 5.0 5.5
X
ŷ = b0 + b1 X
# Base graphics: Plot the data with linear regression fit and confidence bands
# scatterplot
plot(thyroid$weight, thyroid$blood_loss)
# regression line from lm() fit
abline(lm.blood.wt)
510
510
●
500 ●
●
500
thyroid$blood_loss
blood_loss
490 ● ●
490
● ●
● ●
480
480
470
●
●
470
●
●
35 40 45 50 55 60 65 70
40 50 60 70 thyroid$weight
weight
For the thyroid operation data with Y = Blood loss in ml and X = Weight
in kg, the LS line is Ŷ = 552.44 − 1.30X, or Predicted Blood Loss = 552.44 −
1.30 Weight. For an 86kg individual, the Predicted Blood Loss = 552.44 − 1.30 × 86 =
440.64ml.
The LS regression coefficients for this model are interpreted as follows. The inter-
cept b0 is the predicted blood loss for a 0 kg individual. The intercept has no meaning
here. The slope b1 is the predicted increase in blood loss for each additional kg of
weight. The slope is −1.30, so the predicted decrease in blood loss is 1.30 ml for each
increase of 1 kg in weight.
Any fitted linear relationship holds only approximately and does not necessarily
extend outside the range of the data. In particular, nonsensical predicted blood losses
of less than zero are obtained at very large weights outside the range of data.
be the predicted or fitted Y -value for an X-value of Xi and let ei = Yi − Ŷi . The
fitted value Ŷi is the value of the LS line at Xi whereas the residual ei is the distance
that the observed response Yi is from the LS line. Given this notation,
n
X n
X
Residual Sums of Squares = Res SS = (Yi − ŷi )2 = e2i .
i=1 i=1
•
19
•
18
•
17
Fitted
residual
16
Response •
15
•
4.5 X-val 5.0 5.5
The Residual SS, or sum of squared residuals, is small if each Ŷi is close to Yi (i.e.,
the line closely fits the data). It can be shown that
n
X
Total SS in Y = (Yi − Ȳ )2 ≥ Res SS ≥ 0.
i=1
Also define
n
X
Regression SS = Reg SS = Total SS − Res SS = b1 (Yi − Ȳ )(Xi − X̄).
i=1
Reg SS
R2 = coefficient of determination = .
Total SS
•
8
•
6
Variation in Y
•
4
•
2
•
0
•
-2
-1 0 1 2 3 4
Variation in X
Furthermore,
•
5
• •
• •
4
• ••
• • • ••
•
•• • •• •• • • •
• • • • • • •
• • • • •• •
• • ••
3
• •• • • •• • • • •
• • ••• •• • •• •• •• •
• • • • • •• ••• •• ••• • •••• •
•• • • ••
Y
• • ••• • ••••• •• • • • ••
2
• •
• •• • • • • •
• • • • • •• ••• • ••
•• • • ••• •• •• • • • • ••• •• •
• • •
• •• • • ••
•• • • •• •
1
• ••
• • • • •• •
•
• •• • •
•
0
• •
•
-3 -2 -1 0 1 2
X
we can think that the Xi s were fixed by the experimenter, and that the Yi are
random responses at the selected predictor values.
Yi = β0 + β1 Xi + εi
(i.e., Response = Mean Response + Residual), where the εi s are, by virtue of assump-
tions 2, 3, and 4, independent normal random variables with mean 0 and variance
σY2 |X . The following picture might help see this. Note that the population regression
line is unknown, and is estimated from the data using the LS line.
14
14
12
12
10
10
Y
epsilon_i
8
Y_i
6
1 2 3 4 5 1 2 X_i 3 4 5
X X
5
●
●
●
● ●
●
● ●
●
0
●
●● ●
●
● ●●
● ● ●
●● ●
●
● ● ● ●
● ● ● ●●
● ●
−5
●
● ● ●
● ● ●
●● ●
● ●
● ●
● ● ●
●
● ● ●
● ●
●●
●
● ● ●
−10
●
●
● ●
● ● ●
●
●
● ● ●
● ●
● ● ●
●
●
−15
●
● ●
● ●
●
● ●
●
−20
● ●
●
−25
2 4 6 8 10 12 14
1. Validity. Most importantly, the data you are analyzing should map to the
research question you are trying to answer. This sounds obvious but is often
overlooked or ignored because it can be inconvenient.
5. Normality of errors.
Normality and equal variance are typically minor concerns, unless you’re using the
model to make predictions for individual data points.
− Ŷi )2
P
Res SS i (Yi
s2Y |X = Res MS = = .
Res df n−2
and where tcrit is the appropriate critical value for the desired CI level from a t-
distribution with df =Res df .
To test H0 : β1 = β10 (a given value) against HA : β1 6= β10 , reject H0 if |ts | ≥ tcrit ,
where
b1 − β10
ts = ,
SEb1
and tcrit is the t-critical value for a two-sided test, with the desired size and df =Res
df . Alternatively, you can evaluate a p-value in the usual manner to make a decision
about H0 .
# CI for beta1
sum.lm.blood.wt <- summary(lm.blood.wt)
sum.lm.blood.wt$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 552.442023 21.4408832 25.765824 2.253105e-07
## weight -1.300327 0.4364156 -2.979562 2.465060e-02
est.beta1 <- sum.lm.blood.wt$coefficients[2,1]
se.beta1 <- sum.lm.blood.wt$coefficients[2,2]
sum.lm.blood.wt$fstatistic
## value numdf dendf
## 8.877788 1.000000 6.000000
8.7.1 Testing β1 = 0
Assuming the mean relationship is linear, consider testing H0 : β1 = 0 against HA :
β1 6= 0. This test can be conducted using a t-statistic, as outlined above, or with an
ANOVA F -test, as outlined below.
For the analysis of variance (ANOVA) F -test, compute
Reg MS
Fs =
Res MS
and reject H0 when Fs exceeds the critical value (for the desired size test) from an F -
table with numerator df = 1 and denominator df = n − 2 (see qf()). The hypothesis
of zero slope (or no relationship) is rejected when Fs is large, which happens when a
significant portion of the variation in Y is explained by the linear relationship with
X.
The p-values from the t-test and the F -test are always equal. Furthermore this
p-value is equal to the p-value for testing no correlation between Y and X, using the
t-test described earlier. Is this important, obvious, or disconcerting?
Xp is not necessarily one of the observed Xi s in the data. To get a CI for µp , use
Ŷp ± tcrit SE(Ŷp ), where the standard error of Ŷp is
s
1 (Xp − X̄)2
SE(Ŷp ) = sY |X +P 2
.
n i (Xi − X̄)
and tcrit is identical to the critical value used for a CI on β1 . The prediction variance
has two parts: (1) the 1 indicates the variability associated with the data around the
mean (regression line), and (2) the rest is the variability associated with estimating
the mean.
For example, in the blood loss problem you may want to estimates the blood loss
for an 50kg individual, and to get a CI for this prediction. This problem is different
from computing a CI for the mean blood loss of all 50kg individuals!
# CI for the mean and PI for a new observation at weight=50
predict(lm.blood.wt, data.frame(weight=50), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 487.4257 477.1575 497.6938
predict(lm.blood.wt, data.frame(weight=50), interval = "prediction", level = 0.95)
## fit lwr upr
## 1 487.4257 457.098 517.7533
Comments
1. The prediction interval is wider than the CI for the mean response. This is
reasonable because you are less confident in predicting an individual response
than the mean response for all individuals.
2. The CI for the mean response and the prediction interval for an individual
response become wider as Xp moves away from X̄. That is, you get a more
sensitive CI and prediction interval for Xp s near the center of the data.
3. In plots below include confidence and prediction bands along with the fitted LS
line.
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid, aes(x = weight, y = blood_loss))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = TRUE)
print(p)
# Base graphics: Plot the data with linear regression fit and confidence bands
# scatterplot
plot(thyroid$weight, thyroid$blood_loss)
# regression line from lm() fit
abline(lm.blood.wt)
# x values of weight for predictions of confidence bands
x.pred <- data.frame(weight = seq(min(thyroid$weight), max(thyroid$weight),
length = 20))
# draw upper and lower confidence bands
lines(x.pred$weight, predict(lm.blood.wt, x.pred,
interval = "confidence")[, "upr"], col = "blue")
lines(x.pred$weight, predict(lm.blood.wt, x.pred,
interval = "confidence")[, "lwr"], col = "blue")
525
●
●
510
●
●
●
500 ●
500
thyroid$blood_loss
● ●
blood_loss
490
●
● ●
475
● ●
480
●
470
450
●
35 40 45 50 55 60 65 70
40 50 60 70 thyroid$weight
weight
8.9.1 Introduction
The simple linear regression model is usually written as
Yi = β0 + β1 Xi + εi
where the εi s are independent normal random variables with mean 0 and variance
σ 2 . The model implies (1) The average Y -value at a given X-value is linearly related
to X. (2) The variation in responses Y at a given X value is constant. (3) The
population of responses Y at a given X is normally distributed. (4) The observed
data are a random sample.
A regression analysis is never complete until these assumptions have been checked.
In addition, you need to evaluate whether individual observations, or groups of ob-
servations, are unduly influencing the analysis. A first step in any analysis is to
plot the data. The plot provides information on the linearity and constant variance
assumption.
• •
25
•
18
300
•
•
• • •
16
• •• •
20
•
200
•
14
•••
Y
Y
•
600
• •
• •
15
• •
12
• •
400
• • •
100
•••
10
• • • •
•
200
•
• 10 • •• • ••
• • • • •
8
• • • ••
0
• • •• • • •
0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
X X X X
Figure (a) is the only plot that is consistent with the assumptions. The plot
shows a linear relationship with constant variance. The other figures show one or
more deviations. Figure (b) shows a linear relationship but the variability increases
as the mean level increases. In Figure (c) we see a nonlinear relationship with constant
variance, whereas (d) exhibits a nonlinear relationship with non-constant variance.
In many examples, nonlinearity or non-constant variability can be addressed by
transforming Y or X (or both), or by fitting polynomial models. These issues
will be addressed later.
• • •
25
•
•
• •
1
•• • • •
•• • • •
• • •• • •• •
20
• • • •
•• • • •
•
Resids
• • •
•• • • •• • •••
0
Y
• •••• • • •• •
• • •
• ••••• •
• • ••
15
••
• • •
•••
• • •
-1
• •• •• •
•• •
•
10
•
•
• •
-2
3 4 5 6 7 10 15 20 25
X Fitted
The following sequence of plots show how inadequacies in the data plot appear
in a residual plot. The first plot shows a roughly linear relationship between Y and
X with non-constant variance. The residual plot shows a megaphone shape rather
than the ideal horizontal band. A possible remedy is a weighted least squares
analysis to handle the non-constant variance (see end of chapter for an example),
or to transform Y to stabilize the variance. Transforming the data may destroy the
linearity.
• • • •
12
120
60
4
•
100
10
40
• • •
• • •• ••
• • •
80
• •• • • • •
20
• • • •
8
• • • • • •• •
2
•
Resids
Resids
•
• • •
60
••
Y
Y
• • •
• •• • • •
0
• • •
• • • ••
•• • • • •• • • • •
6
• • • • ••
40
• • • •
• • • • •• ••
-20
• •• ••• •
• • • • • • • •
0
• • • •
•
• • • • • • • •• • • ••• ••• • •• • • ••
20
• • • •
• • •
4
••• ••• • •• •• •• •
• •• • •• ••
• • -40 • •• • • ••• • • ••
•• • •
• •• •• • •
0
• • • •
-60
• • • • •
-2
3 4 5 6 7 25 30 35 40 45 50 55 2 3 4 5 6 2 3 4 5 6 7
X Fitted X Fitted
The plot above shows a nonlinear relationship between Y and X. The residual
plot shows a systematic dependence of the sign of the residual on the fitted value.
Possible remedies were mentioned earlier.
The plot below shows an outlier. This case has a large residual and large stu-
dentized residual. A sensible approach here is to refit the model after holding out the
case to see if any conclusions change.
• • •• •
2
40
• • •
•
• • • • • •• • • •
• • •• ••••• ••• •• •
0
• • • • • •••
• • • •
35
• •• •
• •• •
-2
•• ••• •
•
••
30
Resids
• • ••
-4
• ••
• •••••• •
Y
•••
•
-6
• •••
25
• •
•• •
-8
••
20
•• •
-10
•
• •
-12
3 4 5 6 7 20 25 30 35 40
X Fitted
without using this observation to construct the fit. It is quite possible for the deleted
residual to be huge when the raw residual is tiny.
The studentized deleted residual for observation ith is calculated by fitting the
regression based on all of the cases except the ith one. The residual is then divided
by its estimated standard deviation. Since the Studentized deleted residual for the
ith observation estimates all quantities with this observation deleted from the data
set, the ith observation cannot influence these estimates. Therefore, unusual Y values
clearly stand out. Studentized deleted residuals with large absolute values are con-
sidered large. If the regression model is appropriate, with no outlying observations,
each Studentized deleted residual follows the t-distribution with n − 1 − p degrees of
freedom.
library(ggplot2)
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, ncol=2
, top = "Nonconstant variance vs sample size")
y
● ●
●● ● ●● ● ●● ●
● 0 ●●●● ●
● ● ● ● ●●●
● ● ● ● ● ●●
● ●●● ●
● ●
●
●
● ●●●
● ● ● ●
●●●● ●● ● ● ● ●●
●● ● ●
●
● ●● ● ● ● ●
●
● ●● ●●
● ● ● ● ●●●●
● ●● ●●●● ●
−1 ●● ●
● ● ●
●●
● ● ●
●● −1 ● ● ● ●●
● ● ● ●●
● ● ● ● ●
● ● ● ● ●●
● ● ●
● ●
● ● ●●
●
●
−2
−2 ●
●
● ● ●
0 1 2 3 4 0 1 2 3 4
x x
Different variance, constant sample size Different variance, different sample sizes
● ●
6
●
●●●●
5 ●●
● ● ●●
● ●
●
●●● ●●
● ●
●●● ●
● ● ●●● ●
●●●●
● ● ● ● ●●
● ● ●
● ● ●●●●
●
3 ●
●●
● ● ●
●●
●
●●
●
●●
●●
● ● ● ●●●●
●
●● ●●● ●● ● ● ●
● ● 0 ●
● ● ●●
●●
●●●
●●●●●●
●
●●● ●
● ●● ●●
● ● ●●● ●●● ●● ●●
●
● ●●●
●
● ● ●●
●
●● ●
●
● ●●
● ●●● ●●
●● ● ● ● ●●●●●
● ● ●● ● ● ●
● ●
●●
y
● ● ●
●●
● ● ●●●
●
●●● ● ●
● ● ●● ●
●
0 ●●
●●
●●
●
●
●●
●
●
●
●
●●
● ● ● ●● ●● ●
● ●● ● ●
●●●●● ●
● ● ●● ●●
●● ● ● ●● ● ●●
●
●
● ●
●●● ● ● ●
● −5
● ●● ● ● ●
●●
● ● ●
● ●
●● ●
●
●● ●
●
−3
●
−10
● ●
0 1 2 3 4 0 1 2 3 4
x x
8.9.5 Outliers
Outliers are observations that are poorly fitted by the regression model. The response
for an outlier is far from the fitted line, so outliers have large positive or negative values
of the studentized residual ri . Usually, |ri | > 2 is considered large. Outliers are often
highlighted in residual plots.
What do you do with outliers? Outliers may be due to incorrect recordings of
the data or failure of the measuring device, or indications or a change in the mean
or variance structure for one or more cases. Incorrect recordings should be fixed if
possible, but otherwise deleted from the analysis.
Routine deletion of outliers from the analysis is not recommended. This practice
can have a dramatic effect on the fit of the model and the perceived precision of
parameter estimates and predictions. Analysts who routinely omit outliers without
cause tend to overstate the significance of their findings and get a false sense of
precision in their estimates and predictions. To assess effects of outliers, a data analyst
should repeat the analysis holding out the outliers to see whether any substantive
conclusions are changed. Very often the only real effect of an outlier is to inflate MSE
and hence make p-values a little larger and CIs a little wider than necessary, but
without substantively changing conclusions. They can completely mask underlying
patterns, however.
10
• •
45
8
40
•
6
35
•
Y
•
4
•
30
• •
••
2
•
25
• • • •
••••• •
0
• • •
20
4 5 6 0 2 4 6 8 10
X X
In the second plot, the extreme value is a high leverage value, which is basically
an outlier among the X values; Y does not enter in this calculation. This influential
observation is not an outlier because its presence in the analysis determines that the
LS line will essentially pass through it! These are values with the potential of greatly
distorting the fitted model. They may or may not actually have distorted it.
The hat variable from the influence() function on the object returned from lm()
fit will give the leverages: influence(lm.output)$hat. Leverage values fall between
0 and 1. Experts consider a leverage value greater than 2p/n or 3p/n, where p is the
number of predictors or factors plus the constant and n is the number of observations,
large and suggest you examine the corresponding observation. A rule-of-thumb is to
identify observations with leverage over 3p/n or 0.99, whichever is smaller.
Dennis Cook developed a measure of the impact that individual cases have on the
placement of the LS line. His measure, called Cook’s distance or Cook’s D, provides
a summary of how far the LS line changes when each individual point is held out (one
at a time) from the analysis. While high leverage values indicate observations that
have the potential of causing trouble, those with high Cook’s D values actually do
disproportionately affect the overall fit. The case with the largest D has the greatest
impact on the placement of the LS line. However, the actual influence of this case
may be small. In the plots above, the observations I focussed on have the largest
Cook’s Ds.
A simple, but not unique, expression for Cook’s distance for the j th case is
X
Dj ∝ (Ŷi − Ŷi[−j] )2 ,
i
where Ŷi[−j] is the fitted value for the ith case when the LS line is computed from all
the data except case j. Here ∝ means that Dj is a multiple of i (Ŷi − Ŷi[−j] )2 where
P
the multiplier does not depend on the case. This expression implies that Dj is also
an overall measure of how much the fitted values change when case j is deleted.
Observations with large D values may be outliers. Because D is calculated us-
ing leverage values and standardized residuals, it considers whether an observation
is unusual with respect to both x- and y-values. To interpret D, compare it to the
F -distribution with (p, n − p) degrees-of-freedom to determine the corresponding per-
centile. If the percentile value is less than 10% or 20%, the observation has little
influence on the fitted values. If the percentile value is greater than 50%, the obser-
vation has a major influence on the fitted values and should be examined.
Many statisticians make it a lot simpler than this sounds and use 1 as a cutoff
value for large Cook’s D (when D is on the appropriate scale). Using the cutoff of 1
can simplify an analysis, since frequently one or two values will have noticeably larger
D values than other observations without actually having much effect, but it can be
important to explore any observations that stand out. Cook’s distance values for each
observation from a linear regression fit are given with cooks.distance(lm.output).
Given a regression problem, you should locate the points with the largest Dj s and
see whether holding these cases out has a decisive influence on the fit of the model or
the conclusions of the analysis. You can examine the relative magnitudes of the Dj s
across cases without paying much attention to the actual value of Dj , but there are
guidelines (see below) on how large Dj needs to be before you worry about it.
It is difficult to define a good strategy for dealing with outliers and influential
observations. Experience is the best guide. I will show you a few examples that
highlight some standard phenomena. One difficulty you will find is that certain
observations may be outliers because other observations are influential, or vice-versa.
If an influential observation is held out, an outlier may remain an outlier, may become
• 10 12 14 16 •
15
•
• •
10
Y
• •
•
5
• •
0
• •
15
15
•
• •
• •
10
10
Y
• •
5
•
• •
0
0 2 4 6 8 10 12 2 4 6 8 10 12
X X
Many researchers are hesitant to delete points from an analysis. I think this view
is myopic, and in certain instances, such as the Gesell example to be discussed, can
not be empirically supported. Being rigid about this can lead to some silly analyses
of data, but one needs a very good reason and full disclosure if any points are deleted.
megaphone pattern in residuals vs. fits is the classic (not the only) pattern
to look for. Weighted least squares or transformations may be called for.
(d) Do you see obvious outliers? Make sure you do not have a misrecorded
data value. It might be worth refitting the equation without the outlier to
see if it affects conclusions substantially.
(e) Is the normality assumption reasonable? This can be very closely related
to the preceding points.
(f) Is there a striking pattern in residuals vs. order of the data? This can be
an indication that the independence assumption is not valid.
5. Check the Cook’s D values. The Cook’s distance plot and Residuals vs.
Leverage (with Cook’s D) plot are both helpful.
6. If you found problem observations, omit them from the analysis and see if any
conclusions change substantially. There are two good ways to do this.
(a) Subset the data.frame using subset().
(b) Use lm() with the weights= option with weights of 0 for the excluded
observations, weights of 1 for those included.
You may need to repeat all these steps many times for a complete analysis.
525
7
●
4
● 1
●
500
2 6
● ●
blood_loss
5
●
475
3
●
8
●
450
40 50 60 70
weight
Clearly the heaviest individual is an unusual value that warrants a closer look
(maybe data recording error). I might be inclined to try a transformation here
(such as log(weight)) to make that point a little less influential.
2. Do any obvious transformations of the data. We will look at transformations
later.
3. Fit the least squares equation. Blood Loss appears significantly negatively as-
sociated with weight.
lm.blood.wt <- lm(blood_loss ~ weight, data = thyroid)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.blood.wt)
##
## Call:
## lm(formula = blood_loss ~ weight, data = thyroid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.565 -6.189 4.712 8.192 9.382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 552.4420 21.4409 25.77 2.25e-07 ***
## weight -1.3003 0.4364 -2.98 0.0247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.66 on 6 degrees of freedom
## Multiple R-squared: 0.5967,Adjusted R-squared: 0.5295
## F-statistic: 8.878 on 1 and 6 DF, p-value: 0.02465
(a) Graphs: Check Standardized Residuals (or the Deleted Residuals). The
residual plots:
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.blood.wt, which = c(1,4,6))
# residuals vs weight
plot(thyroid$weight, lm.blood.wt$residuals, main="Residuals vs weight")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(lm.blood.wt$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 8 2 4
# residuals vs order of data
plot(lm.blood.wt$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
2.5
3 3●
10
2.5
4●
● ●
●
2.0
2.0
●
Cook's distance
Cook's distance
0
Residuals
1.5
1.5
●
1
1.0
−10
1.0
2●
0.5
0.5
8 0.5
●8
−20
●8 7
● ●7
0.0
0.0
●
● 0
10
● 10 4● ●
● ● ● ● ● ●
● ● ●
5
5
5
lm.blood.wt$residuals
lm.blood.wt$residuals
lm.blood.wt$residuals
● ● ●
0
0
0
−5
−5
● −5 ● ●
−10
−10
●
−10 ● 2 ●
−15
−20
−20
● −20 ● 8 ●
What changes by deleting case 3? The fitted line gets steeper (slope changes
from −1.30 to −2.19), adjusted R2 gets larger (up to 58% from 53%), and S
changes from 11.7 to 10.6. Because the Weight values are much less spread
out, SE(βˆ1 ) becomes quite a bit larger (to 0.714, up from 0.436) and we lose a
degree of freedom for MS Error (which will penalize us on tests and CIs). Just
about any quantitative statement we would want to make using CIs would be
about the same either way since CIs will overlap a great deal, and our quali-
tative interpretations are unchanged (Blood Loss drops with Weight). Unless
something shows up in the plots, I don’t see any very important changes here.
# exclude obs 3
thyroid.no3 <- subset(thyroid, wt == 1)
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid.no3, aes(x = weight, y = blood_loss, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)
520
7
●
4
●
1
●
blood_loss
500
2 6
● ●
5
●
480
8
●
460
35 40 45 50
weight
Nothing very striking shows up in the residual plots, and no Cook’s D values
are very large among the remaining observations.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.blood.wt.no3, which = c(1,4,6))
# residuals vs weight
plot(thyroid.no3$weight, lm.blood.wt.no3$residuals[(thyroid$wt == 1)]
, main="Residuals vs weight")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(lm.blood.wt.no3$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 3 8 2
# residuals vs order of data
plot(lm.blood.wt.no3$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
0.6
0.6
5
Cook's distance
Cook's distance
Residuals
0.4
0
0.4
●
2 ●2
−5
0.2
0.2
6 0.5
−10
●6
●
2● ●
●8
−15
0.0
0.0
●
● 0
10
3●
30
● ●
● ● 30
lm.blood.wt.no3$residuals
lm.blood.wt.no3$residuals
5
20
20
●
0
10
●
10 ● ●
●
● ● ●
−5
● ●
0
0
● ●
−10
−10
−10
●
● ● 8 ● 2 ● ●
How much difference is there in a practical sense? Examine the 95% prediction
interval for a new observation at Weight = 50kg. Previously we saw that interval
based on all 8 observations was from 457.1 to 517.8 ml of Blood Loss. Based on just
the 7 observations the prediction interval is 451.6 to 512.4 ml. There really is no
practical difference here.
# CI for the mean and PI for a new observation at weight=50
predict(lm.blood.wt , data.frame(weight=50), interval = "prediction")
## fit lwr upr
## 1 487.4257 457.098 517.7533
predict(lm.blood.wt.no3, data.frame(weight=50), interval = "prediction")
## Warning in predict.lm(lm.blood.wt.no3, data.frame(weight = 50), interval = "prediction"):
Assuming constant prediction variance even though model fit is weighted
## fit lwr upr
## 1 481.9939 451.5782 512.4096
Therefore, while obs. 3 was potentially influential, whether the value is included
or not makes very little difference in the model fit or relationship between Weight
and BloodLoss.
id age score
1 1 15 95
2 2 26 71
3 3 10 83
4 4 9 91
5 5 15 102
6 6 20 87
7 7 18 93
8 8 11 100
9 9 8 104
10 10 20 94
11 11 7 113
12 12 9 96
13 13 10 83
14 14 11 84
15 15 11 102
16 16 10 100
17 17 12 105
18 18 42 57
19 19 17 121
20 20 11 86
21 21 10 100
1. Plot Score versus Age. Comment on the relationship between Score and Age.
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(gesell, aes(x = age, y = score, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)
19
●
120
11
●
9 17
●
●
15 5
16 8●
21 ●
100 ● ●
12 1
●
●
7 10
●
4 ●
●
20 6
●
3 14
●
13
score
●
●
80
2
●
60
18
●
10 20 30 40
age
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.score.age, which = c(1,4,6))
# residuals vs weight
plot(gesell$age, lm.score.age$residuals, main="Residuals vs age")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm.score.age$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 19 3 13
# residuals vs order of data
plot(lm.score.age$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
● 19
0.6
0.6
20
Cook's distance
Cook's distance
Residuals
0.4
0.4
●
10
● ●
●
●
● ● ●
● ●
0
● 19 0.2
● 19
0.5
0.2
●
●
−10
● ●
● 2
● ● ●2
3●
13 ●●
●
●
0.0
0.0
●
−20
●
●●
●
●
● 0
19 ●
30
30
● 30 ●
lm.score.age$residuals
lm.score.age$residuals
lm.score.age$residuals
20
20
20
● ●
10
10
● ● 10 ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
0
0
● ● ●
● ● ●
● ● ●
−10
−10
● ●
●
● ●
●
−10 ●
●
●
● ● ●
● ● 3 ● 13 ● ●
10 15 20 25 30 35 40 −2 −1 0 1 2 5 10 15 20
5. Observations 18 and 19 stand out with relatively high Cook’s D. The cutoff
line is only a rough guideline. Those two were flagged with high influence and
standardized residual, respectively, also. Be sure to examine the scatter plot
carefully to see why 18 and 19 stand out.
6. Consider doing two additional analyses: Analyze the data after omitting case
18 only and analyze the data after omitting case 19 only. Refit the regression
model for each of these two scenarios. Provide a summary table such as the
following, giving the relevant summary statistics for the three analyses. Discuss
the impact that observations 18 and 19 have individually on the fit of the model.
When observation 18 is omitted, the estimated slope is not significantly different
from zero (p-value = 0.1489), indicating that age is not an important predictor
of Gesell score. This suggests that the significance of age as a predictor in the
original analysis was due solely to the presence of observation 18. Note the
dramatic decrease in R2 after deleting observation 18.
The fit of the model appears to improve when observation 19 is omitted. For
example, R2 increases noticeably and the p-value for testing the significance of
the slope decreases dramatically (in a relative sense). These tendencies would be
expected based on the original plot. However, this improvement is misleading.
Once observation 19 is omitted, observation 18 is much more influential. Again
the significance of the slope is due to the presence of observation 18.
Feature Full data Omit 18 Omit 19
b0 109.87 105.63 109.30
b1 -1.13 -0.78 -1.19
SE(b0 ) 5.07 7.16 3.97
SE(b1 ) 0.31 0.52 0.24
R2 0.41 0.11 0.57
p-val for H0 : β1 = 0 0.002 0.149 0.000
Can you think of any reasons to justify doing the analysis without observation 18?
If you include observation 18 in the analysis, you are assuming that the mean
Gesell score is linearly related to age over the entire range of observed ages. Obser-
vation 18 is far from the other observations on age (age for observation 18 is 42; the
second highest age is 26; the lowest age is 7). There are no children with ages between
27 and 41, so we have no information on whether the relationship is roughly linear
over a significant portion of the range of ages. I am comfortable deleting observation
18 from the analysis because it’s inclusion forces me to make an assumption that I
can not check using these data. I am only willing to make predictions of Gesell score
for children with ages roughly between 7 and 26. However, once this point is omitted,
age does not appear to be an important predictor.
A more complete analysis would delete observation 18 and 19 together. What
would you expect to see if you did this?
over all possible choices of β0 and β1 . The weighted LS (WLS) line chooses the values
of β0 and β1 that minimize
n
X
wi {Yi − (β0 + β1 Xi )}2
i=1
over all possible choices of β0 and β1 . If σY |X depends up X, then the correct choice
of weights is inversely proportional to variance, wi ∝ σY2 |X .
Consider the following data and plot of y vs. x and standardized OLS residuals
vs x. It is very clear that variability increases with x.
#### Weighted Least Squares
# R code to generate data
set.seed(7)
n <- 100
# 1s, Xs uniform 0 to 100
X <- matrix(c(rep(1,n),runif(n,0,100)), ncol=2)
# intercept and slope (5, 5)
beta <- matrix(c(5,5),ncol=1)
# errors are X*norm(0,1), so variance increases with X
e <- X[,2]*rnorm(n,0,1)
# response variables
y <- X %*% beta + e
# fit regression
lm.y.x <- lm(y ~ x, data = wlsdat)
● ●
●
600
● 100 ●
● ●
● ● ● ●
●
●
●
● ●
● ●
● ● ● ●
●
● ●
● ● ●
● ● ● ● ● ●
● ●● ●
● ●
● ● ●
● ● ●
● ● ● ● ● ●
400 ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ●
● ●
●
0 ●
●
●●●
●
●
●
●
● ●
● ● ●● ●
● ● ●
● ●
res
● ●
y
● ● ●
● ●
● ● ●
● ● ●
● ● ●
●
● ●
● ● ●
● ●
●
●● ● ●
●
● ● ●
● ●
● ●● ●
● ● ●
● ● ● ● ●
● ●
200 ●
● ●
● ● ●
●
●
●
●
●
● ● ● −100
● ●
● ● ●
● ●
● ● ●
●
●● ●●
● ●
●
●
●●●
●●
● ●
●
●
● ●
0
0 25 50 75 100 0 25 50 75 100
x x
In order to use WLS to solve this problem, we need some form for σY2 |X . Finding
that form is a real problem with WLS. It can be useful to plot the absolute value of
the standardized residual vs. x to see if the top boundary seems to follow a general
pattern.
# ggplot: Plot the absolute value of the residuals
library(ggplot2)
p <- ggplot(wlsdat, aes(x = x, y = abs(res)))
p <- p + geom_point()
print(p)
●
150
100
● ●
abs(res)
● ●
● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
50 ● ●
● ●
●
●
●
● ● ● ● ●
●
● ● ● ●
● ●
● ●● ●
● ●
● ● ● ●
● ● ●
●
● ● ● ● ● ●
● ●
● ● ●
● ●
●● ● ●
●● ● ● ●
●
● ● ● ●
● ● ● ●
● ● ● ●● ●
0 ●● ● ●
0 25 50 75 100
x
# fit regression
lm.y.x.wt <- lm(y ~ x, data = wlsdat, weights = x^(-2))
●
●
● ●
●
●
1 ●
●
● ●
● ● ●
● ●
●
● ● ●
●
●
● ● ● ● ●
●
●
● ● ●
reswt * wt
● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
●
●
●
● ● ●
0 ●
●
● ● ●
●
● ● ●
● ●
●
● ●
●
●
● ●
● ●
● ●
●
● ●
●
−1 ● ● ●
● ●
●
●
●
●
●
● ●
● ● ●
●
●
●
−2
0 25 50 75 100
x
Contents
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.2.1 Ideal versus Bootstrap world, sampling distributions . . . . 335
9.2.2 The accuracy of the sample mean . . . . . . . . . . . . . . . 338
9.2.3 Comparing bootstrap sampling distribution from popula-
tion and sample . . . . . . . . . . . . . . . . . . . . . . . . 344
Learning objectives
After completing this topic, you should be able to:
explain the bootstrap principle for hypothesis tests and inference.
decide (for simple problems) how to construct a bootstrap procedure.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
8. use statistical software.
12. make evidence-based decisions.
9.1 Introduction
Statistical theory attempts to answer three basic questions:
1. How should I collect my data?
2. How should I analyze and summarize the data that I’ve collected?
Example: Aspirin and heart attacks, large-sample theory Does aspirin pre-
vent heart attacks in healthy middle-aged men? A controlled, randomized, double-
blind study was conducted and gathered the following data.
(fatal plus non-fatal)
heart attacks subjects
aspirin group: 104 11037
placebo group: 189 11034
A good experimental design, such as this one, simplifies the results! The ratio of the
two rates (the risk ratio) is
104/11037
θ̂ = = 0.55.
189/11034
Because of the solid experimental design, we can believe that the aspirin-takers only
have 55% as many heart attacks as the placebo-takers.
We are not really interested in the estimated ratio θ̂, but the true ratio, θ. That
is the ratio if we could treat all possible subjects, not just a sample of them. Large
sample theory tells us that the log risk ratio has an approximate Normal distribution.
The standard error of the log risk ratio is estimated simply by the square root of the
sum of the reciprocals of the four frequencies:
r
1 1 1 1
SE(log(RR)) = + + + = 0.1228
104 189 11037 11034
The same data that allowed us to estimate the ratio θ with θ̂ = 0.55 also allowed us
to get an idea of the estimate’s accuracy.
1
Efron (1979), “Bootstrap methods: another look at the jackknife.” Ann. Statist. 7, 1–26
Example: Aspirin and strokes, large-sample theory The aspirin study tracked
strokes as well as heart attacks.
strokes subjects
aspirin group: 119 11037
placebo group: 98 11034
The ratio of the two rates (the risk ratio) is
119/11037
θ̂ = = 1.21.
98/11034
It looks like aspirin is actually harmful, now, however the 95% interval for the true
stroke ratio θ is (0.925, 1.583). This includes the neutral value θ = 1, at which aspirin
would be no better or worse than placebo for strokes.
9.2 Bootstrap
The bootstrap is a data-based simulation method for statistical inference, which can
be used to produce inferences like those in the previous slides. The term “bootstrap”
comes from literature. In “The Adventures of Baron Munchausen”, by Rudolph Erich
Raspe, the Baron had fallen to the bottom of a deep lake, and he thought to get out
by pulling himself up by his own bootstraps.
How can the sampling distribution of the proportion of Republican votes be estimated?
The ideal case: draw repeated (infinite) samples from the population
Population Samples Statistic Repeated samples provide the true sampling distribution = Republican vote
1 of the proportion of Republican votes
= Democrat vote
2 = Proportion of
Density
Republican votes
Density
0.20 0.50 0.80
2
Density
R
0.20 0.50 0.80
Example: Aspirin and strokes, bootstrap Here’s how the bootstrap works in
the stroke example. We create two populations:
the first consisting of 119 ones and 11037 − 119 = 10918 zeros,
the second consisting of 98 ones and 11034 − 98 = 10936 zeros.
We draw with replacement a sample of 11037 items from the first population, and a
sample of 11034 items from the second population. Each is called a bootstrap sample.
From these we derive the bootstrap replicate of θ̂:
Repeat this process a large number of times, say 10000 times, and obtain 10000
bootstrap replicates θ̂∗ . The summaries are in the code, followed by a histogram of
bootstrap replicates, θ̂∗ .
#### Example: Aspirin and strokes, bootstrap
# sample size (n) and successes (s) for sample 1 (aspirin) and 2 (placebo)
n <- c(11037, 11034)
s <- c( 119, 98)
# data for samples 1 and 2, where 1 = success (stroke), 0 = failure (no stroke)
dat1 <- c(rep(1, s[1]), rep(0, n[1] - s[1]))
dat2 <- c(rep(1, s[2]), rep(0, n[2] - s[2]))
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs1 <- rep(NA, R)
bs2 <- rep(NA, R)
# draw R bootstrap resamples of proportions
for (i in 1:R) {
# proportion of successes in bootstrap samples 1 and 2
# (as individual steps for group 1:)
resam1 <- sample(dat1, n[1], replace = TRUE)
success1 <- sum(resam1)
bs1[i] <- success1 / n[1]
# (as one line for group 2:)
bs2[i] <- sum(sample(dat2, n[2], replace = TRUE)) / n[2]
}
# bootstrap replicates of ratio estimates
rat <- bs1 / bs2
# sort the ratio estimates to obtain bootstrap CI
rat.sorted <- sort(rat)
# 0.025th and 0.975th quantile gives equal-tail bootstrap CI
CI.bs <- c(rat.sorted[round(0.025*R)], rat.sorted[round(0.975*R+1)])
CI.bs
## [1] 0.9399154 1.5878036
library(ggplot2)
p <- ggplot(dat.rat, aes(x = rat))
p <- p + geom_histogram(aes(y=..density..), binwidth=0.02)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at 1 and CI
p <- p + geom_vline(xintercept=1, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of relative risk ratio, strokes")
p <- p + xlab("ratio (red = 1, green = bootstrap CI)")
print(p)
Bootstrap distribution of relative risk ratio, strokes
2
density
In this simple case, the confidence interval derived from the bootstrap (0.94, 1.588)
agrees very closely with the one derived from statistical theory (0.925, 1.583). Boot-
strap methods are intended to simplify the calculation of inferences like those using
large-sample theory, producing them in an automatic way even in situations much
more complicated than the risk ratio in the aspirin example.
Bootstrap Principle The plug-in principle is used when the underlying distri-
bution is unknown and you substitute your best guess for what that distribution is.
What to substitute?
Empirical distribution ordinary bootstrap
Smoothed distribution (kernel) smoothed bootstrap
Parametric distribution parametric bootstrap
Satisfy assumptions such as the null hypothesis
This substitution works in many cases, but not always. Keep in mind that the
bootstrap distribution is centered at the statistic, not the parameter. Implemention
is done by Monte Carlo sampling.
The bootstrap is commonly implemented in one of two ways, nonparametrically
or parametrically. An exact nonparametric bootstrap requires nn samples! That’s
one for every possible combination of each of n observation positions taking the value
of each of n observations. This is sensibly approximated by using the Monte Carlo
strategy of drawing a large number (1000 or 10000) of random resamples. On the
other hand, a parametric bootstrap first assumes a distribution for the population
(such as a normal distribution) and estimates the distributional parameters (such as
the mean and variance) from the observed sample. Then, the Monte Carlo strategy is
used to draw a large number (1000 or 10000) of samples from the estimated parametric
distribution.
Example: Mouse survival, two-sample t-test, mean Sixteen mice were ran-
domly assigned to a treatment group or a control group. Shown are their survival
times, in days, following a test surgery. Did the treatment prolong survival?
Group Data n Mean SE
Control: 52, 104, 146, 10, 9 56.22 14.14
51, 30, 40, 27, 46
Treatment: 94, 197, 16, 38, 7 86.86 25.24
99, 141, 23
Difference: 30.63 28.93
Numerical and graphical summaries of the data are below. There seems to be a
slight difference in variability between the two treatment groups.
#### Example: Mouse survival, two-sample t-test, mean
treatment <- c(94, 197, 16, 38, 99, 141, 23)
control <- c(52, 104, 146, 10, 51, 30, 40, 27, 46)
survive <- c(treatment, control)
group <- c(rep("Treatment", length(treatment)), rep("Control", length(control)))
mice <- data.frame(survive, group)
library(plyr)
# ddply "dd" means the input and output are both data.frames
mice.summary <- ddply(mice,
"group",
function(X) {
data.frame( m = mean(X$survive),
s = sd(X$survive),
n = length(X$survive)
)
}
)
# standard errors
mice.summary$se <- mice.summary$s/sqrt(mice.summary$n)
# individual confidence limits
mice.summary$ci.l <- mice.summary$m - qt(1-.05/2, df=mice.summary$n-1) * mice.summary$se
mice.summary$ci.u <- mice.summary$m + qt(1-.05/2, df=mice.summary$n-1) * mice.summary$se
mice.summary
## group m s n se ci.l ci.u
## 1 Control 56.22222 42.47581 9 14.15860 23.57242 88.87202
## 2 Treatment 86.85714 66.76683 7 25.23549 25.10812 148.60616
diff(mice.summary$m) #£
## [1] 30.63492
# histogram using ggplot
p <- ggplot(mice, aes(x = survive))
p <- p + geom_histogram(binwidth = 20)
p <- p + geom_rug()
p <- p + facet_grid(group ~ .)
p <- p + labs(title = "Mouse survival following a test surgery") + xlab("Survival (days)")
print(p)
Mouse survival following a test surgery
3
2
Control
0
count
2
Treatment
0
0 50 100 150 200
Survival (days)
√
The standard error for the difference is 28.93 = 25.242 + 14.142 , so the observed
difference of 30.63 is only 30.63/28.93=1.05 estimated standard errors greater than
zero, an insignificant result.
The two-sample t-test of the difference in means confirms the lack of statistically
significant difference between these two treatment groups with a p-value=0.3155.
t.test(survive ~ group, data = mice)
##
## Welch Two Sample t-test
##
Repeat this process a large number of times, say 10000 times, and obtain 10000
bootstrap replicates µ̂∗ . The summaries are in the code, followed by a histogram of
bootstrap replicates, µ̂∗ .
#### Example: Mouse survival, two-sample bootstrap, mean
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs1 <- rep(NA, R)
bs2 <- rep(NA, R)
# draw R bootstrap resamples of means
for (i in 1:R) {
bs2[i] <- mean(sample(control, replace = TRUE))
bs1[i] <- mean(sample(treatment, replace = TRUE))
}
# bootstrap replicates of difference estimates
bs.diff <- bs1 - bs2
sd(bs.diff)
## [1] 27.00087
# sort the difference estimates to obtain bootstrap CI
diff.sorted <- sort(bs.diff)
# 0.025th and 0.975th quantile gives equal-tail bootstrap CI
CI.bs <- c(diff.sorted[round(0.025*R)], diff.sorted[round(0.975*R+1)])
CI.bs
## [1] -21.96825 83.09524
library(ggplot2)
p <- ggplot(dat.diff, aes(x = bs.diff))
p <- p + geom_histogram(aes(y=..density..), binwidth=5)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at 0 and CI
p <- p + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of difference in survival time, median")
p <- p + xlab("ratio (red = 0, green = bootstrap CI)")
print(p)
0.015
0.010
density
0.005
0.000
−50 0 50 100
ratio (red = 0, green = bootstrap CI)
library(ggplot2)
p <- ggplot(dat.diff, aes(x = bs.diff))
p <- p + geom_histogram(aes(y=..density..), binwidth=5)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at 0 and CI
p <- p + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of difference in survival time, median")
0.02
density
0.01
0.00
−100 0 100
ratio (red = 0, green = bootstrap CI)
LSAT <- c(622, 542, 579, 653, 606, 576, 620, 615, 553, 607, 558, 596, 635,
581, 661, 547, 599, 646, 622, 611, 546, 614, 628, 575, 662, 627,
608, 632, 587, 581, 605, 704, 477, 591, 578, 572, 615, 606, 603,
535, 595, 575, 573, 644, 545, 645, 651, 562, 609, 555, 586, 580,
594, 594, 560, 641, 512, 631, 597, 621, 617, 637, 572, 610, 562,
635, 614, 546, 598, 666, 570, 570, 605, 565, 686, 608, 595, 590,
558, 611, 564, 575)
GPA <- c(3.23, 2.83, 3.24, 3.12, 3.09, 3.39, 3.10, 3.40, 2.97, 2.91, 3.11,
3.24, 3.30, 3.22, 3.43, 2.91, 3.23, 3.47, 3.15, 3.33, 2.99, 3.19,
3.03, 3.01, 3.39, 3.41, 3.04, 3.29, 3.16, 3.17, 3.13, 3.36, 2.57,
3.02, 3.03, 2.88, 3.37, 3.20, 3.23, 2.98, 3.11, 2.92, 2.85, 3.38,
2.76, 3.27, 3.36, 3.19, 3.17, 3.00, 3.11, 3.07, 2.96, 3.05, 2.93,
3.28, 3.01, 3.21, 3.32, 3.24, 3.03, 3.33, 3.08, 3.13, 3.01, 3.30,
3.15, 2.82, 3.20, 3.44, 3.01, 2.92, 3.45, 3.15, 3.50, 3.16, 3.19,
3.15, 2.81, 3.16, 3.02, 2.74)
# law = population
law <- data.frame(School, LSAT, GPA, Sampled)
law$Sampled <- factor(law$Sampled)
# law.sam = sample
law.sam <- subset(law, Sampled == 1)
library(ggplot2)
p <- ggplot(law, aes(x = LSAT, y = GPA))
p <- p + geom_point(aes(colour = Sampled, shape = Sampled), alpha = 0.8, size = 2)
p <- p + labs(title = "Law School average scores of LSAT and GPA")
print(p)
3.50
3.25
Sampled
GPA
0
3.00 1
2.75
Let’s bootstrap the sample of 15 observations to get the bootstrap sampling dis-
tribution of correlation (for sampling 15 from the population). From the bootstrap
sampling distribution we’ll calculate a bootstrap confidence interval for the true pop-
ulation correlation, as well as a bootstrap standard deviation for the correlation. But
how well does this work? Let’s compare it against the true sampling distribution
by drawing 15 random schools from the population of 82 schools and calculating the
correlation. If the bootstrap works well (from our hopefully representative sample of
15), then the bootstrap sampling distribution from the 15 schools will be close to the
true sampling distribution.
The code below does that, followed by two histograms. In this case, the histograms
are noticeably non-normal, having a long tail toward the left. Inferences based on
the normal curve are suspect when the bootstrap histogram is markedly non-normal.
The histogram on the left is the nonparametric bootstrap sampling distribution using
only the n = 15 sampled schools with 10000 bootstrap replicates of corr(xd ∗ ). The
histogram on the right is the true sampling distribution using 10000 replicates of
d ∗ ) from the population of law school data, repeatedly drawing n = 15 without
corr(x
replacement from the N = 82 points. Impressively, the bootstrap histogram on the
left strongly resembles the population histogram on the right. Remember, in a real
problem we would only have the information on the left, from which we would be
trying to infer the situation on the right.
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs.pop <- rep(NA, R)
bs.sam <- rep(NA, R)
# draw R bootstrap resamples of medians
for (i in 1:R) {
# sample() draws indicies then bootstrap correlation of LSAT and GPA
# population
bs.pop[i] = cor(law [sample(seq(1,nrow(law )), nrow(law.sam)
, replace = TRUE), 2:3])[1, 2]
# sample
bs.sam[i] = cor(law.sam[sample(seq(1,nrow(law.sam)), nrow(law.sam)
, replace = TRUE), 2:3])[1, 2]
}
300
group
count
200
Pop
Sam
100
Contents
10.1 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.2 Effect size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.3 Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.4 Power calculation via simulation . . . . . . . . . . . . . . 359
Learning objectives
After completing this topic, you should be able to:
assess the power of a test or
determine the required sample size for a study.
Achieving these goals contributes to mastery in these course learning outcomes:
7. Distinguish between statistical significance and scientific relevance.
10. Identify and explain the statistical methods, assumptions, and limitations.
12. Make evidence-based decisions by constructing and deciding between testable
hypotheses using appropriate data and methods.
conducted many times. Having power of 0.8 means that 80% of the time, we would
get a statistically significant difference between the drug A and placebo groups. This
also means that 20% of the times that we run this experiment, we will not obtain a
statistically significant effect between the two groups, even though there really is an
effect in reality. That is, the probability of a Type-II error is β = 0.2.
One-sample power figure Consider the plot below for a one-sample one-tailed
greater-than t-test. If the null hypothesis, H0 : µ = µ0 , is true, then the test statistic
t is follows the null distribution indicated by the hashed area. Under a specific alter-
native hypothesis, H1 : µ = µ1 , the test statistic t follows the distribution indicated
by the solid area. If α is the probability of making a Type-I error (rejecting H0 when
it is true), then “crit. val.” indicates the location of the tcrit value associated with H0
on the scale of the data. The rejection region is the area under H0 that is at least
as far as “crit. val.” is from µ0 . The power (1 − β) of the test is the green area, the
area under H1 in the rejection region. A Type-II error is made when H1 is true, but
we fail to reject H0 in the red region. (Note, for a two-tailed test the rejection region
for both tails under the H1 curve contribute to the power.)
#### One-sample power
# Power plot with two normal distributions
# http://stats.stackexchange.com/questions/14140/how-to-best-display-graphically-type-ii-beta-error-powe
Color
Null hypothesis
Type−II error
Power
−∞ µ0 crit. val. µ1 ∞
Example: IQ drug Imagine that we are evaluating the effect of a putative memory
enhancing drug. We have randomly sampled 25 people from a population known to
be normally distributed with a µ of 100 and a σ of 15. We administer the drug, wait
a reasonable time for it to take effect, and then test our subjects’ IQ. Assume that
we were so confident in our belief that the drug would either increase IQ or have no
effect that we entertained one-sided (directional) hypotheses. Our null hypothesis is
that after administering the drug µ ≤ 100 and our alternative hypothesis is µ > 100.
These hypotheses must first be converted to exact hypotheses. Converting the null
is easy: it becomes µ = 100. The alternative is more troublesome. If we knew that
the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis
would be µ = 115, and we could compute power, the probability of correctly rejecting
the false null hypothesis given that µ is really equal to 115 after drug treatment, not
100 (normal IQ). But if we already knew how large the effect of the drug was, we
would not need to do inferential statistics. . .
One solution is to decide on a minimum nontrivial effect size. What is the
smallest effect that you would consider to be nontrivial? Suppose that you decide
that if the drug increases µIQ by 2 or more points, then that is a nontrivial effect, but
if the mean increase is less than 2 then the effect is trivial.
Now we can test the null of µ = 100 versus the alternative of µ = 102. Consider
the previous plot. Let the left curve represent the distribution of sample means if the
null hypothesis
√ were true, µ = 100. This sampling distribution has a µ = 100 and
a σȲ = 15/ 25 = 3. Let the right curve represent the sampling distribution if the
exact alternative hypothesis is true, µ = 102. Its µ is 102 and, assuming the drug has
no effect on the variance in IQ scores, also has σȲ = 3.
The green area in the upper tail of the null distribution (gray hatched curve) is
α. Assume we are using a one-tailed α of 0.05. How large would a sample mean need
be for us to reject the null? Since the upper 5% of a normal distribution extends
from 1.645σ above the µ up to positive infinity, the sample mean IQ would need be
100 + 1.645(3) = 104.935 or more to reject the null. What are the chances of getting
a sample mean of 104.935 or more if the alternative hypothesis is correct, if the drug
increases IQ by 2 points? The area under the alternative curve from 104.935 up to
positive infinity represents that probability, which is power. Assuming the alternative
hypothesis is true, that µ = 102, the probability of rejecting the null hypothesis is
the probability of getting a sample mean of 104.935 or more in a normal distribution
with µ = 102, σ = 3. Z = (104.935 − 102)/3 = 0.98, and P (Z > 0.98) = 0.1635.
That is, power is about 16%. If the drug really does increase IQ by an average of 2
points, we have a 16% chance of rejecting the null. If its effect is even larger, we have
a greater than 16% chance.
Suppose we consider 5 (rather than 2) the minimum nontrivial effect size. This
will separate the null and alternative distributions more, decreasing their overlap and
increasing power. Now, Z = (104.935 − 105)/3 = −0.02, P (Z > −0.02) = 0.5080 or
about 51%. It is easier to detect large effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ;
α is now split into both tails of the null distribution, 0.025 in each tail. We shall
reject the null if the sample mean is 1.96 or more standard errors away from the
µ of the null distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more
(or if it is 100 − 1.96(3) = 94.12 or less) we reject the null. The probability of that
happening if the alternative is correct (µ = 105) is: Z = (105.88 − 105)/3 = 0.29,
P (Z > 0.29) = 0.3859, and P (Z < (94.12 − 105)/3) = P (Z < −3.63) = 0.00014, for
a total power = (1 − β) = 0.3859 + 0.00014, or about 39%. Note that our power is less
than it was with a one-tailed test. If you can correctly predict the direction of
effect, a one-tailed test is more powerful than a two-tailed test.
Consider
√ what would happen if you increased sample size to 100. Now the σȲ =
15/ 100 = 1.5. With the null and alternative distributions are narrower, and should
overlap less, increasing power. With σȲ = 1.5 the sample mean will need be 100 +
(1.96)(1.5) = 102.94 (rather than 105.88 from before) or more to reject the null.
If the drug increases IQ by 5 points, power is: Z = (102.94 − 105)/1.5 = −1.37,
P (Z > −1.37) = 0.9147, or between 91 and 92%. Anything that decreases the
standard error will increase power. This may be achieved by increasing
the sample size N or by reducing the σ of the dependent variable. The σ
research study on the effects of taking a small, daily dose of aspirin. Each participant
was instructed to take one pill a day. For about half of the participants the pill
was aspirin, for the others it was a placebo. The dependent variable was whether or
not the participant had a heart attack during the study. In terms of a correlation
coefficient, the size of the observed effect was r = 0.034. In terms of percentage of
variance explained, that is 0.12%. In other contexts this might be considered a trivial
effect, but it this context it was so large an effect that the researchers decided it was
unethical to continue the study and the contacted all of the participants who were
taking the placebo and told them to start taking aspirin every day.
The plots below indicate the amount of power for a given effect size and sample
size for a one-sample t-test and ANOVA test. This graph makes clear the diminishing
returns you get for adding more and more subjects if you already have moderate to
high power. For example, let’s say we’re doing a one-sample test and we an effect
size of 0.2 and have only 10 subjects. We can see that we have a power of about 0.15,
which is really, really low. Going to 25 subjects increases our power to about 0.25,
and to 100 subjects increases our power to about 0.6. But if we had a large effect size
of 0.8, 10 subjects would already give us a power of about 0.8, and using 25 or 100
subjects would both give a power at least 0.98. So each additional subject gives you
less additional power. This curve also illustrates the “cost” of increasing your desired
power from 0.8 to 0.98.
# Power curve plot for one-sample t-test with range of sample sizes
# http://stackoverflow.com/questions/4680163/power-vs-effect-size-plot/4680786#4680786
powsF <- sapply(nn, getFPow) # ANOVA power for for all group sizes
powsT <- sapply(nn, getTPow) # t-Test power for for all group sizes
#dev.new(width=10, fig.height=5)
par(mfrow=c(1, 2))
matplot(dVals, powsT, type="l", lty=1, lwd=2, xlab="effect size d",
ylab="Power", main="Power one-sample t-test", xaxs="i",
library(pwr)
pwrt2 <- pwr.t.test(d=.2,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt3 <- pwr.t.test(d=.3,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt5 <- pwr.t.test(d=.5,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt8 <- pwr.t.test(d=.8,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
#plot(pwrt£n, pwrt£power, type="b", xlab="sample size", ylab="power")
matplot(matrix(c(pwrt2$n,pwrt3$n,pwrt5$n,pwrt8$n),ncol=4),
matrix(c(pwrt2$power,pwrt3$power,pwrt5$power,pwrt8$power),ncol=4),
type="l", lty=1, lwd=2, xlab="sample size",
ylab="Power", main="Power one-sample t-test", xaxs="i",
xlim=c(0, 100), ylim=c(0,1), col=c("blue", "red", "darkgreen", "green"))
legend(x="bottomright", legend=paste("d =", c(0.2, 0.3, 0.5, 0.8)), lwd=2,
col=c("blue", "red", "darkgreen", "green"))
1.0
1.0
0.8
0.8
0.6
0.6
Power
Power
0.4
0.4
0.2
0.2
N=5 d = 0.2
N = 10 d = 0.3
N = 25 d = 0.5
0.0
N = 100 d = 0.8
Reasons to do a power analysis There are several of reasons why one might do
a power analysis. (1) Perhaps the most common use is to determine the necessary
number of subjects needed to detect an effect of a given size. Note that trying to find
the absolute, bare minimum number of subjects needed in the study is often not a
good idea. (2) Additionally, power analysis can be used to determine power, given an
effect size and the number of subjects available. You might do this when you know,
for example, that only 75 subjects are available (or that you only have the budget for
75 subjects), and you want to know if you will have enough power to justify actually
doing the study. In most cases, there is really no point to conducting a study that is
seriously underpowered. Besides the issue of the number of necessary subjects, there
are other good reasons for doing a power analysis. (3) For example, a power analysis
is often required as part of a grant proposal. (4) And finally, doing a power analysis is
often just part of doing good research. A power analysis is a good way of making sure
that you have thought through every aspect of the study and the statistical analysis
before you start collecting data.
Limitations Despite these advantages of power analyses, there are some limita-
tions. (1) One limitation is that power analyses do not typically generalize very well.
If you change the methodology used to collect the data or change the statistical pro-
cedure used to analyze the data, you will most likely have to redo the power analysis.
(2) In some cases, a power analysis might suggest a number of subjects that is inad-
equate for the statistical procedure. For example (beyond the scope of this class), a
power analysis might suggest that you need 30 subjects for your logistic regression,
but logistic regression, like all maximum likelihood procedures, require much larger
sample sizes. (3) Perhaps the most important limitation is that a standard power
analysis gives you a “best case scenario” estimate of the necessary number of sub-
jects needed to detect the effect. In most cases, this “best case scenario” is based
on assumptions and educated guesses. If any of these assumptions or guesses are
incorrect, you may have less power than you need to detect the effect. (4) Finally,
because power analyses are based on assumptions and educated guesses, you often
get a range of the number of subjects needed, not a precise number. For example, if
you do not know what the standard deviation of your outcome measure will be, you
guess at this value, run the power analysis and get X number of subjects. Then you
guess a slightly larger value, rerun the power analysis and get a slightly larger number
of necessary subjects. You repeat this process over the plausible range of values of
the standard deviation, which gives you a range of the number of subjects that you
will need.
Other considerations After all of this discussion of power analyses and the nec-
essary number of subjects, we need to stress that power is not the only consideration
when determining the necessary sample size. For example, different researchers might
have different reasons for conducting a regression analysis. (1) One might want to see
if the regression coefficient is different from zero, (2) while the other wants to get a
very precise estimate of the regression coefficient with a very small confidence interval
around it. This second purpose requires a larger sample size than does merely seeing
if the regression coefficient is different from zero. (3) Another consideration when
determining the necessary sample size is the assumptions of the statistical procedure
that is going to be used (e.g., parametric vs nonparametric procedure). (4) The num-
ber of statistical tests that you intend to conduct will also influence your necessary
sample size: the more tests that you want to run, the more subjects that you will need
(multiple comparisons). (5) You will also want to consider the representativeness of
the sample, which, of course, influences the generalizability of the results. Unless you
have a really sophisticated sampling plan, the greater the desired generalizability, the
larger the necessary sample size.
the first one-sided alternative H0 : µ = 100 and H1 : µ > 100. Assume the minimum
nontrivial effect size was that the drug increases µIQ by 2 or more points, so that the
specific alternative to consider is H1 : µ = 102. What is the power of this test?
We already saw how to calculate this analytically. To solve this computationally,
we need to simulate samples of N = 25 from the alternative distribution (µ = 102
and σ = 15) and see what proportion of the time we correctly reject H0 .
#### Example: IQ drug, revisited
# R code to simulate one-sample one-sided power
# Strategy:
# Do this R times:
# draw a sample of size N from the distribution specified by the alternative hypothesis
# That is, 25 subjects from a normal distribution with mean 102 and sigma 15
# Calculate the mean of our sample
# Calculate the associated z-statistic
# See whether that z-statistic has a p-value < 0.05 under H0: mu=100
# If we reject H0, then set reject = 1, else reject = 0.
# Finally, the proportion of rejects we observe is the approximate power
reject <- rep(NA, R); # allocate a vector of length R with missing values (NA)
# to fill with 0 (fail to reject H0) or 1 (reject H0)
for (i in 1:R) {
sam <- rnorm(n, mean=mu1, sd=sigma); # sam is a vector with 25 values
power <- mean(reject); # the average reject (proportion of rejects) is the power
power
## [1] 0.166
# 0.1655 for mu1=102
# 0.5082 for mu1=105
Our simulation (this time) with µ1 = 102 gave a power of 0.166 (exact answer
is P (Z > 0.98) = 0.1635). Rerunning with µ1 = 105 gave a power of 0.5082 (exact
answer is P (Z > −0.02) = 0.5080). Our simulation well-approximates the true value,
and the power can be made more precise by increasing the number of repetitions
calculated. However, two to three decimal precision is quite sufficient.
Example: Head breadth Recall the head breadth example in Chapter 3 compar-
ing maximum head breadths (in millimeters) of modern day Englishmen with ancient
Celts. The data are summarized below.
Descriptive Statistics: ENGLISH, CELTS
Variable N Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
ENGLISH 18 146.50 1.50 6.38 132.00 141.75 147.50 150.00 158.00
CELTS 16 130.75 1.36 5.43 120.00 126.25 131.50 135.50 138.00
Imagine that we don’t have the information above. Imagine we have been invited
to a UK university to take skull measurements for 18 modern day Englishmen, and
16 ancient Celts. We have some information about modern day skulls to use as prior
information for measurement mean and standard deviation. What is the power to
observe a difference between the populations? Let’s make some reasonable assump-
tions that allows us to be a bit conservative. Let’s assume the sampled skulls from
each of our populations is a random sample with common standard deviation 7mm,
and let’s assume we can’t get the full sample but can only measure 15 skulls from
each population. At a significance level of α = 0.05, what is the power for detecting
a difference of 5, 10, 15, 20, or 25 mm?
The theoretical two-sample power result is not too hard to derive (and is avail-
able in text books), but let’s simply compare the power calculated exactly and by
simulation.
For the exact result we use R library pwr. Below is the function call as well as
the result. Note that we specified multiple effect sizes (diff/SD) in one call of the
function.
# R code to compute exact two-sample two-sided power
library(pwr) # load the power calculation library
pwr.t.test(n = 15,
d = c(5,10,15,20,25)/7,
sig.level = 0.05,
power = NULL,
type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 15
## d = 0.7142857, 1.4285714, 2.1428571, 2.8571429, 3.5714286
## sig.level = 0.05
## power = 0.4717438, 0.9652339, 0.9998914, 1.0000000, 1.0000000
## alternative = two.sided
##
## NOTE: n is number in *each* group
To simulate the power under the same circumstances, we follow a similar strategy
as in the one-sample example.
# R code to simulate two-sample two-sided power
# Strategy:
# Do this R times:
# draw a sample of size N from the two hypothesized distributions
# That is, 15 subjects from a normal distribution with specified means and sigma=7
# Calculate the mean of the two samples
# Calculate the associated z-statistic
# See whether that z-statistic has a p-value < 0.05 under H0: mu_diff=0
# If we reject H0, then set reject = 1, else reject = 0.
# Finally, the proportion of rejects we observe is the approximate power
reject <- rep(NA, R); # allocate a vector of length R with missing values (NA)
# to fill with 0 (fail to reject H0) or 1 (reject H0)
for (i in 1:R) {
sam1 <- rnorm(n, mean=mu1 , sd=sigma); # English sample
sam2 <- rnorm(n, mean=mu2[j], sd=sigma); # Celt sample
power
## [1] 0.49275 0.97650 1.00000 1.00000 1.00000
Note the similarity between power calculated using both the exact and simulation
methods. If there is a power calculator for your specific problem, it is best to use
that because it is faster and there is no programming. However, using the simula-
tion method is better if we wanted to entertain different sample sizes with different
standard deviations, etc. There may not be a standard calculator for our specific
problem, so knowing how to simulate the power can be valuable.
Mean Sample size Power
µE µC diff SD nE nC exact simulated
147 142 5 7 15 15 0.4717 0.4928
147 137 10 7 15 15 0.9652 0.9765
147 132 15 7 15 15 0.9999 1
147 127 20 7 15 15 1.0000 1
147 122 25 7 15 15 1.0000 1
Data Cleaning
Contents
11.1 The five steps of statistical analysis . . . . . . . . . . . . 366
11.2 R background review . . . . . . . . . . . . . . . . . . . . . 367
11.2.1 Variable types . . . . . . . . . . . . . . . . . . . . . . . . . 367
11.2.2 Special values and value-checking functions . . . . . . . . . 368
11.3 From raw to technically correct data . . . . . . . . . . . . 369
11.3.1 Technically correct data . . . . . . . . . . . . . . . . . . . . 369
11.3.2 Reading text data into an R data.frame . . . . . . . . . . . 370
11.4 Type conversion . . . . . . . . . . . . . . . . . . . . . . . . 377
11.4.1 Introduction to R’s typing system . . . . . . . . . . . . . . 377
11.4.2 Recoding factors . . . . . . . . . . . . . . . . . . . . . . . . 378
11.4.3 Converting dates . . . . . . . . . . . . . . . . . . . . . . . . 380
11.5 Character-type manipulation . . . . . . . . . . . . . . . . 382
11.5.1 String normalization . . . . . . . . . . . . . . . . . . . . . . 383
11.5.2 Approximate string matching . . . . . . . . . . . . . . . . . 384
11.6 From technically correct data to consistent data . . . . . 387
11.6.1 Detection and localization of errors . . . . . . . . . . . . . . 388
11.6.2 Edit rules for detecting obvious inconsistencies . . . . . . . 393
11.6.3 Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
11.6.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
1. Raw data
Data cleaning
2. Technically correct data The data can be read into an R data.frame, with
correct names, types and labels, without further trouble. However, that does not
mean that the values are error-free or complete.
For example, an age variable may be reported negative, an under-aged person may
be registered to possess a driver’s license, or data may simply be missing. Such
inconsistencies obviously depend on the subject matter that the data pertains to,
and they should be ironed out before valid statistical inference from such data can
be produced.
3. Consistent data The data is ready for statistical inference. It is the data that
most statistical theories use as a starting point. Ideally, such theories can still
be applied without taking previous data cleaning steps into account. In practice
however, data cleaning methods like imputation of missing values will influence
statistical results and so must be accounted for in the following analyses or inter-
pretation thereof.
4. Statistical results The results of the analysis have been produced and can be
stored for reuse.
5. Formatted output The results in tables and figures ready to include in sta-
tistical reports or publications.
Best practice Store the input data for each stage (raw, technically correct, consis-
tent, results, and formatted) separately for reuse. Each step between the stages may
be performed by a separate R script for reproducibility.
pi/0
2 * Inf
Inf - 1e+10
Inf + Inf
3 < -Inf
Inf == Inf
# use is.infinite() to detect Inf variables
is.infinite(-Inf)
NaN Stands for “not a number”. This is generally the result of a calculation
of which the result is unknown, but it is surely not a number. In particular
operations like 0/0, Inf − Inf and Inf/Inf result in NaN. Technically, NaN is of
class numeric, which may seem odd since it is used to indicate that something
is not numeric. Computations involving numbers and NaN always result in NaN.
NaN + 1
exp(NaN)
# use is.nan() to detect NULL variables
is.nan(0/0)
Note that is.finite() checks a numeric vector for the occurrence of any non-
numerical or special values.
is.finite(c(1, NA, 2, Inf, 3, -Inf, 4, NULL, 5, NaN, 6))
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
Best practice Whenever you need to read data from a foreign file format, like
a spreadsheet or proprietary statistical software that uses undisclosed file formats,
make that software responsible for exporting the data to an open format that can be
read by R.
Reading text
read.table() and similar functions below will read a text file and return a data.frame.
Best practice. A freshly read data.frame should always be inspected with func-
tions like head(), str(), and summary().
The read.table() function is the most flexible function to read tabular data
that is stored in a textual format. The other read-functions below all eventually
use read.table() with some fixed parameters and possibly after some preprocessing.
Specifically
read.csv() for comma separated values with period as decimal separator.
read.csv2() for semicolon separated values with comma as decimal separator.
read.delim() tab-delimited files with period as decimal separator.
read.delim2() tab-delimited files with comma as decimal separator.
read.fwf() data with a predetermined number of bytes per column.
Additional optional arguments include:
Argument Description
header Does the first line contain column names?
col.names character vector with column names.
na.string Which strings should be considered NA?
colClasses character vector with the types of columns. Will coerce the columns
to the specified types.
stringsAsFactors If TRUE, converts all character vectors into factor vectors.
sep Field separator.
Except for read.table() and read.fwf(), each of the above functions assumes by
default that the first line in the text file contains column headers. The following
demonstrates this on the following text file.
21,6.0
42,5.9
18,5.7*
21,NA
Read the file with defaults, then specifying necessary options.
#### Example: unnamed person text
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_unnamed.txt"
## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 NA
## 4 21 NA
Now, everything is read in and the height column is translated to numeric, with
the exception of the row containing 5.7*. Moreover, since we now get a warning
instead of an error, a script containing this statement will continue to run, albeit with
less data to analyse than it was supposed to. It is of course up to the programmer to
check for these extra NA’s and handle them appropriately.
detected regardless of whether the file was created under DOS, UNIX, or MAC
(each OS has traditionally had different ways of marking an end-of-line). Reading
in the Daltons file yields the following.
#### Example: Dalton data
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_dalton.txt"
dalton.txt <- readLines(fn.data)
dalton.txt
## [1] "%% Data on the Dalton Brothers" "Gratt ,1861,1892"
## [3] "Bob,1892" "1871,Emmet ,1937"
## [5] "% Names, birth and death dates"
str(dalton.txt)
## chr [1:5] "%% Data on the Dalton Brothers" "Gratt ,1861,1892" ...
The variable dalton.txt has 5 character elements, equal to the number of lines in
the textfile.
Step 2. Selecting lines containing data. This is generally done by throwing
out lines containing comments or otherwise lines that do not contain any data
fields. You can use grep() or grepl() to detect such lines. Regular expressions2 ,
though challenging to learn, can be used to specify what you’re searching for. I
usually search for an example and modify it to meet my needs.
# detect lines starting (^) with a percentage sign (%)
ind.nodata <- grepl("^%", dalton.txt)
ind.nodata
## [1] TRUE FALSE FALSE FALSE TRUE
# and throw them out
!ind.nodata
## [1] FALSE TRUE TRUE TRUE FALSE
dalton.dat <- dalton.txt[!ind.nodata]
dalton.dat
## [1] "Gratt ,1861,1892" "Bob,1892" "1871,Emmet ,1937"
Here, the first argument of grepl() is a search pattern, where the caret (^) indicates
a start-of-line. The result of grepl() is a logical vector that indicates which ele-
ments of dalton.txt contain the pattern ’start-of-line’ followed by a percent-sign.
The functionality of grep() and grepl() will be discussed in more detail later.
Step 3. Split lines into separate fields. This can be done with strsplit().
This function accepts a character vector and a split argument which tells strsplit()
how to split a string into substrings. The result is a list of character vectors.
# remove whitespace by substituting nothing where spaces appear
dalton.dat2 <- gsub(" ", "", dalton.dat)
# split strings by comma
2
http://en.wikipedia.org/wiki/Regular_expression
The function lapply() will apply the function f.assignFields() to each list ele-
ment in dalton.fieldList.
dalton.standardFields <- lapply(dalton.fieldList, f.assignFields)
## Warning in which(as.numeric(x) < 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) > 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) < 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) > 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) < 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) > 1890): NAs introduced by coercion
dalton.standardFields
## [[1]]
## [1] "Gratt" "1861" "1892"
##
## [[2]]
## [1] "Bob" NA "1892"
##
## [[3]]
## [1] "Emmet" "1871" "1937"
The advantage of this approach is having greater flexibility than read.table offers.
However, since we are interpreting the value of fields here, it is unavoidable to
know about the contents of the dataset which makes it hard to generalize the
field assigner function. Furthermore, f.assignFields() function we wrote is still
relatively fragile. That is, it crashes for example when the input vector contains
two or more text-fields or when it contains more than one numeric value larger
than 1890. Again, no one but the data analyst is probably in a better position to
choose how safe and general the field assigner should be.
Step 5. Transform to data.frame. There are several ways to transform a list
to a data.frame object. Here, first all elements are copied into a matrix which is
then coerced into a data.frame.
# unlist() returns each value in a list in a single object
unlist(dalton.standardFields)
## [1] "Gratt" "1861" "1892" "Bob" NA "1892" "Emmet" "1871"
## [9] "1937"
# there are three list elements in dalton.standardFields
length(dalton.standardFields)
## [1] 3
# fill a matrix will the character values
dalton.mat <- matrix(unlist(dalton.standardFields)
, nrow = length(dalton.standardFields)
, byrow = TRUE
)
dalton.mat
The function unlist() concatenates all vectors in a list into one large character
vector. We then use that vector to fill a matrix of class character. However,
the matrix function usually fills up a matrix column by column. Here, our data
is stored with rows concatenated, so we need to add the argument byrow=TRUE.
Finally, we add column names and coerce the matrix to a data.frame. We use
stringsAsFactors=FALSE since we have not started interpreting the values yet.
Step 6. Normalize and coerce to correct types. This step consists of prepar-
ing the character columns of our data.frame for coercion and translating numbers
into numeric vectors and possibly character vectors to factor variables. String nor-
malization and type conversion are discussed later. In this example we can suffice
with the following statements.
dalton.df$birth <- as.numeric(dalton.df$birth)
dalton.df$death <- as.numeric(dalton.df$death)
str(dalton.df)
## 'data.frame': 3 obs. of 3 variables:
## $ name : chr "Gratt" "Bob" "Emmet"
## $ birth: num 1861 NA 1871
## $ death: num 1892 1892 1937
dalton.df
For the user of R these class labels are usually enough to handle R objects in R
scripts. Under the hood, the basic R objects are stored as C structures as C is the
language in which R itself has been written. The type of C structure that is used to
store a basic type can be found with the typeof function. Compare the results below
with those in the previous code snippet.
typeof(c("abc", "def"))
## [1] "character"
typeof(1:10)
## [1] "integer"
typeof(c(pi, exp(1)))
## [1] "double"
typeof(factor(c("abc", "def")))
## [1] "integer"
Note that the type of an R object of class numeric is double. The term double
refers to double precision, which is a standard way for lower-level computer languages
such as C to store approximations of real numbers. Also, the type of an object of
class factor is integer. The reason is that R saves memory (and computational
time!) by storing factor values as integers, while a translation table between factor
and integers are kept in memory. Normally, a user should not have to worry about
these subtleties, but there are exceptions (the homework includes an example of the
subtleties).
In short, one may regard the class of an object as the object’s type from the
user’s point of view while the type of an object is the way R looks at the object. It
is important to realize that R’s coercion functions are fundamentally functions that
change the underlying type of an object and that class changes are a consequence of
the type changes.
read in a vector where 1 stands for male, 2 stands for female and 0 stands for unknown.
Conversion to a factor variable can be done as in the example below.
# example:
gender <- c(2, 1, 1, 2, 0, 1, 1)
gender
## [1] 2 1 1 2 0 1 1
# recoding table, stored in a simple vector
recode <- c(male = 1, female = 2)
recode
## male female
## 1 2
gender <- factor(gender, levels = recode, labels = names(recode))
gender
## [1] female male male female <NA> male male
## Levels: male female
Note that we do not explicitly need to set NA as a label. Every integer value that
is encountered in the first argument, but not in the levels argument will be regarded
missing.
Levels in a factor variable have no natural ordering. However in multivariate
(regression) analyses it can be beneficial to fix one of the levels as the reference level.
R’s standard multivariate routines (lm, glm) use the first level as reference level. The
relevel function allows you to determine which level comes first.
gender <- relevel(gender, ref = "female")
gender
## [1] female male male female <NA> male male
## Levels: female male
Levels can also be reordered, depending on the mean value of another variable,
for example:
age <- c(27, 52, 65, 34, 89, 45, 68)
gender <- reorder(gender, age)
gender
## [1] female male male female <NA> male male
## attr(,"scores")
## female male
## 30.5 57.5
## Levels: female male
Here, the means are added as a named vector attribute to gender. It can be
removed by setting that attribute to NULL.
attr(gender, "scores") <- NULL
gender
## [1] female male male female <NA> male male
## Levels: female male
year and tries to extract valid dates. Note that the code above will only work properly
in locale settings where the name of the second month is abbreviated to Feb. This
holds for English or Dutch locales, but fails for example in a French locale (Fevrier).
There are similar functions for all permutations of d, m, and y. Explicitly, all of
the following functions exist.
dmy()
dym()
mdy()
myd()
ydm()
ymd()
So once it is known in what order days, months and years are denoted, extraction
is very easy.
Note It is not uncommon to indicate years with two numbers, leaving out the
indication of century. Recently in R, 00-69 was interpreted as 2000-2069 and 70-99 as
1970-1999; this behaviour is according to the 2008 POSIX standard, but one should
expect that this interpretation changes over time. Currently all are now 2000-2099.
dmy("01 01 68")
## [1] "2068-01-01"
dmy("01 01 69")
## [1] "1969-01-01"
dmy("01 01 90")
## [1] "1990-01-01"
dmy("01 01 00")
## [1] "2000-01-01"
It should be noted that lubridate (as well as R’s base functionality) is only capable
of converting certain standard notations. For example, the following notation does
not convert.
dmy("15 Febr. 2013")
## Warning: All formats failed to parse. No formats found.
## [1] NA
male
Female
fem.
If this would be treated as a factor variable without any preprocessing, obviously
four, not two classes would be stored. The job at hand is therefore to automatically
recognize from the above data whether each element pertains to male or female.
In statistical contexts, classifying such “messy” text strings into a number of fixed
categories is often referred to as coding.
Below we discuss two complementary approaches to string coding: string nor-
malization and approximate text matching. In particular, the following topics are
discussed.
Remove prepending or trailing white spaces.
Pad strings to a certain width.
Transform to upper/lower case.
Search for strings containing simple patterns (substrings).
Approximate matching procedures based on string distances.
Both str_trim() and str_pad() accept a side argument to indicate whether trim-
ming or padding should occur at the beginning (left), end (right), or both sides of
the string.
Converting strings to complete upper or lower case can be done with R’s built-in
toupper() and tolower() functions.
toupper("Hello world")
## [1] "HELLO WORLD"
tolower("Hello World")
## [1] "hello world"
3
http://en.wikipedia.org/wiki/Regular_expression
Correction of the fields that are deemed erroneous by the selection method.
This may be done through deterministic (model-based) or stochastic methods.
For many data correction methods these steps are not necessarily neatly separated.
First, we introduce a number of techniques dedicated to the detection of errors
and the selection of erroneous fields. If the field selection procedure is performed
separately from the error detection procedure, it is generally referred to as error
localization. Next, we describe techniques that implement correction methods based
on “direct rules” or “deductive correction”. In these techniques, erroneous values are
replaced by better ones by directly deriving them from other values in the same
record. Finally, we give an overview of some commonly used imputation techniques
that are available in R.
Missing values
A missing value, represented by NA in R, is a placeholder for a datum of which the
type is known but its value isn’t. Therefore, it is impossible to perform statistical
analysis on data where one or more values in the data are missing. One may choose
to either omit elements from a dataset that contain missing values or to impute a
value, but missingness is something to be dealt with prior to any analysis.
In practice, analysts, but also commonly used numerical software may confuse
a missing value with a default value or category. For instance, in Excel 2010, the
result of adding the contents of a field containing the number 1 with an empty field
results in 1. This behaviour is most definitely unwanted since Excel silently imputes
“0” where it should have said something along the lines of “unable to compute”. It
should be up to the analyst to decide how empty values are handled, since a default
imputation may yield unexpected or erroneous results for reasons that are hard to
trace.
Another commonly encountered mistake is to confuse an NA in categorical data
with the category unknown. If unknown is indeed a category, it should be added as a
factor level so it can be appropriately analyzed. Consider as an example a categorical
variable representing place of birth. Here, the category unknown means that we have
no knowledge about where a person is born. In contrast, NA indicates that we have
no information to determine whether the birth place is known or not.
The behaviour of R’s core functionality is completely consistent with the idea that
the analyst must decide what to do with missing data. A common choice, namely
“leave out records with missing data” is supported by many base functions through
the na.rm option.
Functions such as sum(), prod(), quantile(), sd(), and so on all have this option.
Functions implementing bivariate statistics such as cor() and cov() offer options to
include complete or pairwise complete values.
Besides the is.na() function, that was already mentioned previously, R comes
with a few other functions facilitating NA handling. The complete.cases() function
detects rows in a data.frame that do not contain any missing value. Recall the person
data set example from earlier.
print(person)
## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 NA
## 4 21 NA
complete.cases(person)
## [1] TRUE TRUE FALSE FALSE
The resulting logical can be used to remove incomplete records from the data.frame.
Alternatively the na.omit() function, does the same.
persons_complete <- na.omit(person)
persons_complete
## age height
## 1 21 6.0
## 2 42 5.9
na.action(persons_complete)
## 3 4
## 3 4
## attr(,"class")
## [1] "omit"
The result of the na.omit() function is a data.frame where incomplete rows have
been deleted. The row.names of the removed records are stored in an attribute called
na.action.
Note. It may happen that a missing value in a data set means 0 or Not applicable.
If that is the case, it should be explicitly imputed with that value, because it is not
unknown, but was coded as empty.
Special values
As explained previously, numeric variables are endowed with several formalized special
values including ±Inf, NA, and NaN. Calculations involving special values often result
in special values, and since a statistical statement about a real-world phenomenon
should never include a special value, it is desirable to handle special values prior to
analysis. For numeric variables, special values indicate values that are not an element
of the mathematical set of real numbers. The function is.finite() determines which
values are “regular” values.
is.finite(c(1, Inf, NaN, NA))
## [1] TRUE FALSE FALSE FALSE
This function accepts vectorial input. With little effort we can write a function
that may be used to check every numerical column in a data.frame.
f.is.special <- function(x) {
if (is.numeric(x)) {
return(!is.finite(x))
} else {
return(is.na(x))
}
}
person
## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 NA
## 4 21 NA
sapply(person, f.is.special)
## age height
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE TRUE
## [4,] FALSE TRUE
Here, the f.is.special() function is applied to each column of person using
sapply(). f.is.special() checks its input vector for numerical special values if the
type is numeric, otherwise it only checks for NA.
Outliers
There is a vast body of literature on outlier detection, and several definitions of
outlier exist. A general definition by Barnett and Lewis defines an outlier in a data
set as an observation (or set of observations) which appear to be inconsistent with that
set of data. Although more precise definitions exist (see e.g., the book by Hawkins),
this definition is sufficient for the current chapter. Below we mention a few fairly
Note. Outliers do not equal errors. They should be detected, but not necessarily
removed. Their inclusion in the analysis is a statistical decision.
For more or less unimodal and symmetrically distributed data, Tukey’s box-and-
whisker method for outlier detection is often appropriate. In this method, an ob-
servation is an outlier when it is larger than the so-called “whiskers” of the set of
observations. The upper whisker is computed by adding 1.5 times the interquar-
tile range to the third quartile and rounding to the nearest lower observation. The
lower whisker is computed likewise. The base R installation comes with function
boxplot.stats(), which, amongst other things, list the outliers.
x <- c(1:10, 20, 30)
boxplot.stats(x)
## $stats
## [1] 1.0 3.5 6.5 9.5 10.0
##
## $n
## [1] 12
##
## $conf
## [1] 3.76336 9.23664
##
## $out
## [1] 20 30
Here, 20 and 30 are detected as outliers since they are above the upper whisker
of the observations in x. The factor 1.5 used to compute the whisker is to an extent
arbitrary and it can be altered by setting the coef option of boxplot.stats(). A
higher coefficient means a higher outlier detection limit (so for the same dataset,
generally less upper or lower outliers will be detected).
boxplot.stats(x, coef = 2)$out
## [1] 30
The box-and-whisker method can be visualized with the box-and-whisker plot,
where the box indicates the interquartile range and the median, the whiskers are
represented at the ends of the box-and-whisker plots and outliers are indicated as
separate points above or below the whiskers.
op <- par(no.readonly = TRUE) # save plot settings
par(mfrow=c(1,3))
boxplot(x, main="default")
boxplot(x, range = 1.5, main="range = 1.5")
boxplot(x, range = 2, main="range = 2")
30
30
● ● ●
25
25
25
20
20
20
● ●
15
15
15
10
10
10
5
5
0
0
The box-and-whisker method fails when data distribution is skewed, as in an
exponential or log-normal distribution. In that case one can attempt to transform
the data, for example with a logarithm or square root transformation. Another option
is to use a method that takes the skewness into account.
A particularly easy-to-implement method for outlier detection with positive ob-
servations is by Hiridoglou and Berthelot. In this method, an observation is an outlier
when
x x∗
h(x) = max , ≥ r, and x > 0.
x∗ x
Here, r is a user-defined reference value and x∗ is usually the median observation,
although other measures of centrality may be chosen. Here, the score function h(x)
grows as 1/x as x approaches zero and grows linearly with x when it is larger than
x∗ . It is therefore appropriate for finding outliers on both sides of the distribution.
Moreover, because of the different behaviour for small and large x-values, it is appro-
priate for skewed (long-tailed) distributions. An implementation of this method in R
does not seem available but it is implemented simple enough as follows.
#### Example: Hiridoglou and Berthelot outlier detection function
f.hb.outlier <- function(x,r) {
x <- x[is.finite(x)]
stopifnot(length(x) > 0 , all(x>0)) # if empty vector or non-positive values, quit
xref <- median(x)
if (xref <= sqrt(.Machine$double.eps)) {
warning("Reference value close to zero: results may be inaccurate")
}
pmax(x/xref, xref/x) > r
}
f.hb.outlier(x, r = 4)
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] TRUE
The above function returns a logical vector indicating which elements of x are
outliers.
However, as the number of variables increases, the number of rules may increase
rapidly and it may be beneficial to manage the rules separate from the data. Moreover,
since multivariate rules may be interconnected by common variables, deciding which
variable or variables in a record cause an inconsistency may not be straightforward.
The editrules package allows one to define rules on categorical, numerical or
mixed-type data sets which each record must obey. Furthermore, editrules can check
which rules are obeyed or not and allows one to find the minimal set of variables to
adapt so that all rules can be obeyed. The package also implements a number of
basic rule operations allowing users to test rule sets for contradictions and certain
redundancies.
As an example, we will work with a small file containing the following data.
age,agegroup,height,status,yearsmarried
21,adult,6.0,single,-1
2,child,3,married, 0
18,adult,5.7,married, 20
221,elderly, 5,widowed, 2
34,child, -7,married, 3
We read this data into a variable called people and define some restrictions on age
using editset().
#### Example: people data for cleaning
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_people.txt"
people <- read.csv(fn.data)
people
## age agegroup height status yearsmarried
## 1 21 adult 6.0 single -1
The editset() function parses the textual rules and stores them in an editset
object. Each rule is assigned a name according to it’s type (numeric, categorical,
or mixed) and a number. The data can be checked against these rules with the
violatedEdits() function. Record 4 contains an error according to one of the rules:
an age of 221 is not allowed.
violatedEdits(E, people)
## edit
## record num1 num2
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE TRUE
## 5 FALSE FALSE
violatedEdits() returns a logical array indicating for each row of the data,
which rules are violated. The number and type of rules applying to a data set usually
quickly grow with the number of variables. With editrules, users may read rules,
specified in a limited R-syntax, directly from a text file using the editfile() function.
As an example consider the contents of the following text file (note, you can’t include
braces in your if() statement).
# numerical rules
age >= 0
height > 0
age <= 150
age > yearsmarried
# categorical rules
status %in% c("married", "single", "widowed")
agegroup %in% c("child", "adult", "elderly")
if ( status == "married" ) agegroup %in% c("adult","elderly")
# mixed rules
if ( status %in% c("married","widowed")) age - yearsmarried >= 17
if ( age < 18 ) agegroup == "child"
if ( age >= 18 && age <65 ) agegroup == "adult"
if ( age >= 65 ) agegroup == "elderly"
There are rules pertaining to purely numerical, purely categorical and rules per-
taining to both data types. Moreover, there are univariate as well as multivariate
rules. Comments are written behind the usual # character. The rule set can be read
as follows.
#### Edit rules for people data
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_edits.txt"
E <- editfile(fn.data)
E
##
## Data model:
## dat6 : agegroup %in% c('adult', 'child', 'elderly')
## dat7 : status %in% c('married', 'single', 'widowed')
##
## Edit set:
## num1 : 0 <= age
## num2 : 0 < height
## num3 : age <= 150
## num4 : yearsmarried < age
## cat5 : if( agegroup == 'child' ) status != 'married'
## mix6 : if( age < yearsmarried + 17 ) !( status %in% c('married', 'widowed') )
## mix7 : if( age < 18 ) !( agegroup %in% c('adult', 'elderly') )
## mix8 : if( 18 <= age & age < 65 ) !( agegroup %in% c('child', 'elderly') )
## mix9 : if( 65 <= age ) !( agegroup %in% c('adult', 'child') )
Since rules may pertain to multiple variables, and variables may occur in several
rules (e.g., the age variable in the current example), there is a dependency between
rules and variables. It can be informative to show these dependencies in a graph using
the plot function. Below the graph plot shows the interconnection of restrictions. Blue
circles represent variables and yellow boxes represent restrictions. The lines indicate
which restrictions pertain to what variables.
op <- par(no.readonly = TRUE) # save plot settings
par(mfrow=c(1,1), mar = c(0,0,0,0))
plot(E)
par(op) # restore plot settings
num1
num3
num4
heght
age
mix8
yrsmr
mix9 num2
mix7
mix6
aggrp
stats
cat5
As the number of rules grows, looking at the full array produced by violatedEdits()
becomes cumbersome. For this reason, editrules offers methods to summarize or vi-
sualize the result.
ve <- violatedEdits(E, people)
summary(ve)
## Edit violations, 5 observations, 0 completely missing (0%):
##
## editname freq rel
## cat5 2 40%
## mix6 2 40%
## num2 1 20%
## num3 1 20%
## num4 1 20%
## mix8 1 20%
##
## Edit violations per record:
##
## errors freq rel
## 0 1 20%
## 1 1 20%
## 2 2 40%
## 3 1 20%
plot(ve)
mix7
dat7
dat6
num1
Edit
mix8
num4
num3
num2
mix6
cat5
Frequency
1.4
1.0
● ●
Number of violations
Here, the edit labeled cat5 is violated by two records (20% of all records). Violated
edits are sorted from most to least often violated. The plot visualizes the same
information.
Error localization
The interconnectivity of edits is what makes error localization difficult. For example,
the graph above shows that a record violating edit num4 may contain an error in age
and/or yrsmr (years married). Suppose that we alter age so that num4 is not violated
anymore. We then run the risk of violating up to six other edits containing age.
If we have no other information available but the edit violations, it makes sense
to minimize the number of fields being altered. This principle, commonly referred to
as the principle of Fellegi and Holt, is based on the idea that errors occur relatively
few times and when they do, they occur randomly across variables. Over the years
several algorithms have been developed to solve this minimization problem of which
two have been implemented in editrules. The localizeErrors() function provides
11.6.3 Correction
Correction methods aim to fix inconsistent observations by altering invalid values in
a record based on information from valid values. Depending on the method this is
either a single-step procedure or a two-step procedure where first, an error localization
method is used to empty certain fields, followed by an imputation step.
In some cases, the cause of errors in data can be determined with enough certainty
so that the solution is almost automatically known. In recent years, several such
methods have been developed and implemented in the deducorrect package.
For the purposes of ADA1, we will manually correct errors, either by replacing
values or by excluding observations.
The task here is to standardize the lengths and express all of them in meters. The
obvious way would be to use indexing techniques, which would look something like
this.
marx_m <- marx
ind <- (marx$unit == "cm") # indexes for cm
## ## 1-------
## if (unit == "cm") {
## height <- height/100
## unit <- "m"
## }
## ## 2-------
## if (unit == "inch") {
## height <- height/39.37
## unit <- "m"
## }
## ## 3-------
## if (unit == "ft") {
## height <- height/3.28
## unit <- "m"
## }
correctionRules() has parsed the rules and stored them in a correctionRules
object. We may now apply them to the data.
cor <- correctWithRules(R, marx)
The returned value, cor, is a list containing the corrected data
cor$corrected
## name height unit
## 1 Groucho 1.700000 m
## 2 Zeppo 1.740000 m
## 3 Chico 1.778004 m
## 4 Gummo 1.680000 m
## 5 Harpo 1.801829 m
as well as a log of applied corrections.
cor$corrections[1:4]
## row variable old new
## 1 1 height 170 1.7
## 2 1 unit cm m
## 3 3 height 70 1.77800355600711
## 4 3 unit inch m
## 5 4 height 168 1.68
## 6 4 unit cm m
## 7 5 height 5.91 1.80182926829268
## 8 5 unit ft m
The log lists for each row, what variable was changed, what the old value was and
what the new value is. Furthermore, the fifth column of cor$corrections shows the
corrections that were applied (not shown above for formatting reasons).
cor$corrections[5]
## how
## 1 if (unit == "cm") { height <- height/100 unit <- "m" }
## 2 if (unit == "cm") { height <- height/100 unit <- "m" }
So here, with just two commands, the data is processed and all actions logged in
a data.frame which may be stored or analyzed. The rules that may be applied with
deducorrect are rules that can be executed record-by-record.
By design, there are some limitations to which rules can be applied with correctWithRules().
The processing rules should be executable record-by-record. That is, it is not permit-
ted to use functions like mean() or sd(). The symbols that may be used can be listed
as follows.
getOption("allowedSymbols")
## [1] "if" "else" "is.na" "is.finite" "=="
## [6] "<" "<=" "=" ">=" ">"
## [11] "!=" "!" "%in%" "identical" "sign"
## [16] "abs" "||" "|" "&&" "&"
## [21] "(" "{" "<-" "=" "+"
## [26] "-" "*" "^" "/" "%%"
## [31] "%/%"
When the rules are read by correctionRules(), it checks whether any symbol
occurs that is not in the list of allowed symbols and returns an error message when
such a symbol is found as in the following example.
correctionRules(expression(x <- mean(x)))
##
## Forbidden symbols found:
## ## ERR 1 ------
## Forbidden symbols: mean
## x <- mean(x)
## Error in correctionRules.expression(expression(x <- mean(x))): Forbidden symbols
found
Finally, it is currently not possible to add new variables using correctionRules()
although such a feature will likely be added in the future.
Deductive correction
When the data you are analyzing is generated by people rather than machines or mea-
surement devices, certain typical human-generated errors are likely to occur. Given
that data has to obey certain edit rules, the occurrence of such errors can sometimes
be detected from raw data with (almost) certainty. Examples of errors that can be
detected are typing errors in numbers (under linear restrictions) rounding errors in
numbers and sign errors or variable swaps. The deducorrect package has a number
of functions available that can correct such errors. Below we give some examples,
every time with just a single edit rule. The functions can handle larger sets of edits
however.
[I will complete this section if we need it for our Spring semester.]
Deterministic imputation
In some cases a missing value can be determined because the observed values combined
with their constraints force a unique solution.
[I will complete this section if we need it for our Spring semester.]
11.6.4 Imputation
Imputation is the process of estimating or deriving values for fields where data is
missing. There is a vast body of literature on imputation methods and it goes beyond
the scope of this chapter to discuss all of them.
There is no one single best imputation method that works in all cases. The
imputation model of choice depends on what auxiliary information is available and
whether there are (multivariate) edit restrictions on the data to be imputed. The
availability of R software for imputation under edit restrictions is limited. However,
a viable strategy for imputing numerical data is to first impute missing values without
restrictions, and then minimally adjust the imputed values so that the restrictions
are obeyed. Separately, these methods are available in R.
The purpose of this chapter is to discuss R in the context of a quick review of the
topics we covered last semester in ADA11 .
1.1 R
R is a programming language for programming, data management, and statistical
analysis. So many people have written “An Introduction to R”, that I refer you to
the course website2 for links to tutorials. I encourage you to learn R by (1) running
the commands in the tutorials, (2) looking at the help for the commands (e.g., ?mean),
and (3) trying things on your own as you become curious. Make mistakes, figure out
why some things don’t work the way you expect, and keep trying. Persistence wins
the day with programming (as does asking and searching for help).
R is more difficult to master (though, more rewarding) than some statistical pack-
ages (such as Minitab) for the following reasons: (1) R does not, in general, provide
a point-and-click environment for statistical analysis. Rather, R uses syntax-based
programs (i.e., code) to define, transform, and read data, and to define the procedures
for analyzing data. (2) R does not really have a spreadsheet environment for data
management. Rather, data are entered directly within an R program, read from a
file, or imported from a spreadsheet. All manipulation, transformation, and selection
of data is coded in the R program. Well done, this means that all the steps of the
analysis are available to be repeatable and understood.
Take a minute to install the packages we’ll need this semester by executing the
following commands in R.
#### Install packages needed this semester
ADA2.package.list <- c("BSDA", "Hmisc", "MASS", "NbClust",
1
http://statacumen.com/teaching/ada1/
2
http://statacumen.com/teaching/ada2/
# filename
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch01_turkey.csv"
# read file and assign data to turkey variable
turkey <- read.csv(fn.data)
# examine the structure of the dataset, is it what you expected?
# a data.frame containing integers, numbers, and factors
str(turkey)
## 'data.frame': 15 obs. of 3 variables:
## $ age : int 28 20 32 25 23 22 29 27 28 26 ...
## $ weight: num 13.3 8.9 15.1 13.8 13.1 10.4 13.1 12.4 13.2 11.8 ...
## $ orig : Factor w/ 2 levels "va","wi": 1 1 1 2 2 1 1 1 1 1 ...
# print dataset to screen
turkey
## age weight orig
## 1 28 13.3 va
## 2 20 8.9 va
## 3 32 15.1 va
## 4 25 13.8 wi
## 5 23 13.1 wi
## 6 22 10.4 va
## 7 29 13.1 va
## 8 27 12.4 va
## 9 28 13.2 va
## 10 26 11.8 va
## 11 21 11.5 wi
## 12 31 16.6 wi
## 13 27 14.2 wi
## 14 29 15.4 wi
## 15 30 15.9 wi
# Note: to view the age variable (column), there's a few ways to do that
turkey$age # name the variable
## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30
turkey[, 1] # give the column number
## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30
turkey[, "age"] # give the column name
## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30
# and the structure is a vector
str(turkey$age)
## int [1:15] 28 20 32 25 23 22 29 27 28 26 ...
# let's create an additional variable for later
# gt25mo will be a variable indicating whether the age is greater than 25 months
turkey$gt25mo <- (turkey$age > 25)
# now we also have a Boolean (logical) column
str(turkey)
## 'data.frame': 15 obs. of 4 variables:
## $ age : int 28 20 32 25 23 22 29 27 28 26 ...
## $ weight: num 13.3 8.9 15.1 13.8 13.1 10.4 13.1 12.4 13.2 11.8 ...
## $ orig : Factor w/ 2 levels "va","wi": 1 1 1 2 2 1 1 1 1 1 ...
## $ gt25mo: logi TRUE FALSE TRUE FALSE FALSE FALSE ...
# there are a couple ways of subsetting the rows
turkey[(turkey$gt25mo == TRUE),] # specify the rows
## age weight orig gt25mo
## 1 28 13.3 va TRUE
## 3 32 15.1 va TRUE
## 7 29 13.1 va TRUE
## 8 27 12.4 va TRUE
## 9 28 13.2 va TRUE
## 10 26 11.8 va TRUE
## 12 31 16.6 wi TRUE
## 13 27 14.2 wi TRUE
## 14 29 15.4 wi TRUE
## 15 30 15.9 wi TRUE
subset(turkey, gt25mo == FALSE) # use subset() to select the data.frame records
## age weight orig gt25mo
## 2 20 8.9 va FALSE
## 4 25 13.8 wi FALSE
## 5 23 13.1 wi FALSE
## 6 22 10.4 va FALSE
## 11 21 11.5 wi FALSE
Analyses can be then done on the entire dataset, or repeated for all subsets of a
variable in the dataset.
# summaries of each variable in the entire dataset,
summary(turkey)
## age weight orig gt25mo
## Min. :20.00 Min. : 8.90 va:8 Mode :logical
## 1st Qu.:24.00 1st Qu.:12.10 wi:7 FALSE:5
## Median :27.00 Median :13.20 TRUE :10
## Mean :26.53 Mean :13.25
## 3rd Qu.:29.00 3rd Qu.:14.65
## Max. :32.00 Max. :16.60
# or summarize by a variable in the dataset.
by(turkey, turkey$orig, summary)
## turkey$orig: va
## age weight orig gt25mo
## Min. :20.00 Min. : 8.90 va:8 Mode :logical
## 1st Qu.:25.00 1st Qu.:11.45 wi:0 FALSE:2
## Median :27.50 Median :12.75 TRUE :6
## Mean :26.50 Mean :12.28
## 3rd Qu.:28.25 3rd Qu.:13.22
## Max. :32.00 Max. :15.10
## ----------------------------------------------------
## turkey$orig: wi
## age weight orig gt25mo
## Min. :21.00 Min. :11.50 va:0 Mode :logical
## 1st Qu.:24.00 1st Qu.:13.45 wi:7 FALSE:3
## Median :27.00 Median :14.20 TRUE :4
## Mean :26.57 Mean :14.36
## 3rd Qu.:29.50 3rd Qu.:15.65
## Max. :31.00 Max. :16.60
library(ggplot2)
# Histogram overlaid with kernel density curve
p11 <- ggplot(turkeyva, aes(x = weight))
# Histogram with density instead of count on y-axis
p11 <- p11 + geom_histogram(aes(y=..density..)
, binwidth=2
, colour="black", fill="white")
# Overlay with transparent density plot
p11 <- p11 + geom_density(alpha=0.1, fill="#FF6666")
p11 <- p11 + geom_rug()
# violin plot
# boxplot
p13 <- ggplot(turkeyva, aes(x = "weight", y = weight))
p13 <- p13 + geom_boxplot()
p13 <- p13 + coord_flip()
library(gridExtra)
#grid.arrange(p11, p12, p13, ncol=1, main="Turkey weights for origin va")
## add grobs = list(), and main= becomes top=
grid.arrange(grobs = list(p11, p12, p13), ncol=1, top="Turkey weights for origin va")
# violin plot
p22 <- ggplot(turkeywi, aes(x = "weight", y = weight))
p22 <- p22 + geom_violin(fill = "gray50")
p22 <- p22 + geom_boxplot(width = 0.2, alpha = 3/4)
p22 <- p22 + coord_flip()
# boxplot
p23 <- ggplot(turkeywi, aes(x = "weight", y = weight))
p23 <- p23 + geom_boxplot()
p23 <- p23 + coord_flip()
library(gridExtra)
grid.arrange(grobs = list(p21, p22, p23), ncol=1, top="Turkey weights for origin wi")
density
0.15
0.10
0.10
0.05 0.05
0.00 0.00
7.5 10.0 12.5 15.0 17.5 12 14 16
weight weight
weight weight
x
10 12 14 12 13 14 15 16
weight weight
weight weight
x
10 12 14 12 13 14 15 16
weight weight
Check normality of each sample graphically with with bootstrap sampling distri-
bution and normal quantile plot and formally with normality tests.
# a function to compare the bootstrap sampling distribution with
# a normal distribution with mean and SEM estimated from the data
bs.one.samp.dist <- function(dat, N = 1e4) {
n <- length(dat);
# resample from data
sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol=N);
# draw a histogram of the means
sam.mean <- colMeans(sam);
# save par() settings
Plot of data with smoothed density curve Plot of data with smoothed density curve
0.3
0.20
Density
Density
0.2
0.10
0.1
0.00
0.0
8 10 12 14 16 11 12 13 14 15 16 17
dat dat
Bootstrap sampling distribution of the mean Bootstrap sampling distribution of the mean
0.6
0.6
0.4
0.4
Density
Density
0.2
0.2
0.0
0.0
10 11 12 13 14 12 13 14 15 16
● ●
15
16 ●
14
●
●
● 15
●
13
turkeyva$weight
turkeywi$weight
●
●
12 14
●
●
11 ●
13
●
10
12
9 ● ●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0
# Normality tests
# VA
shapiro.test(turkeyva$weight)
##
## Shapiro-Wilk normality test
##
## data: turkeyva$weight
## W = 0.95414, p-value = 0.7528
library(nortest)
ad.test(turkeyva$weight)
##
## Anderson-Darling normality test
##
## data: turkeyva$weight
## A = 0.283, p-value = 0.5339
# lillie.test(turkeyva£weight)
cvm.test(turkeyva$weight)
##
## Cramer-von Mises normality test
##
## data: turkeyva$weight
## W = 0.050135, p-value = 0.4642
# WI
shapiro.test(turkeywi$weight)
##
## Shapiro-Wilk normality test
##
## data: turkeywi$weight
# VA
t.summary <- t.test(turkeyva$weight, mu = 12)
t.summary
##
## One Sample t-test
##
## data: turkeyva$weight
## t = 0.40582, df = 7, p-value = 0.697
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
## 10.67264 13.87736
## sample estimates:
## mean of x
## 12.275
# WI
t.summary <- t.test(turkeywi$weight, mu = 12)
t.summary
##
## One Sample t-test
##
## data: turkeywi$weight
## t = 3.5442, df = 6, p-value = 0.01216
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
## 12.72978 15.98450
## sample estimates:
## mean of x
## 14.35714
# Sign test for the median
# VA
library(BSDA)
SIGN.test(turkeyva$weight, md=12)
##
## One-sample Sign-Test
##
## data: turkeyva$weight
## s = 5, p-value = 0.7266
## alternative hypothesis: true median is not equal to 12
## 95 percent confidence interval:
## 9.9125 13.8850
## sample estimates:
## median of x
## 12.75
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.9297 10.4000 13.300
## Interpolated CI 0.9500 9.9125 13.885
## Upper Achieved CI 0.9922 8.9000 15.100
# WI
SIGN.test(turkeywi$weight, md=12)
##
## One-sample Sign-Test
##
## data: turkeywi$weight
## s = 6, p-value = 0.125
## alternative hypothesis: true median is not equal to 12
## 95 percent confidence interval:
## 12.00286 16.38000
## sample estimates:
## median of x
## 14.2
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8750 13.1000 15.90
## Interpolated CI 0.9500 12.0029 16.38
## Upper Achieved CI 0.9844 11.5000 16.60
# Wilcoxon sign-rank test for the median (or mean, since symmetric assumption)
# VA
# with continuity correction in the normal approximation for the p-value
wilcox.test(turkeyva$weight, mu=12, conf.int=TRUE)
## Warning in wilcox.test.default(turkeyva$weight, mu = 12, conf.int = TRUE): cannot
compute exact p-value with ties
##
## Wilcoxon signed rank test
##
## data: turkeywi$weight
## V = 27, p-value = 0.03125
## alternative hypothesis: true location is not equal to 12
## 95 percent confidence interval:
## 12.65 16.00
## sample estimates:
## (pseudo)median
## 14.375
# boxplot
p2 <- ggplot(turkey, aes(x = orig, y = weight))
p2 <- p2 + geom_boxplot()
# add a "+" at the mean
p2 <- p2 + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
p2 <- p2 + geom_point()
p2 <- p2 + coord_flip()
p2 <- p2 + labs(title = "Boxplot with mean (+) and points")
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5), ncol=2, nrow=3
, top="Turkey weights compared by origin")
Turkey weights compared by origin
Dotplot with position jitter Boxplot with mean (+) and points
● ●
● ● ●
wi ●
● wi ● ● ● ● ● ● ●
orig
orig
● ●
● ●
va ● ●
● ● va ● ● ● ● ●●● ●
9 11 13 15 9 11 13 15
weight weight
2
va
1
2 orig
count
count
0
va
3
wi
1
2
wi
0 0
7.5 10.0 12.5 15.0 17.5 7.5 10.0 12.5 15.0 17.5
weight weight
2 orig
count
va
wi
1
0
7.5 10.0 12.5 15.0 17.5
weight
Using the two-sample t-test, first check the normality assumptions of the sampling
distribution of the mean difference between the populations.
# a function to compare the bootstrap sampling distribution
# of the difference of means from two samples with
# a normal distribution with mean and SEM estimated from the data
bs.two.samp.diff.dist(turkeyva$weight, turkeywi$weight)
Sample 1
n = 8 , mean = 12.275 , sd = 1.9167
0.3
Density
0.2
0.1
0.0
10 12 14 16
Sample
dat1 2
n = 7 , mean = 14.357 , sd = 1.7596
0.20
Density
0.10
0.00
10 12 14 16
−5 −4 −3 −2 −1 0 1
diff.mean
##
## data: turkeyva$weight and turkeywi$weight
## W = 11.5, p-value = 0.06384
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -4.19994493 0.09993686
## sample estimates:
## difference in location
## -2.152771
# without continuity correction
wilcox.test(turkeyva$weight, turkeywi$weight, conf.int=TRUE, correct=FALSE)
## Warning in wilcox.test.default(turkeyva$weight, turkeywi$weight, conf.int = TRUE,
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(turkeyva$weight, turkeywi$weight, conf.int = TRUE,
: cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test
##
## data: turkeyva$weight and turkeywi$weight
## W = 11.5, p-value = 0.05598
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -4.100049e+00 1.445586e-05
## sample estimates:
## difference in location
## -2.152771
# id.vars: ID variables
# all variables to keep but not split apart on
# id.vars=NULL,
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
# measure.vars = c("PT1","PT2","PT3","PT4","PT5"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "plant",
# value.name: column name for values in table
value.name = "runup",
# remove the NA values
na.rm = TRUE
)
## No id variables; using all as measure variables
str(waste.long)
## 'data.frame': 95 obs. of 2 variables:
## $ plant: Factor w/ 5 levels "PT1","PT2","PT3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ runup: num 1.2 10.1 -2 1.5 -3 -0.7 3.2 2.7 -3.2 -1.7 ...
head(waste.long)
## plant runup
## 1 PT1 1.2
## 2 PT1 10.1
## 3 PT1 -2.0
## 4 PT1 1.5
## 5 PT1 -3.0
## 6 PT1 -0.7
tail(waste.long)
## plant runup
## 96 PT5 22.3
## 97 PT5 3.1
## 98 PT5 16.8
## 99 PT5 11.3
## 100 PT5 12.3
## 101 PT5 16.9
# Calculate the mean, sd, n, and se for the plants
)
}
)
# standard errors
waste.summary$se <- waste.summary$s/sqrt(waste.summary$n)
waste.summary$moe <- qt(1 - 0.05 / 2, df = waste.summary$n - 1) * waste.summary$se
# individual confidence limits
waste.summary$ci.l <- waste.summary$m - waste.summary$moe
50
Run−up waste
25
The outliers here suggest the ANOVA is not an appropriate model. The normality
tests below suggest the distributions for the first two plants are not normal.
by(waste.long$runup, waste.long$plant, ad.test)
## waste.long$plant: PT1
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 2.8685, p-value = 1.761e-07
##
## ----------------------------------------------------
## waste.long$plant: PT2
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 2.5207, p-value = 1.334e-06
##
## ----------------------------------------------------
## waste.long$plant: PT3
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 0.23385, p-value = 0.7624
##
## ----------------------------------------------------
## waste.long$plant: PT4
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 0.12363, p-value = 0.9834
##
## ----------------------------------------------------
## waste.long$plant: PT5
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 0.27445, p-value = 0.6004
For review purposes, I’ll fit the ANOVA, but we would count on the following
nonparametric method for inference.
fit.w <- aov(runup ~ plant, data = waste.long)
summary(fit.w)
## Df Sum Sq Mean Sq F value Pr(>F)
## plant 4 451 112.73 1.16 0.334
## Residuals 90 8749 97.21
fit.w
## Call:
## aov(formula = runup ~ plant, data = waste.long)
##
## Terms:
## plant Residuals
## Sum of Squares 450.921 8749.088
## Deg. of Freedom 4 90
##
## Residual standard error: 9.859619
## Estimated effects may be unbalanced
QQ Plot of residuals
41 ●
60
40 18 ●
fit.w$residuals
20
15
89 ●
●
96
38 ●●
●●●
●●●●
●●●●
●●●●●
●●
●●●
●
●●
●●
0 ●●
●●
●●
●●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●●
●●
●●
●●●
●
●●●●●●●
●●●●●
●●●
●
2490
● ●
−20 ● 25 ● 93
−2 −1 0 1 2
norm quantiles
##
## Kruskal-Wallis rank sum test
##
## data: runup by plant
## Kruskal-Wallis chi-squared = 15.319, df = 4, p-value =
## 0.004084
# Bonferroni 95% pairwise comparisions with continuity correction
# in the normal approximation for the p-value
for (i1.pt in 1:4) {
for (i2.pt in (i1.pt+1):5) {
wt <- wilcox.test(waste[,names(waste)[i1.pt]], waste[,names(waste)[i2.pt]]
, conf.int=TRUE, conf.level = 1 - 0.05/choose(5,2))
cat(names(waste)[i1.pt], names(waste)[i2.pt])
print(wt)
}
}
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT1 PT2
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 131.5, p-value = 0.009813
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -8.299958 1.599947
## sample estimates:
## difference in location
## -4.399951
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT1 PT3
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 141.5, p-value = 0.07978
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -6.900028 2.700029
## sample estimates:
## difference in location
## -2.500047
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## -13.4 1.7
## sample estimates:
## difference in location
## -6.6
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT4 PT5
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 82, p-value = 0.1157
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -11.099965 4.799945
## sample estimates:
## difference in location
## -4.000035
##
## Pearson's Chi-squared test
##
## data: xt
## X-squared = 0.53571, df = 1, p-value = 0.4642
# the default is to perform Yates' continuity correction
chisq.test(xt)
## Warning in chisq.test(xt): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xt
## X-squared = 0.033482, df = 1, p-value = 0.8548
# Fisher's exact test
fisher.test(xt)
##
## Fisher's Exact Test for Count Data
##
## data: xt
## p-value = 0.6084
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.02687938 6.23767632
## sample estimates:
## odds ratio
## 0.4698172
A mosaic plot is for categorical data. Area represents frequency. The default
shading is a good start, since colors only appear when there’s evidence of associ-
ation related to those cell values. In our example, there’s insufficient evidence for
association, so the default shading is all gray.
library(vcd) # for mosaic()
# shading based on significance relative to appropriate chi-square distribution
mosaic(xt, shade=TRUE)
# you can define your own interpolated shading
mosaic(xt, shade=TRUE, gp_args = list(interpolate = seq(.1,.2,.05)))
gt25mo gt25mo
FALSE TRUE FALSE TRUE
Pearson Pearson
residuals: residuals:
0.44 0.44
va
va
0.20
0.15
0.10
orig
orig
0.00 0.00
−0.10
−0.15
−0.20
wi
wi
−0.41 −0.41
p−value = p−value =
0.46421 0.46421
# Contribution to chi-sq
# pull out only the cellname and chisq columns
x.table.chisq <- x.table[, c("cellname","chisq")]
# reorder the cellname categories to be descending relative to the chisq statistic
x.table.chisq$cellname <- with(x.table, reorder(cellname, -chisq))
0.15
4
Contribution
stat
count
0.10
obs
exp
0.05
0 0.00
2700
19
●
8
●
12 17
●
2400 ●
9
● 14
3 ●
●
10
●
5 18
● ●
shearpsi
11 1
● ●
2100 4
●
16
●
1800 7 13
2015
●
● ●
●
6
● 2
●
5 10 15 20 25
agewks
Plot diagnostics.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.shearpsi.agewks, which = c(1,4,6))
# residuals vs weight
plot(rocket$agewks, lm.shearpsi.agewks$residuals, main="Residuals vs agewks")
# Normality of Residuals
library(car)
qqPlot(lm.shearpsi.agewks$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 5 6 1
# residuals vs order of data
plot(lm.shearpsi.agewks$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
●1
0.30
● ●
0.30
●
● ●
●
● ● ●
1.5
Cook's distance
Cook's distance
●
● ● 6 ●6
0
0.20
Residuals
●
0.20
●
●
●
−100
● 19 19 ●
0.10
0.10
1
●
●
−200
●
● ● ●
●6 5● ●
● ●
●
● 0.5
0.00
0.00
●●
● ●● ● 0
50 100
● ●
100 ● 1● ● ●
● ● ●
lm.shearpsi.agewks$residuals
lm.shearpsi.agewks$residuals
lm.shearpsi.agewks$residuals
● ● ●
● ● ●
●
● ●
● ● ●
● 50 ● ● ●
● ●
● ● ●
● ● ●
● ● ● ● ● ●
0
0
● ● ●
● ● ●
−50
● ● ●
● ● ●
−100
−100
● ● ●
−100
−150
−200
−200
−200
● ● ● 5 ● 6 ● ●
5 10 15 20 25 −2 −1 0 1 2 5 10 15 20
The relationship between shear strength and age is fairly linear with predicted
shear strength decreasing as the age of the propellant increases. The fitted LS line is
The test for H0 : β1 = 0 (zero slope for the population regression line) is highly
significant: p-value< 0.0001. Also note that R2 = 0.9018 so the linear relationship
between shear strength and age explains about 90% of the variation in shear strength.
The data plot and residual information identify observations 5 and 6 as potential
outliers (r5 = −2.38, r6 = −2.32). The predicted values for these observations are
much greater than the observed shear strengths. These same observations appear as
potential outliers in the normal scores plot and the plot of ri against Ŷi . Observations
5 and 6 also have the largest influence on the analysis; see the Cook’s distance values.
A sensible next step would be to repeat the analysis holding out the most influ-
ential case, observation 5. It should be somewhat clear that the influence of case
6 would increase dramatically once case 5 is omitted from the analysis. Since both
cases have essentially the same effect on the positioning of the LS line, I will assess
the impact of omitting both simultaneously.
Before we hold out these cases, how do you think the LS line will change? My
guess is these cases are pulling the LS line down, so the intercept of the LS line should
increase once these cases are omitted. Holding out either case 5 or 6 would probably
also affect the slope, but my guess is that when they are both omitted the slope will
change little. (Is this my experience speaking, or have I already seen the output?
Both.) What will happen to R2 when we delete these points?
2700
19
●
8
●
12 17
●
2400 ●
9
● 14
3 ●
●
10
●
18
shearpsi
●
11 1
● ●
2100 4
●
16
●
1800 7 13
2015
●
● ●
●
2
●
5 10 15 20 25
agewks
Plot diagnostics.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.shearpsi.agewks, which = c(1,4,6))
# residuals vs weight
plot(rocket56$agewks, lm.shearpsi.agewks$residuals, main="Residuals vs agewks")
# Normality of Residuals
library(car)
qqPlot(lm.shearpsi.agewks$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## 12 20 2
## 10 18 2
# residuals vs order of data
plot(lm.shearpsi.agewks$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
12
2.5 2
● 12
●
● ●
●
0.3
50
0.3
● ●
1.5
Cook's distance
Cook's distance
●
Residuals
● ● ●
2
0
●2
0.2
●
0.2
● ● 19 19 ●
●
−50
● ●
● 1
0.1
0.1
●
●2
−100
● 20 ● ●
12 ● ●
● ● ● 0.5
0.0
0.0
● ● ●
●●
● 0
● ● ●
● ● ● ● ●
lm.shearpsi.agewks$residuals
lm.shearpsi.agewks$residuals
lm.shearpsi.agewks$residuals
●
● ● ●
50
50
● ●
50 ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ●
0
0
● ● ●
● ● ● ● ● ●
● ●
−50
−50
−50
● ● ●
● ● 2 ●
−100
−100
●
−100
● 20 ●
● ● 12 ●
5 10 15 20 25 −2 −1 0 1 2 5 10 15
Some summaries for the complete analysis, and when cases 5 and 6 are held out,
are given below. The summaries lead to the following conclusions:
1. Holding out cases 5 and 6 has little effect on the estimated LS line. Predictions
of shear strength are slightly larger after holding out these two cases (recall that
intercept increased, but slope was roughly the same!)
2. Holding out these two cases decreases σ̂ considerably, and leads to a modest
increase in R2 . The complete data set will give wider CI and prediction intervals
than the analysis which deletes case 5 and 6 because σ̂ decreases when these
points are omitted.
3. Once these cases are held out, the normal scores plot and plot of the studentized
residuals against fitted values shows no significant problems. One observation
Review complete
Now that we’re warmed up, let’s dive into new material!
1
This problem is from the Minitab handbook.
## $ ht : int 1629 1569 1561 1619 1566 1639 1494 1568 1540 1530 ...
## $ chin : num 8 3.3 3.3 3.7 9 3 7.3 3.7 10.3 5.7 ...
## $ fore : num 7 5 1.3 3 12.7 3.3 4.7 4.3 9 4 ...
## $ calf : num 12.7 8 4.3 4.3 20.7 5.7 8 0 10 6 ...
## $ pulse: int 88 64 68 52 72 72 64 80 76 60 ...
## $ sysbp: int 170 120 125 148 140 106 120 108 124 134 ...
## $ diabp: int 76 60 75 120 78 72 76 62 70 64 ...
# Description of variables
# id = individual id
# age = age in years yrmig = years since migration
# wt = weight in kilos ht = height in mm
# chin = chin skin fold in mm fore = forearm skin fold in mm
# calf = calf skin fold in mm pulse = pulse rate-beats/min
# sysbp = systolic bp diabp = diastolic bp
Indian sysbp by yrage with continuous wt Indian sysbp by yrage with categorical wt
1
● H
160 160
39
● H
4
● M
wt
35 wtcat
● H
5 31 80
sysbp
sysbp
140 ● ● 140 M M L
15 32 L
● ● M H
22 70 M M
● M
29 36 16 26 10
● ● ● ● ● M M L H L
38 H H
● 60 H
13
● M
30 25 28
● ● ● M H M
23
93
● 20 ● 24 L
L
● ● ● M L L
7 2 1718
120 ● ● ●● 120 L L L L
33 14
● ● M M
37 11
● ● L M
12 1921
● ●● L L L
27
● M
34
● M
8
● L
6
● M
Fit the simple linear regression model reporting the ANOVA table (“Terms”) and
parameter estimate table (“Coefficients”).
# fit the simple linear regression model
lm.sysbp.yrage <- lm(sysbp ~ yrage, data = indian)
# use Anova() from library(car) to get ANOVA table (Type 3 SS, df)
library(car)
Anova(lm.sysbp.yrage, type=3)
## Anova Table (Type III tests)
##
## Response: sysbp
## Sum Sq Df F value Pr(>F)
## (Intercept) 178221 1 1092.9484 < 2e-16 ***
## yrage 498 1 3.0544 0.08881 .
## Residuals 6033 37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.sysbp.yrage)
##
## Call:
## lm(formula = sysbp ~ yrage, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.161 -10.987 -1.014 6.851 37.254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 133.496 4.038 33.060 <2e-16 ***
## yrage -15.752 9.013 -1.748 0.0888 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.77 on 37 degrees of freedom
sysbp = β0 + β1 yrage + ε.
and suggests that average systolic blood pressure decreases as the fraction of life spent
in modern society increases. However, the t-test of H0 : β1 = 0 is not significant at
the 5% level (p-value=0.08881). That is, the weak linear relationship observed in
the data is not atypical of a population where there is no linear relationship between
systolic blood pressure and the fraction of life spent in a modern society.
Even if this test were significant, the small value of R2 = 0.07626 suggests that
yrage fraction does not explain a substantial amount of the variation in the systolic
blood pressures. If we omit the individual with the highest blood pressure then the
relationship would be weaker.
so the model implies that that average systolic blood pressure is a linear combination
of yrage fraction and weight. As in simple linear regression, the standard multiple re-
gression analysis assumes that the responses are normally distributed with a constant
variance σ 2 . The parameters of the regression model β0 , β1 , β2 , and σ 2 are estimated
by least squares (LS).
Here is the multiple regression model with yrage and wt (weight) as predictors.
Add wt to the right hand side of the previous formula statement.
2. Looking at the ANOVA tables for the simple linear and the multiple regres-
sion models we see that the Regression (model) df has increased from 1 to 2
(2=number of predictor variables) and the Residual (error) df has decreased
from 37 to 36 (=n − 1− number of predictors). Adding a predictor increases
the Regression df by 1 and decreases the Residual df by 1.
3. The Residual SS decreases by 6033.37 − 3441.36 = 2592.01 upon adding the
weight term. The Regression SS increased by 2592.01 upon adding the weight
term term to the model. The Total SS does not depend on the number of
predictors so it stays the same. The Residual SS, or the part of the variation
in the response unexplained by the regression model, never increases when new
predictors are added. (You can’t add a predictor and explain less variation.)
4. The proportion of variation in the response explained by the regression model:
R2 = Regression SS/Total SS
never decreases when new predictors are added to a model. The R2 for the
simple linear regression was 0.076, whereas
R2 = 3090.08/6531.44 = 0.473
for the multiple regression model. Adding the weight variable to the model
increases R2 by 40%. That is, weight explains 40% of the variation in systolic
blood pressure not already explained by fraction.
5. The estimated variability about the regression line
Residual MS = σ̂ 2
decreased dramatically after adding the weight effect. For the simple linear
regression model σ̂ 2 = 163.06, whereas σ̂ 2 = 95.59 for the multiple regression
model. This suggests that an important predictor has been added to model.
6. The F -statistic for the multiple regression model
as a plotting symbol. The relationship between systolic blood pressure and fraction
is fairly linear within each weight category, and stronger than when we ignore weight.
The slopes in the three groups are negative and roughly constant.
To see why yrage fraction is an important predictor after taking weight into con-
sideration, let us return to the multiple regression model. The model implies that the
average systolic blood pressure is a linear combination of yrage fraction and weight:
\ = β0 + β1 yrage + β2 wt.
sysbp
For each fixed weight, the average systolic blood pressure is linearly related to yrage
fraction with a constant slope β1 , independent of weight. A similar interpretation
holds if we switch the roles of yrage fraction and weight. That is, if we fix the value
of fraction, then the average systolic blood pressure is linearly related to weight with
a constant slope β2 , independent of yrage fraction.
To see this point, suppose that the LS estimates of the regression parameters are
the true values
\ = 60.89 − 26.76 yrage + 1.21 wt.
sysbp
If we restrict our attention to 50kg Indians, the average systolic blood pressure as a
function of fraction is
\ = 60.89 − 26.76 yrage + 1.21(50) = 121.39 − 26.76 yrage.
sysbp
For 60kg Indians,
\ = 60.89 − 26.76 yrage + 1.21(60) = 133.49 − 26.76 yrage.
sysbp
Hopefully the pattern is clear: the average systolic blood pressure decreases by
26.76 for each increase of 1 on fraction, regardless of one’s weight. If we vary weight
over its range of values, we get a set of parallel lines (i.e., equal slopes) when we
plot average systolic blood pressure as a function of yrage fraction. The intercept
increases by 1.21 for each increase of 1kg in weight.
Similarly, if we plot the average systolic blood pressure as a function of weight,
for several fixed values of fraction, we see a set of parallel lines with slope 26.76, and
intercepts decreasing by 26.76 for each increase of 1 in fraction.
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(indian, aes(x = wt, y = sysbp, label = id))
p <- p + geom_point(aes(colour=yrage), size=2)
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5, alpha = 0.25, colour = 2)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
p <- p + labs(title="Indian sysbp by wt with continuous yrage")
print(p)
Indian sysbp by wt with continuous yrage
1
●
160
39
●
4
●
yrage
35 0.8
●
5 31
sysbp
140 ● ●
15 32 0.6
● ●
22
16
10 36 ● 29 26 0.4
●● ● ● ●
38
● 0.2
13
●
28
30 25
● ●
3 23
●
● 24 20 9
●● ●
7 18 217
120 ● ● ●●
33 14
● ●
37 11
● ●
19 12
21
● ●●
27
●
34
●
8
●
6
●
60 70 80
wt
If we had more data we could check the model by plotting systolic blood pressure
against fraction, broken down by individual weights. The plot should show a fairly
linear relationship between systolic blood pressure and fraction, with a constant slope
across weights. I grouped the weights into categories because of the limited number
of observations. The same phenomenon should approximately hold, and it does. If
the slopes for the different weight groups changed drastically with weight, but the
relationships were linear, we would need to include an interaction or product variable
wt × yrage in the model, in addition to weight and yrage fraction. This is probably
not warranted here.
A final issue that I wish to address concerns the interpretation of the estimates of
the regression coefficients in a multiple regression model. For the fitted model
our interpretation is consistent with the explanation of the regression model given
above. For example, focus on the yrage fraction coefficient. The negative coeffi-
cient indicates that the predicted systolic blood pressure decreases as yrage fraction
increases holding weight constant. In particular, the predicted systolic blood pres-
sure decreases by 26.76 for each unit increase in fraction, holding weight constant at
any value. Similarly, the predicted systolic blood pressure increases by 1.21 for each
unit increase in weight, holding yrage fraction constant at any level.
This example was meant to illustrate multiple regression. A more complete anal-
ysis of the data, including diagnostics, will be given later.
The data below are selected from a larger collection of data referring to candidates for
the General Certificate of Education (GCE) who were being considered for a special
award. Here, Y denotes the candidate’s total mark, out of 1000, in the GCE exam,
while X1 is the candidate’s score in the compulsory part of the exam, which has a
maximum score of 200 of the 1000 points on the exam. X2 denotes the candidates’
score, out of 100, in a School Certificate English Language paper taken on a previous
occasion.
#### Example: GCE
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch02_gce.dat"
gce <- read.table(fn.data, header=TRUE)
str(gce)
## 'data.frame': 15 obs. of 3 variables:
## $ y : int 476 457 540 551 575 698 545 574 645 690 ...
## $ x1: int 111 92 90 107 98 150 118 110 117 114 ...
## $ x2: int 68 46 50 59 50 66 54 51 59 80 ...
## print dataset to screen
#gce
y x1 x2
1 476 111 68
2 457 92 46
3 540 90 50
4 551 107 59
5 575 98 50
6 698 150 66
7 545 118 54
8 574 110 51
9 645 117 59
10 690 114 80
11 634 130 57
12 637 118 51
13 390 91 44
14 562 118 61
15 560 109 66
I will lead you through a number of steps to help you answer this question. Let
us answer the following straightforward questions.
1. Plot Y against X1 and X2 individually, and comment on the form (i.e., linear,
non-linear, logarithmic, etc.), strength, and direction of the relationships.
2. Plot X1 against X2 and comment on the form, strength, and direction of the
relationship.
3. Compute the correlation between all pairs of variables. Do the correlations
appear sensible, given the plots?
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
#p <- ggpairs(gce, progress=FALSE)
# put scatterplots on top so y axis is vertical
p <- ggpairs(gce, upper = list(continuous = "points")
, lower = list(continuous = "cor")
, progress=FALSE
)
print(p)
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
y x1 x2
● ●
0.005 ● ●
● ●
● ● ● ●
0.004
● ● ●●
0.003 ● ● ● ●
● ● ● ●
y
● ●
0.002
● ●
● ●
0.001
0.000 ● ●
140
●
Corr:
x1
120 ● ● ●●
0.731 ●
● ●
●
●
100 ●
●● ●
80
70
Corr: Corr:
x2
60 0.548 0.509
50
# Normality of Residuals
library(car)
qqPlot(lm.y.x1$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 10 13 1
# residuals vs order of data
plot(lm.y.x1$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
Residuals vs Fitted Cook's distance 0.4 Cook's dist vs Leverage hii (1 − hii)
0.4
13
2 ● 13 1.5 1
10 ●
100
0.3
0.3
●
Cook's distance
Cook's distance
● ●
50
●
Residuals
0.2
0.2
●
●●
0
●
● 3 6 6●
● ● ●3
● 0.5
−50
0.1
0.1
●
●
●
−100
●
● 13 1● ●●
●
●
0.0
0.0
●
● ● 0
500 550 600 650 700 2 4 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5
● 10 ● ●
100
100
100
● ● ●
lm.y.x1$residuals
lm.y.x1$residuals
lm.y.x1$residuals
● ● ● ● ●
50
50
●
50 ● ●
● ● ●
●● 0 ● ●
0
● ●
● ● ●
● ● ●
● ● ● ●
● ●
−50
−50
● −50 ● ●
1
−100
−100
● ● ●
● −100 ● 13 ●
Model Y = β0 + β1 X2 + ε:
# y ~ x2
lm.y.x2 <- lm(y ~ x2, data = gce)
library(car)
Anova(lm.y.x2, type=3)
## Anova Table (Type III tests)
##
## Response: y
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.y.x2, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.y.x2$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 1 13 12
0.4
100 1
0.4
● 12 13 13 ●
●
● ●
50
0.3
●●
0.3
Cook's distance
Cook's distance
Residuals
● ●
0
0.2
●
0.2
●
−50
●
●
6
● ●6 0.5
0.1
0.1
● 13 ●
−150
● ●
1● ●● ●
0.0
0.0
● ●
●● 0
100
● 100 12 ● ●
● ● ●
● ● ● ● ● ●
50
50
●
50 ● ●
● ●
lm.y.x2$residuals
lm.y.x2$residuals
lm.y.x2$residuals
●
● ● ●
● ● ●
0
0
0
● ● ●
● ● ● ● ● ●
−50
−50
●
● −50 ●
●
●
●
−100
● ● 13 ●
−150
−150
● ● 1 ●
−150
45 50 55 60 65 70 75 80 −1 0 1 2 4 6 8 10 12 14
Answer: R2 is 0.53 for the model with X1 and 0.30 with X2 . Equivilantly, the
Model SS is larger for X1 (53970) than for X2 (30321). Thus, X1 appears to be
a better predictor of Y than X2 .
5. Consider 2 simple linear regression models for predicting Y , one with X1 as a
predictor, and the other with X2 as the predictor. Do X1 and X2 individually
appear to be important for explaining the variation in Y ? (i.e., test that the
slopes of the regression lines are zero). Which, if any, of the output, support,
or contradicts, your answer to the previous question?
Answer: The model with X1 has a t-statistic of 3.86 with an associated p-
value of 0.0020, while X2 has a t-statistic of 2.36 with an associated p-value of
0.0346. Both predictors explain a significant amount of variability in Y . This
is consistant with part (4).
6. Fit the multiple regression model
Y = β0 + β1 X1 + β2 X2 + ε.
Diagnostic plots suggest the residuals are roughly normal with no substantial
outliers, though the Cook’s distance is substantially larger for observation 10.
We may wish to fit the model without observation 10 to see whether conclusions
change.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.y.x1.x2, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.y.x1.x2$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 1 13 5
1.2
2.5 2
1.0
1.5
●
Cook's distance
Cook's distance
0.8
●
Residuals
●
●
0.6
●
●
● ●
−50
0.4
1 ●1
13 ● 13
● 13
0.2
−100
1● ●●
● 0.5
●
0.0
●
●
●●● ● ●
● 0
500 550 600 650 700 2 4 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5
● ●
●
● ● ● 5●
● ● ●● ● ●
●
●
50
50
50
lm.y.x1.x2$residuals
lm.y.x1.x2$residuals
lm.y.x1.x2$residuals
● ● ●
● ● ●
0
0
● ● ●
● ● ●
● ●
●
● ● ●
●
● ● ● ● ●
−50
−50
−50
● ● ● 13
−100
−100
−100
● ● ● 1
both X1 and X2 is 0.5757. There is only a very small increase in R2 from the
model with only X1 when X2 is added, which is consistent with X2 not being
important given that X1 is already in the model.
9. Do your best to answer the question posed above, in the paragraph after the
data “A goal . . . ”. Provide an equation (LS) for predicting Y .
Answer: Yes, we’ve seen that X1 may be used to predict Y , and that X2 does not
explain significantly more variability in the model with X1 . Thus, the preferred
model has only X1 :
ŷ = 128.55 + 3.95X1 .
I will give you my thoughts on these data, and how I would attack this problem,
keeping the ultimate goal in mind. I will examine whether transformations of the data
are appropriate, and whether any important conclusions are dramatically influenced
by individual observations. I will use some new tools to attack this problem, and will
outline how they are used.
The plot of GCE (Y ) against COMP (X1 ) is fairly linear, but the trend in the
plot of GCE (Y ) against SCEL (X2 ) is less clear. You might see a non-linear trend
here, but the relationship is not very strong. When I assess plots I try to not allow
a few observations affect my perception of trend, and with this in mind, I do not see
any strong evidence at this point to transform any of the variables.
One difficulty that we must face when building a multiple regression model is that
these two-dimensional (2D) plots of a response against individual predictors may have
little information about the appropriate scales for a multiple regression analysis. In
particular, the 2D plots only tell us whether we need to transform the data in a simple
linear regression analysis. If a 2D plot shows a strong non-linear trend, I would do
an analysis using the suggested transformations, including any other effects that are
important. However, it might be that no variables need to be transformed in the
multiple regression model.
The partial regression residual plot, or added variable plot, is a graphical tool
that provides information about the need for transformations in a multiple regression
model. The following reg procedure generates diagnostics and the partial residual
plots for each predictor in the multiple regression model that has COMP and SCEL
as predictors of GCE.
library(car)
avPlots(lm.y.x1.x2, id = list(n = 3))
Added−Variable Plots
100
● 10 ●
100
6●
● 11 ●
50
● 5 ● ● 5 ●●
50
●
10
●●
y | others
y | others
0
●
●
● ● 15 ●
●
0
●
−50
●
● ●
● ●
−50
●
● 13
−100
1●
−150
● 1 ● 13
−10 0 10 20 30 −5 0 5 10 15 20
x1 | others x2 | others
The partial regression residual plot compares the residuals from two model fits.
First, we “adjust” Y for all the other predictors in the model except the selected
one. Then, we “adjust” the selected variable Xsel for all the other predictors in the
model. Lastly, plot the residuals from these two models against each other to see what
relationship still exists between Y and Xsel after accounting for their relationships with
the other predictors.
# function to create partial regression plot
partial.regression.plot <- function (y, x, sel, ...) {
m <- as.matrix(x[, -sel])
# residuals of y regressed on all x's except "sel"
y1 <- lm(y ~ m)$res
# residuals of x regressed on all other x's
x1 <- lm(x[, sel] ~ m)$res
# plot residuals of y vs residuals of x
plot( y1 ~ x1, main="Partial regression plot", ylab="y | others", ...)
# add grid
grid(lty = "solid")
# add red regression line
abline(lm(y1 ~ x1), col = "red", lwd = 2)
}
par(mfrow=c(1, 2))
partial.regression.plot(gce$y, cbind(gce$x1, gce$x2), 1, xlab="x1 | others")
partial.regression.plot(gce$y, cbind(gce$x1, gce$x2), 2, xlab="x2 | others")
100
●
● ●
50
● ● ● ●●
50
●
y | others
y | others
●●
0
●
●
● ● ● ●
0
−50
●
●
● ●
● ●
−100 −50
●
●
−150
●
● ●
−10 0 10 20 30 −5 0 5 10 15 20
x1 | others x2 | others
The first partial regression residual plot for COMP, given below, “adjusts” GCE
(Y ) and COMP (X1 ) for their common dependence on all the other predictors in
the model (only SCEL (X2 ) here). This plot tells us whether we need to transform
COMP in the multiple regression model, and whether any observations are influencing
the significance of COMP in the fitted model. A roughly linear trend suggests that
no transformation of COMP is warranted. The positive relationship seen here is
consistent with the coefficient of COMP being positive in the multiple regression
model. The partial residual plot for COMP shows little evidence of curvilinearity,
and much less so than the original 2D plot of GCE against COMP. This indicates
that there is no strong evidence for transforming COMP in a multiple regression
model that includes SCEL.
Although SCEL appears to somewhat useful as a predictor of GCE on it’s own, the
multiple regression output indicates that SCEL does not explain a significant amount
of the variation in GCE, once the effect of COMP has been taken into account. Put
another way, previous performance in the School Certificate English Language (X2 )
has little predictive value independently of what has already emerged from the current
performance in the compulsory papers (X1 or COMP). This conclusion is consistent
with the fairly weak linear relationship between GCE against SCEL seen in the second
partial residual plot.
Do diagnostics suggest any deficiencies associated with this conclusion? The par-
tial residual plot of SCEL highlights observation 10, which has the largest value of
Cook’s distance in the multiple regression model. If we visually hold observation 10
out from this partial residual plot, it would appear that the relationship observed in
this plot would weaken. This suggests that observation 10 is actually enhancing the
significance of SCEL in the multiple regression model. That is, the p-value for testing
the importance of SCEL in the multiple regression model would be inflated by holding
out observation 10. The following output confirms this conjecture. The studentized
residuals, Cook’s distances and partial residual plots show no serious deficiencies.
Model Y = β0 + β1 X1 + β2 X2 + ε, excluding observation 10:
gce10 <- gce[-10,]
# y ~ x1 + x2
lm.y10.x1.x2 <- lm(y ~ x1 + x2, data = gce10)
library(car)
Anova(lm.y10.x1.x2, type=3)
## Anova Table (Type III tests)
##
## Response: y
## Sum Sq Df F value Pr(>F)
## (Intercept) 5280 1 1.7572 0.211849
## x1 37421 1 12.4540 0.004723 **
## x2 747 1 0.2486 0.627870
## Residuals 33052 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.y10.x1.x2)
##
## Call:
## lm(formula = y ~ x1 + x2, data = gce10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.117 -30.319 4.661 37.416 64.803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.461 120.295 1.326 0.21185
## x1 4.241 1.202 3.529 0.00472 **
## x2 -1.280 2.566 -0.499 0.62787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.82 on 11 degrees of freedom
## Multiple R-squared: 0.6128,Adjusted R-squared: 0.5424
## F-statistic: 8.706 on 2 and 11 DF, p-value: 0.005413
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.y10.x1.x2, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.y10.x1.x2$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## 13 1 9
## 12 1 9
## residuals vs order of data
#plot(lm.y10.x1.x2£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.6
1
2.5 2 1● 1.5
● 9● 13 ● 13
●
0.5
50
0.5
●
0.4
Cook's distance
Cook's distance
●
0.4
● ●
1
Residuals
0.3
0.3
●
●
●
0.2
−50
0.2
●
3 ●3
1●
0.1
0.1 ● 0.5
−100
● ● ●
● 13 ● ● ●
0.0
0.0
●●
● ● 0
500 550 600 650 700 2 4 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5
● ● ● ● ● ● ● 9●
50
50
50
● ● ●
lm.y10.x1.x2$residuals
lm.y10.x1.x2$residuals
lm.y10.x1.x2$residuals
● ● ●
● ● ● ● ● ●
0
0
● ● ●
● ● ●
● ● ●
● ● ●
−50
−50
● ● ●
−50
● ● ● 1
−100
−100
● ● −100 ● 13
library(car)
avPlots(lm.y10.x1.x2, id = list(n = 3))
Added−Variable Plots
100
● ●
6● ● ● 9
12
50
9 ●
● 11 ●
50
●
●
●
15 ●
y | others
y | others
● ●
●
0
●
● ● ●
●
−100 −50
● ●
● ●
−50
● 13 1●
● 1 ● 13
−10 0 10 20 −5 0 5 10
x1 | others x2 | others
What are my conclusions? It would appear that SCEL (X2 ) is not a useful pre-
dictor in the multiple regression model. For simplicity, I would likely use a simple
linear regression model to predict GCE (Y ) from COMP (X1 ) only. The diagnostic
analysis of the model showed no serious deficiencies.
3.1 Model
Given data on a response variable Y and k predictor variables X1 , X2 , . . . , Xk , we
wish to develop a regression model to predict Y . Assuming that the collection of
variables is measured on the correct scale, and that the candidate list of predictors
includes all the important predictors, the most general model is
Y = β0 + β1 X1 + · · · + βk Xk + ε.
In most problems one or more of the predictors can be eliminated from this general
or full model without (much) loss of information. We want to identify the impor-
tant predictors, or equivalently, eliminate the predictors that are not very useful for
explaining the variation in Y (conditional on the other predictors in the model).
We will study several automated methods for model selection, which, given a
specific criterion for selecting a model, gives the best predictors. Before applying
any of the methods, you should plot Y against each predictor X1 , X2 , . . . , Xk to see
whether transformations are needed. If a transformation of Xi is suggested, include
the transformation along with the original Xi in the candidate list.√Note that you
can transform the predictors differently, for example, log(X1 ) and X2 . However,
if several transformations are suggested for the response, then you should consider
doing one analysis for each suggested response scale before deciding on the final scale.
At this point, I will only consider the backward elimination method. Other
approaches will be addressed later this semester.
Y = β0 + β1 X1 + · · · + βk Xk + ε.
2. Find the variable which when omitted from the full model (1) reduces R2 the
least, or equivalently, increases the Residual SS the least. This is the variable
that gives the largest p-value for testing an individual regression coefficient
H0 : βi = 0 for i > 0. Suppose this variable is Xk . If you reject H0 , stop and
conclude that the full model is best. If you do not reject H0 , delete Xk from
the full model, giving the new full model
Y = β0 + β1 X1 + · · · + βk−1 Xk−1 + ε.
str(indian2)
## 'data.frame': 39 obs. of 8 variables:
## $ sysbp: int 170 120 125 148 140 106 120 108 124 134 ...
## $ wt : num 71 56.5 56 61 65 62 53 53 65 57 ...
## $ ht : int 1629 1569 1561 1619 1566 1639 1494 1568 1540 1530 ...
## $ chin : num 8 3.3 3.3 3.7 9 3 7.3 3.7 10.3 5.7 ...
## $ fore : num 7 5 1.3 3 12.7 3.3 4.7 4.3 9 4 ...
## $ calf : num 12.7 8 4.3 4.3 20.7 5.7 8 0 10 6 ...
## $ pulse: int 88 64 68 52 72 72 64 80 76 60 ...
## $ yrage: num 0.0476 0.2727 0.2083 0.0417 0.04 ...
# Description of variables
# id = individual id
# age = age in years yrmig = years since migration
# wt = weight in kilos ht = height in mm
# chin = chin skin fold in mm fore = forearm skin fold in mm
# calf = calf skin fold in mm pulse = pulse rate-beats/min
# sysbp = systolic bp diabp = diastolic bp
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
0.02 ●
● ●
● ●
●
●
●
●
●
●
●
●
●
sysbp
● ●● ● ●●●● ●● ● ●
●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ●●
● ●● ●
● ● ●●● ●● ● ●● ● ●● ● ●●●●● ● ● ●
● ● ●●● ● ● ● ●
● ● ● ● ●●●●
0.01 ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●
●● ● ● ● ● ●
●●
●● ● ●● ●● ● ●● ● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ●
●
●
●
● ●● ●
● ●
●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ●●● ● ● ●
●●● ●● ● ●● ● ● ● ●● ● ●●●●
●
● ●
● ● ●●● ● ●●● ● ● ● ● ●● ● ● ● ●●● ●
● ● ● ● ● ● ● ● ●
● ●● ● ●
0.00 ● ● ●● ●● ● ● ● ● ●
● ● ● ● ● ●
80
Corr: ●
●● ●
● ● ● ●
●
●
● ● ●
● ● ●● ●
● ●
● ● ●
● ● ●●
●
●
●● ● ●
wt
70 ● ● ●
●● ● ●
● ● ● ●
● ● ●● ●●
●
●
●
● ● ●
●●
●
●
0.521 ● ●●
●
● ●● ●●
●
●●●
● ● ● ●●● ● ●
●
● ● ●
●
●● ● ● ●
●
●●● ● ● ●
●● ●
●
●
●
●● ● ● ● ●
● ● ●
● ●● ● ●● ●●
● ●●● ● ● ● ● ● ● ●●
60 ● ● ●
●
● ● ●
●
● ● ●
●
● ● ● ● ● ●
● ●●
●● ● ●● ● ● ●● ● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●● ●● ●
● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
1650 ● ● ● ● ●
●●● ● ● ● ● ● ● ● ●
● ●● ●
● ● ●
● ●● ●● ●● ●●●● ●
● ● ● ● ●
● ● ●● ● ● ●●● ● ● ● ●
● ● ●● ●
●
● ●
● ● ●● ● ● ●
● ● ●● ●
●
● ● ●● ● ● ● ● ● ●
1600
Corr: Corr: ●●●
●●● ●
●
● ●●●
●
● ● ●●
● ●
● ●
●
● ● ● ●
● ●
●
● ● ●●
●
●
● ●
ht
● ● ●
1550 0.219 0.45 ●
● ● ●
●
● ● ●
●● ● ●
● ● ●●●
● ●
● ● ● ●●●● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
1500 ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
12.5 ● ● ● ●
● ● ●● ● ● ● ●
● ●● ● ●
10.0 ●
● ●
●
● ●
●
● ● ● ● ● ● ● ●
Corr: Corr: Corr:
chin
●● ● ● ● ● ● ●
7.5 ● ● ● ● ● ● ●
● ● ● ●
0.17 0.562 −0.0079 ● ● ●●
●
●●●● ●● ●
●
● ● ● ●
●● ● ●
●
5.0 ● ●● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
●● ● ● ● ●●● ● ● ● ● ● ● ● ●●
● ●●● ●●
● ●
●●●
●● ●
●● ● ● ● ● ●
● ● ●
● ● ● ●● ● ●
●● ●● ●● ●
2.5
● ● ●
12.5 ● ● ●
10.0
● ● ●
Corr: Corr: Corr: Corr:
fore
7.5 ● ● ●
● ● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ● ●
5.0 0.272 0.544 −0.0689 0.638 ● ●●
●
●●
● ●●
●
● ● ●
● ●● ●
●
●
● ●●● ●●
●●● ● ●
● ●
●●● ● ● ● ● ●
● ●● ●● ● ●
● ● ●
●
●●●●●
●● ● ● ● ● ●
● ● ● ●● ●●
● ● ● ●● ●
2.5
● ● ●
● ●
20
● ●
15 ● ●
● ● ● ●
Corr: Corr: Corr: Corr: Corr: ●
● ●
● ● ●● ●
calf
● ● ●
10 ●● ● ●●
0.251 0.392 −0.00285 0.516 0.736 ●
● ● ●● ●
● ●
● ●
● ●
● ● ●●● ●
●
●
●
●●
● ●● ●
●
5 ● ● ● ●
●● ● ● ● ●
● ● ● ● ●●
0 ● ●
●
90 ● ● ●
●
80 ● ●
Corr: Corr: Corr: Corr: Corr: Corr:
pulse
●● ● ●
● ●● ●●● ● ● ●
70
0.133 0.31 0.00294 0.224 0.422 0.21 ●
● ●●
● ●
●
●
● ●●
60 ● ●● ● ● ●
●
●
50
0.75
Corr: Corr: Corr: Corr: Corr: Corr: Corr:
yrage
0.50
−0.276 0.293 0.0512 0.12 0.028 −0.113 0.213
0.25
0.00
120 140 160 60 70 80 1500155016001650
2.5 5.0 7.5 10.0 12.5 2.5 5.0 7.510.012.5 0 5 10 15 20 50 60 70 80 90 0.00 0.25 0.50 0.75
##
## n= 39
##
##
## P
## sysbp wt ht chin fore calf pulse yrage
## sysbp 0.0007 0.1802 0.3003 0.0936 0.1236 0.4211 0.0888
## wt 0.0007 0.0040 0.0002 0.0003 0.0136 0.0548 0.0702
## ht 0.1802 0.0040 0.9619 0.6767 0.9863 0.9858 0.7570
## chin 0.3003 0.0002 0.9619 0.0000 0.0008 0.1708 0.4665
## fore 0.0936 0.0003 0.6767 0.0000 0.0000 0.0075 0.8656
## calf 0.1236 0.0136 0.9863 0.0008 0.0000 0.1995 0.4933
## pulse 0.4211 0.0548 0.9858 0.1708 0.0075 0.1995 0.1928
## yrage 0.0888 0.0702 0.7570 0.4665 0.8656 0.4933 0.1928
Below I fit the linear model with all the selected main effects.
# fit full model
lm.indian2.full <- lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
, data = indian2)
library(car)
Anova(lm.indian2.full, type=3)
## Anova Table (Type III tests)
##
## Response: sysbp
## Sum Sq Df F value Pr(>F)
## (Intercept) 389.46 1 3.8991 0.0572767 .
## wt 1956.49 1 19.5874 0.0001105 ***
## ht 131.88 1 1.3203 0.2593289
## chin 186.85 1 1.8706 0.1812390
## fore 27.00 1 0.2703 0.6068061
## calf 2.86 1 0.0287 0.8666427
## pulse 14.61 1 0.1463 0.7046990
## yrage 1386.76 1 13.8835 0.0007773 ***
## Residuals 3096.45 31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.indian2.full)
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + calf + pulse + yrage,
## data = indian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3993 -5.7916 -0.6907 6.9453 23.5771
##
## Coefficients:
Remarks on Step 0: The full model has 7 predictors so REG df = 7. The F -test
in the full model ANOVA table (F = 4.91 with p-value = 0.0008) tests the hypothesis
that the regression coefficient for each predictor variable is zero. This test is highly
significant, indicating that one or more of the predictors is important in the model.
In the ANOVA table, the F -value column gives the square of the t-statistic (from
the parameter [Coefficients] estimate table) for testing the significance of the indi-
vidual predictors in the full model (conditional on all other predictors being in the
model). The p-value is the same whether the t-statistic or F -value is shown.
The least important variable in the full model, as judged by the p-value, is calf skin
fold. This variable, upon omission, reduces R2 the least, or equivalently, increases the
Residual SS the least. The p-value of 0.87 exceeds the default 0.10 cut-off, so calf
will be the first to be omitted from the model.
Below, we will continue in this way. After deleting calf, the six predictor model
can be fitted. Manually, you can find that at least one of the predictors left is
important, as judged by the overall F -test p-value. The least important predictor left
is pulse. This variable is omitted from the model because the p-value for including
it exceeds the 0.10 threshold.
This is repeated until all predictors remain significant at a 0.10 significance level.
# model reduction using update() and subtracting (removing) model terms
lm.indian2.red <- lm.indian2.full;
# remove calf
lm.indian2.red <- update(lm.indian2.red, ~ . - calf ); summary(lm.indian2.red);
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + pulse + yrage, data = indian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6993 -5.3152 -0.7725 7.2966 23.7240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.13739 53.05581 2.000 0.053993 .
## wt 1.70900 0.38051 4.491 8.65e-05 ***
## ht -0.04478 0.03871 -1.157 0.256008
## chin -1.14165 0.82823 -1.378 0.177635
## fore -0.56731 1.07462 -0.528 0.601197
## pulse 0.07103 0.19142 0.371 0.713018
## yrage -29.54000 7.63983 -3.867 0.000509 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.841 on 32 degrees of freedom
## Multiple R-squared: 0.5255,Adjusted R-squared: 0.4365
## F-statistic: 5.906 on 6 and 32 DF, p-value: 0.0003103
# remove pulse
lm.indian2.red <- update(lm.indian2.red, ~ . - pulse); summary(lm.indian2.red);
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + yrage, data = indian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6147 -5.9803 -0.2065 6.6755 24.9269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.27872 51.18665 2.154 0.038601 *
## wt 1.71825 0.37470 4.586 6.22e-05 ***
## ht -0.04504 0.03820 -1.179 0.246810
## chin -1.17716 0.81187 -1.450 0.156514
## fore -0.43385 0.99933 -0.434 0.667013
## yrage -28.98171 7.39172 -3.921 0.000421 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.712 on 33 degrees of freedom
## Multiple R-squared: 0.5234,Adjusted R-squared: 0.4512
## F-statistic: 7.249 on 5 and 33 DF, p-value: 0.0001124
# remove fore
lm.indian2.red <- update(lm.indian2.red, ~ . - fore ); summary(lm.indian2.red);
##
## Call:
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06
# all are significant, stop.
# final model: sysbp ~ wt + yrage
lm.indian2.final <- lm.indian2.red
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=184.64
## sysbp ~ wt + ht + chin + fore + pulse + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - pulse 1 13.34 3112.6 182.81 0.1377 0.7130185
## - fore 1 26.99 3126.3 182.98 0.2787 0.6011969
## - ht 1 129.56 3228.9 184.24 1.3377 0.2560083
## <none> 3099.3 184.64
## - chin 1 184.03 3283.3 184.89 1.9000 0.1776352
## - yrage 1 1448.00 4547.3 197.59 14.9504 0.0005087 ***
## - wt 1 1953.77 5053.1 201.70 20.1724 8.655e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=182.81
## sysbp ~ wt + ht + chin + fore + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - fore 1 17.78 3130.4 181.03 0.1885 0.667013
## - ht 1 131.12 3243.8 182.42 1.3902 0.246810
## <none> 3112.6 182.81
## - chin 1 198.30 3310.9 183.22 2.1023 0.156514
## - yrage 1 1450.02 4562.7 195.72 15.3730 0.000421 ***
## - wt 1 1983.51 5096.2 200.03 21.0290 6.219e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=181.03
## sysbp ~ wt + ht + chin + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - ht 1 113.57 3244.0 180.42 1.2334 0.2745301
## <none> 3130.4 181.03
## - chin 1 287.20 3417.6 182.45 3.1193 0.0863479 .
## - yrage 1 1445.52 4575.9 193.84 15.7000 0.0003607 ***
## - wt 1 2263.64 5394.1 200.25 24.5857 1.945e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=180.42
## sysbp ~ wt + chin + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3244.0 180.42
## - chin 1 197.37 3441.4 180.72 2.1295 0.1534065
## - yrage 1 1368.44 4612.4 192.15 14.7643 0.0004912 ***
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06
1. The individual with the highest systolic blood pressure (case 1) has a large
studentized residual ri and the largest Cook’s Di .
2. Except for case 1, the rankit plot and the plot of the studentized residuals
against the fitted values show no gross abnormalities.
3. The plots of studentized residuals against the individual predictors show no
patterns. The partial residual plots show roughly linear trends. These plots
collectively do not suggest the need to transform either of the predictors. Al-
though case 1 is prominent in the partial residual plots, it does not appear to
be influencing the significance of these predictors.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.indian2.final, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.indian2.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 1 34 11
## residuals vs order of data
#plot(lm.indian2.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.4
1●
0.4
1.5
20
● ●
0.3
Cook's distance
Cook's distance
●
0.3
10
●
Residuals
● ●
●
● ● ●
● ● ●
● ● ●
0.2
●
0.2
● ● ●
0
● ● ●
●
● 1
●
● ●
●
−10
0.1
0.1
● ● ● ● 8
4 8●
●● ●4 ●
● 11 ● ●
● ● 0.5
●●●●
−20
● 34 ●
●
●● ●●●● ● ●●
0.0
0.0
●●●
●●●●● ●
●●
● ●● 0
100 110 120 130 140 0 10 20 30 40 0 0.05 0.15 0.2 0.25 0.3
● ● 1●
20
20
20
lm.indian2.final$residuals
lm.indian2.final$residuals
lm.indian2.final$residuals
● ● ● ● ●
●
● ● ●
10
10
● ● ●
● ● ● ● 10 ● ●
● ● ●
● ● ● ● ● ● ●●
●
● ● ● ● ● ● ●
●●
● ● ● ● ●● ●●●
●● ●● ●● ● ● ●●●
●
0
0
● ● ● ● ● ● ●●●
● ● ●
● ● ●
● ● ●
● ● ● ● ●●
● ● ●
−10
−10
● ● ●
● ● ● ● ● ● ● ●
−10 ● ●
●●
● ● ● ● ●
● ● ● 11●
34
−20
−20
● ● ●
−20
55 60 65 70 75 80 85 0.0 0.2 0.4 0.6 0.8 −2 −1 0 1 2
Recall that the partial regression residual plot for weight, given below, adjusts
systolic blood pressure and weight for their common dependence on all the other
predictors in the model (only years by age fraction here). This plot tells us whether
we need to transform weight in the multiple regression model, and whether any ob-
servations are influencing the significance of weight in the fitted model. A roughly
linear trend, as seen here, suggests that no transformation of weight is warranted.
The positive relationship seen here is consistent with the coefficient of weight being
positive in the multiple regression model.
The partial residual plot for fraction exhibits a stronger relationship than is seen
in the earlier 2D plot of systolic blood pressure against year by age fraction. This
means that fraction is more useful as a predictor after taking an individual’s weight
into consideration.
library(car)
avPlots(lm.indian2.final, id = list(n = 3))
Added−Variable Plots
1● ● 1
30
30
39 ●
●
20
sysbp | others
sysbp | others
20
● ● ● ● ●
●
10
●●
● ● ●
10
● ● ●
● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ●
●
●
0
●
● ● ● 38 ●
0
● ● ●
● ● ● ●
●● ●
● ● ●● ● ● ● 8●
−10
●
● 8 ●
● ● ● 11●● ● 11● ●
−20
●
●● ● 34 ● 34 ●
Model selection methods can be highly influenced by outliers and influential cases.
We should hold out case 1, and rerun the backward procedure to see whether case 1
unduly influenced the selection of the two predictor model. If we hold out case 1, we
find that the model with weight and fraction as predictors is suggested again. After
holding out case 1, there are no large residuals, no extremely influential points, or
any gross abnormalities in plots. The R2 for the selected model is now R2 = 0.408.
This decrease in R2 should have been anticipated. Why?1
The two analyses suggest that the “best model” for predicting systolic blood
pressure is
sysbp = β0 + β1 wt + β2 yrage + ε.
Should case 1 be deleted? I have not fully explored this issue, but I will note that
eliminating this case does have a significant impact on the estimates of the regression
coefficients, and on predicted values. What do you think?
actual dose an animal received was approximately determined as 40mg of the drug
per kilogram of body weight. (Liver weight is known to be strongly related to body
weight.) After a fixed length of time, each rat was sacrificed, the liver weighed, and
the percent of the dose in the liver determined.
The experimental hypothesis was that, for the method of determining the dose,
there is no relationship between the percentage of dose in the liver (Y ) and the body
weight, liver weight, and relative dose.
#### Example: Rat liver
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch03_ratliver.csv"
ratliver <- read.csv(fn.data)
str(ratliver)
## 'data.frame': 19 obs. of 4 variables:
## $ y : num 0.42 0.25 0.56 0.23 0.23 0.32 0.37 0.41 0.33 0.38 ...
## $ bodywt : int 176 176 190 176 200 167 188 195 176 165 ...
## $ liverwt: num 6.5 9.5 9 8.9 7.2 8.9 8 10 8 7.9 ...
## $ dose : num 0.88 0.88 1 0.88 1 0.83 0.94 0.98 0.88 0.84 ...
y bodywt liverwt dose
1 0.42 176 6.50 0.88
2 0.25 176 9.50 0.88
3 0.56 190 9.00 1.00
4 0.23 176 8.90 0.88
5 0.23 200 7.20 1.00
6 0.32 167 8.90 0.83
7 0.37 188 8.00 0.94
8 0.41 195 10.00 0.98
9 0.33 176 8.00 0.88
10 0.38 165 7.90 0.84
11 0.27 158 6.90 0.80
12 0.36 148 7.30 0.74
13 0.21 149 5.20 0.75
14 0.28 163 8.40 0.81
15 0.34 170 7.20 0.85
16 0.28 186 6.80 0.94
17 0.30 146 7.30 0.73
18 0.37 181 9.00 0.90
19 0.46 149 6.40 0.75
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
#p <- ggpairs(ratliver, progress=FALSE)
# put scatterplots on top so y axis is vertical
p <- ggpairs(ratliver, upper = list(continuous = "points")
, lower = list(continuous = "cor")
, progress=FALSE
)
print(p)
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
y bodywt liverwt dose
● ● ●
4
3 ● ● ●
● ● ●
● ● ●
y
2 ●
● ●
●
● ●
●
● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
1
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
0 ● ● ●
● ●
200
● ●
● ●
● ●
● ●
● ●
180
bodywt
Corr: ● ● ● ● ●
0.151 ● ●
● ●
● ●
● ●
160
● ●
● ● ●
● ●
● ●
●
10
●
9 ● ●
● ●
8 ● ●
liverwt
●
Corr: Corr:
0.203 0.5 ●●
● ●
7 ●
●
●
●
●
5
1.0
0.9
Corr: Corr: Corr:
dose
0.228 0.99 0.49
0.8
0.2 0.3 0.4 0.5 160 180 200 5 6 7 8 9 10 0.8 0.9 1.0
library(car)
Anova(lm.ratliver.full, type=3)
## Anova Table (Type III tests)
##
## Response: y
## Sum Sq Df F value Pr(>F)
## (Intercept) 0.011157 1 1.8676 0.19188
## bodywt 0.042408 1 7.0988 0.01768 *
## liverwt 0.004120 1 0.6897 0.41930
## dose 0.044982 1 7.5296 0.01507 *
## Residuals 0.089609 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.ratliver.full)
##
## Call:
## lm(formula = y ~ bodywt + liverwt + dose, data = ratliver)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.100557 -0.063233 0.007131 0.045971 0.134691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.265922 0.194585 1.367 0.1919
## bodywt -0.021246 0.007974 -2.664 0.0177 *
## liverwt 0.014298 0.017217 0.830 0.4193
## dose 4.178111 1.522625 2.744 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07729 on 15 degrees of freedom
## Multiple R-squared: 0.3639,Adjusted R-squared: 0.2367
## F-statistic: 2.86 on 3 and 15 DF, p-value: 0.07197
The backward elimination procedure selects weight and dose as predictors. The
p-values for testing the importance of these variables, when added last to this two
predictor model, are small, 0.019 and 0.015.
lm.ratliver.red.AIC <- step(lm.ratliver.full, direction="backward", test="F")
## Start: AIC=-93.78
## y ~ bodywt + liverwt + dose
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - liverwt 1 0.004120 0.093729 -94.924 0.6897 0.41930
## <none> 0.089609 -93.778
## - bodywt 1 0.042408 0.132017 -88.416 7.0988 0.01768 *
## - dose 1 0.044982 0.134591 -88.049 7.5296 0.01507 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=-94.92
## y ~ bodywt + dose
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 0.093729 -94.924
## - bodywt 1 0.039851 0.133580 -90.192 6.8027 0.01902 *
## - dose 1 0.043929 0.137658 -89.621 7.4989 0.01458 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.ratliver.final <- lm.ratliver.red.AIC
This cursory analysis leads to a conclusion that a combination of dose and body
weight is associated with Y , but that neither of these predictors is important of its own
(low correlations with Y ). Although this commonly happens in regression problems,
it is somewhat paradoxical here because dose was approximately a multiple of body
weight, so to a first approximation, these predictors are linearly related and so only
one of them should be needed in a linear regression model. Note that the correlation
between dose and body weight is 0.99.
The apparent paradox can be resolved only with a careful diagnostic analysis! For
the model with dose and body weight as predictors, there are no cases with large |ri |
values, but case 3 has a relatively large Cook’s D value.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.ratliver.final, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.ratliver.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 19 13 1
## residuals vs order of data
#plot(lm.ratliver.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
2.0
3
2 1.5 3●
19 ●
1
●1
1.5
1.5
● ●
Cook's distance
Cook's distance
0.05
●●
Residuals
●
● ●
1.0
1.0
●
●
● ●
−0.05
0.5
0.5
●
●● ●● 0.5
5 19
●5
● 19
13 ●
−0.15
●●
0.0
0.0
●
●●●
●
●●● 0
● ● 19 ●
0.10
0.10
● ● 0.10 1●
lm.ratliver.final$residuals
lm.ratliver.final$residuals
lm.ratliver.final$residuals
● ● ● ● ●
●
0.05
0.05
● ● ● ● 0.05 ● ●
● ● ●
● ● ● ● ● ●
● ●
0.00
0.00
●
● ● 0.00 ●
● ● ● ● ● ●
−0.05
● ● ●
● ● ●
−0.10
−0.10
● ● ● ● ● ●
● ● ●
−0.10
● ● ● 13
150 160 170 180 190 200 0.75 0.80 0.85 0.90 0.95 1.00 −2 −1 0 1 2
Further, the partial residual plot for bodywt clearly highlights case 3. Without this
case we would see roughly a random scatter of points, suggesting that body weight is
unimportant after taking dose into consideration. The importance of body weight as
a predictor in the multiple regression model is due solely to the placement of case 3.
The partial residual plot for dose gives the same message.
library(car)
avPlots(lm.ratliver.final, id = list(n = 3))
Added−Variable Plots
0.20
● 3 3●
0.15
19 ●
● 19
0.10
1●
y | others
y | others
● 1
0.05
10 ● ● ● ●
18 ● ● ● 10
●
● ● 18
●
0.00
● ● ● ●
● ● ●
−0.05
●
●
● ●
● ●
−0.10
●
13 ● ●
●
● ● 13
−0.15
● ●
Removing case 3 If we delete this case and redo the analysis we find, as expected,
no important predictors of Y . The output below shows that the backward elimination
removes each predictor from the model. Thus, the apparent relationship between Y
and body weight and dose in the initial analysis can be ascribed to Case 3 alone. Can
you see this case in the plots?
# remove case 3
ratliver3 <- ratliver[-3,]
scatterplot below (e.g., rat 8 with a weight of 195g got a lower dose of 0.98).
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(ratliver, aes(x = bodywt, y = dose, label = 1:nrow(ratliver)))
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
p <- p + geom_point(alpha=1/3)
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5, alpha = 0.25, colour = 2)
p <- p + labs(title="Rat liver dose by bodywt: rat 3 overdosed")
print(p)
Rat liver dose by bodywt: rat 3 overdosed
3 5
1.0
8
16 7
18
0.9
2
1
4
9
dose
15
10
6
14
11
0.8
13
19
12
17
A number of causes for the result found in the first analysis are possible: (1)
the dose or weight recorded for case 3 was in error, so the case should probably
be deleted from the analysis, or (2) the regression fit in the second analysis is not
appropriate except in the region defined by the 18 points excluding case 3. It is
possible that the combination of dose and rat weight chosen was fortuitous, and that
the lack of relationship found would not persist for any other combinations of them,
since inclusion of a data point apparently taken under different conditions leads to
a different conclusion. This suggests the need for collection of additional data, with
dose determined by some rule other than a constant proportion of weight.
I hope the point of this analysis is clear! What have we learned from this analysis?
Insecticide 1
Insecticide 2
Insecticide 3
Insecticide 4
Let
1X
µ= µi
I i
be the grand mean, or average of the population means. Let
αi = µi − µ
be the ith group treatment effect. The treatment effects are constrained to add
to zero, α1 + α2 + · · · + αI = 0, and measure the difference between the treatment
population means and the grand mean. Given this notation, the one-way ANOVA
model is
yij = µ + αi + eij .
The model specifies that the
Response = Grand Mean + Treatment Effect + Residual.
An hypothesis of interest is whether the population means are equal: H0 : µ1 =
· · · = µI , which is equivalent to the hypothesis of no treatment effects: H0 : α1 =
· · · = αI = 0. If H0 is true, then the one-way model is
yij = µ + eij ,
where µ is the common population mean. You know how to test H0 and do multiple
comparisons of the treatments, so I will not review this material.
Most texts use treatment effects to specify ANOVA models, a convention that I
will also follow. A difficulty with this approach is that the treatment effects must
be constrained to be uniquely estimable from the data (because the I population
means µi are modeled in terms of I + 1 parameters: µi = µ + αi ). An infinite
number of constraints can be considered each of which gives the same structure on
the population means. The standard constraint where the treatment effects sum to
zero was used above, but many statistical packages, impose the constraint αI = 0
(or sometimes α1 = 0). Although estimates of treatment effects depend on which
constraint is chosen, the null and alternative models used with the ANOVA F -test,
and pairwise comparisons of treatment effects, do not. I will downplay the discussion
of estimating treatment effects to minimize problems.
The discussion will be limited to randomized block experiments with one factor.
Two or more factors can be used with a randomized block design. For example,
the agricultural experiment could be modified to compare four combinations of two
corn varieties and two levels of fertilizer in each block instead of the original four
varieties. In certain experiments, each experimental unit receives each treatment.
The experimental units are “natural” blocks for the analysis.
The volunteers in the study were treated as blocks in the analysis. At best, the
volunteers might be considered a representative sample of males between the ages of
20 and 30. This limits the extent of inferences from the experiment. The scientists
can not, without sound medical justification, extrapolate the results to children or to
senior citizens.
#### Example: Itching
itch <- read.csv("http://statacumen.com/teach/ADA2/ADA2_notes_Ch05_itch.csv")
1
Beecher, 1959
where µij is the population mean response for the j th treatment in the ith block and
eij is the deviation of the response from the mean. The population means are assumed
to satisfy the additive model
µij = µ + αi + βj
where µ is a grand mean, αi is the effect for the ith block, and βj is the effect
for the j th treatment. The responses are assumed to be independent across blocks,
normally distributed and with constant variance. The randomized block model does
not require the observations within a block to be independent, but does assume
that the correlation between responses within a block is identical for each pair of
treatments.
The model is sometimes written as
Response = Grand Mean + Treatment Effect + Block Effect + Residual.
Given the data, let ȳi· be the ith block sample mean (the average of the responses
in the ith block), ȳ·j be the j th treatment sample mean (the average of the responses
on the j th treatment), and ȳ·· be the average response of all IJ observations in the
experiment.
An ANOVA table for the randomized block experiment partitions the Model SS
into SS for Blocks and Treatments.
Source df SSP MS
Blocks I − 1 J P i (ȳi· − ȳ·· )2
Treats J − 1 I j (ȳ·j − ȳ·· )2
(y − ȳi· − ȳ·j + ȳ·· )2
P
Error (I − 1)(J − 1)
Pij ij 2
Total IJ − 1 ij (yij − ȳ·· ) .
A primary interest is testing whether the treatment effects are zero: H0 : β1 =
· · · = βJ = 0. The treatment effects are zero if in each block the population mean
responses are identical for each treatment. A formal test of no treatment effects is
based on the p-value from the F-statistic Fobs = MS Treat/MS Error. The p-value
is evaluated in the usual way (i.e., as an upper tail area from an F-distribution with
J − 1 and (I − 1)(J − 1) df.) This H0 is rejected when the treatment averages ȳ·j
vary significantly relative to the error variation.
A test for no block effects (H0 : α1 = · · · = αI = 0) is often a secondary inter-
est, because, if the experiment is designed well, the blocks will be, by construction,
noticeably different. There are no block effects if the population mean response for
an arbitrary treatment is identical across blocks. A formal test of no block effects is
based on the p-value from the the F -statistic Fobs = MS Blocks/MS Error. This H0 is
rejected when the block averages ȳi· vary significantly relative to the error variation.
The randomized block model is easily fitted in the lm() function. Before illus-
trating the analysis on the itching data, let me mention five important points about
randomized block analyses:
1. The F -test p-value for comparing J = 2 treatments is identical to the p-value
for comparing the two treatments using a paired t-test.
2. The Block SS plus the Error SS is the Error SS from a one-way ANOVA com-
paring the J treatments. If the Block SS is large relative to the Error SS from
the two-factor model, then the experimenter has eliminated a substantial por-
tion of the variation that is used to assess the differences among the treatments.
This leads to a more sensitive comparison of treatments than would have been
obtained using a one-way ANOVA.
3. The RB model is equivalent to an additive or no interaction model for a two-
factor experiment, where the blocks are levels of one of the factors. The analysis
of a randomized block experiment under this model is the same analysis used
for a two-factor experiment with no replication (one observation per cell). We
will discuss the two-factor design soon. P P
4. Under the sum constraint on the parameters (i.e., i αi = j βj = 0), the
estimates of the grand mean, block effects, and treatment effects are µ̂ = ȳ·· ,
α̂i = ȳi· − ȳ·· , and β̂j = ȳ·j − ȳ·· , respectively. The estimated mean response for
the (i, j)th cell is µ̂ij = µ̂ + α̂i + β̂j = ȳi· + ȳ·j − ȳ·· .
5. The F -test for comparing treatments is appropriate when the responses within
a block have the same correlation. This is a reasonable working assumption
RB Analysis of the Itching Data First we reshape the data to long format so
each observation is its own row in the data.frame and indexed by the Patient and
Treatment variables.
library(reshape2)
itch.long <- melt(itch
, id.vars = "Patient"
, variable.name = "Treatment"
, value.name = "Seconds"
)
str(itch.long)
## 'data.frame': 70 obs. of 3 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Treatment: Factor w/ 7 levels "Nodrug","Placebo",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Seconds : int 174 224 260 255 165 237 191 100 115 189 ...
head(itch.long, 3)
## Patient Treatment Seconds
## 1 1 Nodrug 174
## 2 2 Nodrug 224
## 3 3 Nodrug 260
tail(itch.long, 3)
## Patient Treatment Seconds
## 68 8 Tripel 129
## 69 9 Tripel 79
## 70 10 Tripel 317
# make Patient a factor variable
itch.long$Patient <- factor(itch.long$Patient)
str(itch.long)
## 'data.frame': 70 obs. of 3 variables:
## $ Patient : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Treatment: Factor w/ 7 levels "Nodrug","Placebo",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Seconds : int 174 224 260 255 165 237 191 100 115 189 ...
As a first step, I made side-by-side boxplots of the itching durations across treat-
ments. The boxplots are helpful for informally comparing treatments and visualizing
the data. The differences in the level of the boxplots will usually be magnified by the
F -test for comparing treatments because the variability within the boxplots includes
block differences which are moved from the Error SS to the Block SS. The plot also
includes the 10 Patients with lines connecting their measurements to see how common
the treatment differences were over patients. I admit, this plot is a little too busy.
Each of the five drugs appears to have an effect, compared to the placebo and
to no drug. Papaverine appears to be the most effective drug. The placebo and no
drug have similar medians. The relatively large spread in the placebo group suggests
that some patients responded adversely to the placebo compared to no drug, whereas
others responded positively.
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(itch.long, aes(x = Treatment, y = Seconds))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(aes(yintercept = 0),
colour = "black", linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_hline(aes(yintercept = mean(Seconds)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# colored line for each patient
p <- p + geom_line(aes(group = Patient, colour = Patient), alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(aes(colour = Patient))
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
alpha = 0.5)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, aes(colour=Treatment), alpha = 0.8)
p <- p + labs(title = "Comparison of Treatments for Itching, Treatment means")
p <- p + ylab("Duration of itching (seconds)")
# removes legend
p <- p + theme(legend.position="none")
print(p)
400
●
Duration of itching (seconds)
300
●
● ●
●
● ● ●
●
● ● ●
●
●
200 ●
●
●
● ●
● ●
● ●
●
● ● ●
●
●
●
● ● ● ●
● ●
● ●
●
● ●
●
● ● ●
● ● ●
● ●
● ●
100 ●
●
●
●
● ●
●
● ●
To fit the RB model in lm(), you need to specify blocks (Patient) and treatments
(Treatment) as factor variables, and include each to the right of the tilde symbol in
the formula statement. The response variable Seconds appears to the left of the
tilde.
lm.s.t.p <- lm(Seconds ~ Treatment + Patient, data = itch.long)
library(car)
Anova(lm.s.t.p, type=3)
## Anova Table (Type III tests)
##
## Response: Seconds
## Sum Sq Df F value Pr(>F)
## (Intercept) 155100 1 50.1133 3.065e-09 ***
## Treatment 53013 6 2.8548 0.017303 *
## Patient 103280 9 3.7078 0.001124 **
## Residuals 167130 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.s.t.p)
##
## Call:
The order to look at output follows the hierarchy of multi-parameter tests down
to single-parameter tests.
1. The F-test at the bottom of the summary() tests for both no block effects and
no treatment effects. If there are no block effects and no tretment effects then
the mean itching time is independent of treatment and patients. The p-value
of 0.0005 strongly suggests that the population mean itching times are not all
equal.
2. The ANOVA table at top from Anova() partitions the Model SS into the SS
for Blocks (Patients) and Treatments. The Mean Squares, F-statistics, and
p-values for testing these effects are given. For a RB design with the same
number of responses per block (i.e., no missing data), the Type I and Type III
SS are identical, and correspond to the formulas given earlier. The distinction
between Type I and Type III SS is important for unbalanced problems, an issue
we discuss later. The F -tests show significant differences among the treatments
(p-value=0.017) and among patients (p-value=0.001).
Multiple comparisons Multiple comparison and contrasts are not typically straight-
forward in R, though some newer packages are helping make them easier. Below I
show one way that I think is relatively easy.
The package multcomp is used to specify which factor to perform multiple com-
parisons over and which p-value adjustment method to use. Below I use Tukey ad-
justments, first.
# multcomp has functions for multiple comparisons
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: ’MASS’
## The following object is masked from ’package:sm’:
##
## muscle
##
## Attaching package: ’TH.data’
## The following object is masked from ’package:MASS’:
##
## geyser
## The following object is masked from ’package:sm’:
##
## geyser
##
## Attaching package: ’multcomp’
## The following object is masked by ’.GlobalEnv’:
##
## waste
# Use the ANOVA object and run a "General Linear Hypothesis Test"
# specifying a linfct (linear function) to be tested.
# The mpc (multiple comparison) specifies the factor and method.
# Here: correcting over Treatment using Tukey contrast corrections.
glht.itch.t <- glht(aov(lm.s.t.p), linfct = mcp(Treatment = "Tukey"))
summary(glht.itch.t)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = lm.s.t.p)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Placebo - Nodrug == 0 13.80 24.88 0.555 0.9978
## Papv - Nodrug == 0 -72.80 24.88 -2.926 0.0699 .
## Morp - Nodrug == 0 -43.00 24.88 -1.728 0.6003
## Amino - Nodrug == 0 -46.70 24.88 -1.877 0.5038
## Pento - Nodrug == 0 -14.50 24.88 -0.583 0.9971
## Tripel - Nodrug == 0 -23.80 24.88 -0.957 0.9610
## Papv - Placebo == 0 -86.60 24.88 -3.481 0.0162 *
## Morp - Placebo == 0 -56.80 24.88 -2.283 0.2710
## Amino - Placebo == 0 -60.50 24.88 -2.432 0.2054
## Pento - Placebo == 0 -28.30 24.88 -1.137 0.9135
## Tripel - Placebo == 0 -37.60 24.88 -1.511 0.7370
## Morp - Papv == 0 29.80 24.88 1.198 0.8920
## Amino - Papv == 0 26.10 24.88 1.049 0.9398
## Pento - Papv == 0 58.30 24.88 2.343 0.2434
## Tripel - Papv == 0 49.00 24.88 1.969 0.4456
## Amino - Morp == 0 -3.70 24.88 -0.149 1.0000
## Pento - Morp == 0 28.50 24.88 1.146 0.9108
## Tripel - Morp == 0 19.20 24.88 0.772 0.9867
## Pento - Amino == 0 32.20 24.88 1.294 0.8516
## Tripel - Amino == 0 22.90 24.88 0.920 0.9676
## Tripel - Pento == 0 -9.30 24.88 -0.374 0.9998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
With summary(), the p-value adjustment can be coerced into one of several popular
methods, such as Bonferroni. Notice that the significance is lower (larger p-value)
for Bonferroni below than Tukey above. Note comment at bottom of output that
“(Adjusted p values reported -- bonferroni method)”. Passing the summary
to plot() will create a plot of the pairwise intervals for difference between factor
levels.
Recall how the Bonferroni correction works. A comparison of c pairs of levels
from one factor having a family error rate of 0.05 or less is attained by comparing
pairs of treatments at the 0.05/c level. Using this criteria, the population mean
response for factor levels (averaged over the other factor) are significantly different if
the p-value for the test is 0.05/c or less. The output actually adjusts the p-values by
reporting p-value×c, so that the reported adjusted p-value can be compared to the
0.05 significance level.
summary(glht.itch.t, test = adjusted("bonferroni"))
##
Placebo − Nodrug ( ● )
Papv − Nodrug ( ● )
Morp − Nodrug ( ● )
Amino − Nodrug ( ● )
Pento − Nodrug ( ● )
Tripel − Nodrug ( ● )
Papv − Placebo ( ● )
Morp − Placebo ( ● )
Amino − Placebo ( ● )
Pento − Placebo ( ● )
Tripel − Placebo ( ● )
Morp − Papv ( ● )
Amino − Papv ( ● )
Pento − Papv ( ● )
Tripel − Papv ( ● )
Amino − Morp ( ● )
Pento − Morp ( ● )
Tripel − Morp ( ● )
Pento − Amino ( ● )
Tripel − Amino ( ● )
Tripel − Pento ( ● )
Linear Function
Bonferroni−adjusted Treatment contrasts
The Bonferroni comparisons for Treatment suggest that papaverine induces a lower
mean itching time than placebo. All the other comparisons of treatments are insignif-
icant. The comparison of Patient blocks is of less interest.
### Code for the less interesting contrasts.
### Testing multiple factors may be of interest in other problems.
### Note that the first block of code below corrects the p-values
### for all the tests done for both factors together,
### that is, the Bonferroni-corrected significance level is (alpha / (t + p))
### where t = number of treatment comparisons
### and p = number of patient comparisons.
in the boxplots). Except for these cases, which are also the most influential cases
(Cook’s distance), the plot of the studentized residuals against fitted values shows no
gross abnormalities.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.s.t.p, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.s.t.p$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 20 52 48
## residuals vs order of data
#plot(lm.s.t.p£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
20
20 ●
0.15
52 ●
0.15
100
● 48 52 52 ●
Cook's distance
Cook's distance
●
2.5
● ●
Residuals
0.10
●
0.10
●
50
● ●● ● ●
● ● 48 ● 48
● ● ●
● ●
● ●
● ●
● 2
● ● ●
0
●● ●
● ●
●● ● ●● ● ●
0.05
●
0.05
● ● ● ●
●
● ● ●
●
● ● ● ● ●
●
● 1.5
●● ● ●
●
●● ●
●
● ● ● ● ●
●
● ● ● ●
● 1
● ● ●
●
●
−100
●
0.00
0.00
●
● 0.5
●
● 0
150
150 20 ●
● 52 ●
48 ●
100
100
●
100
lm.s.t.p$residuals
●
● ●
● ●●
●
●●●
50
50
● 50 ●●
●
●●
●●
●
●
●●●●
●●
●
●●●●
0
0
●●●
●●●
●●●●
●●●
●●
●●
●●●
●●●●
−50
−50
●
●●
−50 ●●
●
● ●●●
● ●●
●
●
●
norm quantiles
Although the F -test for comparing treatments is not overly sensitive to modest
deviations from normality, I will present a non-parametric analysis as a backup, to
see whether similar conclusions are reached about the treatments.
For simplicity, assume that the experiment is balanced, that is, the same number of
beetles (4) is assigned to each group (12 × 4 = 48). This is a CRD with two factors.
dose insecticide t1 t2 t3 t4
1 low A 0.3100 0.4500 0.4600 0.4300
2 low B 0.8200 1.1000 0.8800 0.7200
3 low C 0.4300 0.4500 0.6300 0.7600
4 low D 0.4500 0.7100 0.6600 0.6200
5 medium A 0.3600 0.2900 0.4000 0.2300
6 medium B 0.9200 0.6100 0.4900 1.2400
7 medium C 0.4400 0.3500 0.3100 0.4000
8 medium D 0.5600 1.0200 0.7100 0.3800
9 high A 0.2200 0.2100 0.1800 0.2300
10 high B 0.3000 0.3700 0.3800 0.2900
11 high C 0.2300 0.2500 0.2400 0.2200
12 high D 0.3000 0.3600 0.3100 0.3300
First we reshape the data to long format so each observation is its own row in the
data.frame and indexed by the dose and insecticide variables.
library(reshape2)
beetles.long <- melt(beetles
, id.vars = c("dose", "insecticide")
, variable.name = "number"
, value.name = "hours10"
)
str(beetles.long)
## 'data.frame': 48 obs. of 4 variables:
## $ dose : Factor w/ 3 levels "low","medium",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ insecticide: Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
## $ number : Factor w/ 4 levels "t1","t2","t3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hours10 : num 0.31 0.82 0.43 0.45 0.36 0.92 0.44 0.56 0.22 0.3 ...
head(beetles.long)
## dose insecticide number hours10
## 1 low A t1 0.31
## 2 low B t1 0.82
## 3 low C t1 0.43
## 4 low D t1 0.45
## 5 medium A t1 0.36
## 6 medium B t1 0.92
The basic unit of analysis is the cell means, which are the averages of the 4 ob-
servations in each of the 12 treatment combinations. For example, in the table below,
the sample mean survival for the 4 beetles given a low dose (dose=1) of insecticide A
is 0.413. From the cell means we obtain the dose and insecticide marginal means
by averaging over the levels of the other factor. For example, the marginal mean
for insecticide A is the average of the cell means for the 3 treatment combinations
involving insecticide A: 0.314 = (0.413 + 0.320 + 0.210)/3.
Cell Means Dose
Insecticide 1 2 3 Insect marg
A 0.413 0.320 0.210 0.314
B 0.880 0.815 0.335 0.677
C 0.568 0.375 0.235 0.393
D 0.610 0.668 0.325 0.534
Dose marg 0.618 0.544 0.277 0.480
Because the experiment is balanced, a marginal mean is the average of all observa-
tions that receive a given treatment. For example, the marginal mean for insecticide
A is the average survival time for the 16 beetles given insecticide A.
Looking at the table of means, the insecticides have noticeably different mean
survival times averaged over doses, with insecticide A having the lowest mean survival
time averaged over doses. Similarly, higher doses tend to produce lower survival times.
A more formal approach to analyzing the table of means is given in the next section.
µij = µ + αi + βj + (αβ)ij ,
where µ is a grand mean, αi is the effect for the ith level of F1, βj is the effect for the
j th level of F2, and (αβ)ij is the interaction between the ith level of F1 and the j th
level of F2. (Note that (αβ) is an individual term distinct from α and β, (αβ) is not
their product.) The model is often written
meaning
Response = Grand Mean + F1 effect + F2 effect + F1-by-F2 interaction +
Residual.
The additive model having only main effects, no interaction terms, is yijk =
µ + αi + βj + eijk , meaning
Response = Grand Mean + F1 effect + F2 effect + Residual.
The effects of F1 and F2 on the mean are additive.
The F2 marginal population means are averages within columns (over rows):
1X
µ̄·j = µrj .
I r
The overall or grand population mean is the average of the cell means
1 X 1X 1X
µ̄·· = µrc = µ̄i· = µ̄·j .
IJ rc I i J j
Using this notation, the effects in the interaction model are µ = µ̄·· , αi = µ̄i· −
µ̄·· , βj = µ̄·j − µ̄·· , and (αβ)ij = µij − µ̄i· − µ̄·j + µ̄·· . The effects sum to zero:
X X X
αi = βj = (αβ)ij = 0,
i j ij
and satisfy µij = µ + αi + βj + (αβ)ij (i.e., cell mean is sum of effects) required under
the model.
The F1 and F2 effects are analogous to treatment effects in a one-factor experi-
ment, except that here the treatment means are averaged over the levels of the other
factor. The interaction effect will be interpreted later.
be the sample mean and variance, respectively, for the K responses at the ith level of
F1 and the j th level of F2. Inferences about the population means are based on the
table of sample means:
Level of F2
Level of F1 1 2 ··· J F1 marg
1 ȳ11 ȳ12 · · · ȳ1J ȳ1·
2 ȳ21 ȳ22 · · · ȳ2J ȳ2·
.. .. .. .. .. ..
. . . . . .
I ȳI1 ȳI2 · · · ȳIJ ȳI·
F2 marg ȳ·1 ȳ·2 · · · ȳ·J ȳ··
The F1 marginal sample means are averages within rows of the table:
1X
ȳi· = ȳic .
J c
The sample sizes in each of the IJ treatment groups are equal (K), so ȳi· is the
sample average of all responses at the ith level of F1, ȳ·j is the sample average of all
responses at the j th level of F2, and ȳ·· is the average response in the experiment.
Under the interaction model, the estimated population mean for the (i, j)th cell is
the observed cell mean: µ̂ij = ȳij . This can be partitioned into estimated effects
that satisfy
[ .
µ̂ij = µ̂ + α̂i + β̂j + (αβ)ij
of F2. The test for no F1 effect is based on MS F1/MS Error, which is compared
to the upper tail of an F-distribution with numerator and denominator df of I − 1
and IJ(K − 1), respectively. H0 is rejected when the F1 marginal means ȳi· vary
significantly relative to the within sample variation. Equivalently, H0 is rejected
when the sum of squared F1 effects (between sample variation) is large relative to
the within sample variation.
The test of no F2 effect: H0 : β1 = · · · = βJ = 0 is equivalent to testing H0 :
µ̄·1 = µ̄·2 = · · · = µ̄·J . The absence of a F2 effect implies that each level of F2
has the same population mean response when the means are averaged over
levels of F1. The test for no F2 effect is based on MS F2/MS Error, which is
compared to an F-distribution with numerator and denominator df of J − 1 and
IJ(K − 1), respectively. H0 is rejected when the F2 marginal means ȳ·j vary
significantly relative to the within sample variation. Equivalently, H0 is rejected
when the sum of squared F2 effects (between sample variation) is large relative to
the within sample variation.
The test of no interaction: H0 : (αβ)ij = 0 for all i and j is based on MS Interact/MS Error,
which is compared to an F-distribution with numerator and denominator df of
(I − 1)(J − 1) and IJ(K − 1), respectively.
The interaction model places no restrictions on the population means µij . Since
the population means can be arbitrary, the interaction model can be viewed as a one
factor model with IJ treatments. One connection between the two ways of viewing the
two-factor analysis is that the F1, F2, and Interaction SS for the two-way interaction
model sum to the Treatment or Model SS for comparing the IJ treatments. The
Error SS for the two-way interaction model is identical to the Error SS for a one-way
ANOVA of the IJ treatments. An overall test of no differences in the IJ population
means is part of the two-way analysis.
I always summarize the data using the cell and marginal means instead of the
estimated effects, primarily because means are the basic building blocks for the anal-
ysis. My discussion of the model and tests emphasizes both approaches to help you
make the connection with the two ways this material is often presented in texts.
Understanding interaction
To understand interaction, suppose you (conceptually) plot the means in each row
of the population table, giving what is known as the population mean profile
plot. The F1 marginal population means average the population means within the
F1 profiles. At each F2 level, the F2 marginal mean averages the population cell
means across F1 profiles.
is,
F1=2
6 8
F1=1
4
F1=3
2
0
1 2 3 4 5
Level of Factor 2
14
F1=3
12 10
Population Mean
8
F1=1
6 4
2
F1=2
0
1 2 3 4 5
Level of Factor 2
The roles of F1 and F2 can be reversed in these plots without changing the assess-
ment of a presence or absence of interaction. It is often helpful to view the interaction
plot from both perspectives.
A qualitative check for interaction can be based on the sample means profile
plot, but keep in mind that profiles of sample means are never perfectly parallel even
when the factors do not interact in the population. The Interaction SS measures the
extent of non-parallelism in the sample mean profiles. In particular, the Interaction
[ = 0 for
SS is zero when the sample mean profiles are perfectly parallel because (αβ) ij
all i and j.
mean(beetles.long[, "hours10"])
## [1] 0.479375
beetles.mean <- ddply(beetles.long, .(), summarise, m = mean(hours10))
beetles.mean
## .id m
## 1 <NA> 0.479375
beetles.mean.d <- ddply(beetles.long, .(dose), summarise, m = mean(hours10))
beetles.mean.d
## dose m
## 1 low 0.617500
## 2 medium 0.544375
## 3 high 0.276250
beetles.mean.i <- ddply(beetles.long, .(insecticide), summarise, m = mean(hours10))
beetles.mean.i
## insecticide m
## 1 A 0.3141667
## 2 B 0.6766667
## 3 C 0.3925000
## 4 D 0.5341667
beetles.mean.di <- ddply(beetles.long, .(dose,insecticide), summarise, m = mean(hours10))
beetles.mean.di
## dose insecticide m
## 1 low A 0.4125
## 2 low B 0.8800
## 3 low C 0.5675
## 4 low D 0.6100
## 5 medium A 0.3200
## 6 medium B 0.8150
## 7 medium C 0.3750
## 8 medium D 0.6675
## 9 high A 0.2100
## 10 high B 0.3350
## 11 high C 0.2350
## 12 high D 0.3250
# Interaction plots, ggplot
Beetles interaction plot, insecticide by dose Beetles interaction plot, dose by insecticide
1.2 1.2
●
0.8 0.8
insecticide
dose
● A
hours10
hours10
● low
B
● medium
C ●
high
D
0.4 ● 0.4 ●
●
0.0 0.0
Beetles interaction plot, insecticide by dose Beetles interaction plot, dose by insecticide
0.9
0.9
beetles.long$insecticide beetles.long$dose
0.8
0.8
B medium
D low
mean of beetles.long$hours10
mean of beetles.long$hours10
0.7
0.7
C high
A
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
low medium high A B C D
beetles.long$dose beetles.long$insecticide
In the lm() function below we specify a first-order model with interactions, in-
cluding the main effects and two-way interactions. The interaction between dose and
insecticide is indicated with dose:insecticide. The shorthand dose*insecticide
expands to “dose + insecticide + dose:insecticide” for this first-order model.
The F -test at the bottom of the summary() tests for no differences among the
population mean survival times for the 12 dose and insecticide combinations. The
p-value of < 0.0001 strongly suggests that the population mean survival times are
not all equal.
The next summary at the top gives two partitionings of the one-way ANOVA
Treatment SS into the SS for Dose, Insecticide, and the Dose by Insecticide interac-
tion. The Mean Squares, F-statistics and p-values for testing these effects are given.
The p-values for the F-statistics indicate that the dose and insecticide effects are
significant at the 0.01 level. The F-test for no dose by insecticide interaction is not
significant at the 0.10 level (p-value=0.112). Thus, the interaction seen in the profile
plot of the sample means might be due solely to chance or sampling variability.
lm.h.d.i.di <- lm(hours10 ~ dose + insecticide + dose:insecticide
, data = beetles.long)
# lm.h.d.i.di <- lm(hours10 ~ dose*insecticide, data = beetles.long) # equivalent
library(car)
Anova(lm.h.d.i.di, type=3)
## Anova Table (Type III tests)
##
## Response: hours10
## Sum Sq Df F value Pr(>F)
## (Intercept) 0.68063 1 30.6004 2.937e-06 ***
## dose 0.08222 2 1.8482 0.1721570
## insecticide 0.45395 3 6.8031 0.0009469 ***
Since the interaction is not significant, I’ll drop the interaction term and fit the
additive model with main effects only. I update the model by removing the interaction
term.
lm.h.d.i <- update(lm.h.d.i.di, ~ . - dose:insecticide )
library(car)
Anova(lm.h.d.i, type=3)
## Anova Table (Type III tests)
##
## Response: hours10
## Sum Sq Df F value Pr(>F)
## (Intercept) 1.63654 1 65.408 4.224e-10 ***
## dose 1.03301 2 20.643 5.704e-07 ***
The Bonferroni multiple comparisons indicate which treatment effects are differ-
ent.
# Testing multiple factors is of interest here.
# Note that the code below corrects the p-values
# for all the tests done for both factors together,
# that is, the Bonferroni-corrected significance level is (alpha / (d + i))
# where d = number of dose comparisons
# and i = number of insecticide comparisons.
insecticide: B − A ( ● )
insecticide: C − A ( ● )
insecticide: D − A ( ● )
insecticide: C − B ( ● )
insecticide: D − B ( ● )
insecticide: D − C ( ● )
Linear Function
Bonferroni−adjusted Treatment contrasts
If dose and insecticide interact, you can conclude that beetles given a high dose
of the insecticide typically survive for shorter periods of time averaged over insec-
ticides. You can not, in general, conclude that the highest dose yields the lowest
survival time regardless of insecticide. For example, the difference in the medium
and high dose marginal means (0.544 - 0.276 = 0.268) estimates the typical decrease
in survival time achieved by using the high dose instead of the medium dose, averaged
over insecticides. If the two factors interact, then the difference in mean times between
the medium and high doses on a given insecticide may be significantly greater than
0.268, significantly less than 0.268, or even negative. In the latter case the medium
dose would be better than the high dose for the given insecticide, even though the
high dose gives better performance averaged over insecticides. An interaction forces
you to use the cell means to decide which combination of dose and insecticide gives
the best results (and the multiple comparisons as they were done above do not give
multiple comparisons of cell means; a single factor variable combining both factors
would need to be created). Of course, our profile plot tells us that this hypothetical
situation is probably not tenable here, but it could be so when a significant interaction
is present.
If dose and insecticide do not interact, then the difference in marginal dose
means averaged over insecticides also estimates the difference in population mean
survival times between two doses, regardless of the insecticide. This follows from
the parallel profiles definition of no interaction. Thus, the difference in the medium
and high dose marginal means (0.544 - 0.276 = 0.268) estimates the expected decrease
in survival time anticipated from using the high dose instead of the medium dose,
regardless of the insecticide (and hence also when averaged over insecticides). A
practical implication of no interaction is that you can conclude that the high dose is
best, regardless of the insecticide used. The difference in marginal means for two doses
estimates the difference in average survival expected, regardless of the insecticide.
An ordering of the mean survival times on the four insecticides (averaged over
the three doses) is given below. Three groups are obtained from the Bonferroni
comparisons, with any two insecticides separated by one or more other insecticides
in the ordered string having significantly different mean survival times averaged over
doses.
If interaction is present, you can conclude that insecticide A is no better than C,
but significantly better than B or D, when performance is averaged over doses. If
the interaction is absent, then A is not significantly better than C, but is significantly
better than B or D, regardless of the dose. Furthermore, for example, the difference
in marginal means for insecticides B and A of 0.677 - 0.314 = 0.363 is the expected
decrease in survival time from using A instead of B, regardless of dose. This is also
the expected decrease in survival times when averaged over doses.
Insect: B D C A
Marg Mean: 0.677 0.534 0.393 0.314
Groups: ------------
------------
------------
material temp v1 v2 v3 v4
1 1 50 130 155 74 180
2 1 65 34 40 80 75
3 1 80 20 70 82 58
4 2 50 150 188 159 126
5 2 65 136 122 106 115
6 2 80 25 70 58 45
7 3 50 138 110 168 160
8 3 65 174 120 150 139
9 3 80 96 104 82 60
library(reshape2)
battery.long <- melt(battery
, id.vars = c("material", "temp")
, variable.name = "battery"
, value.name = "maxvolt"
)
str(battery.long)
## 'data.frame': 36 obs. of 4 variables:
## $ material: Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
## $ temp : Factor w/ 3 levels "50","65","80": 1 2 3 1 2 3 1 2 3 1 ...
## $ battery : Factor w/ 4 levels "v1","v2","v3",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ maxvolt : int 130 34 20 150 136 25 138 174 96 155 ...
The overall F -test at the bottom indicates at least one parameter in the model is
significant. The two-way ANOVA table indicates that the main effect of temperature
and the interaction are significant at the 0.05 level, the main effect of material is not.
lm.m.m.t.mt <- lm(maxvolt ~ material*temp, data = battery.long)
library(car)
Anova(lm.m.m.t.mt, type=3)
## Anova Table (Type III tests)
##
## Response: maxvolt
## Sum Sq Df F value Pr(>F)
## (Intercept) 72630 1 107.5664 6.456e-11 ***
## material 886 2 0.6562 0.5268904
## temp 15965 2 11.8223 0.0002052 ***
## material:temp 9614 4 3.5595 0.0186112 *
## Residuals 18231 27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.m.m.t.mt)
##
## Call:
## lm(formula = maxvolt ~ material * temp, data = battery.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.750 -14.625 1.375 17.938 45.250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.75 12.99 10.371 6.46e-11 ***
## material2 21.00 18.37 1.143 0.263107
## material3 9.25 18.37 0.503 0.618747
## temp65 -77.50 18.37 -4.218 0.000248 ***
## temp80 -77.25 18.37 -4.204 0.000257 ***
## material2:temp65 41.50 25.98 1.597 0.121886
The cell means plots of the material profiles have different slopes, which is consis-
tent with the presence of a temperature-by-material interaction.
library(plyr)
# Calculate the cell means for each (material, temp) combination
battery.mean <- ddply(battery.long, .(), summarise, m = mean(maxvolt))
battery.mean
## .id m
## 1 <NA> 105.5278
battery.mean.m <- ddply(battery.long, .(material), summarise, m = mean(maxvolt))
battery.mean.m
## material m
## 1 1 83.16667
## 2 2 108.33333
## 3 3 125.08333
battery.mean.t <- ddply(battery.long, .(temp), summarise, m = mean(maxvolt))
battery.mean.t
## temp m
## 1 50 144.83333
## 2 65 107.58333
## 3 80 64.16667
battery.mean.mt <- ddply(battery.long, .(material,temp), summarise, m = mean(maxvolt))
battery.mean.mt
## material temp m
## 1 1 50 134.75
## 2 1 65 57.25
## 3 1 80 57.50
## 4 2 50 155.75
## 5 2 65 119.75
## 6 2 80 49.50
## 7 3 50 144.00
## 8 3 65 145.75
## 9 3 80 85.50
# Interaction plots, ggplot
p <- ggplot(battery.long, aes(x = material, y = maxvolt, colour = temp, shape = temp))
p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
●
150 150
●
● ●
temp material
maxvolt
maxvolt
100 ● 50 100 ● 1
65 2
80 3
● ●
50 50
0 0
1 2 3 50 65 80
material temp
65 − 50 ( ● )
80 − 50 ( ● )
80 − 65 ( ● )
−100 −50 0 50
Linear Function
Bonferroni−adjusted Treatment contrasts
The Bonferroni comparisons indicate that the population mean max voltage for
the three temperatures averaged over material types decreases as the temperature
increases:
Temp: 80 65 50
Marg mean: 64.17 107.58 144.83
Group: ------------- ------
However, you can compare materials at each temperature, and you can compare
temperatures for each material. At individual temperatures, material 2 and 3 (or 1
and 2) might be significantly different even though they are not significantly different
when averaged over temperatures. For example, material 2 might produce a signifi-
cantly higher average output than the other two material types at 50 degrees. This
comparison of cell means is relevant if you are interested in using the batteries at 50
degrees! Comparing cell means is possible using “lsmeans”, a point I will return to
later.
# mean vs sd plot
library(plyr)
# means and standard deviations for each dose/interaction cell
beetles.meansd.di <- ddply(beetles.long, .(dose,insecticide), summarise
, m = mean(hours10), s = sd(hours10))
beetles.meansd.di
## dose insecticide m s
## 1 low A 0.4125 0.06946222
## 2 low B 0.8800 0.16083117
## 3 low C 0.5675 0.15671099
## 4 low D 0.6100 0.11284207
## 5 medium A 0.3200 0.07527727
0.3
1.00
dose
● low
insecticide medium
A 0.2 high
hours10
0.75
B
s
C insecticide
● ●
D ● A
● B
0.50 ● ● C
0.1 ● D
●
●
0.25
0.0
low medium high 0.2 0.4 0.6 0.8
dose m
Diagnostic plots show the following features. The normal quantile plot shows
an “S” shape rather than a straight line, suggesting the residuals are not normal,
but have higher kurtosis (more peaky) than a normal distribution. The residuals
vs the fitted (predicted) values show that the higher the predicted value the more
variability (horn shaped). The plot of the Cook’s distances indicate a few influential
observations.
# interaction model
lm.h.d.i.di <- lm(hours10 ~ dose*insecticide, data = beetles.long)
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.h.d.i.di, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.h.d.i.di$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 42 20 30
0.30
3.5 ● 42
0.30
42
42 ●
0.4
● 20 3
Cook's distance
Cook's distance
0.2
0.20
20 ● 20
0.20
●
Residuals
● ● 30 ● 302.5
● ● ●
●●
●● ●
● ● ●
●
0.0
● ●
● ●
● ●
● ● ● ● ●
● ● ●
●
●●
●
●
● 2
0.10
●
0.10
● ● ●
●
●
●
−0.2
● ●
●
● ● 1.5
●
● ●
30 ● ●
●
●
1
0.00
0.00
●
●
−0.4
●
●
● 0.5
0
42 ●
0.4
0.4
● ●
0.4
● ● 20 ●
lm.h.d.i.di$residuals
0.2
0.2
●
● 0.2 ●
● ●
●
●●
●●●●
●●●●●
0.0
0.0
●●●●●
0.0 ●●●●
●●●●●
●●●●
●●
●
● ●●
●
−0.2
−0.2
● ●
−0.2 ●
● ●
● 30
norm quantiles
Survival times are usually right skewed, with the spread or variability in the
distribution increasing as the mean or median increases. Ideally, the distributions
should be symmetric, normal, and the standard deviation should be fairly constant
across groups.
The boxplots (note the ordering) and the plot of the sij against ȳij show the
tendency for the spread to increase with the mean. This is reinforced by the residual
plot, where the variability increases as the predicted values (the cell means under the
two-factor interaction model) increase.
As noted earlier, the QQ-plot of the studentized residuals is better suited to
examine normality here than the boxplots which are constructed from 4 observations.
Not surprisingly, the boxplots do not suggest non-normality. Looking at the QQ-plot
we clearly see evidence of non-normality.
to plot the IQR against the median to get a more robust view of the dependence of
spread on typical level because sij and ȳij are sensitive to outliers.
1. If sij increases linearly with ȳij , use a log transformation of the response.
2. If sij increases as a quadratic function of ȳij , use a reciprocal (inverse) trans-
formation of the response.
3. If sij increases as a square root function of ȳij , use a square root transformation
of the response.
4. If sij is roughly independent of ȳij , do not transform the response. This idea
does not require the response to be non-negative!
A logarithmic transformation or a reciprocal (inverse) transformation of the sur-
vival times might help to stabilize the variance. The survival time distributions are
fairly symmetric, so these nonlinear transformations may destroy the symmetry. As
a first pass, I will consider the reciprocal transformation because the inverse survival
time has a natural interpretation as the dying rate. For example, if you survive 2
hours, then 1/2 is the proportion of your remaining lifetime expired in the next hour.
The unit of time is actually 10 hours, so 0.1/time is the actual rate. The 0.1 scaling
factor has no effect on the analysis provided you appropriately rescale the results on
the mean responses.
Create the rate variable.
#### Example: Beetles, non-constant variance
# create the rate variable (1/hours10)
beetles.long$rate <- 1/beetles.long$hours10
# mean vs sd plot
library(plyr)
# means and standard deviations for each dose/interaction cell
beetles.meansd.di.rate <- ddply(beetles.long, .(dose,insecticide), summarise
, m = mean(rate), s = sd(rate))
beetles.meansd.di.rate
## dose insecticide m s
## 1 low A 2.486881 0.4966627
## 2 low B 1.163464 0.1994976
## 3 low C 1.862724 0.4893774
## 4 low D 1.689682 0.3647127
## 5 medium A 3.268470 0.8223269
## 6 medium B 1.393392 0.5531885
## 7 medium C 2.713919 0.4175138
0.8
5
dose
4
● low
insecticide 0.6 medium
A high
rate
●
B
s
3 C ● insecticide
●
D ● A
● B
● 0.4 ● C
2 ● ● D
1
0.2 ●
low medium high 1 2 3 4
dose m
The profile plots and ANOVA table indicate that the main effects are significant
but the interaction is not.
library(plyr)
# Calculate the cell means for each (dose, insecticide) combination
beetles.mean <- ddply(beetles.long, .(), summarise, m = mean(rate))
beetles.mean
## .id m
## 1 <NA> 2.622376
beetles.mean.d <- ddply(beetles.long, .(dose), summarise, m = mean(rate))
beetles.mean.d
## dose m
## 1 low 1.800688
## 2 medium 2.269329
## 3 high 3.797112
beetles.mean.i <- ddply(beetles.long, .(insecticide), summarise, m = mean(rate))
beetles.mean.i
## insecticide m
## 1 A 3.519345
## 2 B 1.861943
## 3 C 2.947210
## 4 D 2.161007
beetles.mean.di <- ddply(beetles.long, .(dose,insecticide), summarise, m = mean(rate))
beetles.mean.di
## dose insecticide m
## 1 low A 2.486881
## 2 low B 1.163464
## 3 low C 1.862724
## 4 low D 1.689682
## 5 medium A 3.268470
## 6 medium B 1.393392
## 7 medium C 2.713919
## 8 medium D 1.701534
## 9 high A 4.802685
## 10 high B 3.028973
## 11 high C 4.264987
## 12 high D 3.091805
# Interaction plots, ggplot
Beetles interaction plot, insecticide by dose Beetles interaction plot, dose by insecticide
4 4
insecticide
● dose
● A
● low
rate
rate
B
medium
● C ● high
D
2 2
●
●
0 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.r.d.i.di)
##
## Call:
## lm(formula = rate ~ dose * insecticide, data = beetles.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76847 -0.29642 -0.06914 0.25458 1.07936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.48688 0.24499 10.151 4.16e-12 ***
## dosemedium 0.78159 0.34647 2.256 0.030252 *
## dosehigh 2.31580 0.34647 6.684 8.56e-08 ***
## insecticideB -1.32342 0.34647 -3.820 0.000508 ***
## insecticideC -0.62416 0.34647 -1.801 0.080010 .
## insecticideD -0.79720 0.34647 -2.301 0.027297 *
## dosemedium:insecticideB -0.55166 0.48999 -1.126 0.267669
## dosehigh:insecticideB -0.45030 0.48999 -0.919 0.364213
## dosemedium:insecticideC 0.06961 0.48999 0.142 0.887826
## dosehigh:insecticideC 0.08646 0.48999 0.176 0.860928
## dosemedium:insecticideD -0.76974 0.48999 -1.571 0.124946
## dosehigh:insecticideD -0.91368 0.48999 -1.865 0.070391 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 36 degrees of freedom
## Multiple R-squared: 0.8681,Adjusted R-squared: 0.8277
## F-statistic: 21.53 on 11 and 36 DF, p-value: 1.289e-12
##
## Call:
## lm(formula = rate ~ dose + insecticide, data = beetles.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82757 -0.37619 0.02116 0.27568 1.18153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6977 0.1744 15.473 < 2e-16 ***
## dosemedium 0.4686 0.1744 2.688 0.01026 *
## dosehigh 1.9964 0.1744 11.451 1.69e-14 ***
## insecticideB -1.6574 0.2013 -8.233 2.66e-10 ***
## insecticideC -0.5721 0.2013 -2.842 0.00689 **
## insecticideD -1.3583 0.2013 -6.747 3.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4931 on 42 degrees of freedom
## Multiple R-squared: 0.8441,Adjusted R-squared: 0.8255
## F-statistic: 45.47 on 5 and 42 DF, p-value: 6.974e-16
Unlike the original analysis, the residual plots do not show any gross deviations
from assumptions. Also, no case seems relatively influential.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.r.d.i, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.r.d.i$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 41 4 33
## residuals vs order of data
#plot(lm.r.d.i£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.15
41 ● 41 2.5
0.15
41 ●
1.0
●4 33 ●
●
Cook's distance
Cook's distance
●
0.10
0.5
● ●
0.10
Residuals
● ● 2
● 4
● ● ● ● 4●
● ● 33 ● 33
● ● ●
● ● ● ●
●
●
0.0
● ●
●
● ● ● ● ●
0.05
● 1.5
0.05
● ●
● ● ● ●
● ● ●
−0.5
● ●
●
● ●
● ● ●
● ● ●
●
●
1
●
●
● ● ●
●
●
●
0.00
0.00
0.5
−1.0
●
●
● 0
1 2 3 4 0 10 20 30 40 0.1
41 ●
1.0
1.0 1.0
33 ●4 ●
●
lm.r.d.i$residuals
●
0.5
0.5
●●
0.5 ●●
●
●
●●●
●
●●●
●
●●
●●●
0.0
0.0
0.0 ●●
●●
●●●
●
●●●
●●
−0.5
−0.5
● ●●
●
−0.5 ● ●
●
●
●
● ●
norm quantiles
insecticide: B − A ( ● )
insecticide: C − A ( ● )
insecticide: D − A ( ● )
insecticide: C − B ( ● )
insecticide: D − B ( ● )
insecticide: D − C ( ● )
−2 −1 0 1 2
Linear Function
Bonferroni−adjusted Treatment contrasts
of the observed interaction between the main effects. Although the interaction in the
original analysis was not significant at the 10% level (p-value=0.112), the small sam-
ple sizes suggest that power for detecting interaction might be low. To be on the safe
side, one might interpret the main effects in the original analysis as if an interaction
were present. This need appears to be less pressing with the rates.
The statistical assumptions are reasonable for an analysis of the rates. I think
that the simplicity of the main effects interpretation is a strong motivating factor for
preferring the analysis of the transformed data to the original analysis. You might
disagree, especially if you believe that the original time scale is most relevant for
analysis.
Given the suitability of the inverse transformation, I did not consider the loga-
rithmic transformation.
When there are model interactions, the comparisons of the main effects are inap-
propriate, and give different results depending on the method of comparison.
# fit interaction model (same as before)
lm.m.m.t.mt <- lm(maxvolt ~ material*temp, data = battery.long)
When there are model interactions and you want to compare cell means, levels of
one factor at each level of another factor separately, then you must use lsmeans().
# fit interaction model (same as before)
lm.m.m.t.mt <- lm(maxvolt ~ material*temp, data = battery.long)
Finally, an important point demonstrated in the next section is that the cell and
marginal averages given by the means and lsmeans methods agree here for the
main effects model because the design is balanced. For unbalanced designs with two
or more factors, lsmeans and means compute different averages. I will argue that
lsmeans are the appropriate averages for unbalanced analyses. You should use the
means statement with caution — it is OK for balanced or unbalanced one-factor
designs, and for the balanced two-factor designs (including the RB) that we have
discussed.
## 48 p 60 83
# mean vs sd plot
library(plyr)
# means and standard deviations for each time/interaction cell
rat.meansd.tv <- ddply(rat, .(time,vein), summarise
, m = mean(insulin), s = sd(insulin), n = length(insulin))
rat.meansd.tv
## time vein m s n
## 1 0 j 26.60000 12.75931 5
## 2 0 p 81.91667 27.74710 12
## 3 30 j 79.50000 36.44585 6
## 4 30 p 172.90000 76.11753 10
## 5 60 j 61.33333 62.51666 3
## 6 60 p 128.50000 49.71830 12
p <- ggplot(rat.meansd.tv, aes(x = m, y = s, shape = time, colour = vein, label=n))
p <- p + geom_point(size=4)
# labels are sample sizes
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
p <- p + labs(title = "Rats standard deviation vs mean")
print(p)
Rats standard deviation vs mean
●
10
300
3
60
time
● 0
200 12
vein 30
insulin
j 60
s
p
40 vein
6
●
a j
●
a p
100
12
●
20
5
●
0
0 30 60 50 100 150
time m
We take the log of insulin to correct the problem. The variances are more constant
now, except for one sample with only 3 observations which has a larger standard
deviation than the others, but because this is based on such a small sample size, it’s
not of much concern.
rat$loginsulin <- log(rat$insulin)
# boxplots, ggplot
p <- ggplot(rat, aes(x = time, y = loginsulin, colour = vein))
p <- p + geom_boxplot()
print(p)
# mean vs sd plot
library(plyr)
# means and standard deviations for each time/interaction cell
rat.meansd.tv <- ddply(rat, .(time,vein), summarise
, m = mean(loginsulin)
, s = sd(loginsulin)
, n = length(loginsulin))
rat.meansd.tv
## time vein m s n
## 1 0 j 3.179610 0.5166390 5
## 2 0 p 4.338230 0.4096427 12
## 3 30 j 4.286804 0.4660571 6
## 4 30 p 5.072433 0.4185221 10
## 5 60 j 3.759076 1.0255165 3
## 6 60 p 4.785463 0.3953252 12
p <- ggplot(rat.meansd.tv, aes(x = m, y = s, shape = time, colour = vein, label=n))
p <- p + geom_point(size=4)
# labels are sample sizes
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
p <- p + labs(title = "Rats standard deviation vs mean")
print(p)
time
0.8 ● 0
vein
loginsulin
30
j 60
s
4 p
vein
●
a j
●
0.6 ●
a p
5
●
3
6
12 10
0.4 ● 12
3
For the ugly details, see http://goanna.cs.rmit.edu.au/~fscholer/anova.php.
Because the profile plot lines all seem parallel, and because of the interaction
Type III SS p-value above, it appears there is not sufficient evidence for a vein-by-
time interaction. For now we’ll keep the interaction in the model for the purpose of
discussing differences between means and lsmeans and Type I and Type III SS.
# calculate means for plot
library(plyr)
rat.mean.tv <- ddply(rat, .(time,vein), summarise, m = mean(loginsulin))
● ●
4 4
●
time
vein
loginsulin
loginsulin
● ●
● 0
● j
30
p
60
2 2
0 0
0 30 60 j p
time vein
## 1 0 3.997460
## 2 30 4.777822
## 3 60 4.580186
library(lsmeans)
lsmeans(lm.i.t.v.tv, list(pairwise ~ time), adjust = "bonferroni")
## NOTE: Results may be misleading due to involvement in interactions
## $`lsmeans of time`
## time lsmean SE df lower.CL upper.CL
## 0 3.758920 0.1258994 42 3.504845 4.012996
## 30 4.679619 0.1221403 42 4.433130 4.926108
## 60 4.272270 0.1526754 42 3.964158 4.580381
##
## Results are averaged over the levels of: vein
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## 0 - 30 -0.9206985 0.1754107 42 -5.249 <.0001
## 0 - 60 -0.5133494 0.1978900 42 -2.594 0.0390
## 30 - 60 0.4073491 0.1955199 42 2.083 0.1300
##
## Results are averaged over the levels of: vein
## P value adjustment: bonferroni method for 3 tests
For completeness, these diagnostic plots are mostly fine, though the plot of the
Cook’s distances indicate a couple influential observations.
# interaction model
lm.i.t.v.tv <- lm(loginsulin ~ time*vein, data = rat
, contrasts = list(time = contr.sum, vein = contr.sum))
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.i.t.v.tv, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.i.t.v.tv$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 13 12 17
## residuals vs order of data
#plot(lm.i.t.v.tv£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.6
0.6
●
2.5
Cook's distance
Cook's distance
● ●
0.5
●
●
●● ●
Residuals
● ● 12 12 ●
0.4
●
0.4
●
● ● ●
●
● ●
● 2
0.0
● ●
●● ● ●
●
● ● ●
● ●
● ●
0.2
0.2
●
−0.5
● 1.5
●
●
● 3
● ● ● ●3
17 ●
●
●
● 1
● 12 ● ●
−1.0
● ● ● ● ●
0.5
0.0
0.0
●
● ●
● ●
● ● ● 0
13 ●
1.0
1.0 1.0
●
lm.i.t.v.tv$residuals
● ●
0.5
0.5
●
0.5 ●
●● ●
●●
●●●
●●●
●●●
●
0.0
0.0
0.0 ●●
●●
●●●
●●
●●
●
●●
●
●●
−0.5
−0.5
●
−0.5 ●
●
●
● ● ●
12 17
●
●
0 30 60 j p −2 −1 0 1 2
norm quantiles
Should I use means or lsmeans, Type I or Type III SS? Use lsmeans and
Type III SS.
Regardless of whether the design is balanced, the basic building blocks for a two-
factor analysis are cell means, and the marginal means, defined as the average of the
cell means over the levels of the other factor.
The F -statistics based on Type III SSs are appropriate for unbalanced two-factor
designs because they test the same hypotheses that were considered in balanced de-
signs. That is, the Type III F -tests on the main effects check for equality in population
means averaged over levels of the other factor. The Type III F -test for no interaction
checks for parallel profiles. Given that the Type III F -tests for the main effects check
for equal population cell means averaged over the levels of the other factor, multiple
comparisons for main effects should be based on lsmeans.
The Type I SS and F -tests and the multiple comparisons based on means should
be ignored because they do not, in general, test meaningful hypotheses. The problem
with using the means output is that the experimenter has fixed the sample sizes for
a two-factor experiment, so comparisons of means, which ignore the second factor,
introduces a potential bias due to choice of sample sizes. Put another way, any
differences seen in the means in the jugular and portal could be solely due to the
sample sizes used in the experiment and not due to differences in the veins.
Focusing on the Type III SS, the F -tests indicate that the vein and time effects
are significant, but that the interaction is not significant. The jugular and portal
profiles are reasonably parallel, which is consistent with a lack of interaction. What
can you conclude from the lsmeans comparisons of veins and times?
Answer: significant differences between veins, and between times 0 and 30.
4
Please attempt by hand before looking at the solutions at http://statacumen.com/teach/
ADA2/ADA2_05_PairedAndBlockDesigns_CoefScan.pdf.
A Short Discussion of
Observational Studies
In most scientific studies, the groups being compared do not consist of identical
experimental units that have been randomly assigned to receive a treatment. Instead,
the groups might be extremely heterogeneous on factors that might be related to a
specific response on which you wish to compare the groups. Inferences about the
nature of differences among groups in such observational studies can be flawed if
this heterogeneity is ignored in the statistical analysis.
The following problem emphasizes the care that is needed when analyzing obser-
vational studies, and highlights the distinction between the means and lsmeans
output for a two-way table. The data are artificial, but the conclusions are consis-
tent with an interesting analysis conducted by researchers at Sandia National Labo-
ratories.
A representative sample of 550 high school seniors was selected in 1970. A similar
sample of 550 was selected in 1990. The final SAT scores (on a 1600 point scale) were
obtained for each student1 .
The boxplots for the two samples show heavy-tailed distributions with similar
spreads. Given the large sample sizes, the F -test comparing populations is approxi-
mately valid even though the population distributions are non-normal.
#### Example: SAT
sat <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch06_sat.dat", header = TRUE)
sat$year <- factor(sat$year)
1
The fake-data example in this chapter is similar to a real-world SAT example illustrated in this
paper: “Minority Contributions to the SAT Score Turnaround: An Example of Simpson’s Paradox”
by Howard Wainer, Journal of Educational Statistics, Vol. 11, No. 4 (Winter, 1986), pp. 239–244
http://www.jstor.org/stable/1164696.
950
900
●
●
grade
850
800
1970 1990
year
A simple analysis might compare the average SAT scores for the two years, to
see whether students are scoring higher, lower, or about the same, over time. The
one-way lsmeans and means breakdowns of the SAT scores are identical; the av-
erage SAT scores for 1970 and 1990 are 892.8 and 882.2, respectively. The one-way
ANOVA, combined with the observed averages, indicates that the typical SAT score
has decreased significantly (10.7 points) over the 20 year period.
lm.g.y <- lm(grade ~ year, data = sat
, contrasts = list(year = contr.sum))
library(car)
# type III SS
Anova(lm.g.y, type=3)
## Anova Table (Type III tests)
##
## Response: grade
950 ●
900 ●
eth
grade
● 1
2
850
800
1970 1990
year
I fit a two-factor model with year and ethnicity effects plus an interaction. The
two-factor model gives a method to compare the SAT scores over time, after adjust-
ing for the effect of ethnicity on performance. The F -test for comparing years adjusts
for ethnicity because it is based on comparing the average SAT scores across years
after averaging the cell means over ethnicities, thereby eliminating from the compar-
ison of years any effects due to changes in the ethnic composition of the populations.
The two-way analysis is preferable to the unadjusted one-way analysis which ignores
ethnicity.
lm.g.y.e.ye <- lm(grade ~ year * eth, data = sat
, contrasts = list(year = contr.sum, eth = contr.sum))
The year and ethnicity main effects are significant in the two factor model, but
the interaction is not. The marginal lsmeans indicate that the average SAT score
increased significantly over time when averaged over ethnicities. This is consistent
with the cell mean SAT scores increasing over time within each ethnic group. Given
the lack of a significant interaction, the expected increase in SAT scores from 1970 to
1990 within each ethnic group is the difference in marginal averages: 912.0 - 861.9
= 50.1.
library(plyr)
# unbalanced, don't match (lsmeans is correct)
sat.mean.y <- ddply(sat, .(year), summarise, m = mean(grade))
sat.mean.y
## year m
## 1 1970 892.8418
## 2 1990 882.1545
library(lsmeans)
lsmeans(lm.g.y.e.ye, list(pairwise ~ year), adjust = "bonferroni")
## NOTE: Results may be misleading due to involvement in interactions
## $`lsmeans of year`
## year lsmean SE df lower.CL upper.CL
## 1970 861.926 0.5253021 1096 860.8953 862.9567
## 1990 912.037 0.5253021 1096 911.0063 913.0677
##
## Results are averaged over the levels of: eth
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
##
## $`pairwise differences of contrast, eth | eth`
## eth = 1:
## contrast estimate SE df t.ratio p.value
## 1970 - 1990 -48.848 1.050604 1096 -46.495 <.0001
##
## eth = 2:
## contrast estimate SE df t.ratio p.value
## 1970 - 1990 -51.374 1.050604 1096 -48.899 <.0001
lsmeans(lm.g.y.e.ye, list(pairwise ~ eth | year), adjust = "bonferroni")
## $`lsmeans of eth | year`
## year = 1970:
## eth lsmean SE df lower.CL upper.CL
## 1 899.712 0.3167691 1096 899.0905 900.3335
## 2 824.140 1.0017118 1096 822.1745 826.1055
##
## year = 1990:
## eth lsmean SE df lower.CL upper.CL
## 1 948.560 1.0017118 1096 946.5945 950.5255
## 2 875.514 0.3167691 1096 874.8925 876.1355
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast, year | year`
## year = 1970:
## contrast estimate SE df t.ratio p.value
## 1 - 2 75.572 1.050604 1096 71.932 <.0001
##
## year = 1990:
## contrast estimate SE df t.ratio p.value
## 1 - 2 73.046 1.050604 1096 69.528 <.0001
As noted in the insulin analysis, the marginal lsmeans and means are different
for unbalanced two-factor analyses. The marginal means ignore the levels of the
other factors when averaging responses. The marginal lsmeans are averages of cell
means over the levels of the other factor. Thus, for example, the 1970 mean SAT
score of 892.8 is the average of the 550 scores selected that year. The 1970 lsmeans
SAT score of 861.9 is midway between the average 1970 SAT scores for the two ethnic
groups: 861.9 = (899.7 + 824.1)/2. Hopefully, this discussion also clarifies why the
year marginal means are identical in the one and two-factor analyses, but the year
lsmeans are not.
The 1970 and 1990 marginal means estimate the typical SAT score ignoring all
factors that may influence performance. These marginal averages are not relevant for
understanding any trends in performance over time because they do not account
for changes in the composition of the population that may be related to performance.
The average SAT scores (ignoring ethnicity) decreased from 1970 to 1990 because
the ethnic composition of the student population changed. Ten out of every eleven
students sampled in 1970 were from the first ethnic group. Only one out of eleven
students sampled in 1990 was from this group. Students in the second ethnic group
are underachievers, but they are becoming a larger portion of the population over
time. The decrease in average (means) performance inferred from comparing 1970
to 1990 is confounded with the increased representation of the underachievers over
time. Once ethnicity was taken into consideration, the typical SAT scores were shown
to have increased, rather than decreased.
In summary, the one-way analysis ignoring ethnicity is valid, and allows you to
conclude that the typical SAT score has decreased over time, but it does not provide
any insight into the nature of the changes that have occurred. A two-factor anal-
ysis backed up with a comparison of the marginal lsmeans is needed to compare
performances over time, adjusting for the changes in ethnic composition.
The Sandia study reached the same conclusion. The Sandia team showed that
the widely reported decreases in SAT scores over time are due to changes in the
ethnic distribution of the student population over time, with individuals in historically
underachieving ethnic groups becoming a larger portion of the student population over
time.
A more complete analysis of the SAT study would adjust the SAT scores to account
for other potential confounding factors, such as sex, and differences due to the number
of times the exam was taken. These confounding effects are taken into consideration
by including them as effects in the model.
The interpretation of the results from an observational study with several effects
of interest, and several confounding variables, is greatly simplified by eliminating the
insignificant effects from the model. For example, the year by ethnicity interaction
in the SAT study might be omitted from the model to simplify interpretation. The
year effects would then be estimated after fitting a two-way additive model with
year and ethnicity effects only. The same approach is sometimes used with designed
experiments, say the insulin study that we analyzed earlier.
An important caveat The ideas that we discussed on the design and analysis of
experiments and observational studies are universal. They apply regardless of whether
you are analyzing categorical data, counts, or measurements.
Suppose that you are interested in comparing the typical lifetime (hours) of two tool
types (A and B). A simple analysis of the data given below would consist of making
side-by-side boxplots followed by a two-sample test of equal means (or medians). The
standard two-sample test using the pooled variance estimator is a special case of the
one-way ANOVA with two groups. The summaries suggest that the distribution of
lifetimes for the tool types are different. In the output below, µi is population mean
lifetime for tool type i (i = A, B).
#### Example: Tool lifetime
tools <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch07_tools.dat"
, header = TRUE)
str(tools)
## 'data.frame': 20 obs. of 3 variables:
## $ lifetime: num 18.7 14.5 17.4 14.5 13.4 ...
## $ rpm : int 610 950 720 840 980 530 680 540 890 730 ...
## $ type : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
library(ggplot2)
p <- ggplot(tools, aes(x = type, y = lifetime))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(aes(yintercept = mean(lifetime)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour="red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour="red", alpha = 0.8)
p <- p + labs(title = "Tool type lifetime") + ylab("lifetime (hours)")
p <- p + coord_flip()
print(p)
B
type
20 30 40
lifetime (hours)
A two sample t-test comparing mean lifetimes of tool types indicates a difference
between means.
t.summary <- t.test(lifetime ~ type, data = tools)
t.summary
##
## Welch Two Sample t-test
##
## data: lifetime by type
## t = -6.435, df = 15.93, p-value = 8.422e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.70128 -9.93472
## sample estimates:
## mean in group A mean in group B
## 17.110 31.928
This comparison is potentially misleading because the samples are not compara-
ble. A one-way ANOVA is most appropriate for designed experiments where all the
factors influencing the response, other than the treatment (tool type), are fixed by
the experimenter. The tools were operated at different speeds. If speed influences
lifetime, then the observed differences in lifetimes could be due to differences in speeds
at which the two tool types were operated.
Fake example For example, suppose speed is inversely related to lifetime of the
tool. Then, the differences seen in the boxplots above could be due to tool type
B being operated at lower speeds than tool type A. To see how this is possible,
consider the data plot given below, where the relationship between lifetime and speed
is identical in each sample. A simple linear regression model relating hours to speed,
ignoring tool type, fits the data exactly, yet the lifetime distributions for the tool
types, ignoring speed, differ dramatically. (The data were generated to fall exactly
on a straight line). The regression model indicates that you would expect identical
mean lifetimes for tool types A and B, if they were, or could be, operated at identical
speeds. This is not exactly what happens in the actual data. However, I hope the
point is clear.
#### Example: Tools, fake
toolsfake <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch07_toolsfake.dat"
, header = TRUE)
library(ggplot2)
p <- ggplot(toolsfake, aes(x = speed, y = hours, colour = type, shape = type))
p <- p + geom_point(size=4)
library(R.oo) # for ascii code lookup
p <- p + scale_shape_manual(values=charToInt(sort(unique(toolsfake$type))))
p <- p + labs(title="Fake tools data, hours by speed with categorical type")
print(p)
30.0 B
B
B
B
B
27.5
B
B
B
B
type
hours
B
25.0 A A
A
B B
A
A
A
22.5 A
A
A
A
A
20.0 A
As noted in the Chapter 6 SAT example, you should be wary of group comparisons
where important factors that influence the response have not been accounted for or
controlled. In the SAT example, the differences in scores were affected by a change
in the ethnic composition over time. A two-way ANOVA with two factors, time and
ethnicity, gave the most sensible analysis.
For the tool lifetime problem, you should compare groups (tools) after adjusting
the lifetimes to account for the influence of a measurement variable, speed. The
appropriate statistical technique for handling this problem is called analysis of co-
variance (ANCOVA).
7.1 ANCOVA
A natural way to account for the effect of speed is through a multiple regression
model with lifetime as the response and two predictors, speed and tool type. A
binary categorical variable, here tool type, is included in the model as a dummy
variable or indicator variable (a {0, 1} variable).
Consider the model
where typeB is 0 for type A tools, and 1 for type B tools. For type A tools, the model
simplifies to:
This ANCOVA model fits two regression lines, one for each tool type, but restricts
the slopes of the regression lines to be identical. To see this, let us focus on the
interpretation of the regression coefficients. For the ANCOVA model,
β2 = slope of population regression lines for tool types A and B.
and
β0 = intercept of population regression line for tool A (called the reference group).
Given that
β0 + β1 = intercept of population regression line for tool B,
it follows that
β1 = difference between tool B and tool A intercepts.
A picture of the population regression lines for one version of the model is given
below.
40
35
Population Mean Life
25 30
Tool B
20 15
Tool A
The ANCOVA model is plausible. The relationship between lifetime and speed
is roughly linear within tool types, with similar slopes but unequal intercepts across
groups. The plot of the studentized residuals against the fitted values shows no gross
abnormalities, but suggests that the variability about the regression line for tool type
A is somewhat smaller than the variability for tool type B. The model assumes that
the variability of the responses is the same for each group. The QQ-plot does not
show any gross deviations from a straight line.
#### Example: Tool lifetime
library(ggplot2)
p <- ggplot(tools, aes(x = rpm, y = lifetime, colour = type, shape = type))
p <- p + geom_point(size=4)
library(R.oo) # for ascii code lookup
p <- p + scale_shape_manual(values=charToInt(sort(unique(tools$type))))
p <- p + geom_smooth(method = lm, se = FALSE)
p <- p + labs(title="Tools data, lifetime by rpm with categorical type")
print(p)
40
B
B B
B
30 B type
lifetime
A A
B
B B B B B
A
A
20
A A
A
A A
A A
A
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared: 0.9003,Adjusted R-squared: 0.8886
## F-statistic: 76.75 on 2 and 17 DF, p-value: 3.086e-09
# plot diagnostics
par(mfrow=c(2,3))
plot(lm.l.r.t, which = c(1,4,6), pch=as.character(tools$type))
# Normality of Residuals
library(car)
qqPlot(lm.l.r.t$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(tools$type))
## [1] 7 20 19
## residuals vs order of data
#plot(lm.l.r.t£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.4
6
0.4
20
2 20B
20B
19B
4
0.3
0.3
AA
Cook's distance
Cook's distance
B
2
A A B 1.5
Residuals
0.2
A
0
A
0.2
A A B
7
B A7
−2
A
1
0.1
19
0.1
B
BB B
19
A A
−4
B B B
A7 A A 0.5
A B A BB
−6
B
0.0
0.0
A A A B 0
B 20 B
B 19 B
4
4
A A A
A
B
lm.l.r.t$residuals
lm.l.r.t$residuals
2 B
2
A A B A B
A
B B
A 0 A A
0
A A A
B A B A
B B
−2
−2
A −2 A
B B
−4
−4
B −4 B
B B
A ● A 7
The t-test of H0 : β1 = 0 checks whether the intercepts for the population regres-
sion lines are equal, assuming equal slopes. The t-test p-value < 0.0001 suggests that
the population regression lines for tools A and B have unequal intercepts. The LS
lines indicate that the average lifetime of either type tool decreases by 0.0266 hours
for each increase in 1 RPM. Regardless of the lathe speed, the model predicts that
type B tools will last 15 hours longer (i.e., the regression coefficient for the typeB
predictor) than type A tools. Summarizing this result another way, the t-test suggests
that there is a significant difference between the lifetimes of the two tool types, after
adjusting for the effect of the speeds at which the tools were operated. The estimated
difference in average lifetime is 15 hours, regardless of the lathe speed.
Status I1 I2
L 0 0
M 0 1
H 1 0
Given the indicators I1 and I2 and the predictor IQN, define two interaction or
product effects: I1 × IQN and I2 × IQN.
This model is best understood by considering the three status classes separately.
If status = L, then I1 = I2 = 0. For these families
IQF = β0 + β3 IQN + e.
The regression coefficients β0 and β3 are the intercept and slope for the L status
population regression line. The other parameters measure differences in intercepts and
slopes across the three groups, using L status families as a baseline or reference
group. In particular:
β1 = difference between the intercepts of the H and L population regression lines.
β2 = difference between the intercepts of the M and L population regression lines.
β4 = difference between the slopes of the H and L population regression lines.
β5 = difference between the slopes of the M and L population regression lines.
The plot gives a possible picture of the population regression lines corresponding
to the general model (7.1).
160
Population Mean IQ Foster Twin (IQF)
L Status
140
M Status
120
H Status
80 100
120
H H
L
L M
L H
L
status
100 L L
IQF
L L
L H H
M L L M
M M
HL M
L
H L
80 H
L
M
M
L
60
60 80 100 120
IQN
# Normality of Residuals
library(car)
qqPlot(lm.f.n.s.ns$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(twins$status))
## [1] 27 24 23
## residuals vs order of data
#plot(lm.f.n.s.ns£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.30
2.5 L272 1.5
15
0.30
27
L 24
L23
M
10
L 13 13M
Cook's distance
Cook's distance
M
0.20
H H
0.20
5
10 M10
H 1
Residuals
L
L H L L
L M
0
L M
M LL H
−5
0.10
L
0.10
M H H
H H
M
L L
M L
L H H 0.5
−15
27L LLL H M
0.00
0.00
LL LLHL M LH 0
L 24 L
M L 23 L
M
10
10
10
lm.f.n.s.ns$residuals
lm.f.n.s.ns$residuals
L L
M H H HH
M
5
5
H L LH
L L
L
L H 0 LHL
0
M
0
M
L L LL
M L H MH L
−5
−5
M L −5 L
H HM
H H
L L L L
−15 −10
−15 −10
M −10 M
L −15 L 27
The natural way to express the fitted model is to give separate prediction equations
for the three status groups. Here is an easy way to get the separate fits. For the
general model (7.1), the predicted IQF satisfies
Predicted IQF = (Intercept + Coeff for Status Indicator)
+ (Coeff for Status Product Effect + Coeff for IQN) × IQN.
For the baseline group, use 0 as the coefficients for the status indicator and product
effect.
Thus, for the baseline group with status = L,
Predicted IQF = 7.20 + 0 + (0.948 + 0) IQN
= 7.20 + 0.948 IQN.
For the M status group with indicator I2 and product effect I2 × IQN:
Predicted IQF = 7.20 − 6.39 + (0.948 + 0.024) IQN
= 0.81 + 0.972 IQN.
For the H status group with indicator I1 and product effect I1 × IQN:
Predicted IQF = 7.20 − 9.08 + (0.948 + 0.029) IQN
= −1.88 + 0.977 IQN.
The LS lines are identical to separately fitting simple linear regressions to the three
groups.
IQF = β0 + β1 I1 + β2 I2 + β3 IQN + e
L Status
140
M Status
120
H Status
80 100
##
## Call:
## lm(formula = IQF ~ IQN + status, data = twins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8235 -5.2366 -0.1111 4.4755 13.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6188 9.9628 0.564 0.578
## IQN 0.9658 0.1069 9.031 5.05e-09 ***
## statusH -6.2264 3.9171 -1.590 0.126
## statusM -4.1911 3.6951 -1.134 0.268
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.571 on 23 degrees of freedom
## Multiple R-squared: 0.8039,Adjusted R-squared: 0.7784
## F-statistic: 31.44 on 3 and 23 DF, p-value: 2.604e-08
For the ANCOVA model, the predicted IQF for the three groups satisfies
Predicted IQF = (Intercept + Coeff for Status Indicator)
+(Coeff for IQN) × IQN.
As with the general model, use 0 as the coefficients for the status indicator and
product effect for the baseline group.
For L status families:
Predicted IQF = 5.62 + 0.966 IQN,
for M status:
Predicted IQF = 5.62 − 4.19 + 0.966 IQN
= 1.43 + 0.966 IQN,
and for H status:
Predicted IQF = 5.62 − 6.23 + 0.966 IQN
= −0.61 + 0.966 IQN.
is a special case of the ANCOVA model with β1 = β2 = 0. This model does not
distinguish among social classes. The common intercept and slope for the social
classes are β0 and β3 , respectively.
The predicted IQF for this model is
IQF = β0 + β1 I1 + β2 I2 + e
is a special case of the ANCOVA model with β3 = 0. In this model, social status
has an effect on IQF but IQN does not. This model of parallel regression lines
with zero slopes is identical to a one-way ANOVA model for the three social classes,
where the intercepts play the role of the population means, see the plot below.
140
Population Mean IQ Foster Twin (IQF)
H Status
130
M Status
100 110 120
L Status
90
For the ANOVA model, the predicted IQF for the three groups satisfies
for M status:
The predicted IQFs are the mean IQFs for the three groups.
lm.f.s <- lm(IQF ~ status, data = twins)
library(car)
Anova(aov(lm.f.s), type=3)
effect. The general model has the same structure as a two-factor interaction ANOVA
model because the plot of the population means allows non-parallel profiles. However,
the general model is a special case of the two-factor interaction ANOVA model because
it restricts the means to change linearly with IQN.
The ANCOVA model has main effects for status and IQN but no interaction:
The ANCOVA model is a special case of the additive two-factor ANOVA model
because the plot of the population means has parallel profiles, but is not equivalent
to the additive two-factor ANOVA model.
The model with equal slopes and intercepts has no main effect for status nor an
interaction between status and IQN:
The one-way ANOVA model has no main effect for IQN nor an interaction between
status and IQN:
I will expand on these ideas later, as they are useful for understanding the con-
nections between regression and ANOVA models.
Let’s go about testing another hypothesis, first, using the Wald test, then we’ll
test our two simultaneous hypotheses above.
H0 : equal slopes for all status groups
H0 : β4 = β5 = 0
lm.f.n.s.ns <- lm(IQF ~ IQN*status, data = twins)
library(car)
Anova(aov(lm.f.n.s.ns), type=3)
## Anova Table (Type III tests)
##
## Response: IQF
## Sum Sq Df F value Pr(>F)
, Sigma = vcov(lm.f.n.s.ns)
, Terms = c(4,6))
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 1.2, df = 2, P(> X2) = 0.55
# Another way to do this is to define the matrix r and vector r, manually.
mR <- as.matrix(rbind(c(0, 0, 0, 1, 0, 0), c(0, 0, 0, 0, 0, 1)))
mR
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 1 0 0
## [2,] 0 0 0 0 0 1
vR <- c(0, 0)
vR
## [1] 0 0
wald.test(b = coef(lm.f.n.s.ns)
, Sigma = vcov(lm.f.n.s.ns)
, L = mR, H0 = vR)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 1.2, df = 2, P(> X2) = 0.55
In hypothesis 2 we are testing β1 − β2 = 0 and β4 − β5 = 0 which are the difference
of the 2nd and 3rd coefficients and the difference of the 5th and 6th coefficients.
However, we need to choose the correct positions based on the coef() order, and these
are positions 3 and 4, and 5 and 6. The large p-value=0.91 suggests that M and H
can be described by the same regression line, same slope and intercept.
mR <- as.matrix(rbind(c(0, 0, 1, -1, 0, 0), c(0, 0, 0, 0, 1, -1)))
mR
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 1 -1 0 0
## [2,] 0 0 0 0 1 -1
vR <- c(0, 0)
vR
## [1] 0 0
wald.test(b = coef(lm.f.n.s.ns)
, Sigma = vcov(lm.f.n.s.ns)
, L = mR, H0 = vR)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.19, df = 2, P(> X2) = 0.91
The results of these tests are not surprising, given our previous analysis where we
found that the status effect is not significant for all three groups.
Any simultaneous linear combination of parameters can be tested in this way.
In the twins example, I defined two indicator variables (plus two interaction variables)
from an ordinal categorical variable: status (H, M, L). Many researchers would assign
numerical codes to the status groups and use the coding as a predictor in a regression
model. For status, a “natural” coding might be to define NSTAT=0 for L, 1 for M,
and 2 for H status families. This suggests building a multiple regression model with
a single status variable (i.e., single df):
If you consider the status classes separately, the model implies that
The model assumes that the IQF by IQN regression lines are parallel for the three
groups, and are separated by a constant β2 . This model is more restrictive (and less
reasonable) than the ANCOVA model with equal slopes but arbitrary intercepts. Of
course, this model is a easier to work with because it requires keeping track of only
one status variable instead of two status indicators.
A plot of the population regression lines under this model is given above, assuming
β2 < 0.
160
Population Mean IQ Foster Twin (IQF)
L Status
H Status
80 100
library(reshape2)
YX <- data.frame(cbind(melt(Y), X[,"a"], X[,"b"], X[,"c"]))
colnames(YX) <- c("obs", "Model", "Y", "a", "b", "c")
YX$a <- factor(YX$a)
YX$b <- factor(YX$b)
Three−way Interaction
b: 0 b: 1
10
Model: 1
5
10
Model: 2
5
10 a
Model: 3
0
Y
5 1
10
Model: 4
5
10
Model: 5
5
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
c
Polynomial Regression
Y = β0 + β1 X + β2 X 2 + · · · + βp X p + ε.
Quadratics Cubics
4
0
−5
2
−10
0
y
y
−2
−20
−4
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
xx <- matrix(c(rep(1,length(x)),x1,x2,x3,x4,x5,x6,x7,x8,x9,x10),ncol=11)
●
5000
0.5
● ●
Y
●
0
● ● ●●
● ● ●
● ● ● ●
● ●
−0.5
● ●
●
−1.5
−10000
●
●
−1.5 −0.5 0.0 0.5 1.0 1.5 −1.5 −0.5 0.0 0.5 1.0 1.5
X X
The significance level for the estimate of the Temp coefficient depends on whether we
measure temperature in degrees Celsius or Fahrenheit.
To avoid these problems, I recommend the following:
1. Center the X data at X̄ and fit the model
This is usually important only for cubic and higher order models.
1
Draper and Smith 1966, p. 162
●
●
●
● ●
30 ●
●
●
●
●
cloud
●
27 ●
●
●
24
Fit the simple linear regression model and plot the residuals.
lm.c.i <- lm(cloud ~ i8, data = cloudpoint)
#library(car)
#Anova(aov(lm.c.i), type=3)
#summary(lm.c.i)
The data plot is clearly nonlinear, suggesting that a simple linear regression model
is inadequate. This is confirmed by a plot of the studentized residuals against the
fitted values from a simple linear regression of Cloud point on i8. Also by the residuals
against the i8 values. We do not see any local maxima or minima, so a second order
model is likely to be adequate. To be sure, we will first fit a cubic model, and see
whether the third order term is important.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.c.i, which = c(1,4,6), pch=as.character(cloudpoint$type))
# Normality of Residuals
library(car)
qqPlot(lm.c.i$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(cloudpoint$type))
## [1] 1 11 17
# residuals vs order of data
plot(lm.c.i$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
3 1● 2.5
0.8
1
● 11
0.8
● ●
0.5
●
●
● ● 2
0.6
Cook's distance
Cook's distance
●
0.6
● ●
0.0
Residuals
● ●
● ●
0.4
−0.5
0.4
● 1.5
●
17 ● 14 17 14 ●
● 17
0.2
0.2
● 1
−1.5
●1 ● ●
●● ● 0.5
0.0
0.0
● ●
●● ● ● ●
● ● 0
● 11 ● ●
● ● ● ● ● ●
0.5
0.5
● ● ●
● 0.5 ● ●
● ● ● ● ● ●
● ● ●
lm.c.i$residuals
lm.c.i$residuals
lm.c.i$residuals
● ● ● ● ● ●
0.0
0.0
0.0
● ● ● ● ● ●
● ● ● ● ● ●
−0.5
−0.5
● −0.5 ● ●
● ● ● 17 ● ● ●
−1.0
−1.0
−1.0
−1.5
−1.5
● −1.5 ● 1 ●
−4 −2 0 2 4 −2 −1 0 1 2 5 10 15
The output below shows that the cubic term improves the fit of the quadratic
model (i.e., the cubic term is important when added last to the model). The plot
of the studentized residuals against the fitted values does not show any extreme
abnormalities. Furthermore, no individual point is poorly fitted by the model. Case
1 has the largest studentized residual: r1 = −1.997.
# I() is used to create an interpreted object treated "as is"
# so we can include quadratic and cubic terms in the formula
# without creating separate columns in the dataset of these terms
lm.c.i3 <- lm(cloud ~ i8 + I(i8^2) + I(i8^3), data = cloudpoint)
#library(car)
#Anova(aov(lm.c.i3), type=3)
summary(lm.c.i3)
##
## Call:
## lm(formula = cloud ~ i8 + I(i8^2) + I(i8^3), data = cloudpoint)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42890 -0.18658 0.07355 0.13536 0.39328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.870451 0.088364 326.723 < 2e-16 ***
## i8 0.847889 0.048536 17.469 6.67e-11 ***
## I(i8^2) -0.065998 0.007323 -9.012 3.33e-07 ***
# Normality of Residuals
library(car)
qqPlot(lm.c.i3$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(cloudpoint$type))
## [1] 4 12 1
# residuals vs order of data
plot(lm.c.i3$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
1
2● 1 1.5
0.4
12 ●
●
0.6
0.6
●
0.2
Cook's distance
Cook's distance
●
● 18 18 ●
● ● 14 ● 14
Residuals
● ●
●
●
0.4
0.0
0.4
●
●
−0.2
●
0.2
0.2
● 0.5
● ●
●
−0.4
●
●1
4● ●
●●●
●
0.0
0.0
●●●
●
● 0
0.4
● 0.4 12 ● ●
● ● ●
● ●
0.2
0.2
● 0.2 ●
lm.c.i3$residuals
lm.c.i3$residuals
lm.c.i3$residuals
●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
0.0
0.0
0.0
● ● ●
● ● ●
−0.2
−0.2
● −0.2 ● ●
● ● ●
● ● ●
−0.4
−0.4
●
−0.4 ● 1 ●
● ● 4 ●
−4 −2 0 2 4 −2 −1 0 1 2 5 10 15
The first and last observations have the lowest and highest values of I8, given by
0 and 10, respectively. These cases are also the most influential points in the data
set (largest Cook’s D). If we delete these cases and redo the analysis we find that
the cubic term is no longer important (p-value=0.55) when added after the quadratic
term. One may reasonably conclude that the significance of the cubic term in the
original analysis is solely due to the two extreme I8 values, and that the quadratic
Mooney data, mooney by oil with filler labels Mooney data, mooney by filler with oil labels
60 0
150 150
60 10
48 0
100 100
60 20
mooney
mooney
48 10
36 0
60 40
48 20
36 10
50 24 50 0
48 40
12 24 36 0 10 20
0 12 24 36 0 10 20 40
12 24 20 40
0 12 10 40
0 20
0 0
0 10 20 30 40 0 20 40 60
oil filler
At each of the 4 oil levels, the relationship between the Mooney viscosity and filler
level (with 6 levels) appears to be quadratic. Similarly, the relationship between the
Mooney viscosity and oil level appears quadratic for each filler level (with 4 levels).
This supports fitting the general quadratic model as a first step in the analysis.
The output below shows that each term is needed in the model. Although there
are potentially influential points (cases 6 and 20), deleting either or both cases does
not change the significance of the effects in the model (not shown).
# I create each term separately
lm.m.o2.f2 <- lm(mooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler),
data = mooney)
summary(lm.m.o2.f2)
##
## Call:
## lm(formula = mooney ~ oil + filler + I(oil^2) + I(filler^2) +
## I(oil * filler), data = mooney)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3497 -2.2231 -0.1615 2.5424 5.2749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.144582 2.616779 10.373 9.02e-09 ***
## oil -1.271442 0.213533 -5.954 1.57e-05 ***
## filler 0.436984 0.152658 2.862 0.0108 *
## I(oil^2) 0.033611 0.004663 7.208 1.46e-06 ***
## I(filler^2) 0.027323 0.002410 11.339 2.38e-09 ***
## I(oil * filler) -0.038659 0.003187 -12.131 8.52e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.937 on 17 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9917,Adjusted R-squared: 0.9892
## F-statistic: 405.2 on 5 and 17 DF, p-value: < 2.2e-16
## poly() will evaluate variables and give joint polynomial values
## which is helpful when you have many predictors
#head(mooney, 10)
#head(poly(mooney£oil, mooney£filler, degree = 2, raw = TRUE), 10)
## This model is equivalent to the one above
#lm.m.o2.f2 <- lm(mooney ~ poly(oil, filler, degree = 2, raw = TRUE), data = mooney)
#summary(lm.m.o2.f2)
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.m.o2.f2, which = c(1,4,6), pch=as.character(mooney$oil))
# Normality of Residuals
library(car)
qqPlot(lm.m.o2.f2$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(mooney$oil[ind]))
## 18 20 6
## 18 19 6
## residuals vs order of data
#plot(lm.m.o2.f2£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.8
2.5 2
6
20 204
60
2 1
2
0.6
4
0.6
6 60
4
Cook's distance
Cook's distance
1
2
4 04
4 1.5
Residuals
0.4
0.4
0
1
1 0
2
−2
2 1 0 18
0 218
0.2
0
0.2
1
−4
2 1
420
−6
182 2 4 4 0.5
111 22 00400 14 2 0
0.0
0.0
0
Residuals vs oil with filler labels Residuals vs filler with oil labels QQ Plot
6 0 60
6 1 2 1 1 2
2 2 4 2
4
4
1 1 1
2 4
lm.m.o2.f2$residuals
lm.m.o2.f2$residuals
lm.m.o2.f2$residuals
4
2 1 2 1
2
1 3 0 4 4 0
4
6 4 4 44
0
0
4 1 1
0 0 1
0 0
1
3 2 2
−2
−2
3 3 0 2 1
0 −2 0 2
1
4
2 0 0 0
0
−4
−4
−4
4 2 2
1 4 4 20
−6
−6
6 2 −6 2 18
0 10 20 30 40 0 10 20 30 40 50 60 −2 −1 0 1 2
library(ggplot2)
p <- ggplot(mooney, aes(x = oil, y = logmooney, label = filler))
p <- p + geom_text()
#p <- p + scale_y_continuous(limits = c(0, max(mooney£logmooney, na.rm=TRUE)))
p <- p + labs(title="Mooney data, log(mooney) by oil with filler labels")
print(p)
## Warning: Removed 1 rows containing missing values (geom text).
library(ggplot2)
p <- ggplot(mooney, aes(x = filler, y = logmooney, label = oil))
p <- p + geom_text()
#p <- p + scale_y_continuous(limits = c(0, max(mooney£logmooney, na.rm=TRUE)))
p <- p + labs(title="Mooney data, log(mooney) by filler with oil labels")
print(p)
## Warning: Removed 1 rows containing missing values (geom text).
Mooney data, log(mooney) by oil with filler labels Mooney data, log(mooney) by filler with oil labels
60 0
5.0 5.0
60 10
48 0
4.5 4.5
60 20
48 10
36 0
60 40
4.0
48 4.0
20
logmooney
logmooney
36 10
24 0
48 40
12 24 36 0 10 20
3.5 3.5
0 12 24 36 0 10 20 40
24 40
3.0 12 3.0 20
0 10
12 40
0 20
2.5 2.5
0 10 20 30 40 0 20 40 60
oil filler
To see that a simpler model is appropriate, we fit the full quadratic model. The
interaction term can be omitted here, without much loss of predictive ability (R-
squared is similar). The p-value for the interaction term in the quadratic model is
0.34.
# I create each term separately
lm.lm.o2.f2 <- lm(logmooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler),
data = mooney)
summary(lm.lm.o2.f2)
##
## Call:
## lm(formula = logmooney ~ oil + filler + I(oil^2) + I(filler^2) +
## I(oil * filler), data = mooney)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.077261 -0.035795 0.009193 0.030641 0.075640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.236e+00 3.557e-02 90.970 < 2e-16 ***
## oil -3.921e-02 2.903e-03 -13.507 1.61e-10 ***
## filler 2.860e-02 2.075e-03 13.781 1.18e-10 ***
## I(oil^2) 4.227e-04 6.339e-05 6.668 3.96e-06 ***
## I(filler^2) 4.657e-05 3.276e-05 1.421 0.173
## I(oil * filler) -4.231e-05 4.332e-05 -0.977 0.342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05352 on 17 degrees of freedom
# Normality of Residuals
library(car)
qqPlot(lm.lm.o2.f2$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(mooney$oil[ind]))
## 22 12 21
## 21 12 20
## residuals vs order of data
#plot(lm.lm.o2.f2£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
0.4
0.4
0.05
0
Cook's distance
Cook's distance
4 2
0.3
1
0.3
0 1
Residuals
4 2
1
0.00
0.2
1
0.2
2 2 13
21
421
4 1 213
0 0 4 4
−0.05
0.1
0.1
2 1
0 2 0 1
422
2 1
00 4 0 4 0.5
111222 0
0.0
0.0
2 0
3.0 3.5 4.0 4.5 5.0 5 10 15 20 0.1 0.2 0.3 0.4 0.5
Residuals vs oil with filler labels Residuals vs filler with oil labels QQ Plot
6
4 2 4 1 1 121 4 12 1
0.05
0.05
1 0 0.05 0
lm.lm.o2.f2$residuals
lm.lm.o2.f2$residuals
lm.lm.o2.f2$residuals
1 1 2
4 4 2
0 1 0 1 1
2 6 1 4 4
1 0
3 6 1 2 12
0.00
0.00
3 0 0.00
0
4
2 2 2 22
4 4 4
0 0
4
2 0 0
−0.05
−0.05
0 1 −0.05 2 1
0 2
6 0 0
3 2 2
3 4 4 22
0 10 20 30 40 0 10 20 30 40 50 60 −2 −1 0 1 2
After omitting the interaction term, the quadratic effect in filler is not needed
in the model (output not given). Once these two effects are removed, each of the
remaining effects is significant.
# I create each term separately
lm.lm.o2.f <- lm(logmooney ~ oil + filler + I(oil^2),
data = mooney)
summary(lm.lm.o2.f)
##
## Call:
## lm(formula = logmooney ~ oil + filler + I(oil^2), data = mooney)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.090796 -0.031113 -0.008831 0.032533 0.100587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.230e+00 2.734e-02 118.139 < 2e-16 ***
## oil -4.024e-02 2.702e-03 -14.890 6.26e-12 ***
## filler 3.086e-02 5.716e-04 53.986 < 2e-16 ***
## I(oil^2) 4.097e-04 6.356e-05 6.446 3.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05423 on 19 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9947,Adjusted R-squared: 0.9939
## F-statistic: 1195 on 3 and 19 DF, p-value: < 2.2e-16
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.lm.o2.f, which = c(1,4,6), pch=as.character(mooney$oil))
# Normality of Residuals
library(car)
qqPlot(lm.lm.o2.f$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(mooney$oil[ind]))
## 12 22 16
## 12 21 16
## residuals vs order of data
#plot(lm.lm.o2.f£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
12 22
0.20
121 1.5
0.20
4 1
4
0.05
21 214
0.15
Cook's distance
Cook's distance
4
0.15
2 0
Residuals
0 2
1 4
0.00
0.10
1
0.10
1 0 2 1
2 2 0 0
1 2
4 0
0.05
0.05
0 1 4
0 0
2 1 0 0 0.5
−0.10
422 216 2 2 22
4
0.00
0.00
1 1 0 0
2.5 3.0 3.5 4.0 4.5 5.0 5 10 15 20 0.05 0.1 0.15 0.2 0.25
Residuals vs oil with filler labels Residuals vs filler with oil labels QQ Plot
0.10
0.10
6 1 0.10 12 1
4 2 4 1 1
4
1 4 4
0.05
0.05
lm.lm.o2.f$residuals
lm.lm.o2.f$residuals
lm.lm.o2.f$residuals
0.05
1 1 0
2 2 0
0 6 0 2 2 0
1 6 1 4 41
0.00
0.00
2 1 0.00 1
3 3 1
0 01
6
4 0
4 2 2
0 0 0 2 0
2
0 2 1 2 1 2
−0.05
−0.05
4 4 4
−0.05
2 0 0
3 3 2
4 4 22 2 16
0 10 20 30 40 0 10 20 30 40 50 60 −2 −1 0 1 2
\
log(Moody viscosity) = 3.2297 − 0.0402 Oil + 0.0004 Oil2 + 0.0309 Filler.
Quadratic models with two or more predictors are often used in industrial ex-
periments to estimate the optimal combination of predictor values to maximize or
minimize the response, over the range of predictor variable values where the model is
reasonable. (This strategy is called “response surface methodology”.) For example,
we might wish to know what combination of oil level between 0 and 40 and filler level
between 0 and 60 provides the lowest predicted Mooney viscosity (on the original or
log scale). We can visually approximate the minimizer using the data plots, but one
can do a more careful job of analysis using standard tools from calculus.
We have considered simple models for designed experiments and observational studies
where a response variable is modeled as a linear combination of effects due to factors
or predictors, or both. With designed experiments, where only qualitative factors
are considered, we get a “pure ANOVA” model. For example, in the experiment
comparing survival times of beetles, the potential effects of insecticide (with levels
A, B, C, and D) and dose (with levels 1=low, 2=medium, and 3=high) are included
in the model as factors because these variables are qualitative. The natural model
to consider is a two-way ANOVA with effects for dose and insecticide and a dose-by-
insecticide interaction. If, however, the dose given to each beetle was recorded on a
measurement scale, then the dosages can be used to define a predictor variable which
can be used as a “regression effect” in a model. That is, the dose or some function of
dose can be used as a (quantitative) predictor instead of as a qualitative effect.
For simplicity, assume that the doses are 10, 20, and 30, but the actual levels are
irrelevant to the discussion. The simple additive model, or ANCOVA model, assumes
that there is a linear relationship between mean survival time and dose, with different
intercepts for the four insecticides. If data set includes the survival time (times) for
each beetle, the insecticide (insect: an alphanumeric variable, with values A, B, C,
and D), and dose, you would fit the ANCOVA model this way
beetles$insect <- factor(beetles$insect)
lm.t.i.d <- lm(times ~ insect + dose, data = beetles)
A more complex model that allows separate regression lines for each insecticide is
specified as follows:
beetles$insect <- factor(beetles$insect)
lm.t.i.d.id <- lm(times ~ insect + dose + insect:dose, data = beetles)
It is important to recognize that the factor() statement defines which variables
in the model are treated as factors. Each effect of Factor data type is treated as
a factor. Effects in the model statement that are numeric data types are treated as
predictors. To treat a measurement variable as a factor (with one level for each
distinct observed value of the variable) instead of a predictor, convert that varible
type to a factor using factor(). Thus, in the survival time experiment, these models
give the analysis for a two-way ANOVA model without interaction and with interac-
tion, respectively, where both dose and insecticide are treated as factors (since dose
and insect are both converted to factors), even though we just defined dose on a
measurement scale!
Is there a basic connection between the ANCOVA and separate regression line
models for dose and two-way ANOVA models where dose and insecticide are treated
as factors? Yes — I mentioned a connection when discussing ANCOVA and I will try
now to make the connection more explicit.
For the moment, let us simplify the discussion and assume that only one insecticide
was used at three dose levels. The LS estimates of the mean responses from the
quadratic model
are the observed average survival times at the three dose levels. The LS curve goes
through the mean survival time at each dose, as illustrated in the picture below.
then the LS estimates of the population mean survival times are the observed mean
survival times. The two models are mathematically equivalent, but the parameters
have different interpretations. In essence, the one-way ANOVA model places no
restrictions on the values of the population means (no a priori relation between them)
at the three doses, and neither does the quadratic model! (WHY?)
●
7
●
6
● ●
5
● ●
times
● ●
4
●
3
Insecticide
● A ●
●
B
2
● C ●
D
● Mean ●
1
10 15 20 25 30
dose
In a one-way ANOVA, the standard hypothesis of interest is that the dose effects
are zero. This can be tested using the one-way ANOVA F-test, or by testing H0 :
β1 = β2 = 0 in the quadratic model. With three dosages, the absence of a linear or
quadratic effect implies that all the population mean survival times must be equal.
An advantage of the polynomial model over the one-way ANOVA is that it provides
an easy way to quantify how dose impacts the mean survival, and a convenient way
to check whether a simple description such as a simple linear regression model is
adequate to describe the effect.
More generally, if dose has p levels, then the one-way ANOVA model
Times = Grand Mean + Dose Effect + Residual,
is equivalent to the (p − 1)st degree polynomial
and the one-way ANOVA F-test for no treatment effects is equivalent to testing
H0 : β1 = β2 = · · · = βp−1 = 0 in this polynomial.
Returning to the original experiment with 4 insecticides and 3 doses, I can show
the following two equivalences. First, the two-way additive ANOVA model, with
insecticide and dose as factors, i.e., model (A), is mathematically equivalent to an
additive model with insecticide as a factor, and a quadratic effect in dose:
beetles$insect <- factor(beetles$insect)
lm.t.i.d.d2 <- lm(times ~ insect + dose + I(dose^2), data = beetles)
Thinking of dose2 as a quadratic term in dose, rather than as an interaction,
this model has an additive insecticide effect, but the dose effect is not differentiated
across insecticides. That is, the model assumes that the quadratic curves for the
four insecticides differ only in level (i.e., different intercepts) and that the coefficients
for the dose and dose2 effects are identical across insecticides. This is an additive
model, because the population means plot has parallel profiles. A possible pictorial
representation of this model is given below.
●
7
●
6
5
● ●
times
● ●
4
●
3
Insecticide
● A
●
B
2
● C ●
D
●
1
10 15 20 25 30
dose
Second, the two-way ANOVA interaction model, with insecticide and dose as
factors, i.e., model (I), is mathematically equivalent to an interaction model with
insecticide as a factor, and a quadratic effect in dose.
beetles$insect <- factor(beetles$insect)
lm.t.i.d.d2.id.id2 <- lm(times ~ insect + dose + I(dose^2)
+ insect:dose + insect:I(dose^2), data = beetles)
This model fits separate quadratic relationships for each of the four insecticides,
by including interactions between insecticides and the linear and quadratic terms in
dose. Because dose has three levels, this model places no restrictions on the mean
responses.
To summarize, we have established that
The additive two-way ANOVA model with insecticide and dose as factors is
mathematically identical to an additive model with an insecticide factor and a
quadratic effect in dose. The ANCOVA model with a linear effect in dose is a
special case of these models, where the quadratic effect is omitted.
The two-way ANOVA interaction model with insecticide and dose as factors is
mathematically identical to a model with an insecticide factor, a quadratic effect
in dose, and interactions between the insecticide and the linear and quadratic
dose effects. The separate regression lines model with a linear effect in dose is a
special case of these models, where the quadratic dose effect and the interaction
of the quadratic term with insecticide are omitted.
Recall that response models with factors and predictors as effects can be fit
using the lm() procedure, but each factor or interaction involving a factor must be
represented in the model using indicator variables or product terms. The number of
required indicators or product effects is one less than the number of distinct levels of
the factor. For example, to fit the model with “parallel” quadratic curves in dose, you
can define (in the data.frame()) three indicator variables for the insecticide effect,
say I1 , I2 , and I3 , and fit the model
For the “quadratic interaction model”, you must define 6 interaction or product terms
between the 3 indicators and the 2 dose terms:
The (β6 I1 Dose+β7 I2 Dose+β8 I3 Dose) component in the model formally corresponds
to the insect ∗ dose interaction, whereas the (β9 I1 Dose2 + β10 I2 Dose2 + β11 I3 Dose2 )
component is equivalent to the insect ∗ dose ∗ dose interaction (i.e., testing H0 : β9 =
β10 = β11 = 0).
This discussion is not intended to confuse, but rather to impress upon you the inti-
mate connection between regression and ANOVA, and to convince you of the care that
is needed when modelling variation even in simple studies. Researchers are usually
faced with more complex modelling problems than we have examined, where many
variables might influence the response. If experimentation is possible, a scientist will
often control the levels of variables that influence the response but that are not of
primary interest. This can result in a manageable experiment with, say, four or fewer
qualitative or quantitative variables that are systematically varied in a scientifically
meaningful way. In observational studies, where experimentation is not possible, the
scientist builds models to assess the effects of interest on the response, adjusting the
response for all the uncontrolled variables that might be important. The uncontrolled
variables are usually a mixture of factors and predictors. Ideally, the scientist knows
what variables to control in an experiment and which to vary, and what variables are
important to collect in an observational study.
The level of complexity that I am describing here might be intimidating, but
certain basic principles can be applied to many of the studies you will see. Graduate
students in statistics often take several courses (5+) in experimental design, regression
analysis, and linear model theory to master the breadth of models, and the subtleties
of modelling, that are needed to be a good data analyst. I can only scratch the
surface here. I will discuss a reasonably complex study having multiple factors and
multiple predictors. The example focuses on strategies for building models, with little
attempt to do careful diagnostic analyses. Hopefully, the example will give you an
appreciation for statistical modelling, but please be careful — these tools are
dangerous!
(coded 1 for Doctorate, 0 else), yd (number of years since highest degree was earned),
and salary (academic year salary in dollars).
#### Example: Faculty salary
faculty <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch09_faculty.dat"
, header = TRUE)
head(faculty)
## id sex rank year degree yd salary
## 1 1 0 3 25 1 35 36350
## 2 2 0 3 13 1 22 35350
## 3 3 0 3 10 1 23 28200
## 4 4 1 3 7 1 27 26775
## 5 5 0 3 19 0 30 33696
## 6 6 0 3 16 1 21 28516
str(faculty)
## 'data.frame': 52 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : int 0 0 0 1 0 0 1 0 0 0 ...
## $ rank : int 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 25 13 10 7 19 16 0 16 13 13 ...
## $ degree: int 1 1 1 1 0 1 0 1 0 0 ...
## $ yd : int 35 22 23 27 30 21 32 18 30 31 ...
## $ salary: int 36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
faculty$sex <- factor(faculty$sex , labels=c("Male", "Female"))
# ordering the rank variable so Full is the baseline, then descending.
faculty$rank <- factor(faculty$rank , levels=c(3,2,1)
, labels=c("Full", "Assoc", "Asst"))
faculty$degree <- factor(faculty$degree, labels=c("Other", "Doctorate"))
head(faculty)
## id sex rank year degree yd salary
## 1 1 Male Full 25 Doctorate 35 36350
## 2 2 Male Full 13 Doctorate 22 35350
## 3 3 Male Full 10 Doctorate 23 28200
## 4 4 Female Full 7 Doctorate 27 26775
## 5 5 Male Full 19 Other 30 33696
## 6 6 Male Full 16 Doctorate 21 28516
str(faculty)
## 'data.frame': 52 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
## $ rank : Factor w/ 3 levels "Full","Assoc",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 25 13 10 7 19 16 0 16 13 13 ...
## $ degree: Factor w/ 2 levels "Other","Doctorate": 2 2 2 2 1 2 1 2 1 1 ...
## $ yd : int 35 22 23 27 30 21 32 18 30 31 ...
## $ salary: int 36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
The data includes two potential predictors of salary (year and yd), and three
factors (sex, rank, and degree). A primary statistical interest is whether males and
females are compensated equally, on average, after adjusting salary for rank, years in
rank, and the other given effects. Furthermore, we wish to know whether an effect
due to sex is the same for each rank, or not.
Before answering these questions, let us look at the data. I will initially focus
on the effect of the individual factors (sex, rank, and degree) on salary. A series of
box-plots is given below. Looking at the boxplots, notice that women tend to earn
less than men, that faculty with Doctorates tend to earn more than those without
Doctorates (median), and that salary tends to increase with rank.
# plot marginal boxplots
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), nrow = 1)
salary
salary
25000 25000 25000
library(ggplot2)
# create position dodge offset for plotting points
pd <- position_dodge(0.75) # 0.75 puts dots up center of boxplots
30000
sex
salary
20000
Male
Female
10000
I will consider two simple analyses of these data. The first analysis considers the
effect of the three factors on salary. The second analysis considers the effect of the
predictors. A complete analysis using both factors and predictors is then considered.
I am doing the three factor analysis because the most complex pure ANOVA problem
we considered this semester has two factors — the analysis is for illustration only!!
The full model for a three-factor study includes the three main effects, the three
possible two-factor interactions, plus the three-factor interaction. Identifying the
factors by S (sex), D (degree) and R (rank), we write the full model as
Salary = Grand mean + S effect + D effect + R effect
+S*D interaction + S*R interaction + R*D interaction
+S*D*R interaction + Residual.
You should understand what main effects and two-factor interactions measure, but
what about the three-factor term? If you look at the two levels of degree separately,
then a three-factor interaction is needed if the interaction between sex and rank is
different for the two degrees. (i.e., the profile plots are different for the two degrees).
Not surprisingly, three-factor interactions are hard to interpret.
I considered a hierarchical backward elimination of effects (see Chapter 3 for
details). Individual regression variables are not considered for deletion, unless they
correspond to an effect in the model statement. All tests were performed at the 0.10
level, but this hardly matters here.
The first step in the elimination is to fit the full model and check whether the
three-factor term is significant. The three-factor term was not significant (in fact, it
couldn’t be fit because one category had zero observations). After eliminating this
effect, I fit the model with all three two-factor terms, and then sequentially deleted the
least important effects, one at a time, while still adhering to the hierarchy principle
using the AIC criterion from the step() function. The final model includes only an
effect due to rank. Finally, I compute the lsmeans() to compare salary for all pairs
of rank.
# fit full model
lm.faculty.factor.full <- lm(salary ~ sex*rank*degree, data = faculty)
## Note that there are not enough degrees-of-freedom to estimate all these effects
## because we have 0 observations for Female/Assoc/Doctorate
library(car)
Anova(lm.faculty.factor.full, type=3)
## Error in Anova.III.lm(mod, error, singular.ok = singular.ok, ...): there are aliased
coefficients in the model
summary(lm.faculty.factor.full)
##
## Call:
## lm(formula = salary ~ sex * rank * degree, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
Remove the three-way interaction, then use step() to perform backward selection
based on AIC.
# model reduction using update() and subtracting (removing) model terms
lm.faculty.factor.red <- lm.faculty.factor.full;
# remove variable
lm.faculty.factor.red <- update(lm.faculty.factor.red, ~ . - sex:rank:degree );
Anova(lm.faculty.factor.red, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 3932650421 1 435.9113 < 2.2e-16 ***
## sex 11227674 1 1.2445 0.2709438
## rank 196652264 2 10.8989 0.0001539 ***
## degree 421614 1 0.0467 0.8298945
## sex:rank 2701493 2 0.1497 0.8614045
## sex:degree 7661926 1 0.8493 0.3620198
## rank:degree 33433415 2 1.8529 0.1693627
## Residuals 378910404 42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# AIC backward selection
lm.faculty.factor.red.AIC <- step(lm.faculty.factor.red, direction="backward", test="F")
## Start: AIC=841.68
## salary ~ sex + rank + degree + sex:rank + sex:degree + rank:degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:rank 2 2701493 381611896 838.05 0.1497 0.8614
## - sex:degree 1 7661926 386572329 840.72 0.8493 0.3620
## <none> 378910404 841.68
## - rank:degree 2 33433415 412343819 842.08 1.8529 0.1694
##
## Step: AIC=838.05
## salary ~ sex + rank + degree + sex:degree + rank:degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:degree 1 12335789 393947686 837.71 1.4223 0.2394
## <none> 381611896 838.05
## - rank:degree 2 32435968 414047864 838.29 1.8699 0.1662
##
## Step: AIC=837.71
## salary ~ sex + rank + degree + rank:degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
library(car)
Anova(lm.faculty.factor.final, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 1.7593e+10 1 1963.932 < 2.2e-16 ***
## rank 1.3468e+09 2 75.171 1.174e-15 ***
## Residuals 4.3895e+08 49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.faculty.factor.final)
##
## Call:
This analysis suggests that sex is not predictive of salary, once other factors are
taken into account. In particular, faculty rank appears to be the sole important
effect, in the sense that once salaries are adjusted for rank no other factors explain a
significant amount of the unexplained variation in salaries.
As noted earlier, the analysis was meant to illustrate a three-factor ANOVA and
backward selection. The analysis is likely flawed, because it ignores the effects of year
and year since degree on salary.
p1 <- ggplot(faculty, aes(x = year, y = salary, colour = rank, shape = sex, size = degree))
p1 <- p1 + scale_size_discrete(range=c(3,5))
## Warning: Using size for a discrete variable is not advised.
p1 <- p1 + geom_point(alpha = 0.5)
p1 <- p1 + labs(title = "Salary by year")
p1 <- p1 + theme(legend.position = "bottom")
#print(p1)
p2 <- ggplot(faculty, aes(x = yd, y = salary, colour = rank, shape = sex, size = degree))
p2 <- p2 + scale_size_discrete(range=c(3,5))
## Warning: Using size for a discrete variable is not advised.
p2 <- p2 + geom_point(alpha = 0.5)
p2 <- p2 + labs(title = "Salary by yd")
p2 <- p2 + theme(legend.position = "bottom")
#print(p2)
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow = 1)
35000 35000
30000 30000
salary
salary
25000 25000
20000 20000
15000 15000
0 5 10 15 20 25 0 10 20 30
year yd
# interaction model
lm.s.y.yd.yyd <- lm(salary ~ year*yd, data = faculty)
summary(lm.s.y.yd.yyd)
##
## Call:
## lm(formula = salary ~ year * yd, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10368.5 -2361.5 -505.7 2363.1 12211.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16287.391 1395.049 11.675 1.25e-15 ***
## year 561.155 275.243 2.039 0.04700 *
## yd 235.415 83.266 2.827 0.00683 **
## year:yd -3.089 10.412 -0.297 0.76796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3958 on 48 degrees of freedom
## Multiple R-squared: 0.579,Adjusted R-squared: 0.5527
## F-statistic: 22 on 3 and 48 DF, p-value: 4.17e-09
# interaction is not significant
lm.s.y.yd <- lm(salary ~ year + yd, data = faculty)
summary(lm.s.y.yd)
##
## Call:
## lm(formula = salary ~ year + yd, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10321.2 -2347.2 -332.7 2298.8 12240.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16555.7 1052.4 15.732 < 2e-16 ***
## year 489.3 129.6 3.777 0.000431 ***
## yd 222.2 69.8 3.184 0.002525 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3921 on 49 degrees of freedom
## Multiple R-squared: 0.5782,Adjusted R-squared: 0.561
## F-statistic: 33.58 on 2 and 49 DF, p-value: 6.532e-10
where the year and year since degree effects (YD) are linear terms (as in the multiple
regression model we considered). To check whether any important effects might
have been omitted, I added individual three-factor terms to this model. All of the
three factor terms were insignificant (not shown), so I believe that my choice for the
“maximal” model is sensible.
The output below gives the fit to the maximal model, and subsequent fits, using
the hierarchy principle. Only selected summaries are provided.
# fit full model with two-way interactions
lm.faculty.full <- lm(salary ~ (sex + rank + degree + year + yd)^2, data = faculty)
library(car)
Anova(lm.faculty.full, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 22605087 1 3.6916 0.06392 .
## sex 4092995 1 0.6684 0.41984
## rank 5731837 2 0.4680 0.63059
## degree 4137628 1 0.6757 0.41735
## year 2022246 1 0.3302 0.56966
## yd 3190911 1 0.5211 0.47578
## BIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.faculty.red.BIC <- step(lm.faculty.full, direction="backward", test="F"
, k=log(nrow(faculty)))
## Start: AIC=868.72
## salary ~ (sex + rank + degree + year + yd)^2
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:rank 2 932237 190757690 861.07 0.0761 0.9269
## - rank:year 2 1571933 191397386 861.24 0.1284 0.8800
## - rank:yd 2 9822382 199647836 863.44 0.8020 0.4575
## - rank:degree 2 13021265 202846719 864.26 1.0632 0.3576
## - year:yd 1 50921 189876375 864.78 0.0083 0.9279
## - sex:yd 1 2024210 191849663 865.32 0.3306 0.5695
## - degree:year 1 4510249 194335703 865.99 0.7366 0.3974
## - degree:yd 1 6407880 196233334 866.49 1.0465 0.3142
## - sex:degree 1 7164815 196990268 866.69 1.1701 0.2877
## - sex:year 1 7194388 197019841 866.70 1.1749 0.2868
## <none> 189825454 868.72
##
## Step: AIC=861.07
## salary ~ sex + rank + degree + year + yd + sex:degree + sex:year +
## sex:yd + rank:degree + rank:year + rank:yd + degree:year +
## degree:yd + year:yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rank:year 2 4480611 195238301 854.37 0.3876 0.6818
## - rank:yd 2 14587933 205345624 857.00 1.2618 0.2964
## - year:yd 1 25889 190783580 857.12 0.0045 0.9470
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=821.19
## salary ~ rank + year
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 276992734 821.19
## - year 1 161953324 438946058 841.18 28.065 2.905e-06 ***
## - rank 2 632056217 909048951 875.09 54.764 4.103e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The add1() function will indicate whether a variable from the “full” model should be
added to the current model. In our case, our BIC-backward selected model appears
adequate.
add1(lm.faculty.red.BIC, . ~ (sex + rank + degree + year + yd)^2, test="F")
## Single term additions
##
## Model:
## salary ~ rank + year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 276992734 813.39
## sex 1 2304648 274688086 814.95 0.3943 0.5331
## degree 1 1127718 275865016 815.18 0.1921 0.6632
## yd 1 2314414 274678320 814.95 0.3960 0.5322
## rank:year 2 15215454 261777280 814.45 1.3368 0.2727
library(car)
Anova(lm.faculty.final, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 4422688839 1 766.407 < 2.2e-16 ***
## rank 632056217 2 54.764 4.103e-13 ***
## year 161953324 1 28.065 2.905e-06 ***
## Residuals 276992734 48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.faculty.final)
##
## Call:
## lm(formula = salary ~ rank + year, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3462.0 -1302.8 -299.2 783.5 9381.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25657.79 926.81 27.684 < 2e-16 ***
## rankAssoc -5192.24 871.83 -5.956 2.93e-07 ***
## rankAsst -9454.52 905.83 -10.437 6.12e-14 ***
## year 375.70 70.92 5.298 2.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2402 on 48 degrees of freedom
## Multiple R-squared: 0.8449,Adjusted R-squared: 0.8352
## F-statistic: 87.15 on 3 and 48 DF, p-value: < 2.2e-16
### comparing lsmeans (may be unbalanced)
library(lsmeans)
## compare levels of main effects
lsmeans(lm.faculty.final, list(pairwise ~ rank), adjust = "bonferroni")
## $`lsmeans of rank`
## rank lsmean SE df lower.CL upper.CL
## Full 28468.28 582.2789 48 27297.53 29639.03
## Assoc 23276.05 642.2996 48 21984.62 24567.48
## Asst 19013.76 613.0513 48 17781.14 20246.38
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## Full - Assoc 5192.239 871.8328 48 5.956 <.0001
## Full - Asst 9454.523 905.8301 48 10.437 <.0001
## Assoc - Asst 4262.285 882.8914 48 4.828 <.0001
##
## P value adjustment: bonferroni method for 3 tests
model are significant. The baseline group is Full Professors, with rank=3. Predicted
salaries for the different ranks are given by:
Do you remember how to interpret the lsmeans, and the p-values for comparing
lsmeans?
You might be tempted to conclude that rank and years in rank are the only effects
that are predictive of salaries, and that differences in salaries by sex are insignificant,
once these effects have been taken into account. However, you must be careful because
you have not done a diagnostic analysis. The following two issues are also important
to consider.
A sex effect may exist even though there is insufficient evidence to support it based
on these data. (Lack of power corrupts; and absolute lack of power corrupts
absolutely!) If we are interested in the possibility of a sex effect, I think that we
would do better by focusing on how large the effect might be, and whether it is
important. A simple way to check is by constructing a confidence interval for the sex
effect, based on a simple additive model that includes sex plus the effects that were
selected as statistically significant, rank and year in rank. I am choosing this model
because the omitted effects are hopefully small, and because the regression coefficient
for a sex indicator is easy to interpret in an additive model. Other models might be
considered for comparison. Summary output from this model is given below.
# add sex to the model
lm.faculty.final.sex <- update(lm.faculty.final, . ~ . + sex)
summary(lm.faculty.final.sex)
##
## Call:
## lm(formula = salary ~ rank + year + sex, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3286.3 -1311.8 -178.4 939.1 9002.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25390.65 1025.14 24.768 < 2e-16 ***
## rankAssoc -5109.93 887.12 -5.760 6.20e-07 ***
## rankAsst -9483.84 912.79 -10.390 9.19e-14 ***
## year 390.94 75.38 5.186 4.47e-06 ***
## sexFemale 524.15 834.69 0.628 0.533
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2418 on 47 degrees of freedom
## Multiple R-squared: 0.8462,Adjusted R-squared: 0.8331
## F-statistic: 64.64 on 4 and 47 DF, p-value: < 2.2e-16
Men are the baseline group for the sex effect, so the predicted salaries for men
are 524 dollars less than that for women, adjusting for rank and year. A rough 95%
CI for the sex differential is the estimated sex coefficient plus or minus two standard
errors, or 524 ± 2 ∗ (835), or −1146 to 2194 dollars. The range of plausible values
for the sex effect would appear to contain values of practical importance, so further
analysis is warranted here.
Another concern, and potentially a more important issue, was raised by M. O.
Finkelstein in a 1980 discussion in the Columbia Law Review on the use of regres-
sion in discrimination cases: “. . . [a] variable may reflect a position or status
bestowed by the employer, in which case if there is discrimination in the
award of the position or status, the variable may be ‘tainted’.” Thus, if
women are unfairly held back from promotion through the faculty ranks, then using
faculty rank to adjust salary before comparing sexes may not be acceptable to the
courts. This suggests that an analysis comparing sexes but ignoring rank effects
might be justifiable. What happens if this is done?
lm.faculty.sex.yd <- lm(salary ~ sex + yd, data = faculty)
library(car)
Anova(lm.faculty.sex.yd, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 4275963832 1 231.4448 < 2.2e-16 ***
## sex 67178787 1 3.6362 0.06241 .
## yd 766344185 1 41.4799 4.883e-08 ***
## Residuals 905279453 49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.faculty.sex.yd)
##
## Call:
## lm(formula = salary ~ sex + yd, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9631.7 -2529.4 3.5 2298.0 13125.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
You should recognize that automated model selection methods should not replace
scientific theory when building models! Automated methods are best suited for ex-
ploratory analyses, in situations where the researcher has little scientific information
as a guide.
AIC/BIC were discussed in Section 3.2.1 for stepwise procedures and were used
in examples in Chapter 9. In those examples, I included the corresponding F -tests in
the ANOVA table as a criterion for dropping variables from a model. The next few
sections cover these methods in more detail, then discuss other criteria and selections
strategies, finishing with a few examples.
Y = β0 + β1 X1 + ε (10.1)
Y = β0 + β1 X1 + β2 X2 + ε (10.2)
# Description of variables
# id = individual id
# age = age in years yrmig = years since migration
# wt = weight in kilos ht = height in mm
# chin = chin skin fold in mm fore = forearm skin fold in mm
# calf = calf skin fold in mm pulse = pulse rate-beats/min
# sysbp = systolic bp diabp = diastolic bp
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5, p6, p7), ncol=3
, top = "Scatterplots of response sysbp with each predictor variable")
Scatterplots of response sysbp with each predictor variable
● ● ●
● ● ●
● ● ●
● ● ●
sysbp
sysbp
sysbp
140 ● ● 140 ● ● 140 ● ●
● ● ● ● ● ●
● ● ●
●● ● ● ● ● ●● ● ● ●● ●● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ●
●●●● ● ●
● ●● ●
●●
● ● ●
120 ● ● ●● 120 ● ● ● ● 120 ●● ● ●
● ● ●● ● ●
● ● ● ● ● ●
● ●● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
sysbp
sysbp
sysbp
160
●
●
●
sysbp
140 ● ●
● ●
●
● ● ●● ●
●
●
● ● ●
●
●● ● ●
120 ● ● ●●
● ●
● ●
● ●●
●
●
●
●
The step() function provides the forward, backward, and stepwise procedures
based on AIC or BIC, and provides corresponding F -tests.
Forward selection output The output for the forward selection method is below.
BIC is our selection criterion, though similar decisions are made as if using F -tests.
Step 1 Variable wt =weight is entered first because it has the highest correlation
with sysbp =sys bp. The corresponding F -value is the square of the t-statistic for
testing the significance of the weight predictor in this simple linear regression model.
Step 2 Adding yrage =fraction to the simple linear regression model with
weight as a predictor increases R2 the most, or equivalently, decreases Residual SS
(RSS) the most.
Step 3 The last table has “<none>” as the first row indicating that the current
model (no change to current model) is the best under the current selection criterion.
# start with an empty model (just the intercept 1)
lm.indian.empty <- lm(sysbp ~ 1, data = indian)
# Forward selection, BIC with F-tests
lm.indian.forward.red.BIC <- step(lm.indian.empty
, sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
, direction = "forward", test = "F", k = log(nrow(indian)))
## Start: AIC=203.38
## sysbp ~ 1
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + wt 1 1775.38 4756.1 194.67 13.8117 0.0006654 ***
## <none> 6531.4 203.38
## + yrage 1 498.06 6033.4 203.95 3.0544 0.0888139 .
## + fore 1 484.22 6047.2 204.03 2.9627 0.0935587 .
## + calf 1 410.80 6120.6 204.51 2.4833 0.1235725
## + ht 1 313.58 6217.9 205.12 1.8660 0.1801796
## + chin 1 189.19 6342.2 205.89 1.1037 0.3002710
## + pulse 1 114.77 6416.7 206.35 0.6618 0.4211339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=194.67
## sysbp ~ wt
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + yrage 1 1314.69 3441.4 185.71 13.7530 0.0006991 ***
## <none> 4756.1 194.67
## + chin 1 143.63 4612.4 197.14 1.1210 0.2967490
## + calf 1 16.67 4739.4 198.19 0.1267 0.7240063
## + pulse 1 6.11 4749.9 198.28 0.0463 0.8308792
## + ht 1 2.01 4754.0 198.31 0.0152 0.9024460
## + fore 1 1.16 4754.9 198.32 0.0088 0.9257371
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=185.71
## sysbp ~ wt + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3441.4 185.71
## + chin 1 197.372 3244.0 187.07 2.1295 0.1534
## + fore 1 50.548 3390.8 188.80 0.5218 0.4749
## + calf 1 30.218 3411.1 189.03 0.3101 0.5812
## + ht 1 23.738 3417.6 189.11 0.2431 0.6251
## + pulse 1 5.882 3435.5 189.31 0.0599 0.8081
summary(lm.indian.forward.red.BIC)
##
## Call:
## lm(formula = sysbp ~ wt + yrage, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06
Backward selection output The output for the backward elimination method is
below. BIC is our selection criterion, though similar decisions are made as if using
F -tests.
Step 0 The full model has 7 predictors so REG df = 7. The F -test in the full
model ANOVA table (F = 4.91 with p-value=0.0008) tests the hypothesis that the
regression coefficient for each predictor variable is zero. This test is highly significant,
indicating that one or more of the predictors is important in the model.
The t-value column gives the t-statistic for testing the significance of the individual
predictors in the full model conditional on the other variables being in the model.
# start with a full model
lm.indian.full <- lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage, data = indian)
summary(lm.indian.full)
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + calf + pulse + yrage,
## data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3993 -5.7916 -0.6907 6.9453 23.5771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.45766 53.91303 1.975 0.057277 .
## wt 1.71095 0.38659 4.426 0.000111 ***
## ht -0.04533 0.03945 -1.149 0.259329
## chin -1.15725 0.84612 -1.368 0.181239
## fore -0.70183 1.34986 -0.520 0.606806
## calf 0.10357 0.61170 0.169 0.866643
## pulse 0.07485 0.19570 0.383 0.704699
## yrage -29.31810 7.86839 -3.726 0.000777 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.994 on 31 degrees of freedom
## Multiple R-squared: 0.5259,Adjusted R-squared: 0.4189
## F-statistic: 4.913 on 7 and 31 DF, p-value: 0.0008079
The least important variable in the full model, as judged by the p-value, is
calf =calf skin fold. This variable, upon omission, reduces R2 the least, or equiva-
lently, increases the Residual SS the least. So calf is the first to be omitted from the
model.
Step 1 After deleting calf , the six predictor model is fitted. At least one of
the predictors left is important, as judged by the overall F -test p-value. The least
important predictor left is pulse =pulse rate.
Stepwise selection output The output for the stepwise selection is given below.
Variables are listed in the output tables in order that best improves the AIC/BIC
criterion. In the stepwise case, BIC will decrease (improve) by considering variables
to drop or add (indicated in the first column by − and +). Rather than printing
a small table at each step of the step() procedure, we use lm.XXX$anova to print a
summary of the drop/add choices made.
# Stepwise (both) selection, BIC with F-tests, starting with intermediate model
# (this is a purposefully chosen "opposite" model,
# from the forward and backward methods this model
# includes all the variables dropped and none kept)
lm.indian.intermediate <- lm(sysbp ~ ht + fore + calf + pulse, data = indian)
# option: trace = 0 does not print each step of the selection
lm.indian.both.red.BIC <- step(lm.indian.intermediate
, sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
, direction = "both", test = "F", k = log(nrow(indian)), trace = 0)
# the anova object provides a summary of the selection steps in order
lm.indian.both.red.BIC$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 34 5651.131 212.3837
## 2 - pulse 1 2.874432 35 5654.005 208.7400
## 3 - calf 1 21.843631 36 5675.849 205.2268
## 4 + wt -1 925.198114 35 4750.651 201.9508
## 5 + yrage -1 1439.707117 34 3310.944 191.5335
## 6 - ht 1 79.870793 35 3390.815 188.7995
## 7 - fore 1 50.548149 36 3441.363 185.7131
summary(lm.indian.both.red.BIC)
##
## Call:
## lm(formula = sysbp ~ wt + yrage, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
All three methods using BIC choose the same final model, sysbp = β0 + β1 wt +
β2 yrage. Using the AIC criterion, you will find different results.
4. R̄2 can be less than zero for models that explain little of the variation in Y .
The adjusted R2 is easier to calibrate than R2 because it tends to decrease when
unimportant variables are added to a model. The model with the maximum R̄2 is
judged best by this criterion. As I noted before, I do not take any of the criteria
literally, and would also choose other models with R̄2 near the maximum value for
further consideration.
Residual SS
Cp = 2
− Residual df + (p + 1)
σ̂FULL
2
where σ̂FULL is the Residual MS from the full model with k variables X1 , X2 , . . ., Xk .
If all the important effects from the candidate list are included in the model,
then the difference between the first two terms of Cp should be approximately zero.
Thus, if the model under consideration includes all the important variables from the
candidate list, then Cp should be approximately p + 1 (the number of variables in
model plus one), or less. If important variables from the candidate list are excluded,
Cp will tend to be much greater than p + 1.
Two important properties of Cp are
1. the full model has Cp = p + 1, where p = k, and
2. if two models have the same number of variables, then the model with the larger
R2 has the smaller Cp .
Models with Cp ≈ p + 1, or less, merit further consideration. As with R2 and R̄2 , I
prefer simpler models that satisfy this condition. The “best” model by this criterion
has the minimum Cp .
R2
1.00
●
● ● ● ● ● ●
●
●
●
●
●
0.95
leaps.r2$r2
0.90
0.85
1 2 3 4 5 6 7 8
leaps.r2$size
Adj−R2
1.00 ● ● ● ● ● ●
●
● ●
●
●
●
0.95
leaps.adjr2$adjr2
0.90
0.85
1 2 3 4 5 6 7 8
leaps.adjr2$size
Cp
30
●
●
25
20
leaps.Cp$Cp
●
15
●
10
●
●
5
● ●
● ●
●
●
● ●
1 2 3 4 5 6 7 8
leaps.Cp$size
All together The function below takes regsubsets() output and formats it into a
table.
# best subset, returns results sorted by BIC
f.bestsubset <- function(form, dat, nbest = 5){
library(leaps)
bs <- regsubsets(form, data=dat, nvmax=30, nbest=nbest, method="exhaustive");
bs2 <- cbind(summary(bs)$which, (rowSums(summary(bs)$which)-1)
, summary(bs)$rss, summary(bs)$rsq
, summary(bs)$adjr2, summary(bs)$cp, summary(bs)$bic);
cn <- colnames(bs2);
cn[(dim(bs2)[2]-5):dim(bs2)[2]] <- c("SIZE", "rss", "r2", "adjr2", "cp", "bic");
colnames(bs2) <- cn;
ind <- sort.int(summary(bs)$bic, index.return=TRUE); bs2 <- bs2[ind$ix,];
return(bs2);
}
# perform on our model
i.best <- f.bestsubset(formula(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage)
, indian)
op <- options(); # saving old options
options(width=90) # setting command window output text width wider
i.best
## (Intercept) wt ht chin fore calf pulse yrage SIZE rss r2 adjr2
## 2 1 1 0 0 0 0 0 1 2 3441.363 0.47310778 0.44383599
## 3 1 1 0 1 0 0 0 1 3 3243.990 0.50332663 0.46075463
Discussion of Cp results:
1. None of the single predictor models is adequate. Each has Cp 1 + 1 = 2, the
target value.
2. The only adequate two predictor model has wt = weight and yrage = fraction
as predictors: Cp = 1.45 < 2 + 1 = 3. This is the minimum Cp model.
3. Every model with weight and fraction is adequate. Every model that excludes
either weight or fraction is inadequate: Cp p + 1.
According to Cp , any reasonable model must include both weight and fraction as
predictors. Based on simplicity, I would select the model with these two predictors
as a starting point. I can always add predictors if subsequent analysis suggests this
is necessary!
I will give three reasons why I feel that the simpler model is preferable at this
point:
predictors, and see whether the relationship between o2up and the individual pre-
dictors is roughly linear. If not, we will consider appropriate transformations of the
response and/or predictors.
#### Example: Oxygen uptake
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch10_oxygen.dat"
oxygen <- read.table(fn.data, header=TRUE)
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5), nrow=2
, top = "Scatterplots of response o2up with each predictor variable")
30 30 30
o2up
o2up
o2up
20 20 20
10 10 10
● ● ●
● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●
0
● ● ●● ● ● ●●● ● 0 ●● ● ● ● ● ●● 0
●
● ● ● ●● ● ●
● ●●
300 600 900 1200 150 200 250 300 4000 6000 8000
bod tkn ts
● ●
30 30
o2up
o2up
20 20
10 10
● ●
● ● ● ●
● ● ● ●
● ●●
● ●●
●● ● ●●
●●● ● ●●
● ● ● ●● ●● ● ●
0 0
60 70 80 4000 6000 8000
tvs cod
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5), nrow=2
, top = "Scatterplots of response logup with each predictor variable")
logup
logup
0.5 0.5 0.5
● ● ●
● ● ●
● ● ● ● ● ●
0.0 ● ● 0.0 ● ● ● 0.0 ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ●
−0.5 ● −0.5 ● −0.5 ●
300 600 900 1200 150 200 250 300 4000 6000 8000
bod tkn ts
● ●
1.5 1.5
1.0 1.0
● ●
● ● ● ●
logup
logup
0.5 0.5
● ●
● ●
●● ● ●
0.0 ● ●● 0.0 ● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ● ●
−0.5 ● −0.5 ●
I used several of the model selection procedures to select out predictors. The
model selection criteria below point to a more careful analysis of the model with
ts and cod as predictors. This model has the minimum Cp and is selected by the
backward and stepwise procedures. Furthermore, no other model has a substantially
higher R2 or R̄2 . The fit of the model will not likely be improved substantially by
adding any of the remaining three effects to this model.
# perform on our model
o.best <- f.bestsubset(formula(logup ~ bod + tkn + ts + tvs + cod)
, oxygen, nbest = 3)
op <- options(); # saving old options
options(width=90) # setting command window output text width wider
o.best
## (Intercept) bod tkn ts tvs cod SIZE rss r2 adjr2 cp bic
## 2 1 0 0 1 0 1 2 1.0850469 0.7857080 0.7604972 1.738781 -21.82112
## 3 1 0 1 1 0 1 3 0.9871461 0.8050430 0.7684886 2.318714 -20.71660
## 3 1 0 0 1 1 1 3 1.0633521 0.7899926 0.7506163 3.424094 -19.22933
These comments must be taken with a grain of salt because we have not critically
assessed the underlying assumptions (linearity, normality, independence), nor have
we considered whether the data contain influential points or outliers.
lm.oxygen.final <- lm(logup ~ ts + cod, data = oxygen)
summary(lm.oxygen.final)
##
## Call:
## lm(formula = logup ~ ts + cod, data = oxygen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37640 -0.09238 -0.04229 0.06256 0.59827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.370e+00 1.969e-01 -6.960 2.3e-06 ***
## ts 1.492e-04 5.489e-05 2.717 0.0146 *
## cod 1.415e-04 5.318e-05 2.661 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2526 on 17 degrees of freedom
## Multiple R-squared: 0.7857,Adjusted R-squared: 0.7605
## F-statistic: 31.17 on 2 and 17 DF, p-value: 2.058e-06
The p-values for testing the importance of the individual predictors are small,
indicating that both predictors are important. However, two observations (1 and 20)
are poorly fitted by the model (both have ri > 2) and are individually most influential
(largest Di s). Recall that this experiment was conducted over 220 days, so these
observations were the first and last data points collected. We have little information
about the experiment, but it is reasonable to conjecture that the experiment may not
have reached a steady state until the second time point, and that the experiment was
ended when the experimental material dissipated. The end points of the experiment
may not be typical of conditions under which we are interested in modelling oxygen
uptake. A sensible strategy here is to delete these points and redo the entire analysis
to see whether our model changes noticeably.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.oxygen.final, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.oxygen.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 1 20 7
## residuals vs order of data
#plot(lm.oxygen.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
1
3 ●1 2.5
0.6
1●
1.2
● 20
2
0.4
1.0
Cook's distance
Cook's distance
Residuals
0.8
0.2
● ●
●
● 1.5
0.0
● ● ●
●
0.5
● ●
● ● ●
0.4
● 20 ● 20
●
1
● 3
● 3●
−0.4
●
7● ●● 0.5
0.0
0.0
● ● ●
●●
● ●
●● ● ●● 0
0.6
● ● 0.6 1●
● ● 20 ●
lm.oxygen.final$residuals
lm.oxygen.final$residuals
lm.oxygen.final$residuals
0.4
0.4
0.4
0.2
0.2
● ●
0.2
● ● ● ●
● ● ●
● ● ●
0.0
0.0
● ● ● ● ● ● 0.0 ● ● ●
● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
−0.2
−0.2
● ● ●
−0.2
● ● ●
● ● ●
−0.4
−0.4
● ●
−0.4
● 7
3000 4000 5000 6000 7000 8000 9000 3000 4000 5000 6000 7000 8000 9000 −2 −1 0 1 2
Further, the partial residual plot for both ts and cod clearly highlights outlying
cases 1 and 20.
library(car)
avPlots(lm.oxygen.final, id = list(n = 3))
Added−Variable Plots
0.6
3● 1●
logup | others
2●
0.2
●
9●
● 20
●
●
0.0
●●● ● ●
● ● ●
● ●
●
● ● ● ●
●
●●
●
−0.2
● 3 ●
● ●
7●
−0.4
●
●
● ● 7 ● 9
Below is the model with ts and cod as predictors, after omitting the end obser-
vations. Both predictors are significant at the 0.05 level. Furthermore, there do not
appear to be any extreme outliers. The QQ-plot, and the plot of studentized residuals
against predicted values do not show any extreme abnormalities.
lm.oxygen2.final <- lm(logup ~ ts + cod, data = oxygen2)
summary(lm.oxygen2.final)
##
## Call:
## lm(formula = logup ~ ts + cod, data = oxygen2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24157 -0.08517 0.01004 0.10102 0.25094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.335e+00 1.338e-01 -9.976 5.16e-08 ***
## ts 1.852e-04 3.182e-05 5.820 3.38e-05 ***
## cod 8.638e-05 3.517e-05 2.456 0.0267 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1438 on 15 degrees of freedom
## Multiple R-squared: 0.8923,Adjusted R-squared: 0.878
## F-statistic: 62.15 on 2 and 15 DF, p-value: 5.507e-08
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.oxygen2.final, which = c(1,4,6))
# Normality of Residuals
library(car)
qqPlot(lm.oxygen2.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## 6 7 15
## 5 6 14
## residuals vs order of data
#plot(lm.oxygen2.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
4
2 1.5
4●
6●
0.4
0.2
0.4
●
3 3●
●
Cook's distance
Cook's distance
●
0.1
0.3
● 0.3
●
1
Residuals
● ● ●
7 ●7
−0.1 0.0
0.2
●
0.2
●
●
●
●
0.1
0.1
●
● ● ●
● 0.5
● 15 ●
7● ● ● ●
0.0
0.0
●
−0.3
●●● ● ● ● 0
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 5 10 15 0 0.1 0.2 0.3 0.4
● ● 6●
0.2
0.2
0.2
lm.oxygen2.final$residuals
lm.oxygen2.final$residuals
lm.oxygen2.final$residuals
● ● ●
● ● ●
● ● ●
0.1
0.1
● ● 0.1 ●
● ● ●
● ● ● ● ●
● ● ● ●
0.0
0.0
● ● 0.0 ●
● ● ● ● ●
●
● ● ●
−0.1
−0.1
● ●
−0.1 ●
● ● ●
● ● ●
−0.2
−0.2
● ●
−0.2 ● 15
● ● ● 7
3000 4000 5000 6000 7000 8000 9000 3000 4000 5000 6000 7000 8000 −2 −1 0 1 2
library(car)
avPlots(lm.oxygen2.final, id = list(n = 3))
Added−Variable Plots
3● 4●
0.6
6●
0.4
0.2
logup | others
logup | others
2● ●
● ●
0.2
● 6 9● ●
●
0.0
● ●
0.0
●
● ● ● ● ● 3
● ● ● ● ● ●
●
●
● ● ● 7●
15 ●
−0.2
●7
−0.4
● 15 ● 9
Let us recall that the researcher’s primary goal was to identify important predic-
tors of o2up. Regardless of whether we are inclined to include the end observations in
the analysis or not, it is reasonable to conclude that ts and cod are useful for explain-
ing the variation in log10 (o2up). If these data were the final experiment, I might be
inclined to eliminate the end observations and use the following equation to predict
oxygen uptake:
Logistic Regression
Logistic regression analysis is used for predicting the outcome of a categorical depen-
dent variable based on one or more predictor variables. The probabilities describing
the possible outcomes of a single trial are modeled, as a function of the explanatory
(predictor) variables, using a logistic function. Logistic regression is frequently used
to refer to the problem in which the dependent variable is binary — that is, the num-
ber of available categories is two — and problems with more than two categories are
referred to as multinomial logistic regression or, if the multiple categories are ordered,
as ordered logistic regression.
Logistic regression measures the relationship between a categorical dependent vari-
able and usually (but not necessarily) one or more continuous independent variables,
by converting the dependent variable to probability scores. As such it treats the same
set of problems as does probit regression using similar techniques.
where “...” stands for additional options. The key parameter here is family, which
is a simple way of specifying a choice of variance and link functions. Some choices
of family are listed in the table. As can be seen, each of the first five choices has an
associated variance function (for binomial the binomial variance µ(1 − µ)), and one
or more choices of link functions (for binomial the logit, probit, or complementary
log-log).
Family Variance Link
gaussian gaussian identity
binomial binomial logit, probit, or cloglog
poisson poisson log, identity, or sqrt
Gamma Gamma inverse, identity, or log
inverse.gaussian inverse.gaussian 1/µ2
quasi user-defined user-defined
As long as you want the default link, all you have to specify is the family name.
If you want an alternative link, you must add a link argument. For example to do
probits you use:
glm(formula, family = binomial(link = probit))
The last family on the list, quasi, is there to allow fitting user-defined models by
maximum quasi-likelihood.
The rest of this chapter concerns logistic regression with a binary response vari-
able.
1
Milicer, H. and Szczotka, F. (1966) Age at Menarche in Warsaw girls in 1965. Human Biology
38, 199–203.
The researchers were curious about how the proportion of girls that reached menar-
che (p̂ = Menarche/Total) varied with age. One could perform a test of homogeneity
(Multinomial goodness-of-fit test) by arranging the data as a 2-by-25 contingency
table with columns indexed by age and two rows: ROW1 = Menarche, and ROW2
= number that have not reached menarche = (Total − Menarche). A more power-
ful approach treats these as regression data, using the proportion of girls reaching
menarche as the response and age as a predictor.
A plot of the observed proportion p̂ of girls that have reached menarche shows
that the proportion increases as age increases, but that the relationship is nonlinear.
The observed proportions, which are bounded between zero and one, have a lazy S-
shape (a sigmoidal function) when plotted against age. The change in the observed
proportions for a given change in age is much smaller when the proportion is near 0 or
1 than when the proportion is near 1/2. This phenomenon is common with regression
data where the response is a proportion.
The trend is nonlinear so linear regression is inappropriate. A sensible alternative
might be to transform the response or the predictor to achieve near linearity. A
common transformation of response proportions following a sigmoidal curve is to the
logit scale µ̂ = loge {p̂/(1 − p̂)}. This transformation is the basis for the logistic
regression model. The natural logarithm (base e) is traditionally used in logistic
regression.
The logit transformation is undefined when p̂ = 0 or p̂ = 1. To overcome this
problem, researchers use the empirical logits, defined by log{(p̂ + 0.5/n)/(1 − p̂ +
0.5/n)}, where n is the sample size or the number of observations on which p̂ is based.
A plot of the empirical logits against age is roughly linear, which supports a logistic
transformation for the response.
library(ggplot2)
p <- ggplot(menarche, aes(x = Age, y = p.hat))
p <- p + geom_point()
p <- p + labs(title = paste("Observed probability of girls reaching menarche,\n",
"Warsaw, Poland in 1965", sep=""))
print(p)
# emperical logits
menarche$emp.logit <- log(( menarche$p.hat + 0.5/menarche$Total) /
(1 - menarche$p.hat + 0.5/menarche$Total))
library(ggplot2)
p <- ggplot(menarche, aes(x = Age, y = emp.logit))
p <- p + geom_point()
p <- p + labs(title = "Empirical logits")
print(p)
Observed probability of girls reaching menarche, Empirical logits
Warsaw, Poland in 1965 8
●
1.00 ● ●
●
● ●
●
● ●
●
● 4 ●
0.75 ● ●
● ●
●
● ●
●
emp.logit
●
● ●
p.hat
●
0.50 0 ● ●
● ●
●
●
●
● ●
●
●
●
0.25
●
●
−4
● ●
● ●
● ●
● ●
0.00 ● ● ● ●
10 12 14 16 10 12 14 16
Age Age
or, equivalently, as
exp(β0 + β1 X)
p= .
1 + exp(β0 + β1 X)
The logistic regression model is a binary response model, where the response for
each case falls into one of two exclusive and exhaustive categories, success (cases with
the attribute of interest) and failure (cases without the attribute of interest).
The odds of success are p/(1 − p). For example, when p = 1/2 the odds of success
are 1 (or 1 to 1). When p = 0.9 the odds of success are 9 (or 9 to 1). The logistic
model assumes that the log-odds of success is linearly related to X. Graphs of the
logistic model relating p to X are given below. The sign of the slope refers to the
sign of β1 .
I should write p = p(X) to emphasize that p is the proportion of all individuals
with score X that have the attribute of interest. In the menarche data, p = p(X) is
the population proportion of girls at age X that have reached menarche.
Logit Scale Probability Scale
1.0
0.8
5
+ slope
0.6
Probability
Log-Odds
0 slope
0
0.4
0 slope
- slope
0.2
-5
+ slope - slope
0.0
-5 0 5 -5 0 5
X X
The data in a logistic regression problem are often given in summarized or ag-
gregate form:
X n y
X1 n1 y1
X2 n2 y2
.. .. ..
. . .
Xm nm ym
where yi is the number of individuals with the attribute of interest among ni randomly
selected or representative individuals with predictor variable value Xi . For raw data
on individual cases, yi = 1 or 0, depending on whether the case at Xi is a success or
failure, and the sample size column n is omitted with raw data.
For logistic regression, a plot of the sample proportions p̂i = yi /ni against Xi
should be roughly sigmoidal, and a plot of the empirical logits against Xi should be
roughly linear. If not, then some other model is probably appropriate. I find the
second plot easier to calibrate, but neither plot is very informative when the sample
sizes are small, say 1 or 2. (Why?).
There are a variety of other binary response models that are used in practice. The
probit regression model or the complementary log-log regression model might be
appropriate when the logistic model does fit the data.
The following section describes the standard MLE strategy for estimating the
logistic regression parameters.
A simple way to estimate β0 and β1 is by least squares (LS), using the empirical logits
as responses and the Xi s as the predictor values.
Below we use standard regression to calculate the LS fit between the empirical
logits and age.
lm.menarche.e.a <- lm(emp.logit ~ Age, data = menarche)
# LS coefficients
coef(lm.menarche.e.a)
## (Intercept) Age
## -22.027933 1.676395
The LS estimates for the menarche data are b0 = −22.03 and b1 = 1.68, which
gives the fitted relationship
p̃
log = −22.03 + 1.68 Age
1 − p̃
or
exp(−22.03 + 1.68 Age)
p̃ = ,
1 + exp(−22.03 + 1.68 Age)
where p̃ is the predicted proportion (under the model) of girls having reached menar-
che at the given age. I used p̃ to identify a predicted probability, in contrast to p̂
which is the observed proportion at a given age.
The power of the logistic model versus the contingency table analysis discussed
earlier is that the model gives estimates for the population proportion reaching menar-
che at all ages within the observed age range. The observed proportions allow you to
estimate only the population proportions at the observed ages.
over all possible values of β0 and β1 , where the pi s satisfy the logistic model
pi
log = β0 + β1 Xi .
1 − pi
The ML method also gives standard errors and significance tests for the regression
estimates.
is used to test the adequacy of the model. The deviance is small when the data fits
the model, that is, when the observed and fitted proportions are close together. Large
values of D occur when one or more of the observed and fitted proportions are far
apart, which suggests that the model is inappropriate.
If the logistic model holds, then D has a chi-squared distribution with m − r
degrees of freedom, where m is the the number of groups and r (here 2) is the
number of estimated regression parameters. A p-value for the deviance is given by
the area under the chi-squared curve to the right of D. A small p-value indicates that
the data does not fit the model.
Alternatively, the fit of the model can be evaluated using the chi-squared approx-
imation to the Pearson X 2 statistic:
m m
(yi − ni p̃i )2 ((ni − yi ) − ni (1 − p̃i ))2 (yi − ni p̃i )2
X X
2
X = + = .
i=1
ni p̃i ni (1 − p̃i ) i=1
ni p̃i (1 − p̃i )
Age Total Menarche p.hat emp.logit fitted.values fit se.fit fit.upper fit.lower
21 15.08 122.00 117.00 0.96 3.06 0.97 3.38 0.14 0.97 0.96
Age Total Menarche p.hat emp.logit fitted.values fit se.fit fit.upper fit.lower
1 9.21 376.00 0.00 0.00 −6.62 0.00 −6.20 0.23 0.00 0.00
2 10.21 200.00 0.00 0.00 −5.99 0.01 −4.56 0.18 0.01 0.01
3 10.58 93.00 0.00 0.00 −5.23 0.02 −3.96 0.16 0.02 0.01
4 10.83 120.00 2.00 0.02 −3.86 0.03 −3.55 0.14 0.04 0.02
5 11.08 90.00 2.00 0.02 −3.57 0.04 −3.14 0.13 0.05 0.03
6 11.33 88.00 5.00 0.06 −2.72 0.06 −2.74 0.12 0.08 0.05
7 11.58 105.00 10.00 0.10 −2.21 0.09 −2.33 0.11 0.11 0.07
8 11.83 111.00 17.00 0.15 −1.69 0.13 −1.92 0.10 0.15 0.11
9 12.08 100.00 16.00 0.16 −1.63 0.18 −1.51 0.08 0.21 0.16
10 12.33 93.00 29.00 0.31 −0.78 0.25 −1.10 0.07 0.28 0.22
11 12.58 100.00 39.00 0.39 −0.44 0.33 −0.70 0.07 0.36 0.30
12 12.83 108.00 51.00 0.47 −0.11 0.43 −0.29 0.06 0.46 0.40
13 13.08 99.00 47.00 0.47 −0.10 0.53 0.12 0.06 0.56 0.50
14 13.33 106.00 67.00 0.63 0.54 0.63 0.53 0.07 0.66 0.60
15 13.58 105.00 81.00 0.77 1.20 0.72 0.94 0.07 0.75 0.69
16 13.83 117.00 88.00 0.75 1.10 0.79 1.34 0.08 0.82 0.77
17 14.08 98.00 79.00 0.81 1.40 0.85 1.75 0.09 0.87 0.83
18 14.33 97.00 90.00 0.93 2.49 0.90 2.16 0.10 0.91 0.88
19 14.58 120.00 113.00 0.94 2.72 0.93 2.57 0.11 0.94 0.91
20 14.83 102.00 95.00 0.93 2.54 0.95 2.98 0.12 0.96 0.94
21 15.08 122.00 117.00 0.96 3.06 0.97 3.38 0.14 0.97 0.96
22 15.33 111.00 107.00 0.96 3.17 0.98 3.79 0.15 0.98 0.97
23 15.58 94.00 92.00 0.98 3.61 0.98 4.20 0.16 0.99 0.98
24 15.83 114.00 112.00 0.98 3.81 0.99 4.61 0.18 0.99 0.99
25 17.58 1049.00 1049.00 1.00 7.65 1.00 7.46 0.28 1.00 1.00
The summary table gives MLEs and standard errors for the regression parameters.
The z-value column is the parameter estimate divided by its standard error. The p-
values are used to test whether the corresponding parameters of the logistic model
are zero.
summary(glm.m.a)
##
## Call:
## glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
## data = menarche)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0363 -0.9953 -0.4900 0.7780 1.3675
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
## Age 1.63197 0.05895 27.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3693.884 on 24 degrees of freedom
## Residual deviance: 26.703 on 23 degrees of freedom
## AIC: 114.76
##
## Number of Fisher Scoring iterations: 4
If the model is correct and when sample sizes are large, the residual deviance D
has an approximate chi-square distribution,
residual D = χ2residual df .
If D is too large, or the p-value is too small, then the model does not capture all the
features in the data.
The deviance statistic is D = 26.70 on 25 − 2 = 23 df. The large p-value for
D suggests no gross deficiencies with the logistic model. The observed and fitted
proportions (p.hat and fitted.values in the output table above are reasonably close
at each observed age. Also, emp.logit and fit are close. This is consistent with D
being fairly small. The data fits the logistic regression model reasonably well.
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
glm.m.a$deviance
## [1] 26.70345
glm.m.a$df.residual
## [1] 23
dev.p.val <- 1 - pchisq(glm.m.a$deviance, glm.m.a$df.residual)
dev.p.val
## [1] 0.2687953
The MLEs b0 = −21.23 and b1 = 1.63 for the intercept and slope are close to
the LS estimates of bLS0 = −22.03 and bLS1 = 1.68, respectively from page 680. The
two estimation methods give similar predicted probabilities here. The MLE of the
predicted probabilities satisfy
p̃
log = −21.23 + 1.63 Age
1 − p̃
or
exp(−21.23 + 1.63 Age)
p̃ = .
1 + exp(−21.23 + 1.63 Age)
library(ggplot2)
p <- ggplot(menarche, aes(x = Age, y = p.hat))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = Age, ymin = fit.lower, ymax = fit.upper), alpha = 0.2)
p <- p + geom_line(aes(x = Age, y = fitted.values), color = "red")
# fitted values
p <- p + geom_point(aes(y = fitted.values), color = "red", size=2)
# observed values
p <- p + geom_point(size=2)
p <- p + labs(title = paste("Observed and predicted probability of girls reaching menarche,\n",
"Warsaw, Poland in 1965", sep=""))
print(p)
Observed and predicted probability of girls reaching menarche,
Warsaw, Poland in 1965
1.00 ● ●
● ●
● ●
●
● ●
● ●
● ● ●
●
●
● ●
●
0.75 ●
●
●
p.hat
●
0.50
● ●
●
●
●
●
0.25 ●
●
● ●
●
●
●
●
●
● ●
● ●
● ●
● ●
0.00
10 12 14 16
Age
If the model holds, then a slope of β1 = 0 implies that p does not depend on AGE,
i.e., the proportion of girls that have reached menarche is identical across age groups.
The Wald p-value for the slope is < 0.0001, which leads to rejecting H0 : β1 = 0 at
any of the usual test levels. The proportion of girls that have reached menarche is
not constant across age groups. Again, the power of the model is that it gives you a
simple way to quantify the effect of age on the proportion reaching menarche. This
is more appealing than testing homogeneity across age groups followed by multiple
comparisons.
Wald tests can be performed to test the global null hypothesis, that all non-
intercept βs are equal to zero. This is the logistic regression analog of the overall
model F-test in ANOVA and regression. The only predictor is AGE, so the implied
test is that the slope of the regression line is zero. The Wald test statistic and p-value
reported here are identical to the Wald test and p-value for the AGE effect given
in the parameter estimates table. The Wald test can also be used to test specific
contrasts between parameters.
# Testing Global Null Hypothesis
library(aod)
coef(glm.m.a)
## (Intercept) Age
## -21.226395 1.631968
# specify which coefficients to test = 0 (Terms = 2:4 would be terms 2, 3, and 4)
wald.test(b = coef(glm.m.a), Sigma = vcov(glm.m.a), Terms = 2:2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 766.3, df = 1, P(> X2) = 0.0
and NRES (the number of NTOTAL that survived at least one year from the time of
diagnosis).
where LWBC = log(WBC). The model is best understood by separating the AG+
and AG− cases. For AG− individuals, AG=0 so the model reduces to
p
log = β0 + β1 LWBC + β2 ∗ 0 = β0 + β1 LWBC.
1−p
The model without AG (i.e., β2 = 0) is a simple logistic model where the log-odds
of surviving one year is linearly related to LWBC, and is independent of AG. The
reduced model with β2 = 0 implies that there is no effect of the AG level on the
survival probability once LWBC has been taken into account.
Including the binary predictor AG in the model implies that there is a linear
relationship between the log-odds of surviving one year and LWBC, with a constant
slope for the two AG levels. This model includes an effect for the AG morphological
factor, but more general models are possible. A natural extension would be to include
a product or interaction effect, a point that I will return to momentarily.
The parameters are easily interpreted: β0 and β0 + β2 are intercepts for the pop-
ulation logistic regression lines for AG− and AG+, respectively. The lines have a
common slope, β1 . The β2 coefficient for the AG indicator is the difference between
intercepts for the AG+ and AG− regression lines. A picture of the assumed rela-
tionship is given below for β1 < 0. The population regression lines are parallel on
the logit scale only, but the order between AG groups is preserved on the probability
scale.
Logit Scale Probability Scale
1.0
5
0.80.6
IAG=1
0
Probability
Log-Odds
IAG=1
0.4
-5
IAG=0
0.2
IAG=0
-10
0.0
-5 0 5 -5 0 5
LWBC LWBC
Before looking at output for the equal slopes model, note that the data set has 30
distinct AG and LWBC combinations, or 30 “groups” or samples. Only two samples
have more than 1 observation. The majority of the observed proportions surviving
at least one year (number surviving ≥ 1 year/group sample size) are 0 (i.e., 0/1)
or 1 (i.e., 1/1). This sparseness of the data makes it difficult to graphically assess
the suitability of the logistic model (because the estimated proportions are almost
all 0 or 1). Although significance tests on the regression coefficients do not require
large group sizes, the chi-squared approximation to the deviance statistic is suspect
in sparse data settings. With small group sizes as we have here, most researchers
would not interpret the p-value for D literally. Instead, they would use the p-values
to informally check the fit of the model. Diagnostics would be used to highlight
problems with the model.
glm.i.l <- glm(cbind(nres, ntotal - nres) ~ ag + lwbc, family = binomial, leuk)
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.i.l$deviance, glm.i.l$df.residual)
dev.p.val
## [1] 0.6842804
The large p-value for D indicates that there are no gross deficiencies with the
model. Recall that the Testing Global Null Hypothesis gives p-values for testing the
hypothesis that the regression coefficients are zero for each predictor in the model.
The two predictors are LWBC and AG, so the small p-values indicate that LWBC or
AG, or both, are important predictors of survival. The p-values in the estimates table
suggest that LWBC and AG are both important. If either predictor was insignificant, I
would consider refitting the model omitting the least significant effect, as in regression.
# Testing Global Null Hypothesis
library(aod)
coef(glm.i.l)
## (Intercept) ag1 lwbc
## 5.543349 2.519562 -1.108759
# specify which coefficients to test = 0 (Terms = 2:3 is for terms 2 and 3)
wald.test(b = coef(glm.i.l), Sigma = vcov(glm.i.l), Terms = 2:3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 8.2, df = 2, P(> X2) = 0.017
Given that the model fits reasonably well, a test of H0 : β2 = 0 might be a primary
interest here. This checks whether the regression lines are identical for the two AG
levels, which is a test for whether AG affects the survival probability, after taking
LWBC into account. This test is rejected at any of the usual significance levels,
suggesting that the AG level affects the survival probability (assuming a very specific
model).
summary(glm.i.l)
##
## Call:
## glm(formula = cbind(nres, ntotal - nres) ~ ag + lwbc, family = binomial,
## data = leuk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6599 -0.6595 -0.2776 0.6438 1.7131
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.5433 3.0224 1.834 0.0666 .
## ag1 2.5196 1.0907 2.310 0.0209 *
## lwbc -1.1088 0.4609 -2.405 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38.191 on 29 degrees of freedom
## Residual deviance: 23.014 on 27 degrees of freedom
## AIC: 30.635
##
## Number of Fisher Scoring iterations: 5
library(ggplot2)
p <- ggplot(leuk, aes(x = lwbc, y = p.hat, colour = ag, fill = ag))
●
●
●
0.75 ●
●
●
●
●
●
ag
p.hat
0.50 ● ● 0
●
● 1
●
●
●
0.25 ●
●
● ●
● ●
●
● ●
●●● ●
● ●
0.00
5 6 7 8 9
lwbc
ntotal nres ag wbc lwbc p.hat fitted.values fit se.fit fit.upper fit.lower
1 1 1 1 75 4.32 1.00 0.96 3.28 1.44 1.00 0.61
2 1 1 1 230 5.44 1.00 0.88 2.03 0.99 0.98 0.52
3 1 1 1 260 5.56 1.00 0.87 1.90 0.94 0.98 0.51
4 1 1 1 430 6.06 1.00 0.79 1.34 0.78 0.95 0.45
5 1 1 1 700 6.55 1.00 0.69 0.80 0.66 0.89 0.38
6 1 1 1 940 6.85 1.00 0.62 0.47 0.61 0.84 0.33
7 1 1 1 1000 6.91 1.00 0.60 0.40 0.61 0.83 0.31
8 1 1 1 1050 6.96 1.00 0.59 0.35 0.60 0.82 0.30
9 3 1 1 10000 9.21 0.33 0.10 −2.15 1.12 0.51 0.01
10 1 1 0 300 5.70 1.00 0.31 −0.78 0.87 0.72 0.08
11 1 1 0 440 6.09 1.00 0.23 −1.21 0.83 0.61 0.06
12 1 0 1 540 6.29 0.00 0.75 1.09 0.72 0.92 0.42
13 1 0 1 600 6.40 0.00 0.73 0.97 0.69 0.91 0.41
14 1 0 1 1700 7.44 0.00 0.45 −0.18 0.61 0.73 0.20
15 1 0 1 3200 8.07 0.00 0.29 −0.89 0.73 0.63 0.09
16 1 0 1 3500 8.16 0.00 0.27 −0.99 0.75 0.62 0.08
17 1 0 1 5200 8.56 0.00 0.19 −1.42 0.88 0.57 0.04
18 1 0 0 150 5.01 0.00 0.50 −0.01 1.02 0.88 0.12
19 1 0 0 400 5.99 0.00 0.25 −1.10 0.84 0.63 0.06
20 1 0 0 530 6.27 0.00 0.20 −1.41 0.83 0.55 0.05
21 1 0 0 900 6.80 0.00 0.12 −2.00 0.86 0.42 0.02
22 1 0 0 1000 6.91 0.00 0.11 −2.12 0.87 0.40 0.02
23 1 0 0 1900 7.55 0.00 0.06 −2.83 1.01 0.30 0.01
24 1 0 0 2100 7.65 0.00 0.05 −2.94 1.03 0.29 0.01
25 1 0 0 2600 7.86 0.00 0.04 −3.18 1.09 0.26 0.00
26 1 0 0 2700 7.90 0.00 0.04 −3.22 1.11 0.26 0.00
27 1 0 0 2800 7.94 0.00 0.04 −3.26 1.12 0.26 0.00
28 1 0 0 3100 8.04 0.00 0.03 −3.37 1.15 0.25 0.00
29 1 0 0 7900 8.97 0.00 0.01 −4.41 1.48 0.18 0.00
30 2 0 0 10000 9.21 0.00 0.01 −4.67 1.57 0.17 0.00
or equivalently,
exp(5.54 − 1.11 LWBC)
p̃ = .
1 + exp(5.54 − 1.11 LWBC)
For AG+ individuals with AG=1,
p̃
log = 5.54 − 1.11 LWBC + 2.52(1) = 8.06 − 1.11 LWBC,
1 − p̃
or
exp(8.06 − 1.11 LWBC)
p̃ = .
1 + exp(8.06 − 1.11 LWBC)
Using the logit scale, the difference between AG+ and AG− individuals in the
estimated log-odds of surviving at least one year, at a fixed but arbitrary LWBC, is
the estimated AG regression coefficient
Using properties of exponential functions, the odds that an AG+ patient lives at least
one year is exp(2.52) = 12.42 times larger than the odds that an AG− patient lives
at least one year, regardless of LWBC.
This summary, and a CI for the AG odds ratio, is given in the Odds Ratio
table. Similarly, the estimated odds ratio of 0.33 for LWBC implies that the odds of
surviving at least one year is reduced by a factor of 3 for each unit increase of LWBC.
We can use the confint() function to obtain confidence intervals for the coefficient
estimates. Note that for logistic models, confidence intervals are based on the profiled
log-likelihood function.
## CIs using profiled log-likelihood
confint(glm.i.l)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.1596372 12.4524409
## ag1 0.5993391 5.0149271
## lwbc -2.2072275 -0.3319512
We can also get CIs based on just the standard errors by using the default method.
Leukemia example In the example above, the OR of surviving at least one year
increases 12.43 times for AG+ vs AG−, and increases 0.33 times (that’s a decrease)
for every unit increase in lwbc.
The aim of an experiment originally reported by Strand (1930) and quoted by Bliss
(1935) was to assess the response of the confused flour beetle, Tribolium confusum,
to gaseous carbon disulphide (CS2 ). In the experiment, prescribed volumes of liquid
carbon disulphide were added to flasks in which a tubular cloth cage containing a
batch of about thirty beetles was suspended. Duplicate batches of beetles were used
for each concentration of CS2 . At the end of a five-hour period, the proportion killed
was recorded and the actual concentration of gaseous CS2 in the flask, measured in
mg/l, was determined by a volumetric analysis. The mortality data are given in the
table below.
#### Example: Beetles
## Beetles data set
# conc = CS2 concentration
# y = number of beetles killed
# n = number of beetles exposed
# rep = Replicate number (1 or 2)
beetles <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_beetles.dat", header = TRUE)
beetles$rep <- factor(beetles$rep)
conc y n rep conc y n rep
1 49.06 2 29 1 9 49.06 4 30 2
2 52.99 7 30 1 10 52.99 6 30 2
3 56.91 9 28 1 11 56.91 9 34 2
4 60.84 14 27 1 12 60.84 14 29 2
5 64.76 23 30 1 13 64.76 29 33 2
6 68.69 29 31 1 14 68.69 24 28 2
7 72.61 29 30 1 15 72.61 32 32 2
8 76.54 29 29 1 16 76.54 31 31 2
beetles$conc2 <- beetles$conc^2 # for quadratic term (making coding a little easier)
beetles$p.hat <- beetles$y / beetles$n # observed proportion of successes
# emperical logits
beetles$emp.logit <- log(( beetles$p.hat + 0.5/beetles$n) /
(1 - beetles$p.hat + 0.5/beetles$n))
#str(beetles)
Plot the observed probability of mortality and the empirical logits with linear and
quadratic LS fits (which are not the same as the logistic MLE fits).
library(ggplot2)
p <- ggplot(beetles, aes(x = conc, y = p.hat, shape = rep))
# observed values
p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Observed mortality, probability scale")
print(p)
library(ggplot2)
p <- ggplot(beetles, aes(x = conc, y = emp.logit))
p <- p + geom_smooth(method = "lm", colour = "red", se = FALSE)
1.00
4
0.75
rep rep
emp.logit
p.hat
1 1
0.50 2 2
0.25
−2
50 60 70 50 60 70
conc conc
In a number of articles that refer to these data, the responses from the first two
concentrations are omitted because of apparent non-linearity. Bliss himself remarks
that
However, there does not appear to be any biological motivation for this and so here
they are retained in the data set.
Combining the data from the two replicates and plotting the empirical logit of the
observed proportions against concentration gives a relationship that is better fit by a
quadratic than a linear relationship,
p
log = β0 + β1 X + β2 X 2 .
1−p
The right plot below shows the linear and quadratic model fits to the observed values
with point-wise 95% confidence bands on the logit scale, and on the left is the same
on the proportion scale.
library(ggplot2)
p <- ggplot(beetles.all, aes(x = conc, y = p.hat, shape = rep, colour = modelorder, fill = modelorder))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = conc, ymin = fit.lower, ymax = fit.upper), linetype = 0, alpha = 0.1)
p <- p + geom_line(aes(x = conc, y = fitted.values, linetype = modelorder), size = 1)
# fitted values
p <- p + geom_point(aes(y = fitted.values), size=2)
# observed values
p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Observed and predicted mortality, probability scale")
print(p)
library(ggplot2)
p <- ggplot(beetles.all, aes(x = conc, y = emp.logit, shape = rep, colour = modelorder, fill = mo
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = conc, ymin = fit - 1.96 * se.fit, ymax = fit + 1.96 * se.fit), linet
p <- p + geom_line(aes(x = conc, y = fit, linetype = modelorder), size = 1)
# fitted values
p <- p + geom_point(aes(y = fit), size=2)
# observed values
p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Observed and predicted mortality, logit scale")
print(p)
Observed and predicted mortality, probability scale Observed and predicted mortality, logit scale
1.00 ●
● 7.5
●
●
●
●
●
● 5.0
0.75
●
●
rep rep
● 1 ● ● 1
●
2 2
emp.logit
●
p.hat
●
2.5
●
0.50
modelorder modelorder
●
●
● linear ● linear
● quadratic ● quadratic
● ●
●
● 0.0
●
0.25 ●
● ●
●
●
●
●
−2.5 ●
●
0.00
50 60 70 50 60 70
conc conc
from 0 for a patient with no injuries to 75 for a patient with 3 or more life threatening
injuries. The ISS is the standard injury index used by trauma centers throughout
the U.S. The RTS is an index of physiologic injury, and is constructed as a weighted
average of an incoming patient’s systolic blood pressure, respiratory rate, and Glasgow
Coma Scale. The RTS can take on values from 0 for a patient with no vital signs to
7.84 for a patient with normal vital signs.
Champion et al. (1981) proposed a logistic regression model to estimate the prob-
ability of a patient’s survival as a function of RTS, the injury severity score ISS, and
the patient’s age, which is used as a surrogate for physiologic reserve. Subsequent
survival models included the binary effect BP as a means to differentiate between
blunt and penetrating injuries.
We will develop a logistic model for predicting survival from ISS, AGE, BP, and
RTS, and nine body regions. Data on the number of severe injuries in each of the nine
body regions is also included in the database, so we will also assess whether these
features have any predictive power. The following labels were used to identify the
number of severe injuries in the nine regions: AS = head, BS = face, CS = neck,
DS = thorax, ES = abdomen, FS = spine, GS = upper extremities, HS = lower
extremities, and JS = skin.
#### Example: UNM Trauma Data
trauma <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_trauma.dat"
, header = TRUE)
## Variables
# surv = survival (1 if survived, 0 if died)
# rts = revised trauma score (range: 0 no vital signs to 7.84 normal vital signs)
# iss = injury severity score (0 no injuries to 75 for 3 or more life threatening injuries)
# bp = blunt or penetrating injuries (e.g., car crash BP=0 vs gunshot/knife wounds BP=1)
# Severe injuries: add the severe injuries 3--6 to make summary variables
trauma <- within(trauma, {
as = a3 + a4 + a5 + a6 # as = head
bs = b3 + b4 + b5 + b6 # bs = face
cs = c3 + c4 + c5 + c6 # cs = neck
ds = d3 + d4 + d5 + d6 # ds = thorax
es = e3 + e4 + e5 + e6 # es = abdomen
fs = f3 + f4 + f5 + f6 # fs = spine
gs = g3 + g4 + g5 + g6 # gs = upper extremities
hs = h3 + h4 + h5 + h6 # hs = lower extremities
js = j3 + j4 + j5 + j6 # js = skin
})
# keep only columns of interest
names(trauma)
## [1] "id" "surv" "a1" "a2" "a3" "a4" "a5" "a6" "b1" "b2"
## [11] "b3" "b4" "b5" "b6" "c1" "c2" "c3" "c4" "c5" "c6"
## [21] "d1" "d2" "d3" "d4" "d5" "d6" "e1" "e2" "e3" "e4"
## [31] "e5" "e6" "f1" "f2" "f3" "f4" "f5" "f6" "g1" "g2"
## [41] "g3" "g4" "g5" "g6" "h1" "h2" "h3" "h4" "h5" "h6"
## [51] "j1" "j2" "j3" "j4" "j5" "j6" "iss" "iciss" "bp" "rts"
## [61] "age" "prob" "js" "hs" "gs" "fs" "es" "ds" "cs" "bs"
## [71] "as"
trauma <- subset(trauma, select = c(id, surv, as:js, iss:prob))
head(trauma)
## id surv as bs cs ds es fs gs hs js iss iciss bp rts age prob
## 1 1238385 1 0 0 0 1 0 0 0 0 0 13 0.8612883 0 7.8408 13 0.9909890
## 2 1238393 1 0 0 0 0 0 0 0 0 0 5 0.9421876 0 7.8408 23 0.9947165
## 3 1238898 1 0 0 0 0 0 0 2 0 0 13 0.7251130 0 7.8408 43 0.9947165
## 4 1239516 1 1 0 0 0 0 0 0 0 0 16 1.0000000 0 5.9672 17 0.9615540
## 5 1239961 1 1 0 0 0 0 0 0 0 1 9 0.9346634 0 4.8040 20 0.9338096
## 6 1240266 1 0 0 0 0 0 0 0 1 0 13 0.9004691 0 7.8408 32 0.9947165
#str(trauma)
4
6 1.5 1.5
3
4 1.0 1.0
2
2 0.5 0.5
1
0 0.0 0.0 0
es fs gs hs
6 5 4 5
4 4
3
4
3 3
2
2 2
2
1
1 1
0 0 0 0
value
js iss iciss bp
1.00 1.00 1.00
60
0.75 0.75 0.75
6 75
4 50
2 25
0 0
0 1 0 1
factor(surv)
ables. The numbers of cases in the success category and the group sample sizes were
specified in the model statement, along with the names of the predictors. The trauma
data set, which is not reproduced here, is raw data consisting of one record per pa-
tient (i.e., 3132 lines). The logistic model is fitted to data on individual cases by
specifying the binary response variable (SURV) with successes and 1 − SURV failures
with the predictors on the right-hand side of the formula. Keep in mind that we are
defining the logistic model to model the success category, so we are modeling the
probability of surviving.
As an aside, there are two easy ways to model the probability of dying (which
we don’t do below). The first is to swap the order the response is specified in the
formula: cbind(1 - surv, surv). The second is to convert a model for the log-odds
of surviving to a model for the log-odds of dying by simply changing the sign of each
regression coefficient in the model.
I only included the summary table from the backward elimination, and information
on the fit of the selected model.
glm.tr <- glm(cbind(surv, 1 - surv) ~ as + bs + cs + ds + es + fs + gs + hs + js
+ iss + rts + age + bp
, family = binomial, trauma)
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.tr.red.AIC$deviance, glm.tr.red.AIC$df.residual)
dev.p.val
## [1] 1
Letting p be the probability of survival, the estimated survival probability is given
by
p̃
log = 0.3558 − 0.4613 ES − 0.6351 BP − 0.0569 ISS
1 − p̃
+0.8431 RTS − 0.0497 AGE.
Let us interpret the sign of the coefficients, and the odds ratios, in terms of the impact
that individual predictors have on the survival probability.
## coefficients and 95% CI
cbind(OR = coef(glm.tr.red.AIC), confint(glm.tr.red.AIC))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.35584499 -0.51977300 1.21869015
## es -0.46131679 -0.67603693 -0.24307991
## iss -0.05691973 -0.07159539 -0.04249502
## rts 0.84314317 0.73817886 0.95531089
## age -0.04970641 -0.06020882 -0.03943822
# predicted probabilities
Yhat <- fitted(glm.tr.red.AIC)
# Classification Table
classify.table$Thresh [i.thresh] <- thresh[i.thresh] # Prob.Level
classify.table$Cor.Event[i.thresh] <- cTab[2,2] # Correct.Event
classify.table$Cor.NonEv[i.thresh] <- cTab[1,1] # Correct.NonEvent
classify.table$Inc.Event[i.thresh] <- cTab[2,1] # Incorrect.Event
classify.table$Inc.NonEv[i.thresh] <- cTab[1,2] # Incorrect.NonEvent
classify.table$Cor.All [i.thresh] <- 100 * sum(diag(cTab)) / sum(cTab) # Correct.Overall
classify.table$Sens [i.thresh] <- 100 * cTab[2,2] / sum(cTab[,2]) # Sensitivity
classify.table$Spec [i.thresh] <- 100 * cTab[1,1] / sum(cTab[,1]) # Specificity
classify.table$Fal.P [i.thresh] <- 100 * cTab[2,1] / sum(cTab[2,]) # False.Pos
classify.table$Fal.N [i.thresh] <- 100 * cTab[1,2] / sum(cTab[1,]) # False.Neg
}
round(classify.table, 1)
## Thresh Cor.Event Cor.NonEv Inc.Event Inc.NonEv Cor.All Sens Spec Fal.P Fal.N
## 1 0.0 2865 0 267 0 91.5 100.0 0.0 8.5 NaN
## 2 0.1 2861 79 188 4 93.9 99.9 29.6 6.2 4.8
## 3 0.2 2856 105 162 9 94.5 99.7 39.3 5.4 7.9
## 4 0.3 2848 118 149 17 94.7 99.4 44.2 5.0 12.6
## 5 0.4 2837 125 142 28 94.6 99.0 46.8 4.8 18.3
## 6 0.5 2825 139 128 40 94.6 98.6 52.1 4.3 22.3
## 7 0.6 2805 157 110 60 94.6 97.9 58.8 3.8 27.6
## 8 0.7 2774 174 93 91 94.1 96.8 65.2 3.2 34.3
## 9 0.8 2727 196 71 138 93.3 95.2 73.4 2.5 41.3
## 10 0.9 2627 229 38 238 91.2 91.7 85.8 1.4 51.0
## 11 1.0 0 267 0 2865 8.5 0.0 100.0 NaN 91.5
The data set has 2865 survivors and 267 people that died. Using a 0.50 cutoff,
2825 of the survivors would be correctly identified, and 40 misclassified. Similarly, 139
of the patients that died would be correctly classified and 128 would not. The overall
percentage of cases correctly classified is (2825+138)/3132 = 94.6%. The sensitivity
is the percentage of survivors that are correctly classified, 2825/(2825 + 40) = 98.6%.
The specificity is the percentage of patients that died that are correctly classified,
139/(139 + 128) = 52.1%. The false positive rate, which is the % of those predicted
to survive that did not, is 128/(128 + 2825) = 4.3%. The false negative rate, which
is the % of those predicted to die that did not is 40/(40 + 139) = 22.5%.
# Thresh = 0.5 classification table
YhatPred <- cut(Yhat, breaks=c(-Inf, 0.5, Inf), labels=c("NonEvent", "Event"))
# contingency table and marginal sums
cTab <- table(YhatPred, YObs)
addmargins(cTab)
## YObs
## YhatPred NonEvent Event Sum
## NonEvent 139 40 179
## Event 128 2825 2953
## Sum 267 2865 3132
round(subset(classify.table, Thresh == 0.5), 1)
## Thresh Cor.Event Cor.NonEv Inc.Event Inc.NonEv Cor.All Sens Spec Fal.P Fal.N
## 6 0.5 2825 139 128 40 94.6 98.6 52.1 4.3 22.3
The misclassification rate seems small, but you should remember that approxi-
mately 10% of patients admitted to UNM eventually die from their injuries. Given
this historical information only, you could achieve a 10% misclassification rate by
completely ignoring the data and classifying each admitted patient as a survivor.
Using the data reduces the misclassification rate by about 50% (from 10% down to
4.4%), which is an important reduction in this problem.
Logistic histogram plots of the data show a clear marginal relationship of failure
with temp but not with pressure. We still need to assess the model with both variables
together.
# plot logistic plots of response to each predictor individually
library(popbio)
##
## Attaching package: ’popbio’
## The following object is masked from ’package:gdata’:
##
## resample
logi.hist.plot(shuttle$temp, shuttle$y, boxp=FALSE, type="hist"
, rug=TRUE, col="gray", ylabel = "Probability", xlabel = "Temp")
logi.hist.plot(shuttle$pressure, shuttle$y, boxp=FALSE, type="hist"
, rug=TRUE, col="gray", ylabel = "Probability", xlabel = "Pressure")
1.0 ● ● ● ● ● ● 0 1.0 ● ● 0
5
10
0.8 0.8
10
20
Frequency
Frequency
Probability
Probability
0.6 0.6
0.4 0.4
20
10
0.2 0.2
10
5
0.0 ● ● ● ● ● ● ● ● ● ● ● ● 0 0.0 ● ● ● 0
Temp Pressure
We fit the logistic model below using Y = 1 if at least one O-ring failed, and 0
otherwise. We are modelling the chance of one or more O-ring failures as a function
of temperature and pressure.
The D goodness-of-fit statistic suggest no gross deviations from the model. Fur-
thermore, the test of H0 : β1 = β2 = 0 (no regression effects) based on the Wald test
has a p-value of 0.1, which suggests that neither temperature or pressure, or both,
are useful predictors of the probability of O-ring failure. The z-test test p-values for
testing H0 : β1 = 0 and H0 : β2 = 0 individually are 0.037 and 0.576, respectively,
which indicates pressure is not important (when added last to the model), but that
temperature is important. This conclusion might be anticipated by looking at data
plots above.
glm.sh <- glm(cbind(y, 1 - y) ~ temp + pressure, family = binomial, shuttle)
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.sh$deviance, glm.sh$df.residual)
dev.p.val
## [1] 0.4589415
# Testing Global Null Hypothesis
library(aod)
coef(glm.sh)
## (Intercept) temp pressure
## 16.385319489 -0.263404073 0.005177602
# specify which coefficients to test = 0 (Terms = 2:3 is for terms 2 and 3)
wald.test(b = coef(glm.sh), Sigma = vcov(glm.sh), Terms = 2:3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 4.6, df = 2, P(> X2) = 0.1
# Model summary
summary(glm.sh)
##
## Call:
## glm(formula = cbind(y, 1 - y) ~ temp + pressure, family = binomial,
## data = shuttle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1928 -0.7879 -0.3789 0.4172 2.2031
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.385319 8.027474 2.041 0.0412 *
## temp -0.263404 0.126371 -2.084 0.0371 *
## pressure 0.005178 0.009257 0.559 0.5760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 19.984 on 20 degrees of freedom
## AIC: 25.984
##
## Number of Fisher Scoring iterations: 5
A reasonable next step would be to refit the model, after omitting pressure as a
predictor. The target model is now
pi
log = β0 + β1 Tempi .
1 − pi
case flight y six temp pressure fitted.values fit se.fit fit.upper fit.lower fitted
1 31.00 7.85 4.04 1.00 0.48 1.00
2 35.00 6.92 3.61 1.00 0.46 1.00
3 40.00 5.76 3.08 1.00 0.43 1.00
4 45.00 4.60 2.55 1.00 0.40 0.99
5 50.00 3.43 2.02 1.00 0.37 0.97
6 1.00 14.00 1.00 2.00 53.00 50.00 0.94 2.74 1.71 1.00 0.35 0.94
7 2.00 9.00 1.00 1.00 57.00 50.00 0.86 1.81 1.31 0.99 0.32 0.86
8 3.00 23.00 1.00 1.00 58.00 200.00 0.83 1.58 1.21 0.98 0.31 0.83
9 4.00 10.00 1.00 1.00 63.00 50.00 0.60 0.42 0.77 0.87 0.25 0.60
10 5.00 1.00 0.00 0.00 66.00 200.00 0.43 −0.28 0.59 0.71 0.19 0.43
11 6.00 5.00 0.00 0.00 67.00 50.00 0.38 −0.51 0.56 0.64 0.17 0.38
12 7.00 13.00 0.00 0.00 67.00 200.00 0.38 −0.51 0.56 0.64 0.17 0.38
13 8.00 15.00 0.00 0.00 67.00 50.00 0.38 −0.51 0.56 0.64 0.17 0.38
14 9.00 4.00 0.00 0.00 68.00 200.00 0.32 −0.74 0.55 0.58 0.14 0.32
15 10.00 3.00 0.00 0.00 69.00 200.00 0.27 −0.98 0.56 0.53 0.11 0.27
16 11.00 8.00 0.00 0.00 70.00 50.00 0.23 −1.21 0.59 0.49 0.09 0.23
17 12.00 17.00 0.00 0.00 70.00 200.00 0.23 −1.21 0.59 0.49 0.09 0.23
18 13.00 2.00 1.00 1.00 70.00 200.00 0.23 −1.21 0.59 0.49 0.09 0.23
19 14.00 11.00 1.00 1.00 70.00 200.00 0.23 −1.21 0.59 0.49 0.09 0.23
20 15.00 6.00 0.00 0.00 72.00 200.00 0.16 −1.67 0.70 0.43 0.05 0.16
21 16.00 7.00 0.00 0.00 73.00 200.00 0.13 −1.90 0.78 0.40 0.03 0.13
22 17.00 16.00 0.00 0.00 75.00 100.00 0.09 −2.37 0.94 0.37 0.01 0.09
23 18.00 21.00 1.00 2.00 75.00 200.00 0.09 −2.37 0.94 0.37 0.01 0.09
24 19.00 19.00 0.00 0.00 76.00 200.00 0.07 −2.60 1.03 0.36 0.01 0.07
25 20.00 22.00 0.00 0.00 76.00 200.00 0.07 −2.60 1.03 0.36 0.01 0.07
26 21.00 12.00 0.00 0.00 78.00 200.00 0.04 −3.07 1.22 0.34 0.00 0.04
27 22.00 20.00 0.00 0.00 79.00 200.00 0.04 −3.30 1.32 0.33 0.00 0.04
28 23.00 18.00 0.00 0.00 81.00 200.00 0.02 −3.76 1.51 0.31 0.00 0.02
library(ggplot2)
p <- ggplot(shuttle, aes(x = temp, y = y))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = temp, ymin = fit.lower, ymax = fit.upper), alpha = 0.2)
p <- p + geom_line(aes(x = temp, y = fitted), colour="red")
# fitted values
p <- p + geom_point(aes(y = fitted.values), size=2, colour="red")
# observed values
p <- p + geom_point(size = 2)
p <- p + ylab("Probability")
p <- p + labs(title = "Observed events and predicted probability of 1+ O-ring failures")
print(p)
## Warning: Removed 5 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).
Observed events and predicted probability of 1+ O−ring failures
1.00 ● ● ● ● ● ●
●
●
0.75
●
Probability
0.50
●
●
0.25
●
●
●
●
●
● ●
●
0.00 ● ● ● ● ● ● ● ● ● ● ● ●
30 40 50 60 70 80
temp
An Introduction to Multivariate
Methods
Multivariate statistical methods are used to display, analyze, and describe data on
two or more features or variables simultaneously. I will discuss multivariate methods
for measurement data. Methods for multi-dimensional count data, or mixtures of
counts and measurements are available, but are beyond the scope of what I can do
here. I will give a brief overview of the type of problems where multivariate methods
are appropriate.
Example: Turtle shells Jolicouer and Mosimann provided data on the height,
length, and width of the carapace (shell) for a sample of female painted turtles.
Cluster analysis is used to identify which shells are similar on the three features.
Principal component analysis is used to identify the linear combinations of the
measurements that account for most of the variation in size and shape of the shells.
Cluster analysis and principal component analysis are primarily descriptive tech-
niques.
Example: Fisher’s Iris data Random samples of 50 flowers were selected from
three iris species: Setosa, Virginica, and Versicolor. Four measurements were made
on each flower: sepal length, sepal width, petal length, and petal width. Suppose the
sample means on each feature are computed within the three species. Are the means
on the four traits significantly different across species? This question can be answered
using four separate one-way ANOVAs. A more powerful MANOVA (multivariate
analysis of variance) method compares species on the four features simultaneously.
Discriminant analysis is a technique for comparing groups on multi-dimensional
data. Discriminant analysis can be used with Fisher’s Iris data to find the linear com-
binations of the flower features that best distinguish species. The linear combinations
are optimally selected, so insignificant differences on one or all features may be sig-
nificant (or better yet, important) when the features are considered simultaneously!
Furthermore, the discriminant analysis could be used to classify flowers into one of
these three species when their species is unknown.
MANOVA, discriminant analysis, and classification are primarily inferential
techniques.
Example: −45◦ rotation A plot of data on two features X1 and X2 is given below.
Also included is a plot for the two linear combinations
1
Y1 = √ (X1 + X2 ) and
2
1
Y2 = √ (X2 − X1 ).
2
##
## Attaching package: ’ellipse’
## The following object is masked from ’package:car’:
##
## ellipse
## The following object is masked from ’package:graphics’:
##
## pairs
1 1
2
2
Y2 , Y1
2 2
1, 0
1
1
●
45°
X2
Y2
●
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X1 Y1
√
The 2 divisor in Y1 and Y2 does not alter the interpretation of these linear
combinations: Y1 is essentially the sum of X1 and X2 , whereas Y2 is essentially the
difference between X2 and X1 .
Example: Two groups The plot below shows data on two features X1 and X2
from two distinct groups.
2
Y2
Y1
group 1 group 1
1
1
θ°
X2
Y2
0
0
−1
−1
group 2 group 2
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X1 Y1
If you compare the groups on X1 and X2 separately, you may find no significant
differences because the groups overlap substantially on each feature. The plot on the
right was obtained by rotating the coordinate axes −θ degrees, and then plotting the
data relative to the new coordinate axes. The rotation corresponds to creating two
linear combinations:
Y1 = cos(θ)X1 + sin(θ)X2
Y2 = − sin(θ)X1 + cos(θ)X2 .
The two groups differ substantially on Y2 . This linear combination is used with
discriminant analysis and MANOVA to distinguish between the groups.
The linear combinations used in certain multivariate methods do not correspond
to a rotation of the original coordinate axes. However, the pictures given above
should provide some insight into the motivation for the creating linear combinations
of two features. The ideas extend to three or more features, but are more difficult to
represent visually.
or as the row-vector Xi0 = (Xi1 , Xi2 , · · · , Xip ). Here Xij is the value on the j th variable.
Two subscripts are needed for the data values. One subscript identifies the individual
and the other subscript identifies the feature.
A matrix is a rectangular array of numbers or variables. A data set can be viewed
as a matrix with n rows and p columns, where n is the sample size. Each row contains
data for a given individual:
X11 X12 · · · X1p
X21 X22 · · · X2p
.. .
.. .. ..
. . . .
Xn1 Xn2 · · · Xnp
Vector and matrix notation are used for summarizing multivariate data. For example,
the sample mean vector is
X̄1
X̄2
X̄ = .. ,
.
X̄p
where X̄j is the sample average on the j th feature. Using matrix algebra, X̄ is defined
using a familiar formula:
n
1X
X̄ = Xi .
n i=1
where n
1 X
sii = (Xki − X̄i )2
n − 1 k=1
is the sample covariance between the ith and j th features. The subscripts on the
elements in S identify where the element is found in the matrix: sij is stored in the
ith row and the j th column. The variances are found on the main diagonal of the
matrix. The covariances are off-diagonal elements. S is symmetric, meaning that
the elements above the main diagonal are a reflection of the entries below the main
diagonal. More formally, sij = sji .
Matrix algebra allows you to express S using a formula analogous to the sample
variance for a single feature:
n
1 X
S= (Xk − X̄)(Xk − X̄)0 .
n − 1 k=1
Here (Xk − X̄)(Xk − X̄)0 is the matrix product of a column vector with p entries
times a row vector with p entries. This matrix product is a p × p matrix with
(Xki − X̄i )(Xkj − X̄j ) in the ith row and j th column. The matrix products are added
up over all n observations and then divided by n − 1.
The interpretation of covariances is enhanced by standardizing them to give corre-
lations. The sample correlation matrix is denoted by the p × p symmetric matrix
r11 r12 · · · r1p
r21 r22 · · · r2p
R = .. .. .
.. . .
. . . .
rp1 rp2 · · · rpp
The ith row and j th column element of R is the correlation between the ith and j th
features. The diagonal elements are one: rii = 1. The off-diagonal elements satisfy
sij
rij = rji = √ .
sii sjj
In many applications the data are standardized to have mean 0 and variance 1 on
each feature. The data are standardized through the so-called Z-score transforma-
tion: (Xki − X̄i )/sii which, on each feature, subtracts the mean from each observation
and divides by the corresponding standard deviation. The sample variance-covariance
matrix for the standardized data is the correlation matrix R for the raw data.
Example: Let X1 , X2 , and X3 be the reaction times for three visual stimuli named
A, B and C, respectively. Suppose you are given the following summaries based on a
sample of 30 individuals:
4
X̄ = 5 ,
4.7
2.26 2.18 1.63
S = 2.18 2.66 1.82 ,
1.63 1.82 2.47
1.00 0.89 0.69
R = 0.89 1.00 0.71 .
0.69 0.71 1.00
Y 1 = a1 X 1 + a2 X 2 + · · · + ap X p .
and X
s2Y = ai aj sij = a0 Sa,
ij
where X̄ and S are the sample mean vector and sample variance-covariance matrix
for X 0 = (X1 , X2 , . . . , Xp ).
Similarly, the sample covariance between Y1 and
Y 2 = b0 X = b1 X 1 + b2 X 2 + · · · + bp X p
is X
sY1 ,Y2 = ai bj sij = a0 Sb = b0 Sa.
ij
Example: In the stimuli example, the total reaction time per individual is
X1
Y = [1 1 1] X2 = X1 + X2 + X3 .
X3
Ȳ = [1 1 1]
4
X̄ = [1 1 1] 5 = 4 + 5 + 4.7 = 13.7.
4.7
covariance matrix. The coefficients in the PCs are eigenvectors of the sample co-
variance matrix. The sum of the variances of the principal components is equal to
the sum of the variances in the original features. An alternative method for PCA
uses standardized data, which is often called PCA on the correlation matrix.
The ordered principal components are uncorrelated variables with progressively
less variation. Principal components are often viewed as separate dimensions corre-
sponding to the collection of features. The variability of each component divided by
the total variability of the components is the proportion of the total variation in the
data captured by each component. If data reduction is your goal, then you might
need only the first few principal components to capture most of the variability in the
data. This issue will be returned to later.
The unit-length constraint on the coefficients in PCA is needed to make the max-
imization well-defined. Without this constraint, there does not exist a linear combi-
nation with maximum variation. For example, the variability of an arbitrary linear
combination a1 X1 + a2 X2 + · · · + ap Xp is increased by 100 when each coefficient is
multiplied by 10!
The principal components are unique only up to a change of the sign for each
coefficient. For example,
and
PRIN1 = −0.2X1 + 0.4X2 − 0.4X3 − 0.8X4
have the same variability, so either could play the role of the first principal component.
This non-uniqueness does not have an important impact on the analysis.
90
dallas
●
houston
●
el paso new orleans miami
oklahomalittle
city rock ● jackson mobile ●
● columbia● ●
●
jacksonville
wichita ● ● ●
●
80 nashville
memphis
washington dc● norfolk
kansas cityst louis albuquerque
●
●
charlotte
● atlanta
●●
●
richmondraleigh ●
●
july
omaha louisville
philadelphia
●
●
● salt lake city new baltimore
●
york
●
● ● ●
wilmington
cincinnati
des moines peoria indianapolis atlantic ●
city
● charleston, wv sacramento
● ● ●boise ● ● ●
●
sioux falls detroitcolumbus
boston
●denver
● hartford
● ●
●
minneapolis albany providence
chicago ● pittsburgh
● ● ● cleveland ●●
bismarck ●
●
milwaukee
burlington concord buffalo
70 ●spokane
● great
● ● falls cheyenne
● reno
● ● ●
portland, me
●
portland, or
●
duluth
●
20 40 60
january
The princomp() procedure is used for PCA. By default the principal components
are computed based on the covariance matrix. The correlation matrix may also be
used (it effectively z-scores the data first) with the cor = TRUE option. The principal
component scores are the values of the principal components across cases. The
principal component scores PRIN1, PRIN2, . . . , PRINp are centered to have mean
zero.
Output from a PCA on the covariance matrix is given. Two principal components
are created because p = 2.
# perform PCA on covariance matrix
temp.pca <- princomp( ~ january + july, data = temp)
# standard deviation and proportion of variation for each component
summary(temp.pca)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 12.3217642 3.0004557
## Proportion of Variance 0.9440228 0.0559772
## Cumulative Proportion 0.9440228 1.0000000
# coefficients for PCs
loadings(temp.pca)
##
## Loadings:
## Comp.1 Comp.2
## january 0.939 0.343
## july 0.343 -0.939
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
PCA is effectively doing a location shift (to the origin, zero) and a rotation of
the data. When the correlation is used for PCA (instead of the covariance), it also
performs a scaling so that the resulting PC scores have unit-variance in all directions.
# create small data.frame with endpoints of PC lines through data
line.scale <- c(35, 15) # length of PCA lines to draw
# endpoints of lines to draw
temp.pca.line.endpoints <-
data.frame(PC = c(rep("PC1", 2), rep("PC2", 2))
, x = c(temp.pca$center[1] - line.scale[1] * temp.pca$loadings[1, 1]
, temp.pca$center[1] + line.scale[1] * temp.pca$loadings[1, 1]
, temp.pca$center[1] - line.scale[2] * temp.pca$loadings[1, 2]
, temp.pca$center[1] + line.scale[2] * temp.pca$loadings[1, 2])
, y = c(temp.pca$center[2] - line.scale[1] * temp.pca$loadings[2, 1]
, temp.pca$center[2] + line.scale[1] * temp.pca$loadings[2, 1]
, temp.pca$center[2] - line.scale[2] * temp.pca$loadings[2, 2]
, temp.pca$center[2] + line.scale[2] * temp.pca$loadings[2, 2])
)
temp.pca.line.endpoints
## PC x y
## 1 PC1 -0.7833519 63.61121
## 2 PC1 64.9739769 87.61066
## 3 PC2 26.9525727 89.70179
## 4 PC2 37.2380523 61.52008
# plot original data with PCA vectors overlayed
library(ggplot2)
p1 <- ggplot(temp, aes(x = january, y = july))
p1 <- p1 + geom_point() # points
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
# good idea since both are in the same units
p1 <- p1 + geom_text(aes(label = id), vjust = -0.5, alpha = 0.25) # city labels
# plot PC lines
p1 <- p1 + geom_path(data = subset(temp.pca.line.endpoints, PC=="PC1"), aes(x=x, y=y)
, alpha=0.5)
p1 <- p1 + geom_path(data = subset(temp.pca.line.endpoints, PC=="PC2"), aes(x=x, y=y)
, alpha=0.5)
# label lines
p1 <- p1 + annotate("text"
, x = temp.pca.line.endpoints$x[1]
, y = temp.pca.line.endpoints$y[1]
, label = as.character(temp.pca.line.endpoints$PC[1])
, vjust = 0) #, size = 10)
p1 <- p1 + annotate("text"
, x = temp.pca.line.endpoints$x[3]
, y = temp.pca.line.endpoints$y[3]
, label = as.character(temp.pca.line.endpoints$PC[3])
, hjust = 1) #, size = 10)
p1 <- p1 + labs(title = "Mean temperature in Jan and July for selected cities")
print(p1)
90 PC2
54
●
56
●
55 10
45 3 ●
50
27 1 19●
●
17 ● ● ●
● ● 9
●
●
80 53 52
28 29 358 ● ●
● ● ●●
60 59 39
11
●
31 ● 40
●
●
● 57 4718
38 21 ●
● ●
july
●
42 7●● ● 4
16 14 15 ● 34 62
● ● ● 12 ● ● ●
●
51 23 4422
● 5
●
● 6●
26 36 13 ● 49 ●
48
● ● ● 43 ●●
41 ●
●
58 63 33 37 61
70 ● 30 ●
32
● ●
●
● 64 ●
●
20
●
46
●
25
●
24
PC1 ●
0 20 40 60
january
library(gridExtra)
grid.arrange(grobs = list(p2, p3), ncol=1, top="Temperature data and PC scores")
Temperature data and PC scores
Same, PC scores
46
10 ●
32 10
● ●
24 4
5 ●
64 ●
20 61
●
●
37
● 9
30 43 48
49 ●
33 5
Comp.2
● ●
25
●
63 ●
● ●
22● 62 40 39 11 19
44● 34 ● ● 1 ●
●
58 ● 13 6 23 ● 12 ● 59
●
● ●
0 ● 36 ● ● 42 7● 21 60 ●
50 27 56
● ●
15 ● ● 38
4718
●
●
● 52 ● ● ●
● ● ●
358● 53 ●
14 57 ● ●
55
26
● ● 29 3 ●
●
41 ● 51 16 ●
45
●
● ● 28 ● 54
31 ● 17 ●
●
−5 ●
2
●
−20 0 20
Comp.1
17 31
5 54 ● 28 ●
● 45 ● 16 51 41
3
●
29 ● ● 26 ●
●
55 ● ● 57 14
●
53 835
●
● ●
52 ● ●
18 47
38 15
56 27 50 60 21 ●●
7 42
● ● ● ●
● 12 ● 23 6 13 36 58
39 59
Comp.2
● ● ● ●
0 1 40
● 34● ● 44 ●
● ● 63 ●
25
19 ● 11
● 62 ● 22● 33
● ● ●
● 5 ● 49
48 43 30
● ●
9
●
●● ●
37 ●
●
●
● 61 20
●
64 ●
4 ●
24
●
−5 10 32 ●
● ●
46
−10 ●
−20 0 20
Comp.1
5. PRIN1 weights the January temperature about three times the July tempera-
ture. This is sensible because PRIN1 maximizes variation among linear combi-
nations of the January and July temperatures. January temperatures are more
variable, so they are weighted heavier in this linear combination.
6. The PCs PRIN1 and PRIN2 are standardized to have mean zero. This explains
why some PRIN1 scores are negative, even though PRIN1 is a weighted average
of the January and July temperatures, each of which is non-negative.
The built-in plots plot the scores and original data directions (biplot) and the
screeplot shows the relative variance proportion of all components in decreasing order.
temp.pca
−50 0 50
46
0.4
140
120
32 10
50
0.2
24 4
100
64
20 61
9
Variances
3743
48
Comp.2
30 495
80
33
25 63 22 62 40 11 19
1 january
34 39
6 44
58 1323 59
0.0
36 1242
0
721 60 27 56
18 52 50
60
15 38
47
8 53
35
14 57 55
july
29 3
41 2651 16
40
45 54
−0.2
31 28 17
−50
20
2
0
Comp.1
scales, or when the features have wildly different variances. The features are stan-
dardized to have mean zero and variance one by using the Z-score transformation:
(Obs − Mean)/Std Dev. The PCA is then performed on the standardized data.
temp.z <- temp
# manual z-score
temp.z$january <- (temp.z$january - mean(temp.z$january)) / sd(temp.z$january)
# z-score using R function scale()
temp.z$july <- scale(temp.z$july)
2
●
3
PC2
2
54
●
56
●
55 10
1 19
● ●
45 3 27 ●
● ● 50 ● ●
9
17 ●
●
1 ●
53 52
● ●
28 29 35
8
●
● ●● 5939
● 11
●
60
● 40 ●
july
31 ●
●
57 4718
38
● ●21●
● ●
427●
0 16 1514 ●
34 62 4
● ● ●
●
12
● ●
●
51 23 44 ●22
● ● ●5
6 ●
●
26 3613 49
48
● ● ●
●
43 ●
●
41
●
−1
58 63 37
● 33 61
●
●
30● ●
64 32
● ●
●
20
●
46
●
25
●
−2
PC1
24
●
−2 −1 0 1 2 3
january
The covariance matrix computed from the standardized data is the correlation
matrix. Thus, principal components based on the standardized data are computed
from the correlation matrix. This is implemented by adding the cor = TRUE option
on the princomp() procedure statement.
# perform PCA on correlation matrix
temp.pca2 <- princomp( ~ january + july, data = temp, cor = TRUE)
# standard deviation and proportion of variation for each component
summary(temp.pca2)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.3339592 0.4696305
## Proportion of Variance 0.8897236 0.1102764
## Cumulative Proportion 0.8897236 1.0000000
# coefficients for PCs
loadings(temp.pca2)
##
## Loadings:
## Comp.1 Comp.2
## january 0.707 0.707
## july 0.707 -0.707
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
# scores are coordinates of each observation on PC scale
head(temp.pca2$scores)
## Comp.1 Comp.2
## 1 1.9964045 0.3286199
## 2 3.3330689 -1.0080444
## 3 1.2566173 -0.3554730
## 4 0.7341125 0.8485470
## 5 -0.4971200 0.2299523
## 6 -0.8492236 -0.0386098
This plot is the same except for the top/right scale around the biplot and the
variance scale on the screeplot.
# a couple built-in plots
par(mfrow=c(1,2))
biplot(temp.z.pca)
screeplot(temp.z.pca)
temp.z.pca
−5 0 5 10
46
0.4
10
1.5
10
0.3
32 4
0.2
64 9
24
1.0
Variances
Comp.2
20 61
0.1
19 january
48 11 1
37 43
495 62 40
39
30
33 22 34 59 56
44 27
0.0
63 50
721 60 52
0
25 13623 1242
58 36 15 38 18
47
8 53 55
35
0.5
−0.1
3 july
14 57 29
45 54
−5
−0.2
26 16 28 17
41 51 31
2
0.0
Comp.1
The standardized features are dimensionless, so the PCs are not influenced by the
original units of measure, nor are they affected by the variability in the features. The
only important factor is the correlation between the features, which is not changed
by standardization.
The PCs from the correlation matrix are
and
PRIN2 = −0.707 JAN + 0.707 JULY.
PCA is an exploratory tool, so neither a PCA on the covariance matrix nor a
PCA on the correlation matrix is always the “right” method. I often do both and see
which analysis is more informative.
PRIN2 is a comparison of January and July temperatures (signs of the loadings: JAN
is − and JULY is +):
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
## 3D scatterplot
library(scatterplot3d)
par(mfrow=c(1,1))
with(shells, {
scatterplot3d(x=length
, y=width
, z=height
, main="Shells 3D Scatterplot"
, type = "h" # lines to the horizontal xy-plane
, color="blue", pch=19, # filled blue circles
#, highlight.3d = TRUE # makes color change with z-axis value
)
})
#### For a rotatable 3D plot, use plot3d() from the rgl library
# ## This uses the R version of the OpenGL (Open Graphics Library)
# library(rgl)
# with(shells, { plot3d(x = length, y = width, z = height) })
length width height
● ●
Shells 3D Scatterplot
0.015 ● ●
●● ●●
● ●● ● ● ●
●● ●●
length
0.010 ● ●
●● ● ●
●●● ●● ●
●● ● ● ●
0.005
● ●
●
●● ● ●
●
0.000 ● ● ● ● ●
70
●
● ●
130
65
●
120 ●
● ● ●
●
60
●●
Corr:
width
110 ●
●●●
●
●●
55
height
100 0.973 ●●
● ●
●
●●
●
● ● ●
●
width
50
90 ● ●
●
● 140
80 ●
● 130
45
●● 120
110
40
100
60 ● ●
90
35
80
height
Question: Can PRIN2 and PRIN3, which have relatively little variation, be used
in any meaningful way? To think about this, suppose the variability in PRIN2 and
PRIN3 was zero.
The first principal component accounts for 98% of the total variability in the
standardized data. The total variability for correlation is always the number p of
features because it is the sum of the variances. Here, p = 3. Little loss of information
is obtained by summarizing the standardized data using PRIN1, which is essentially
an average of length, width, and height. PRIN2 and PRIN3 are measures of shape.
The loadings in the first principal component are approximately equal because the
correlations between pairs of features are almost identical. The standardized features
are essentially interchangeable with regards to the construction of the first principal
component, so they must be weighted similarly. True, but not obvious.
the feature scores onto the axes of maximal and minimal variation, and then rotating
the axes appropriately.
One can show mathematically that PRIN1 is the best (in some sense) linear com-
bination of the two features to predict the original two features simultaneously. In-
tuitively, this is plausible. In a PCA, you know the direction for the axis of maximal
variation. Given the value of PRIN1, you get a good prediction of the original feature
scores by moving PRIN1 units along the axis of maximal variation in the feature
space.
2
2
PRIN2 PRIN1
1
1
Feature2
●
PRIN2
0
0
●
(0.99,0.33)
(1,−0.3)
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
Feature1 PRIN1
The LS line from regressing feature 2 on feature 1 gives the best prediction for
feature 2 scores when the feature 1 score is known. Similarly, the LS line from
regressing feature 1 on feature 2 gives the best prediction for feature 1 scores when
the feature 2 score is known. PRIN1 is the best linear combination of features 1 and
2 to predict both features simultaneously. Note that feature 1 and feature 2 are linear
combinations as well!
This idea generalizes. The first k principal components give the best simultaneous
prediction of the original p features, among all possible choices of k uncorrelated
unit-length linear combinations of the features. Prediction of the original features
improves as additional components are added, but the improvement is slight when
the added principal components have little variability. Thus, summarizing the data
using the principal components with maximum variation is a sensible strategy for
data reduction.
PRIN2
PRIN1
1
group 1 group 1
θ°
Feature2
PRIN2
0
group 2 group 2
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
Feature1 PRIN1
Although PRIN1 explains most of the variability in the two features (ignoring
the groups), little of the total variation is due to group differences. If the researcher
reduced the two features to the first principal component, he would be throwing away
most of the information for distinguishing between the groups. PRIN2 accounts for
little of the total variation in the features, but most of the variation in PRIN2 is due
to group differences.
If a comparison of the two groups was the primary interest, then the researcher
should use discriminant analysis instead. Although there is little gained by reduc-
ing two variables to one, this principle always applies in multivariate problems. In
2
PRIN2
PRIN1
group 2
1
θ° group 1 group 2
Feature2
PRIN2
0
group 1
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
Feature1 PRIN1
2
PRIN2 PRIN1
●
1
1
●
Feature2
PRIN2
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
Feature1 PRIN1
After a severe storm in 1898, a number of sparrows were taken to the biological labora-
tory at the University of Rhode Island. H. Bumbus1 measured several morphological
characteristics on each bird. The data here correspond to five measurements on a
sample of 49 females. The measurements are the total length, alar extent, beak-head
length, humerus length, and length of keel of sternum.
1
Bumpus, Hermon C. 1898. Eleventh lecture. The elimination of the unfit as illustrated by the
introduced sparrow, Passer domesticus. (A fourth contribution to the study of variation.) Biol.
Lectures: Woods Hole Marine Biological Laboratory, 209–225.
http://media-2.web.britannica.com/eb-media/46/51946-004-D003BC49.gif
Let us look at the output, paying careful attention to the interpretations of the
principal components (zeroing out small loadings). How many components seem
sufficient to capture the total variation in the morphological measurements?
#### Example: Sparrows
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch13_sparrows.dat"
sparrows <- read.table(fn.data, header = TRUE)
str(sparrows)
## 'data.frame': 49 obs. of 5 variables:
## $ Total : int 156 153 155 157 164 158 161 157 158 155 ...
## $ Alar : int 245 240 243 238 248 240 246 235 244 236 ...
## $ BeakHead: num 31.6 31 31.5 30.9 32.7 31.3 32.3 31.5 31.4 30.3 ...
## $ Humerus : num 18.5 18.4 18.6 18.4 19.1 18.6 19.3 18.1 18.5 18.5 ...
## $ Keel : num 20.5 20.6 20.3 20.2 21.2 22 21.8 19.8 21.6 20.1 ...
head(sparrows)
## Total Alar BeakHead Humerus Keel
## 1 156 245 31.6 18.5 20.5
## 2 153 240 31.0 18.4 20.6
## 3 155 243 31.5 18.6 20.3
## 4 157 238 30.9 18.4 20.2
## 5 164 248 32.7 19.1 21.2
## 6 158 240 31.3 18.6 22.0
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
Total Alar BeakHead Humerus Keel
● ● ● ●
● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
0.075
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
Total
0.050 ● ● ● ● ● ●● ● ● ●●● ●● ● ●● ●●
● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●●● ●
● ● ● ● ●●● ● ● ● ● ●●
0.025 ●●●●●● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ●● ● ● ●●
● ● ● ●
● ●●● ● ● ●● ●● ● ● ● ● ● ● ●●● ●
0.000 ● ● ●● ●● ● ●
● ● ●
250 ● ● ●
● ● ●
● ● ● ● ● ●
● ●● ● ●● ● ● ●
● ● ● ● ●●
245 ● ●
● ● ●●● ● ● ●
●●
● ● ● ● ● ● ● ● ●
● ●
●
● ●● ●● ● ● ●
Corr:
Alar
● ● ● ●● ● ●● ●
240 0.735 ●
● ● ●●●
●
●● ● ● ●
● ●
● ●●
●
●
●
● ●● ● ● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ●● ● ●●
● ● ● ● ● ● ●● ●
235 ● ● ● ● ●●
● ● ●
● ● ●
230 ● ● ●
● ●
● ●
33
● ●
● ●
● ●
●● ● ●
● ● ● ●
BeakHead
● ●
● ●
32 Corr: Corr: ●●
●
●
●
●
●
● ●
● ● ● ●
● ● ● ●
0.662 0.674 ●●● ● ●
● ●
● ● ●
●
●
●
● ●
● ●
● ●
31 ● ●●
● ●●
●●
●
●
●
●●
●●
●
●
● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●
30
●
● ●
●
● ● ● ● ●
19 ●
Humerus
● ● ● ● ●
Corr: Corr: Corr: ●●● ● ●●
●● ● ● ● ●●
●● ●
0.645 0.769 0.763 ● ●● ●
● ● ●
18 ● ●
● ● ● ●
● ●
● ●
●
23
22
21
0.608 0.531 0.529 0.609
20
19
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Total 0.452 0.058 0.689 0.422 0.375
## Alar 0.461 -0.301 0.345 -0.545 -0.530
## BeakHead 0.450 -0.326 -0.453 0.607 -0.342
## Humerus 0.470 -0.189 -0.409 -0.390 0.651
## Keel 0.399 0.874 -0.184 -0.073 -0.194
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## SS loadings 1.0 1.0 1.0 1.0 1.0
## Proportion Var 0.2 0.2 0.2 0.2 0.2
## Cumulative Var 0.2 0.4 0.6 0.8 1.0
# a couple built-in plots
par(mfrow=c(1,2))
biplot(sparrows.pca)
screeplot(sparrows.pca)
sparrows.pca
−5 0 5
3.5
16
3.0
0.4
2.5
5
30 Keel
11 6
0.2
Variances
2.0
Comp.2
38
40 34
13 18 9 17 15
1.5
2710
33 21 4137 35 43 45
24 3644 Total
0.0
2 12 7
0
23 4 29
4622 31
14 Humerus
49
1.0
26 8 39 Alar
19 BeakHead
31 4728
25 5
−0.2
20
0.5
48
32
−5
42
0.0
Comp.1
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
ID WT2 HT2 WT9 HT9 LG9 ST9 WT18 HT18 LG18 ST18 SOMA
0.03
0.02
Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr:
ID
0.01 0.0935 0.285 −0.0788 0.0855 −0.0721 0.12 −0.296 0.0268 −0.236 −0.0531 −0.363
0.00
17 ● ●
16 ● ●
Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr:
WT2
15 ●
● ●● ● ●
14 ● ● ●●
13 ●
●●● ●
● 0.498 0.579 0.382 0.581 0.347 0.216 0.336 0.199 0.258 −0.283
●
12 ●
● ● ● ●
● ● ●●
● ● ● ●● ● ● ●● ● ●
● ●● ●
90 ●●● ● ● ● ●
● ● Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr:
HT2
●● ● ● ●
●● ● ●
87 ●● ●
● ●
● ●
WT9
35 ● ●●
● ● ●
● ●
● ●
● ●● ●
●●●
●
●● ● ● ●● ●● ● ● ● ● ● ●●
30 ● ● ● ●●
●
●●
● ●●●
● ● ● ● ● ●● 0.62 0.906 0.536 0.709 0.384 0.584 0.363 0.157
● ● ●
● ●● ● ●
● ● ● ●
25 ● ● ●
● ● ● ● ●● ● ●
145 ● ● ● ● ● ●●
140 ●●● ● ● ● ●● ●●
●
● ●● ● ● ●
●● ● ●● ●● ● ● Corr: Corr: Corr: Corr: Corr: Corr: Corr:
HT9
●●● ●● ●●● ●● ●●●● ● ●●●
● ●
135 ●
●●● ●● ● ● ●
●● ●● ● ● ● ●
●● ● ● ● ●● 0.353 0.358 0.273 0.864 0.0372 0.18 −0.141
130 ●●
●
● ●
●
● ●
●
●●
●
125 ● ● ● ●
● ● ● ● ●
32 ● ● ● ● ●
● ● ● ● ●
30 Corr: Corr: Corr: Corr: Corr: Corr:
LG9
● ● ● ● ●● ● ● ●
● ● ● ●● ● ● ● ● ●●●● ● ● ●●
28 ● ● ●● ●●● ● ●
●
●●●●
● ● ● ● ● ●●
●
●●
●●
●
● ● ●● ●● ●
●
26
● ●●
● ●●
●●
●
●● ●
● ●●
● ● ● ● ●●
● ●● ●
● ●
●●
● ●
● ● ● ●
●● ●
●
●
0.524 0.64 0.108 0.664 0.344 0.173
● ● ● ● ● ● ● ● ● ●
24
100 ● ● ● ● ● ●
● ● ● ● ● ●
90 ● ● ● ● ● ●
80 ● ● ● ● ● ●
Corr: Corr: Corr: Corr: Corr:
ST9
●●●● ● ● ● ● ● ● ● ●●●
● ●● ●
● ●●● ● ●● ● ●● ● ● ● ●
●● ●● ●● ●
● ● ●
70 ● ● ● ●● ● ●
● ● ● ● ●● ●● ●●●● ●
● ●●●
● ●
●
● ● ●
● ● ● ●
●● ●●
● ●●●●
●
60
●● ●
●
●● ● ● ●●
● ● ●
●
●●●
● ● ●●●
● ● ● ● ●●
● ● ●● ●● ●
● 0.185 0.159 0.209 0.66 −0.334
50 ● ● ● ● ● ●
● ● ● ● ● ● ●
100
WT18
●
●
●
●
●
● ●
●
●
●
●
●
●
●
Corr: Corr: Corr: Corr:
80 ●● ●
● ●●●● ●
● ●●
● ● ● ●
● ● ●
● ● ●● ● ● ●● ●●
●●
● ●● ● ●
●●●● ●●● ● ● ● ●●● ●●●
● ● ● ● ● ● ● ●●● ●
● ● ● ●●
● ●● ● ●●●●●
● ●● ●
●
●
●●● ● ● ● ● ● ●●●●● ●●●
●●
●
●●
● ●● ● ●
●
● ●● ● ● ●●● ●
●●●
●
●●
●● ●●● 0.184 0.901 0.34 0.599
60 ● ● ● ●●
● ●●
● ●●● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
195 ● ● ● ● ● ● ● ●
190 ● ● ●● ● ● ●● ● ● ●● ●●●
●● ●● ● ●● ● ●
Corr: Corr: Corr:
HT18
● ● ● ●● ●● ●● ● ● ●●
185 ●
●
●
● ● ●●
●
● ●●
●
● ●● ● ● ●
●●
● ●● ● ● ● ● ● ● ●●
180 ● ● ●● ●
● ● ●● ● ●● ● ● ● ●● ●●
●● ● ●
●● ● ● ● ●● ●
● ●● ● ● ● ● ●●
●
● ● ●● ●● ●●● ●
175 ●●
●●●●
●
●
●
● ● ●●
●
● ●
●
●●
● ● ●● ●
●
●●
●● ●
●● −0.034 0.194 −0.0986
●● ● ● ● ●
● ●● ● ● ●
● ● ●●
● ● ●● ●
● ●
● ●● ● ● ●●
●
●● ● ●
170 ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Corr: Corr: LG18
●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●
●●● ● ●● ● ● ● ● ● ●●
●●
● ● ●●
●● ● ● ●
●
●● ●●
● ●●
●●●●●● ● ● ●●
● ●● ●●●● ●●●●● ●●
●● ●●● ● ●
● ● ●● ●
●● ●
●●●
●
●
●●● ● ● ● ●●●●● ● ●●
● ● ●● ●
35 ●●
●●●
●●
● ●
●● ● ●
●
●
●
● ●● ●
● ●
●
●
●
●
● ●
●● ● ●
●
●
●
●● ●
●●
● ●●● ●
●
●
●
●●
●
●
●
● ● ●
●
● 0.285 0.607
● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●●
250 ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●
●
●● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ●●
●● ● ●● ● ● ●● ●
● ●
● ●● ●
225 ● ●● ● ● ●● ● ● ● ● ●●● ●● ● ●● ● ●● ● ●● ●
●● ● ● ●● ●● ● ●
● ●● ● ●●● ●
●●● ●● ● ●● ●● ● ● ●● ●● ● ●● ● Corr:
ST18
● ●● ● ● ●● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ●● ● ●
200 ● ●● ● ● ●● ● ● ●● ● ●●● ● ● ●● ●
● ● ●●● ●●● ●● ● ● ●● ● ●
●● ●● ● ● ● ● ●● ● ● ●● ● ● ●
● ● ●
175
●
●
● ●
●
● ●●
● ●
● ●
●
● ●
●
● ● ●●
●
●●
● ●
●● ● ●
● −0.227
● ● ● ● ● ● ● ● ● ●
150 ● ● ● ● ● ● ● ● ● ●
● ● ●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
6 ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
SOMA
As an aside, there are other ways to visualize the linear relationships more quickly.
The ellipse library has a function plotcorr(), though it’s output is less than ideal.
An improvement has been made with an updated version2 of the plotcorr() function.
## my.plotcorr example, see associated R file for my.plotcorr() function code
2
http://hlplab.wordpress.com/2012/03/20/correlation-plot-matrices-using-the-ellipse-library/
Correlations
SOMA
WT18
HT18
LG18
ST18
WT2
WT9
HT2
HT9
LG9
ST9
ID
ID 0.09 0.29 −0.08 0.09 −0.07 0.12 −0.3 0.03 −0.24 −0.05 −0.36
WT2 0.5 0.58 0.38 0.58 0.35 0.22 0.34 0.2 0.26 −0.28
HT2 0.53 0.78 0.28 0.36 0.32 0.68 0.11 0.23 −0.12
ST18 −0.23
SOMA
It is reasonable to expect that the characteristics measured over time, for example
HT2, HT9, and HT18 are strongly correlated. Evidence supporting this hypothesis
is given in the following output, which summarizes correlations within subsets of
the predictors. Two of the three subsets include measures over time on the same
characteristic.
cor(bgs[,c("WT2", "WT9", "WT18")])
## WT2 WT9 WT18
## WT2 1.0000000 0.5792217 0.2158735
## WT9 0.5792217 1.0000000 0.7089029
## WT18 0.2158735 0.7089029 1.0000000
cor(bgs[,c("HT2", "HT9", "HT18")])
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5, p6, p7, p8), ncol=3, top="Selected BGS variables")
40 100 100
● ●
35
WT18
WT18
WT9
● ● ● ●
●
● ●
80 ●
● 80 ●
●
● ●
● ●● ● ●
● ●
● ●●
● ●
● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ●
30 ●
● ● ●
●●
● ● ●
● ●
● ● ● ● ●
● 60 60
● ● ●
● ●
●● ● ●
25 ● ● ●
12 13 14 15 16 17 12 13 14 15 16 17 25 30 35 40
WT2 WT2 WT9
● 195 ● 195 ●
●
145 ●
●
● ●
190 190
● ●
● ● ●
140 ●
●
●
● ● ● ●
● ●
185 185
HT18
HT18
● ● ● ● ●
HT9
●
● ● ● ●
● ● ●
135 ● ● ●
● ●
●
●
180 180
● ● ● ●●
●● ● ● ●
● ● ●
● ●
● ●
●
130 175 ● 175 ●
● ● ●
● ● ●
● ● ● ●
● ●
●
170 ●
170 ●
125
81 84 87 90 81 84 87 90 125 130 135 140 145
HT2 HT2 HT9
● ●
250 ●
● ●
●
●
● ● ● ●
● 225
40 ●●
● ● ●
●
LG18
●
ST18
● ●
●
●
●●
● ●
● 200 ● ●
●
●
● ● ●
●
● ●
● ● ●
35 ●
● ● ●
● 175
●
●
●
●
● ●
150
24 26 28 30 32 50 60 70 80 90 100
LG9 ST9
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.2882526 0.5834426
## Proportion of Variance 0.8297974 0.1702026
## Cumulative Proportion 0.8297974 1.0000000
print(loadings(bgsST.pca), cutoff = 0)
##
## Loadings:
## Comp.1 Comp.2
## ST9 0.707 0.707
## ST18 0.707 -0.707
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
Cluster Analysis
14.1 Introduction
Cluster analysis is an exploratory tool for locating and grouping observations that
are similar to each other across features. Cluster analysis can also be used to group
variables that are similar across observations.
Clustering or grouping is distinct from discriminant analysis and classification. In
discrimination problems there are a given number of known groups to compare or
distinguish. The aim in cluster analysis is to define groups based on similarities. The
clusters are then examined for underlying characteristics that might help explain the
grouping.
There are a variety of clustering algorithms1 . I will discuss a simple (agglom-
erative) hierarchical clustering method for grouping observations. The method
begins with each observation as an individual cluster or group. The two most similar
observations are then grouped, giving one cluster with two observations. The remain-
ing clusters have one observation. The clusters are then joined sequentially until one
cluster is left.
14.1.1 Illustration
To illustrate the steps, suppose eight observations are collected on two features X1
and X2 . A plot of the data is given below.
Step 1. Each observation is a cluster.
Step 2. Form a new cluster by grouping the two clusters that are most similar, or
closest to each other. This leaves seven clusters.
Step 3. Form a new cluster by grouping the two clusters that are most similar, or
closest to each other. This leaves six clusters.
1
http://cran.r-project.org/web/views/Cluster.html
9 4
● 3
● 1 ● 4
Comp.2
2
x2
6 ● 2 ● 4
● 5 0
● 1 ● 5
3 ● 6 −2
● 2 ● 6
● 8 ● 8
5 10 15 20 −10 −5 0 5
x1 Comp.1
Here are the results of one distance measure, which will be discussed in more detail
after the plots. The clustering algorithm order for average linkage is plotted here.
# create distance matrix between points
intro.dist <- dist(intro)
intro.hc.average <- hclust(intro.dist, method = "average")
library(cluster)
for (i.clus in 7:2) {
clusplot(intro, cutree(intro.hc.average, k = i.clus)
, color = TRUE, labels = 2, lines = 0
, cex = 2, cex.txt = 1, col.txt = "gray20"
, main = paste(i.clus, "clusters"), sub = NULL)
}
7 clusters 6 clusters
77 67
6
6
4
4
33 23
Component 2
2
2
44 34
0
0
1
●1 55 1
●1 45
−2
−2
22 66 8 ●2 56 8
−10 −5 0 5 −10 −5 0 5
5 clusters1
Component 4 clusters1
Component
57 47
6
6
4
4
23 23
Component 2
2
2
4 4
0
0
1
●1 35 1
●1 35
−2
6
−2
●2 46 8
●2 8
−10 −5 0 5 −10 −5 0 5
3 clusters1
Component 2 clusters1
Component
37 27
6
6
4
1
4
●3 1 ●3
Component 2
2
2
●4
●4
0
5
●1
0
●1 25
−2
6
●2 8
−2
6
●2
−4
−10 −5 0 5 −10 −5 0 5
2
There are many ways to create dendrograms in R, see http://gastonsanchez.com/blog/
how-to/2012/10/03/Dendrograms.html for several examples.
12
6
15
10
5
8
4
10
Height
Height
Height
6
3
4
2
2
0
7
5
6
8
1
2
3
4
1
2
3
4
7
5
6
8
1
2
3
4
7
5
6
8
1.0
1.0
Component 2
Component 2
Component 2
1 1 1
0.0
0.0
0.0
4 4 4
−1.0
−1.0
−1.0
3 3 3
−2.0
−2.0
−2.0
mammal v1 v2 v3 v4 v5 v6 v7 v8
1 Brown Bat 2 3 1 1 3 3 3 3
2 Mole 3 2 1 0 3 3 3 3
3 Silver Hair Bat 2 3 1 1 2 3 3 3
4 Pigmy Bat 2 3 1 1 2 2 3 3
5 House Bat 2 3 1 1 1 2 3 3
6 Red Bat 1 3 1 1 2 2 3 3
7 Pika 2 1 0 0 2 2 3 3
8 Rabbit 2 1 0 0 3 2 3 3
9 Beaver 1 1 0 0 2 1 3 3
10 Groundhog 1 1 0 0 2 1 3 3
11 Gray Squirrel 1 1 0 0 1 1 3 3
12 House Mouse 1 1 0 0 0 0 3 3
13 Porcupine 1 1 0 0 1 1 3 3
14 Wolf 3 3 1 1 4 4 2 3
15 Bear 3 3 1 1 4 4 2 3
16 Raccoon 3 3 1 1 4 4 3 2
17 Marten 3 3 1 1 4 4 1 2
18 Weasel 3 3 1 1 3 3 1 2
19 Wolverine 3 3 1 1 4 4 1 2
20 Badger 3 3 1 1 3 3 1 2
21 River Otter 3 3 1 1 4 3 1 2
22 Sea Otter 3 2 1 1 3 3 1 2
23 Jaguar 3 3 1 1 3 2 1 1
24 Cougar 3 3 1 1 3 2 1 1
25 Fur Seal 3 2 1 1 4 4 1 1
26 Sea Lion 3 2 1 1 4 4 1 1
27 Grey Seal 3 2 1 1 3 3 2 2
28 Elephant Seal 2 1 1 1 4 4 1 1
29 Reindeer 0 4 1 0 3 3 3 3
30 Elk 0 4 1 0 3 3 3 3
31 Deer 0 4 0 0 3 3 3 3
32 Moose 0 4 0 0 3 3 3 3
The program below produces cluster analysis summaries for the mammal teeth
data.
# create distance matrix between points
teeth.dist <- dist(teeth[,-1])
# create dendrogram
teeth.hc.average <- hclust(teeth.dist, method = "average")
plot(teeth.hc.average, hang = -1
, main = paste("Teeth with average linkage") # and", i.clus, "clusters")
, labels = teeth[,1])
# rect.hclust(teeth.hc.average, k = i.clus)
1
0
Raccoon
Wolf
Bear
Elephant_Seal
Fur_Seal
Sea_Lion
Jaguar
Cougar
River_Otter
Marten
Wolverine
Grey_Seal
Sea_Otter
Weasel
Badger
Reindeer
Elk
Deer
Moose
House_Mouse
Beaver
Groundhog
Gray_Squirrel
Porcupine
Brown_Bat
Silver_Hair_Bat
Red_Bat
Pigmy_Bat
House_Bat
Mole
Pika
Rabbit
teeth.dist
hclust (*, "average")
3
There are thirty in this package: http://cran.r-project.org/web/packages/NbClust/
NbClust.pdf
## $ v1 : int 2 3 2 2 2 1 2 2 1 1 ...
## $ v2 : int 3 2 3 3 3 3 1 1 1 1 ...
## $ v3 : int 1 1 1 1 1 1 0 0 0 0 ...
## $ v4 : int 1 0 1 1 1 1 0 0 0 0 ...
## $ v5 : int 3 3 2 2 1 2 2 3 2 2 ...
## $ v6 : int 3 3 3 2 2 2 2 2 1 1 ...
## $ v7 : int 3 3 3 3 3 3 3 3 3 3 ...
## $ v8 : int 3 3 3 3 3 3 3 3 3 3 ...
# Because the data type is "int" for integer, the routine fails (error expected)
NbClust(teeth[,-1], method = "average", index = "all")
## Error in solve.default(W): system is computationally singular: reciprocal condition
number = 1.51394e-16
# However, change the data type from integer to numeric and it works just fine!
teeth.num <- as.numeric(as.matrix(teeth[,-1]))
NC.out <- NbClust(teeth.num, method = "average", index = "all")
## Warning in max(DiffLev[, 5], na.rm = TRUE): no non-missing arguments to max; returning
-Inf
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## Warning in matrix(c(results), nrow = 2, ncol = 26): data length [51] is not a sub-multiple
or multiple of the number of rows [2]
## Warning in matrix(c(results), nrow = 2, ncol = 26, dimnames = list(c("Number clusters",
: data length [51] is not a sub-multiple or multiple of the number of rows [2]
## *******************************************************************
## * Among all indices:
## * 1 proposed 4 as the best number of clusters
## * 5 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 5
##
##
## *******************************************************************
# most of the methods suggest 4 or 5 clusters, as do the plots
NC.out$Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 5 5 4 5.0000 5.000 5 -Inf
## Value_Index Inf Inf Inf 369.1341 7787.404 414 5
● ● ● ● ● ● ● ● ● ● ● ●
0.0038
●
4e−04
●
Hubert Statistic values
0.0034
2e−04
●
0.0030
0e+00
● ● ● ● ● ● ● ● ● ●
2 4 6 8 10 12 14 2 4 6 8 10 12 14
0.2
● ● ● ● ● ● ● ● ●
0.1
−0.10
0.0
● ● ● ● ● ● ● ● ● ● ● ●
2 4 6 8 10 12 14 2 4 6 8 10 12 14
There are several statistical methods for selecting the number of clusters. No
method is best. They suggest using the cubic clustering criteria (ccc), a pseudo-F
statistic, and a pseudo-t statistic. At a given step, the pseudo-t statistic is the distance
between the center of the two clusters to be merged, relative to the variability within
these clusters. A large pseudo-t statistic implies that the clusters to be joined are
relatively dissimilar (i.e., much more variability between the clusters to be merged
than within these clusters). The pseudo-F statistic at a given step measures the
variability among the centers of the current clusters relative to the variability within
the clusters. A large pseudo-F value implies that the clusters merged consist of fairly
similar observations. As clusters are joined, the pseudo-t statistic tends to increase,
and the pseudo-F statistic tends to decrease. The ccc is more difficult to describe.
The RSQ summary is also useful for determining the number of clusters. RSQ is
a pseudo-R2 statistic that measures the proportion of the total variation explained
by the differences among the existing clusters at a given step. RSQ will typically
decrease as the pseudo-F statistic decreases.
A common recommendation on cluster selection is to choose a cluster size
where the values of ccc and the pseudo-F statistic are relatively high (compared to
what you observe with other numbers of clusters), and where the pseudo-t statistic
is relatively low and increases substantially at the next proposed merger. For the
mammal teeth data this corresponds to four clusters. Six clusters is a sensible second
choice.
# create dendrogram
teeth.hc.average <- hclust(teeth.dist, method = "average")
plot(teeth.hc.average, hang = -1
, main = paste("Teeth with average linkage and", i.clus, "clusters")
, labels = teeth[,1])
rect.hclust(teeth.hc.average, k = i.clus)
1
0
Raccoon
Wolf
Bear
Elephant_Seal
Fur_Seal
Sea_Lion
Jaguar
Cougar
River_Otter
Marten
Wolverine
Grey_Seal
Sea_Otter
Weasel
Badger
Reindeer
Elk
Deer
Moose
House_Mouse
Beaver
Groundhog
Gray_Squirrel
Porcupine
Brown_Bat
Silver_Hair_Bat
Red_Bat
Pigmy_Bat
House_Bat
Mole
Pika
Rabbit
teeth.dist
hclust (*, "average")
5
29
30
32
2
31
●6
1●3 4
14
Component 2
16
1
●1
●4 15
●5 22
19
21 17
18
0
8 20
27 22 26
3 7
−1
13 10
11 9 23 25
24
12 28
−4 −3 −2 −1 0 1 2 3
Component 1
## mammal v1 v2 v3 v4 v5 v6 v7 v8
## 14 Wolf 3 3 1 1 4 4 2 3
## 15 Bear 3 3 1 1 4 4 2 3
## 16 Raccoon 3 3 1 1 4 4 3 2
## 17 Marten 3 3 1 1 4 4 1 2
## 18 Weasel 3 3 1 1 3 3 1 2
## 19 Wolverine 3 3 1 1 4 4 1 2
## 20 Badger 3 3 1 1 3 3 1 2
## 21 River_Otter 3 3 1 1 4 3 1 2
## 22 Sea_Otter 3 2 1 1 3 3 1 2
## 23 Jaguar 3 3 1 1 3 2 1 1
## 24 Cougar 3 3 1 1 3 2 1 1
## 25 Fur_Seal 3 2 1 1 4 4 1 1
## 26 Sea_Lion 3 2 1 1 4 4 1 1
## 27 Grey_Seal 3 2 1 1 3 3 2 2
## 28 Elephant_Seal 2 1 1 1 4 4 1 1
## [1] "Cluster 5 ----------------------------- "
## mammal v1 v2 v3 v4 v5 v6 v7 v8
## 29 Reindeer 0 4 1 0 3 3 3 3
## 30 Elk 0 4 1 0 3 3 3 3
## 31 Deer 0 4 0 0 3 3 3 3
## 32 Moose 0 4 0 0 3 3 3 3
Below are the 1976 crude birth and death rates in 74 countries. A data plot and
output from a complete and single linkage cluster analyses are given.
#### Example: Birth and death rates
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch14_birthdeath.dat"
bd <- read.table(fn.data, header = TRUE)
str(bd)
## 'data.frame': 74 obs. of 3 variables:
## $ country: Factor w/ 74 levels "afghan","algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ birth : int 52 50 47 22 16 12 47 12 36 17 ...
## $ death : int 30 16 23 10 8 13 19 12 10 10 ...
● upp_volta
● angola
● ethiopia
ivory_cst
● cameroon ● madagasca
● nigeria
20 ●nepal
● banglades
● saudi_ar
death
●zaire
mozambique
● vietnam ● tanzania
● uganda
● sudan
● indonesia ● morocco
● algeria
●india
● burma
● german_dr ● guatamala
● pakistan
● ghana
●syria
●iraq
rhodesia
●kenya
● austria ●egypt
●peru
● german_fr
●uk
belguim ● hungary ● turkey
● sth_africa ●iran
● nkorea
● sweden
● france ● czechosla ●china ● ecuador
10 ●italy ● bulgaria
● romania
portugal
● argentina ●phillip
thailand
columbia
●brazil
● switzer ●usa
● greece
●ussr● poland ● sri_lanka
● netherlan
● australia
●spain
yugoslav
● canada ●chile ● mexico
●japan ●cuba ● skorea ● malaysia ●venez
● taiwan
10 20 30 40 50
birth
4e−06
● ● ● ● ● ●
4.8e−05
● ●
●
Hubert Statistic values
●
● ●
●
●
2e−06
4.4e−05
●
●
●
4.0e−05
0e+00
●
●
● ● ● ● ● ●
2 4 6 8 10 12 14 2 4 6 8 10 12 14
● ●
0.6
4
0.4
Dindex Values
●
●
3
0.2
● ●
● ●
●
●
2
●
●
−0.2 0.0
●
●
● ●
● ●
●
1
● ●
● ● ●
● ●
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Let’s try 3 clusters based on the dendrogram plots below. First we’ll use complete
linkage.
# create distance matrix between points
bd.dist <- dist(bd[,-1])
# create dendrogram
bd.hc.complete <- hclust(bd.dist, method = "complete")
plot(bd.hc.complete, hang = -1
10
0
afghan
upp_volta
algeria
kenya
iraq
rhodesia
ghana
syria
cameroon
vietnam
nigeria
madagasca
angola
ethiopia
ivory_cst
mozambique
zaire
banglades
nepal
saudi_ar
morocco
tanzania
sudan
uganda
german_fr
german_dr
austria
belguim
uk
netherlan
switzer
sweden
france
italy
argentina
poland
chile
cuba
hungary
czechosla
portugal
romania
canada
japan
ussr
spain
yugoslav
bulgaria
usa
australia
greece
malaysia
sri_lanka
taiwan
skorea
pakistan
nkorea
ecuador
iran
guatamala
egypt
peru
india
burma
indonesia
mexico
venez
china
brazil
thailand
columbia
phillip
sth_africa
turkey
bd.dist
hclust (*, "complete")
1
2
74
●
●
72 371 732
70 67
● ●63 65 64
66
69
● 62
59 ● 68 16
● 56
1
55 53 58 61
● 3960
● 57
48 49 50 54
38 5251
47
● 45 44 43
● 42
Component 2
● ● 37 41 46
40 ●
35 ● 33
0
● ● 32 36
31
26 2830 34
● 29
22 21 20 27
●
17 24 2325
15 19 18
−1
12 11 14
● 7 9 13
● 2 10
●
3 ● 4 6 8 5
1
●
−2
−3 −2 −1 0 1 2
Component 1
## 59 sudan 49 17 1
## 62 syria 47 14 1
## 63 tanzania 47 17 1
## 67 uganda 48 17 1
## 70 upp_volta 50 28 1
## 72 vietnam 42 17 1
## 74 zaire 45 18 1
## [1] "Cluster 2 ----------------------------- "
## country birth death cut.comp
## 4 argentina 22 10 2
## 5 australia 16 8 2
## 6 austria 12 13 2
## 8 belguim 12 12 2
## 10 bulgaria 17 10 2
## 13 canada 17 7 2
## 14 chile 22 7 2
## 18 cuba 20 6 2
## 19 czechosla 19 11 2
## 23 france 14 11 2
## 24 german_dr 12 14 2
## 25 german_fr 10 12 2
## 27 greece 16 9 2
## 29 hungary 18 12 2
## 34 italy 14 10 2
## 36 japan 16 6 2
## 46 netherlan 13 8 2
## 51 poland 20 9 2
## 52 portugal 19 10 2
## 54 romania 19 10 2
## 57 spain 18 8 2
## 60 sweden 12 11 2
## 61 switzer 12 9 2
## 66 ussr 18 9 2
## 68 uk 12 12 2
## 69 usa 15 9 2
## 73 yugoslav 18 8 2
## [1] "Cluster 3 ----------------------------- "
## country birth death cut.comp
## 9 brazil 36 10 3
## 11 burma 38 15 3
## 15 china 31 11 3
## 16 taiwan 26 5 3
## 17 columbia 34 10 3
## 20 ecuador 42 11 3
## 21 egypt 39 13 3
## 28 guatamala 40 14 3
## 30 india 36 15 3
## 31 indonesia 38 16 3
## 32 iran 42 12 3
## 38 nkorea 43 12 3
## 39 skorea 26 6 3
## 41 malaysia 30 6 3
## 42 mexico 40 7 3
## 48 pakistan 44 14 3
## 49 peru 40 13 3
## 50 phillip 34 10 3
## 56 sth_africa 36 12 3
## 58 sri_lanka 26 9 3
## 64 thailand 34 10 3
## 65 turkey 34 12 3
## 71 venez 36 6 3
# plot original data
library(ggplot2)
p1 <- ggplot(bd, aes(x = birth, y = death, colour = cut.comp, shape = cut.comp))
p1 <- p1 + geom_point(size = 2) # points
p1 <- p1 + geom_text(aes(label = country), hjust = -0.1, alpha = 0.2) # labels
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
p1 <- p1 + labs(title = "1976 crude birth and death rates, complete linkage")
print(p1)
● upp_volta
● angola
● ethiopia
ivory_cst
● cameroon
● madagasca
● nigeria
20 ●nepal cut.comp
● banglades
● saudi_ar
death
●zaire
mozambique a
● 1
● vietnam ● tanzania
● uganda
● sudan
a 2
indonesia ● morocco● algeria
indiaburma a 3
german_dr guatamala
pakistan
● ghana
●syria
●iraq
rhodesia
●kenya
austria egypt
peru
german_fr
uk
belguim hungary turkeysth_africa iran
nkorea
sweden
france czechosla china ecuador
10 italy bulgaria
romania
portugal
argentina phillip
thailand
columbia
brazil
switzerusa
greece
ussr poland sri_lanka
netherlan
australia
spain
yugoslav
canada chile mexico
japan cuba skorea malaysia venez
taiwan
10 20 30 40 50
birth
In very general/loose terms4 , it appears that at least some members of the “Four
4
Thanks to Drew Enigk from Spring 2013 who provided this interpretation.
Asian Tigers5 ” are toward the bottom of the swoop, while the countries with more
Euro-centric wealth are mostly clustered on the left side of the swoop, and many
developing countries make up the steeper right side of the swoop. Perhaps the birth
and death rates of a given country are influenced in part by the primary means by
which the country has obtained wealth6 (if it is considered a wealthy country). For
example, the Four Asian Tigers have supposedly developed wealth in more recent
years through export-driven economies, and the Tiger Cub Economies7 are currently
developing in a similar fashion8 .
5
http://en.wikipedia.org/wiki/Four_Asian_Tigers
6
http://www.povertyeducation.org/the-rise-of-asia.html
7
http://en.wikipedia.org/wiki/Tiger_Cub_Economies
8
http://www.investopedia.com/terms/t/tiger-cub-economies.asp
● ●
8e−07
●
2.9e−05
● ●
4e−07
●
●
●
● ● ● ●
2.7e−05
0e+00
●
● ●
● ● ●
2 4 6 8 10 12 14 2 4 6 8 10 12 14
● ●
6
● ● ●
5
Dindex Values
●
● ● ●
4
● ●
0
● ● ● ●
−1
3
●
●
−2
●
●
2
● ● ●
● ● ● ●
2 4 6 8 10 12 14 2 4 6 8 10 12 14
# create dendrogram
bd.hc.single <- hclust(bd.dist, method = "single")
plot(bd.hc.single, hang = -1
, main = paste("Teeth with single linkage and", i.clus, "clusters")
, labels = bd[,1])
rect.hclust(bd.hc.single, k = i.clus)
afghan
upp_volta
argentina
netherlan
switzer
german_fr
sweden
german_dr
austria
belguim
uk
poland
hungary
czechosla
portugal
romania
ussr
spain
yugoslav
japan
france
italy
canada
bulgaria
usa
australia
greece
bd$cut.sing <- factor(cutree(bd.hc.single, k = i.clus))
chile
cuba
malaysia
sri_lanka
taiwan
skorea
cameroon
mexico
venez
vietnam
china
bd.dist
turkey
sth_africa
brazil
2
3
2
74
72
73 71
67
●1
70 66
63 69
62 65 64
68 16
59
56 61
1
55 60
53 57 58 39
54
48 49 50
5251
38
Component 2
47
45 44 43
37 42 46
40 41
33
0
35 36
32
31 34
26 30
28
29
27
22
21 20 25
24 23
19 17 18
−1
15 14
12 11 13
7 9
10
2
3 46 8 5
1
●
−2
−3 −2 −1 0 1 2
Component 1
## 30 india 36 15 3 2
## 31 indonesia 38 16 3 2
## 32 iran 42 12 3 2
## 33 iraq 48 14 1 2
## 35 ivory_cst 48 23 1 2
## 37 kenya 50 14 1 2
## 38 nkorea 43 12 3 2
## 40 madagasca 47 22 1 2
## 42 mexico 40 7 3 2
## 43 morocco 47 16 1 2
## 44 mozambique 45 18 1 2
## 45 nepal 46 20 1 2
## 47 nigeria 49 22 1 2
## 48 pakistan 44 14 3 2
## 49 peru 40 13 3 2
## 50 phillip 34 10 3 2
## 53 rhodesia 48 14 1 2
## 55 saudi_ar 49 19 1 2
## 56 sth_africa 36 12 3 2
## 59 sudan 49 17 1 2
## 62 syria 47 14 1 2
## 63 tanzania 47 17 1 2
## 64 thailand 34 10 3 2
## 65 turkey 34 12 3 2
## 67 uganda 48 17 1 2
## 71 venez 36 6 3 2
## 72 vietnam 42 17 1 2
## 74 zaire 45 18 1 2
## [1] "Cluster 3 ----------------------------- "
## country birth death cut.comp cut.sing
## 4 argentina 22 10 2 3
## 5 australia 16 8 2 3
## 6 austria 12 13 2 3
## 8 belguim 12 12 2 3
## 10 bulgaria 17 10 2 3
## 13 canada 17 7 2 3
## 14 chile 22 7 2 3
## 16 taiwan 26 5 3 3
## 18 cuba 20 6 2 3
## 19 czechosla 19 11 2 3
## 23 france 14 11 2 3
## 24 german_dr 12 14 2 3
## 25 german_fr 10 12 2 3
## 27 greece 16 9 2 3
## 29 hungary 18 12 2 3
## 34 italy 14 10 2 3
## 36 japan 16 6 2 3
## 39 skorea 26 6 3 3
## 41 malaysia 30 6 3 3
## 46 netherlan 13 8 2 3
## 51 poland 20 9 2 3
## 52 portugal 19 10 2 3
## 54 romania 19 10 2 3
## 57 spain 18 8 2 3
## 58 sri_lanka 26 9 3 3
## 60 sweden 12 11 2 3
## 61 switzer 12 9 2 3
## 66 ussr 18 9 2 3
## 68 uk 12 12 2 3
## 69 usa 15 9 2 3
## 73 yugoslav 18 8 2 3
# plot original data
library(ggplot2)
p1 <- ggplot(bd, aes(x = birth, y = death, colour = cut.sing, shape = cut.sing))
p1 <- p1 + geom_point(size = 2) # points
p1 <- p1 + geom_text(aes(label = country), hjust = -0.1, alpha = 0.2) # labels
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
p1 <- p1 + labs(title = "1976 crude birth and death rates, single linkage")
print(p1)
● upp_volta
angola
ethiopia
ivory_cst
cameroon madagasca
nigeria
20 nepal cut.sing
banglades
saudi_ar
death
zaire
mozambique a
● 1
vietnam tanzania
uganda
sudan a 2
indonesia moroccoalgeria
indiaburma a 3
german_dr guatamalapakistan
ghana
syria
iraq
rhodesia
kenya
austria egypt
peru
german_fr
uk
belguim hungary turkeysth_africa irannkorea
sweden
france czechosla china ecuador
10 italy bulgaria
romania
portugal
argentina phillip
thailand
columbia
brazil
switzerusa
greece
ussr poland sri_lanka
netherlan
australia
spain
yugoslav
canada chile mexico
japan cuba skorea malaysia venez
taiwan
10 20 30 40 50
birth
The two methods suggest three clusters. Complete linkage also suggests 14 clus-
ters, but the clusters were unappealing so this analysis will not be presented here.
The three clusters generated by the two methods are very different. The same
tendency was observed using average linkage and Ward’s method.
An important point to recognize is that different clustering algorithms may agree
on the number of clusters, but they may not agree on the composition of the clusters.
Jolicouer and Mosimann studied the relationship between the size and shape of
painted turtles. The table below gives the length, width, and height (all in mm)
for 24 males and 24 females.
#### Example: Painted turtle shells
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch15_shells_mf.dat"
shells <- read.table(fn.data, header = TRUE)
str(shells)
## 'data.frame': 48 obs. of 4 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ length: int 98 103 103 105 109 123 123 133 133 133 ...
## $ width : int 81 84 86 86 88 92 95 99 102 102 ...
## $ height: int 38 38 42 42 44 50 46 51 51 51 ...
#head(shells)
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# color by sex
p <- ggpairs(shells
, mapping = ggplot2::aes(colour = sex, alpha = 0.5)
, title = "Painted turtle shells"
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
## 3D scatterplot
library(scatterplot3d)
with(shells, {
scatterplot3d(x = length
, y = width
, z = height
, main = "Shells 3D Scatterplot"
, type = "h" # lines to the horizontal xy-plane
, color = as.integer(sex) # color by group
, pch = as.integer(sex)+19 # plotting character by group
sex
10 ●
5
0
●
180 ● ●
●
Cor : 0.978 Cor : 0.963
70
●
160 length
65
●
●
120 ●
M: 0.95 M: 0.947 ●
60
100
●
● ●●
●
55
●
height
● ●
Cor : 0.96
120
width
● ●
●● 140
50
width
F: 0.966 ●
●
130
100 ● ●
●●● 120
45
●●
M: 0.912 ● ●
● 110
80 ●●●
●● 100
●
40
●
● 90
● 80
●●
35
60 70
80 100 120 140 160 180
height
50
length
40
be the vector of population means for the ith population, where µij is the ith population
mean on the j th feature. For the turtles, p = 3 features and k = 2 strata (sexes).
A one-way MANOVA tests the hypothesis that the population mean vectors are
identical: H0 : µ1 = µ2 = · · · = µk against HA : not H0 . For the carapace data, you
are simultaneously testing that the sexes have equal population mean lengths, equal
population mean widths, and equal population mean heights.
Assume that the sample sizes from the different groups are n1 , n2 , . . . , nk . The
be the vector of responses for the j th individual from the ith sample. Let
0
X̄i0 =
X̄i1 X̄i2 · · · X̄ip
and Si be the mean vector and variance-covariance matrix for the ith sample. Finally,
let
0
X̄ 0 =
X̄1 X̄2 · · · X̄p
be the vector of means ignoring samples (combine all the data across samples and
compute the average on each feature), and let
P
(ni − 1)Si
S= i
n−k
be the pooled variance-covariance matrix. The pooled variance-covariance matrix is
a weighted average of the variance-covariance matrices from each group.
To test H0 , construct the following MANOVA table, which is the multivariate
analog of the ANOVA table:
Source df SS MS
0
P
Between k − 1 n (
i i PiX̄ − X̄)(X̄ i − X̄)
Within n − k P i (n i − 1)S i
0
Total n−1 ij (Xij − X̄)(Xij − X̄)
where all the MSs are SS/df.
The expressions for the SS have the same form as SS in univariate analysis of
variance, except that each SS is a p × p symmetric matrix. The diagonal elements of
the SS matrices are the SS for one-way ANOVAs on the individual features. The
off-diagonal elements are SS between features. The Error MS matrix is the pooled
variance-covariance matrix S.
The standard MANOVA assumes that you have independent samples from multi-
variate normal populations with identical variance-covariance matrices. This implies
that each feature is normally distributed in each population, that a feature has the
same variability across populations, and that the correlation (or covariance) between
two features is identical across populations. The Error MS matrix estimates the com-
mon population variance-covariance matrix when the population variance-covariance
matrices are identical.
The H0 of equal population mean vectors should be rejected when the difference
among mean vectors, as measured by the Between MS matrix, is large relative to the
The features are positively correlated within each sex. The correlations between
pairs of features are similar for males and females. Females tend to be larger on
each feature. The distributions for length, width, and height are fairly symmetric
within sexes. No outliers are present. Although females are more variable on each
feature than males, the MANOVA assumptions do not appear to be grossly violated
here. (Additionally, you could consider transforming the each dimension of the data
in hopes to make the covariances between sexes more similar, though it may not be
easy to find a good transformation to use.)
pca.sh <- princomp(shells[, 2:4])
df.pca.sh <- data.frame(sex = shells$sex, pca.sh$scores)
str(df.pca.sh)
## 'data.frame': 48 obs. of 4 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Comp.1: num -31.4 -25.9 -23.6 -22 -17.1 ...
## $ Comp.2: num -2.27 -1.43 -4.44 -3.26 -3.11 ...
## $ Comp.3: num -0.943 0.73 -1.671 -1.618 -2.2 ...
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# put scatterplots on top so y axis is vertical
p <- ggpairs(df.pca.sh
, mapping = ggplot2::aes(colour = sex, alpha = 0.5)
, title = "Principal components of Shells"
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
sex
10
5
0
Comp.1
25
F: 0.107 F: 0.242
0
5.0
Cor : 3.46e−15
2.5
Comp.2
0.0
F: −0.196
−2.5 M: −0.068
−5.0
Comp.3
0
−3
For comparison with the MANOVA below, here are the univariate ANOVAs for
each feature. For the carapace data, the univariate ANOVAs indicate significant
differences between sexes on length, width, and height. Females are larger on average
than males on each feature.
# Univariate ANOVA tests, by each response variable
lm.sh <- lm(cbind(length, width, height) ~ sex, data = shells)
summary(lm.sh)
## Response length :
##
## Call:
## lm(formula = length ~ sex, data = shells)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.042 -10.667 1.271 11.927 40.958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.042 3.509 38.77 < 2e-16 ***
## sexM -22.625 4.962 -4.56 3.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.19 on 46 degrees of freedom
## Multiple R-squared: 0.3113,Adjusted R-squared: 0.2963
## F-statistic: 20.79 on 1 and 46 DF, p-value: 3.788e-05
##
##
## Response width :
##
## Call:
## lm(formula = width ~ sex, data = shells)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5833 -5.5417 -0.4375 4.8854 29.4167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.583 2.149 47.725 < 2e-16 ***
## sexM -14.292 3.040 -4.701 2.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.53 on 46 degrees of freedom
## Multiple R-squared: 0.3246,Adjusted R-squared: 0.3099
## F-statistic: 22.1 on 1 and 46 DF, p-value: 2.376e-05
##
##
## Response height :
##
## Call:
## lm(formula = height ~ sex, data = shells)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0417 -2.7917 -0.7083 4.0417 14.9583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.042 1.258 41.360 < 2e-16 ***
## sexM -11.333 1.779 -6.369 8.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.164 on 46 degrees of freedom
## Multiple R-squared: 0.4686,Adjusted R-squared: 0.457
## F-statistic: 40.56 on 1 and 46 DF, p-value: 8.087e-08
A few procedures can be used for one-way MANOVA; two are manova() and the
car package’s Manova(). First we check the assumption of multivariate normality.
# Test multivariate normality using the Shapiro-Wilk test for multivariate normality
library(mvnormtest)
# The data needs to be transposed t() so each variable is a row
# with observations as columns.
● ●
●
6
8
●
Mahalanobis D2 distance
Mahalanobis D2 distance
●
5
● ●
●
6
● ●
4
● ●
●
●
3
4
●● ●
●●●
●
●● ●
2
●●
●●● ● ●
2
●
●●●
1
●● ●
● ●
●●●
●●● ●
0
0
0 2 4 6 8 10 0 2 4 6 8 10
The curvature in the Famale sample cause us to reject normality, while the males
do not deviate from normality. We’ll proceed anyway since this deviation from nor-
mality in the female sample will largely increase the variability of the sample and not
displace the mean greatly, and the sample sizes are somewhat large.
Multivariate test statistics These four multivariate test statistics are among the
most common to assess differences across the levels of the categorical variables for
a linear combination of responses. In general Wilks’ lambda is recommended unless
there are problems with small total sample size, unequal sample sizes between groups,
violations of assumptions, etc., in which case Pillai’s trace is more robust.
Wilks’ lambda, (λ)
Most commonly used statistic for overall significance
Considers differences over all the characteristic roots
The smaller the value of Wilks’ lambda, the larger the between-groups disper-
sion
Pillai’s trace
Considers differences over all the characteristic roots
More robust than Wilks’; should be used when sample size decreases, unequal
cell sizes or homogeneity of covariances is violated
Hotelling’s trace
Considers differences over all the characteristic roots
Roy’s greatest characteristic root
Tests for differences on only the first discriminant function (Chapter 16)
Most appropriate when responses are strongly interrelated on a single dimen-
sion
Highly sensitive to violation of assumptions, but most powerful when all as-
sumptions are met.
# Multivariate MANOVA test
# the specific test is specified in summary()
# test = c("Pillai", "Wilks", "Hotelling-Lawley", "Roy")
man.sh <- manova(cbind(length, width, height) ~ sex, data = shells)
summary(man.sh, test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## sex 1 0.38695 23.237 3 44 3.622e-09 ***
## Residuals 46
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# I prefer the output from the car package
library(car)
lm.man <- lm(cbind(length, width, height) ~ sex, data = shells)
man.sh <- Manova(lm.man)
summary(man.sh)
##
## Type II MANOVA Tests:
##
## Sum of squares and products for error:
## length width height
## length 13590.792 8057.500 4679.875
## width 8057.500 5100.792 2840.458
## height 4679.875 2840.458 1747.917
##
## ------------------------------------------
##
## Term: sex
##
## Sum of squares and products for the hypothesis:
## length width height
## length 6142.688 3880.188 3077.000
## width 3880.188 2451.021 1943.667
## height 3077.000 1943.667 1541.333
##
## Multivariate Tests: sex
## Df test stat approx F num Df den Df Pr(>F)
## Pillai 1 0.6130506 23.23665 3 44 3.622e-09 ***
## Wilks 1 0.3869494 23.23665 3 44 3.622e-09 ***
## Hotelling-Lawley 1 1.5843173 23.23665 3 44 3.622e-09 ***
## Roy 1 1.5843173 23.23665 3 44 3.622e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The four MANOVA tests of no differences between sexes are all highly significant.
These tests reinforce the univariate analyses. Of the four tests, I prefer Roy’s test
because it has an intuitive interpretation. I will mostly ignore the other three tests
for discussion.
Roy’s test locates the linear combination of the features that produces the most
significant one-way ANOVA test for no differences among groups. If the groups are
not significantly different on the linear combination that best separates the groups in
a one-way ANOVA sense, then there is no evidence that the population mean vectors
are different. The critical value for Roy’s test accounts for the linear combination
being suggested by the data. That is, the critical value for Roy’s test is not the same
critical value that is used in a one-way ANOVA. The idea is similar to a Bonferroni-
type correction with multiple comparisons.
Roy’s method has the ability to locate linear combinations of the features on
which the groups differ, even when the differences across groups are not significant
on any feature. This is a reason for treating multivariate problems using multivariate
methods rather than through individual univariate analyses on each feature.
## For Roy's characteristic Root and vector
#str(man.sh)
H <- man.sh$SSP$sex # H = hypothesis matrix
# man.sh£df # hypothesis df
E <- man.sh$SSPE # E = error matrix
# man.sh£error.df # error df
The three linear combinations for the carapace data are (reading down the columns
1
http://en.wikipedia.org/wiki/Rotation_matrix
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
sex
10
5
0
−4
Cor : 1.56e−15 Cor : 1.47e−16
−8
F: −0.184 F: −0.19
D1
−12
M: 0.501 M: 0.515
−16
Cor : 0.877
20
F: 0.863
D2
15
M: 0.897
30
25
D3
20
0 1 2 3 40 1 2 3 4 −16 −12 −8 −4 15 20 20 25 30
Discriminant Analysis
A researcher collected data on two external features for two (known) sub-species of
an insect. She can use discriminant analysis to find linear combinations of the
features that best distinguish the sub-species. The analysis can then be used to
classify insects with unknown sub-species origin into one of the two sub-species based
on their external features.
To see how this might be done, consider the following data plot. Can1 is the
linear combination of the two features that best distinguishes or discriminates the
two sub-species. The value of Can1 could be used to classify insects into one of the
two groups, as illustrated.
gives the most significant F -test for a null hypothesis of no group differences in a
one-way ANOVA, among all linear combinations of the features. The second linear
combination or the second linear discriminant function:
Can2 = a21 X1 + a22 X2 + · · · + a2p Xp
gives the most significant F -test for no group differences in a one-way ANOVA, among
all linear combinations of the features that are uncorrelated (adjusting for groups)
with Can1. In general, the j th linear combination Canj (j = 1, 2, . . . , r) gives the
most significant F -test for no group differences in a one-way ANOVA, among all linear
combinations of the features that are uncorrelated with Can1, Can2, . . . , Can(j − 1).
The coefficients in the canonical discriminant functions can be multiplied by a
constant, or all the signs can be changed (that is, multiplied by the constant −1),
without changing their properties or interpretations.
library(ggplot2)
15
lot size in 1000 sq ft
10 ● ●
● ● owner
● ● ●
● ●
● ● nonowner
● ●
owner
5
0
0 10 20 30 40
income in $1000
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(rev(mower)
, mapping = ggplot2::aes(colour = owner, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
10.0
7.5
owner
5.0
2.5
0.0
12
Cor : 0.172
11
10
lotsize
nonowner: −0.0865
9
8 owner: −0.311
7
30
income
20
10
0.0 0.5 1.0 1.5 2.0
0.0 0.5 1.0 1.5 2.0 7 8 9 10 11 1210 20 30
Although the two groups overlap, the owners tend to have higher incomes and
larger lots than the non-owners. Income seems to distinguish owners and non-owners
better than lot size, but both variables seem to be useful for discriminating between
groups.
Qualitatively, one might classify prospects based on their location relative to a
roughly vertical line on the scatter plot. A discriminant analysis gives similar results
to this heuristic approach because the Can1 scores will roughly correspond to the
projection of the two features onto a line perpendicular to the hypothetical vertical
line. candisc() computes one discriminant function here because p = 2 and k = 2
gives r = min(p, k − 1) = min(2, 1) = 1.
Below we first fit a lm() and use that object to compare populations. First we
compare using univariate ANOVAs. The p-values are for one-way ANOVA comparing
owners to non-owners and both income and lotsize features are important individually
for distinguishing between the groups.
# first fit lm() with formula = continuous variables ~ factor variables
lm.mower <- lm(cbind(income, lotsize) ~ owner, data = mower)
summary(lm.mower)
## Response income :
##
## Call:
## lm(formula = income ~ owner, data = mower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4917 -3.8021 0.5875 2.5979 10.2083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.133 1.601 11.954 4.28e-11 ***
## ownerowner 7.358 2.264 3.251 0.00367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.545 on 22 degrees of freedom
## Multiple R-squared: 0.3245,Adjusted R-squared: 0.2938
## F-statistic: 10.57 on 1 and 22 DF, p-value: 0.003665
##
##
## Response lotsize :
##
## Call:
## lm(formula = lotsize ~ owner, data = mower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81667 -0.66667 -0.01667 0.71667 1.66667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8167 0.2984 29.55 < 2e-16 ***
## ownerowner 1.3167 0.4220 3.12 0.00498 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.034 on 22 degrees of freedom
## Multiple R-squared: 0.3068,Adjusted R-squared: 0.2753
## F-statistic: 9.736 on 1 and 22 DF, p-value: 0.004983
Second, the MANOVA indicates the multivariate means are different indicating
both income and lotsize features taken together are important for distinguishing be-
tween the groups.
# test whether the multivariate means of the two populations are different
library(car)
### can also plot 2D plots when have more than two groups (will use later)
## library(heplots)
#heplot(can.mower, scale=6, fill=TRUE)
#heplot3d(can.mower, scale=6, fill=TRUE)
The raw canonical coefficients define the canonical discriminant variables and are
identical to the feature loadings in a one-way MANOVA, except for an unimportant
multiplicative factor. Only Can1 is generated here.
can.mower$coeffs.raw
## Can1
## income -0.1453404
## lotsize -0.7590457
The means output gives the mean score on the canonical discriminant variables
by group, after centering the scores to have mean zero over all groups. These are in
order of the owner factor levels (nonowner, owner).
can.mower$means
## [1] 1.034437 -1.034437
The linear combination of income and lotsize that best distinguishes owners from
non-owners
Can1 = −0.1453 INCOME + −0.759 LOTSIZE
is a weighted average of income and lotsize.
In the scatterplot below, Can1 is the direction indicated by the dashed line.
library(ggplot2)
15
Perp to Can1 for discrim
lot size in 1000 sq ft
10 ● ●
● ● owner
● ● ●
● ●
● ● nonowner
● ●
Can1 owner
5
0
0 10 20 30 40
income in $1000
# Plots of Can1
p1 <- ggplot(can.mower$scores, aes(x = Can1, fill = owner))
p1 <- p1 + geom_histogram(binwidth = 2/3, alpha = 0.5, position="identity")
p1 <- p1 + scale_x_continuous(limits = c(min(can.mower$scores$Can1), max(can.mower$scores$Can1)))
p1 <- p1 + geom_rug(aes(colour = owner))
#p1 <- p1 + labs(title = "Can1 for mower data")
#print(p1)
p2 <- p2 + geom_point()
p2 <- p2 + coord_flip()
p2 <- p2 + scale_y_continuous(limits = c(min(can.mower$scores$Can1), max(can.mower$scores$Can1)))
#p2 <- p2 + labs(title = "Can1 for mower data")
#print(p2)
library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=2, top = "Can1 for mower data")
## Warning: Removed 2 rows containing missing values (geom bar).
2
owner owner
owner
count
nonowner ● nonowner
owner ● owner
1
nonowner ● ● ● ● ●● ●● ●● ● ●
0
−2 −1 0 1 2 −2 −1 0 1 2
Can1 Can1
The standardized coefficients (use the pooled within-class coefficients) indicate the
relative contributions of the features to the discrimination. The standardized coeffi-
cients are roughly equal, which suggests that income and lotsize contribute similarly
to distinguishing the owners from non-owners.
can.mower$coeffs.std
## Can1
## income -0.8058419
## lotsize -0.7845512
The p-value of 0.0004 on the likelihood ratio test indicates that Can1 strongly
distinguishes between owners and non-owners. This is consistent with the separation
between owners and non-owners in the boxplot of Can1 scores.
I noted above that Can1 is essentially the same linear combination given in a
MANOVA comparison of owners to non-owners. Here is some Manova() output to
support this claim. The MANOVA test p-values agree with the candisc output (as
we saw earlier). The first characteristic vector from the MANOVA is given here.
## For Roy's characteristic Root and vector
H <- man.mo$SSP$owner # H = hypothesis matrix
E <- man.mo$SSPE # E = error matrix
# characteristic roots of (E inverse * H)
EinvH <- solve(E) %*% H # solve() computes the matrix inverse
ev <- eigen(EinvH) # eigenvalue/eigenvectors
ev
## eigen() decomposition
## $values
## [1] 1.167337 0.000000
##
## $vectors
## [,1] [,2]
The plots show big differences between Setosa and the other two species. The
differences between Versicolor and Virginica are smaller, and appear to be mostly
due to differences in the petal widths and lengths.
#### Example: Fisher's iris data
# The "iris" dataset is included with R in the library(datasets)
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(iris[,c(5,1,2,3,4)]
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
40
Species
30
20
10
0
8
Cor : −0.118 Cor : 0.872 Cor : 0.818
Sepal.Length
7
setosa: 0.743 setosa: 0.267 setosa: 0.278
6 versicolor: 0.526 versicolor: 0.754 versicolor: 0.546
5 virginica: 0.457 virginica: 0.864 virginica: 0.281
4.5
Cor : −0.428 Cor : −0.366
4.0
Sepal.Width
3.5 setosa: 0.178 setosa: 0.233
3.0 versicolor: 0.561 versicolor: 0.664
2.5 virginica: 0.401 virginica: 0.538
2.0
Cor : 0.963
6
Petal.Length
setosa: 0.332
4
versicolor: 0.787
2 virginica: 0.322
2.5
2.0
Petal.Width
1.5
1.0
0.5
0.0
0 10 20 300 10 20 300 10 20 30 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5
library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=2, top = "Parallel Coordinate Plots of Iris data")
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
1.00 8
0.75 6
Species Species
value
value
setosa setosa
0.50 4
versicolor versicolor
virginica virginica
0.25 2
0.00 0
candisc was used to discriminate among species. There are k = 3 species and
p = 4 features, so the number of discriminant functions is 2 (the minimum of 4 and
3 − 1).
# first fit lm() with formula = continuous variables ~ factor variables
lm.iris <- lm(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species
, data = iris)
library(candisc)
can.iris <- candisc(lm.iris)
can.iris$coeffs.raw
## Can1 Can2
## Sepal.Length -0.8293776 0.02410215
## Sepal.Width -1.5344731 2.16452123
## Petal.Length 2.2012117 -0.93192121
## Petal.Width 2.8104603 2.83918785
Coefficients):
Can2 is not easily interpreted, though perhaps a comperison of lengths and widths
ignoring sepalL:
The canonical directions provide a maximal separation the species. Two lines
across Can1 will provide a classification rule.
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(can.iris$scores
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
40
30
Species
20
10
0
10
Cor : −3.36e−16
5
setosa: −0.69
Can1
0
versicolor: 0.268
−5
virginica: 0.224
−10
Can2
0
−1
−2
0 2 4 6 80 2 4 6 80 2 4 6 8 −10 −5 0 5 10 −2 −1 0 1 2 3
There are significant differences among species on both discriminant functions; see
the p-values under the likelihood ratio tests. Of course, Can1 produces the largest
differences — the overlap among species on Can1 is small. Setosa has the lowest Can1
scores because this species has the smallest petal measurements relative to its sepal
measurements. Virginica has the highest Can1 scores.
can.iris
##
## Canonical Discriminant Analysis for Species:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.96987 32.19193 31.907 99.12126 99.121
## 2 0.22203 0.28539 31.907 0.87874 100.000
##
## Test of H0: The canonical correlations in the
## current row and all that follow are zero
##
## LR test stat approx F numDF denDF Pr(> F)
## 1 0.02344 199.145 8 288 < 2.2e-16 ***
Questions:
1. What is the most striking feature of the plot of the Can1 scores?
2. Does the assumption of equal population covariance matrices across species
seem plausible?
3. How about multivariate normality?
# Covariance matrices by species
by(iris[,1:4], iris$Species, cov)
## iris$Species: setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.12424898 0.099216327 0.016355102 0.010330612
## Sepal.Width 0.09921633 0.143689796 0.011697959 0.009297959
## Petal.Length 0.01635510 0.011697959 0.030159184 0.006069388
## Petal.Width 0.01033061 0.009297959 0.006069388 0.011106122
## ----------------------------------------------------
## iris$Species: versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.26643265 0.08518367 0.18289796 0.05577959
## Sepal.Width 0.08518367 0.09846939 0.08265306 0.04120408
## Petal.Length 0.18289796 0.08265306 0.22081633 0.07310204
## Petal.Width 0.05577959 0.04120408 0.07310204 0.03910612
## ----------------------------------------------------
## iris$Species: virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.40434286 0.09376327 0.30328980 0.04909388
## Sepal.Width 0.09376327 0.10400408 0.07137959 0.04762857
## Petal.Length 0.30328980 0.07137959 0.30458776 0.04882449
## Petal.Width 0.04909388 0.04762857 0.04882449 0.07543265
# Test multivariate normality using the Shapiro-Wilk test for multivariate normality
library(mvnormtest)
# The data needs to be transposed t() so each variable is a row
# with observations as columns.
##
## data: Z
## W = 0.93043, p-value = 0.005739
mshapiro.test(t(iris[iris$Species == "virginica" , 1:4]))
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.93414, p-value = 0.007955
# Graphical Assessment of Multivariate Normality
f.mnv.norm.qqplot <- function(x, name = "") {
# creates a QQ-plot for assessing multivariate normality
par(mfrow=c(1,3))
f.mnv.norm.qqplot(iris[iris$Species == "setosa" , 1:4], "setosa" )
f.mnv.norm.qqplot(iris[iris$Species == "versicolor", 1:4], "versicolor")
f.mnv.norm.qqplot(iris[iris$Species == "virginica" , 1:4], "virginica" )
par(mfrow=c(1,1))
● ● ● ●
12
12
●
12
● ●
10
10
● ●
Mahalanobis D2 distance
Mahalanobis D2 distance
Mahalanobis D2 distance
10
●
● ●
8
●
8
●● ● ●
● ●
● ● ●
8
●●
●● ●
6
●● ●● ●
● ●●
● ●●
●●
6
● ●
●●
●●
● ●●● ●
4
●● ●● ●●●
● ●● ●●
●●● ●●
4
●●● ●● ●●
●● ●● ●
●● ●●
●
●●● ●●●●
●● ● ●
●●● ●●●
2
● ●
●
●● ●●●
● ●●●
● ●
2
●
●●● ●● ●●
●●
●● ●
●● ●●● ●●●
●● ●●●
●●● ●● ●●●
0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Classification
Given the score on CAN1 for each insect to be classified, assign insects to the sub-
species that they most resemble. Similarity is measured by the distance on CAN1 to
the average CAN1 scores for the two subspecies, identified by X’s on the plot.
The plot below illustrates the idea with the r = 2 discriminant functions in Fisher’s
iris data: Obs 1 is classified as Versicolor and Obs 2 is classified as Setosa.
3
●
2
● ●
●
●
● ●
● ●
1 ● obs 1
●
●
●
● ●●
●
● ● ● Species
● ●
●
●●● ●
Can2
setosa
●
0 ●
● versicolor
● ●●●
●● virginica
● ●
●
●
●●●
● ●●
−1 obs 2 ● ●●
●
−2 ●
−10 −5 0 5 10
Can1
where (X − X̄i )0 is the transpose of the column vector (X − X̄i ), and S −1 is the
matrix inverse of S. Note that if S is the identity matrix (a matrix with 1s on the
diagonal and 0s on the off-diagonals), then this is the Euclidean distance. Given the
M -distance from X to each sample, classify X into the group which has the minimum
M -distance.
The M -distance is an elliptical distance measure that accounts for correlation
between features, and adjusts for different scales by standardizing the features to
have unit variance. The picture below (left) highlights the idea when p = 2. All of
the points on a given ellipse are the same M -distance to the center (X̄1 , X̄2 )0 . As the
ellipse expands, the M -distance to the center increases.
Group 2
8
8
Group 1
obs 3
6
6
4
X2
X2
Group 3
4
obs 1
2
−2
0 2 4 6 0 5 10 15
X1 X1
To see how classification works, suppose you have three groups and two features,
as in the plot above (right). Observations 1 is closest in M -distance to the center
of group 3. Observation 2 is closest to group 1. Thus, classify observations 1 and
2 into groups 3 and 1, respectively. Observation 3 is closest to the center of group
2 in terms of the standard Euclidean (walking) distance. However, observation 3 is
more similar to data in group 1 than it is to either of the other groups. The M -
distance from observation 3 to group 1 is substantially smaller than the M -distances
to either group 2 or 3. The M -distance accounts for the elliptical cloud of data within
each group, which reflects the correlation between the two features. Thus, you would
classify observation 3 into group 1.
The M -distance from the ith group to the j th group is the M -distance between
the centers of the groups:
Larger values suggest relatively better potential for discrimination between groups.
In the plot above, D2 (1, 2) < D2 (1, 3) which implies that it should be easier to
distinguish between groups 1 and 3 than groups 1 and 2.
M -distance classification is equivalent to classification based on a probability
model that assumes the samples are independently selected from multivariate nor-
mal populations with identical covariance matrices. This assumption is consistent
with the plot above where the data points form elliptical clouds with similar orienta-
tions and spreads across samples. Suppose you can assume a priori (without looking
at the data for the individual that you wish to classify) that a randomly selected in-
dividual from the combined population (i.e., merge all sub-populations) is equally
likely to be from any group:
1
PRIORj ≡ Pr(observation is from group j) = ,
k
where k is the number of groups. Then, given the observed features X for an indi-
vidual
To be precise, I will note that Pr(j|X) is unknown, and the expression for Pr(j|X) is
an estimate based on the data.
The group with the largest posterior probability Pr(j|X) is the group into which
X is classified. Maximizing Pr(j|X) across groups is equivalent to minimizing the
M -distance Dj2 (X) across groups, so the two classification rules are equivalent.
data into a training or calibration set from which the classification rule is con-
structed. The remaining data, called the test data set, is used with the classification
rule to estimate the error rate. In particular, the proportion of test cases misclassified
estimates the misclassification rate. This process is often repeated, say 10 times, and
the error rate estimated to be the average of the error rates from the individual splits.
With repeated random splitting, it is common to use 10% of each split as the test
data set (a 10-fold cross-validation).
Repeated random splitting can be coded. As an alternative, you might consider
using one random 50-50 split (a 2-fold) to estimate the misclassification rate, provided
you have a reasonably large data base.
Another form of cross-validation uses a jackknife method where single cases are
held out of the data (an n-fold), then classified after constructing the classification rule
from the remaining data. The process is repeated for each case, giving an estimated
misclassification rate as the proportion of cases misclassified.
The lda() function allows for jackknife cross-validation (CV) and cross-validation
using a single test data set (predict()). The jackknife method is necessary with small
sized data sets so single observations don’t greatly bias the classification. You can also
classify observations with unknown group membership, by treating the observations
to be classified as a test data set.
## 3D scatterplot
library(scatterplot3d)
with(shells, {
scatterplot3d(x = length
, y = width
, z = height
, main = "Shells 3D Scatterplot"
, type = "h" # lines to the horizontal xy-plane
, color = as.integer(sex) # color by group
, pch = as.integer(sex)+19 # plotting character by group
#, highlight.3d = TRUE # makes color change with z-axis value
, angle = 100 # viewing angle (seems hard to control)
)
})
10
5 ●
0
180
Cor : 0.978 Cor : 0.963 ●
●
160 ●
●
●
length
35 40 45 50 55 60 65 70
100
width
140 ●
● ●
●
●●
Cor : 0.96 130 ●
120 ●
120 ●
● ●
●●
width
height
F: 0.966 ●
100 110 ●
● ●●
●
●
●●
M: 0.912 100 ●● ●
80 ●
●
●●●●
90 ● ●
●
80 ●●
60 70
height
length
40
##
## triplot
partimat(sex ~ length + width + height, data = shells
, plot.matrix = TRUE)
100 120 140 160 180 80 90 100 110 120 130
35 40 45 50 55 60 65
180
F F
Error: 0.188 Error: 0.083
160
160 F F
FF FF
F FF F F F
FF FF
140
140
F F
FF F● F
length F M MFF F●
M FF M
M
M M
FMM M
120
F F F
120
MM MMM M
MMM M
MM
●M M● M
M
M F MM F F
MF
100 110 120 130 100
MMFF FM
MMM F
90
●M
FM
M
MM MM
M● F
MFF M M MF
MF
MM FM
M
M M
MF M M MF
80
80
M M
M F F M
65
65
FFF F
FF F
F F
60
60
F F FF
55
55
F F
●F ●F
FF F
F F height
50
50
F F F
F
M FF M
F MM M
F
45
45
F MMM F MM
MM
FF M M FM M
40
●M
MM ●M
MMM
40
MM M
MF M M MMMM
M FMMM FM
M M FM
35
35
MM MM
100 120 140 160 80 90 100 110 120 13035 40 45 50 55 60 65
The default linear discriminant analysis assumes equal prior probabilities for males
and females.
library(MASS)
lda.sh <- lda(sex ~ length + height, data = shells)
lda.sh
## Call:
## lda(sex ~ length + height, data = shells)
##
## Prior probabilities of groups:
## F M
## 0.5 0.5
##
## Group means:
## length height
## F 136.0417 52.04167
## M 113.4167 40.70833
##
## Coefficients of linear discriminants:
## LD1
## length 0.1370519
## height -0.4890769
The linear discrimant function is in the direction that best separates the sexes,
The plot of the lda object shows the groups across the linear discriminant func-
tion. From the klaR package we can get color-coded classification areas based on a
perpendicular line across the LD function.
plot(lda.sh, dimen = 1, type = "both", col = as.numeric(shells$sex))
0.8
0.4
0.0
−4 −3 −2 −1 0 1 2
group F
0.8
0.4
0.0
−4 −3 −2 −1 0 1 2
group M
The constructed table gives the jackknife-based classification and posterior prob-
abilities of being male or female for each observation in the data set. The misclassi-
fication rate follows.
# CV = TRUE does jackknife (leave-one-out) crossvalidation
lda.sh.cv <- lda(sex ~ length + height, data = shells, CV = TRUE)
# error column
classify.sh$error <- as.character(classify.sh$error)
classify.agree <- as.character(as.numeric(shells$sex) - as.numeric(lda.sh.cv$class))
classify.sh$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
24 females are classified correctly, with the other four classified as males. The To-
tal Error of 0.0833 is the estimated miscassification rate, computed as the sum of
Rates×Prior over sexes: 0.0833 = 0.1667 × 0.5 + 0 × 0.5. Are the misclassification
results sensible, given the data plots that you saw earlier?
The listing of the posterior probabilities for each sex, by case, gives you an idea
of the clarity of classification, with larger differences between the male and female
posteriors corresponding to more definitive (but not necessarily correct!) classifica-
tions.
# print table
classify.sh
## sex class error postF postM
## 1 F M -1 0.166 0.834
## 2 F M -1 0.031 0.969
## 3 F F 0.847 0.153
## 4 F F 0.748 0.252
## 5 F F 0.900 0.100
## 6 F F 0.992 0.008
## 7 F F 0.517 0.483
## 8 F F 0.937 0.063
## 9 F F 0.937 0.063
## 10 F F 0.937 0.063
## 11 F M -1 0.184 0.816
## 12 F M -1 0.294 0.706
## 13 F F 0.733 0.267
## 14 F F 0.733 0.267
## 15 F F 0.917 0.083
## 16 F F 0.994 0.006
## 17 F F 0.886 0.114
## 18 F F 0.864 0.136
## 19 F F 1.000 0.000
## 20 F F 0.998 0.002
## 21 F F 1.000 0.000
## 22 F F 1.000 0.000
## 23 F F 0.993 0.007
## 24 F F 0.999 0.001
## 25 M M 0.481 0.519
## 26 M M 0.040 0.960
## 27 M M 0.020 0.980
## 28 M M 0.346 0.654
## 29 M M 0.092 0.908
## 30 M M 0.021 0.979
## 31 M M 0.146 0.854
## 32 M M 0.078 0.922
## 33 M M 0.018 0.982
## 34 M M 0.036 0.964
## 35 M M 0.026 0.974
## 36 M M 0.019 0.981
## 37 M M 0.256 0.744
## 38 M M 0.023 0.977
## 39 M M 0.023 0.977
## 40 M M 0.012 0.988
## 41 M M 0.002 0.998
## 42 M M 0.175 0.825
## 43 M M 0.020 0.980
## 44 M M 0.157 0.843
## 45 M M 0.090 0.910
## 46 M M 0.067 0.933
## 47 M M 0.081 0.919
## 48 M M 0.074 0.926
# Assess the accuracy of the prediction
pred.freq <- table(shells$sex, lda.sh.cv$class) # row = true sex, col = classified sex
pred.freq
##
## F M
## F 20 4
## M 0 24
prop.table(pred.freq, 1) # proportions by row
##
## F M
## F 0.8333333 0.1666667
## M 0.0000000 1.0000000
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## F M
## 0.8333333 1.0000000
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 0.9166667
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0.08333333
of the test data and the training data. Many researchers use a 50-50 split. Regardless
of the split, you should combine the two data sets at the end of the cross-validation
to create the actual rule for classifying future data.
Below, the half of the indices of the iris data set are randomly selected, and
assigned a label “test”, whereas the rest are “train”. A plot indicates the two sub-
samples are similar.
#### Example: Fisher's iris data
# The "iris" dataset is included with R in the library(datasets)
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Randomly assign equal train/test by Species strata
library(plyr)
iris <- ddply(iris, .(Species), function(X) {
ind <- sample.int(nrow(X), size = round(nrow(X)/2))
sort(ind)
X$test <- "train"
X$test[ind] <- "test"
X$test <- factor(X$test)
X$test
return(X)
})
summary(iris$test)
## test train
## 75 75
table(iris$Species, iris$test)
##
## test train
## setosa 25 25
## versicolor 25 25
## virginica 25 25
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(subset(iris, test == "train")[,c(5,1,2,3,4)]
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, title = "train"
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
p <- ggpairs(subset(iris, test == "test")[,c(5,1,2,3,4)]
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, title = "test"
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
train test
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Sepal.Width Petal.Length Petal.Width
25 25
20 20
Species
Species
15 15
10 10
5 5
0 0
8
Cor : −0.156 Cor : 0.879 Cor : 0.833 Cor : −0.0858 Cor : 0.867 Cor : 0.806
7
Sepal.Length
Sepal.Length
7
setosa: 0.741 setosa: 0.378 setosa: 0.341 setosa: 0.744 setosa: 0.151 setosa: 0.219
6 6
versicolor: 0.537 versicolor: 0.758 versicolor: 0.54 versicolor: 0.517 versicolor: 0.746 versicolor: 0.551
5 virginica: 0.427 virginica: 0.757 virginica: 0.246 5 virginica: 0.476 virginica: 0.914 virginica: 0.338
4.5
4.0 Cor : −0.44 Cor : −0.357 Cor : −0.418 Cor : −0.375
4.0
Sepal.Width
Sepal.Width
3.5 setosa: 0.278 setosa: 0.0964 3.5 setosa: 0.0662 setosa: 0.361
3.0 versicolor: 0.552 versicolor: 0.706 3.0 versicolor: 0.574 versicolor: 0.628
virginica: 0.406 virginica: 0.668 2.5 virginica: 0.397 virginica: 0.396
2.5
2.0
6 Cor : 0.96 6
Cor : 0.966
Petal.Length
Petal.Length
setosa: 0.162 setosa: 0.454
4 4
versicolor: 0.82 versicolor: 0.733
2.5 2.5
2.0 2.0
Petal.Width
Petal.Width
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0 3 6 9 0 3 6 9 0 3 6 9 5 6 7 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.0 0.5 1.0 1.5 2.0 2.5 0 5 1015 0 5 1015 0 5 1015 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5
4.5 5.5 6.5 7.5 2.5 3.0 3.5 4.0 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
v v v
7.5
7.5
Error: v
0.227 Error: 0.04 v Error: 0.067 v
v v vv v v
v vv v v
v v vv vv v v
v v vv v v vvv v v vvv
6.5
6.5
●v v vv vvvvv●
vv vv v vvvvv●vv vv
v v v vv vv vv vv
Sepal.Length v v v● vvv v v ●vvvvvv v v●vvv v
v vv v vv v v v v vv v
s s v s
5.5
vv vv vvvv v vvv v v
5.5
s s s ssss ss s
v s sss v ssss v
vv s s●s s ss s s sss●s vv ssss●ss vv
s s s ss s sss
4.0 4.5
ss s ss
4.0 4.5
s ss s s
s sss sss
Error:s0.227 s Error: 0.053 s Error: 0.053
s s s
s s ss s
s s s
ss v s v ss v
3.5
3.5
s s ss
s s●s s v v s●ssss v v s●ss v v
s v v s vv s v v
ss s v v vvv v Sepal.Width sss vv v vvv s v v v v
s vv s vv v s vv v
3.0
3.0
ss s v v ● vv sss vv vvv v sss vv vv v
v v vv v v v ●v vv v ●
vv●v vv v v v vv●v v v v v v v●vv vvvv v
v vv v vv vv v v v v
v v v
2.5
2.5
v v v vv v v v
v v v vv vv
s v v s v v s v v
v v v v v v
Error: 0.04 Error: 0.053 Error: 0.067
v v vv v v vv v v vv v vv
6
6
v vv● v vv v v v●v v v v v vv ●vvvvv
v vvv vvvvv vv v v
v vv v vv vvv v vv vv vvv vvv
vvvvv vvvv
5
5
v v
vv ●vvv vv vv v v v vv vv vv v v vvv●vvvv
v vvvvv v v● v vv vvv
vv v v v v
4
4
Petal.Length vv v
v
vv vv v
3
3
2
2
s s s ss sss
sssss●sssss ss s ●
sss s ss ss s ss s ssss
sssss sss
●s
sss s s s s v s s sv s s ss s
2.5
vv v vvv vvvvv
●
v v v ● v v● v
vv v vvvv v v vvvv v vvvvvvv vv
v v v v vvvvvvv
v v v vv vv v vv vv v v
1.5
ss s●sss s s s ss s ss ss s s s s
●
sssssss
●
ss ssssssss ss s s s s ss s sssssssssss
4.5 5.5 6.5 7.5 2.5 3.0 3.5 4.0 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
library(MASS)
lda.iris <- lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
, data = subset(iris, test == "train"))
lda.iris
## Call:
## lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
## data = subset(iris, test == "train"))
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 4.960 3.400 1.448 0.240
## versicolor 5.968 2.792 4.304 1.344
## virginica 6.512 2.944 5.508 2.020
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.598122 -0.3320011
The plots of the lda object shows the data on the LD scale.
plot(lda.iris, dimen = 1, col = as.numeric(iris$Species))
plot(lda.iris, dimen = 2, col = as.numeric(iris$Species))
#pairs(lda.iris, col = as.numeric(iris£Species))
0.6
0.3
0.0
−5 0 5 10
5
group setosa
virginica
virginica
virginica
virginicavirginica setosa
virginica
virginica setosa
0.6
virginica
virginicavirginica versicolor
versicolor setosa
setosasetosa
setosa
LD2
virginicaversicolor
virginica versicolor
versicolor
versicolor setosa
setosa
setosa
virginica
virginica
virginica versicolor
versicolor setosa
setosa
setosa
setosa
versicolor
versicolor versicolor
versicolor setosa
setosa
setosa
virginica versicolor
virginicaversicolor
versicolor setosa
0.0
group versicolor
−5
0.6
0.3
0.0
−5 0 5
−5 0 5 10
LD1
group virginica
# error column
classify.iris$error <- as.character(classify.iris$error)
classify.agree <- as.character(as.numeric(subset(iris, test == "train")$Species)
- as.numeric(lda.iris.cv$class))
classify.iris$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
How well does the LD functions constructed on the training data predict the
Species in the independent test data?
# predict the test data from the training data LDFs
pred.iris <- predict(lda.iris, newdata = subset(iris, test == "test"))
, round(pred.iris$posterior,3))
colnames(classify.iris) <- c("Species", "class", "error"
, paste("P", colnames(lda.iris.cv$posterior), sep=""))
# error column
classify.iris$error <- as.character(classify.iris$error)
classify.agree <- as.character(as.numeric(subset(iris, test == "test")$Species)
- as.numeric(pred.iris$class))
classify.iris$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
# print table
classify.iris
## Species class error Psetosa Pversicolor Pvirginica
## 2 setosa setosa 1 0.000 0.000
## 3 setosa setosa 1 0.000 0.000
## 6 setosa setosa 1 0.000 0.000
## 8 setosa setosa 1 0.000 0.000
## 9 setosa setosa 1 0.000 0.000
## 10 setosa setosa 1 0.000 0.000
## 11 setosa setosa 1 0.000 0.000
## 12 setosa setosa 1 0.000 0.000
## 15 setosa setosa 1 0.000 0.000
## 16 setosa setosa 1 0.000 0.000
## 23 setosa setosa 1 0.000 0.000
## 24 setosa setosa 1 0.000 0.000
## 26 setosa setosa 1 0.000 0.000
## 28 setosa setosa 1 0.000 0.000
## 29 setosa setosa 1 0.000 0.000
## 31 setosa setosa 1 0.000 0.000
## 35 setosa setosa 1 0.000 0.000
## 37 setosa setosa 1 0.000 0.000
## 40 setosa setosa 1 0.000 0.000
## 41 setosa setosa 1 0.000 0.000
## 43 setosa setosa 1 0.000 0.000
## 44 setosa setosa 1 0.000 0.000
## 45 setosa setosa 1 0.000 0.000
## 47 setosa setosa 1 0.000 0.000
## 49 setosa setosa 1 0.000 0.000
## 52 versicolor versicolor 0 0.999 0.001
## 53 versicolor versicolor 0 0.992 0.008
## 54 versicolor versicolor 0 0.999 0.001
## 55 versicolor versicolor 0 0.992 0.008
## 57 versicolor versicolor 0 0.988 0.012
## 59 versicolor versicolor 0 1.000 0.000
## 61 versicolor versicolor 0 1.000 0.000
## 62 versicolor versicolor 0 0.999 0.001
## 63 versicolor versicolor 0 1.000 0.000
## 69 versicolor versicolor 0 0.918 0.082
## virginica 0 0 25
prop.table(pred.freq, 1) # proportions by row
##
## setosa versicolor virginica
## setosa 1 0 0
## versicolor 0 1 0
## virginica 0 0 1
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## setosa versicolor virginica
## 1 1 1
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 1
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0
The classification rule based on the training set works well with the test data. Do
not expect such nice results on all classification problems! Usually the error rate is
slightly higher on the test data than on the training data.
It is important to recognize that statistically significant differences (MANOVA)
among groups on linear discriminant function scores do not necessarily translate into
accurate classification rules! (WHY?)
, direction = "backward")
## ‘stepwise classification’, using 10-fold cross-validated correctness rate of method
lda’.
## 150 observations of 4 variables in 3 classes; direction: backward
## stop criterion: improvement less than 1%.
## correctness rate: 0.98; starting variables (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## hr.elapsed min.elapsed sec.elapsed
## 0.00 0.00 0.19
plot(step.iris.b, main = "Start = full model, backward selection")
step.iris.b$formula
## Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## <environment: 0x000000002045a3f0>
# start with empty model and do stepwise (direction = "both")
step.iris.f <- stepclass(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
, data = iris
, method = "lda"
, improvement = 0.01 # stop criterion: improvement less than 1%
# default of 5% is too coarse
, direction = "forward")
## ‘stepwise classification’, using 10-fold cross-validated correctness rate of method
lda’.
## 150 observations of 4 variables in 3 classes; direction: forward
## stop criterion: improvement less than 1%.
## correctness rate: 0.96; in: "Petal.Width"; variables (1): Petal.Width
##
## hr.elapsed min.elapsed sec.elapsed
## 0.0 0.0 0.2
plot(step.iris.f, main = "Start = empty model, forward selection")
step.iris.f$formula
## Species ~ Petal.Width
## <environment: 0x000000003222b898>
Start = full model, backward selection Start = empty model, forward selection
1.4
●
0.8
estimated correctness rate
0.6
1.0
0.4
0.8
0.2
0.6
0.0
●
START
START
+ Petal.Width
Given your selected model, you can then go on to fit your classification model by
using the formula from the stepclass() object.
library(MASS)
lda.iris.step <- lda(step.iris.b$formula
, data = iris)
lda.iris.step
## Call:
## lda(step.iris.b$formula, data = iris)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 5.006 3.428 1.462 0.246
## versicolor 5.936 2.770 4.260 1.326
## virginica 6.588 2.974 5.552 2.026
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.8293776 0.02410215
## Sepal.Width 1.5344731 2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width -2.8104603 2.83918785
##
## Proportion of trace:
## LD1 LD2
## 0.9912 0.0088
Note that if you have many variables, you may wish to use the alternate syntax
below to specify your formula (see the help ?stepclass for this example).
iris.d <- iris[,1:4] # the data
iris.c <- iris[,5] # the classes
sc_obj <- stepclass(iris.d, iris.c, "lda", start.vars = "Sepal.Width")
The admissions officer of a business school has used an index of undergraduate GPA
and management aptitude test scores (GMAT) to help decide which applicants should
be admitted to graduate school. The data below gives the GPA and GMAT scores for
recent applicants who are classified as admit (A), borderline (B), or not admit (N).
An equal number of A, B, and N’s (roughly) were selected from their corresponding
populations (Johnson and Wichern, 1988).
#### Example: Business school admissions data
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch17_business.dat"
business <- read.table(fn.data, header = TRUE)
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(business
, mapping = ggplot2::aes(colour = admit, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
library(ggplot2)
p <- ggplot(business, aes(x = gpa, y = gmat, shape = admit, colour = admit))
p <- p + geom_point(size = 6)
library(R.oo) # for ascii code lookup
p <- p + scale_shape_manual(values=charToInt(sort(unique(business$admit))))
p <- p + theme(legend.position="none") # remove legend with fill colours
print(p)
AA
20 B
admit
A
A A
NA
600
10
A A
A
AAAA A A
0
NNNNN B A
B A
BAAA A
Cor : 0.442
3.5 500 N N B BBAB
B A
N B
gmat
3.0
A: 0.147
N NN B B A A
B B
N
gpa
B: −0.331 B BB BB
BB A
2.5
N A
N: 0.507
NN N N N N B
BBB
B
2.0 400 N B A B
NA B
700
N B
600
N B
300 N
gmat
500
400
300
N
2.0 2.5 3.0 3.5
0 2 4 60 2 4 60 2 4 6 2.0 2.5 3.0 3.5 300 400 500 600 700 gpa
The officer wishes to use these data to develop a more quantitative (i.e., less
subjective) approach to classify prospective students. Historically, about 20% of all
applicants have been admitted initially, 10% are classified as borderline, and the
remaining 70% are not admitted. The officer would like to keep these percentages
roughly the same in the future.
This is a natural place to use discriminant analysis. Let us do a more careful
analysis here, paying attention to underlying assumptions of normality and equal
covariance matrices.
The GPA and GMAT distributions are reasonably symmetric. Although a few
outliers are present, it does not appear that any transformation will eliminate the
outliers and preserve the symmetry. Given that the outliers are not very extreme,
I would analyze the data on this scale. Except for the outliers, the spreads (IQRs)
are roughly equal across groups within GPA and GMAT. I will look carefully at the
variance-covariance matrices later.
There is a fair amount of overlap between the borderline and other groups, but
this should not be too surprising. Otherwise these applicants would not be borderline!
Assuming equal variance-covariance matrices, both GPA and GMAT are impor-
tant for discriminating among entrance groups. This is consistent with the original
data plots.
# classification of observations based on classification methods
# (e.g. lda, qda) for every combination of two variables.
library(klaR)
partimat(admit ~ gmat + gpa
, data = business
, plot.matrix = FALSE)
Partition Plot
700
A AA
A A
B
A
600 A A
N A A A
A●AA A A
B A A
N A
N NN N B A A
B AA A
N B
500
N B A
N BB
A
gmat
N BB A
N N B B BA B
N ●
B
● B BB BBB A
N A
N N
N B B
B B
NN
400
N N A B B
B
NA
B
B
N
N B
300
N
2.0 2.5 3.0 3.5
gpa
library(MASS)
lda.business <- lda(admit ~ gpa + gmat
, data = business)
lda.business
## Call:
## lda(admit ~ gpa + gmat, data = business)
##
## Prior probabilities of groups:
## A B N
## 0.3595506 0.3483146 0.2921348
##
## Group means:
## gpa gmat
## A 3.321875 554.4062
## B 3.004516 454.1935
## N 2.400385 443.7308
##
## Coefficients of linear discriminants:
## LD1 LD2
## gpa -3.977912929 -1.48346456
## gmat -0.003057846 0.01292319
##
## Proportion of trace:
## LD1 LD2
## 0.9473 0.0527
The linear discrimant functions that best classify the admit are
A
−4 −2 0 2 4
group A
2
AA A A
B A
A N N
NN N N N
B B
N N
A NN
0.0 0.2 0.4 0.6
A B N
A A AA
A
LD2
B B B
A N
A A
A
A A AA AB BB B N N N
0
N
A A N
B B N
A AB B BB N
N
B B B
A BB
B N
−4 −2 0 2 4 BB B A
AN N N
A B
A N
group B
−2
B B
B
B
0.0 0.2 0.4 0.6
−2 0 2 4
−4 −2 0 2 4
LD1
group N
# error column
classify.business$error <- as.character(classify.business$error)
as outlined below.
When the prior probabilities are unequal, classification is based on the general-
ized distance to group j:
exp{−0.5Dj2 (X)}
Pr(j|X) = P 2
.
k exp{−0.5Dk (X)}
Here S is the pooled covariance matrix, and log(PRIORj ) is the (natural) log of the
prior probability of being in group j. As before, you classify observation X into the
group that it is closest to in terms of generalized distance, or equivalently, into the
group with the maximum posterior probability.
Note that −2 log(PRIORj ) exceeds zero, and is extremely large when PRIORj is
near zero. The generalized distance is the M -distance plus a penalty term that is
large when the prior probability for a given group is small. If the prior probabilities
are equal, the penalty terms are equal so the classification rule depends only on the
M -distance.
The penalty makes it harder (relative to equal probabilities) to classify into a
low probability group, and easier to classify into high probability groups. In the
admissions data, an observation has to be very close to the B or A groups to not be
classified as N.
Note that in the analysis below, we make the tenuous assumption that the popu-
lation covariance matrices are equal. We also have 6 new observations that we wish
to classify. These observations are entered as a test data set with missing class levels.
# new observations to classify
business.test <- read.table(text = "
admit gpa gmat
NA 2.7 630
NA 3.3 450
NA 3.4 540
NA 2.8 420
NA 3.5 340
NA 3.0 500
", header = TRUE)
With priors, the LDs are different.
library(MASS)
lda.business <- lda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business)
lda.business
## Call:
## lda(admit ~ gpa + gmat, data = business, prior = c(0.2, 0.1,
## 0.7))
##
## Prior probabilities of groups:
## A B N
## 0.2 0.1 0.7
##
## Group means:
## gpa gmat
## A 3.321875 554.4062
## B 3.004516 454.1935
## N 2.400385 443.7308
##
## Coefficients of linear discriminants:
## LD1 LD2
## gpa -4.014778092 -1.38058511
## gmat -0.002724201 0.01299761
##
## Proportion of trace:
## LD1 LD2
## 0.9808 0.0192
About 1/2 of the borderlines in the calibration set are misclassified. This is due
to the overlap of the B group with the other 2 groups, but also reflects the low prior
probability for the borderline group. The classification rule requires strong evidence
that an observation is borderline before it can be classified as such.
# CV = TRUE does jackknife (leave-one-out) crossvalidation
lda.business.cv <- lda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business, CV = TRUE)
# error column
classify.business$error <- as.character(classify.business$error)
classify.agree <- as.character(as.numeric(business$admit)
- as.numeric(lda.business.cv$class))
classify.business$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
The test data cases were entered with missing group IDs. The classification table
compares the group IDs, which are unknown, to the ID for the group into which an
observation is classified. These two labels differ, so all the test data cases are identified
as misclassified. Do not be confused by this! Just focus on the classification for each
case, and ignore the other summaries.
# predict the test data from the training data LDFs
pred.business <- predict(lda.business, newdata = business.test)
## error column
#classify.business.test£error <- as.character(classify.business.test£error)
#classify.agree <- as.character(as.numeric(business.test£admit)
# - as.numeric(pred.business£class))
#classify.business.test£error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
# print table
classify.business.test
## admit class postA postB postN
## 1 NA N 0.102 0.074 0.824
## 2 NA A 0.629 0.367 0.004
## 3 NA A 0.919 0.081 0.000
## 4 NA N 0.026 0.297 0.676
## 5 NA B 0.461 0.538 0.001
## 6 NA B 0.385 0.467 0.148
Except for observation 5, the posterior probabilities for the test cases give strong
evidence in favor of classification into a specific group.
to group j:
exp{−0.5Dj2 (X)}
Pr(j|X) = P 2
.
k exp{−0.5Dk (X)}
Here Sj is the sample covariance matrix from group j and log |Sj | is the log of the de-
terminant of this covariance matrix. The determinant penalty term is large for groups
having large variability. The rule is not directly tied to linear discriminant function
variables, so interpretation and insight into this method is less straightforward.
There is evidence that quadratic discrimination does not improve misclassification
rates in many problems with small to modest sample sizes, in part, because the
quadratic rule requires an estimate of the covariance matrix for each population. A
modest to large number of observations is needed to accurately estimate variances
and correlations. I often compute the linear and quadratic rules, but use the linear
discriminant analysis unless the quadratic rule noticeably reduces the misclassification
rate.
Recall that the GPA and GMAT sample variances are roughly constant across
admission groups, but the correlation between GPA and GMAT varies widely across
groups.
The quadratic rule does not classify the training data noticeably better than
the linear discriminant analysis. The individuals in the test data have the same
classifications under both approaches. Assuming that the optimistic error rates for
the two rules were “equally optimistic”, I would be satisfied with the standard linear
discriminant analysis, and would summarize my analysis based on this approach.
Additional data is needed to decide whether the quadratic rule might help reduce the
misclassification rates.
# classification of observations based on classification methods
# (e.g. qda, qda) for every combination of two variables.
library(klaR)
partimat(admit ~ gmat + gpa
, data = business
, plot.matrix = FALSE
, method = "lda", main = "LDA partition")
partimat(admit ~ gmat + gpa
, data = business
, plot.matrix = FALSE
, method = "qda", main = "QDA partition")
700
A AA A AA
A A A A
B B
A A
A A
600
600
A A
N A A A N A A A
AA A A AA A A
A● A●
B A A B A A
N A N A
N NN NB A A
A A N NN NB A A
A A
B A B A
N B N B
500
500
N B A N B A
NN BB
A
NN BB
A
gmat
gmat
BB A BB A
N N B B BA B N N B B BA B
N ●
BB N ●
BB
● B BB B B A ● B BB B B A
N A N A
N NN B
B B B N NN B
B B B
NN NN
400
400
N N A B B N N A B B
B B
N A N A
B B
B B
N N
N B N B
300
300
N N
N N
2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.5
gpa gpa
library(MASS)
qda.business <- qda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business)
qda.business
## Call:
## qda(admit ~ gpa + gmat, data = business, prior = c(0.2, 0.1,
## 0.7))
##
## Prior probabilities of groups:
## A B N
## 0.2 0.1 0.7
##
## Group means:
## gpa gmat
## A 3.321875 554.4062
## B 3.004516 454.1935
## N 2.400385 443.7308
# CV = TRUE does jackknife (leave-one-out) crossvalidation
qda.business.cv <- qda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business, CV = TRUE)
# error column
classify.business$error <- as.character(classify.business$error)
classify.agree <- as.character(as.numeric(business$admit)
- as.numeric(qda.business.cv$class))
classify.business$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
# print table
#classify.business
## error column
#classify.business.test£error <- as.character(classify.business.test£error)
#classify.agree <- as.character(as.numeric(business.test£admit)
# - as.numeric(pred.business£class))
#classify.business.test£error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
# print table
classify.business.test
## admit class postA postB postN
## 1 NA N 0.043 0.038 0.919
## 2 NA A 0.597 0.402 0.000
## 3 NA A 0.978 0.022 0.000
## 4 NA N 0.051 0.423 0.526
## 5 NA B 0.292 0.708 0.000
## 6 NA B 0.363 0.513 0.123