Statistical Methods II
Statistical Methods II
Derek L. Sonderegger
Preface 7
Statistical Theory 11
1 Matrix Manipulation 11
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Types of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Operations on Matrices . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Parameter Estimation 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 R example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Inference 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Confidence Intervals and Hypothesis Tests . . . . . . . . . . . . . 36
3.3 F-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3
4 CONTENTS
4 Contrasts 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Estimate and variance . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Estimating contrasts using glht() . . . . . . . . . . . . . . . . . 52
4.4 Using emmeans Package . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Statistical Models 65
6 Two-way ANOVA 77
6.1 Review of 1-way ANOVA . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Main Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 Interaction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 Diagnostics 105
7.1 Detecting Assumption Violations . . . . . . . . . . . . . . . . . . 105
7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Appendix 257
7
8 CONTENTS
Statistical Theory
9
Chapter 1
Matrix Manipulation
Learning Outcomes
1.1 Introduction
Almost all of the calculations done in classical statistics require formulas with
large number of subscripts and many different sums. In this chapter we will
develop the mathematical machinery to write these formulas in a simple compact
formula using matrices.
We will first introduce the idea behind a matrix and give several special types
of matrices that we will encounter.
1.2.1 Scalars
To begin, we first define a scalar. A scalar is just a single number, either real
or complex. The key is that a scalar is just a single number. For example, 6 is
a scalar, as is −3. By convention, variable names for scalars will be lower case
and not in bold typeface.
√
Examples could be 𝑎 = 5, 𝑏 = 3, or 𝜎 = 2.
11
12 CHAPTER 1. MATRIX MANIPULATION
1.2.2 Vectors
A vector is collection of scalars, arranged as a row or column. Our convention
will be that a vector will be a lower cased letter but written in a bold type. In
other branches of mathematics is common to put a bar over the variable name
to denote that it is a vector, but in statistics, we have already used a bar to
denote a mean.
Examples of column vectors could be
2
2 ⎡ 8 ⎤
⎢ ⎥
𝑎=⎡ ⎤
⎢ −3 ⎥ 𝑏=⎢ 3 ⎥
⎣ 4 ⎦ ⎢ 4 ⎥
⎣ 1 ⎦
𝑐 = [ 8 10 43 −22 ]
𝑑 = [ −1 5 2 ]
To denote a specific entry in the vector, we will use a subscript. For example,
the second element of 𝑑 is 𝑑2 = 5. Notice, that we do not bold this symbol
because the second element of the vector is the scalar value 5.
1.2.3 Matrix
Just as a vector is a collection of scalars, a matrix can be viewed as a collection
of vectors (all of the same length). We will denote matrices with bold capitalized
letters. In general, I try to use letters at the end of the alphabet for matrices.
Likewise, I try to use symmetric letters to denote symmetric matrices.
For example, the following is a matrix with two rows and three columns
1 2 3
𝑊 =[ ]
4 5 6
and there is no requirement that the number of rows be equal, less than, or
greater than the number of columns. In denoting the size of the matrix, we first
refer to the number of rows and then the number of columns. Thus 𝑊 is a 2 × 3
matrix and it sometimes is helpful to remind ourselves of this by writing 𝑊 2×3 .
To pick out a particular element of a matrix, I will again use a subscripting nota-
tion, always with the row number first and then column. Notice the notational
shift to lowercase, non-bold font.
𝑤1,2 = 2 and 𝑤2,3 = 6
1.2. TYPES OF MATRICES 13
There are times I will wish to refer to a particular row or column of a matrix
and we will use the following notation
𝑤1,⋅ = [ 1 2 3 ]
A square matrix is a matrix with the same number of rows as columns. The
following are square
1 2 3
3 6
𝑍=[ ] 𝑋=⎡
⎢ 2 1 2 ⎤
⎥
8 10
⎣ 3 2 1 ⎦
A square matrix that has zero entries in every location except the main diagonal
is called a diagonal matrix. Here are two examples:
1 0 0 0
4 0 0 ⎡ 0 2 ⎤
0 0
𝑄=⎡
⎢ 0 5 0 ⎤
⎥ 𝑅=⎢ ⎥
⎢ 0 0 2 0 ⎥
⎣ 0 0 6 ⎦
⎣ 0 0 0 3 ⎦
14 CHAPTER 1. MATRIX MANIPULATION
Sometimes to make matrix more clear, I will replace the 0 with a dot to empha-
size the non-zero components.
1 ⋅ ⋅ ⋅
⎡ ⋅ 2 ⋅ ⋅ ⎤
𝑅=⎢ ⎥
⎢ ⋅ ⋅ 2 ⋅ ⎥
⎣ ⋅ ⋅ ⋅ 3 ⎦
A diagonal matrix with main diagonal values exactly 1 is called the identity
matrix. The 3 × 3 identity matrix is denoted 𝐼3 .
1 ⋅ ⋅
𝐼3 = ⎡
⎢ ⋅ 1 ⋅ ⎤
⎥
⎣ ⋅ ⋅ 1 ⎦
1.3.1 Transpose
1 6 𝑇 1 8
𝑍=[ ] 𝑍 =[ ]
8 3 6 3
3 1 2 3 9 8
𝑀 =⎡ 5 ⎤ =⎡ 7 ⎤
𝑇
⎢ 9 4 ⎥ 𝑀 ⎢ 1 4 ⎥
⎣ 8 7 6 ⎦ ⎣ 2 5 6 ⎦
We can think of this as swapping all elements about the main diagonal. Alter-
natively we could think about the transpose as making the first row become the
first column, the second row become the second column, etc. In this fashion we
could define the transpose of a non-square matrix.
1 2 3
𝑊 =[ ]
4 5 6
1 4
=⎡ ⎤
𝑇
𝑊 ⎢ 2 5 ⎥
⎣ 3 6 ⎦
1.3. OPERATIONS ON MATRICES 15
Addition and subtraction are performed element-wise. This means that two
matrices or vectors can only be added or subtracted if their dimensions match.
1 5 6
⎡ 2 ⎤ ⎡ 6 ⎤ ⎡ 8 ⎤
⎢ ⎥+⎢ ⎥=⎢ ⎥
⎢ 3 ⎥ ⎢ 7 ⎥ ⎢ 10 ⎥
⎣ 4 ⎦ ⎣ 8 ⎦ ⎣ 12 ⎦
5 8 1 2 4 6
⎡ 2 4 ⎤ − ⎡ 3 4 ⎤ = ⎡ −1 0 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ 11 15 ⎦ ⎣ 5 −6 ⎦ ⎣ 6 21 ⎦
1.3.3 Multiplication
Multiplication is the operation that is vastly different for matrices and vectors
than it is for scalars. There is a great deal of mathematical theory that suggests
a useful way to define multiplication. What is presented below is referred to
as the dot-product of vectors in calculus, and is referred to as the standard
inner-product in linear algebra.
We first define multiplication for a row and column vector. For this multiplica-
tion to be defined, both vectors must be the same length. The product is the
sum of the element-wise multiplications.
5
⎡ 6 ⎤
[ 1 2 3 4 ]⎢ ⎥ = (1 ⋅ 5) + (2 ⋅ 6) + (3 ⋅ 7) + (4 ⋅ 8) = 5 + 12 + 21 + 32 = 70
⎢ 7 ⎥
⎣ 8 ⎦
13 14
1 2 3 4 ⎡ 15 ⎤
16
𝑋=⎡
⎢ 5 6 7 8 ⎤⎥ 𝑊 =⎢ ⎥
⎢ 17 18 ⎥
⎣ 9 10 11 12 ⎦
⎣ 19 20 ⎦
16 CHAPTER 1. MATRIX MANIPULATION
and similarly
so that
170 180
𝑍=⎡ ⎤
⎢ 426 452 ⎥
⎣ 682 724 ⎦
1 2
1 2 3 ⎡ 1+4+3 2+4+6 8 12
[ ]⎢ 2 2 ⎤
⎥=[ 2+6+4 ]=[ ]
2 3 4 4+6+8 12 18
⎣ 1 2 ⎦
Notice that this definition of multiplication means that the order matters.
Above, we calculated 𝑋 3×4 𝑊 4×2 but we cannot reverse the order because the
inner dimensions do not match up.
4 5 20 25
5⎡
⎢ 7 6 ⎤ ⎡ ⎤
⎥ = ⎢ 35 30 ⎥
⎣ 9 10 ⎦ ⎣ 45 50 ⎦
Because of this definition, it is clear that 𝑎𝑋 = 𝑋𝑎 and the order does not
matter. Thus when mixing scalar multiplication with matrices, it is acceptable
to reorder scalars, but not matrices.
1.3. OPERATIONS ON MATRICES 17
1.3.7 Determinant
The determinant is defined only for square matrices and can be thought of as
the matrix equivalent of the absolute value or magnitude (i.e. | − 6| = 6). The
determinant gives a measure of the multi-dimensional size of a matrix (say the
matrix 𝐴) and as such is denoted det (𝐴) or |𝐴|. Generally this is a very tedious
thing to calculate by hand and for completeness sake, we will give a definition
and small examples.
For a 2 × 2 matrix
𝑎 𝑐
∣ ∣ = 𝑎𝑑 − 𝑐𝑏
𝑏 𝑑
5 2
∣ ∣ = 50 − 6 = 44
3 10
1.3.8 Inverse
5𝑥 = 15
for 𝑥. To do so, we multiply each side of the equation by the inverse of 5, which
is 1/5.
5𝑥 = 15
1 1
⋅ 5 ⋅ 𝑥 = ⋅ 15
5 5
1⋅𝑥=3
𝑥=3
For scalars, we know that the inverse of scalar 𝑎 is the value that when multiplied
by 𝑎 is 1. That is we see to find 𝑎−1 such that 𝑎𝑎−1 = 1.
−1 −1
In the matrix case, I am interested in finding 𝐴 such that 𝐴 𝐴 = 𝐼 and
−1
𝐴𝐴 = 𝐼. For both of these multiplications to be defined, 𝐴 must be a square
matrix and so the inverse is only defined for square matrices.
For a 2 × 2 matrix
𝑎 𝑏
𝑊 =[ ]
𝑐 𝑑
−1 1 𝑑 −𝑏
𝑊 = [ ]
det 𝑊 −𝑐 𝑎
−1 1 3 −2
𝑊 = [ ]
−7 −5 1
− 73 2
7
=[ 5 ]
7 − 71
20 CHAPTER 1. MATRIX MANIPULATION
and thus
−1 1 2 −3 2
𝑊𝑊 =[ ] [ 57 7 ]
5 3 7 − 71
− 73 + 10
7
2
7 − 2
7
⎡
=⎢ ⎤
⎥
15 15 10 3
⎣ −7 + 7 7 − 7 ⎦
1 0
=[ ] = 𝐼2
0 1
Not every square matrix has an inverse. If the determinant of the matrix (which
we think of as some measure of the magnitude or size of the matrix) is zero,
then the formula would require us to divide by zero. Just as we cannot find the
inverse of zero (i.e. solve 0𝑥 = 1 for 𝑥), a matrix with zero determinate is said
to have no inverse.
1.4 Exercises
Consider the following matrices:
1 4
1 2 3 6 4 3 1 2
A=[ ] B=[ ] c=⎡ ⎤
⎢ 2 ⎥ d=⎡ ⎤
⎢ 5 ⎥ E=[ ]
6 5 4 8 7 6 2 6
⎣ 3 ⎦ ⎣ 6 ⎦
1. Find Bc
𝑇
2. Find AB
𝑇
3. Find c d
𝑇
4. Find cd
5. Confirm that
3 −1
E−1 = [ ]
−1 1/2
is the inverse of E by calculating EE−1 = I.
Chapter 2
Parameter Estimation
Learning Outcomes
𝑇 𝑇
𝛽̂ = (X X)−1 X y
ŷ = X𝛽̂
and
𝑛
1
𝜎̂ 2 = MSE = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 − 𝑝 𝑖=1
𝑇 −1
StdErr (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗
2.1 Introduction
We have previously looked at ANOVA and regression models and, in many ways,
they felt very similar. In this chapter we will introduce the theory that allows
us to understand both models as a particular flavor of a larger class of models
known as linear models.
21
22 CHAPTER 2. PARAMETER ESTIMATION
First we clarify what a linear model is. A linear model is a model where the
data (which we will denote using roman letters as 𝑥 and 𝑦) and parameters of
interest (which we denote using Greek letters such as 𝛼 and 𝛽) interact only via
addition and multiplication. The following are linear models:
Model Formula
ANOVA 𝑦𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖𝑗
Simple Regression 𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 + 𝜖 𝑖
Quadratic Term 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜖𝑖
General Regression 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖,1 + 𝛽2 𝑥𝑖,2 + ⋯ + 𝛽𝑝 𝑥𝑖,𝑝 + 𝜖𝑖
Notice in the Quadratic model, the square is not a parameter and we can con-
sider 𝑥2𝑖 as just another column of data. This leads to the second example of
multiple regression where we just add more slopes for other covariates where the
𝑝th covariate is denoted 𝑥⋅,𝑝 and might be some transformation (such as 𝑥2 or
log 𝑥) of another column of data. The critical point is that the transformation
to the data 𝑥 does not depend on a parameter. Thus the following is not a
linear model
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝛼
𝑖 + 𝜖𝑖
We would like to represent all linear models in a similar compact matrix repre-
sentation. This will allow us to make the transition between simple and multiple
regression (and ANCOVA) painlessly.
Typically we’ll write the model as if we are specifying the 𝑖𝑡ℎ element of the
data set
𝑖𝑖𝑑
𝑦𝑖 = 𝛽
⏟⏟ + 𝛽⏟
0⏟ 𝑥𝑖 + 𝜖⏟𝑖
1⏟ where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )
signal noise
Notice we have a data generation model where there is some relationship between
the explanatory variable and the response which we can refer to as the “signal”
part of the defined model and the noise term which represents unknown actions
effecting each data point that move the response variable. We don’t know what
those unknown or unmeasured effects are, but we do know the sum of those
effects results in a vertical shift from the signal part of the model.
2.2. MODEL SPECIFICATIONS 23
εi β0 + β1x
Response Variable
Explanatory Variable
This representation of the model implicitly assumes that our data set has 𝑛
observations and we could write the model using all the obsesrvations using
matrices and vectors that correspond the the data and the parameters.
𝑦1 = 𝛽 0 + 𝛽 1 𝑥1 + 𝜖 1
𝑦2 = 𝛽 0 + 𝛽 1 𝑥2 + 𝜖 2
𝑦3 = 𝛽 0 + 𝛽 1 𝑥3 + 𝜖 3
⋮
𝑦𝑛−1 = 𝛽0 + 𝛽1 𝑥𝑛−1 + 𝜖𝑛−1
𝑦𝑛 = 𝛽 0 + 𝛽 1 𝑥𝑛 + 𝜖 𝑛
𝑖𝑖𝑑
where, as usual, 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ). These equations can be written using matrices
as
𝑦1 1 𝑥1 𝜖1
⎡ 𝑦 ⎤ ⎡ 1 𝑥 ⎤ ⎡ 𝜖 ⎤
2 2
⎢ ⎥ ⎢ ⎥ ⎢ 2 ⎥
⎢ 𝑦3 ⎥ = ⎢ 1 𝑥 3 ⎥ 𝛽0
[
𝜖
]+ ⎢ 3 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⎥⏟ 𝛽1 ⎢ ⋮ ⎥
⎢ 𝑦𝑛−1 ⎥ ⎢ 1 𝑥𝑛−1 ⎥ 𝛽 ⎢ 𝜖𝑛−1 ⎥
⎣
⏟⏟⏟𝑦𝑛⏟ ⎦ ⎣ 1
⏟ ⏟⏟⏟⏟⏟𝑥 ⎦
𝑛⏟⏟ ⎣⏟⏟
⏟ 𝜖𝑛 ⏟⏟
⎦
𝑦 𝑋 𝜖
𝑦 = 𝑋𝛽 + 𝜖 where 𝜖 ∼ 𝑁 (0, 𝜎2 𝐼 𝑛 )
where 𝑋 is referred to as the design matrix and 𝛽 is the vector of location pa-
rameters we are interested in estimating. To be very general, a vector of random
variables needs to describe how each of them varies in relation to each other, so
we need to specify the variance matrix. However because 𝜖𝑖 is independent of
24 CHAPTER 2. PARAMETER ESTIMATION
𝜖𝑗 , for all (𝑖, 𝑗) pairs, the variance matrix can be written as 𝜎2 𝐼 because all of
the covariances are zero.
The anova model is also a linear model and all we must do is create a appropriate
design matrix. Given the design matrix 𝑋, all the calculations are identical as
in the simple regression case.
𝑦𝑖,𝑗 = 𝜇𝑖 + 𝜖𝑖,𝑗
where 𝑦𝑖,𝑗 is the 𝑗th observation within the 𝑖th group. To clearly show the
creation of the 𝑋 matrix, let the number of groups be 𝑝 = 3 and the number of
observations per group be 𝑛𝑖 = 4. We now expand the formula to show all the
data.
𝑦1,1 = 𝜇1 + 𝜖1,1
𝑦1,2 = 𝜇1 + 𝜖1,2
𝑦1,3 = 𝜇1 + 𝜖1,3
𝑦1,4 = 𝜇1 + 𝜖1,4
𝑦2,1 = 𝜇2 + 𝜖2,1
𝑦2,2 = 𝜇2 + 𝜖2,2
𝑦2,3 = 𝜇2 + 𝜖2,3
𝑦2,4 = 𝜇2 + 𝜖2,4
𝑦3,1 = 𝜇3 + 𝜖3,1
𝑦3,2 = 𝜇3 + 𝜖3,2
𝑦3,3 = 𝜇3 + 𝜖3,3
𝑦3,4 = 𝜇3 + 𝜖3,4
𝑦1,1 1 0 0 𝜖1,1
⎡ 𝑦 ⎤ ⎡ 1 0 0 ⎤ ⎡ 𝜖 ⎤
1,2
⎢ ⎥ ⎢ ⎥ ⎢ 1,2 ⎥
⎢ 𝑦1,3 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,3 ⎥
⎢ 𝑦1,4 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,4 ⎥
⎢ 𝑦2,1 ⎥ ⎢ 0 1 0 ⎥ ⎢ 𝜖2,1 ⎥
⎢ ⎥ ⎢ ⎥ 𝜇1 ⎢ ⎥
⎢ 𝑦2,2 ⎥ = ⎢ 0 1 0 ⎥ ⎡ 𝜇 ⎤+ ⎢ 𝜖2,2 ⎥
⎢ 𝑦2,3 ⎥ ⎢ 0 1 0 ⎥ ⎢ 𝜇2 ⎥ ⎢ 𝜖2,3 ⎥
⎢ 𝑦2,4 ⎥ ⎢ 0 1 0 ⎥⏟ ⎣ 3 ⎦ ⎢ 𝜖2,4 ⎥
⎢ 𝑦 ⎥ ⎢ 0 0 1 ⎥ 𝛽 ⎢ 𝜖 ⎥
⎢ 3,1 ⎥ ⎢ ⎥ ⎢ 3,1 ⎥
⎢ 𝑦3,2 ⎥ ⎢ 0 0 1 ⎥ ⎢ 𝜖3,2 ⎥
⎢ 𝑦3,3 ⎥ ⎢ 0 0 1 ⎥ ⎢ 𝜖3,3 ⎥
𝑦3,4⏟⏟
⎣⏟⏟
⏟ ⎦ ⏟ 0⏟
⎣⏟⏟ 0 ⏟⏟
1⏟⎦ ⎣
⏟ 𝜖3,4 ⎦
𝑦 X 𝜖
Notice that each column of the 𝑋 matrix is acting as an indicator if the obser-
vation is an element of the appropriate group. As such, these are often called
indicator variables. Another term for these, which I find less helpful, is dummy
variables.
𝑦𝑖,𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖,𝑗
26 CHAPTER 2. PARAMETER ESTIMATION
𝑦1,1 1 0 0 𝜖1,1
⎡ 𝑦 ⎤ ⎡ 1 0 0 ⎤ ⎡ 𝜖 ⎤
1,2
⎢ ⎥ ⎢ ⎥ ⎢ 1,2 ⎥
⎢ 𝑦1,3 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,3 ⎥
⎢ 𝑦1,4 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,4 ⎥
⎢ 𝑦2,1 ⎥ ⎢ 1 1 0 ⎥ ⎢ 𝜖2,1 ⎥
⎢ ⎥ ⎢ ⎥ 𝜇 ⎢ ⎥
⎢ 𝑦2,2 ⎥ = ⎢ 1 1 0 ⎥ ⎡ 𝜏 ⎤+ ⎢ 𝜖2,2 ⎥
⎢ 𝑦2,3 ⎥ ⎢ 1 1 0 ⎥ ⎢ 𝜏2 ⎥ ⎢ 𝜖2,3 ⎥
⎢ 𝑦2,4 ⎥ ⎢ 1 1 0 ⎥⎣ ⏟ 3 ⎦ ⎢ 𝜖2,4 ⎥
⎢ 𝑦 ⎥ ⎢ 1 0 1 ⎥ 𝛽 ⎢ 𝜖 ⎥
⎢ 3,1 ⎥ ⎢ ⎥ ⎢ 3,1 ⎥
⎢ 𝑦3,2 ⎥ ⎢ 1 0 1 ⎥ ⎢ 𝜖3,2 ⎥
⎢ 𝑦3,3 ⎥ ⎢ 1 0 1 ⎥ ⎢ 𝜖3,3 ⎥
⎣ 𝑦
⏟⏟⏟ 3,4⏟⎦
⏟ ⏟ 1⏟
⎣⏟⏟ 0 ⏟⏟
1⏟⎦ ⎣
⏟ 𝜖3,4 ⎦
𝑦 X 𝜖
For both simple regression and ANOVA, we can write the model in matrix form
as
𝑦 = 𝑋𝛽 + 𝜖 where 𝜖 ∼ 𝑁 (0, 𝜎2 𝐼 𝑛 )
which could also be written as
𝑦 ∼ 𝑁 (𝑋𝛽, 𝜎2 𝐼 𝑛 )
and we could use the maximum-likelihood principle to find estimators for 𝛽 and
𝜎2 . In this section, we will introduce the estimators 𝛽̂ and 𝜎̂ 2 .
Our goal is to find the best estimate of 𝛽 given the data. To justify the formula,
consider the case where there is no error terms (i.e. 𝜖𝑖 = 0 for all 𝑖). Thus we
have
𝑦 = 𝑋𝛽
and our goal is to solve for 𝛽. To do this, we must use a matrix inverse, but
𝑇
since inverses only exist for square matrices, we pre-multiple by 𝑋 (notice that
𝑇
𝑋 𝑋 is a symmetric 2 × 2 matrix).
𝑇 𝑇
𝑋 𝑦 = 𝑋 𝑋𝛽
𝑇 −1
and then pre-multiply by (𝑋 𝑋) .
2.3. PARAMETER ESTIMATION 27
𝑇 −1 𝑇 𝑇 −1 𝑇
(𝑋 𝑋) 𝑋 𝑦 = (𝑋 𝑋) 𝑋 𝑋𝛽
𝑇 −1 𝑇
(𝑋 𝑋) 𝑋 𝑦=𝛽
𝑇 −1 𝑇
This exercise suggests that (𝑋 𝑋) 𝑋 𝑦 is a good place to start when looking
for the maximum-likelihood estimator for 𝛽.
Happily it turns out that this quantity is in fact the maximum-likelihood esti-
mator for the data generation model
𝑦 ∼ 𝑁 (𝑋𝛽, 𝜎2 𝐼 𝑛 )
𝑇 −1 𝑇
𝛽̂ = (𝑋 𝑋) 𝑋 𝑦
𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 + 𝜖 𝑖
𝑖𝑖𝑑
where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ).
Using our estimates 𝛽̂ we can obtain predicted values for the regression line at
any x-value. In particular we can find the predicted value for each 𝑥𝑖 value in
our dataset.
𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥𝑖
𝑦 ̂ = 𝑋 𝛽̂
𝑇 −1 𝑇
= 𝑋 (𝑋 𝑋) 𝑋 𝑦
= 𝐻𝑦
28 CHAPTER 2. PARAMETER ESTIMATION
𝑇 −1 𝑇
where 𝐻 = 𝑋 (𝑋 𝑋) 𝑋 is often called the hat-matrix because it takes 𝑦 to
𝑦 ̂ and has many interesting theoretical properties.1
We can now estimate the error terms via
𝜖 ̂ = 𝑦 − 𝑦̂
= 𝑦 − 𝐻𝑦
= (𝐼 𝑛 − 𝐻) 𝑦
As usual we estimate 𝜎2 using the mean-squared error, but the general formula
is
𝜎̂ 2 = MSE
𝑛
1
= ∑ 𝜖2𝑖̂
𝑛 − 𝑝 𝑖=1
1 𝑇
=𝜖̂ 𝜖̂
𝑛−𝑝
where 𝛽 has 𝑝 elements, and thus we have 𝑛 − 𝑝 degrees of freedom.
and projects it onto a 𝑝-dimension subspace spanned by the vectors in 𝑋. Projection matrices
have many useful properties and much of the theory of linear models utilizes 𝐻.
2.4. STANDARD ERRORS 29
and additive constants are ignored, but multiplicative constants are pulled out
as follows:
𝑇 𝑇 𝑇
𝑉 𝑎𝑟 (𝐴 𝜖 + 𝑏) = 𝑉 𝑎𝑟 (𝐴 𝜖) = 𝐴 𝑉 𝑎𝑟 (𝜖) 𝐴
We next derive the sampling variance of our estimator 𝛽̂ by first noting that 𝑋
and 𝛽 are constants and therefore
𝑉 𝑎𝑟 (𝑦) = 𝑉 𝑎𝑟 (𝑋𝛽 + 𝜖)
= 𝑉 𝑎𝑟 (𝜖)
= 𝜎2 𝐼 𝑛
because the error terms are independent and therefore 𝐶𝑜𝑣 (𝜖𝑖 , 𝜖𝑗 ) = 0 when
𝑖 ≠ 𝑗 and 𝑉 𝑎𝑟 (𝜖𝑖 ) = 𝜎2 . Recalling that constants come out of the variance
operator as the constant squared.
𝑇 −1 𝑇
̂ = 𝑉 𝑎𝑟 ((𝑋 𝑋)
𝑉 𝑎𝑟 (𝛽) 𝑋 𝑦)
𝑇 −1 𝑇 𝑇 −1
= (𝑋 𝑋) 𝑋 𝑉 𝑎𝑟 (𝑦) 𝑋 (𝑋 𝑋)
𝑇 −1 𝑇 𝑇 −1
= (𝑋 𝑋) 𝑋 𝜎2 𝐼 𝑛 𝑋 (𝑋 𝑋)
𝑇 −1 𝑇 𝑇 −1
= 𝜎2 (𝑋 𝑋) 𝑋 𝑋 (𝑋 𝑋)
𝑇 −1
= 𝜎2 (𝑋 𝑋)
Using this, the standard error (i.e. the estimated standard deviation) of 𝛽𝑗̂ (for
any 𝑗 in 1, … , 𝑝) is
𝑇 −1
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗
30 CHAPTER 2. PARAMETER ESTIMATION
𝑇 −1 𝑇
The statistic 𝛽̂ = (𝑋 𝑋) 𝑋 𝑦 is the unbiased maximum-likelihood estimator
of 𝛽.
𝑦 ̂ = 𝑋 𝛽̂
𝜖 ̂ = 𝑦 − 𝑦̂
The estimate of 𝜎2 is
𝑛
1 1 𝑇
𝜎̂ 2 = MSE = ∑ 𝜖2𝑖̂ = 𝜖̂ 𝜖̂
𝑛 − 𝑝 𝑖=1 𝑛−𝑝
and is the typical squared distance between an observation and the model pre-
diction.
The standard error (i.e. the estimated standard deviation) of 𝛽𝑗̂ (for any 𝑗 in
1, … , 𝑝) is
𝑇 −1
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗
2.5 R example
Here we will work an example in R and do both the “hand” calculation as well
as using the lm() function to obtain the same information.
n <- 20
x <- seq(0,10, length=n)
y <- -3 + 2*x + rnorm(n, sd=2)
my.data <- data.frame(x=x, y=y)
ggplot(my.data) + geom_point(aes(x=x,y=y))
2.5. R EXAMPLE 31
15
10
y
1 𝑥1
⎡ 1 𝑥2 ⎤
⎢ ⎥
1 𝑥3
𝑋=⎢ ⎥
⎢ ⋮ ⋮ ⎥
⎢ 1 𝑥𝑛−1 ⎥
⎣ 1 𝑥𝑛 ⎦
𝑇 −1 𝑇
𝛽̂ = (𝑋 𝑋) 𝑋 𝑦
Our next step is to calculate the predicted values 𝑦 ̂ and the residuals 𝜖 ̂
𝑦 ̂ = 𝑋 𝛽̂
𝜖 ̂ = 𝑦 − 𝑦̂
Now that we have the residuals, we can calculate 𝜎̂ 2 and the standard errors of
𝛽𝑗̂
𝑛
1
𝜎̂ 2 = ∑ 𝜖2𝑖̂
𝑛 − 𝑝 𝑖=1
𝑇 −1
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗
We now print out the important values and compare them to the summary
output given by the lm() function in R.
cbind(Est=beta.hat, StdErr=std.errs)
## Est StdErr
## -1.448250 0.7438740
## x 1.681485 0.1271802
sigma.hat
## [1] 1.726143
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5127 -1.2134 0.1469 1.2610 3.2358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4482 0.7439 -1.947 0.0673 .
## x 1.6815 0.1272 13.221 1.04e-10 ***
## ---
2.6. EXERCISES 33
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.726 on 18 degrees of freedom
## Multiple R-squared: 0.9066, Adjusted R-squared: 0.9015
## F-statistic: 174.8 on 1 and 18 DF, p-value: 1.044e-10
2.6 Exercises
1. We will do a simple ANOVA analysis on example 8.2 from Ott & Long-
necker using the matrix representation of the model. A clinical psychol-
ogist wished to compare three methods for reducing hostility levels in
university students, and used a certain test (HLT) to measure the degree
of hostility. A high score on the test indicated great hostility. The psy-
chologist used 24 students who obtained high and nearly equal scores in
the experiment. eight were selected at random from among the 24 problem
cases and were treated with method 1. Seven of the remaining 16 students
were selected at random and treated with method 2. The remaining nine
students were treated with method 3. All treatments were continued for
a one-semester period. Each student was given the HLT test at the end of
the semester, with the results show in the following table. (This analysis
was done in section 8.3 of my STA 570 notes)
Method Values
1 96, 79, 91, 85, 83, 91, 82, 87
2 77, 76, 74, 73, 78, 71, 80
3 66, 73, 69, 66, 77, 73, 71, 70, 74
𝑦𝑖𝑗 = 𝛽𝑖 + 𝜖𝑖𝑗
𝑖𝑖𝑑
where 𝛽𝑖 is the mean of group 𝑖 and 𝜖𝑖𝑗 ∼ 𝑁 (0, 𝜎2 ).
a. Create one vector of all 24 hostility test scores y. (Use the c()
function.)
b. Create a design matrix X with dummy variables for columns that code
for what group an observation belongs to. Notice that X will be a 24
rows by 3 column matrix. Hint: An R function that might be handy
is cbind(a,b) which will bind two vectors or matrices together along
the columns. There is also a corresponding rbind() function that
binds vectors/matrices along rows. Furthermore, the repeat command
rep() could be handy.
34 CHAPTER 2. PARAMETER ESTIMATION
c) Find 𝛽̂ using the matrix formula given in class. Hint: The R function
t(A) computes the matrix transpose A𝑇 , solve(A) computes A−1 ,
and the operator %*% does matrix multiplication (used as A %*% B).
−1
d) Examine the matrix (X𝑇 X) X𝑇 . What do you notice about it? In
particular, think about the result when you right multiply by y. How
does this matrix calculate the appropriate group means and using the
appropriate group sizes 𝑛𝑖 ?
DBH 30.5 31.5 31.7 32.3 33.3 35 35.4 35.6 36.3 37.8
CC 0.74 0.69 0.65 0.72 0.58 0.5 0.6 0.7 0.52 0.6
Height 58 64 65 70 68 63 78 80 74 76
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖,1 + 𝛽2 𝑥𝑖,2 + 𝜖𝑖
Inference
Learning Outcomes
3.1 Introduction
The goal of statistics is to take information calculated from sample data and use
that information to estimate population parameters. The problem is that the
sample statistic is only a rough guess and if we were to collect another sample
of data, we’d get a different sample statistic and thus a different parameter
estimate. Therefore, we need to utilize the sample statistics to create confidence
intervals and make hypothesis tests about those parameters.
In this chapter, we’ll consider a dataset about the Galápagos Islands relating
the number of tortoise species on an island to various island characteristics such
as size, maximum elevation, etc. The set contains 𝑛 = 30 islands and
35
36 CHAPTER 3. INFERENCE
Variable Description
Species Number of tortoise species found on the island
Endimics Number of tortoise species endemic to the island
Elevation Elevation of the highest point on the island
Area Area of the island (km2 )
Nearest Distance to the nearest neighboring island (km)
Scruz Distance to the Santa Cruz islands (km)
Adjacent Area of the nearest adjacent island (km2 )
where 𝑄∗1−𝛼/2 is the 1 − 𝛼/2 quantile from some appropriate distribution. The
mathematical details about which distribution the quantile should come from
are often obscure, but usually involve the degrees of freedom 𝑛 − 𝑝 where 𝑝 is
the number of parameters in the “signal” part of the model.
The confidence interval formula for the 𝛽 parameters in a linear model is
where 𝑡∗1−𝛼/2,𝑛−𝑝 is the 1−𝛼/2 quantile from the t-distribution with 𝑛−𝑝 degrees
of freedom. A test statistic for testing 𝐻0 ∶ 𝛽𝑗 = 0 versus 𝐻𝑎 ∶ 𝛽𝑗 ≠ 0 is
𝛽𝑗̂ − 0
𝑡𝑛−𝑝 =
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ )
3.3. F-TESTS 37
3.3 F-tests
We wish to develop a rigorous way to compare nested models and decide if
a complicated model explains enough more variability than a simple model
to justify the additional intellectual effort of thinking about the data in the
complicated fashion.
It is important to specify that we are developing a way of testing nested models.
By nested, we mean that the simple model can be created from the full model
just by setting one or more model parameters to zero. This method doesn’t
constrain us for testing just a single parameter being possibly zero. Instead we
can test if an entire set of parameters all possibly being equal to zero.
3.3.1 Theory
Recall that in the simple regression and ANOVA cases we were interested in
comparing a simple model versus a more complex model. For each model we
computed the sum of squares error (SSE) and said that if the complicated model
performed much better than the simple then 𝑆𝑆𝐸𝑠𝑖𝑚𝑝𝑙𝑒 ≫ 𝑆𝑆𝐸𝑐𝑜𝑚𝑝𝑙𝑒𝑥 .
Recall from the estimation chapter, the model parameter estimates are found
by using the 𝛽̂ values that minimize the SSE. If it were to turn out that a 𝛽𝑗̂
of zero minimized SSE, then zero would be estimate. Next consider that we
are requiring the simple model to be a simplification of the complex model by
setting certain parameters to zero. So we are considering a simple model that
sets 𝛽𝑗̂ = 0 and vs a complex model that allows for 𝛽𝑗̂ to be any real value
(including), then because we select 𝛽𝑗̂ to be the value that minimizes SSE, then
𝑆𝑆𝐸𝑠𝑖𝑚𝑝𝑙𝑒 ≥ 𝑆𝑆𝐸𝑐𝑜𝑚𝑝𝑙𝑒𝑥 .
We’ll define 𝑆𝑆𝐸𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑆𝑆𝐸𝑠𝑖𝑚𝑝𝑙𝑒 − 𝑆𝑆𝐸𝑐𝑜𝑚𝑝𝑙𝑒𝑥 ≥ 0 and observe that if
the complex model is a much better fit to the data, then 𝑆𝑆𝐸𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 is large.
But how large is large enough to be statistically significant? In part, it depends
on how many more parameters were added to the model and what the amount
of unexplained variability left in the complex model. Let 𝑑𝑓𝑑𝑖𝑓𝑓 be the number
of parameters difference between the simple and complex models.
As with most test statistics, the F statistic can be considered as a “Signal-
to-Noise” ratio where the signal part is the increased amount of variability
explained per additional parameter by the complex model and the noise part is
just the MSE of the complex model.
and we claimed that if the null hypothesis was true (i.e. the complex model is an
unnecessary obfuscation of the simple), then this ratio follows an F-distribution
with degrees of freedom 𝑑𝑓𝑑𝑖𝑓𝑓 and 𝑑𝑓𝑐𝑜𝑚𝑝𝑙𝑒𝑥 .
38 CHAPTER 3. INFERENCE
The F-distribution is centered near one and we should reject the simple model
(in favor of the complex model) if this F statistic is much larger than one.
Therefore the p-value for the test is
Fdfdiff, dfcomplex distribution
0.6
0.4
then
3.4 Example
We will consider a data set from Johnson and Raven (1973) which also appears
in Weisberg (1985). This data set is concerned with the number of tortoise
species on 𝑛 = 30 different islands in the Galapagos. The variables of interest
in the data set are:
Variable Description
Species Number of tortoise species found on the island
Endimics Number of tortoise species endemic to the island
Elevation Elevation of the highest point on the island
Area Area of the island (km2 )
Nearest Distance to the nearest neighboring island (km)
Scruz Distance to the Santa Cruz islands (km)
Adjacent Area of the nearest adjacent island (km2 )
We will first read in the data set from the package faraway.
3.4. EXAMPLE 39
First we will create the full model that predicts the number of species as a
function of elevation, area, nearest, scruz and adjacent. Notice that this model
has 𝑝 = 6 𝛽𝑖 values (one for each coefficient plus the intercept).
We can happily fit this model just by adding terms on the left hand side of the
model formula. Notice that R creates the design matrix 𝑋 for us.
All the usual calculations from chapter two can be calculated and we can see
the summary table for this regression as follows:
summary(M.c)
##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
3.4. EXAMPLE 41
The first test we might want to do is to test if any of the covariates are signifi-
cant. That is to say that we want to test the full model versus the simple null
hypothesis model
𝑦𝑖 = 𝛽0 + 𝜖𝑖
that has no covariates and only a y-intercept. So we will create a simple model
and calculate the appropriate Residual Sums of Squares (RSS) for each model,
along with the difference in degrees of freedom between the two models.
## [1] 15.69941
## [1] 6.839486e-07
Both the F.stat and its p-value are given at the bottom of the summary table.
However, I might be interested in creating an ANOVA table for this situation.
This type of table is often shown in textbooks, but base functions in R don’t
produce exactly this table. Instead the anova(simple, complex) command
produces the following:
can be obtained from R by using the anova() function on the two models of
interest. This representation skips showing the Mean Squared calculations.
anova(M.s, M.c)
[𝑅𝑆𝑆𝑠 − 𝑅𝑆𝑆𝑐 ] /1
𝐹 =
𝑅𝑆𝑆𝑐 / (𝑛 − 𝑝)
=⋮
2
𝛽𝑗̂
=[ ]
𝑆𝐸 (𝛽𝑗̂ )
= 𝑡2
where 𝑡 has a t-distribution with 𝑛 − 𝑝 degrees of freedom under the null hy-
pothesis that the simple model is sufficient.
We consider the case of removing the covariate Area from the model and will
calculate our test statistic using both methods.
3.4. EXAMPLE 43
## [1] 1.139792
1 - pf(F.stat, 1, 24)
## [1] 0.296318
sqrt(F.stat)
## [1] 1.067611
To calculate it using the estimated coefficient and its standard error, we must
grab those values from the summary table
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.07 19.2 0.369 0.715
## 2 Area -0.0239 0.0224 -1.07 0.296
## 3 Elevation 0.319 0.0537 5.95 0.00000382
## 4 Nearest 0.00914 1.05 0.00867 0.993
## 5 Scruz -0.241 0.215 -1.12 0.275
## 6 Adjacent -0.0748 0.0177 -4.23 0.000297
## [1] -1.067611
44 CHAPTER 3. INFERENCE
2 * pt(t, 24)
## [1] 0.296318
All that hand calculation is tedious, so we can again use the anova()() command
to compare the two models.
anova(M.s, M.c)
We find a large p-value associated with this test and can safely stay with the null
hypothesis, that the simple model is sufficient to explain the observed variability
in the number of species of tortoise.
3.5. EXERCISES 45
3.5 Exercises
1. The dataset prostate in package faraway has information about a study
of 97 men with prostate cancer. We import the data and examine the first
four observations using the following commands.
data(prostate, package='faraway')
head(prostate)
It is possible to get information about the data set using the command
help(prostate). Fit a model with lpsa as the response and all the other
variables as predictors.
a) Compute 90% and 95% confidence intervals for the parameter asso-
ciated with age. Using just these intervals, what could we deduced
about the p-value for age in the regression summary. Hint: look at
the help for the function confint(). You’ll find the level option
to be helpful. Alternatively use the broom::tidy() function with the
conf.int=TRUE option and also use the level= option as well.
b) Remove all the predictors that are not significant at the 5% level.
Test this model against the original model. Which is preferred?
2. Thirty samples of cheddar cheese were analyzed for their content of acetic
acid, hydrogen sulfide and lactic acid. Each sample was tasted and scored
by a panel of judges and the average taste score produces. Used the
cheddar dataset from the faraway package (import it the same way you
did in problem one, but now use cheddar) to answer the following:
a) Fit a regression model with taste as the response and the three chem-
ical contents as predictors. Identify the predictors that are statisti-
cally significant at the 5% level.
b) Acetic and H2S are measured on a log10 scale. Create two new
columns in the cheddar data frame that contain the values on their
original scale. Fit a linear model that uses the three covariates on
their non-log scale. Identify the predictors that are statistically sig-
nificant at the 5% level for this model.
c) Can we use an 𝐹 -test to compare these two models? Explain why
or why not. Which model provides a better fit to the data? Explain
your reasoning.
d) For the model in part (a), if a sample of cheese were to have H2S in-
creased by 2 (where H2S is on the log scale and we increase this value
by 2 using some method), what change in taste would be expected?
What caveates must be made in this interpretation? Hint: I don’t
want to get into interpreting parameters on the log scale just yet. So
just interpret this as adding 2 to the covariate value and predicting
the change in taste.
46 CHAPTER 3. INFERENCE
3. The sat data set in the faraway package gives data collected to study the
relationship between expenditures on public education and test results.
a) Fit a model that with total SAT score as the response and only the
intercept as a covariate.
b) Fit a model with total SAT score as the response and expend, ratio,
and salary as predictors (along with the intercept).
c) Compare the models in parts (a) and (b) using an F-test. Is the
larger model superior?
d) Examine the summary table of the larger model? Does this contradict
your results in part (c)? What might be causing this issue? Create
a graph or summary diagnostics to support your guess.
e) Fit the model with salary and ratio (along with the intercept) as
predictor variables and examine the summary table. Which covari-
ates are significant?
f) Now add takers to the model (so the model now includes three
predictor variables along with the intercept). Test the hypothesis
that 𝛽𝑡𝑎𝑘𝑒𝑟𝑠 = 0 using the summary table.
g) Discuss why ratio was not significant in the model in part (e) but
was significant in part (f). Hint: Look at the Residual Standard Error
𝜎̂ in each model and argue that each t-statistic is some variant of a
“signal-to-noise” ratio and that the “noise” part is reduced in the
second model.
4. In this exercise, we will show that adding a covariate to the model that is
just random noise will decrease the model Sum of Squared Error (SSE).
d) Fit a linear model that includes this new Noise variable in addition
to the Height. Calculate the SSE error in the same manner as before.
Does it decrease or increase. Quantify how much it has changed.
3.5. EXERCISES 47
e) Repeat parts (c) and (d) several times. Comment on the trend in
change in SSE. Hint: This isn’t strictly necessary but is how I would
go about answering this question. Wrap parts (c) and (d) in a for
loop and generate a data.frame of a couple hundred runs. Then make
a density plot of the SSE values for the complex models and add a
vertical line on the graph of the simple model SSE.
results <- NULL
for( i in 1:2000 ){
# Do stuff
results <- results %>% rbind( glance(model) )
}
ggplot(results, aes(x=deviance)) +
geom_density() +
geom_vline( xintercept = simple.SSE )
48 CHAPTER 3. INFERENCE
Chapter 4
Contrasts
4.1 Introduction
We have written our model as 𝑦 = 𝑋𝛽 + 𝜖 and often we are interested in some
linear function of the 𝛽.̂ Some examples include the model predictions 𝑦𝑖̂ = 𝑋 𝑖⋅ 𝛽
where 𝑋 𝑖⋅ is the 𝑖𝑡ℎ row of the 𝑋 matrix. Other examples include differences in
group means in a one-way ANOVA or differences in predicted values 𝑦𝑖̂ − 𝑦𝑗̂ .
All of these can can be written as 𝑐′ 𝛽 for some vector 𝑐.
We often are interested in estimating a function of the parameters 𝛽. For ex-
ample in the offset representation of the ANOVA model with 3 groups we have
𝑦𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖𝑗
where
𝑇
𝛽 = [𝜇 𝜏2 𝜏3 ]
and 𝜇 is the mean of the control group, group one is the control group and thus
𝜏1 = 0, and 𝜏2 and 𝜏3 are the offsets of group two and three from the control
group. In this representation, the mean of group two is 𝜇 + 𝜏2 and is estimated
with 𝜇̂ + 𝜏2̂ .
A contrast is a linear combinations of elements of 𝛽,̂ which is a fancy way of
saying that it is a function of the elements of 𝛽 ̂ where the elements can be
49
50 CHAPTER 4. CONTRASTS
Similarly in the simple regression case, I will be interested in the height of the
regression line at 𝑥0 . This height can be written as
𝛽0̂
𝑦0̂ = 𝛽0̂ + 𝛽1̂ 𝑥0 = ⏟
[⏟1⏟ 𝑥⏟
0⏟] ⋅ [ ]
𝛽1̂
𝑐 𝑇 ⏟
𝛽̂
In this manner, we could think of calculating all of the predicted values 𝑦𝑖̂ as
just the result of the contrast 𝑋 𝛽̂ where our design matrix 𝑋 takes the role of
the contrasts.
and the standard error is found by plugging in our estimate of 𝜎2 and taking
the square root.
−1 −1
̂ = √𝜎̂ 2 𝑐𝑇 (𝑋 𝑇 𝑋)
𝑆𝑡𝑑𝐸𝑟𝑟 (𝑐𝑇 𝛽)
𝑇
𝑐 = 𝜎̂ √𝑐𝑇 (𝑋 𝑋) 𝑐
As usual, we can now calculate confidence intervals for 𝑐𝑇 𝛽̂ using the usual
formula
1−𝛼/2
𝐸𝑠𝑡 ± 𝑡𝑛−𝑝 𝑆𝑡𝑑𝐸𝑟𝑟 ( 𝐸𝑠𝑡 )
4.2. ESTIMATE AND VARIANCE 51
1−𝛼/2 𝑇 −1
𝑐𝑇 𝛽̂ ± 𝑡𝑛−𝑝 𝜎̂ √𝑐𝑇 (𝑋 𝑋) 𝑐
Recall the hostility example which was an ANOVA with three groups with the
data
We have analyzed this data using both the cell means model and the offset
and we will demonstrate how to calculate the group means from the offset
representation. Thus we are interested in estimating 𝜇 + 𝜏2 and 𝜇 + 𝜏3 . I am
also interested in estimating the difference between treatment 2 and 3 and will
therefore be interested in estimating 𝜏2 − 𝜏3 .
90
y
80
70
We can fit the offset model and obtain the design matrix and estimate of 𝜎̂ via
the following code.
52 CHAPTER 4. CONTRASTS
m <- lm(y ~ group, data=data) # Fit the ANOVA model (offset representation)
tidy(m) # Show me beta.hat
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 86.8 1.52 57.1 1.57e-24
## 2 groupGroup2 -11.2 2.22 -5.03 5.58e- 5
## 3 groupGroup3 -15.8 2.09 -7.55 2.06e- 7
Now we calculate
data.frame(Estimate=ctb, StdErr=std.err)
## Estimate StdErr
## 1 75.57143 1.622994
and notice this is the exact same estimate and standard error we got for group
two when we fit the cell means model.
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 groupGroup1 86.7 1.52 57.1 1.57e-24
## 2 groupGroup2 75.6 1.62 46.6 1.12e-22
## 3 groupGroup3 71 1.43 49.6 2.99e-23
is the generalized linear hypothesis test function glht() that can be found in
the multiple comparisons package multcomp. The p-values will be adjusted
to correct for testing multiple hypothesis, so there may be slight differences
compared to the p-value seen in just the regular summary table.
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 86.8 1.52 57.1 1.57e-24
## 2 groupGroup2 -11.2 2.22 -5.03 5.58e- 5
## 3 groupGroup3 -15.8 2.09 -7.55 2.06e- 7
We will now define a row vector (and it needs to be a matrix or else glht() will
throw an error. First we note that the simple contrast 𝑐𝑇 = [1 0 0] just grabs
the first coefficient and gives us the same estimate and standard error as the
summary did.
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = y ~ group, data = data)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Intercept == 0 86.750 1.518 57.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = y ~ group, data = data)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Mean of Group 1 == 0 86.750 1.518 57.141 <0.001 ***
## Mean of Group 2 == 0 75.571 1.623 46.563 <0.001 ***
## Mean of Group 3 == 0 71.000 1.431 49.604 <0.001 ***
## Diff G2-G3 == 0 4.571 2.164 2.112 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(test, level=0.95)
##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = y ~ group, data = data)
##
## Quantile = 2.6461
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## Mean of Group 1 == 0 86.7500 82.7328 90.7672
## Mean of Group 2 == 0 75.5714 71.2769 79.8660
## Mean of Group 3 == 0 71.0000 67.2126 74.7874
## Diff G2-G3 == 0 4.5714 -1.1546 10.2975
4.4. USING EMMEANS PACKAGE 55
data(trees)
model <- lm( Volume ~ Girth, data=trees )
trees <- trees %>%
dplyr::select( -matches('fit'), -matches('lwr'), -matches('upr') ) %>%
cbind( predict(model, interval='conf'))
ggplot(trees, aes(x=Girth, y=Volume)) +
geom_point() +
geom_ribbon( aes(ymin=lwr, ymax=upr), alpha=.3 ) +
geom_line( aes(y=fit) )
80
60
Volume
40
20
0
8 12 16 20
Girth
Using the summary() function, we can test hypotheses about if the y-intercept or
slope could be equal to zero, but we might be interested in confidence intervals
for the regression line at girth values of 10 and 12.
56 CHAPTER 4. CONTRASTS
In this case, we want to look at all possible pairwise differences between the
predicted values at 10 and 12. (The rev part just reverses the order in which
we do the subtraction.)
## $emmeans
## Girth emmean SE df lower.CL upper.CL
## 10 13.7 1.109 29 11.4 16.0
## 12 23.8 0.824 29 22.2 25.5
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 12 - 10 10.1 0.495 29 20.478 <.0001
4.4. USING EMMEANS PACKAGE 57
Notice that if I was interested in 3 points, we would get all of the differences.
## $emmeans
## Girth emmean SE df lower.CL upper.CL
## 10 13.7 1.109 29 11.4 16.0
## 11 18.8 0.945 29 16.8 20.7
## 12 23.8 0.824 29 22.2 25.5
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 10 - 11 -5.07 0.247 29 -20.478 <.0001
## 10 - 12 -10.13 0.495 29 -20.478 <.0001
## 11 - 12 -5.07 0.247 29 -20.478 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates
In this very simple case, the slope parameter is easily available as a parameter
value, but we could use the emtrends() function to obtain the slope.
This output is a bit mysterious because because of the 13.248 component. What
has happened is that emtrends is telling us the slope of the line at a particular
point on the x-axis (the mean of all the girth values). While this doesn’t matter
in this example, because the slope is the same for all values of girth, if we had
fit a quadratic model, it would not.
80
60
Volume
40
20
8 12 16 20
Girth
To consider the pairwise contrasts between different levels we will consider the
college student hostility data again. A clinical psychologist wished to compare
three methods for reducing hostility levels in university students, and used a
certain test (HLT) to measure the degree of hostility. A high score on the
test indicated great hostility. The psychologist used 24 students who obtained
high and nearly equal scores in the experiment. Eight subjects were selected
at random from among the 24 problem cases and were treated with method 1,
seven of the remaining 16 students were selected at random and treated with
method 2 while the remaining nine students were treated with method 3. All
treatments were continued for a one-semester period. Each student was given
the HLT test at the end of the semester, with the results show in the following
table.
4.4. USING EMMEANS PACKAGE 59
90
HLT
80
70
M1 M2 M3
Method
To use the emmeans(), we again will use the pairwise command where we specify
that we want all the pairwise contrasts between Method levels.
## $emmeans
## Method emmean SE df lower.CL upper.CL t.ratio p.value
## M1 86.8 1.52 21 83.6 89.9 57.141 <.0001
## M2 75.6 1.62 21 72.2 78.9 46.563 <.0001
## M3 71.0 1.43 21 68.0 74.0 49.604 <.0001
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## M1 - M2 11.18 2.22 21 5.577 16.8 5.030 0.0002
## M1 - M3 15.75 2.09 21 10.491 21.0 7.548 <.0001
## M2 - M3 4.57 2.16 21 -0.883 10.0 2.112 0.1114
##
## Confidence level used: 0.95
60 CHAPTER 4. CONTRASTS
4.5 Exercises
1. The American Community Survey is on ongoing survey being conducted
monthly by the US Census Bureau and the package Lock5Data has a
dataset called EmployedACS that has 431 randomly selected anonymous
US residents from this survey.
data('EmployedACS', package='Lock5Data')
?Lock5Data::EmployedACS
2. We will examine a data set from Ashton et al. (2007) that relates the
length of a tortoise’s carapace to the number of eggs laid in a clutch. The
data are
a) Plot the data with carapace as the explanatory variable and clutch
size as the response.
b) We want to fit the model
𝑖𝑖𝑑
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜖𝑖 where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )
To fit this model, we need the design matrix with a column of ones,
a column of 𝑥𝑖 values and a column of 𝑥2𝑖 values. To fit this model,
we could create a new column of 𝑥2𝑖 values, or do it in the formula.
4.5. EXERCISES 61
63
Chapter 5
Analysis of Covariance
(ANCOVA)
5.1 Introduction
One way that we could extend the ANOVA and regression models is to have
both categorical and continuous predictor variables. For historical reasons going
back to pre “computer in your pocket” days, statisticians call this the Analysis
of Covariance (ANCOVA) model. Because it is just another example of a 𝑦 =
𝑋𝛽 +𝜖 linear model, I prefer to think of it as simply having both continuous and
categorical variables in my model. None of the cookbook calculations change,
but the interpretation of the parameters gets much more interesting.
The dataset teengamb in the package faraway has data regarding the rates
of gambling among teenagers in Britain and their gender and socioeconomic
status. One question we might be interested in is how gender and income relate
to how much a person gambles. But what should be the effect of gender be?
There are two possible ways that gender could enter the model. Either:
1. We could fit two lines to the data one for males and one for females but
require that the lines be parallel (i.e. having the same slopes for income).
This is accomplished by having a separate y-intercept for each gender. In
effect, the line for the females would be offset by a constant amount from
the male line.
65
66 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)
2. We could fit two lines but but allow the slopes to differ as well as the
y-intercept. This is referred to as an “interaction” between income and
gender. One way to remember that this is an interaction is because the
effect of income on gambling rate is dependent on the gender of the indi-
vidual.
Additive Interaction
150
100
sex
gamble
Male
Female
50
0 5 10 15 0 5 10 15
income
*It should be noted here, that the constant variance assumption is being violated
and we really ought to do a transformation. I would recommend performing a
√
⋅ transformation on both the gamble and income covariates, but we’ll leave
them as is for now.
We will now see how to go about fitting these two models. As might be imagined,
these can be fit in the same fashion we have been solving the linear models, but
require a little finesse in defining the appropriate design matrix 𝑋.
𝛽0 + 𝛽1 + 𝛽2 𝑥𝑖 + 𝜖𝑖 if female
𝑦𝑖 = {
𝛽0 + 𝛽2 𝑥𝑖 + 𝜖𝑖 if male
where 𝛽1 is the vertical offset of the female group regression line to the reference
group, which is the males regression line. Because the first 19 observations are
5.3. LINES WITH DIFFERENT SLOPES (AKA INTERACTION MODEL)67
I like this representation where 𝛽1 is the offset from the male regression line
because it makes it very convenient to test if the offset is equal to zero. The
second column of the design matrix referred to as a “dummy variable” or “in-
dicator variable” that codes for the female gender. Notice that even though I
have two genders, I only had to add one additional variable to my model because
we already had a y-intercept 𝛽0 and we only added one indicator variable for
females.
What if we had a third group? Then we would fit another column of indicator
variable for the third group. The new beta coefficient in the model would be the
offset of the new group to the reference group. For example we consider 𝑛 = 9
observations with 𝑛𝑖 = 3 observations per group where 𝑦𝑖,𝑗 is the 𝑗 th replication
of the 𝑖th group.
𝑦1,1 1 0 0 𝑥1,1 𝜖1,1
⎡ 𝑦1,2 ⎤ ⎡ 1 0 0 𝑥1,2 ⎤ ⎡ 𝜖1,2 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝑦1,3 ⎥ ⎢ 1 0 0 𝑥1,3 ⎥ ⎢ 𝜖1,3 ⎥
𝛽0
⎢ 𝑦2,1 ⎥ ⎢ 1 1 0 𝑥2,1 ⎥⎡ ⎤ ⎢ 𝜖2,1 ⎥
⎢ ⎥=⎢ ⎥⎢ 𝛽1
𝑦2,2 1 1 0 𝑥2,2 ⎥+⎢ 𝜖2,2 ⎥
⎢ 𝑦2,3 ⎥ ⎢ 1 1 0 𝑥2,3 ⎥⎢ 𝛽2 ⎥ ⎢
𝜖2,3 ⎥
⎢ ⎥ ⎢ ⎥⎣ 𝛽3 ⎦ ⎢ ⎥
⎢ 𝑦3,1 ⎥ ⎢ 1 0 1 𝑥3,1 ⎥ ⎢ 𝜖3,3 ⎥
⎢ 𝑦3,2 ⎥ ⎢ 1 0 1 𝑥3,2 ⎥ ⎢ 𝜖3,2 ⎥
⎣ 𝑦3,3 ⎦ ⎣ 1 0 1 𝑥3,3 ⎦ ⎣ 𝜖3,3 ⎦
In this model, 𝛽0 is the y-intercept for group 1. The parameter 𝛽1 is the vertical
offset from the reference group (group 1) for the second group. Similarly 𝛽2 is
the vertical offset for group 3. All groups will share the same slope, 𝛽4 .
Where 𝛽1 is the offset in y-intercept of the female group from the male group,
and 𝛽3 is the offset in slope. Now our matrix formula looks like
𝑦1 1 1 𝑥1 𝑥1 𝜖1
⎡ ⋮ ⎤ ⎡ ⋮ ⋮ ⋮ ⋮ ⎤ 𝛽0 ⎡ ⋮ ⎤
⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ 𝜖 ⎥
⎢ 𝑦19 ⎥=⎢ 1 1 𝑥19 𝑥19 ⎥⎢ 𝛽1
⎥ + ⎢ 19 ⎥
⎢ 𝑦20 ⎥ ⎢ 1 0 𝑥20 0 ⎥⎢ 𝛽2 ⎥ ⎢ 𝜖20 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋮ ⎥⎣ 𝛽3 ⎦ ⎢ ⋮ ⎥
⎣ 𝑦47 ⎦ ⎣ 1 0 𝑥47 0 ⎦ ⎣ 𝜖47 ⎦
where the new fourth column is the what I would get if I multiplied the 𝑥 column
element-wise with the dummy-variable column. To fit this model in R we have
data('teengamb', package='faraway')
150
100
sex
gamble
Female
50 Male
0 5 10 15
income
##
## Call:
## lm(formula = gamble ~ sex * income, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.522 -4.860 -1.790 6.273 93.478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1400 9.2492 0.339 0.73590
## sexMale -5.7996 11.2003 -0.518 0.60724
## income 0.1749 1.9034 0.092 0.92721
## sexMale:income 6.3432 2.1446 2.958 0.00502 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.98 on 43 degrees of freedom
## Multiple R-squared: 0.5857, Adjusted R-squared: 0.5568
## F-statistic: 20.26 on 3 and 43 DF, p-value: 2.451e-08
Coefficients Interpretation
(Intercept) y-intercept for the females
sexMale The difference in y-intercept for Males
income Slope of the female regression line
sexMale:income The offset in slopes for the Males
70 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)
## $emmeans
## income = 5:
## sex emmean SE df lower.CL upper.CL
## Male 29.93 3.97 43 21.93 37.9
## Female 4.01 5.08 43 -6.23 14.3
##
## income = 10:
## sex emmean SE df lower.CL upper.CL
## Male 62.52 6.35 43 49.71 75.3
## Female 4.89 12.13 43 -19.58 29.4
##
## Confidence level used: 0.95
##
## $contrasts
## income = 5:
## contrast estimate SE df t.ratio p.value
## Male - Female 25.9 6.44 43 4.022 0.0002
##
5.3. LINES WITH DIFFERENT SLOPES (AKA INTERACTION MODEL)71
## income = 10:
## contrast estimate SE df t.ratio p.value
## Male - Female 57.6 13.69 43 4.208 0.0001
If we want the slopes as well as the difference in slopes, we would use the
emtrends() function.
## $emtrends
## income sex income.trend SE df lower.CL upper.CL
## 10 Male 6.518 0.988 43 4.53 8.51
## 10 Female 0.175 1.903 43 -3.66 4.01
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 10 Male - 10 Female 6.34 2.14 43 2.958 0.0050
While I specified to calculate the slope at the x-value of income=10, that doesn’t
matter because the slopes are the same at all x-values
Somewhat less interestingly, we could calculate the average of the Male and
Female slopes.
4.5
4.0
Species
Sepal.Width
3.5
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
Looking at this graph, it seems that I will likely have a model with different
y-intercepts for each species, but it isn’t clear to me if we need different slopes.
We consider the sequence of building successively more complex models:
4.5
4.0
3.5
m1
3.0
2.5
2.0
4.5
4.0
Species
Sepal.Width
3.5
virginica
m2
3.0 setosa
versicolor
2.5
2.0
4.5
4.0
3.5
m3
3.0
2.5
2.0
5 6 7 8
Sepal.Length
Looking at these, it seems obvious that the simplest model where we ignore
Species is horrible. The other two models seem decent, and I am not sure about
the parallel lines model vs the differing slopes model.
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.42 0.254 13.5 0
## 2 Sepal.Length -0.062 0.043 -1.44 0.152
For the simplest model, there is so much unexplained noise that the slope vari-
able isn’t significant.
Moving onto the next most complicated model, where each species has their
own y-intercept, but they share a slope, we have
74 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.669 0.308 2.17 0.031
## 2 Sepal.Length 0.35 0.046 7.56 0
## 3 Speciessetosa 1.01 0.093 10.8 0
## 4 Speciesversicolor 0.024 0.065 0.37 0.712
The first two lines are the y-intercept and slope associated with the reference
group and the last two lines are the y-intercept offsets from the reference group
to Setosa and Versicolor, respectively. We have that the slope associated with
increasing Sepal Length is significant and that Setosa has a statistically different
y-intercept than the reference group Virginica and that Versicolor does not have
a statistically different y-intercept than the reference group.
Finally we consider the most complicated model that includes two more slope
parameters
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.45 0.405 3.57 0
## 2 Sepal.Length 0.232 0.061 3.79 0
## 3 Speciessetosa -2.02 0.686 -2.94 0.004
## 4 Speciesversicolor -0.574 0.605 -0.95 0.344
## 5 Sepal.Length:Speciessetosa 0.567 0.126 4.49 0
## 6 Sepal.Length:Speciesversicolor 0.088 0.097 0.905 0.367
Meaning R-label
Reference group y-intercept (Intercept)
Reference group slope Sepal.Length
offset to y-intercept for Setosa Speciessetosa
offset to y-intercept for Versicolor Speciesversicolor
offset to slope for Setosa Sepal.Length:Speciessetosa
offset to slope for Versicolor Sepal.Length:Speciesversicolor
5.5. EXERCISES 75
It appears that slope for Setosa is different from the reference group Virginica.
However because we’ve added 2 parameters to the model, testing Model2 vs
Model3 is not equivalent to just looking at the p-value for that one slope. Instead
we need to look at the F-test comparing the two models which will evaluate if
the decrease in SSE is sufficient to justify the addition of two parameters.
anova(m2, m3)
The F-test concludes that there is sufficient decrease in the SSE to justify adding
two additional parameters to the model.
5.5 Exercises
1. The in the faraway package, there is a dataset named phbirths that gives
babies birth weights along with their gestational time in utero along with
the mother’s smoking status.
a. Load and inspect the dataset using
data('phbirths', package='faraway') # load the data within the package
?faraway::phbirths # Look at the help file
b. Create a plot of the birth weight vs the gestational age. Color code
the points based on the mother’s smoking status. Does it appear
that smoking matters?
c. Fit the simple model (one regression line) along with both the main
effects (parallel lines) and interaction (non-parallel lines) ANCOVA
model to these data. Which model is preferred?
d. Using whichever model you selected in the previous section, create a
graph of the data along with the confidence region for the regression
line(s).
e. Now consider only the “full term babies” which are babies with ges-
tational age at birth ≥ 36 weeks. With this reduced dataset, repeat
parts c,d.
76 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)
Two-way ANOVA
Given a categorical covariate (which I will call a factor) with 𝐼 levels, we are
interested in fitting the model
𝑦𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖𝑗
𝑖𝑖𝑑
where 𝜖𝑖𝑗 ∼ 𝑁 (0, 𝜎2 ), 𝜇 is the overall mean, and 𝜏𝑖 are the offset of factor
level 𝑖 from 𝜇. Unfortunately this model is not identifiable because I could
add a constant (say 5) to 𝜇 and subtract that same constant from each of the
𝜏𝑖 values and the group mean 𝜇 + 𝜏𝑖 would not change. There are two easy
restrictions we could make to make the model identifiable:
77
78 CHAPTER 6. TWO-WAY ANOVA
𝐻0 ∶ 𝑦𝑖𝑗 = 𝜇 + 𝜖𝑖𝑗
𝐻𝑎 ∶ 𝑦𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
6.1.1 An Example
We look at a dataset that comes from the study of blood coagulation times:
24 animals were randomly assigned to four different diets and the samples were
taken in a random order. The diets are denoted as 𝐴,𝐵,𝐶,and 𝐷 and the
response of interest is the amount of time it takes for the blood to coagulate.
data('coagulation', package='faraway')
ggplot(coagulation, aes(x=diet, y=coag)) +
geom_boxplot() +
labs( x='Diet', y='Coagulation Time' )
68
Coagulation Time
64
60
56
A B C D
Diet
Just by looking at the graph, we expect to see that diets 𝐴 and 𝐷 are similar
while 𝐵 and 𝐶 are different from 𝐴 and 𝐷 and possibly from each other, too.
We first fit the offset model.
##
## Call:
## lm(formula = coag ~ diet, data = coagulation)
##
## Residuals:
## Min 1Q Median 3Q Max
6.1. REVIEW OF 1-WAY ANOVA 79
Notice that diet 𝐴 is the reference level and it has a mean of 61. Diet 𝐵 has an
offset from 𝐴 of 5, etc. From the very small F-statistic, we conclude that simple
model
𝑦𝑖𝑗 = 𝜇 + 𝜖𝑖𝑗
is not sufficient to describe the data.
1−𝛼/2 𝑇 −1
𝑐𝑇 𝛽̂ ± 𝑡𝑛−𝑝 𝜎̂ √𝑐𝑇 (𝑋 𝑋) 𝑐
80 CHAPTER 6. TWO-WAY ANOVA
There are several ways to make R calculate this interval, but the easiest is to
use the emmeans package. This package computes the above intervals which are
commonly known as Tukey’s Honestly Significant Differences.
## $emmeans
## diet emmean SE df lower.CL upper.CL
## A 61 1.183 20 59.0 63.0
## B 66 0.966 20 64.3 67.7
## C 68 0.966 20 66.3 69.7
## D 61 0.837 20 59.6 62.4
##
## Confidence level used: 0.9
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B -5 1.53 20 -3.273 0.0183
## A - C -7 1.53 20 -4.583 0.0010
## A - D 0 1.45 20 0.000 1.0000
## B - C -2 1.37 20 -1.464 0.4766
## B - D 5 1.28 20 3.912 0.0044
## C - D 7 1.28 20 5.477 0.0001
##
## P value adjustment: tukey method for comparing a family of 4 estimates
Here we see that diets 𝐴 and 𝐷 are similar to each other, but different than 𝐵
and 𝐶 and that 𝐵 and 𝐶 are not statistically different from each other at the
0.10 level.
Often I want to produce the “Compact Letter Display” which identifies which
groups are significantly different. Unfortunately this leads to a somewhat binary
decision of “is statistically significant” or “is NOT statistically significant”, but
we should at least know how to do this calculation.
6.2. TWO-WAY ANOVA 81
LetterData <- emmeans(m, specs= ~ diet) %>% # cld() will freak out if you have pairwise here...
multcomp::cld(Letters=letters, level=0.95) %>%
mutate(.group = str_remove_all(.group, '\\s') ) %>% # remove the spaces
mutate( y = 73 ) # height to place the letters at.
LetterData
a b b a
70
Coagulation Time
65
60
A B C D
Diet
𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + 𝜖𝑖𝑗𝑘
which has the main effects of each covariate or possibly the model with the
interaction
𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 + 𝜖𝑖𝑗𝑘
To consider what an interaction term might mean consider the role of temper-
ature and humidity on the amount of fungal growth. You might expect to see
data similar to this (where the numbers represent some sort of measure of fungal
growth):
In this case we see that increased humidity increases the amount of fungal
growth, but the amount of increase depends on the temperature. At 2 C, the
increase is humidity increases are significant, but at 10 C the increases are
larger, and at 30 C the increases are larger yet. The effect of changing from one
humidity level to the next depends on which temperature level we are at. This
change in effect of humidity is an interaction effect. A memorable example is
that chocolate by itself is good. Strawberries by themselves are also good. But
the combination of chocolate and strawberries is a delight greater than the sum
of the individual treats.
We can look at a graph of the Humidity and Temperature vs the Response and
see the effect of increasing humidity changes based on the temperature level.
Just as in the ANCOVA model, the interaction manifested itself in non-parallel
slopes, the interaction manifests itself in non-parallel slopes when I connect the
dots across the factor levels.
6.3. ORTHOGONALITY 83
200
Temperature
Growth
2C
10C
100
30C
0
5% 30% 60% 90%
Humidity
6.3 Orthogonality
When designing an experiment, I want to make sure than none of my covariates
are confounded with each other and I’d also like for them to not be correlated.
Consider the following three experimental designs, where the number in each
bin is the number of subjects of that type. I am interested in testing 2 different
drugs and studying its effect on heart disease within the gender groups.
1. This design is very bad. Because we have no males taking drug 1, and no
females taking drug 2, we can’t say if any observed differences are due to
84 CHAPTER 6. TWO-WAY ANOVA
𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + 𝜖𝑖𝑗𝑘
• The intercept term, 𝜇 is the reference point for all the other parameters.
This is the expected value for an observation in the first level of factor 1
and the first level of factor two.
• 𝛼𝑖 is the amount you expect the response to increase when changing from
factor 1 level 1, to factor 1 level i (while the second factor is held constant).
• 𝛽𝑗 is the amount you expect the response to increase when changing from
factor 2 level 1 to factor 2 level j (while the first factor is held constant).
6.4. MAIN EFFECTS MODEL 85
Referring back to the fungus example, let the 𝛼𝑖 values be associated with
changes in humidity and 𝛽𝑗 values be associated with changes in temperature
levels. Then the expected value of each treatment combination is
100
80 Var
1
Yield
2
60 3
40
A B C D
Pest
86 CHAPTER 6. TWO-WAY ANOVA
The first thing we notice is that pesticides B and D seem to be better than
the others and that variety 3 seems to be the best producer. The effect of
pesticide treatment seems consistent between varieties, so we don’t expect that
the interaction effect will be significant.
Notice that the affects for Variety and Pesticide are the same whether or not the
other is in the model. This is due to the orthogonal design of the experiment
and makes it much easier to interpret the main effects of Variety and Pesticide.
Most statistical software will produce an analysis of variance table when fitting
a two-way ANOVA. This table is very similar to the analysis of variance table
we have seen in the one-way ANOVA, but has several rows which correspond to
the additional factors added to the model.
6.4. MAIN EFFECTS MODEL 87
Consider the two-way ANOVA with factors 𝐴 and 𝐵 which have levels 𝐼 and
𝐽 discrete levels respectively. For convenience let 𝑅𝑆𝑆1 is the residual sum of
squares of the intercept-only model, and 𝑅𝑆𝑆𝐴 be the residual sum of squares
for the model with just the main effect of factor 𝐴, and 𝑅𝑆𝑆𝐴+𝐵 be the residual
sum of squares of the model with both main effects. Finally assume that we
have a total of 𝑛 observations. The ANOVA table for this model is as follows:
Sum of Sq
Source df (SS) Mean Sq F p-value
A 𝑑𝑓𝐴 = 𝑆𝑆𝐴 = 𝑀 𝑆𝐴 = 𝑀 𝑆𝐴 /𝑀 𝑆𝐸
𝑃 (𝐹𝑑𝑓𝐴 ,𝑑𝑓𝑒 > 𝐹𝐴 )
𝐼 −1 𝑅𝑆𝑆1 − 𝑆𝑆𝐴 /𝑑𝑓𝐴
𝑅𝑆𝑆𝐴
B 𝑑𝑓𝐵 = 𝑆𝑆𝐵 = 𝑀 𝑆𝐵 = 𝑃 (𝐹𝑑𝑓𝐵 ,𝑑𝑓𝑒 > 𝐹𝐵 )
𝑀 𝑆𝐵 /𝑀 𝑆𝐸
𝐽 −1 𝑅𝑆𝑆𝐴 − 𝑆𝑆𝐵 /𝑑𝑓𝐵
𝑅𝑆𝑆𝐴+𝐵
Error 𝑑𝑓𝑒 = 𝑅𝑆𝑆𝐴+𝐵 𝑀 𝑆𝐸 =
𝑛−𝐼 − 𝑅𝑆𝑆𝐴+𝐵 /𝑑𝑓𝑒
𝐽 +1
Note, if the table is cut off, you can change decrease your font size and have it
all show up…
This arrangement of the ANOVA table is referred to as “Type I” sum of squares.
We can examine this table in the fruit trees example using the anova()() com-
mand but just passing a single model.
We might think that this is the same as fitting three nested models and running
an F-test on each successive pairs of models, but it isn’t. While both will give
the same Sums of Squares, the F statistics are different because the MSE of the
complex model is different. In particular, the F-statistics are larger and thus
the p-values are smaller for detecting significant effects.
88 CHAPTER 6. TWO-WAY ANOVA
As in the one-way ANOVA, we are interested in which factor levels differ. For
example, we might suspect that it makes sense to group pesticides B and D
together and claim that they are better than the group of A and C.
Just as we did in the one-way ANOVA model, this is such a common thing to
do that there is an easy way to do this, using emmeans.
## $emmeans
## Var emmean SE df lower.CL upper.CL
## 1 46.9 2.59 18 41.4 52.3
## 2 59.2 2.59 18 53.8 64.7
6.4. MAIN EFFECTS MODEL 89
emmeans(m3, spec=pairwise~Pest)
## $emmeans
## Pest emmean SE df lower.CL upper.CL
## A 53.0 2.99 18 46.7 59.3
## B 67.8 2.99 18 61.6 74.1
## C 51.2 2.99 18 44.9 57.4
## D 73.8 2.99 18 67.6 80.1
##
## Results are averaged over the levels of: Var
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B -14.83 4.23 18 -3.510 0.0122
## A - C 1.83 4.23 18 0.434 0.9719
## A - D -20.83 4.23 18 -4.930 0.0006
## B - C 16.67 4.23 18 3.944 0.0048
## B - D -6.00 4.23 18 -1.420 0.5038
## C - D -22.67 4.23 18 -5.364 0.0002
##
## Results are averaged over the levels of: Var
## P value adjustment: tukey method for comparing a family of 4 estimates
These outputs are nice and they show the main effects of variety and pesticide.
Similar to the 1-way ANOVA, we also want to be able to calculate the compact
letter display.
So we see that each variety is significantly different from all the others and
among the pesticides, 𝐴 and 𝐶 are indistinguishable as are 𝐵 and 𝐷, but there
is a difference between the 𝐴, 𝐶 and 𝐵, 𝐷 groups.
When the model contains the interaction of the two factors, our model is written
as
𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 + 𝜖𝑖𝑗𝑘
Interpreting effects effects can be very tricky. Under the interaction, the effect
of changing from factor 1 level 1 to factor 1 level 𝑖 depends on what level of
factor 2 is. In essence, we are fitting a model that allows each of the 𝐼 × 𝐽 cells
in my model to vary independently. As such, the model has a total of 𝐼 × 𝐽
parameters but because the model without interactions had 1 + (𝐼 − 1) + (𝐽 − 1)
terms in it, the interaction is adding 𝑑𝑓𝐴𝐵 parameters. We can solve for this
6.5. INTERACTION MODEL 91
via:
𝐼 × 𝐽 = 1 + (𝐼 − 1) + (𝐽 − 1) + 𝑑𝑓𝐴𝐵
𝐼 × 𝐽 = 𝐼 + 𝐽 − 1 + 𝑑𝑓𝐴𝐵
𝐼𝐽 − 𝐼 − 𝐽 = −1 + 𝑑𝑓𝐴𝐵
𝐼(𝐽 − 1) − 𝐽 = −1 + 𝑑𝑓𝐴𝐵
𝐼(𝐽 − 1) − 𝐽 + 1 = 𝑑𝑓𝐴𝐵
𝐼(𝐽 − 1) − (𝐽 − 1) = 𝑑𝑓𝐴𝐵
(𝐼 − 1)(𝐽 − 1) = 𝑑𝑓𝐴𝐵
This makes sense because the first factor added (𝐼 − 1) columns to the de-
sign matrix and an interaction with a continuous covariate just multiplied the
columns of the factor by the single column of the continuous covariate. Creating
an interaction of two factors multiplies each column of the first factor by all the
columns defined by the second factor.
The expected value of the 𝑖𝑗 combination is 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 . Returning to
our fungus example, the expected means for each treatment under the model
with main effects and the interaction is
Most statistical software will produce an analysis of variance table when fitting
a two-way ANOVA. This table is very similar to the analysis of variance table
we have seen in the one-way ANOVA, but has several rows which correspond to
the additional factors added to the model.
Consider the two-way ANOVA with factors 𝐴 and 𝐵 which have levels 𝐼 and
𝐽 discrete levels respectively. For convenience let 𝑅𝑆𝑆1 be the residual sum of
92 CHAPTER 6. TWO-WAY ANOVA
squares of the intercept-only model, and 𝑅𝑆𝑆𝐴 be the residual sum of squares
for the model with just the main effect of factor 𝐴. Likewise 𝑅𝑆𝑆𝐴+𝐵 and
𝑅𝑆𝑆𝐴∗𝐵 shall be the residual sum of squares of the model with just the main
effects and the model with main effects and the interaction. Finally assume
that we have a total of 𝑛 observations. The ANOVA table for this model is as
follows:
We next consider whether or not to include the interaction term to the fruit
tree model. We fit the model with the interaction and then graph the results.
# Create the Interaction Plot using emmeans package. IP stands for interaction plot
m4 <- lm(Yield ~ Var * Pest, data=fruit)
emmip( m4, Var ~ Pest ) # color is LHS of the formula
6.5. INTERACTION MODEL 93
90
80
Linear prediction
Var
70 1
2
60
3
50
40
A B C D
Levels of Pest
100
80 Var
1
Yield
2
60 3
40
A B C D
Pest
All of the line segments are close to parallel so, we don’t expect the interaction
to be significant.
anova( m4 )
Examining the ANOVA table, we see that the interaction effect is not significant
and we will stay with simpler model Yield~Var+Pest.
This data set looks at the number of breaks that occur in two different types of
wool under three different levels of tension (low, medium, and high). The fewer
number of breaks, the better.
As always, the first thing we do is look at the data. In this case, it looks like
the number of breaks decreases with increasing tension and perhaps wool B has
fewer breaks than wool A.
data(warpbreaks)
ggplot(warpbreaks, aes(x=tension, y=breaks, color=wool, shape=wool), size=2) +
geom_boxplot() +
geom_point(position=position_dodge(width=.35)) # offset the wool groups
60
wool
breaks
40 A
B
20
L M H
tension
We next fit our linear model and examine the diagnostic plots.
Standardized residuals
2
20
L:A L:A
Residuals
L:B 1 L:B
10
M:A M:A
0 M:B 0 M:B
H:A H:A
−10 −1
H:B H:B
−20 29 29
20 25 30 35 40 −2 −1 0 1 2
Fitted values Theoretical Quantiles
MASS::boxcox(model)
95%
log−Likelihood
−65
−80
−2 −1 0 1 2
λ
This suggests we should make a log transformation, though because the confi-
dence interval is quite wide we might consider if the increased difficulty in in-
terpretation makes sufficient progress towards making the data meet the model
assumptions. The diagnostic plots of the resulting model look better for the
constant variance assumption, but the normality is now a worse off. Because
the Central Limit Theorem helps deal with the normality question, I’d rather
stabilize the variance at the cost of the normality.
tension:wool tension:wool
Standardized residuals
0.4 1
L:A L:A
Residuals
L:B L:B
0.0 0
M:A M:A
M:B M:B
−0.4 H:A −1 H:A
H:B H:B
23 −2 23
−0.8 14 29 1429
3.0 3.2 3.4 3.6 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Next we’ll fit the interaction model and check the diagnostic plots. The diag-
nostic plots look good and this appears to be a legitimate model.
0.4
1
L:A L:A
Residuals
L:B L:B
0.0 0
M:A M:A
M:B M:B
−1
−0.4 H:A H:A
H:B H:B
29 −2
29
−0.8 23 23
3.0 3.2 3.4 3.6 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Then we’ll do an F-test to see if it is a better model than the main effects model.
The p-value is marginally significant, so we’ll keep the interaction in the model,
but recognize that it is a weak interaction.
## 1 50 7.6270
## 2 48 6.7138 2 0.91315 3.2642 0.04686 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Next we look at the effect of the interaction and the easiest way to do this is
to look at the interaction plot. The emmeans::emmip() just shows the mean of
each treatment combination, while the plot I made by hand shows the mean of
each treatment combination along with the raw data.
A <- emmip(model.2, wool ~ tension) # LHS is color, RHS is the x-axis variable
3.6 4.0
Linear prediction
log(breaks)
3.0 2.5
L M H L M H
Levels of tension tension
98 CHAPTER 6. TWO-WAY ANOVA
We can see that it appears that wool A has a decrease in breaks between low
and medium tension, while wool B has a decrease in breaks between medium
and high. It is actually quite difficult to see this interaction when we examine
the model coefficients.
summary(model.2)
##
## Call:
## lm(formula = log(breaks) ~ tension * wool, data = warpbreaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81504 -0.27885 0.04042 0.27319 0.64358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7179 0.1247 29.824 < 2e-16 ***
## tensionM -0.6012 0.1763 -3.410 0.00133 **
## tensionH -0.6003 0.1763 -3.405 0.00134 **
## woolB -0.4356 0.1763 -2.471 0.01709 *
## tensionM:woolB 0.6281 0.2493 2.519 0.01514 *
## tensionH:woolB 0.2221 0.2493 0.891 0.37749
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.374 on 48 degrees of freedom
## Multiple R-squared: 0.3363, Adjusted R-squared: 0.2672
## F-statistic: 4.864 on 5 and 48 DF, p-value: 0.001116
𝐻0 ∶ (𝜇 + 𝛼2 + 𝛽2 + (𝛼𝛽)22 ) − (𝜇 + 𝛼3 + 𝛽2 + (𝛼𝛽)32 ) = 0
𝐻𝑎 ∶ (𝜇 + 𝛼2 + 𝛽2 + (𝛼𝛽)22 ) − (𝜇 + 𝛼3 + 𝛽2 + (𝛼𝛽)32 ) ≠ 0
## $emmeans
## tension wool emmean SE df lower.CL upper.CL
## L A 3.72 0.125 48 3.47 3.97
## M A 3.12 0.125 48 2.87 3.37
## H A 3.12 0.125 48 2.87 3.37
## L B 3.28 0.125 48 3.03 3.53
## M B 3.31 0.125 48 3.06 3.56
## H B 2.90 0.125 48 2.65 3.15
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## L A - M A 0.601196 0.176 48 3.410 0.0158
## L A - H A 0.600323 0.176 48 3.405 0.0160
## L A - L B 0.435567 0.176 48 2.471 0.1535
## L A - M B 0.408619 0.176 48 2.318 0.2071
## L A - H B 0.813794 0.176 48 4.616 0.0004
## M A - H A -0.000873 0.176 48 -0.005 1.0000
## M A - L B -0.165629 0.176 48 -0.939 0.9341
## M A - M B -0.192577 0.176 48 -1.092 0.8821
## M A - H B 0.212598 0.176 48 1.206 0.8319
## H A - L B -0.164756 0.176 48 -0.935 0.9355
## H A - M B -0.191704 0.176 48 -1.087 0.8840
## H A - H B 0.213471 0.176 48 1.211 0.8295
## L B - M B -0.026948 0.176 48 -0.153 1.0000
## L B - H B 0.378227 0.176 48 2.145 0.2823
## M B - H B 0.405175 0.176 48 2.298 0.2149
##
## Results are given on the log (not the response) scale.
## P value adjustment: tukey method for comparing a family of 6 estimates
The last call to emmeans gives us all the pairwise tests comparing the cell means.
If we don’t want to wade through all the other pairwise contrasts we could do
the following:
# If I want to not wade through all those contrasts and just grab the
# contrasts for wool type 'B' and tensions 'M' and 'H'
emmeans(model.2, pairwise~tension*wool, at=list(wool='B', tension=c('M','H')))
## $emmeans
100 CHAPTER 6. TWO-WAY ANOVA
What would happen if we just looked at the main effects? In the case where our
experiment is balanced with equal numbers of observations in each treatment
cell, we can interpret these differences as follows. Knowing that each cell in
our table has a different estimated mean, we could consider the average of all
the type A cells as the typical wool A. Likewise we could average all the cell
means for the wool B cells. Then we could look at the difference between those
two averages. In the balanced design, this is equivalent to removing the tension
term from the model and just looking at the difference between the average log
number of breaks.
## $emmeans
## tension emmean SE df lower.CL upper.CL
## L 3.50 0.0882 48 3.32 3.68
## M 3.21 0.0882 48 3.04 3.39
## H 3.01 0.0882 48 2.83 3.19
##
## Results are averaged over the levels of: wool
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## L - M 0.287 0.125 48 2.303 0.0649
## L - H 0.489 0.125 48 3.925 0.0008
## M - H 0.202 0.125 48 1.622 0.2465
##
## Results are averaged over the levels of: wool
## Results are given on the log (not the response) scale.
## P value adjustment: tukey method for comparing a family of 3 estimates
6.5. INTERACTION MODEL 101
## $emmeans
## wool emmean SE df lower.CL upper.CL
## A 3.32 0.072 48 3.17 3.46
## B 3.17 0.072 48 3.02 3.31
##
## Results are averaged over the levels of: tension
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B 0.152 0.102 48 1.495 0.1415
##
## Results are averaged over the levels of: tension
## Results are given on the log (not the response) scale.
Using emmeans, we can see the wool effect difference between types B and A is
−0.1522. We can calculate the mean number of log breaks for each wool type
and take the difference by the following:
warpbreaks %>%
group_by(wool) %>%
summarise( wool.means = mean(log(breaks)) ) %>%
summarise( diff(wool.means) )
## # A tibble: 1 x 1
## `diff(wool.means)`
## <dbl>
## 1 -0.152
In the unbalanced case taking the average of the cell means produces a different
answer than taking the average of the data. The emmeans package chooses to
take the average of the cell means.
102 CHAPTER 6. TWO-WAY ANOVA
6.6 Exercises
As we have developed much of the necessary theory, we are moving into exer-
cises that emphasize modeling decisions and interpretation. As such, grading of
exercises will move to emphasizing interpretation and justification of why mod-
eling decisions were made. For any R output produced, be certain to discuss
what is important. Feel free to not show exploratory work, but please comment
that you considered. Furthermore, as students progress, necessary analysis steps
will not be listed and students will be expected to appropriately perform the
necessary work and comment as appropriate. These are leading to, at the end
of the course, students being given a dataset and, with no prompts as to what
is an appropriate analysis, told to model the data appropriately.
1. In the faraway package, the data set rats has data on a gruesome ex-
periment that examined the time till death of 48 rats when they were
subjected to three different types of poison administered in four different
manners (which they called treatments). We are interested in assessing
which poison works the fastest as well as which administration method is
most effective.
a. The response variable time needs to be transformed. To see this, we’ll
examine the diagnostic plots from the interaction model and there
is clearly a problem with non-constant variance. We’ll look at the
Box-Cox family of transformations and see that 𝑦−1 is a reasonable
transformation.
data('rats', package='faraway')
model <- lm( time ~ poison * treat, data=rats)
# lindia::gg_diagnose(model) # All the plots...
lindia::gg_diagnose(model, plot.all=FALSE)[[4]] # just resid vs fitted
lindia::gg_boxcox(model)
rats <- rats %>% mutate( speed = time^(-1) )
Diagnostics
Often problems with one of these can be corrected by transforming either the
explanatory or response variables.
105
106 CHAPTER 7. DIAGNOSTICS
show how our diagnostic measures will detect various departures from the model
assumptions.
The data are available in R as a data frame anscombe and is loaded by default.
The data consists of four datasets, each having the same linear regression 𝑦 ̂ =
3 + 0.5 𝑥 but the data are drastically different.
Set 1 Set 2
10
10
8
8
6
6
4
4
6 9 12 6 9 12
y
Set 3 Set 4
13
11 11
9 9
7 7
5 5
6 9 12 7.5 10.0 12.5 15.0 17.5
x
𝑦 ̂ = 𝑋 𝛽̂
𝑇 −1 𝑇
= 𝑋 (𝑋 𝑋) 𝑋 𝑦
= 𝐻𝑦
𝑇 −1 𝑇
where the “Hat Matrix” is 𝐻 = 𝑋 (𝑋 𝑋) 𝑋 because we have 𝑦 ̂ = 𝐻𝑦. The
elements of 𝐻 can be quite useful in diagnostics. It can be shown that the
variance of the 𝑖the residual is
𝑉 𝑎𝑟 (𝜖𝑖̂ ) = 𝜎2 (1 − 𝐻 𝑖𝑖 )
where 𝐻 𝑖𝑖 is the 𝑖th element of the main diagonal of 𝐻. This suggests that I
could rescale my residuals to
𝜖𝑖̂
𝜖∗𝑖̂ =
𝜎√1
̂ − 𝐻 𝑖𝑖
which, if the normality and homoscedasticity assumptions hold, should behave
as a 𝑁 (0, 1) sample.
These rescaled residuals are called “studentized residuals”, though R typically
refers to them as “standardized”. Since we have a good intuition about the scale
of a standard normal distribution, the scale of standardized residuals will give
a good indicator if normality is violated.
There are actually two types of studentized residuals, typically called internal
and external among statisticians. The version presented above is the internal
version which can be obtained using the R function rstandard() while the
external version is available using rstudent(). Whenever you see R present
standardized residuals, they are talking about internally studentized residuals.
For sake of clarity, I will use the term standardized as well.
Standardized Residuals 2
−1
3 6 9
Observation Index
and we notice that the outlier residual is really big. If the model assumptions
were true, then the standardized residuals should follow a standard normal dis-
tribution, and I would need to have hundreds of observations before I wouldn’t
be surprised to see a residual more than 3 standard deviations from 0.
7.1.1.2 Leverage
The extremely large standardized residual suggests that this data point is im-
portant, but we would like to quantify how important this observation actually
is.
One way to quantify this is to look at the elements of 𝐻. Because
𝑛
𝑦𝑖̂ = ∑ 𝐻 𝑖𝑗 𝑦𝑗
𝑗=1
then the 𝑖th row of 𝐻 is a vector of weights that tell us how influential a point
𝑦𝑗 is for calculating the predicted value 𝑦𝑖̂ . If I look at just the main diagonal
of 𝐻, these are how much weight a point has on its predicted value. As such, I
can think of the 𝐻 𝑖𝑖 as the amount of leverage a particular data point has on
the regression line. It can be shown that the leverages must be 0 ≤ 𝐻 𝑖𝑖 ≤ 1 and
that ∑ 𝐻 𝑖𝑖 = 𝑝.
X <- model.matrix(model3)
H <- X %*% solve( t(X) %*% X) %*% t(X)
round(H, digits=2)
7.1. DETECTING ASSUMPTION VIOLATIONS 109
## 1 2 3 4 5 6 7 8 9 10 11
## 1 0.32 0.27 0.23 0.18 0.14 0.09 0.05 0.00 -0.05 -0.09 -0.14
## 2 0.27 0.24 0.20 0.16 0.13 0.09 0.05 0.02 -0.02 -0.05 -0.09
## 3 0.23 0.20 0.17 0.15 0.12 0.09 0.06 0.04 0.01 -0.02 -0.05
## 4 0.18 0.16 0.15 0.13 0.11 0.09 0.07 0.05 0.04 0.02 0.00
## 5 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.05
## 6 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09
## 7 0.05 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## 8 0.00 0.02 0.04 0.05 0.07 0.09 0.11 0.13 0.15 0.16 0.18
## 9 -0.05 -0.02 0.01 0.04 0.06 0.09 0.12 0.15 0.17 0.20 0.23
## 10 -0.09 -0.05 -0.02 0.02 0.05 0.09 0.13 0.16 0.20 0.24 0.27
## 11 -0.14 -0.09 -0.05 0.00 0.05 0.09 0.14 0.18 0.23 0.27 0.32
Set 3 Set 4
1.00
0.75
leverage
0.50
0.25
3 6 9 3 6 9
index
This leverage idea only picks out the potential for a specific value of 𝑥 to be
influential, but does not actually measure influence. It has picked out the issue
with the fourth data set, but does not adequately address the outlier in set 3.
To attempt to measure the actual influence of an observation {𝑦𝑖 , 𝑥𝑇𝑖 } on the lin-
ear model, we consider the effect on the regression if we removed the observation
110 CHAPTER 7. DIAGNOSTICS
# Note: The high leverage point in set 4 has a Cook's distance of Infinity.
ggplot(rbind(Set3,Set4), aes(x=index, y=cooksd)) +
geom_point() +
facet_grid(. ~ set) +
labs(y="Cook's Distance")
Set 3 Set 4
Cook's Distance
1.0
0.5
0.0
3 6 9 3 6 9
index
7.1. DETECTING ASSUMPTION VIOLATIONS 111
Some texts will give a rule of thumb that points with Cook’s distances greater
than 1 should be considered influential, while other books claim a reasonable
rule of thumb is 4/ (𝑛 − 𝑝 − 1) where 𝑛 is the sample size, and 𝑝 is the number
of parameters in 𝛽. My take on this, is that you should look for values that are
highly different from the rest of your data.
After fitting a linear model in R, you have the option of looking at diagnostic
plots that help to decide if any assumptions are being violated. We will step
through each of the plots that are generated by the function plot(model) or
using ggplot2 using the package ggfortify.
In the package ggfortify there is a function that will calculate the diagnostics
measures and add them to your dataset. This will simplify our graphing process.
In the simple linear regression the most useful plot to look at was the residuals
versus the 𝑥-covariate, but we also saw that this was similar to looking at the
residuals versus the fitted values. In the general linear model, we will look at
the residuals versus the fitted values or possibly the studentized residuals versus
the fitted values.
112 CHAPTER 7. DIAGNOSTICS
10
clutch.size
Looking at the data, it seems that we are violating the assumption that a linear
model is appropriate, but we will fit the model anyway and look at the residual
graph.
Residuals vs Fitted
6
12
Residuals
2
−2
2
17
−8
Fitted values
lm(clutch.size ~ carapace)
7.1. DETECTING ASSUMPTION VIOLATIONS 113
residuals
2.0
count
1.5 5
1.0 0
0.5 −5
0.0
−7.5 −5.0 −2.5 0.0 2.5 5.0 290 300 310 320 330
Residuals carapace
Standardized Residual
Residual vs. Fitted Value Normal−QQ Plot
Residuals
5.0
5 2.5
0 0.0
−2.5
−5 −5.0
7.6 8.0 8.4 8.8 Standardized Residuals −2 −1 0 1 2
Fitted Values Theoretical Quantile
1 1
0 0
−1 −1
−2 −2
7.6 8.0 8.4 8.8 0.05 0.10 0.15 0.20 0.25
Sqrt(Standardized Residuals) Leverage
Cook's distance
5
Residuals
−5
Residuals vs Fitted
5.0 12
2.5
Residuals
0.0
−2.5
−5.0
2
17
7.6 8.0 8.4 8.8
Fitted values
The blue curves going through the plot is a smoother of the residuals. Ideally
this should be a flat line and I should see no trend in this plot. Clearly there
is a quadratic trend as larger tortoises have larger clutch sizes until some point
where the extremely large tortoises start laying fewer (perhaps the extremely
large tortoises are extremely old as well). To correct for this, we should fit a
model that is quadratic in carapace length. We will create a new covariate,
carapace.2, which is the square of the carapace length and add it to the model.
In general I could write the quadratic model as
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜖𝑖
and note that my model is still a linear model with respect to covariates 𝑥 and
𝑥2 because I can still write the model as
𝑦 = 𝑋𝛽 + 𝜖
1 𝑥1 𝑥21 𝜖1
⎡ 1 𝑥2 𝑥22 ⎤ 𝛽 ⎡ 𝜖 ⎤
⎢ ⎥⎡ 0 ⎤ ⎢ 2 ⎥
=⎢ 1 𝑥3 𝑥23 ⎥ ⎢ 𝛽1 ⎥ + ⎢ 𝜖3 ⎥
⎢ ⋮ ⋮ ⋮ ⎥ ⎣ 𝛽2 ⎦ ⎢ ⋮ ⎥
⎣ 1 𝑥𝑛 𝑥2𝑛 ⎦ ⎣ 𝜖𝑛 ⎦
# Fit an arbitrary degree polynomial, I recommend this method for fitting the model!
model <- lm( clutch.size ~ poly(carapace, 2), data=Eggs )
# If you use poly() in the formula, you must use 'data=' here,
# otherwise you can skip it and R will do the right thing.
autoplot(model, which=1, data=Eggs)
Residuals vs Fitted
16
2.5
Residuals
0.0
−2.5
15
2
3 5 7 9
Fitted values
Now our residual plot versus fitted values does not show any trend, suggesting
that the quadratic model is fitting the data well. Graphing the original data
along with the predicted values confirms this.
ggplot(Eggs, aes(x=carapace)) +
geom_ribbon( aes(ymin=lwr, ymax=upr), fill='red', alpha=.3) +
geom_line(aes(y=fit), color='red') +
geom_point(aes(y=clutch.size))
116 CHAPTER 7. DIAGNOSTICS
10
fit
7.1.2.1.2 Heteroskedasticity
The plot of residuals versus fitted values can detect heteroskedasticity (non-
constant variance) in the error terms.
To illustrate this, we turn to another dataset in the Faraway book. The dataset
airquality uses data taken from an environmental study that measured four
variables, ozone, solar radiation, temperature and wind speed for 153 consec-
utive days in New York. The goal is to predict the level of ozone using the
weather variables.
We first graph all pairs of variables in the dataset.
data(airquality)
# pairs(~ Ozone + Solar.R + Wind + Temp, data=airquality)
airquality %>% select( Solar.R, Wind, Temp, Ozone) %>%
GGally::ggpairs()
10
5 −0.458*** −0.602***
90 Corr:
Temp
80
70 0.698***
60
150
Ozone
100
50
0
0 100 200 300 5 10 15 20 60 70 80 90 0 50 100 150
7.1. DETECTING ASSUMPTION VIOLATIONS 117
and notice that ozone levels are positively correlated with solar radiation and
temperature, and negatively correlated with wind speed. A linear relationship
with wind might be suspect as is the increasing variability in the response to
high temperature. However, we don’t know if those trends will remain after
fitting the model, because there is some covariance among the predictors.
Residuals vs Fitted
100
117
50 30 62
Residuals
0 50 100
Fitted values
7.1.2.2 QQplots
0.3
Density
0.2
0.1
0.0
z0.05 z0.15 z0.25 z0.35 z0.45z0.55 z0.65 z0.75 z0.85 z0.95
x
I can then graph the theoretical quantiles vs my observed values and if they lie
on the 1-to-1 line, then my data comes from a standard normal distribution.
set.seed(93516) # make random sample in the next code chunk consistant run-to-run
n <- 10
data <- data.frame( observed = rnorm(n, mean=0, sd=1) ) %>%
arrange(observed) %>%
mutate( theoretical = qnorm( (1:n -.5)/n ) )
1.5
1.0
0.5
observed
0.0
−0.5
−1.0
−1 0 1
theoretical
In the context of a regression model, we wish to look at the residuals and see
if there are obvious departures from normality. Returning to the air quality
example, R will calculate the qqplot for us.
7.1. DETECTING ASSUMPTION VIOLATIONS 119
Normal Q−Q
117
Standardized residuals
2.5 3062
0.0
−2 −1 0 1 2
Theoretical Quantiles
In this case, we have a large number of residuals that are bigger than I would
expect them to be based on them being from a normal distribution. We could
further test this using the Shapiro-Wilks test and compare the standardized
residuals against a 𝑁 (0, 1) distribution.
shapiro.test( rstandard(model) )
##
## Shapiro-Wilk normality test
##
## data: rstandard(model)
## W = 0.9151, p-value = 2.819e-06
The tail of the distribution of observed residuals is far from what we expect to
see.
This plot is a variation on the fitted vs residuals plot, but the y-axis uses the
square root of the absolute value of the standardized residuals. Supposedly this
makes detecting increasing variance easier to detect, but I’m not convinced.
120 CHAPTER 7. DIAGNOSTICS
This plot lets the user examine the which observations have a high potential for
being influential (i.e. high leverage) versus how large the residual is. Because
Cook’s distance is a function of those two traits, we can also divide the graph
up into regions by the value of Cook’s Distance.
Residuals vs Leverage
3 10
Standardized Residuals
9
−1
11
0.0 0.1 0.2 0.3
Leverage
that one data point (observation 10) has an extremely large standardized resid-
ual. This is one plot where I prefer what the base graphics in R does compared
to the ggfortify version. The base version of R adds some contour lines that
mark where the contours of where Cook’s distance is 1/2 and 1.
plot(model3, which=5)
7.2. EXERCISES 121
Standardized residuals
Residuals vs Leverage
10
3
1
0.5
1
Cook's distance
−1
9
11 0.5
Leverage
lm(y ~ x)
7.2 Exercises
1. In the ANCOVA chapter, we examined the relationship on dose of vitamin
C on guinea pig tooth growth based on how the vitamin was delivered
(orange juice vs a pill supplement).
a. Load the ToothGrowth data which is pre-loaded in base R.
b. Plot the data with dose level on the x-axis and tooth length growth
(len) on the y-axis. Color the points by supplement type (supp).
c. Is the data evenly distributed along the x-axis? Comment on the
wisdom of using this design.
d. Fit a linear model to these data and examine the diagnostic plots.
What stands out to you?
e. Log-transform the dose variable and repeat parts (c) and (d). Com-
ment on the effect of the log transformation.
2. The dataset infmort in the faraway package has information about infant
mortality from countries around the world. Be aware that this is a old data
set and does not necessarily reflect current conditions. More information
about the dataset can be found using ?faraway::infmort. We will be
interested in understanding how infant mortality is predicted by per capita
income, world region, and oil export status.
a. Plot the relationship between income and mortality. This can be
done using the command
data('infmort', package='faraway')
pairs(mortality ~., data=infmort)
Data Transformations
𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 + 𝜖 𝑖
𝑖𝑖𝑑
where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ), the necessary assumptions are (in order of importance):
123
124 CHAPTER 8. DATA TRANSFORMATIONS
0
log(x)
−1
−2
−3
0 1 2 4 6 8
x
1. As 𝑥 → 0, log(𝑥) → −∞.
2. At 𝑥 = 1 we have 𝑙𝑜𝑔(𝑥 = 1) = 0.
3. As 𝑥 → ∞, log(𝑥) → ∞ as well, but at a much slower rate.
4. Even though 𝑙𝑜𝑔(𝑥) is only defined for 𝑥 > 0, the result can take on any
real value, positive or negative.
6
exp(x)
0
−3 −2 −1 0 1 2
like this: x
1. as 𝑥 → −∞, 𝑒𝑥 → 0.
2. At 𝑥 = 0 we have 𝑒0 = 1.
3. as 𝑥 → ∞, 𝑒𝑥 → ∞ as well, but at a much faster rate.
4. The function 𝑒𝑥 can be evaluated for any real number, but the result is
always > 0.
Finally we have that 𝑒𝑥 and 𝑙𝑜𝑔(𝑥) are inverse functions of each other by the
following identity:
𝑥 = log (𝑒𝑥 )
and
𝑥 = 𝑒log(𝑥) if 𝑥 > 0
Also it is important to note that the log function has some interesting properties
in that it makes operations “1-operation easier”.
𝑒𝑎+𝑏 = 𝑒𝑎 𝑒𝑏
The reason we like using a log() transformation is that it acts differently on large
values than small. In particular for 𝑥 > 1 we have that log(𝑥) makes all of the
smaller, but the transformation on big values of 𝑥 is more extreme. Consider
the following, where most of the x-values are small, but we have a few that are
quite large. Those large values will have extremely high leverage and we’d like
to reduce that.
0.4
0.015
0.3
density
density
0.010
0.2
0.005
0.1
0.000 0.0
0 200 400 600 0 2 4 6
x log(x)
126 CHAPTER 8. DATA TRANSFORMATIONS
5
150
4
100
log(y)
3
y
50
2
0
1
−50
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
x x
Clearly the model fit to the log transformed y-variable is a much better regres-
sion model. However, I would like to take the regression line and confidence
interval back to the original y-scale. This is allowed by doing the inverse func-
tion 𝑒log(𝑦)̂ .
For example if we fit a linear model for income (𝑦) based on the amount of
schooling the individual has received (𝑥). In this case, I don’t really want to
make predictions on the log(𝑦) scale, because (almost) nobody will understand
magnitude difference between predicting 5 vs 6.
Suppose the model is
log 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖
then we might want to give a prediction interval for an 𝑥0 value. The predicted
𝑙𝑜𝑔(𝑖𝑛𝑐𝑜𝑚𝑒) value is
log (𝑦0̂ ) = 𝛽0̂ + 𝛽1̂ 𝑥0
and we could calculate the appropriate predicted income as
̂ ̂
𝑦0̂ = 𝑒𝛽0 +𝛽1 𝑥0 = 𝑒𝑙𝑜𝑔(𝑦0̂ )
Likewise if we had a confidence interval or prediction interval for log (𝑦0̂ ) of the
form (𝑙, 𝑢) then the appropriate interval for 𝑦0̂ is (𝑒𝑙 , 𝑒𝑢 ). Notice that while
(𝑙, 𝑢) might be symmetric about log (𝑦0̂ ), the back-transformed interval is not
symmetric about 𝑦0̂ .
8.2. TRANSFORMING THE RESPONSE 127
ggplot(data, aes(x=x)) +
geom_ribbon( aes(ymin=lwr, ymax=upr), alpha=.6 ) +
geom_line( aes(y=fit), color='blue' ) +
geom_point( aes(y=y) ) +
labs(y='y')
150
100
y
50
0
0.0 0.5 1.0 1.5
x
This back transformation on the 𝑦 ̂ values will be acceptable for any 1-to-1 trans-
formation we use, not just log(𝑦).
Unfortunately the interpretation of the regression coefficients 𝛽0̂ and 𝛽1̂ on the
un-transformed scale becomes more complicated. This is a very serious difficulty
and might sway a researcher from transforming their data.
data(gala, package='faraway')
g <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data=gala)
# I don't like loading the MASS package because it includes a select() function
# that fights with dplyr::select(), so whenever I use a function in the MASS
# package, I just call it using the package::function() naming.
#
# #MASS::boxcox(g, lambda=seq(-2,2, by=.1)) # Set lambda range manually...
MASS::boxcox( g ) # With default lambda range.
95%
log−Likelihood
−100
−160
−2 −1 0 1 2
λ
√
The optimal transformation for these data would be 𝑦1/4 = 4 𝑦 but that is
an extremely uncommon transformation. Instead we should pick the nearest
“standard” transformation which would suggest that we should use either the
√
log 𝑦 or 𝑦 transformation.
Thoughts on the Box-Cox transformation:
Often the effect of a covariate is not linearly related to response, but rather
some function of the covariate. For example the area of a circle is not linearly
related to its radius, but it is linearly related to the radius squared.
𝐴𝑟𝑒𝑎 = 𝜋𝑟2
Similar situations might arise in biological settings, such as the volume of con-
ducting tissue being related to the square of the diameter. Or perhaps an
animals metabolic requirements are related to some power of body length. In
sociology, it is often seen that the utility of, say, $1000 drops off in a logarith-
mic fashion according to the person’s income. To a graduate student, $1K is
a big deal, but to a corporate CEO, $1K is just another weekend at the track.
130 CHAPTER 8. DATA TRANSFORMATIONS
Making a log transformation on any monetary covariate, might account for the
non-linear nature of “utility”.
Picking a good transformation for a covariate is quite difficult, but most fields
of study have spent plenty of time thinking about these issues. When in doubt,
look at scatter plots of the covariate vs the response and ask what transformation
would make the data fall onto a line?
data('gala', package='faraway')
# look at all the scatterplots
gala %>%
mutate(LogSpecies = log(Species)) %>%
dplyr::select(LogSpecies, Area, Elevation, Nearest, Scruz, Adjacent) %>%
GGally::ggpairs(upper=list(continuous='points'), lower=list(continuous='cor'))
LogSpecies Area
0.25
0.20
0.15
0.10
0.05
0.00
4000 Corr:
3000
2000
1000 0.429*
0
Elevation Nearest
200
100 −0.077 −0.101 −0.015 0.615***
0
Adjacent
Scruz. Given the high leverages, a log transformation should be a good idea.
One problem is that log(0) = −∞. A quick look at the data set summary:
gala %>%
dplyr::select(Species, Area, Elevation, Nearest,Scruz, Adjacent) %>%
summary()
reveals that Scruz has a zero value, and so a log transformation will result in a
−∞. So, lets take the square root of Scruz
gala %>%
mutate(LogSpecies = log(Species), LogElevation=log(Elevation), LogArea=log(Area), LogNearest=lo
SqrtScruz=sqrt(Scruz), LogAdjacent=log(Adjacent)) %>%
dplyr::select(LogSpecies, LogElevation, LogArea, LogNearest, SqrtScruz, LogAdjacent) %>%
GGally::ggpairs(upper=list(continuous='points'), lower=list(continuous='cor'))
0.25
0.20
0.15
0.10
0.05
LogElevation
0.00
7 Corr:
6
5
4 0.746***
3
LogArea
5 Corr: Corr:
0
0.870*** 0.904***
LogNearest
−5
4 Corr: Corr: Corr:
2
0 −0.040 0.064 0.086
SqrtScruz
0
5 Corr: Corr: Corr: Corr: Corr:
0 0.111 0.171 0.179 −0.096 0.031
2 4 6 3 4 5 6 7 −5 0 5 0 2 4 0 5 10 15 0 5
132 CHAPTER 8. DATA TRANSFORMATIONS
We will remove all the parameters that appear to be superfluous, and perform
an F-test to confirm that the simple model is sufficient.
summary(m.s)
##
## Call:
## lm(formula = log(Species) ~ log(Area), data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5442 -0.4001 0.0941 0.5449 1.3752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9037 0.1571 18.484 < 2e-16 ***
8.3. TRANSFORMING THE PREDICTORS 133
The slope coefficient (0.3886) is the increase in log(Species) for every 1 unit
increase in log(Area). Unfortunately that is not particularly convenient to in-
terpretation and we will address this in the next section of this chapter.
Finally, we might be interested in creating a confidence interval for the expected
number of tortoise species for an island with Area=50.
x0 <- data.frame(Area=50)
log.Species.CI <- predict(m.s, newdata=x0, interval='confidence')
log.Species.CI # Log(Species) scale
Notice that on the species-scale, we see that the fitted value is not in the center
of the confidence interval.
To help us understand what the log transformations are doing, we can produce
a plot with the island Area on the x-axis and the expected number of Species on
the y-axis and hopefully that will help us understand the relationship between
the two.
library(ggplot2)
pred.data <- data.frame(Area=1:50)
pred.data <- pred.data %>%
cbind( predict(m.s, newdata=pred.data, interval='conf'))
ggplot(pred.data, aes(x=Area)) +
geom_line(aes(y=exp(fit))) +
geom_ribbon(aes(ymin=exp(lwr), ymax=exp(upr)), alpha=.2) +
ylab('Number of Species')
134 CHAPTER 8. DATA TRANSFORMATIONS
100
Number of Species
75
50
25
0 10 20 30 40 50
Area
One of the most difficult issues surrounding transformed variables is that the
interpretation is difficult. Compared to taking the square root, log transforma-
tions are surprisingly interpretable on the original scale. Here we look at the
interpretation of log transformed variables.
scores %>%
dplyr::select(write, read, math, gender) %>%
GGally::ggpairs( aes(color=gender),
upper=list(continuous='points'), lower=list(continuous='cor'))
write
0.03
0.02
0.01
0.00
70 Corr: 0.597***
60 female: 0.621***
read
50
40 male: 0.648***
30
70 Corr: 0.617*** Corr: 0.662***
math
60 female: 0.675***female: 0.711***
50
40 male: 0.627*** male: 0.609***
15
10
gender
5
0
15
10
5
0
30 40 50 60 30 40 50 60 70 40 50 60 70 female male
These data look pretty decent, and I’m not certain that I would do any transfor-
mation, but for the sake of having a concrete example that has both continuous
and categorical covariates, we will interpret effects on a students’ writing score.
We consider the model where we have transformed the response variable and
just an intercept term.
log 𝑦 = 𝛽0 + 𝜖
## # A tibble: 1 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.95 0.0137 288. 7.01e-263
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
𝛽0 + 𝜖 if female
log 𝑦 = {
𝛽0 + 𝛽 1 + 𝜖 if male
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.00 0.0179 223. 1.02e-239
## 2 gendermale -0.103 0.0266 -3.89 1.39e- 4
The intercept is now the mean of the log-transformed write responses for the
̂
females and thus 𝑒𝛽0 = 𝑦𝑓̂ and the offset for males is the change in log(write)
from the female group. Notice that for the males, we have
̂ = 𝛽0̂ + 𝛽1̂
log 𝑦𝑚
̂ ̂
̂ = 𝑒𝛽0 +𝛽1
𝑦𝑚
𝛽0̂
= 𝑒⏟ 𝛽1̂
∗ 𝑒⏟
𝑦𝑓̂ multiplier for males
and therefore we see that males tend to have writing scores 𝑒−0.103 = 0.90 = 90%
of the females. Typically this sort of result would be reported as the males have
a 10% lower writing score than the females.
Hand calculating these is challenging to do it correctly, but as usual we can have
emmeans calculate it for us.
8.4. INTERPRETATION OF LOG TRANSFORMED VARIABLE COEFFICIENTS137
𝛽0 + 𝛽 2 𝑥 + 𝜖 if female
log 𝑦 = {
𝛽0 + 𝛽 1 + 𝛽 2 𝑥 + 𝜖 if male
We will use the reading score read to predict the writing score. Then 𝛽2̂ is the
predicted increase in log(write) for every 1-unit increase in read score. The
̂
interpretation of 𝛽0̂ is now log 𝑦 ̂ when 𝑥 = 0 and therefore 𝑦 ̂ = 𝑒𝛽0 when 𝑥 = 0.
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.41 0.0546 62.5 8.68e-132
## 2 gendermale -0.116 0.0210 -5.52 1.08e- 7
## 3 read 0.0113 0.00102 11.1 2.02e- 22
For females, we consider the difference in log 𝑦 ̂ for a 1-unit increase in 𝑥 and
will interpret this on the original write scale.
If we are interested in, say, a 20-unit increase in 𝑥, then that would result in an
increase of
̂ ̂ ̂ ̂ ̂
𝑒𝛽0 +𝛽2 (𝑥+20) 𝑒𝛽0 𝑒𝛽2 𝑥 𝑒20𝛽2 ̂ ̂
20
= = 𝑒20𝛽2 = (𝑒𝛽2 )
𝑒𝛽0̂ +𝛽2̂ 𝑥 𝑒𝛽0̂ 𝑒𝛽2̂ 𝑥
̂
In short, we can interpret 𝑒𝛽𝑖 as the multiplicative increase/decrease in the non-
transformed response variable. Some students get confused by what is meant
by a % increase or decrease in 𝑦.
This means that so long as the ratio between the two x-values is constant, then
the change in 𝑦 ̂ is the same. So doubling the value of 𝑥 from 1 to 2 has the same
effect on 𝑦 ̂ as changing x from 50 to 100.
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -59.1 9.95 -5.94 1.28e- 8
## 2 gendermale -5.43 1.01 -5.36 2.29e- 7
## 3 log(read) 29.0 2.53 11.5 9.98e-24
## 1 2 3
## 48.06622 59.84279 71.61936
We should see a
29.045 log (1.5) = 11.78
difference in 𝑦 ̂ values for the first and second students and the second and third.
140 CHAPTER 8. DATA TRANSFORMATIONS
## $emmeans
## read emmean SE df lower.CL upper.CL
## 2 -41.66 8.21 197 -57.9 -25.46
## 4 -21.53 6.47 197 -34.3 -8.78
## 8 -1.39 4.72 197 -10.7 7.92
##
## Results are averaged over the levels of: gender
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 4 - 2 20.1 1.75 197 11.493 <.0001
## 8 - 2 40.3 3.50 197 11.493 <.0001
## 8 - 4 20.1 1.75 197 11.493 <.0001
##
## Results are averaged over the levels of: gender
## P value adjustment: tukey method for comparing a family of 3 estimates
log 𝑦 = 𝛽0 + 𝛽2 log 𝑥 + 𝜖
and we again consider two 𝑥 values (again 𝑥1 and 𝑥2 ). We then examine the
difference in the log 𝑦 ̂ values as
log 𝑦2̂ − log 𝑦1̂ = [𝛽0̂ + 𝛽2̂ log 𝑥2 ] − [𝛽0̂ + 𝛽2̂ log 𝑥1 ]
𝑦2̂ 𝑥
log [ ] = 𝛽2̂ log [ 2 ]
𝑦1̂ 𝑥1
𝛽̂
𝑦2̂ 𝑥2 2 ⎤
log [ ] = log ⎡
⎢ 𝑥 ) ⎥
(
𝑦1̂
⎣ 1 ⎦
𝛽2̂
𝑦2̂ 𝑥
= ( 2)
𝑦1̂ 𝑥1
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.71 0.205 8.36 1.14e-14
## 2 gendermale -0.114 0.0209 -5.48 1.27e- 7
## 3 log(read) 0.581 0.0521 11.1 1.08e-22
which implies for a 10% increase in read score, we should see a 1.100.581 = 1.056
multiplier in write score. That is to say, a 10% increase in reading score results
in a 5% increase in writing score.
## $emmeans
## read response SE df lower.CL upper.CL
## 55 53.8 0.594 197 52.6 54.9
## 50 50.9 0.535 197 49.8 51.9
##
## Results are averaged over the levels of: gender
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df t.ratio p.value
## 55 / 50 1.06 0.00525 197 11.148 <.0001
##
## Results are averaged over the levels of: gender
## Tests are performed on the log scale
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.90 0.157 18.5 3.17e-17
## 2 log(Area) 0.389 0.0416 9.34 4.23e-10
142 CHAPTER 8. DATA TRANSFORMATIONS
## $emmeans
## Area response SE df lower.CL upper.CL
## 400 187 43.7 28 116.0 302
## 200 143 30.2 28 92.7 221
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df t.ratio p.value
## 400 / 200 1.31 0.0377 28 9.342 <.0001
##
## Tests are performed on the log scale
and therefore doubling of Area (i.e. the ratio of the 𝐴𝑟𝑒𝑎2 /𝐴𝑟𝑒𝑎1 = 2) results
in a 20.389 = 1.31 multiplier of the Species value. That is to say doubling the
island area increases the number of species by 31%.
In the table below 𝛽 represents the group offset value, or the slope value asso-
ciated with 𝑥. If we are in a model with multiple slopes such as an ANCOVA
model, then the beta term represents the slope of whatever group you are in-
terested.
8.5 Exercises
1. In the ANCOVA chapter, we examined the relationship on dose of vitamin
C on guinea pig tooth growth based on how the vitamin was delivered
(orange juice vs a pill supplement).
a. Load the ToothGrowth data which is available in base R.
b. Plot the data with log dose level on the x-axis and tooth length
growth on the y-axis. Color the points by supplement type.
c. Fit a linear model using the log transformed dose.
d. Interpret the effect of doubling the dose on tooth growth for the OJ
and VC supplement groups.
2. We will consider the relationship between income and race using a subset
of employed individuals from the American Community Survey.
a. Load the EmployedACS dataset from the Lock5Data package.
b. Create a box plot showing the relationship between Race and Income.
c. Consider the boxcox family of transformations of Income. What
transformation seems appropriate? Consider both square-root and
log transformation? While the race differences are not statistically
significant in either case, there is an interesting shift in how black,
white, and other groups are related. Because there are people with
zero income, we have to do something. We could either use a trans-
√
formation like 𝑦, remove all the zero observations, or to add a
small value to the zero observations. We’ll add 0.05 to the zero val-
ues, which represents the zero income people receiving $50. Graph
both the log and square root transformations. Do either completely
address the issue? What about the cube-root (𝜆 = 1/3 )?
d. Using your cube-root transformed Income variable, fit an ANOVA
model and evaluate the relationship between race and income uti-
lizing these data. Provide (and interpret) the point estimates, even
though they aren’t statistically significant. Importantly, we haven’t
accounted for many sources of variability such as education level and
job type. There is much more to consider than just this simple anal-
ysis.
3. The dataset Lock5Data::HomesForSale has a random sample of home
prices in 4 different states. Consider a regression model predicting the
144 CHAPTER 8. DATA TRANSFORMATIONS
library(tidyverse)
Variable Interpretation
Weight Weight (g)
Length.1 Length from nose to beginning of Tail (cm)
Length.2 Length from nose to notch of Tail (cm)
Length.3 Length from nose to tip of tail (cm)
Height Maximal height as a percentage of Length.3
Width Maximal width as a percentage of Length.3
Sex 0=Female, 1=Male
Species Which species of perch (1-7)
145
146 CHAPTER 8. DATA TRANSFORMATIONS
We first look at the data and observe the expected relationship between length
and weight.
fish %>%
dplyr::select(Weight, Length.1, Length.2, Length.3, Height, Width) %>%
GGally::ggpairs(upper=list(continuous='points'),
lower=list(continuous='cor'))
WeightLength.1
0.0010
0.0005
0.0000
60 Corr:
40
20 0.916***
Length.2
60 Corr: Corr:
40
20 0.919*** 1.000***
Length.3Height Width
60 Corr: Corr: Corr:
40
20 0.924*** 0.992*** 0.994***
40 Corr: Corr: Corr: Corr:
30
20 0.193* 0.035 0.055 0.132.
21 Corr: Corr: Corr: Corr: Corr:
18
15
12 0.133. 0.031 0.043 0.036 0.456***
9
0 50010001500 20 40 60 20 40 60 20 40 60 20 30 40 9 12 15 18 21
Naively, we might consider the linear model with all the length effects present.
##
## Call:
## lm(formula = Weight ~ Length.1 + Length.2 + Length.3 + Height +
## Width, data = fish)
##
## Residuals:
## Min 1Q Median 3Q Max
8.5. EXERCISES 147
##
## Call:
## lm(formula = Weight ~ Length.2 + Height + Width, data = fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -306.14 -75.11 -36.45 89.54 337.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -701.0750 71.0438 -9.868 < 2e-16 ***
## Length.2 30.4360 0.9841 30.926 < 2e-16 ***
## Height 5.5141 1.4311 3.853 0.000171 ***
## Width 5.6513 5.2016 1.086 0.278974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
148 CHAPTER 8. DATA TRANSFORMATIONS
When you have two variables in a model that are highly positively correlated,
you often find that one will have a positive coefficient and the other will be
negative. Likewise, if two variables are highly negatively correlated, the two
regression coefficients will often be the same sign.
In this case the sum of the three length covariate estimates was approximately
31 in both cases, but with three length variables, the second could be negative
the third be positive with approximately the same magnitude and we get ap-
proximately the same model as with both the second and third length variables
missing from the model.
𝑦𝑖 = 𝛽 0 + 𝛽 1 𝐿 1 + 𝛽 2 𝐿 2 + 𝛽 3 𝐿 3 + 𝜖 𝑖
but if the covariates are highly correlated, then approximately 𝐿1 = 𝐿2 = 𝐿3 =
𝐿 and this equation could be written:
𝑦𝑖 = 𝛽0 + (𝛽1 + 𝛽2 + 𝛽3 )𝐿3 + 𝜖𝑖
In general, you should be very careful with the interpretation of the regression
coefficients when the covariates are highly correlated.
Solutions
1. Select one of the correlated covariates to include in the model. You could
either select the covariate with the highest correlation with the response.
Alternatively you could use your scientific judgment as to which covariate
is most appropriate.
2. You could create an index variable that combines the correlated covariates
into a single amalgamation variable. For example, we could create 𝐿 =
(𝐿1 + 𝐿2 + 𝐿3 )/3 to be the average of the three length measurements.
4. Just not worry about it. Remember, the only problem is with the in-
terpretation of the parameters. If we only care about making accurate
predictions within the scope of our data, then including highly correlated
variables doesn’t matter. However, if you are making predictions away
from the data, the flipped signs might be a big deal.
150 CHAPTER 8. DATA TRANSFORMATIONS
Chapter 9
Variable Selection
Given a set of data, we are interested in selecting the best subset of predictors
for the following reasons:
The problems that arise in the diagnostics of a model will often lead a researcher
to consider other models, for example to include a quadratic term to account for
curvature. The model building process is often an iterative procedure where we
build a model, examine the diagnostic plots and consider what could be added
or modified to correct issues observed.
151
152 CHAPTER 9. VARIABLE SELECTION
We should be careful to note that we typically do not want to remove the main
covariate from the model if the model uses the covariate in a more complicated
fashion. For example, if my model is
𝑦 = 𝛽 0 + 𝛽 1 𝑥 + 𝛽 2 𝑥2 + 𝜖
where 𝜖 ∼ 𝑁 (0, 𝜎2 ), then considering the simplification 𝛽1 = 0 and removing
the effect of 𝑥 is not desirable because that forces the parabola to be symmetric
about 𝑥 = 0. Similarly, if the model contains an interaction effect, then the
removal of the main effect drastically alters the interpretation of the interaction
coefficients and should be avoided. Often times removing a lower complexity
term while keeping a higher complexity term results in unintended consequences
and is typically not recommended.
Using data from the Census Bureau we can look at the life expectancy as a
response to a number of predictors. One R function that is often convenient
to use is the update() function that takes a lm() object and adds or removes
things from the formula. The notation . ~ . means to leave the response and
all the predictors alone, while . ~ . + vnew will add the main effect of vnew
to the model.
state.data %>%
dplyr::select( Life.Exp, Population:Area ) %>%
GGally::ggpairs( upper=list(continuous='points'), lower=list(continuous='cor') )
154 CHAPTER 9. VARIABLE SELECTION
Illiteracy
2 Corr: Corr: Corr:
1−0.588*** 0.108 −0.437**
Murder HS.Grad
12 Corr: Corr: Corr: Corr:
8
4−0.781*** 0.344* −0.230 0.703***
Frost
100
50 0.262. −0.332* 0.226 −0.672***
−0.539***0.367**
0
6e+05
4e+05 Corr: Corr: Corr: Corr: Corr: Corr: Corr:
Area
2e+05 −0.107 0.023 0.363** 0.077 0.228 0.334* 0.059
0e+00
68 70 72 05000
10000
15000
20000
3000
4000
5000
6000 1 2 4 8 12 40 50 60 0 501001500e+00
2e+05
4e+05
I want to add a quadratic effect for HS.Grad rate and for Income. Also, we see
that Population and Area seem to have some high skew to their distributions,
so a log transformation might help. We’ll modify the data and then perform
the backward elimination method starting with the model with all predictors as
main effects.
The signs make reasonable sense (higher murder rates decrease life expectancy)
but covariates like Income are not significant, which is surprising. The largest
p-value is HS.Grad. However, I don’t want to remove the lower-order graduation
term and keep the squared-term. So instead I will remove both of them since
they are the highest p-values. Notice that HS.Grad is correlated with Income
and Illiteracy.
# And Log.Population...
m1 <- update(m1, .~. - Log.Population)
summary(m1)$coefficients %>% round(digits=3)
The removal of Income.2 is a tough decision because the p-value is very close
to 𝛼 = 0.05 and might be left in if it makes model interpretation easier or if the
researcher feels a quadratic effect in income is appropriate (perhaps rich people
are too stressed?).
summary(m1)
##
## Call:
## lm(formula = Life.Exp ~ Income + Murder + Frost + Income.2 +
## Log.Area, data = state.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28858 -0.50631 -0.07242 0.49738 1.75839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.141e+01 4.523e+00 13.577 < 2e-16 ***
## Income 4.212e-03 1.867e-03 2.257 0.0290 *
## Murder -3.092e-01 3.950e-02 -7.828 7.14e-10 ***
## Frost -6.487e-03 2.483e-03 -2.612 0.0123 *
## Income.2 -4.188e-07 2.049e-07 -2.044 0.0470 *
## Log.Area 2.002e-01 9.576e-02 2.091 0.0424 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7349 on 44 degrees of freedom
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.7003
## F-statistic: 23.9 on 5 and 44 DF, p-value: 1.549e-11
We are left with a model that adequately explains Life.Exp but we should
be careful to note that just because a covariate was removed from the model
9.3. CRITERION BASED PROCEDURES 157
does not imply that it isn’t related to the response. For example, being a high
school graduate is highly correlated with not being illiterate as is Income and
thus replacing Illiteracy shows that illiteracy is associated with lower life
expectancy, but it is not as predictive as Income.
##
## Call:
## lm(formula = Life.Exp ~ Illiteracy + Murder + Frost, data = state.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59010 -0.46961 0.00394 0.57060 1.92292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.556717 0.584251 127.611 < 2e-16 ***
## Illiteracy -0.601761 0.298927 -2.013 0.04998 *
## Murder -0.280047 0.043394 -6.454 6.03e-08 ***
## Frost -0.008691 0.002959 -2.937 0.00517 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7911 on 46 degrees of freedom
## Multiple R-squared: 0.6739, Adjusted R-squared: 0.6527
## F-statistic: 31.69 on 3 and 46 DF, p-value: 2.915e-11
Notice that the 𝑅2 values for both models are quite similar 0.7309 vs 0.6739
but the first model with the higher 𝑅2 has one more predictor variable? Which
model should I prefer? I can’t do an F-test because these models are not nested.
It is often necessary to compare models that are not nested. For example, I
might want to compare
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜖
vs
𝑦 = 𝛽 0 + 𝛽2 𝑤 + 𝜖
158 CHAPTER 9. VARIABLE SELECTION
This comparison comes about naturally when doing forward model selection
and we are looking for the “best” covariate to add to the model first.
Akaike introduced his criterion (which he called “An Information Criterion”) as
where 𝐿 (𝛽|̂ data ) is the likelihood function and 𝑝 is the number of elements in
the 𝛽̂ vector and we regard a lower AIC value as better. Notice the 2𝑝 term is
essentially a penalty on adding addition covariates so to lower the AIC value, a
new predictor must lower the negative log likelihood more than it increases the
penalty.
To convince ourselves that the first summand decreases with decreasing RSS in
the standard linear model, we examine the likelihood function
1 1 𝑇
𝑓 (𝑦 | 𝛽, 𝜎, 𝑋) = exp [− (𝑦 − 𝑋𝛽) (𝑦 − 𝑋𝛽)]
(2𝜋𝜎2 )
𝑛/2 2𝜎2
= 𝐿 (𝛽, 𝜎 | 𝑦, 𝑋)
It isn’t clear what we should do with the 𝑛 log (2𝜋) term in the log 𝐿() function.
There are some compelling reasons to ignore it and just use the second, and there
are reasons to use both terms. Unfortunately, statisticians have not settled on
one convention or the other and different software packages might therefore
report different values for AIC.
As a general rule of thumb, if the difference in AIC values is less than two then
the models are not significantly different, differences between 2 and 4 AIC units
are marginally significant and any difference greater than 4 AIC units is highly
significant.
Notice that while this allows us to compare models that are not nested, it does
require that the same data are used to fit both models. Because I could start
out with my data frame including both 𝑥 and 𝑥2 , (or more generally 𝑥 and 𝑓 (𝑥)
for some function 𝑓()) you can regard a transformation of a covariate as “the
9.3. CRITERION BASED PROCEDURES 159
and this criterion punishes large models more than AIC does (because log 𝑛 > 2
for 𝑛 ≥ 8)
The AIC value of a linear model can be found using the AIC() on a lm()()
object.
AIC(m1)
## [1] 118.6942
AIC(m2)
## [1] 124.2947
Because the AIC value for the first model is lower, we would prefer the first
model that includes both Income and Income.2 compared to model 2, which
was Life.Exp ~ Illiteracy+Murder+Frost.
One of the problems with 𝑅2 is that it makes no adjustment for how many
parameters in the model. Recall that 𝑅2 was defined as
9.3.3 Example
Returning to the life expectancy data, we could start with a simple model add
covariates to the model that have the lowest AIC values. R makes this easy
with the function add1() which will take a linear model (which includes the
data frame that originally defined it) and will sequentially add all of the possible
terms that are not currently in the model and report the AIC values for each
model.
Clearly the addition of Murder to the model results in the lowest AIC value,
so we will add Murder to the model. Notice the <none> row corresponds to
the model m which we started with and it has a RSS=88.299. For each model
considered, R will calculate the RSS_{C} for the new model and will calculate
the difference between the starting model and the more complicated model and
display this in the Sum of Squares column.
##
## Model:
## Life.Exp ~ Murder
## Df Sum of Sq RSS AIC
## <none> 34.461 -14.609
## Log.Population 1 2.9854 31.476 -17.140
## Income 1 2.4047 32.057 -16.226
## Illiteracy 1 0.2732 34.188 -13.007
## HS.Grad 1 4.6910 29.770 -19.925
## Frost 1 3.1346 31.327 -17.378
## Log.Area 1 1.4583 33.003 -14.771
## HS.Grad.2 1 4.4396 30.022 -19.505
## Income.2 1 1.8972 32.564 -15.441
There is a companion function to add1() that finds the best term to drop. It is
conveniently named drop1() but here the scope parameter defines the smallest
model to be considered.
It would be nice if all of this work was automated. Again, R makes our life
easy and the function step() does exactly this. The set of models searched
is determined by the scope argument which can be a list of two formulas with
components upper and lower or it can be a single formula, or it can be blank.
The right-hand-side of its lower component defines the smallest model to be
considered and the right-hand-side of the upper component defines the largest
model to be considered. If scope is a single formula, it specifies the upper
component, and the lower model taken to be the intercept-only model. If scope
is missing, the initial model is used as the upper model.
## Start: AIC=26.28
## Life.Exp ~ Income
##
## Df Sum of Sq RSS AIC
## + Murder 1 46.020 32.057 -16.226
## + Illiteracy 1 21.109 56.968 12.523
## + HS.Grad 1 19.770 58.306 13.684
## + Income.2 1 19.062 59.015 14.288
## + HS.Grad.2 1 17.193 60.884 15.847
## + Frost 1 3.188 74.889 26.199
## <none> 78.076 26.283
## + Log.Population 1 1.298 76.779 27.445
162 CHAPTER 9. VARIABLE SELECTION
##
## Call:
## lm(formula = Life.Exp ~ Murder + Frost + HS.Grad + Log.Population,
## data = state.data)
##
## Coefficients:
## (Intercept) Murder Frost HS.Grad Log.Population
## 68.720810 -0.290016 -0.005174 0.054550 0.246836
Notice that our model selected by step() is not the same model we obtained
when we started with the biggest model and removed things based on p-values.
The log-likelihood is only defined up to an additive constant, and there are
different conventional constants used. This is more annoying than anything
164 CHAPTER 9. VARIABLE SELECTION
because all we care about for model selection is the difference between AIC
values of two models and the additive constant cancels. The only time it matters
is when you have two different ways of extracting the AIC values. Recall the
model we fit using the top-down approach was
# m1 was
m1 <- lm(Life.Exp ~ Income + Murder + Frost + Income.2, data = state.data)
AIC(m1)
## [1] 121.4293
## [1] 114.8959
Because step() and AIC() are following different conventions the absolute value
of the AICs are different, but the difference between the two is constant no
matter which function we use.
First we calculate the difference using the AIC() function:
AIC(m1) - AIC(m3)
## [1] 6.533434
and next we use add1() on both models to see what the AIC values for each.
add1(m1, scope=biggest)
add1(m3, scope=biggest)
Using these results, we can calculate the difference in AIC values to be the same
as we calculated before
−22.465 − −28.998 = −22.465 + 28.998
= 6.533
9.4 Exercises
1. Consider the prostate data from the faraway package. The variable lpsa
is a measurement of a prostate specific antigen which higher levels are
indicative of prostate cancer. Use lpsa as the response and all the other
variables as predictors (no interactions). Determine the “best” model
using:
a. Backward elimination using the analysis of variance F-statistic as the
criteria.
b. Forward selection using AIC as the criteria.
2. Again from the faraway package, use the divusa which has divorce rates
for each year from 1920-1996 along with other population information for
each year. Use divorce as the response variable and all other variables
as the predictors.
a. Determine the best model using stepwise selection starting from the
intercept only model and the most complex model being all main
effects (no interactions). Use the F-statistic to determine significance.
Note: add1(), drop1(), and step() allow an option of test=‘F’ to use
an F-test instead of AIC.
b. Following the stepwise selection, comment on the relationship be-
tween p-values used and the AIC difference observed. Do the AIC
rules of thumb match the p-value interpretation?
166 CHAPTER 9. VARIABLE SELECTION
Chapter 10
# library(devtools)
# install_github('dereksonderegger/dsData') # datasets I've made; only install once...
library(dsData)
Often there are covariates in the experimental units that are known to affect
the response variable and must be taken into account. Ideally an experimenter
167
168 CHAPTER 10. MIXED EFFECTS MODELS
can group the experimental units into blocks where the within block variance
is small, but the block to block variability is large. For example, in testing a
drug to prevent heart disease, we know that gender, age, and exercise levels
play a large role. We should partition our study participants into gender, age,
and exercise groups and then randomly assign the treatment (placebo vs drug)
within the group. This will ensure that we do not have a gender, age, and
exercise group that has all placebo observations.
Often blocking variables are not the variables that we are primarily interested
in, but must nevertheless be considered. We call these nuisance variables. We
already know how to deal with these variables by adding them to the model,
but there are experimental designs where we must be careful because the ex-
perimental treatments are nested.
Example 1. An agricultural field study has three fields in which the researchers
will evaluate the quality of three different varieties of barley. Due to how they
harvest the barley, we can only create a maximum of three plots in each field. In
this example we will block on field since there might be differences in soil type,
drainage, etc from field to field. In each field, we will plant all three varieties so
that we can tell the difference between varieties without the block effect of field
confounding our inference. In this example, the varieties are nested within the
fields.
The dataset oatvar in the faraway library contains information about an exper-
iment on eight different varieties of oats. The area in which the experiment was
done had some systematic variability and the researchers divided the area up
10.2. RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) 169
into five different blocks in which they felt the area inside a block was uniform
while acknowledging that some blocks are likely superior to others for growing
crops. Within each block, the researchers created eight plots and randomly
assigned a variety to a plot. This type of design is called a Randomized Com-
plete Block Design (RCBD) because each block contains all possible levels of
the factor of primary interest.
data('oatvar', package='faraway')
ggplot(oatvar, aes(y=yield, x=block, color=variety)) +
geom_point(size=5) +
geom_line(aes(x=as.integer(block))) # connect the dots
500
variety
1
2
400
3
yield
4
5
300 6
7
8
I II III IV V
block
While there is one unusual observation in block IV, there doesn’t appear to be
a blatant interaction. We will consider the interaction shortly. For the main
effects model of yield ~ block + variety we have 𝑝 = 12 parameters and 28
residual degrees of freedom because
𝑑𝑓𝜖 = 𝑛 − 𝑝
= 𝑛 − (1 + [(𝐼 − 1) + (𝐽 − 1)])
= 40 − (1 + [(5 − 1) + (8 − 1)])
= 40 − 12
= 28
Because this is an orthogonal design, the sums of squares doesn’t change regard-
less of which order we add the factors, but if we remove one or two observations,
they would.
But the F-value and p-value for testing if block is significant is nonsense! Imag-
ine that variety didn’t matter we just have 8 replicate samples per block, but
these aren’t true replicates, they are what is called pseudoreplicates. Imagine
taking a sample of 𝑛 = 3 people and observing their height at 1000 different
points in time during the day. You don’t have 3000 data points for estimating
the mean height in the population, you have 3. Unless we account for the this,
the inference for the block variable is wrong. In this case, we only have one
observation for each block, so we can’t do any statistical inference at the block
scale!
Fortunately in this case, we don’t care about the blocking variable and including
it in the model was simply guarding us in case there was a difference, but I wasn’t
interested in estimating it. If the only covariate we care about is the most deeply
nested effect, then we can do the usual analysis and recognize the p-value for
the blocking variable is nonsense, and we don’t care about it.
# Ignore any p-values regarding block, but I'm happy with the analysis for variety
letter_df <- emmeans(m1, ~variety) %>%
multcomp::cld(Letters=letters) %>%
dplyr::select(variety, .group) %>%
mutate(yield = 500)
500 ab bc b a c ab ab bc
400
yield
300
1 2 3 4 5 6 7 8
variety
However it would be pretty sloppy to not do the analysis correctly because our
blocking variable might be something we care about.
Once correct way to model these data are using hierarchical models which are
created by having multiple error terms that can be introduced. In many re-
spects, the random effects structure provides an extremely flexible framework
to consider many of the traditional experimental designs as well as many non-
traditional designs with the benefit of more easily assessing variability at each
hierarchical level.
Mixed effects models combine what we call “fixed” and “random” effects.
For example, in a rabbit study that examined the effect of diet on the growth
of domestic rabbits and we had 10 litters of rabbits and used the 3 most similar
from each litter to test 6 different diets. Here, the 6 different diets are fixed
effects because they are not randomly selected from a population, these exact
same diets can be further studied, and these are the diets we are interested
it. The litters of rabbits and the individual rabbits are randomly selected from
populations, cannot be exactly replicated in future studies, and we are not
interested in the individual litters but rather what the variability is between
individuals and between litters.
Often random effects are not of primary interest to the researcher, but must
172 CHAPTER 10. MIXED EFFECTS MODELS
be considered. Blocking variables are random effects because the arise from a
random sample of possible blocks that are potentially available to the researcher.
Mixed effects models are models that have both fixed and random effects. We
will first concentrate on understanding how to address a model with two sources
error and then complicate the matter with fixed effects.
Recall that the likelihood function is the function links the model parameters
to the data and is found by taking the probability density function and inter-
preting it as a function of the parameters instead of the a function of the data.
Loosely, the probability function tells us what outcomes are most probable, with
the height of the function telling us which values (or regions of values) are most
probable given a set of parameter values. The higher the probability function,
the higher the probability of seeing that value (or data in that region). The
likelihood function turns that relationship around and tells us what parame-
ter values are most likely to have generated the data we have, again with the
parameter values with a higher likelihood value being more “likely”.
𝑖𝑖𝑑
The likelihood function for a sample 𝑦𝑖 ∼ 𝑁 (𝑋𝑖,⋅ 𝛽, 𝜎) can be written as a
function of our parameters 𝛽 and 𝜎2 then we have defined our likelihood function
1 1 𝑇 −1
𝐿 (𝛽, 𝜎2 |𝑦1 , … , 𝑦𝑛 ) = 𝑛/2 1/2
exp [− (𝑦 − 𝑋𝛽) Ω (𝑦 − 𝑋𝛽)]
(2𝜋) [det (Ω)] 2
and
2 1 𝑛 2
𝜎̂𝑀𝐿𝐸 = ∑ (𝑦 − 𝑦)̂
𝑛 𝑖=1 𝑖
and where 𝑦𝑖̂ = 𝑋𝑖⋅ 𝛽̂ we notice that this is not our usual estimator 𝜎̂ 2 = 𝑠2
where 𝑠2 is the sample variance. It turns out that the MLE estimate of 𝜎2 is
biased (the correction is to divide by 𝑛 − 1 instead of 𝑛). This is normally not
an issue if our sample size is large, but with a small sample, the bias is not
insignificant.
10.4. 1-WAY ANOVA WITH A RANDOM EFFECT 173
2 1 𝑛 2
𝜎̂𝑀𝐿𝐸 = ∑𝑦
𝑛 𝑖=1 𝑖
⋅
Under the null hypothesis that 𝑚0 is the true model, the 𝐷 ∼ 𝜒2𝑝1 −𝑝0 where
𝑝1 − 𝑝0 is the difference in number of parameters in the null and alternative
models. That is to say that asymptotically 𝐷 has a Chi-squared distribution
with degrees of freedom equal to the difference in degrees of freedom of the two
models.
We could think of 𝐿 (𝜃 ̂ ) as the maximization of the likelihood when some
0
parameters are held constant (at zero) and all the other parameters are vary.
But we are not required to hold it constant at zero. We could chose any value
of interest and perform a LRT.
Because we often regard a confidence interval as the set of values that would
not be rejected by a hypothesis test, we could consider a sequence of possible
values for a parameter and figure out which would not be rejected by the LRT.
In this fashion we can construct confidence intervals for parameter values.
Unfortunately all of this hinges on the asymptotic distribution of 𝐷 and often
this turns out to be a poor approximation. In simple cases more exact tests can
be derived (for example the F-tests we have used prior) but sometimes nothing
better is currently known. Another alternative is to use resampling methods for
the creation of confidence intervals or p-values.
𝑖𝑖𝑑 𝑖𝑖𝑑
where 𝛾𝑖 ∼ 𝑁 (0, 𝜎𝛾2 ) and 𝜖𝑖𝑗 ∼ 𝑁 (0, 𝜎𝜖2 ). This model could occur, for example,
when looking at the adult weight of domestic rabbits where the random effect is
the effect of litter and we are interested in understanding how much variability
there is between litters (𝜎𝛾2 ) and how much variability there is within a litter
(𝜎𝜖2 ). Another example is the the creation of computer chips. Here a single
wafer of silicon is used to create several chips and we might have wafer-to-wafer
variability and then within a wafer, you have chip-to-chip variability.
First we should think about what the variances and covariances are for any two
observations.
𝑉 𝑎𝑟 (𝑦𝑖𝑗 ) = 𝑉 𝑎𝑟 (𝜇 + 𝛾𝑖 + 𝜖𝑖𝑗 )
= 𝑉 𝑎𝑟 (𝜇) + 𝑉 𝑎𝑟 (𝛾𝑖 ) + 𝑉 𝑎𝑟 (𝜖𝑖𝑗 )
= 0 + 𝜎𝛾2 + 𝜎𝜖2
and 𝐶𝑜𝑣 (𝑦𝑖𝑗 , 𝑦𝑖𝑘 ) = 𝜎𝛾2 because the two observations share the same litter 𝛾𝑖 .
For two observations in different litters, the covariance is 0. These relationships
induce a correlation on observations within the same litter of
𝜎𝛾2
𝜌=
𝜎𝛾2 + 𝜎𝜖2
For example, suppose that we have 𝐼 = 3 litters and in each litter we have 𝐽 = 3
rabbits per litter. Then the variance-covariance matrix looks like
1 ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⎡ 1 ⋅ ⋅ ⎤ ⎡ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎤
⎢ ⎥ ⎢ ⎥
⎢ 1 ⋅ ⋅ ⎥ ⎢ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎥
⎢ ⋅ 1 ⋅ ⎥ ⎢ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⎥
𝑍=⎢ ⋅ 1 ⋅ ⎥ 𝑍𝑍 𝑇 = ⎢ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⎥
⎢ ⋅ 1 ⋅ ⎥ ⎢ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⋅ ⋅ 1 ⎥ ⎢ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⎥
⎢ ⋅ ⋅ 1 ⎥ ⎢ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⎥
⎣ ⋅ ⋅ 1 ⎦ ⎣ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⎦
and perform our maximization on this transformed set of data. Once we have
our unbiased estimates of 𝜎𝛾2 and 𝜎𝜖2 , we can substitute these back into the
untransformed likelihood function and find the MLEs for 𝛽. This process is
called Restricted Maximum Likelihood (REML) and is generally preferred over
the variance component estimates found simply maximizing the regular like-
lihood function. As usual, if our experiment is balanced these complications
aren’t necessary as the REML estimates of 𝛽 are usually the same as the ML
estimates.
Our first example comes from an experiment to test the paper brightness as
affected by the shift operator. The data has 20 observations with 4 different
operators. Each operator had 5 different observations made. The data set is
pulp in the package faraway. We will first analyze this using a fixed-effects one-
way ANOVA, but we will use a different model representation. Instead of using
the first operator as the reference level, we will use the sum-to-zero constraint
(to make it easier to compare with the output of the random effects model).
data('pulp', package='faraway')
ggplot(pulp, aes(x=operator, y=bright)) + geom_point()
176 CHAPTER 10. MIXED EFFECTS MODELS
61.00
60.75
60.50
bright
60.25
60.00
59.75
a b c d
operator
##
## Call:
## lm(formula = bright ~ operator, data = pulp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.440 -0.195 -0.070 0.175 0.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.40000 0.07289 828.681 <2e-16 ***
## operator1 -0.16000 0.12624 -1.267 0.223
## operator2 -0.34000 0.12624 -2.693 0.016 *
## operator3 0.22000 0.12624 1.743 0.101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.326 on 16 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.3359
## F-statistic: 4.204 on 3 and 16 DF, p-value: 0.02261
coef(m)
Notice that the estimate of the fixed effect (the overall mean) is the same in
the fixed-effects ANOVA and in the mixed model. However the fixed effects
178 CHAPTER 10. MIXED EFFECTS MODELS
ANOVA estimates the effect of each operator while the mixed model is inter-
ested in estimating the variance between operators. In the model statement the
(1|operator) denotes the random effect and this notation tells us to fit a model
with a random intercept term for each operator. Here the variance associated
with the operators is 𝜎𝛾2 = 0.068 while the “pure error” is 𝜎𝜖2 = 0.106. The col-
umn for standard deviation is not the variability associated with our estimate,
but is simply the square-root of the variance terms 𝜎𝛾 and 𝜎𝜖 . This was fit using
the REML method.
We might be interested in the estimated effect of each operator
ranef(m2)
## $operator
## (Intercept)
## a -0.1219403
## b -0.2591231
## c 0.1676679
## d 0.2133955
##
## with conditional variances for "operator"
These effects are smaller than the values we estimated in the fixed effects model
due to distributional assumption that penalizes large deviations from the mean.
In general, the estimated random effects are of smaller magnitude than the effect
size estimated using a fixed effect model.
data('oatvar', package='faraway')
ggplot(oatvar, aes(y=yield, x= variety)) +
geom_point() +
facet_wrap(~block, labeller=label_both)
10.5. BLOCKS AS RANDOM VARIABLES 179
400
300
yield
1 2 3 4 5 6 7 8
block: IV block: V
500
400
300
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
variety
In this case, we don’t really care about these particular fields (blocks) and would
prefer to think about these as a random sample of fields that we might have
used in our experiment. The analysis to compare these simple model (without
variety) to the complex model it should use ML because the fixed effects are
different between the two models and thus the 𝐾 matrix used in REML fits are
different.
# By default anova() will always refit the models using ML assuming you want to
# compare models with different fixed effects. Use refit=FALSE to suppress this.
anova(model.0, model.1)
## Data: oatvar
## Models:
## model.0: yield ~ (1 | block)
## model.1: yield ~ variety + (1 | block)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model.0 3 446.94 452.01 -220.47 440.94
## model.1 10 421.67 438.56 -200.84 401.67 39.27 7 1.736e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recall that we don’t really like to trust Likelihood Ratio Tests because they
depend on the asymptotic distribution of the statistic (as sample size increases)
180 CHAPTER 10. MIXED EFFECTS MODELS
and that convergence is pretty slow. Another, and usually better option, is to
perform an F-test with the numerator degrees of freedom estimated using either
the Satterthwaite or Kenward-Roger methods. To do this, we’ll use the anova()
command with just a single model.
# Do an F test for the fixed effects using the similar degree of freedom
# approximations done by SAS
#
# anova(model.1, ddf='lme4') # don't return the p-value because we don't trust it!
# anova(model.1, ddf='Satterthwaite') # Use Satterthwaite
# anova(model.1, ddf='Kenward-Roger') # Use Kenward-Roger
anova(model.1) # default is Satterthwaite's degrees of freedom
Unsurprisingly given our initial thoughts about the data, it looks like variety
is a statistically significant covariate.
There is quite a bit of debate among statisticians about how to calculate the de-
nominator degrees of freedom and which method is preferred in different scenar-
ios. By default, lmerTest uses Satterthwaite’s method, but ‘Kenward-Roger’
is also allowed. In this case, the two methods produces the same estimated
denominator degrees of freedom and the same p-value.
To consider if the random effect should be included in the model, we will turn
to the Likelihood Ratio test. The following examines all single term deletions
to the random effects structure. In our case, this is just considering removing
(1|block)
# Each line tests if a random effect can be removed or reduced to a simpler random effe
# Something like (1|Group) will be tested if it can be removed.
# Something like (1+Trt | Group) will be tested if it can be reduced to (1 | Group)
ranova(model.1)
Here, both the AIC and LRT suggest that the random effect of block is appro-
priate to include in the model.
Now that we have chosen our model, we can examine is model.
summary(model.1)
We start with the Random effects. This section shows us the block-to-block
variability (and the square root of that, the Standard Deviation) as well as the
“pure-error”, labeled residuals, which is an estimate of the variability associ-
ated with two different observations (after the difference in variety is accounted
for) planted within the same block. For this we see that block-to-block variabil-
ity is only slightly smaller than the within block variability.
Why do we care about this? This actually tells us quite a lot about the spatial
variability. Because yield is affected by soil nutrients, micro-climate, soil water
availability, etc, I expect that two identical seedlings planted in slightly different
conditions will have slightly different yields. By examining how the yield changes
over small distances (the residual within block variability) vs how it changes over
long distances (block to block variability) we can get a sense as to the scale at
which these background lurking processes operate.
Next we turn to the fixed effects. These will be the offsets from the reference
group, as we’ve typically worked with. Here we see that varieties 2,5, and 8 are
the best performers (relative to variety 1),
We are certain that there are differences among the varieties, and we should
look at all of the pairwise contrasts among the variety levels. As usual we could
use the package emmeans, which automates much of this (and uses lmerTest
produced p-values for the tests).
As usual we’ll join this information into the original data table and then make
a nice summary graph.
10.5. BLOCKS AS RANDOM VARIABLES 183
oatvar %>%
mutate(variety = fct_reorder(variety, yield)) %>%
ggplot( aes(x=variety, y=yield)) +
geom_point(aes(color=block)) +
geom_text(data=LetterResults, aes(label=.group, y=LetterHeight))
500 a b ab ab ab bc bc c
block
400 I
II
yield
III
IV
300
V
4 3 7 1 6 8 2 5
variety
We’ll consider a second example using data from the pharmaceutical industry.
We are interested in 4 different processes (our treatment variable) used in the
biosynthesis and purification of the drug penicillin. The biosynthesis requires a
nutrient source (corn steep liquor) as a nutrient source for the fungus and the
nutrient source is quite variable. Each batch of the nutrient is is referred to as
a ‘blend’ and each blend is sufficient to create 4 runs of penicillin. We avoid
confounding our biosynthesis methods with the blend by using a Randomized
Complete Block Design and observing the yield of penicillin from each of the
four methods (A,B,D, and D) in each blend.
data(penicillin, package='faraway')
95
90
yield
85
80
A B C D A B C D A B C D A B C D A B C D
treat
It looks like there is definitely a Blend effect (e.g. Blend1 is much better than
Blend5) but it isn’t clear that there is a treatment effect.
It looks like we don’t have a significant effect of the treatments. Next we’ll
examine the simple model to understand the variability.
# Test if we should remove the (1|blend) term using either AIC or Likelihood Ratio Test
ranova(model.1)
We see that the noise is more in the within blend (labeled residual here) rather
than the between blends. If my job were to understand the variability and figure
out how to improve production, this suggests that variability is introduced at
the blend level and at the run level, but the run level is more important.
Fertilizer
So all together we have 8 plots, 32 subplots, and 5 replicates per subplot. When
I analyze the fertilizer, I have 32 experimental units (the thing I have applied
my treatment to), but when analyzing the effect of irrigation, I only have 8
experimental units. In other words, I should have 8 random effects for plot, and
32 random effects for subplot.
186 CHAPTER 10. MIXED EFFECTS MODELS
As we saw before, the effect of irrigation is not significant and the fertilizer effect
is highly significant. We’ll remove the irrigation covariate and refit the model.
We designed our experiment assuming that both plot and subplot were pos-
sibly important. Many statisticians would argue that because that was how we
designed the experiment, we should necessarily keep that structure when ana-
lyzing the data. This is particularly compelling considering that the Irrigation
and Fertilizer were applied on different scales. However it is still fun to look at
the statistical significance of the random effects.
Each row (except the none) is testing the exclusion of a random effect. The
first being the 32 subplots, and the second being the 8 plots. Both are highly
statistically significant. To assess the practical significance of subplots and plots,
we need to look at the variance terms:
summary(model)
## lmerModLmerTest]
## Formula: yield ~ Fertilizer + (1 | plot/subplot)
## Data: AgData
##
## REML criterion at convergence: 572.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.78714 -0.62878 -0.08602 0.64094 2.36353
##
## Random effects:
## Groups Name Variance Std.Dev.
## subplot:plot (Intercept) 5.345 2.312
## plot (Intercept) 8.854 2.975
## Residual 1.014 1.007
## Number of obs: 160, groups: subplot:plot, 32; plot, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 21.0211 1.2056 8.9744 17.436 3.14e-08 ***
## FertilizerHigh 4.6323 0.8328 23.0000 5.563 1.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## FertilzrHgh -0.345
Notice the standard deviation plant-to-plant noise is about 1/3 of the noise
associated with subplot-to-subplot or even plot-to-plot. Finally the effect of
increasing the fertilizer level is to increase yield by about 4.6.
350
300
250
200
150
350 Trt
Response
300 A
250 B
200
C
150
350
300
250
200
150
2 4 6 2 4 6 2 4 6
Replicate
We can easily fit this model using random effects for each ring.
To think about what is actually going on, it is helpful to consider the predicted
values from this model. As usual we will use the predict function, but now we
have the option of including the random effects or not.
First lets consider the predicted values if we completely ignore the Ring random
effect while making predictions.
HierarchicalData.Predictions %>%
head()
## # A tibble: 6 x 6
## # Groups: Trt, Ring [6]
## Trt Ring Rep y y.hat my.text
## <fct> <fct> <int> <dbl> <dbl> <chr>
## 1 A 1 1 364. 312. paste('','',hat(paste('y')),'',' = 311','. 78')
## 2 A 2 1 269. 312. paste('','',hat(paste('y')),'',' = 311','. 78')
## 3 A 3 1 321. 312. paste('','',hat(paste('y')),'',' = 311','. 78')
## 4 B 4 1 189. 235. paste('','',hat(paste('y')),'',' = 234','. 53')
## 5 B 5 1 265. 235. paste('','',hat(paste('y')),'',' = 234','. 53')
## 6 B 6 1 251. 235. paste('','',hat(paste('y')),'',' = 234','. 53')
190 CHAPTER 10. MIXED EFFECTS MODELS
200
300 A
B
200
C
200
2 4 6 2 4 6 2 4 6
Replicate
Now we consider the predicted values, but created using the Ring random effect.
These random effects provide for a slight perturbation up or down depending
on the quality of the Ring, but the sum of all 9 Ring effects is required to be 0.
ranef(model)
## $Ring
## (Intercept)
## 1 9.458738
## 2 -29.798425
## 3 20.339687
## 4 -40.532972
## 5 21.067503
## 6 19.465469
## 7 -21.814405
## 8 2.548287
## 9 19.266118
##
## with conditional variances for "Ring"
sum(ranef(model)$Ring)
## [1] -3.519407e-12
10.6. NESTED EFFECTS 191
Also notice that the sum of the random effects within a treatment is zero! (Recall
Ring 1:3 was treatment A, 4:6 was treatment B, and 7:9 was treatment C).
200
300 A
B
200
C
200
2 4 6 2 4 6 2 4 6
Replicate
Notice in the data, the technicians are always labeled as ‘one’ and ‘two’ regard-
less of the lab. Likewise the two samples given to each technician are always
labeled ‘G’ and ‘H’ even though the actual physical samples are different for
each technician.
In terms of notation, we will refer to the 6 labs as 𝐿𝑖 and the lab technicians
as 𝑇𝑖𝑗 and we note that 𝑗 is either 1 or 2 which doesn’t uniquely identify the
technician unless we include the lab subscript as well. Finally the sub-samples
are nested within the technicians and we denote them as 𝑆𝑖𝑗𝑘 . Finally our “pure”
error is the two observations from the same sample. So the model we wish to
fit is:
𝑦𝑖𝑗𝑘𝑙 = 𝜇 + 𝐿𝑖 + 𝑇𝑖𝑗 + 𝑆𝑖𝑗𝑘 + 𝜖𝑖𝑗𝑘𝑙
𝑖𝑖𝑑 2 𝑖𝑖𝑑 𝑖𝑖𝑑 𝑖𝑖𝑑
where 𝐿𝑖 ∼ 𝑁 (0, 𝜎𝐿 ), 𝑇𝑖𝑗 ∼ 𝑁 (0, 𝜎𝑇2 ), 𝑆𝑖𝑗𝑘 ∼ 𝑁 (0, 𝜎𝑆2 ), 𝜖𝑖𝑗𝑘𝑙 ∼ 𝑁 (0, 𝜎𝜖2 ).
We need a convenient way to tell lmer which factors are nested in which. We can
do this by creating data columns that make the interaction terms. For example
there are 12 technicians (2 from each lab), but in our data frame we only see
two levels, so to create all 12 random effects, we need to create an interaction
column (or tell lmer to create it and use it). Likewise there are 24 sub-samples
and 48 “pure” random effects.
data('eggs', package='faraway')
model <- lmer( Fat ~ 1 + (1|Lab) + (1|Lab:Technician) +
(1|Lab:Technician:Sample), data=eggs)
model <- lmer( Fat ~ 1 + (1|Lab/Technician/Sample), data=eggs)
0.6
Fat
0.4
0.2
G H G H G H G H G H G H G H G H G H G H G H G H
Sample
0.6
Fat
0.4
0.2
G H G H G H G H G H G H G H G H G H G H G H G H
Sample
0.6
Fat
0.4
0.2
G H G H G H G H G H G H G H G H G H G H G H G H
Sample
0.6
Fat
0.4
0.2
G H G H G H G H G H G H G H G H G H G H G H G H
Sample
Now that we have an idea of how things vary, we can look at the 𝜎 terms.
summary(model)[['varcor']]
It looks like there is still plenty of unexplained variability, but the next largest
source of variability is in the technician and also the lab. Is the variability lab-to-
lab large enough for us to convincingly argue that it is statistically significant?
ranova(model)
It looks like the technician effect is at the edge of statistically significant, but the
lab-to-lab effect is smaller than the pure error and not statistically significant.
I’m not thrilled with the repeatability, but the technicians are a bigger concern
than the individual labs. These data aren’t strong evidence of big differences
between labs, but I would need to know if the size of the error is practically
important before we just pick the most convenient lab to send our samples to.
𝑦𝑖𝑗𝑘 = 𝜇 + 𝑀𝑖 + 𝑃𝑗 + 𝑅𝑘 + 𝜖𝑖𝑗𝑘
and we notice that the position and run effects are not nested within anything
else and thus the subscript have just a single index variable. Certainly the run
effect should be considered random as these four are a sample from all possible
runs, but what about the position variable? Here we consider that the machine
being used is a random selection from all possible abrasion machines and any
position differences have likely developed over time and could be considered as
a random sample of possible position effects. We’ll regard both position and
run as crossed random effects.
data('abrasion', package='faraway')
ggplot(abrasion, aes(x=material, y=wear, color=position, shape=run)) +
geom_point(size=3)
position
1
260 2
3
4
240
wear
run
220
1
2
200 3
4
A B C D
material
It certainly looks like the materials are different. I don’t think the run matters,
but position 2 seems to develop excessive wear compared to the other positions.
The material effect is statistically significant and we can figure out the pairwise
differences in the usual fashion.
10.7. CROSSED EFFECTS 197
## $emmeans
## material emmean SE df lower.CL upper.CL
## A 266 7.67 7.48 248 284
## B 220 7.67 7.48 202 238
## C 242 7.67 7.48 224 260
## D 230 7.67 7.48 213 248
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B 45.8 5.53 6 8.267 0.0007
## A - C 24.0 5.53 6 4.337 0.0190
## A - D 35.2 5.53 6 6.370 0.0029
## B - C -21.8 5.53 6 -3.930 0.0295
## B - D -10.5 5.53 6 -1.897 0.3206
## C - D 11.2 5.53 6 2.033 0.2743
##
## Degrees-of-freedom method: kenward-roger
## P value adjustment: tukey method for comparing a family of 4 estimates
summary(m)[['varcor']]
Notice that run and the pure error standard deviation have about the same
magnitude, but position is more substantial. Lets see what happens if we remove
the run random effect.
## Data: abrasion
## Models:
## m2: wear ~ material + (1 | position)
## m: wear ~ material + (1 | run) + (1 | position)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m2 6 115.30 119.94 -51.651 103.30
## m 7 114.26 119.66 -50.128 100.26 3.0459 1 0.08094 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: abrasion
## Models:
## m3: wear ~ material + (1 | run)
## m: wear ~ material + (1 | run) + (1 | position)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m3 6 116.85 121.48 -52.425 104.85
## m 7 114.26 119.66 -50.128 100.26 4.5931 1 0.0321 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data('sleepstudy', package='lme4')
ggplot(sleepstudy, aes(y=Reaction, x=Days)) +
facet_wrap(~ Subject, ncol=6) +
geom_point() +
geom_line()
200 CHAPTER 10. MIXED EFFECTS MODELS
400
300
200
333 334 335 337 349 350
Reaction
400
300
200
351 352 369 370 371 372
400
300
200
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
Days
We want to fit a line to these data, but how should we do this? First we notice
that each subject has their own baseline for reaction time and the subsequent
measurements are relative to this, so it is clear that we should fit a model with
a random intercept.
ranova(m1)
To visualize how well this model fits our data, we will plot the predicted values
which are lines with y-intercepts that are equal to the sum of the fixed effect of
intercept and the random intercept per subject. The slope for each patient is
assumed to be the same and is approximately 10.4.
400
300
200
400
300
200
400
300
200
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
Days
This isn’t too bad, but I would really like to have each patient have their own
slope as well as their own y-intercept. The random slope will be calculated as
a fixed effect of slope plus a random offset from that.
202 CHAPTER 10. MIXED EFFECTS MODELS
400
300
200
333 334 335 337 349 350
Reaction
400
300
200
351 352 369 370 371 372
400
300
200
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
Days
This appears to fit the observed data quite a bit better, but it is useful to test
this.
# This is the first time the ranova() table has considered a reduction. Here we
# consider reducing the random term from (1+Days|Subject) to (1|Subject)
ranova(m2)
# We get the same analysis directly specifying which Simple/Complex model to compare
anova(m2, m1, refit=FALSE)
10.8. REPEATED MEASURES / LONGITUDINAL STUDIES 203
## Data: sleepstudy
## Models:
## m1: Reaction ~ Days + (1 | Subject)
## m2: Reaction ~ Days + (1 + Days | Subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m1 4 1794.5 1807.2 -893.23 1786.5
## m2 6 1755.6 1774.8 -871.81 1743.6 42.837 2 4.99e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that indeed the random effect for each subject in both y-intercept
and in slope is a better model that just a random offset in y-intercept.
It is instructive to look at this example from the top down. First we plot the
population regression line.
325
Reaction
300
275
250
0 2 4 6 8
Days
Person−to−Person Variation
400
Reaction
300
200
0 2 4 6 8
Days
ggplot(sleepstudy, aes(x=Days)) +
geom_line(aes(y=yhat)) +
geom_line(aes(y=yhat.ind, group=Subject), color='red') +
scale_x_continuous(breaks = seq(0,9, by=2)) +
ylab('Reaction') + ggtitle('Within Person Variation') +
facet_wrap(~ Subject, ncol=6) +
geom_point(aes(y=Reaction))
400
300
200
351 352 369 370 371 372
400
300
200
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
Days
Finally we want to go back and look at the coefficients for the complex model.
summary(m2)
Typically the bootstrap is used when we don’t want to make any distributional
assumptions on the data. In that case, we sample with replacement from the
observed data to create the bootstrap data. But, if we don’t mind making dis-
tributional assumptions, then instead of re-sampling the data, we could sample
from the distribution with the observed parameter. In our sleep study example,
we have estimated a population intercept and slope of 251.4 and 10.5. But we
also have a subject intercept and slope random effect which we assumed to be
normally distributed centered at zero with and with estimated standard devi-
ations of 24.7 and 5.9. Then given a subjects regression line, observations are
just normal (mean zero, standard deviation 25.6) perturbations from the line.
All of these numbers came from the summary(m2) output.
To create a bootstrap data simulating a new subject, we could do the following:
450
400
Reaction
350
300
250
0 2 4 6 8
Days
2) Create a bootstrap model by analyzing the bootstrap data using the same
model formula used by the initial model.
3) Apply some function you write to each bootstrap model. This function
takes in a bootstrap model and returns a statistic or vector of statistics.
4) Repeat steps 1-4 repeatedly to create the bootstrap distribution of the
statistics returned by your function.
Now that we have a bootstrap data set, we need to take the data and then fit a
model to the data and then grab the predictions from the model. At this point
we are creating a confidence interval for the response line of a randomly selected
person from the population. The lme4::bootMer function will create bootstrap
data sets and then send those into the lmer function.
# Get our best guess as to the relationship between day and reaction time
ConfData <- ConfData %>%
mutate( Estimate = predict( m2, newdata = ConfData, re.form=~0) )
# bootMer generates new data sets, calls lmer on the data to produce a model,
# and then produces calls whatever function I pass in. It repeats this `nsim`
# number of times.
bootObj <- lme4::bootMer(m2, FUN=myStats, nsim = 1000 )
Normal Density
Kernel Density
perc 95% CI
Obs. Value
0.06
0.06
0.06
Density
Density
Density
0.03
0.03
0.03
0.00
0.00
0.00
230 250 270 240 260 280 250 270 290
1 2 3
0.06
0.06
0.04
0.04
Density
Density
Density
0.03
0.02
0.02
0.00
0.00
0.00
250 270 290 310 260 280 300 320 280 300 320 340
4 5 6
0.04
0.04
Density
Density
Density
0.02
0.02
0.02
0.00
0.00
280 300 320 340 280 320 360 0.00 280 320 360
7 8 9
360
320
Estimate
280
240
0 2 4 6 8
Days
For a confidence interval, we just want to find the range of observed values.
In this case, we want to use the bootstrap data, but don’t need to fit a model
at each bootstrap step. The lme4::simulate function creates the bootstrap
dataset and doesn’t send it for more processing. It returns a vector of response
values that are appropriately organized to be appended to the original dataset.
# squish the Subject/Day info together with the simulated and then grab the quantiles
# for each day
PredIntervals <- cbind(PredData, Simulated) %>%
gather('sim','Reaction', sim_1:sim_1000 ) %>% # go from wide to long structure
group_by(Subject, Days) %>%
summarize(lwr = quantile(Reaction, probs = 0.025),
upr = quantile(Reaction, probs = 0.975))
400
Estimate
300
200
0 2 4 6 8
Days
10.10 Exercises
and somehow three rats in the thyroxine group had some issue unrelated
to the treatment. The following R code might be helpful for the initial
visualization.
# we need to force ggplot to only draw lines between points for the same
# rat. If I haven't already defined some aesthetic that is different
# for each rat, then it will connect points at the same week but for different
# rats. The solution is to add an aesthetic that does the equivalent of the
# dplyr function group_by(). In ggplot2, this aesthetic is "group".
ggplot(ratdrink, aes(y=wt, x=weeks, color=treat)) +
geom_point(aes(shape=treat)) +
geom_line(aes(group=subject)) # play with removing the group=subject aesthetic...
d. Build a mixed model using the fixed effects along with the random
effect of mix. Consider two-way interactions.
e. Using the model you selected, discuss the impact of the different
variance components.
Chapter 11
Binomial Regression
The standard linear model assumes that the observed data is distributed
𝑖𝑖𝑑
𝑦 = 𝑋𝛽 + 𝜖 where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )
𝑦 ∼ 𝑁 (𝜇 = 𝑋𝛽, 𝜎2 𝐼)
and notably this assumes that the data are independent. This model has 𝐸 [𝑦] =
𝑋𝛽. This model is quite flexible and includes:
The general linear model expanded on the linear model and we allow the data
213
214 CHAPTER 11. BINOMIAL REGRESSION
points to be correlated
𝑦 ∼ 𝑁 (𝑋𝛽, 𝜎2 Ω)
where we assume that Ω has some known form but may include some unknown
correlation parameters. This type of model includes our work with mixed models
and time series data.
The study of generalized linear models removes the assumption that the error
terms are normally distributed and allows the data to be distributed according
to some other distribution such as Binomial, Poisson, or Exponential. These
distributions are parameterized differently than the normal (instead of 𝜇 and
𝜎, we might be interested in 𝜆 or 𝑝). However, I am still interested in how my
covariates can be used to estimate my parameter of interest.
Critically, I still want to parameterize my covariates as 𝑋𝛽 because we under-
stand the how continuous and discrete covariates added and interpreted and
what interactions between them mean. By keeping the 𝑋𝛽 part, we continue
to build on the earlier foundations.
𝑃 (𝑊𝑖 = 1) = 𝑝𝑖
𝑃 (𝑊𝑖 = 0) = (1 − 𝑝𝑖 )
which I can rewrite more formally letting 𝑤𝑖 be the observed value as
𝑤 1−𝑤𝑖
𝑃 (𝑊𝑖 = 𝑤𝑖 ) = 𝑝𝑖 𝑖 (1 − 𝑝𝑖 )
and the parameter that I wish to estimate and understand is the probability
of a success 𝑝𝑖 and usually I wish to know how my covariate data 𝑋𝛽 informs
these probabilities.
In the normal distribution case, we estimated the expected value of my response
vector (𝜇) simply using 𝜇̂ = 𝑋 𝛽̂ but this will not work for an estimate of 𝑝̂
because there is no constraint on 𝑋 𝛽,̂ there is nothing to prevent it from being
negative or greater than 1. Because we require the probability of success to be
a number between 0 and 1, I have a problem.
Example: Suppose we are interested in the abundance of mayflies in a stream.
Because mayflies are sensitive to metal pollution, I might be interested in looking
at the presence/absence of mayflies in a stream relative to a pollution gradient.
11.1. BINOMIAL REGRESSION MODEL 215
data('Mayflies', package='dsData')
head(Mayflies)
## CCU Occupancy
## 1 0.05261076 1
## 2 0.25935617 1
## 3 0.64322010 1
## 4 0.90168941 1
## 5 0.97002630 1
## 6 1.08037011 1
1.00
0.75
Occupancy
0.50
0.25
0.00
0 1 2 3 4 5
CCU
If I just fit a regular linear model to this data, we fit the following:
0.8
Occupancy
0.4
0.0
0 1 2 3 4 5
CCU
which is horrible. First, we want the regression line to be related to the proba-
bility of occurrence and it is giving me a negative value. Instead, we want it to
slowly tail off and give me more of an sigmoid-shaped curve. Perhaps something
1.00
0.75
Occupancy
0.50
0.25
0.00
0 1 2 3 4 5
more like the following: CCU
We need a way to convert our covariate data 𝑦 = 𝑋𝛽 from something that can
take values from −∞ to +∞ to something that is constrained between 0 and 1
so that we can fit the model
⎛
⎜ ⎞
⎟
⎜
⎜ ⎛ ⎞⎟
⎟
𝑤𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 ⎜𝑔 −1 ⎜ 𝑦 ⎟⎟
⎜ ⏟𝑖 ⎟
⎜
⎜⏟⏟⎝ ⎟
in
⏟⏟⏟⏟⏟[−∞,∞] ⎠⎟
⎝ in [0,1] ⎠
We use the notation 𝑦𝑖 = 𝑋 𝑖,⋅ 𝛽 to denote a single set of covariate values and
notice that this is unconstrained and can be in (−∞, +∞) while the parameter
of interest 𝑝𝑖 = 𝑔−1 (𝑦𝑖 ) is constrained to [0, 1]. When convenient, we will drop
the 𝑖 subscript while keeping the domain restrictions.
There are several options for the link function 𝑔−1 (⋅) that are commonly used.
11.1. BINOMIAL REGRESSION MODEL 217
⎡ 𝑝 ⎤
𝑔 (𝑝) = logit(𝑝) = log ⎢ ⎥=𝑦
⎢ 1⏟
− 𝑝⎥
⎣ odds ⎦
with inverse
1
𝑔−1 (𝑦) = ilogit(𝑦) = =𝑝
1 + 𝑒−𝑦
and we think of 𝑔 (𝑝) as the log odds function.
2. Probit transformation. The link function is 𝑔 (𝑝) = Φ−1 (𝑝) where Φ
is the standard normal cumulative distribution function and therefore
𝑔−1 (𝑋𝛽) = Φ (𝑋𝛽).
3. Complementary log-log transformation: 𝑔 (𝑝) = log [− log(1 − 𝑝)].
All of these functions will give a sigmoid shape with higher probability as 𝑦
increases and lower probability as it decreases. The logit and probit transfor-
mations have the nice property that if 𝑦 = 0 then 𝑔−1 (0) = 12 .
Inverse Logit() function
1.00
0.75
0.50
p
0.25
0.00
−5.0 −2.5 0.0 2.5 5.0
y = Xβ
Usually the difference in inferences made using these different curves is relatively
small and we will usually use the logit transformation because its form lends
itself to a nice interpretation of my 𝛽 values. In these cases, a slope parameter
in our model will be interpreted as “the change in log odds for every one unit
change in the predictor.”
As in the mixed model case, there are no closed form solution for 𝛽̂ and instead
we must rely on numerical solutions to find the maximum likelihood estimators
for 𝛽.̂ To do this, we must derive the likelihood function.
If
𝑖𝑖𝑑
𝑤𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝𝑖 )
218 CHAPTER 11. BINOMIAL REGRESSION
Then
𝑛
𝑤
ℒ(𝑝|𝑤) = ∏ 𝑝𝑖 𝑖 (1 − 𝑝𝑖 )1−𝑤𝑖
𝑖=1
𝑛
𝑤𝑖 1−𝑤𝑖
ℒ (𝛽|𝑤) = ∏ (ilogit(𝑋 𝑖 𝛽)) (1 − ilogit(𝑋 𝑖 𝛽))
𝑖=1
and
𝑛
𝑤𝑖 1−𝑤𝑖
log ℒ (𝛽|𝑤) = ∑ log {(ilogit(𝑋 𝑖 𝛽)) (1 − ilogit(𝑋 𝑖 𝛽)) }
𝑖=1
## $par
## [1] 5.100892 -3.050336
##
## $value
11.1. BINOMIAL REGRESSION MODEL 219
## [1] 6.324365
##
## $counts
## function gradient
## 71 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
head(Mayflies)
## CCU Occupancy
## 1 0.05261076 1
## 2 0.25935617 1
## 3 0.64322010 1
## 4 0.90168941 1
## 5 0.97002630 1
## 6 1.08037011 1
For binomial response data, we need to know the number of successes and the
number of failures at each level of our covariate. In this case it is quite sim-
ple because there is only one observation at each CCU level, so the number of
successes is Occupancy and the number of failures is just 1-Occupancy. For bi-
nomial data, glm expect the response to be a two-column matrix where the first
column is the number successes and and the second column is the number of fail-
ures. The default choice of link function for binomial data is the logit link, but
the probit can be easily chosen as well using family=binomial(link=probit)
220 CHAPTER 11. BINOMIAL REGRESSION
in the call to glm(). If you only give a single response vector, it is assumed that
the second column is to be calculated as 1-first.column.
summary(m1)
##
## Call:
## glm(formula = cbind(Occupancy, 1 - Occupancy) ~ CCU, family = binomial,
## data = Mayflies)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.55741 -0.31594 -0.06553 0.08653 2.13362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.102 2.369 2.154 0.0313 *
## CCU -3.051 1.211 -2.520 0.0117 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.795 on 29 degrees of freedom
## Residual deviance: 12.649 on 28 degrees of freedom
## AIC: 16.649
##
## Number of Fisher Scoring iterations: 7
Notice that the summary table includes an estimate of the standard error of
each 𝛽𝑗̂ and a standardized value and z-test that are calculated in the usual
𝛽𝑗̂ −0
manner 𝑧𝑗 = 𝑆𝑡𝑑𝐸𝑟𝑟(𝛽𝑗̂ )
but these only approximately follow a standard normal
distribution (due to the CLT results for Maximum Likelihood Estimators). We
should regard the p-values given as approximate.
The sigmoid curve shown prior was the result of the logit model and we can
estimate the probability of occupancy for any value of CCU. Surprisingly, R does
not have a built-in function for the logit and ilogit function, but the faraway
package does include them.
# Here are three ways to calculate the phat value for CCU = 1. The predict()
# function won't give you confidence intervals, however. So I prefer emmeans()
# new.df <- data.frame(CCU=1)
# predict(m1, newdata=new.df) %>% faraway::ilogit() # back transform to p myself
11.1. BINOMIAL REGRESSION MODEL 221
ggplot(Mayflies, aes(x=CCU)) +
geom_ribbon(data=yhat.df, aes(ymin=asymp.LCL, ymax=asymp.UCL), fill='salmon', alpha=.4) +
geom_line( data=yhat.df, aes(y=prob), color='red') +
geom_point(aes(y=Occupancy)) +
labs(y='Probability of Occupancy', x='Heavy Metal Pollution (CCU)', title='Mayfly Example')
Mayfly Example
1.00
Probability of Occupancy
0.75
0.50
0.25
0.00
0 1 2 3 4 5
Heavy Metal Pollution (CCU)
## Prediction
## Truth 0 1
## 0 21 1
## 1 2 6
This scheme has mis-classified 3 observations, two cases where mayflies were
present but we predicted they would be absent, and one case where no mayflies
were detected but we predicted we would see them.
11.2.1 Deviance
In the normal linear models case, we were very interested in the Sum of Squared
Error (SSE)
𝑛
2
𝑆𝑆𝐸 = ∑ (𝑤𝑖 − 𝑤̂ 𝑖 )
𝑖=1
because it provided a mechanism for comparing the fit of two different models.
If a model had a very small SSE, then it fit the observed data well. We used
this as a basis for forming our F-test to compare nested models (some re-scaling
by the appropriate degrees of freedom was necessary, though).
We want an equivalent measure of goodness-of-fit for models that are non-
normal, but in the normal case, I would like it to be related to my SSE statistic.
The deviance of a model with respect to some data 𝑦 is defined by
where 𝜃0̂ are the fitted parameters of the model of interest, and 𝜃𝑆̂ are the
fitted parameters under a “saturated” model that has as many parameters as
it has observations and can therefore fit the data perfectly. Thus the deviance
is a measure of deviation from a perfect model and is flexible enough to handle
non-normal distributions appropriately.
Notice that this definition is very similar to what is calculated during the Like-
lihood Ratio Test. For any two models under consideration, the LRT can be
formed by looking at the difference of the deviances of the two nested models
̂ ̂ ⋅
𝐿𝑅𝑇 = 𝐷 (𝑤, 𝜃𝑠𝑖𝑚𝑝𝑙𝑒 ) − 𝐷 (𝑤, 𝜃𝑐𝑜𝑚𝑝𝑙𝑒𝑥 ) ∼ 𝜒2𝑑𝑓𝑐𝑜𝑚𝑝𝑙𝑒𝑥 −𝑑𝑓𝑠𝑖𝑚𝑝𝑙𝑒
## [1] 2.526819e-06
A convenient way to get R to calculate the LRT 𝜒2 p-value for you is to specify
the test=LRT inside the anova function.
anova(m1, test='LRT')
The inference of this can be confirmed by looking at the AIC values of the two
models as well.
224 CHAPTER 11. BINOMIAL REGRESSION
AIC(m0, m1)
## df AIC
## m0 1 36.79491
## m1 2 16.64873
The deviance is a good way to measure if a model fits the data, but it is not
the only method. Pearson’s 𝑋 2 statistic is also applicable. This statistic takes
2
𝑛
the general form 𝑋 2 = ∑𝑖=1 (𝑂𝑖 −𝐸 𝐸𝑖
𝑖)
where 𝑂𝑖 is the number of observations
observed in category 𝑖 and 𝐸𝑖 is the number expected in category 𝑖. In our case
we need to figure out the categories we have. Since we have both the number
of success and failures, we’ll have two categories per observation 𝑖.
𝑛 2 2𝑛 2
(𝑤𝑖 − 𝑛𝑖 𝑝𝑖̂ ) ((𝑛𝑖 − 𝑤𝑖 ) − 𝑛𝑖 (1 − 𝑝𝑖̂ )) (𝑤 − 𝑛𝑖 𝑝𝑖̂ )
𝑋2 = ∑ [ + ]=∑ 𝑖
𝑖=1
𝑛𝑖 𝑝𝑖̂ 𝑛𝑖 (1 − 𝑝𝑖̂ ) 𝑛 𝑝̂ (1 − 𝑝𝑖̂ )
𝑖=1 𝑖 𝑖
## [1] 14.92367
deviance(m1)
## [1] 12.64873
but we know that this is not a good approximation because the the normal
approximation will not be good for small sample sizes and it isn’t clear what
is “big enough”. Instead we will use an inverted LRT to develop confidence
intervals for the 𝛽𝑖 parameters.
We first consider the simplest case, where we have only an intercept and slope
parameter. Below is a contour plot of the likelihood surface and the shaded
region is the region of the parameter space where the parameters (𝛽0 , 𝛽1 ) would
not be rejected by the LRT. This region is found by finding the maximum
likelihood estimators 𝛽0̂ and 𝛽1̂ , and then finding set of 𝛽0 , 𝛽1 pairs such that
Log−Likelihood Surface
0
−2
β1
−4
−6
0 5 10
β0
## 2.5 % 97.5 %
## (Intercept) 1.629512 11.781167
## CCU -6.446863 -1.304244
226 CHAPTER 11. BINOMIAL REGRESSION
## 2.5 % 97.5 %
## -6.446863 -1.304244
• I think the probability that she will not spit up on me today is 𝑝1 = 0.10.
My wife disagrees and believes the probability is 𝑝2 = 0.01. We can look
at those probabilities and recognize that we differ in our assessment by
a factor of 10 because 10 = 𝑝1 /𝑝2 . If we had assessed the chance of her
spitting up using odds, I would have calculated 𝑜1 = 0.1/0.9 = 1/9. My
wife, on the other hand, would have calculated 𝑜2 = .01/.99 = 1/99. The
odds ratio of these is [1/9] / [1/99] = 99/9 = 11. This shows that she is
much more certain that the event will not happen and the multiplying
factor of the pair of odds is 11.
• But what if we were to consider the probability that my daughter will spit
up? The probabilities assigned by me versus my wife are 𝑝1 = 0.9 and
𝑝2 = 0.99. How should I assess that our probabilities differ by a factor
of 10, because 𝑝1 /𝑝2 = 0.91 ≠ 10? The odds ratio remains the same
calculation, however. The odds I would give are 𝑜1 = .9/.1 = 9 vs my
wife’s odds 𝑜2 = .99/.01 = 99. The odds ratio is now 9/99 = 1/11 and
gives the same information as I calculated from the where we defined a
success as my daughter not spitting up.
Given a logistic regression model with two continuous covariates, then using the
logit() link function we have
𝑝
log ( ) = 𝛽 0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥2
1−𝑝
𝑝
= 𝑒𝛽0 𝑒𝛽1 𝑥1 𝑒𝛽2 𝑥2
1−𝑝
and we can interpret 𝛽1 and 𝛽2 as the increase in the log odds for every unit
increase in 𝑥1 and 𝑥2 . We could alternatively interpret 𝛽1 and 𝛽2 using the
notion that a one unit change in 𝑥1 as a percent change of 𝑒𝛽1 in the odds.
That is to say, 𝑒𝛽1 is the odds ratio of that change.
To investigate how to interpret these effects, we will consider an example of
the rates of respiratory disease of babies in the first year based on covariates of
gender and feeding method (breast milk, formula from a bottle, or a combination
of the two). The data percentages of babies suffering respiratory disease are
Breast Milk +
Formula f Breast Milk b Supplement s
77 47 19
Males M 485 494 147
48 31 16
Females F 384 464 127
data('babyfood', package='faraway')
head(babyfood)
##
## Call:
## glm(formula = cbind(disease, nondisease) ~ sex * food, family = binomial,
## data = babyfood)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.59899 0.12495 -12.797 < 2e-16 ***
## sexGirl -0.34692 0.19855 -1.747 0.080591 .
## foodBreast -0.65342 0.19780 -3.303 0.000955 ***
## foodSuppl -0.30860 0.27578 -1.119 0.263145
## sexGirl:foodBreast -0.03742 0.31225 -0.120 0.904603
## sexGirl:foodSuppl 0.31757 0.41397 0.767 0.443012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.6375e+01 on 5 degrees of freedom
## Residual deviance: 2.6401e-13 on 0 degrees of freedom
## AIC: 43.518
##
## Number of Fisher Scoring iterations: 3
Notice that the residual deviance is effectively zero with zero degrees of freedom
indicating we just fit the saturated model.
It is nice to look at the single term deletions to see if the interaction term could
be dropped from the model.
anova(m2, test='LRT')
Given this, we will look use the reduced model with out the interaction and
check if we could reduce the model any more.
From this we see that we cannot reduce the model any more and we will interpret
the coefficients of this model.
We interpret the intercept term as the log odds that a male child fed only formula
will develop a respiratory disease in their first year. With that, we could then
calculate what the probability of a male formula fed baby developing respiratory
disease using following
𝑝𝑀,𝑓
−1.6127 = log ( ) = logit (𝑝𝑀,𝑓 )
1 − 𝑝𝑀,𝑓
230 CHAPTER 11. BINOMIAL REGRESSION
thus
1
𝑝𝑀,𝑓 = ilogit (−1.6127) = = 0.1662
1 + 𝑒1.6127
We notice that the odds of respiratory disease disease is
𝑝𝑀,𝑓 0.1662
= = 0.1993 = 𝑒−1.613
1 − 𝑝𝑀,𝑓 1 − 0.1662
For a female child bottle fed only formula, their probability of developing res-
piratory disease is
1 1
𝑝𝐹 ,𝑓 = = = 0.1273
1+ 𝑒−(−1.6127−0.3126) 1 + 𝑒1.9253
so we can interpret 𝑒−0.3126 = 0.7315 as the percent change in odds from male
to female infants. That is to say, it is the odds ratio of the female infants to the
males is
𝑝
( 1−𝑝𝐹 ,𝑓 ) 0.1458
−0.3126 𝐹 ,𝑓
𝑒 = 𝑝𝑀,𝑓
= = 0.7315
( 1−𝑝 ) 0.1993
𝑀,𝑓
The interpretation here is that odds of respiratory infection for females is 73.1%
than that of a similarly feed male child and I might say that being female reduces
the odds of respiratory illness by 27% compared to male babies. Similarly we
can calculate the change in odds ratio for the feeding types:
exp( coef(m1) )
First we notice that the intercept term can be interpreted as the odds of infec-
tion for the reference group. The each of the offset terms are the odds ratios
compared to the reference group. We see that breast milk along with formula
has only 84% of the odds of respiratory disease as a formula only baby, and a
breast milk fed child only has 51% of the odds for respiratory disease as the
formula fed baby. We can look at confidence intervals for the odds ratios by the
following:
11.4. INTERPRETING MODEL COEFFICIENTS 231
exp( confint(m1) )
## 2.5 % 97.5 %
## (Intercept) 0.1591988 0.2474333
## sexGirl 0.5536209 0.9629225
## foodBreast 0.3781905 0.6895181
## foodSuppl 0.5555372 1.2464312
We should be careful in drawing conclusions here because this study was a ret-
rospective study and the decision to breast feed a baby vs feeding with formula
is inextricably tied to socio-economic status and we should investigate if the
effect measured is due to feeding method or some other lurking variable tied to
socio-economic status.
As usual, we don’t want to calculate all these quantities by hand and would
prefer if emmeans would do all the back-transformations for us.
## $emmeans
## sex prob SE df asymp.LCL asymp.UCL
## Boy 0.1309 0.0111 Inf 0.1105 0.154
## Girl 0.0992 0.0103 Inf 0.0808 0.121
##
## Results are averaged over the levels of: food
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
232 CHAPTER 11. BINOMIAL REGRESSION
## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Boy / Girl 1.37 0.193 Inf 2.216 0.0267
##
## Results are averaged over the levels of: food
## Tests are performed on the log odds ratio scale
## $emmeans
## sex prob SE df asymp.LCL asymp.UCL
## Boy 0.1309 0.0111 Inf 0.1105 0.154
## Girl 0.0992 0.0103 Inf 0.0808 0.121
##
## Results are averaged over the levels of: food
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Girl / Boy 0.732 0.103 Inf -2.216 0.0267
##
## Results are averaged over the levels of: food
## Tests are performed on the log odds ratio scale
## $emmeans
## food prob SE df asymp.LCL asymp.UCL
## Bottle 0.1457 0.01220 Inf 0.1233 0.1712
## Breast 0.0803 0.00877 Inf 0.0647 0.0993
## Suppl 0.1255 0.01994 Inf 0.0913 0.1700
##
## Results are averaged over the levels of: sex
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Bottle / Breast 1.953 0.299 Inf 4.374 <.0001
## Bottle / Suppl 1.188 0.244 Inf 0.839 0.6786
## Breast / Suppl 0.609 0.132 Inf -2.296 0.0564
##
## Results are averaged over the levels of: sex
11.5. PREDICTION AND EFFECTIVE DOSE LEVELS 233
## $emmeans
## food prob SE df asymp.LCL asymp.UCL
## Bottle 0.1457 0.01220 Inf 0.1233 0.1712
## Breast 0.0803 0.00877 Inf 0.0647 0.0993
## Suppl 0.1255 0.01994 Inf 0.0913 0.1700
##
## Results are averaged over the levels of: sex
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Breast / Bottle 0.512 0.0783 Inf -4.374 <.0001
## Suppl / Bottle 0.842 0.1730 Inf -0.839 0.6786
## Suppl / Breast 1.643 0.3556 Inf 2.296 0.0564
##
## Results are averaged over the levels of: sex
## P value adjustment: tukey method for comparing a family of 3 estimates
## Tests are performed on the log odds ratio scale
data('bliss', package='faraway')
We first fit the logistic regression model and plot the results
Given this, we want to develop a confidence interval for the probabilities by first
calculating using the following formula. As usual, we recall that the 𝑦 values
live in (−∞, ∞).
𝐶𝐼𝑦 ∶ 𝑦 ̂ ± 𝑧 1−𝛼/2 𝑆𝑡𝑑𝐸𝑟𝑟 (𝑦)̂
We must then convert this to the [0, 1] space using the ilogit() function.
𝐶𝐼𝑝 = ilogit (𝐶𝐼𝑦 )
234 CHAPTER 11. BINOMIAL REGRESSION
0.75
Proportion Alive
0.50
0.25
0.00
0 1 2 3 4
Concentration
The next thing we want to do is come up with a confidence intervals for the
concentration level that results in the death of 100(𝑝)% of the insects. Often
we are interested in the case of 𝑝 = 0.5. This is often called LD50, which is the
lethal dose for 50% of the population. Using the link function you can set the
𝑝 value and solve for the concentration value to find
Fortunately we don’t have to do these calculations by hand and can use the
dose.p() function in the MASS package.
## Dose SE
## p = 0.25: 2.945535 0.2315932
## p = 0.50: 2.000000 0.1784367
## p = 0.75: 1.054465 0.2315932
and we can use these to create approximately confidence intervals for these 𝑥𝑝̂
values via
𝑥𝑝̂ ± 𝑧 1−𝛼/2 𝑆𝑡𝑑𝐸𝑟𝑟 (𝑥𝑝̂ )
11.6 Overdispersion
In the binomial distribution, the variance is a function of the probability of
success and is
𝑉 𝑎𝑟 (𝑊 ) = 𝑛𝑝 (1 − 𝑝)
but there are many cases where we might be interested in adding an additional
variance parameter 𝜙 to the model. A common reason for overdispersion to
appear is that we might not have captured all the covariates that influence 𝑝.
We can do a quick simulation to demonstrate that additional variability in 𝑝
leads to addition variability overall.
N <- 1000
n <- 10
p <- .6
overdispersed_p <- p + rnorm(n, mean=0, sd=.05)
sim.data <- NULL
for( i in 1:N ){
sim.data <- sim.data %>% rbind(data.frame(
var = var( rbinom(N, size=n, prob=p)),
type = 'Standard'))
sim.data <- sim.data %>% rbind(data.frame(
var = var( rbinom(N, size=n, prob=overdispersed_p )),
11.6. OVERDISPERSION 237
type = 'OverDispersed'))
}
true.var <- p*(1-p)*n
ggplot(sim.data, aes(x=var, y=..density..)) +
geom_histogram(bins=30) +
geom_vline(xintercept = true.var, color='red') +
facet_grid(type~.) +
ggtitle('Histogram of Sample Variances')
OverDispersed
3
2
1
density
0
4
3
Standard
2
1
0
2.00 2.25 2.50 2.75 3.00 3.25
var
We see that the sample variances fall neatly about the true variance of 2.4 in
the case where the data is distributed with a constant value for 𝑝. However
adding a small amount of random noise about the parameter 𝑝, and we’d have
more variance in the samples.
The extra uncertainty of the probability of success results in extra variability in
the responses.
We can recognize when overdispersion is present by examining the deviance of
our model. Because the deviance is approximately distributed
⋅
𝐷 (𝑦, 𝜃) ∼ 𝜒2𝑑𝑓
where 𝑑𝑓 is the residual degrees of freedom in the model. Because the 𝜒2𝑘 is
the sum of 𝑘 independent, squared standard normal random variables, it has an
expectation 𝑘 and variance 2𝑘. For binomial data with group sizes (say larger
than 5), this approximation isn’t too bad and we can detect overdispersion.
For binary responses, the approximation is quite poor and we cannot detect
overdispersion.
The simplest approach for modeling overdispersion is to introduce an addition
dispersion parameter 𝜎2 . This dispersion parameter may be estimated using
𝑋2
𝜎̂ 2 = .
𝑛−𝑝
238 CHAPTER 11. BINOMIAL REGRESSION
With the addition of the overdispersion parameter to the model, the differences
between a simple and complex model is no longer distributed 𝜒2 and we must
use the following approximate F-statistic
Using the F-test when the the overdispersion parameter is 1 is a less powerful test
than the 𝜒2 test, so we’ll only use the F-test when the overdispersion parameter
must be estimated.
Example: We consider an experiment where at five different stream locations,
four boxes of trout eggs were buried and retrieved at four different times after
the original placement. The number of surviving eggs was recorded and the
eggs disposed of.
data(troutegg, package='faraway')
troutegg <- troutegg %>%
mutate( perish = total - survive) %>%
dplyr::select(location, period, survive, perish, total) %>%
arrange(location, period)
location
1.00
1
2
0.75 3
survive/total
0.50 5
total
0.25
100
120
0.00
140
4 7 8 11
period
We can fit the logistic regression model (noting that the model with the inter-
action of location and period would be saturated):
##
## Call:
## glm(formula = cbind(survive, perish) ~ location * period, family = binomial,
## data = troutegg)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.8792 0.4596 6.265 3.74e-10 ***
## location2 1.0911 0.8489 1.285 0.19870
## location3 0.5136 0.6853 0.749 0.45356
## location4 24.4707 51676.2179 0.000 0.99962
## location5 -2.7716 0.5044 -5.495 3.90e-08 ***
## period7 0.2778 0.6869 0.404 0.68591
## period8 -0.7326 0.5791 -1.265 0.20582
## period11 -0.5695 0.5383 -1.058 0.29007
## location2:period7 -2.4453 1.0291 -2.376 0.01749 *
240 CHAPTER 11. BINOMIAL REGRESSION
The residual deviance seems a little large. With 12 residual degrees of freedom,
the deviance should be near 12. We can confirm that the deviance is quite large
via:
## [1] 3.372415e-09
## [1] 2.29949e-11
and note that this is quite a bit larger than 1, which is what it should be in the
non-overdispersed setting. Using this we can now test the significance of the
effects of location and period.
and conclude that both location and period are significant predictors of trout
egg survivorship.
We could have avoided having to calculate 𝜎̂ 2 by hand by simply using the
quasibinomial family instead of the binomial.
##
## Call:
## glm(formula = cbind(survive, perish) ~ location + period, family = quasibinomial,
## data = troutegg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8305 -0.3650 -0.0303 0.6191 3.2434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6358 0.6495 7.138 1.18e-05 ***
## location2 -0.4168 0.5682 -0.734 0.477315
## location3 -1.2421 0.5066 -2.452 0.030501 *
## location4 -0.9509 0.5281 -1.800 0.096970 .
## location5 -4.6138 0.5777 -7.987 3.82e-06 ***
## period7 -2.1702 0.5504 -3.943 0.001953 **
## period8 -2.3256 0.5609 -4.146 0.001356 **
## period11 -2.4500 0.5405 -4.533 0.000686 ***
242 CHAPTER 11. BINOMIAL REGRESSION
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 5.330358)
##
## Null deviance: 1021.469 on 19 degrees of freedom
## Residual deviance: 64.495 on 12 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
drop1(m2, test='F')
# anova(m2, test='F')
While each of the time periods is different than the first, it looks like periods
7,8, and 11 aren’t different from each other. As usual, we need to turn to the
emmeans package for looking at the pairwise differences between the periods.
## $emmeans
## period prob SE df asymp.LCL asymp.UCL
## 4 0.960 0.0177 Inf 0.907 0.984
## 7 0.735 0.0567 Inf 0.611 0.831
## 8 0.704 0.0618 Inf 0.571 0.809
## 11 0.677 0.0549 Inf 0.562 0.774
##
## Results are averaged over the levels of: location
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
11.6. OVERDISPERSION 243
##
## Call:
## glm(formula = cbind(survive, perish) ~ period * location, family = binomial,
## data = troutegg)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
244 CHAPTER 11. BINOMIAL REGRESSION
Notice that we’ve fit the saturated model and we’ve resolved the overdispersion
problem because of that.
As usual, we can now look at the effect of period at each of the locations.
## location = 1:
## period prob SE df asymp.LCL asymp.UCL .group
## 8 0.8953488 0.03300798 Inf 0.8109406 0.9446443 1
## 11 0.9096774 0.02302375 Inf 0.8532710 0.9457777 1
## 4 0.9468085 0.02314665 Inf 0.8785095 0.9776867 1
## 7 0.9591837 0.01998733 Inf 0.8962639 0.9845962 1
##
11.6. OVERDISPERSION 245
## location = 2:
## period prob SE df asymp.LCL asymp.UCL .group
## 11 0.8524590 0.03210799 Inf 0.7779341 0.9050269 1
## 7 0.8584906 0.03385381 Inf 0.7784456 0.9128537 1
## 8 0.9062500 0.02974911 Inf 0.8295443 0.9504977 12
## 4 0.9814815 0.01297276 Inf 0.9289964 0.9953638 2
##
## location = 3:
## period prob SE df asymp.LCL asymp.UCL .group
## 11 0.7280000 0.03980111 Inf 0.6434907 0.7987421 1
## 8 0.7394958 0.04023479 Inf 0.6533948 0.8104142 1
## 7 0.7692308 0.03695265 Inf 0.6891126 0.8336850 1
## 4 0.9674797 0.01599358 Inf 0.9165610 0.9877408 2
##
## location = 4:
## period prob SE df asymp.LCL asymp.UCL .group
## 8 0.6767677 0.04700674 Inf 0.5787856 0.7613551 1
## 7 0.8247423 0.03860215 Inf 0.7360186 0.8881766 12
## 11 0.8409091 0.03183558 Inf 0.7682755 0.8939194 2
## 4 1.0000000 0.00000007 Inf 0.0000000 1.0000000 12
##
## location = 5:
## period prob SE df asymp.LCL asymp.UCL .group
## 11 0.0000000 0.00000005 Inf 0.0000000 1.0000000 12
## 7 0.0973451 0.02788552 Inf 0.0547290 0.1672733 1
## 8 0.2045455 0.04299929 Inf 0.1328383 0.3015023 1
## 4 0.5268817 0.05177260 Inf 0.4256954 0.6259069 2
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## P value adjustment: tukey method for comparing a family of 4 estimates
## Tests are performed on the log odds ratio scale
## significance level used: alpha = 0.05
Notice that there isn’t a huge difference in period for locations 1-4, but in
location 5 things are very different.
We might consider that location really ought to be a random effect. Fortunately
lme4 supports the family option, although it will not accept quasi families, you
either fit a random effect or fit a quasibinomial. In either case, fitting the
full interaction model with period*location doesn’t work because we have a
saturated model
# addititive model
m3 <- glmer(cbind(survive,perish) ~ period + (1|location),
family=binomial, data=troutegg)
246 CHAPTER 11. BINOMIAL REGRESSION
summary(m3)
# Interaction model
m4 <- glmer(cbind(survive,perish) ~ period + (1 + period|location),
family=binomial, data=troutegg)
#summary(m4)
#emmeans(m4, ~period, type='response') %>% multcomp::cld(Letters=letters)
Unfortunately, we aren’t able to fit the saturated model using random effects
because the numerical optimization function for finding MLE estimates failed
to converge.
data('wbca', package='faraway')
model <- wbca %>% mutate( Class = (Class == 'malignant') ) %>% # Clear what is success
glm( Class ~., data=., family='binomial' ) # and emmeans still happy
248 CHAPTER 11. BINOMIAL REGRESSION
## Predicted
## Truth benign malignant
## benign 432 11
## malignant 15 223
summary(model)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8890 -0.1409 -0.1409 0.0287 2.2284
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.35433 0.54076 -11.751 < 2e-16 ***
## BNucl 0.55297 0.08041 6.877 6.13e-12 ***
## UShap 0.62583 0.17506 3.575 0.000350 ***
## USize 0.56793 0.15910 3.570 0.000358 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 881.39 on 680 degrees of freedom
## Residual deviance: 148.43 on 677 degrees of freedom
## AIC: 156.43
##
## Number of Fisher Scoring iterations: 7
11.7. ROC CURVES 249
From this table, we see that for a breast tumor, the larger values of BNucl,
UShap, and USize imply a greater probability of it being malignant. So for a
tumor with
We would calculate
and therefore
1 1
𝑝̂ = = = 0.0297
1+ 𝑒−𝑋𝛽 ̂ 1 + 𝑒3.482
## 1
## -3.486719
## 1
## 0.0296925
We can think of the True Positive Rate as the probability that a Positive case
will be correctly classified as a positive. Similarly a False Positive Rate is the
probability that a Negative case will be incorrectly classified as a positive.
I wish to examine the relationship between the False Positive Rate and the True
Positive Rate for any decision rule. So what we could do is select a sequence
of decision rules and for each calculate the (FPR, TPR) pair, and then make a
plot where we play connect the dots with the (FPR, TPR) pairs.
Of course we don’t want to have to do this by hand, so we’ll use the package
pROC to do it for us.
1.00
0.75
sensitivity
0.50
0.25
0.00
1.00 0.75 0.50 0.25 0.00
specificity
This looks pretty good and in an ideal classifier that makes prefect predictions,
this would be a perfect right angle at the bend.
Lets zoom in a little on the high specificity values (i.e. low false positive rates)
1.00
0.75
sensitivity
0.50
0.25
0.00
1.000 0.975 0.950 0.925 0.900
specificity
One measure of how far we are from the perfect predictor is the area under the
curve. The perfect model would have an area under the curve of 1. For this
model the area under the curve is:
auc(myROC)
ci(myROC, of='auc')
which seems pretty good and Area Under the Curve (AUC) is often used as a
way of comparing the quality of binary classifiers.
11.8 Exercises
a. Fit a binomial regression with Class as the response variable and the
other nine variables as predictors (for consistency among students,
define a success as the tumor being benign and remember that glm
wants the response to be a matrix where the first column is the
number of successes). Report the residual deviance and associated
degrees of freedom. Can this information be used to determine if this
model fits the data?
b. Use AIC as the criterion to determine the best subset of variables
using the step function.
c. Use the reduced model to give the estimated probability that a tumor
with associated predictor variables
e. Give the probability of testing positive for diabetes for a Pima woman
who had had no pregnancies, had bmi=28 and a glucose level of 110.
f. Give the odds that the same woman would test positive for diabetes.
g. How do her odds change to if she were to have a child? That is to
say, what is the odds ratio for that change?
Appendix
255
Chapter 12
Block Designs
Often there are covariates in the experimental units that are known to affect
the response variable and must be taken into account. Ideally an experimenter
can group the experimental units into blocks where the within block variance
is small, but the block to block variability is large. For example, in testing a
drug to prevent heart disease, we know that gender, age, and exercise levels
play a large role. We should partition our study participants into gender, age,
and exercise groups and then randomly assign the treatment (placebo vs drug)
within the group. This will ensure that we do not have a gender, age, and
exercise group that has all placebo observations.
Often blocking variables are not the variables that we are primarily interested
in, but must nevertheless be considered. We call these nuisance variables. We
already know how to deal with these variables by adding them to the model,
but there are experimental designs where we must be careful because the ex-
perimental treatments are nested.
Example 1. An agricultural field study has three fields in which the researchers
will evaluate the quality of three different varieties of barley. Due to how they
harvest the barley, we can only create a maximum of three plots in each field. In
this example we will block on field since there might be differences in soil type,
drainage, etc from field to field. In each field, we will plant all three varieties so
that we can tell the difference between varieties without the block effect of field
confounding our inference. In this example, the varieties are nested within the
fields.
257
258 CHAPTER 12. BLOCK DESIGNS
The dataset oatvar in the faraway library contains information about an exper-
iment on eight different varieties of oats. The area in which the experiment was
done had some systematic variability and the researchers divided the area up
into five different blocks in which they felt the area inside a block was uniform
while acknowledging that some blocks are likely superior to others for growing
crops. Within each block, the researchers created eight plots and randomly
assigned a variety to a plot. This type of design is called a Randomized Com-
plete Block Design (RCBD) because each block contains all possible levels of
the factor of primary interest.
data('oatvar', package='faraway')
ggplot(oatvar, aes(y=yield, x=block, color=variety)) +
geom_point(size=5) +
geom_line(aes(x=as.integer(block))) # connect the dots
12.1. RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) 259
500
variety
1
2
400
3
yield
4
5
300 6
7
8
I II III IV V
block
While there is one unusual observation in block IV, there doesn’t appear to be
a blatant interaction. We will consider the interaction shortly. For the main
effects model of yield ~ block + variety we have 𝑝 = 12 parameters and 28
residual degrees of freedom because
𝑑𝑓𝜖 = 𝑛 − 𝑝
= 𝑛 − (1 + [(𝐼 − 1) + (𝐽 − 1)])
= 40 − (1 + [(5 − 1) + (8 − 1)])
= 40 − 12
= 28
Because this is an orthogonal design, the sums of squares doesn’t change regard-
less of which order we add the factors, but if we remove one or two observations,
they would.
In determining the significance of variety the above F-value and p-value is
correct. We have 40 observations (5 per variety), and after accounting for the
260 CHAPTER 12. BLOCK DESIGNS
# Ignore any p-values regarding block, but I'm happy with the analysis for variety
letter_df <- emmeans(m1, ~variety) %>%
multcomp::cld(Letters=letters) %>%
dplyr::select(variety, .group) %>%
mutate(yield = 500)
500 ab bc b a c ab ab bc
400
yield
300
1 2 3 4 5 6 7 8
variety
use the Error() function within our formula. By default, Error() just creates
independent error terms, but when we add a covariate, it adds the appropriate
nesting.
##
## Error: block
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 4 33396 8349
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## variety 7 77524 11075 8.284 1.8e-05 ***
## Residuals 28 37433 1337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that in our block level, there is no p-value to assess if the blocks are
different. This is because we don’t have any replication of the blocks. So our
analysis respects that blocks are present, but does not attempt any statistical
analyses on them.
## # A tibble: 6 x 5
## # Groups: plot, subplot, Fertilizer [6]
## plot subplot Fertilizer Irrigation yield
## <fct> <fct> <fct> <fct> <dbl>
## 1 1 1 Low Low 20.2
## 2 1 2 High Low 24.4
262 CHAPTER 12. BLOCK DESIGNS
1
Fertilizer
row
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
col
So all together we have 8 plots, and 32 subplots. When I analyze the fertilizer,
I have 32 experimental units (the thing I have applied my treatment to), but
when analyzing the effect of irrigation, I only have 8 experimental units.
I like to think of this set up as having some lurking variables that act at the plot
level (changes in aspect, maybe something related to what was planted prior)
and some lurking variables that act on a local subplot scale (maybe variation
in clay/silt/sand ratios). So even after I account for Irrigation and Fertilizer
treatments, observations within a plot will be more similar to each other than
observations in two different plots.
We can think about doing two separate analyses, one for the effect of irrigation,
and another for the effect of the fertilizer.
# AgData came from my data package, dsData, (however I did some summarization
# first.)
In this case we see that we have insufficient evidence to conclude that the
observed difference between the Irrigation levels could not be due to random
chance.
Next we can do the appropriate analysis for the fertilizer, recognizing that all
the p-values for the plot effects are nonsense and should be ignored.
Ideally I wouldn’t have to do the averaging over the nested observations and
we would like to not have the misleading p-values for the plots. To do this, we
only have to specify the nesting of the error terms and R will figure out the
appropriate degrees of freedom for the covariates.
# To do this right, we have to abandon the general lm() command and use the more
# specialized aov() command. The Error() part of the formula allows me to nest
# the error terms and allow us to do the correct analysis. The order of these is
# to start with the largest/highest level and then work down the nesting.
m2 <- aov( yield ~ Irrigation + Fertilizer + Error(plot/subplot), data=AgData )
summary(m2)
264 CHAPTER 12. BLOCK DESIGNS
##
## Error: plot
## Df Sum Sq Mean Sq F value Pr(>F)
## Irrigation 1 104.3 104.26 3.428 0.114
## Residuals 6 182.5 30.41
##
## Error: plot:subplot
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 1 0.43 0.43 0.033 0.857
## Residuals 23 298.83 12.99
In the output, we see that the ANOVA table row for the Fertilizer is the same
for both analyses, but the sums-of-squares for Irrigation are different between
the two analyses (because of the averaging) while the F and p values are the
same between the two analyses.
What would have happened if we had performed the analysis incorrectly and
had too many degrees of freedom for the Irrigation test?
In this case we would have concluded that we had statistically significant evi-
dence to conclude the Irrigation levels are different. Notice that the sums-of-
squares in this wrong analysis match up with the sums-of-squares in the correct
design and the only difference is that when we figure out the sum-of-squares for
the residuals we split that into different pools.
A second example of a slightly more complex split plot is given in the package
MASS under the dataset oats. From the help file the data describes the following
experiment:
The yield of oats from a split-plot field trial using three varieties and
four levels of manurial treatment. The experiment was laid out in
6 blocks of 3 main plots, each split into 4 sub-plots. The varieties
were applied to the main plots and the manurial treatments to the
sub-plots.
This is a lot to digest so lets unpack it. First we have 6 blocks and we’ll replicate
the exact same experiment in each block. Within a block, we’ll split it into three
sections, which we’ll call plots (within the block). Finally within each plot, we’ll
have 4 subplots.
One issue that makes this issue confusing for students is that most texts get lazy
and don’t define the blocks, plots, and sub-plots when there are no replicates in
a particular level. I prefer to be clear about defining those so.
data('oats', package='MASS')
oats <- oats %>% mutate(
Nf = ordered(N, levels = sort(levels(N))), # make manure an ordered factor
plot = as.integer(V), # plot
subplot = as.integer(Nf)) # sub-plot
B: I
100
75
50
175
150
125
B: II
100
75
50
175
150
125
B: III
100 V
75
Golden.rain
50
Y
175 Marvellous
150
Victory
B: IV
125
100
75
50
175
150
125
B: V
100
75
50
175
150
B: VI
125
100
75
50
0.0cwt0.2cwt0.4cwt0.6cwt 0.0cwt0.2cwt0.4cwt0.6cwt 0.0cwt0.2cwt0.4cwt0.6cwt
Nf
This graph also makes me think that variety doesn’t matter and it is unlikely
that there an interaction between oat variety and fertilizer level, but we should
check.
Unfortunately the above model isn’t correct because R isn’t smart enough to
understand that the levels of plot and subplot are exact matches to the Variety
and Fertilizer levels. As a result if I defined the model above, the degrees of
freedom will be all wrong because there is too much nesting. So we have to
be smart enough to recognize that plot and subplot are actually Variety and
Fertilizer.
12.2. SPLIT-PLOT DESIGNS 267
##
## Error: B
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 5 15875 3175
##
## Error: B:V
## Df Sum Sq Mean Sq F value Pr(>F)
## V 2 1786 893.2 1.485 0.272
## Residuals 10 6013 601.3
##
## Error: B:V:Nf
## Df Sum Sq Mean Sq F value Pr(>F)
## Nf 3 20020 6673 37.686 2.46e-12 ***
## V:Nf 6 322 54 0.303 0.932
## Residuals 45 7969 177
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sure enough the interaction term is not significant. We next consider the Variety
term.
##
## Error: B
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 5 15875 3175
##
## Error: B:V
## Df Sum Sq Mean Sq F value Pr(>F)
## V 2 1786 893.2 1.485 0.272
## Residuals 10 6013 601.3
##
## Error: B:V:Nf
## Df Sum Sq Mean Sq F value Pr(>F)
## Nf 3 20020 6673 41.05 1.23e-13 ***
## Residuals 51 8291 163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
268 CHAPTER 12. BLOCK DESIGNS
We conclude by noticing that the Variety does not matter, but that the fertilizer
level is quite significant.
There are many other types of designs out there. For example you might have 5
levels of a factor, but when you split your block into plots, you can only create
3 plots. So not every block will have every level of the factor. This is called
Randomized Incomplete Block Designs (RIBD).
You might have a design where you apply even more levels of nesting. Suppose
you have a green house study where you have rooms where you can apply a
temperature treatment, within the room you have four tables and can apply
a light treatment to each table. Finally within each table you can have four
trays where can apply a soil treatment to each tray. This is a continuation of
the split-plot design and by extending the nesting we can develop split-split-plot
and split-split-split-plot designs.
You might have 7 covariates each with two levels (High, Low) and you want
to investigate how these influence your response but also allow for second and
third order interactions. If you looked at every treatment combination you’d
have 27 = 128 different treatment combinations and perhaps you only have the
budget for a sample of 𝑛 = 32. How should you design your experiment? This
question is addressed by fractional factorial designs.
If your research interests involve designing experiments such as these, you should
consider taking an Experimental design course.
12.3 Exercises
1. ???
2. ???
3. ???
Chapter 13
Maximum Likelihood
Estimation
Learning Outcomes
13.1 Introduction
The goal of statistical modeling is to take data that has some general trend
along with some un-explainable variability, and say something intelligent about
the trend. For example, the simple regression model
𝑖𝑖𝑑
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )
269
270 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION
Simple Regression
3
yi = β0 + β1xi + εi
Response
There is a general increasing trend in the response (i.e. the 𝛽0 + 𝛽1 𝑥𝑖 term) and
𝑖𝑖𝑑
then some un-explainable noise (the 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ) part).
While it has been convenient to write the model in this form, it is also possible
to write the simple regression model as
𝑖𝑛𝑑
𝑦𝑖 ∼ 𝑁 ( 𝛽 0 + 𝛽 1 𝑥𝑖 , 𝜎 2 )
This model contains three parameters 𝛽0 , 𝛽1 , and 𝜎 but it certainly isn’t clear
how to estimate these three values. In this chapter, we’ll develop a mechanism
for taking observed data sampled from some distribution parameterized by some
𝛽, 𝜆, 𝜎, or 𝜃 and then estimating those parameters.
13.2 Distributions
Depending on what values the data can take on (integers, positive values) and
the shape of the distribution of values, we might chose to model the data using
one of several different distributions. Next we’ll quickly introduce the mathe-
matical relationship between the parameter and probable data values of several
distributions.
13.2.1 Poisson
The Poisson distribution is used to model the number of events that happen
in some unit of time or space. This distribution is often used to model events
that can only be positive integers. This distribution is parameterized by 𝜆,
which represents the expected number of events that happen (as defined as the
average over an infinitely large number of draws). Because 𝜆 represents the
average number of events, the 𝜆 parameter must be greater than or equal to 0.
13.2. DISTRIBUTIONS 271
The function that defines the relationship between the parameter 𝜆 and what
values are most probable is called the probability mass function when talking
about discrete random variables and probability density functions in the contin-
uous case. Either way, these functions are traditionally notated using 𝑓(𝑥).
𝑒−𝜆 𝜆𝑥
𝑓(𝑥|𝜆) = for 𝑥 ∈ {0, 1, 2, … }
𝑥!
Poisson ( λ = 3.5 )
λ
0.2
f(x | λ = 3.5)
0.1
0.0
0 2 4 6 8 10
X
The notation 𝑓(𝑥|𝜆) read as “f given 𝜆” and is used to denote that this is a
function that describes what values of the data 𝑋 are most probable and that
the function depends on the parameter value. This is emphasizing that if we
were to change the parameter value (to say 𝜆 = 10), then a different set of data
values would be more probable. In the above example with 𝜆 = 3.5, the most
probable outcome is 3 but we aren’t surprised if we were to observe a value of
𝑥 = 1, 2, or 4. However, from this graph, we see that 𝑥 = 10 or 𝑥 = 15 would
be highly improbable.
13.2.2 Exponential
The Exponential distribution can be used to model events that take on a positive
real value and the distribution of values has some skewness. We will parame-
terize this distribution using 𝛽 as the mean of the distribution.
Exponential ( β = 3.5 )
0.2
f(x | β = 3.5)
0.1
0.0
0 5 10 15 20
X
In this distribution, the region near zero is the most probable outcome, and
larger observations are less probable.
13.2.3 Normal
1 1 𝑥−𝜇 2
𝑓(𝑥|𝜇, 𝜎) = √ exp [− ( ) ]
𝜎 2𝜋 2 𝜎
where exp[𝑤] = 𝑒𝑤 is just a notational convenience.
Normal (µ = 6, σ = 1)
µ
0.4
f( x | µ = 6, σ = 1 )
0.3
σ
0.2
0.1
0.0
4 6 8
X
All of these distributions (and there are many, many more distributions com-
monly used) have some mathematical function that defines the how probable a
region of response values is and that function depends on the parameters. Im-
portantly 𝑋 regions with the highest 𝑓(𝑥|𝜃) are the most probable data values.
13.3. LIKELIHOOD FUNCTION 273
There are many additional mathematical details that go into these density func-
tions but the important aspect is that they tell us what data values are most
probable given some parameter values 𝜃.
The Likelihood function is just the probability density (or mass) function 𝑓(𝑥|𝜃)
re-interpreted to be a function where the data is the known quantity and we are
looking to see what parameter values are consistent with the data.
13.3.1 Poisson
Suppose that we have observed a single data point drawn from a Poisson(𝜆) and
we don’t know what 𝜆 is. We first write down the likelihood function
𝑒−𝜆 𝜆𝑥
ℒ(𝜆|𝑥) = 𝑓(𝑥|𝜆) =
𝑥!
L( λ | x=4 ) 0.15
0.10
0.05
0.00
0 5 10 15
λ
−5
−10
−15
−20
0 5 10 15
λ
Our best estimate for 𝜆 is the value that maximizes ℒ(𝜆|𝑥). We could do this two
different ways. First we could mathematically solve by taking the derivative,
setting it equal to zero, and then solving for lambda. Often this process is
made mathematically simpler (and computationally more stable) by instead
maximizing the log of the Likelihood function. This is equivalent because the
log function is monotonically increasing and if 𝑎 < 𝑏 then log(𝑎) < log(𝑏).
It is simpler because taking logs makes everything 1 operation simpler and
reduces the need for using the chain rule while taking derivatives. We could
also find the value of 𝜆 that maximizes the likelihood using numerical methods.
Again because the log function makes everything nicer, in practice we’ll always
maximize the log likelihood. Many optimization functions are designed around
finding function minimums, so to use those, we’ll actually seek to minimize the
negative log likelihood which is simply −1 ∗ log ℒ().
Numerical solvers are convenient, but are only accurate to machine tolerance
you specify. In this case where 𝑥 = 4, the actual maximum likelihood value is
𝜆̂ = 4.
x <- 4
neglogL <- function(param){
dpois(x, lambda=param) %>%
log() %>% # take the log
prod(-1) %>% # multiply by -1
return()
13.3. LIKELIHOOD FUNCTION 275
## $minimum
## [1] 3.999993
##
## $objective
## [1] 1.632876
But what if we have multiple observations from this Poisson distribution? If the
observations are independent, then the probability mass or probability density
functions 𝑓(𝑥𝑖 |𝜃) can just be multiplied together.
𝑛 𝑛
𝑒−𝜆 𝜆𝑥𝑖
ℒ(𝜆|x) = ∏ 𝑓(𝑥𝑖 |𝜆) = ∏
𝑖=1 𝑖=1
𝑥𝑖 !
L( λ | x ) 7.5e−07
5.0e−07
2.5e−07
0.0e+00
0 5 10 15
λ
−50
−75
−100
0 5 10 15
λ
x <- c(4,6,3,3,2,4,3,2)
neglogL <- function(param){
dpois(x, lambda=param) %>%
log() %>% sum() %>% prod(-1) %>%
return()
}
optimize(neglogL, interval=c(0,20) )
## $minimum
## [1] 3.37499
##
## $objective
## [1] 13.85426
We next consider data sampled from the exponential distribution. Recall the
exponential distribution can be parametrized by a single parameter, 𝛽 and
13.3. LIKELIHOOD FUNCTION 277
## [1] 6.396
4e−07
2e−07
0e+00
0 10 20 30
β
−20
−25
−30
0 10 20 30
β
## $minimum
## [1] 6.396004
##
## $objective
## [1] 14.27836
13.3.3 Normal
We finally consider the case where we have observation coming from a distri-
bution that has multiple parameters. The normal distribution is parameterized
𝑖𝑖𝑑
by a mean 𝜇 and spread 𝜎. Suppose that we had observed 𝑥𝑖 ∼ 𝑁 (𝜇, 𝜎2 ) and
saw x = {5, 8, 9, 7, 11, 9}.
𝑛 𝑛
1 1 (𝑥𝑖 − 𝜇)2
ℒ(𝜇, 𝜎|x) = ∏ 𝑓(𝑥𝑖 |𝜇, 𝜎) = ∏ √ exp [− ]
𝑖=1 𝑖=1 2𝜋𝜎 2 𝜎2
Again using calculus, it can be shown that the maximum likelihood estimators
in this model are
𝜇̂ = 𝑥̄ = 8.16666
1 𝑛
𝜎̂𝑚𝑙𝑒 = √ ∑(𝑥 − 𝑥)̄ 2 = 1.8634
𝑛 𝑖=1 𝑖
1
which is somewhat unexpected because the typical estimator we use has a 𝑛−1
multiplier.
2.5
2.0
^ mle
σ
µ
1.5
1.0 ^
µ
6 8 10 12
µ
# Bivariate optimization uses the optim function that only can search
# for a minimum. The first argument is an initial guess to start the algorithm.
# So long as the start point isn't totally insane, the numerical algorithm should
# be fine.
optim(c(5,2), neglogL )
## $par
## [1] 8.166924 1.863391
##
## $value
## [1] 12.24802
##
## $counts
## function gradient
## 71 NA
280 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION
##
## $convergence
## [1] 0
##
## $message
## NULL
13.4 Discussion
1. How could the numerical maximization happen? Assume we have a 1-
dimensional parameter space, we have a reasonable starting estimate, and
the function to be maximized is continuous, smooth, and ≥ 0 for all 𝑥.
While knowing the derivative function 𝑓 ′ (𝑥) would allow us to be much
more clever, lets think about how to do the maximization by just evaluating
𝑓(𝑥) for different values of 𝑥.
2. Convince yourself that if 𝑥0 is the value of 𝑥 that maximizes 𝑓(𝑥), then it
is also the value that maximizes 𝑙𝑜𝑔(𝑓(𝑥)). This will rest on the idea that
𝑙𝑜𝑔() is a strictly increasing function.
13.5 Exercises
1. The 𝜒2 distribution is parameterized by its degrees of freedom parameter
𝜈 which corresponds to the mean of the distribution (𝜈 must be > 0). The
density function 𝑓(𝑥|𝜈) can be accessed R using the dchisq(x, df=nu).
a) For different values of 𝜈, plot the distribution function of 𝑓(𝑥|𝜈). You
might consider 𝜈 = 5, 10, and 20. The valid range of 𝑥 values is [0, ∞)
so select the
b) Suppose that we’ve observed 𝑥 = {9, 7, 7, 6, 10, 7, 9). Calculate the
sample mean and standard deviation.
c) Graphically show that the maximum likelihood estimator of 𝜈 is 𝑥̄ =
7.857.
d) Show that the maximum likelihood estimator of 𝜈 is 𝑥̄ = 7.857 using
a numerical maximization function.
2. The Beta distribution is often used when dealing with data that are pro-
portions. It is parameterized by two parameters, usually called 𝛼 and 𝛽
(both of which must be greater than zero). The mean of this distribution
is by
𝛼
𝐸[𝑋] =
𝛼+𝛽
while the spread of the distribution is inversely related to the magnitude
of 𝛼 and 𝛽. The density function 𝑓(𝑥|𝛼, 𝛽) can be accessed in R using the
dbeta(x, alpha, beta) function.
13.5. EXERCISES 281
STA 571 Course Project Timeline This document presents a rough outline to the
projects conducted in STA 571: Statistical Methods II. The outline gives when
certain topics should be discussed, when decisions should be made on moving
forward with projects, and what students need to be focused on. Proposal and
final project rubrics will be provided in additional PDFs. As always, there is
flexibility based on the way you run your course.
13.6.1 WIBGIs
283
284 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION
The first four weeks should include students trying to create as many WIBGIs
as possible. At the end of each homework assignment, I added an additional
optional section entitle “Project Development” where I reminded the students
they need to produce at minimum 3 WIBGI statements.
The main concept here is to encourage exploration and broad thinking. It is
okay if students write statements that are not feasible. It is also okay if they
write statements that don’t conform to the example statement. The goal here
is creativity. The idea is to begin generating ideas and determining sources of
data; we are trying to encourage data exploration and thinking outside the box
from their core studies.