0% found this document useful (0 votes)
15 views284 pages

Statistical Methods II

The document is a comprehensive guide on advanced statistical methods, covering topics such as matrix manipulation, parameter estimation, inference, contrasts, and various statistical models including ANCOVA and binomial regression. It is structured into chapters with detailed explanations, examples, and exercises to reinforce learning. The content is intended for students who have prior knowledge of basic statistical concepts and aims to deepen their understanding of statistical methodology.

Uploaded by

Ed.. Var
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views284 pages

Statistical Methods II

The document is a comprehensive guide on advanced statistical methods, covering topics such as matrix manipulation, parameter estimation, inference, contrasts, and various statistical models including ANCOVA and binomial regression. It is structured into chapters with detailed explanations, examples, and exercises to reinforce learning. The content is intended for students who have prior knowledge of basic statistical concepts and aims to deepen their understanding of statistical methodology.

Uploaded by

Ed.. Var
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 284

Statistical Methods II

Derek L. Sonderegger

December 08, 2020


2
Contents

Preface 7

Statistical Theory 11

1 Matrix Manipulation 11
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Types of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Operations on Matrices . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Parameter Estimation 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 R example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Inference 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Confidence Intervals and Hypothesis Tests . . . . . . . . . . . . . 36
3.3 F-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3
4 CONTENTS

4 Contrasts 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Estimate and variance . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Estimating contrasts using glht() . . . . . . . . . . . . . . . . . 52
4.4 Using emmeans Package . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Statistical Models 65

5 Analysis of Covariance (ANCOVA) 65


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Offset parallel Lines (aka additive models) . . . . . . . . . . . . . 66
5.3 Lines with different slopes (aka Interaction model) . . . . . . . . 67
5.4 Iris Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Two-way ANOVA 77
6.1 Review of 1-way ANOVA . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Main Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 Interaction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Diagnostics 105
7.1 Detecting Assumption Violations . . . . . . . . . . . . . . . . . . 105
7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8 Data Transformations 123


8.1 A review of log(𝑥) and 𝑒𝑥 . . . . . . . . . . . . . . . . . . . . . . 124
8.2 Transforming the Response . . . . . . . . . . . . . . . . . . . . . 126
8.3 Transforming the predictors . . . . . . . . . . . . . . . . . . . . . 129
8.4 Interpretation of log transformed variable coefficients . . . . . . . 134
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
CONTENTS 5

Correlated Covariates 145


Interpretation with Correlated Covariates . . . . . . . . . . . . . . . . 145
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9 Variable Selection 151


9.1 Nested Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.2 Testing-Based Model Selection . . . . . . . . . . . . . . . . . . . 152
9.3 Criterion Based Procedures . . . . . . . . . . . . . . . . . . . . . 157
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

10 Mixed Effects Models 167


10.1 Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.2 Randomized Complete Block Design (RCBD) . . . . . . . . . . . 168
10.3 Review of Maximum Likelihood Methods . . . . . . . . . . . . . 172
10.4 1-way ANOVA with a random effect . . . . . . . . . . . . . . . . 173
10.5 Blocks as Random Variables . . . . . . . . . . . . . . . . . . . . . 178
10.6 Nested Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.7 Crossed Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.8 Repeated Measures / Longitudinal Studies . . . . . . . . . . . . . 199
10.9 Confidence and Prediction Intervals . . . . . . . . . . . . . . . . 205
10.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

11 Binomial Regression 213


11.1 Binomial Regression Model . . . . . . . . . . . . . . . . . . . . . 214
11.2 Measures of Fit Quality . . . . . . . . . . . . . . . . . . . . . . . 222
11.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.4 Interpreting model coefficients . . . . . . . . . . . . . . . . . . . . 226
11.5 Prediction and Effective Dose Levels . . . . . . . . . . . . . . . . 233
11.6 Overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.7 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6 CONTENTS

Appendix 257

12 Block Designs 257


12.1 Randomized Complete Block Design (RCBD) . . . . . . . . . . . 258
12.2 Split-plot designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
12.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

13 Maximum Likelihood Estimation 269


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.2 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.3 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

Project Appendix 283


13.6 Weeks 1 – 4 (Project Feasibility) . . . . . . . . . . . . . . . . . . 283
Preface

These notes are intended to be used in the second semester of a two-semester


sequence of Statistical Methodology. We assume that students have seen t-
tests, Simple Regression, and ANOVA. The second semester emphasizes the
uniform matrix notation 𝑦 = 𝑋𝛽 + 𝜖 and the interpretation of the coefficients.
We cover model diagnostics, transformation, model selection, interactions of
continuous and categorical predictors as well as introduce random effects in the
experimental design context. Finally we introduce logistic regression.

7
8 CONTENTS
Statistical Theory

9
Chapter 1

Matrix Manipulation

Learning Outcomes

• Perform basic vector and matrix operations of addition and multiplication


• Perform matrix operations of transpose and inverse.

1.1 Introduction

Almost all of the calculations done in classical statistics require formulas with
large number of subscripts and many different sums. In this chapter we will
develop the mathematical machinery to write these formulas in a simple compact
formula using matrices.

1.2 Types of Matrices

We will first introduce the idea behind a matrix and give several special types
of matrices that we will encounter.

1.2.1 Scalars

To begin, we first define a scalar. A scalar is just a single number, either real
or complex. The key is that a scalar is just a single number. For example, 6 is
a scalar, as is −3. By convention, variable names for scalars will be lower case
and not in bold typeface.

Examples could be 𝑎 = 5, 𝑏 = 3, or 𝜎 = 2.

11
12 CHAPTER 1. MATRIX MANIPULATION

1.2.2 Vectors
A vector is collection of scalars, arranged as a row or column. Our convention
will be that a vector will be a lower cased letter but written in a bold type. In
other branches of mathematics is common to put a bar over the variable name
to denote that it is a vector, but in statistics, we have already used a bar to
denote a mean.
Examples of column vectors could be
2
2 ⎡ 8 ⎤
⎢ ⎥
𝑎=⎡ ⎤
⎢ −3 ⎥ 𝑏=⎢ 3 ⎥
⎣ 4 ⎦ ⎢ 4 ⎥
⎣ 1 ⎦

and examples of row vectors are

𝑐 = [ 8 10 43 −22 ]

𝑑 = [ −1 5 2 ]

To denote a specific entry in the vector, we will use a subscript. For example,
the second element of 𝑑 is 𝑑2 = 5. Notice, that we do not bold this symbol
because the second element of the vector is the scalar value 5.

1.2.3 Matrix
Just as a vector is a collection of scalars, a matrix can be viewed as a collection
of vectors (all of the same length). We will denote matrices with bold capitalized
letters. In general, I try to use letters at the end of the alphabet for matrices.
Likewise, I try to use symmetric letters to denote symmetric matrices.
For example, the following is a matrix with two rows and three columns
1 2 3
𝑊 =[ ]
4 5 6
and there is no requirement that the number of rows be equal, less than, or
greater than the number of columns. In denoting the size of the matrix, we first
refer to the number of rows and then the number of columns. Thus 𝑊 is a 2 × 3
matrix and it sometimes is helpful to remind ourselves of this by writing 𝑊 2×3 .
To pick out a particular element of a matrix, I will again use a subscripting nota-
tion, always with the row number first and then column. Notice the notational
shift to lowercase, non-bold font.
𝑤1,2 = 2 and 𝑤2,3 = 6
1.2. TYPES OF MATRICES 13

There are times I will wish to refer to a particular row or column of a matrix
and we will use the following notation

𝑤1,⋅ = [ 1 2 3 ]

is the first row of the matrix 𝑊 . The second column of matrix 𝑊 is


2
𝑤⋅,2 = [ ]
5

1.2.4 Square Matrices

A square matrix is a matrix with the same number of rows as columns. The
following are square

1 2 3
3 6
𝑍=[ ] 𝑋=⎡
⎢ 2 1 2 ⎤

8 10
⎣ 3 2 1 ⎦

1.2.5 Symmetric Matrices

In statistics we are often interested in square matrices where the 𝑖, 𝑗 element is


the same as the 𝑗, 𝑖 element. For example, 𝑥1,2 = 𝑥2,1 in the above matrix 𝑋.
Consider a matrix 𝐷 that contains the distance from four towns to each of the
other four towns. Let 𝑑𝑖,𝑗 be the distance from town 𝑖 to town 𝑗. It only makes
sense that the distance doesn’t matter which direction you are traveling, and
we should therefore require that 𝑑𝑖,𝑗 = 𝑑𝑗,𝑖 .
In this example, it is the values 𝑑𝑖,𝑖 represent the distance from a town to itself,
which should be zero. It turns out that we are often interested in the terms 𝑑𝑖,𝑖
and I will refer to those terms as the main diagonal of matrix 𝐷.
Symmetric matrices play a large role in statistics because matrices that rep-
resent the covariances between random variables must be symmetric because
𝐶𝑜𝑣 (𝑌 , 𝑍) = 𝐶𝑜𝑣 (𝑍, 𝑌 ).

1.2.6 Diagonal Matrices

A square matrix that has zero entries in every location except the main diagonal
is called a diagonal matrix. Here are two examples:

1 0 0 0
4 0 0 ⎡ 0 2 ⎤
0 0
𝑄=⎡
⎢ 0 5 0 ⎤
⎥ 𝑅=⎢ ⎥
⎢ 0 0 2 0 ⎥
⎣ 0 0 6 ⎦
⎣ 0 0 0 3 ⎦
14 CHAPTER 1. MATRIX MANIPULATION

Sometimes to make matrix more clear, I will replace the 0 with a dot to empha-
size the non-zero components.

1 ⋅ ⋅ ⋅
⎡ ⋅ 2 ⋅ ⋅ ⎤
𝑅=⎢ ⎥
⎢ ⋅ ⋅ 2 ⋅ ⎥
⎣ ⋅ ⋅ ⋅ 3 ⎦

1.2.7 Identity Matrices

A diagonal matrix with main diagonal values exactly 1 is called the identity
matrix. The 3 × 3 identity matrix is denoted 𝐼3 .

1 ⋅ ⋅
𝐼3 = ⎡
⎢ ⋅ 1 ⋅ ⎤

⎣ ⋅ ⋅ 1 ⎦

1.3 Operations on Matrices

1.3.1 Transpose

The simplest operation on a square matrix matrix is called transpose. It is


𝑇
defined as 𝑀 = 𝑊 if and only if 𝑚𝑖,𝑗 = 𝑤𝑗,𝑖 .

1 6 𝑇 1 8
𝑍=[ ] 𝑍 =[ ]
8 3 6 3

3 1 2 3 9 8
𝑀 =⎡ 5 ⎤ =⎡ 7 ⎤
𝑇
⎢ 9 4 ⎥ 𝑀 ⎢ 1 4 ⎥
⎣ 8 7 6 ⎦ ⎣ 2 5 6 ⎦
We can think of this as swapping all elements about the main diagonal. Alter-
natively we could think about the transpose as making the first row become the
first column, the second row become the second column, etc. In this fashion we
could define the transpose of a non-square matrix.

1 2 3
𝑊 =[ ]
4 5 6

1 4
=⎡ ⎤
𝑇
𝑊 ⎢ 2 5 ⎥
⎣ 3 6 ⎦
1.3. OPERATIONS ON MATRICES 15

1.3.2 Addition and Subtraction

Addition and subtraction are performed element-wise. This means that two
matrices or vectors can only be added or subtracted if their dimensions match.

1 5 6
⎡ 2 ⎤ ⎡ 6 ⎤ ⎡ 8 ⎤
⎢ ⎥+⎢ ⎥=⎢ ⎥
⎢ 3 ⎥ ⎢ 7 ⎥ ⎢ 10 ⎥
⎣ 4 ⎦ ⎣ 8 ⎦ ⎣ 12 ⎦

5 8 1 2 4 6
⎡ 2 4 ⎤ − ⎡ 3 4 ⎤ = ⎡ −1 0 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ 11 15 ⎦ ⎣ 5 −6 ⎦ ⎣ 6 21 ⎦

1.3.3 Multiplication

Multiplication is the operation that is vastly different for matrices and vectors
than it is for scalars. There is a great deal of mathematical theory that suggests
a useful way to define multiplication. What is presented below is referred to
as the dot-product of vectors in calculus, and is referred to as the standard
inner-product in linear algebra.

1.3.4 Vector Multiplication

We first define multiplication for a row and column vector. For this multiplica-
tion to be defined, both vectors must be the same length. The product is the
sum of the element-wise multiplications.

5
⎡ 6 ⎤
[ 1 2 3 4 ]⎢ ⎥ = (1 ⋅ 5) + (2 ⋅ 6) + (3 ⋅ 7) + (4 ⋅ 8) = 5 + 12 + 21 + 32 = 70
⎢ 7 ⎥
⎣ 8 ⎦

1.3.5 Matrix Multiplication

Matrix multiplication is just a sequence of vector multiplications. If 𝑋 is a


𝑚 × 𝑛 matrix and 𝑊 is 𝑛 × 𝑝 matrix then 𝑍 = 𝑋𝑊 is a 𝑚 × 𝑝 matrix where
𝑧𝑖,𝑗 = 𝑥𝑖,⋅ 𝑤⋅,𝑗 where 𝑥𝑖,⋅ is the 𝑖th column of 𝑋 and 𝑤⋅,𝑗 is the 𝑗th column of 𝑊 .
For example, let

13 14
1 2 3 4 ⎡ 15 ⎤
16
𝑋=⎡
⎢ 5 6 7 8 ⎤⎥ 𝑊 =⎢ ⎥
⎢ 17 18 ⎥
⎣ 9 10 11 12 ⎦
⎣ 19 20 ⎦
16 CHAPTER 1. MATRIX MANIPULATION

so 𝑋 is 3×4 (which we remind ourselves by adding a 3×4 subscript to 𝑋 as 𝑋 3×4 )


and 𝑊 is 𝑊 4×2 . Because the inner dimensions match for this multiplication,
then 𝑍 3×2 = 𝑋 3×4 𝑊 4×2 is defined where

𝑧1,1 = 𝑥1,⋅ 𝑤⋅,1


= (1 ⋅ 13) + (2 ⋅ 15) + (3 ⋅ 17) + (4 ⋅ 19) = 170

and similarly

𝑧2,1 = 𝑥2,⋅ 𝑤⋅,1


= (5 ⋅ 13) + (6 ⋅ 15) + (7 ⋅ 17) + (8 ⋅ 19)
= 426

so that

170 180
𝑍=⎡ ⎤
⎢ 426 452 ⎥
⎣ 682 724 ⎦

For another example, we note that

1 2
1 2 3 ⎡ 1+4+3 2+4+6 8 12
[ ]⎢ 2 2 ⎤
⎥=[ 2+6+4 ]=[ ]
2 3 4 4+6+8 12 18
⎣ 1 2 ⎦

Notice that this definition of multiplication means that the order matters.
Above, we calculated 𝑋 3×4 𝑊 4×2 but we cannot reverse the order because the
inner dimensions do not match up.

1.3.6 Scalar times a Matrix

Strictly speaking, we are not allowed to multiply a matrix by a scalar because


the dimensions do not match. However, it is often notationally convenient. So
we define 𝑎𝑋 to be the element-wise multiplication of each element of 𝑋 by the
scalar 𝑎. Because this is just a notational convenience, the mathematical theory
about inner-products does not apply to this operation.

4 5 20 25
5⎡
⎢ 7 6 ⎤ ⎡ ⎤
⎥ = ⎢ 35 30 ⎥
⎣ 9 10 ⎦ ⎣ 45 50 ⎦

Because of this definition, it is clear that 𝑎𝑋 = 𝑋𝑎 and the order does not
matter. Thus when mixing scalar multiplication with matrices, it is acceptable
to reorder scalars, but not matrices.
1.3. OPERATIONS ON MATRICES 17

1.3.7 Determinant

The determinant is defined only for square matrices and can be thought of as
the matrix equivalent of the absolute value or magnitude (i.e. | − 6| = 6). The
determinant gives a measure of the multi-dimensional size of a matrix (say the
matrix 𝐴) and as such is denoted det (𝐴) or |𝐴|. Generally this is a very tedious
thing to calculate by hand and for completeness sake, we will give a definition
and small examples.

For a 2 × 2 matrix

𝑎 𝑐
∣ ∣ = 𝑎𝑑 − 𝑐𝑏
𝑏 𝑑

So a simple example of a determinant is

5 2
∣ ∣ = 50 − 6 = 44
3 10

The determinant can be thought of as the area of the parallelogram created by


the row or column vectors of the matrix.
18 CHAPTER 1. MATRIX MANIPULATION
1.3. OPERATIONS ON MATRICES 19

1.3.8 Inverse

In regular algebra, we are often interested in solving equations such as

5𝑥 = 15

for 𝑥. To do so, we multiply each side of the equation by the inverse of 5, which
is 1/5.

5𝑥 = 15
1 1
⋅ 5 ⋅ 𝑥 = ⋅ 15
5 5
1⋅𝑥=3
𝑥=3

For scalars, we know that the inverse of scalar 𝑎 is the value that when multiplied
by 𝑎 is 1. That is we see to find 𝑎−1 such that 𝑎𝑎−1 = 1.
−1 −1
In the matrix case, I am interested in finding 𝐴 such that 𝐴 𝐴 = 𝐼 and
−1
𝐴𝐴 = 𝐼. For both of these multiplications to be defined, 𝐴 must be a square
matrix and so the inverse is only defined for square matrices.

For a 2 × 2 matrix
𝑎 𝑏
𝑊 =[ ]
𝑐 𝑑

the inverse is given by:

−1 1 𝑑 −𝑏
𝑊 = [ ]
det 𝑊 −𝑐 𝑎

For example, suppose


1 2
𝑊 =[ ]
5 3

then det 𝑊 = 3 − 10 = −7 and

−1 1 3 −2
𝑊 = [ ]
−7 −5 1

− 73 2
7
=[ 5 ]
7 − 71
20 CHAPTER 1. MATRIX MANIPULATION

and thus
−1 1 2 −3 2
𝑊𝑊 =[ ] [ 57 7 ]
5 3 7 − 71

− 73 + 10
7
2
7 − 2
7

=⎢ ⎤

15 15 10 3
⎣ −7 + 7 7 − 7 ⎦

1 0
=[ ] = 𝐼2
0 1

Not every square matrix has an inverse. If the determinant of the matrix (which
we think of as some measure of the magnitude or size of the matrix) is zero,
then the formula would require us to divide by zero. Just as we cannot find the
inverse of zero (i.e. solve 0𝑥 = 1 for 𝑥), a matrix with zero determinate is said
to have no inverse.

1.4 Exercises
Consider the following matrices:

1 4
1 2 3 6 4 3 1 2
A=[ ] B=[ ] c=⎡ ⎤
⎢ 2 ⎥ d=⎡ ⎤
⎢ 5 ⎥ E=[ ]
6 5 4 8 7 6 2 6
⎣ 3 ⎦ ⎣ 6 ⎦

1. Find Bc
𝑇
2. Find AB
𝑇
3. Find c d
𝑇
4. Find cd
5. Confirm that
3 −1
E−1 = [ ]
−1 1/2
is the inverse of E by calculating EE−1 = I.
Chapter 2

Parameter Estimation

library(tidyverse) # dplyr, tidyr, ggplot2

Learning Outcomes

• Write simple regression or one-way ANOVA models as 𝑦 ∼ 𝑁 (𝑋𝛽, 𝜎2 I𝑛 )


• Utilizing R and a sample of data, calculate the estimators

𝑇 𝑇
𝛽̂ = (X X)−1 X y
ŷ = X𝛽̂
and
𝑛
1
𝜎̂ 2 = MSE = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 − 𝑝 𝑖=1

• Calculate the uncertainty of 𝛽𝑗̂ as

𝑇 −1
StdErr (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗

2.1 Introduction
We have previously looked at ANOVA and regression models and, in many ways,
they felt very similar. In this chapter we will introduce the theory that allows
us to understand both models as a particular flavor of a larger class of models
known as linear models.

21
22 CHAPTER 2. PARAMETER ESTIMATION

First we clarify what a linear model is. A linear model is a model where the
data (which we will denote using roman letters as 𝑥 and 𝑦) and parameters of
interest (which we denote using Greek letters such as 𝛼 and 𝛽) interact only via
addition and multiplication. The following are linear models:

Model Formula
ANOVA 𝑦𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖𝑗
Simple Regression 𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 + 𝜖 𝑖
Quadratic Term 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜖𝑖
General Regression 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖,1 + 𝛽2 𝑥𝑖,2 + ⋯ + 𝛽𝑝 𝑥𝑖,𝑝 + 𝜖𝑖

Notice in the Quadratic model, the square is not a parameter and we can con-
sider 𝑥2𝑖 as just another column of data. This leads to the second example of
multiple regression where we just add more slopes for other covariates where the
𝑝th covariate is denoted 𝑥⋅,𝑝 and might be some transformation (such as 𝑥2 or
log 𝑥) of another column of data. The critical point is that the transformation
to the data 𝑥 does not depend on a parameter. Thus the following is not a
linear model
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝛼
𝑖 + 𝜖𝑖

2.2 Model Specifications

2.2.1 Simple Regression

We would like to represent all linear models in a similar compact matrix repre-
sentation. This will allow us to make the transition between simple and multiple
regression (and ANCOVA) painlessly.

To begin, lets consider the simple regression model.

Typically we’ll write the model as if we are specifying the 𝑖𝑡ℎ element of the
data set
𝑖𝑖𝑑
𝑦𝑖 = 𝛽
⏟⏟ + 𝛽⏟
0⏟ 𝑥𝑖 + 𝜖⏟𝑖
1⏟ where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )
signal noise

Notice we have a data generation model where there is some relationship between
the explanatory variable and the response which we can refer to as the “signal”
part of the defined model and the noise term which represents unknown actions
effecting each data point that move the response variable. We don’t know what
those unknown or unmeasured effects are, but we do know the sum of those
effects results in a vertical shift from the signal part of the model.
2.2. MODEL SPECIFICATIONS 23

Simple Regression Model

εi β0 + β1x
Response Variable

Explanatory Variable

This representation of the model implicitly assumes that our data set has 𝑛
observations and we could write the model using all the obsesrvations using
matrices and vectors that correspond the the data and the parameters.

𝑦1 = 𝛽 0 + 𝛽 1 𝑥1 + 𝜖 1
𝑦2 = 𝛽 0 + 𝛽 1 𝑥2 + 𝜖 2
𝑦3 = 𝛽 0 + 𝛽 1 𝑥3 + 𝜖 3

𝑦𝑛−1 = 𝛽0 + 𝛽1 𝑥𝑛−1 + 𝜖𝑛−1
𝑦𝑛 = 𝛽 0 + 𝛽 1 𝑥𝑛 + 𝜖 𝑛

𝑖𝑖𝑑
where, as usual, 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ). These equations can be written using matrices
as

𝑦1 1 𝑥1 𝜖1
⎡ 𝑦 ⎤ ⎡ 1 𝑥 ⎤ ⎡ 𝜖 ⎤
2 2
⎢ ⎥ ⎢ ⎥ ⎢ 2 ⎥
⎢ 𝑦3 ⎥ = ⎢ 1 𝑥 3 ⎥ 𝛽0
[
𝜖
]+ ⎢ 3 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⎥⏟ 𝛽1 ⎢ ⋮ ⎥
⎢ 𝑦𝑛−1 ⎥ ⎢ 1 𝑥𝑛−1 ⎥ 𝛽 ⎢ 𝜖𝑛−1 ⎥

⏟⏟⏟𝑦𝑛⏟ ⎦ ⎣ 1
⏟ ⏟⏟⏟⏟⏟𝑥 ⎦
𝑛⏟⏟ ⎣⏟⏟
⏟ 𝜖𝑛 ⏟⏟

𝑦 𝑋 𝜖

and we compactly write the model as

𝑦 = 𝑋𝛽 + 𝜖 where 𝜖 ∼ 𝑁 (0, 𝜎2 𝐼 𝑛 )
where 𝑋 is referred to as the design matrix and 𝛽 is the vector of location pa-
rameters we are interested in estimating. To be very general, a vector of random
variables needs to describe how each of them varies in relation to each other, so
we need to specify the variance matrix. However because 𝜖𝑖 is independent of
24 CHAPTER 2. PARAMETER ESTIMATION

𝜖𝑗 , for all (𝑖, 𝑗) pairs, the variance matrix can be written as 𝜎2 𝐼 because all of
the covariances are zero.

2.2.2 ANOVA model

The anova model is also a linear model and all we must do is create a appropriate
design matrix. Given the design matrix 𝑋, all the calculations are identical as
in the simple regression case.

2.2.2.1 Cell means representation

Recall the cell means representation is

𝑦𝑖,𝑗 = 𝜇𝑖 + 𝜖𝑖,𝑗

where 𝑦𝑖,𝑗 is the 𝑗th observation within the 𝑖th group. To clearly show the
creation of the 𝑋 matrix, let the number of groups be 𝑝 = 3 and the number of
observations per group be 𝑛𝑖 = 4. We now expand the formula to show all the
data.
𝑦1,1 = 𝜇1 + 𝜖1,1
𝑦1,2 = 𝜇1 + 𝜖1,2
𝑦1,3 = 𝜇1 + 𝜖1,3
𝑦1,4 = 𝜇1 + 𝜖1,4
𝑦2,1 = 𝜇2 + 𝜖2,1
𝑦2,2 = 𝜇2 + 𝜖2,2
𝑦2,3 = 𝜇2 + 𝜖2,3
𝑦2,4 = 𝜇2 + 𝜖2,4
𝑦3,1 = 𝜇3 + 𝜖3,1
𝑦3,2 = 𝜇3 + 𝜖3,2
𝑦3,3 = 𝜇3 + 𝜖3,3
𝑦3,4 = 𝜇3 + 𝜖3,4

In an effort to write the model as 𝑦 = 𝑋𝛽 + 𝜖 we will write the above as


2.2. MODEL SPECIFICATIONS 25

𝑦1,1 = 1𝜇1 + 0𝜇2 + 0𝜇3 + 𝜖1,1


𝑦1,2 = 1𝜇1 + 0𝜇2 + 0𝜇3 + 𝜖1,2
𝑦1,3 = 1𝜇1 + 0𝜇2 + 0𝜇3 + 𝜖1,3
𝑦1,4 = 1𝜇1 + 0𝜇2 + 0𝜇3 + 𝜖1,4
𝑦2,1 = 0𝜇 + 1𝜇2 + 0𝜇3 + 𝜖2,1
𝑦2,2 = 0𝜇 + 1𝜇2 + 0𝜇3 + 𝜖2,2
𝑦2,3 = 0𝜇 + 1𝜇2 + 0𝜇3 + 𝜖2,3
𝑦2,4 = 0𝜇 + 1𝜇2 + 0𝜇3 + 𝜖2,4
𝑦3,1 = 0𝜇 + 0𝜇2 + 1𝜇3 + 𝜖3,1
𝑦3,2 = 0𝜇 + 0𝜇2 + 1𝜇3 + 𝜖3,2
𝑦3,3 = 0𝜇 + 0𝜇2 + 1𝜇3 + 𝜖3,3
𝑦3,4 = 0𝜇 + 0𝜇2 + 1𝜇3 + 𝜖3,4

and we will finally be able to write the matrix version

𝑦1,1 1 0 0 𝜖1,1
⎡ 𝑦 ⎤ ⎡ 1 0 0 ⎤ ⎡ 𝜖 ⎤
1,2
⎢ ⎥ ⎢ ⎥ ⎢ 1,2 ⎥
⎢ 𝑦1,3 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,3 ⎥
⎢ 𝑦1,4 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,4 ⎥
⎢ 𝑦2,1 ⎥ ⎢ 0 1 0 ⎥ ⎢ 𝜖2,1 ⎥
⎢ ⎥ ⎢ ⎥ 𝜇1 ⎢ ⎥
⎢ 𝑦2,2 ⎥ = ⎢ 0 1 0 ⎥ ⎡ 𝜇 ⎤+ ⎢ 𝜖2,2 ⎥
⎢ 𝑦2,3 ⎥ ⎢ 0 1 0 ⎥ ⎢ 𝜇2 ⎥ ⎢ 𝜖2,3 ⎥
⎢ 𝑦2,4 ⎥ ⎢ 0 1 0 ⎥⏟ ⎣ 3 ⎦ ⎢ 𝜖2,4 ⎥
⎢ 𝑦 ⎥ ⎢ 0 0 1 ⎥ 𝛽 ⎢ 𝜖 ⎥
⎢ 3,1 ⎥ ⎢ ⎥ ⎢ 3,1 ⎥
⎢ 𝑦3,2 ⎥ ⎢ 0 0 1 ⎥ ⎢ 𝜖3,2 ⎥
⎢ 𝑦3,3 ⎥ ⎢ 0 0 1 ⎥ ⎢ 𝜖3,3 ⎥
𝑦3,4⏟⏟
⎣⏟⏟
⏟ ⎦ ⏟ 0⏟
⎣⏟⏟ 0 ⏟⏟
1⏟⎦ ⎣
⏟ 𝜖3,4 ⎦
𝑦 X 𝜖

Notice that each column of the 𝑋 matrix is acting as an indicator if the obser-
vation is an element of the appropriate group. As such, these are often called
indicator variables. Another term for these, which I find less helpful, is dummy
variables.

2.2.2.2 Offset from reference group

In this model representation of ANOVA, we have an overall mean and then


offsets from the control group (which will be group one). The model is thus

𝑦𝑖,𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖,𝑗
26 CHAPTER 2. PARAMETER ESTIMATION

where 𝜏1 = 0. We can write this in matrix form as

𝑦1,1 1 0 0 𝜖1,1
⎡ 𝑦 ⎤ ⎡ 1 0 0 ⎤ ⎡ 𝜖 ⎤
1,2
⎢ ⎥ ⎢ ⎥ ⎢ 1,2 ⎥
⎢ 𝑦1,3 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,3 ⎥
⎢ 𝑦1,4 ⎥ ⎢ 1 0 0 ⎥ ⎢ 𝜖1,4 ⎥
⎢ 𝑦2,1 ⎥ ⎢ 1 1 0 ⎥ ⎢ 𝜖2,1 ⎥
⎢ ⎥ ⎢ ⎥ 𝜇 ⎢ ⎥
⎢ 𝑦2,2 ⎥ = ⎢ 1 1 0 ⎥ ⎡ 𝜏 ⎤+ ⎢ 𝜖2,2 ⎥
⎢ 𝑦2,3 ⎥ ⎢ 1 1 0 ⎥ ⎢ 𝜏2 ⎥ ⎢ 𝜖2,3 ⎥
⎢ 𝑦2,4 ⎥ ⎢ 1 1 0 ⎥⎣ ⏟ 3 ⎦ ⎢ 𝜖2,4 ⎥
⎢ 𝑦 ⎥ ⎢ 1 0 1 ⎥ 𝛽 ⎢ 𝜖 ⎥
⎢ 3,1 ⎥ ⎢ ⎥ ⎢ 3,1 ⎥
⎢ 𝑦3,2 ⎥ ⎢ 1 0 1 ⎥ ⎢ 𝜖3,2 ⎥
⎢ 𝑦3,3 ⎥ ⎢ 1 0 1 ⎥ ⎢ 𝜖3,3 ⎥
⎣ 𝑦
⏟⏟⏟ 3,4⏟⎦
⏟ ⏟ 1⏟
⎣⏟⏟ 0 ⏟⏟
1⏟⎦ ⎣
⏟ 𝜖3,4 ⎦
𝑦 X 𝜖

2.3 Parameter Estimation

For both simple regression and ANOVA, we can write the model in matrix form
as
𝑦 = 𝑋𝛽 + 𝜖 where 𝜖 ∼ 𝑁 (0, 𝜎2 𝐼 𝑛 )
which could also be written as

𝑦 ∼ 𝑁 (𝑋𝛽, 𝜎2 𝐼 𝑛 )

and we could use the maximum-likelihood principle to find estimators for 𝛽 and
𝜎2 . In this section, we will introduce the estimators 𝛽̂ and 𝜎̂ 2 .

2.3.1 Estimation of Location Paramters

Our goal is to find the best estimate of 𝛽 given the data. To justify the formula,
consider the case where there is no error terms (i.e. 𝜖𝑖 = 0 for all 𝑖). Thus we
have
𝑦 = 𝑋𝛽
and our goal is to solve for 𝛽. To do this, we must use a matrix inverse, but
𝑇
since inverses only exist for square matrices, we pre-multiple by 𝑋 (notice that
𝑇
𝑋 𝑋 is a symmetric 2 × 2 matrix).
𝑇 𝑇
𝑋 𝑦 = 𝑋 𝑋𝛽

𝑇 −1
and then pre-multiply by (𝑋 𝑋) .
2.3. PARAMETER ESTIMATION 27

𝑇 −1 𝑇 𝑇 −1 𝑇
(𝑋 𝑋) 𝑋 𝑦 = (𝑋 𝑋) 𝑋 𝑋𝛽
𝑇 −1 𝑇
(𝑋 𝑋) 𝑋 𝑦=𝛽

𝑇 −1 𝑇
This exercise suggests that (𝑋 𝑋) 𝑋 𝑦 is a good place to start when looking
for the maximum-likelihood estimator for 𝛽.
Happily it turns out that this quantity is in fact the maximum-likelihood esti-
mator for the data generation model

𝑦 ∼ 𝑁 (𝑋𝛽, 𝜎2 𝐼 𝑛 )

(and equivalently minimizes the sum-of-squared error). In this course we won’t


prove these two facts, but we will use this as our estimate of 𝛽.

𝑇 −1 𝑇
𝛽̂ = (𝑋 𝑋) 𝑋 𝑦

2.3.2 Estimation of Variance Parameter

Recall our simple regression model is

𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 + 𝜖 𝑖

𝑖𝑖𝑑
where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ).

Using our estimates 𝛽̂ we can obtain predicted values for the regression line at
any x-value. In particular we can find the predicted value for each 𝑥𝑖 value in
our dataset.
𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥𝑖

Using matrix notation, I would write 𝑦 ̂ = 𝑋 𝛽.̂


As usual we will find estimates of the noise terms (which we will call residuals
or errors) via
𝜖𝑖̂ = 𝑦𝑖 − 𝑦𝑖̂
= 𝑦 − (𝛽 ̂ + 𝛽 ̂ 𝑥 )
𝑖 0 1 𝑖

Writing 𝑦 ̂ in matrix terms we have

𝑦 ̂ = 𝑋 𝛽̂
𝑇 −1 𝑇
= 𝑋 (𝑋 𝑋) 𝑋 𝑦
= 𝐻𝑦
28 CHAPTER 2. PARAMETER ESTIMATION

𝑇 −1 𝑇
where 𝐻 = 𝑋 (𝑋 𝑋) 𝑋 is often called the hat-matrix because it takes 𝑦 to
𝑦 ̂ and has many interesting theoretical properties.1
We can now estimate the error terms via

𝜖 ̂ = 𝑦 − 𝑦̂
= 𝑦 − 𝐻𝑦
= (𝐼 𝑛 − 𝐻) 𝑦

As usual we estimate 𝜎2 using the mean-squared error, but the general formula
is
𝜎̂ 2 = MSE

𝑛
1
= ∑ 𝜖2𝑖̂
𝑛 − 𝑝 𝑖=1

1 𝑇
=𝜖̂ 𝜖̂
𝑛−𝑝
where 𝛽 has 𝑝 elements, and thus we have 𝑛 − 𝑝 degrees of freedom.

2.4 Standard Errors


Because our 𝛽̂ estimates vary from sample to sample, we need to estimate how
much they vary from sample to sample. This will eventually allow us to create
confidence intervals for and perform hypothesis test relating to the 𝛽 parameter
vector.

2.4.1 Expectation and variance of a random vector


Just as we needed to derive the expected value and variance of 𝑥̄ in the previous
semester, we must now do the same for 𝛽.̂ But to do this, we need some
properties of expectations and variances.
In the following, let 𝐴𝑛×𝑝 and 𝑏𝑛×1 be constants and 𝜖𝑛×1 be a random vector.
Expectations are very similar to the scalar case where
𝐸 [𝜖1 ]
⎡ 𝐸 [𝜖 ] ⎤
2 ⎥
𝐸 [𝜖] = ⎢
⎢ ⋮ ⎥
⎣ 𝐸 [𝜖 ]
𝑛 ⎦
1 Mathematically, 𝐻 is the projection matrix that takes a vector in 𝑛-dimensional space

and projects it onto a 𝑝-dimension subspace spanned by the vectors in 𝑋. Projection matrices
have many useful properties and much of the theory of linear models utilizes 𝐻.
2.4. STANDARD ERRORS 29

and any constants are pulled through the expectation


𝑇 𝑇
𝐸 [𝐴 𝜖 + 𝑏] = 𝐴 𝐸 [𝜖] + 𝑏

Variances are a little different. The variance of the vector 𝜖 is


𝑉 𝑎𝑟 (𝜖1 ) 𝐶𝑜𝑣 (𝜖1 , 𝜖2 ) … 𝐶𝑜𝑣 (𝜖1 , 𝜖𝑛 )
⎡ 𝐶𝑜𝑣 (𝜖 , 𝜖 ) 𝑉 𝑎𝑟 (𝜖2 ) … 𝐶𝑜𝑣 (𝜖2 , 𝜖𝑛 ) ⎤
2 1
𝑉 𝑎𝑟 (𝜖) = ⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 𝐶𝑜𝑣 (𝜖𝑛 , 𝜖1 ) 𝐶𝑜𝑣 (𝜖𝑛 , 𝜖2 ) … 𝑉 𝑎𝑟 (𝜖1 ) ⎦

and additive constants are ignored, but multiplicative constants are pulled out
as follows:

𝑇 𝑇 𝑇
𝑉 𝑎𝑟 (𝐴 𝜖 + 𝑏) = 𝑉 𝑎𝑟 (𝐴 𝜖) = 𝐴 𝑉 𝑎𝑟 (𝜖) 𝐴

2.4.2 Variance of Location Parameters

We next derive the sampling variance of our estimator 𝛽̂ by first noting that 𝑋
and 𝛽 are constants and therefore

𝑉 𝑎𝑟 (𝑦) = 𝑉 𝑎𝑟 (𝑋𝛽 + 𝜖)
= 𝑉 𝑎𝑟 (𝜖)
= 𝜎2 𝐼 𝑛

because the error terms are independent and therefore 𝐶𝑜𝑣 (𝜖𝑖 , 𝜖𝑗 ) = 0 when
𝑖 ≠ 𝑗 and 𝑉 𝑎𝑟 (𝜖𝑖 ) = 𝜎2 . Recalling that constants come out of the variance
operator as the constant squared.

𝑇 −1 𝑇
̂ = 𝑉 𝑎𝑟 ((𝑋 𝑋)
𝑉 𝑎𝑟 (𝛽) 𝑋 𝑦)
𝑇 −1 𝑇 𝑇 −1
= (𝑋 𝑋) 𝑋 𝑉 𝑎𝑟 (𝑦) 𝑋 (𝑋 𝑋)
𝑇 −1 𝑇 𝑇 −1
= (𝑋 𝑋) 𝑋 𝜎2 𝐼 𝑛 𝑋 (𝑋 𝑋)
𝑇 −1 𝑇 𝑇 −1
= 𝜎2 (𝑋 𝑋) 𝑋 𝑋 (𝑋 𝑋)
𝑇 −1
= 𝜎2 (𝑋 𝑋)

Using this, the standard error (i.e. the estimated standard deviation) of 𝛽𝑗̂ (for
any 𝑗 in 1, … , 𝑝) is

𝑇 −1
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗
30 CHAPTER 2. PARAMETER ESTIMATION

2.4.3 Summary of pertinent results

𝑇 −1 𝑇
The statistic 𝛽̂ = (𝑋 𝑋) 𝑋 𝑦 is the unbiased maximum-likelihood estimator
of 𝛽.

The Central Limit Theorem applies to each element of 𝛽. That is, as 𝑛 → ∞,


𝑇 −1
the distribution of 𝛽𝑗̂ → 𝑁 (𝛽𝑗 , [𝜎2 (𝑋 𝑋) ] ).
𝑗𝑗

The error terms can be calculated via

𝑦 ̂ = 𝑋 𝛽̂
𝜖 ̂ = 𝑦 − 𝑦̂

The estimate of 𝜎2 is
𝑛
1 1 𝑇
𝜎̂ 2 = MSE = ∑ 𝜖2𝑖̂ = 𝜖̂ 𝜖̂
𝑛 − 𝑝 𝑖=1 𝑛−𝑝

and is the typical squared distance between an observation and the model pre-
diction.

The standard error (i.e. the estimated standard deviation) of 𝛽𝑗̂ (for any 𝑗 in
1, … , 𝑝) is

𝑇 −1
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗

2.5 R example

Here we will work an example in R and do both the “hand” calculation as well
as using the lm() function to obtain the same information.

Consider the following data in a simple regression problem:

n <- 20
x <- seq(0,10, length=n)
y <- -3 + 2*x + rnorm(n, sd=2)
my.data <- data.frame(x=x, y=y)
ggplot(my.data) + geom_point(aes(x=x,y=y))
2.5. R EXAMPLE 31

15

10
y

0.0 2.5 5.0 7.5 10.0


x

First we must create the design matrix 𝑋. Recall

1 𝑥1
⎡ 1 𝑥2 ⎤
⎢ ⎥
1 𝑥3
𝑋=⎢ ⎥
⎢ ⋮ ⋮ ⎥
⎢ 1 𝑥𝑛−1 ⎥
⎣ 1 𝑥𝑛 ⎦

and can be created in R via the following:

X <- cbind( rep(1,n), x)

Given 𝑋 and 𝑦 we can calculate

𝑇 −1 𝑇
𝛽̂ = (𝑋 𝑋) 𝑋 𝑦

in R using the following code:

XtXinv <- solve( t(X) %*% X ) # solve() is the inverse function


beta.hat <- XtXinv %*% t(X) %*% y %>% # Do the calculations
as.vector() # make sure the result is a vector

Our next step is to calculate the predicted values 𝑦 ̂ and the residuals 𝜖 ̂

𝑦 ̂ = 𝑋 𝛽̂
𝜖 ̂ = 𝑦 − 𝑦̂

y.hat <- X %*% beta.hat


residuals <- y - y.hat
32 CHAPTER 2. PARAMETER ESTIMATION

Now that we have the residuals, we can calculate 𝜎̂ 2 and the standard errors of
𝛽𝑗̂

𝑛
1
𝜎̂ 2 = ∑ 𝜖2𝑖̂
𝑛 − 𝑝 𝑖=1

𝑇 −1
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ ) = √𝜎̂ 2 [(𝑋 𝑋) ]
𝑗𝑗

sigma2.hat <- 1/(n-2) * sum( residuals^2 ) # p = 2


sigma.hat <- sqrt( sigma2.hat )
std.errs <- sqrt( sigma2.hat * diag(XtXinv) )

We now print out the important values and compare them to the summary
output given by the lm() function in R.

cbind(Est=beta.hat, StdErr=std.errs)

## Est StdErr
## -1.448250 0.7438740
## x 1.681485 0.1271802

sigma.hat

## [1] 1.726143

model <- lm(y~x)


summary(model)

##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5127 -1.2134 0.1469 1.2610 3.2358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4482 0.7439 -1.947 0.0673 .
## x 1.6815 0.1272 13.221 1.04e-10 ***
## ---
2.6. EXERCISES 33

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.726 on 18 degrees of freedom
## Multiple R-squared: 0.9066, Adjusted R-squared: 0.9015
## F-statistic: 174.8 on 1 and 18 DF, p-value: 1.044e-10

2.6 Exercises
1. We will do a simple ANOVA analysis on example 8.2 from Ott & Long-
necker using the matrix representation of the model. A clinical psychol-
ogist wished to compare three methods for reducing hostility levels in
university students, and used a certain test (HLT) to measure the degree
of hostility. A high score on the test indicated great hostility. The psy-
chologist used 24 students who obtained high and nearly equal scores in
the experiment. eight were selected at random from among the 24 problem
cases and were treated with method 1. Seven of the remaining 16 students
were selected at random and treated with method 2. The remaining nine
students were treated with method 3. All treatments were continued for
a one-semester period. Each student was given the HLT test at the end of
the semester, with the results show in the following table. (This analysis
was done in section 8.3 of my STA 570 notes)

Method Values
1 96, 79, 91, 85, 83, 91, 82, 87
2 77, 76, 74, 73, 78, 71, 80
3 66, 73, 69, 66, 77, 73, 71, 70, 74

We will be using the cell means model of ANOVA

𝑦𝑖𝑗 = 𝛽𝑖 + 𝜖𝑖𝑗

𝑖𝑖𝑑
where 𝛽𝑖 is the mean of group 𝑖 and 𝜖𝑖𝑗 ∼ 𝑁 (0, 𝜎2 ).

a. Create one vector of all 24 hostility test scores y. (Use the c()
function.)
b. Create a design matrix X with dummy variables for columns that code
for what group an observation belongs to. Notice that X will be a 24
rows by 3 column matrix. Hint: An R function that might be handy
is cbind(a,b) which will bind two vectors or matrices together along
the columns. There is also a corresponding rbind() function that
binds vectors/matrices along rows. Furthermore, the repeat command
rep() could be handy.
34 CHAPTER 2. PARAMETER ESTIMATION

c) Find 𝛽̂ using the matrix formula given in class. Hint: The R function
t(A) computes the matrix transpose A𝑇 , solve(A) computes A−1 ,
and the operator %*% does matrix multiplication (used as A %*% B).
−1
d) Examine the matrix (X𝑇 X) X𝑇 . What do you notice about it? In
particular, think about the result when you right multiply by y. How
does this matrix calculate the appropriate group means and using the
appropriate group sizes 𝑛𝑖 ?

2. We will calculate the y-intercept and slope estimates in a simple linear


model using matrix notation. We will use a data set that gives the diame-
ter at breast height (DBH) versus tree height for a randomly selected set of
trees. In addition, for each tree, a ground measurement of crown closure
(CC) was taken. Larger values of crown closure indicate more shading
and is often associated with taller tree morphology (possibly). We will
be interested in creating a regression model that predicts height based
on DBH and CC. In the interest of reduced copying, we will only use 10
observations. (Note: I made this data up and the DBH values might be
unrealistic. Don’t make fun of me.)

DBH 30.5 31.5 31.7 32.3 33.3 35 35.4 35.6 36.3 37.8
CC 0.74 0.69 0.65 0.72 0.58 0.5 0.6 0.7 0.52 0.6
Height 58 64 65 70 68 63 78 80 74 76

We are interested in fitting the regression model

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖,1 + 𝛽2 𝑥𝑖,2 + 𝜖𝑖

where 𝛽0 is the y-intercept and 𝛽1 is the slope parameter associated with


DBH and 𝛽2 is the slope parameter associated with Crown Closure.

a) Create a vector of all 10 heights y.


b) Create the design matrix X.
c) Find 𝛽̂ using the matrix formula given in class.
d) Compare your results to the estimated coefficients you get using
the lm() function. To add the second predictor to the model,
your call to lm() should look something like lm(Height ~ DBH +
CrownClosure).
Chapter 3

Inference

library(tidymodels) # Grab model results as data frames


library(tidyverse) # ggplot2, dplyr, tidyr

Learning Outcomes

• Utilize 𝛽̂ and its standard errors of to produce confidence intervals and


hypothesis tests for 𝛽𝑗 values.
• Convert hypothesis tests for 𝛽𝑗 = 0 into a model comparison F-test.
• Utilize F-tests to perform a hypothesis test of multiple 𝛽𝑗 values all being
equal to zero. This is the simple vs complex model comparison.
• Create confidence intervals for 𝛽𝑗 values leveraging the normality assump-
tion of residuals.

3.1 Introduction
The goal of statistics is to take information calculated from sample data and use
that information to estimate population parameters. The problem is that the
sample statistic is only a rough guess and if we were to collect another sample
of data, we’d get a different sample statistic and thus a different parameter
estimate. Therefore, we need to utilize the sample statistics to create confidence
intervals and make hypothesis tests about those parameters.
In this chapter, we’ll consider a dataset about the Galápagos Islands relating
the number of tortoise species on an island to various island characteristics such
as size, maximum elevation, etc. The set contains 𝑛 = 30 islands and

35
36 CHAPTER 3. INFERENCE

Variable Description
Species Number of tortoise species found on the island
Endimics Number of tortoise species endemic to the island
Elevation Elevation of the highest point on the island
Area Area of the island (km2 )
Nearest Distance to the nearest neighboring island (km)
Scruz Distance to the Santa Cruz islands (km)
Adjacent Area of the nearest adjacent island (km2 )

data('gala', package='faraway') # import the data set


head(gala) # show the first couple of rows

## Species Endemics Area Elevation Nearest Scruz Adjacent


## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84

3.2 Confidence Intervals and Hypothesis Tests


We can now state the general method of creating confidence intervals and per-
form hypothesis tests for any element of 𝛽.
The general recipe for a (1 − 𝛼) ∗ 100% confidence interval is

Estimate ± 𝑄∗1−𝛼/2 StdErr( Estimate )

where 𝑄∗1−𝛼/2 is the 1 − 𝛼/2 quantile from some appropriate distribution. The
mathematical details about which distribution the quantile should come from
are often obscure, but usually involve the degrees of freedom 𝑛 − 𝑝 where 𝑝 is
the number of parameters in the “signal” part of the model.
The confidence interval formula for the 𝛽 parameters in a linear model is

𝛽𝑗̂ ± 𝑡∗1−𝛼/2,𝑛−𝑝 𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ )

where 𝑡∗1−𝛼/2,𝑛−𝑝 is the 1−𝛼/2 quantile from the t-distribution with 𝑛−𝑝 degrees
of freedom. A test statistic for testing 𝐻0 ∶ 𝛽𝑗 = 0 versus 𝐻𝑎 ∶ 𝛽𝑗 ≠ 0 is

𝛽𝑗̂ − 0
𝑡𝑛−𝑝 =
𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑗̂ )
3.3. F-TESTS 37

3.3 F-tests
We wish to develop a rigorous way to compare nested models and decide if
a complicated model explains enough more variability than a simple model
to justify the additional intellectual effort of thinking about the data in the
complicated fashion.
It is important to specify that we are developing a way of testing nested models.
By nested, we mean that the simple model can be created from the full model
just by setting one or more model parameters to zero. This method doesn’t
constrain us for testing just a single parameter being possibly zero. Instead we
can test if an entire set of parameters all possibly being equal to zero.

3.3.1 Theory
Recall that in the simple regression and ANOVA cases we were interested in
comparing a simple model versus a more complex model. For each model we
computed the sum of squares error (SSE) and said that if the complicated model
performed much better than the simple then 𝑆𝑆𝐸𝑠𝑖𝑚𝑝𝑙𝑒 ≫ 𝑆𝑆𝐸𝑐𝑜𝑚𝑝𝑙𝑒𝑥 .
Recall from the estimation chapter, the model parameter estimates are found
by using the 𝛽̂ values that minimize the SSE. If it were to turn out that a 𝛽𝑗̂
of zero minimized SSE, then zero would be estimate. Next consider that we
are requiring the simple model to be a simplification of the complex model by
setting certain parameters to zero. So we are considering a simple model that
sets 𝛽𝑗̂ = 0 and vs a complex model that allows for 𝛽𝑗̂ to be any real value
(including), then because we select 𝛽𝑗̂ to be the value that minimizes SSE, then
𝑆𝑆𝐸𝑠𝑖𝑚𝑝𝑙𝑒 ≥ 𝑆𝑆𝐸𝑐𝑜𝑚𝑝𝑙𝑒𝑥 .
We’ll define 𝑆𝑆𝐸𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑆𝑆𝐸𝑠𝑖𝑚𝑝𝑙𝑒 − 𝑆𝑆𝐸𝑐𝑜𝑚𝑝𝑙𝑒𝑥 ≥ 0 and observe that if
the complex model is a much better fit to the data, then 𝑆𝑆𝐸𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 is large.
But how large is large enough to be statistically significant? In part, it depends
on how many more parameters were added to the model and what the amount
of unexplained variability left in the complex model. Let 𝑑𝑓𝑑𝑖𝑓𝑓 be the number
of parameters difference between the simple and complex models.
As with most test statistics, the F statistic can be considered as a “Signal-
to-Noise” ratio where the signal part is the increased amount of variability
explained per additional parameter by the complex model and the noise part is
just the MSE of the complex model.

Signal 𝑅𝑆𝑆𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 /𝑑𝑓𝑑𝑖𝑓𝑓


𝐹 = =
Noise 𝑅𝑆𝑆𝑐𝑜𝑚𝑝𝑙𝑒𝑥 /𝑑𝑓𝑐𝑜𝑚𝑝𝑙𝑒𝑥

and we claimed that if the null hypothesis was true (i.e. the complex model is an
unnecessary obfuscation of the simple), then this ratio follows an F-distribution
with degrees of freedom 𝑑𝑓𝑑𝑖𝑓𝑓 and 𝑑𝑓𝑐𝑜𝑚𝑝𝑙𝑒𝑥 .
38 CHAPTER 3. INFERENCE

The F-distribution is centered near one and we should reject the simple model
(in favor of the complex model) if this F statistic is much larger than one.
Therefore the p-value for the test is
Fdfdiff, dfcomplex distribution

0.6

Suppose we observe F=3


density

0.4
then

0.2 p−value = P(Fdfdiff,dfcomplex>F)


0.0
0 2 4 6 8
x

This hypothesis test doesn’t require a particular difference in number of param-


eters in each model, while the single parameter t-test is stuck testing if just a
single parameter is possibly zero. In the single parameter test case, the t-test
and F-test give the same prior hypothesis test previous t-test. The a corre-
sponding t-test and F-test will give the same p-value and therefore the same
inference about if 𝛽𝑗 is possibly zero.

3.4 Example
We will consider a data set from Johnson and Raven (1973) which also appears
in Weisberg (1985). This data set is concerned with the number of tortoise
species on 𝑛 = 30 different islands in the Galapagos. The variables of interest
in the data set are:

Variable Description
Species Number of tortoise species found on the island
Endimics Number of tortoise species endemic to the island
Elevation Elevation of the highest point on the island
Area Area of the island (km2 )
Nearest Distance to the nearest neighboring island (km)
Scruz Distance to the Santa Cruz islands (km)
Adjacent Area of the nearest adjacent island (km2 )

We will first read in the data set from the package faraway.
3.4. EXAMPLE 39

data('gala', package='faraway') # import the data set


head(gala) # show the first couple of rows

## Species Endemics Area Elevation Nearest Scruz Adjacent


## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84

First we will create the full model that predicts the number of species as a
function of elevation, area, nearest, scruz and adjacent. Notice that this model
has 𝑝 = 6 𝛽𝑖 values (one for each coefficient plus the intercept).

𝑦𝑖 = 𝛽0 + 𝛽1 𝐴𝑟𝑒𝑎𝑖 + 𝛽2 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽3 𝑁 𝑒𝑎𝑟𝑒𝑠𝑡𝑖 + 𝛽4 𝑆𝑐𝑟𝑢𝑧𝑖 + 𝛽5 𝐴𝑑𝑗𝑎𝑐𝑒𝑛𝑡𝑖 + 𝜖𝑖

We can happily fit this model just by adding terms on the left hand side of the
model formula. Notice that R creates the design matrix 𝑋 for us.

M.c <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data=gala)


model.matrix(M.c) # this is the design matrix X.

## (Intercept) Area Elevation Nearest Scruz Adjacent


## Baltra 1 25.09 346 0.6 0.6 1.84
## Bartolome 1 1.24 109 0.6 26.3 572.33
## Caldwell 1 0.21 114 2.8 58.7 0.78
## Champion 1 0.10 46 1.9 47.4 0.18
## Coamano 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 1 0.34 119 8.0 8.0 1.84
## Daphne.Minor 1 0.08 93 6.0 12.0 0.34
## Darwin 1 2.33 168 34.1 290.2 2.85
## Eden 1 0.03 71 0.4 0.4 17.95
## Enderby 1 0.18 112 2.6 50.2 0.10
## Espanola 1 58.27 198 1.1 88.3 0.57
## Fernandina 1 634.49 1494 4.3 95.3 4669.32
## Gardner1 1 0.57 49 1.1 93.1 58.27
## Gardner2 1 0.78 227 4.6 62.2 0.21
## Genovesa 1 17.35 76 47.4 92.2 129.49
## Isabela 1 4669.32 1707 0.7 28.1 634.49
## Marchena 1 129.49 343 29.1 85.9 59.56
## Onslow 1 0.01 25 3.3 45.9 0.10
40 CHAPTER 3. INFERENCE

## Pinta 1 59.56 777 29.1 119.6 129.49


## Pinzon 1 17.95 458 10.7 10.7 0.03
## Las.Plazas 1 0.23 94 0.5 0.6 25.09
## Rabida 1 4.89 367 4.4 24.4 572.33
## SanCristobal 1 551.62 716 45.2 66.6 0.57
## SanSalvador 1 572.33 906 0.2 19.8 4.89
## SantaCruz 1 903.82 864 0.6 0.0 0.52
## SantaFe 1 24.08 259 16.5 16.5 0.52
## SantaMaria 1 170.92 640 2.6 49.2 0.10
## Seymour 1 1.84 147 0.6 9.6 25.09
## Tortuga 1 1.24 186 6.8 50.9 17.95
## Wolf 1 2.85 253 34.1 254.7 2.33
## attr(,"assign")
## [1] 0 1 2 3 4 5

All the usual calculations from chapter two can be calculated and we can see
the summary table for this regression as follows:

summary(M.c)

##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
3.4. EXAMPLE 41

3.4.1 Testing All Covariates

The first test we might want to do is to test if any of the covariates are signifi-
cant. That is to say that we want to test the full model versus the simple null
hypothesis model
𝑦𝑖 = 𝛽0 + 𝜖𝑖
that has no covariates and only a y-intercept. So we will create a simple model

M.s <- lm(Species ~ 1, data=gala)

and calculate the appropriate Residual Sums of Squares (RSS) for each model,
along with the difference in degrees of freedom between the two models.

RSS.c <- sum(resid(M.c)^2)


RSS.s <- sum(resid(M.s)^2)
df.diff <- 5 # complex model has 5 additional parameters
df.c <- 30 - 6 # complex model has 24 degrees of freedom left

The F-statistic for this test is therefore

F.stat <- ( (RSS.s - RSS.c) / df.diff ) / ( RSS.c / df.c )


F.stat

## [1] 15.69941

and should be compared against the F-distribution with 5 and 24 degrees of


freedom. Because a large difference between RSS.s and RSS.c would be evidence
for the alternative, larger model, the p-value for this test is

𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃 (𝐹5,24 ≥ F.stat)

p.value <- 1 - pf(15.699, 5, 24)


p.value

## [1] 6.839486e-07

Both the F.stat and its p-value are given at the bottom of the summary table.
However, I might be interested in creating an ANOVA table for this situation.

Source df Sum Sq Mean Sq F p-value


Difference 𝑝−1 𝑅𝑆𝑆𝑑 𝑀 𝑆𝐸𝑑 = 𝑅𝑆𝑆𝑑 /(𝑝 − 1) 𝑀 𝑆𝐸𝑑 /𝑀 𝑆𝐸𝑐 𝑃 (𝐹 > 𝐹𝑝−1,𝑛−𝑝 )
Complex 𝑛−𝑝 𝑅𝑆𝑆𝑐 𝑀 𝑆𝐸𝑐 = 𝑅𝑆𝑆𝑐 /(𝑛 − 𝑝)
Simple 𝑛−1 𝑅𝑆𝑆𝑠
42 CHAPTER 3. INFERENCE

This type of table is often shown in textbooks, but base functions in R don’t
produce exactly this table. Instead the anova(simple, complex) command
produces the following:

Models df RSS Diff in RSS F p-value


Simple 𝑛−1 𝑅𝑆𝑆𝑠
Complex 𝑛−𝑝 𝑅𝑆𝑆𝑐 𝑅𝑆𝑆𝑑 𝑀 𝑆𝐸𝑑 /𝑀 𝑆𝐸𝑐 𝑃 (𝐹 > 𝐹𝑝−1,𝑛−𝑝 )

can be obtained from R by using the anova() function on the two models of
interest. This representation skips showing the Mean Squared calculations.

anova(M.s, M.c)

## Analysis of Variance Table


##
## Model 1: Species ~ 1
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 381081
## 2 24 89231 5 291850 15.699 6.838e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4.2 Testing a Single Covariate

For a particular covariate, 𝛽𝑗 , we might wish to perform a test to see if it can be


removed from the model. It can be shown that the F-statistic can be re-written
as

[𝑅𝑆𝑆𝑠 − 𝑅𝑆𝑆𝑐 ] /1
𝐹 =
𝑅𝑆𝑆𝑐 / (𝑛 − 𝑝)
=⋮
2
𝛽𝑗̂
=[ ]
𝑆𝐸 (𝛽𝑗̂ )
= 𝑡2
where 𝑡 has a t-distribution with 𝑛 − 𝑝 degrees of freedom under the null hy-
pothesis that the simple model is sufficient.
We consider the case of removing the covariate Area from the model and will
calculate our test statistic using both methods.
3.4. EXAMPLE 43

M.c <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data=gala)


M.s <- lm(Species ~ Elevation + Nearest + Scruz + Adjacent, data=gala)
RSS.c <- sum( resid(M.c)^2 )
RSS.s <- sum( resid(M.s)^2 )
df.d <- 1
df.c <- 30-6
F.stat <- ((RSS.s - RSS.c)/1) / (RSS.c / df.c)
F.stat

## [1] 1.139792

1 - pf(F.stat, 1, 24)

## [1] 0.296318

sqrt(F.stat)

## [1] 1.067611

To calculate it using the estimated coefficient and its standard error, we must
grab those values from the summary table

broom::tidy(M.c) # get the coefficient table

## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.07 19.2 0.369 0.715
## 2 Area -0.0239 0.0224 -1.07 0.296
## 3 Elevation 0.319 0.0537 5.95 0.00000382
## 4 Nearest 0.00914 1.05 0.00867 0.993
## 5 Scruz -0.241 0.215 -1.12 0.275
## 6 Adjacent -0.0748 0.0177 -4.23 0.000297

beta.area <- broom::tidy(M.c)[2,2] %>% pull() # pull turns it into a scalar


SE.beta.area <- broom::tidy(M.c)[2,3] %>% pull()
t <- beta.area / SE.beta.area
t

## [1] -1.067611
44 CHAPTER 3. INFERENCE

2 * pt(t, 24)

## [1] 0.296318

All that hand calculation is tedious, so we can again use the anova()() command
to compare the two models.

anova(M.s, M.c)

## Analysis of Variance Table


##
## Model 1: Species ~ Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 93469
## 2 24 89231 1 4237.7 1.1398 0.2963

3.4.3 Testing a Subset of Covariates


Often a researcher will want to remove a subset of covariates from the model.
In the Galapagos example, Area, Nearest, and Scruz all have non-significant
p-values and would be removed when comparing the full model to the model
without that one covariate. While each of them might be non-significant, is the
sum of all three significant?
Because the individual 𝛽𝑗̂ values are not independent, then we cannot claim
that the subset is not statistically significant just because each variable in turn
was insignificant. Instead we again create simple and complex models in the
same fashion as we have previously done.

M.c <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data=gala)


M.s <- lm(Species ~ Elevation + Adjacent, data=gala)
anova(M.s, M.c)

## Analysis of Variance Table


##
## Model 1: Species ~ Elevation + Adjacent
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 100003
## 2 24 89231 3 10772 0.9657 0.425

We find a large p-value associated with this test and can safely stay with the null
hypothesis, that the simple model is sufficient to explain the observed variability
in the number of species of tortoise.
3.5. EXERCISES 45

3.5 Exercises
1. The dataset prostate in package faraway has information about a study
of 97 men with prostate cancer. We import the data and examine the first
four observations using the following commands.

data(prostate, package='faraway')
head(prostate)

It is possible to get information about the data set using the command
help(prostate). Fit a model with lpsa as the response and all the other
variables as predictors.
a) Compute 90% and 95% confidence intervals for the parameter asso-
ciated with age. Using just these intervals, what could we deduced
about the p-value for age in the regression summary. Hint: look at
the help for the function confint(). You’ll find the level option
to be helpful. Alternatively use the broom::tidy() function with the
conf.int=TRUE option and also use the level= option as well.
b) Remove all the predictors that are not significant at the 5% level.
Test this model against the original model. Which is preferred?
2. Thirty samples of cheddar cheese were analyzed for their content of acetic
acid, hydrogen sulfide and lactic acid. Each sample was tasted and scored
by a panel of judges and the average taste score produces. Used the
cheddar dataset from the faraway package (import it the same way you
did in problem one, but now use cheddar) to answer the following:
a) Fit a regression model with taste as the response and the three chem-
ical contents as predictors. Identify the predictors that are statisti-
cally significant at the 5% level.
b) Acetic and H2S are measured on a log10 scale. Create two new
columns in the cheddar data frame that contain the values on their
original scale. Fit a linear model that uses the three covariates on
their non-log scale. Identify the predictors that are statistically sig-
nificant at the 5% level for this model.
c) Can we use an 𝐹 -test to compare these two models? Explain why
or why not. Which model provides a better fit to the data? Explain
your reasoning.
d) For the model in part (a), if a sample of cheese were to have H2S in-
creased by 2 (where H2S is on the log scale and we increase this value
by 2 using some method), what change in taste would be expected?
What caveates must be made in this interpretation? Hint: I don’t
want to get into interpreting parameters on the log scale just yet. So
just interpret this as adding 2 to the covariate value and predicting
the change in taste.
46 CHAPTER 3. INFERENCE

3. The sat data set in the faraway package gives data collected to study the
relationship between expenditures on public education and test results.

a) Fit a model that with total SAT score as the response and only the
intercept as a covariate.
b) Fit a model with total SAT score as the response and expend, ratio,
and salary as predictors (along with the intercept).
c) Compare the models in parts (a) and (b) using an F-test. Is the
larger model superior?
d) Examine the summary table of the larger model? Does this contradict
your results in part (c)? What might be causing this issue? Create
a graph or summary diagnostics to support your guess.
e) Fit the model with salary and ratio (along with the intercept) as
predictor variables and examine the summary table. Which covari-
ates are significant?
f) Now add takers to the model (so the model now includes three
predictor variables along with the intercept). Test the hypothesis
that 𝛽𝑡𝑎𝑘𝑒𝑟𝑠 = 0 using the summary table.
g) Discuss why ratio was not significant in the model in part (e) but
was significant in part (f). Hint: Look at the Residual Standard Error
𝜎̂ in each model and argue that each t-statistic is some variant of a
“signal-to-noise” ratio and that the “noise” part is reduced in the
second model.

4. In this exercise, we will show that adding a covariate to the model that is
just random noise will decrease the model Sum of Squared Error (SSE).

a) Fit a linear model to the trees dataset that is always preloaded


in R. Recall that this dataset has observations from 31 cherry trees
with variables tree height, girth and volume of lumber produced. Fit
Volume ~ Height.
b) From this simple regression model, obtain the SSE. Hint: you can
calculate this yourself, pull it from the broom::glance() output where
it is entitled deviance or extract it from the output of the anova()
command.
c) Add a new covariate to the model named Noise that is generated at
random from a uniform distribution using the following code:

trees <- trees %>%


mutate( Noise = runif( n() ) )

d) Fit a linear model that includes this new Noise variable in addition
to the Height. Calculate the SSE error in the same manner as before.
Does it decrease or increase. Quantify how much it has changed.
3.5. EXERCISES 47

e) Repeat parts (c) and (d) several times. Comment on the trend in
change in SSE. Hint: This isn’t strictly necessary but is how I would
go about answering this question. Wrap parts (c) and (d) in a for
loop and generate a data.frame of a couple hundred runs. Then make
a density plot of the SSE values for the complex models and add a
vertical line on the graph of the simple model SSE.
results <- NULL
for( i in 1:2000 ){
# Do stuff
results <- results %>% rbind( glance(model) )
}
ggplot(results, aes(x=deviance)) +
geom_density() +
geom_vline( xintercept = simple.SSE )
48 CHAPTER 3. INFERENCE
Chapter 4

Contrasts

library(tidyverse) # ggplot2, dplyr, tidyr


library(tidymodels) # for broom functions
library(emmeans) # for emmeans()
# library(multcomp) # for glht() - multcomp fights with dplyr, :(

4.1 Introduction
We have written our model as 𝑦 = 𝑋𝛽 + 𝜖 and often we are interested in some
linear function of the 𝛽.̂ Some examples include the model predictions 𝑦𝑖̂ = 𝑋 𝑖⋅ 𝛽
where 𝑋 𝑖⋅ is the 𝑖𝑡ℎ row of the 𝑋 matrix. Other examples include differences in
group means in a one-way ANOVA or differences in predicted values 𝑦𝑖̂ − 𝑦𝑗̂ .
All of these can can be written as 𝑐′ 𝛽 for some vector 𝑐.
We often are interested in estimating a function of the parameters 𝛽. For ex-
ample in the offset representation of the ANOVA model with 3 groups we have

𝑦𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖𝑗

where
𝑇
𝛽 = [𝜇 𝜏2 𝜏3 ]
and 𝜇 is the mean of the control group, group one is the control group and thus
𝜏1 = 0, and 𝜏2 and 𝜏3 are the offsets of group two and three from the control
group. In this representation, the mean of group two is 𝜇 + 𝜏2 and is estimated
with 𝜇̂ + 𝜏2̂ .
A contrast is a linear combinations of elements of 𝛽,̂ which is a fancy way of
saying that it is a function of the elements of 𝛽 ̂ where the elements can be

49
50 CHAPTER 4. CONTRASTS

added, subtracted, or multiplied by constants. In particular, the contrast can


be represented by the vector 𝑐 such that the function we are interested in is
𝑐𝑇 𝛽.̂
In the ANOVA case with 𝑘 = 3 where we have the offset representation, I might
be interested in the mean of group 2, which could be written as
𝜇̂
[ 1 1 0 ]⋅ ⎡
𝜇2̂ = 𝜇̂ + 𝜏2̂ = ⏟⏟⏟⏟⏟ ⎤
⎢ 𝜏2̂ ⎥
𝑐𝑇 ⎣
⏟ 𝜏3̂ ⎦
𝛽̂

Similarly in the simple regression case, I will be interested in the height of the
regression line at 𝑥0 . This height can be written as

𝛽0̂
𝑦0̂ = 𝛽0̂ + 𝛽1̂ 𝑥0 = ⏟
[⏟1⏟ 𝑥⏟
0⏟] ⋅ [ ]
𝛽1̂
𝑐 𝑇 ⏟
𝛽̂

In this manner, we could think of calculating all of the predicted values 𝑦𝑖̂ as
just the result of the contrast 𝑋 𝛽̂ where our design matrix 𝑋 takes the role of
the contrasts.

4.2 Estimate and variance


One of the properties of maximum likelihood estimator (MLEs), is that they are
invariant under transformations. Meaning that since 𝛽̂ is the MLE of 𝛽, then
𝑐𝑇 𝛽̂ is the MLE of 𝑐𝑇 𝛽. The only thing we need to perform hypotheses tests
and create confidence intervals is an estimate of the variance of 𝑐𝑇 𝛽.̂
Because we know the variance of 𝛽̂ is
−1
̂ = 𝜎2 (𝑋 𝑇 𝑋)
𝑉 𝑎𝑟 (𝛽)
and because 𝑐 is a constant, then
−1
̂ = 𝜎2 𝑐𝑇 (𝑋 𝑇 𝑋)
𝑉 𝑎𝑟 (𝑐𝑇 𝛽) 𝑐

and the standard error is found by plugging in our estimate of 𝜎2 and taking
the square root.
−1 −1
̂ = √𝜎̂ 2 𝑐𝑇 (𝑋 𝑇 𝑋)
𝑆𝑡𝑑𝐸𝑟𝑟 (𝑐𝑇 𝛽)
𝑇
𝑐 = 𝜎̂ √𝑐𝑇 (𝑋 𝑋) 𝑐

As usual, we can now calculate confidence intervals for 𝑐𝑇 𝛽̂ using the usual
formula
1−𝛼/2
𝐸𝑠𝑡 ± 𝑡𝑛−𝑝 𝑆𝑡𝑑𝐸𝑟𝑟 ( 𝐸𝑠𝑡 )
4.2. ESTIMATE AND VARIANCE 51

1−𝛼/2 𝑇 −1
𝑐𝑇 𝛽̂ ± 𝑡𝑛−𝑝 𝜎̂ √𝑐𝑇 (𝑋 𝑋) 𝑐

Recall the hostility example which was an ANOVA with three groups with the
data

Method Test Scores


1 96 79 91 85 83 91 82 87
2 77 76 74 73 78 71 80
3 66 73 69 66 77 73 71 70 74

We have analyzed this data using both the cell means model and the offset
and we will demonstrate how to calculate the group means from the offset
representation. Thus we are interested in estimating 𝜇 + 𝜏2 and 𝜇 + 𝜏3 . I am
also interested in estimating the difference between treatment 2 and 3 and will
therefore be interested in estimating 𝜏2 − 𝜏3 .

data <- data.frame(


y = c(96,79,91,85,83,91,82,87,
77,76,74,73,78,71,80,
66,73,69,66,77,73,71,70,74),
group = factor(c( rep('Group1',8), rep('Group2',7),rep('Group3',9) ))
)

ggplot(data, aes(x=group, y=y)) + geom_boxplot()

90
y

80

70

Group1 Group2 Group3


group

We can fit the offset model and obtain the design matrix and estimate of 𝜎̂ via
the following code.
52 CHAPTER 4. CONTRASTS

m <- lm(y ~ group, data=data) # Fit the ANOVA model (offset representation)
tidy(m) # Show me beta.hat

## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 86.8 1.52 57.1 1.57e-24
## 2 groupGroup2 -11.2 2.22 -5.03 5.58e- 5
## 3 groupGroup3 -15.8 2.09 -7.55 2.06e- 7

X <- model.matrix(m) # obtains the design matrix


sigma.hat <- glance(m) %>% pull(sigma) # grab sigma.hat
beta.hat <- tidy(m) %>% pull(estimate)
XtX.inv <- solve( t(X) %*% X )

Now we calculate

contr <- c(1,1,0) # define my contrast


ctb <- t(contr) %*% beta.hat
std.err <- sigma.hat * sqrt( t(contr) %*% XtX.inv %*% contr )

data.frame(Estimate=ctb, StdErr=std.err)

## Estimate StdErr
## 1 75.57143 1.622994

and notice this is the exact same estimate and standard error we got for group
two when we fit the cell means model.

CellMeansModel <- lm(y ~ group - 1, data=data)


CellMeansModel %>% tidy()

## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 groupGroup1 86.7 1.52 57.1 1.57e-24
## 2 groupGroup2 75.6 1.62 46.6 1.12e-22
## 3 groupGroup3 71 1.43 49.6 2.99e-23

4.3 Estimating contrasts using glht()


Instead of us doing all the matrix calculations ourselves, all we really need is to is
specify the row vector 𝑐𝑇 . The function that will do the rest of the calculations
4.3. ESTIMATING CONTRASTS USING GLHT() 53

is the generalized linear hypothesis test function glht() that can be found in
the multiple comparisons package multcomp. The p-values will be adjusted
to correct for testing multiple hypothesis, so there may be slight differences
compared to the p-value seen in just the regular summary table.

4.3.1 1-way ANOVA


We will again use the Hostility data set and demonstrate how to calculate the
point estimates, standard errors and confidence intervals for the group means
given a model fit using the offset representation.

m <- lm(y ~ group, data=data)


m %>% tidy()

## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 86.8 1.52 57.1 1.57e-24
## 2 groupGroup2 -11.2 2.22 -5.03 5.58e- 5
## 3 groupGroup3 -15.8 2.09 -7.55 2.06e- 7

We will now define a row vector (and it needs to be a matrix or else glht() will
throw an error. First we note that the simple contrast 𝑐𝑇 = [1 0 0] just grabs
the first coefficient and gives us the same estimate and standard error as the
summary did.

contr <- rbind("Intercept"=c(1,0,0)) # 1x3 matrix with row named "Intercept"


test <- multcomp::glht(m, linfct=contr) # the linear function to be tested is contr
summary(test)

##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = y ~ group, data = data)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Intercept == 0 86.750 1.518 57.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Next we calculate the estimate of all the group means 𝜇, 𝜇 + 𝜏2 and 𝜇 + 𝜏3


and the difference between group 2 and 3. Notice I can specify more than one
contrast at a time.
54 CHAPTER 4. CONTRASTS

contr <- rbind("Mean of Group 1"=c(1,0,0),


"Mean of Group 2"=c(1,1,0),
"Mean of Group 3"=c(1,0,1),
"Diff G2-G3" =c(0,1,-1))
test <- multcomp::glht(m, linfct=contr)
summary(test)

##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = y ~ group, data = data)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Mean of Group 1 == 0 86.750 1.518 57.141 <0.001 ***
## Mean of Group 2 == 0 75.571 1.623 46.563 <0.001 ***
## Mean of Group 3 == 0 71.000 1.431 49.604 <0.001 ***
## Diff G2-G3 == 0 4.571 2.164 2.112 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Finally we calculate confidence intervals in the usual manner using the


confint() function.

confint(test, level=0.95)

##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = y ~ group, data = data)
##
## Quantile = 2.6461
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## Mean of Group 1 == 0 86.7500 82.7328 90.7672
## Mean of Group 2 == 0 75.5714 71.2769 79.8660
## Mean of Group 3 == 0 71.0000 67.2126 74.7874
## Diff G2-G3 == 0 4.5714 -1.1546 10.2975
4.4. USING EMMEANS PACKAGE 55

4.4 Using emmeans Package

Specifying the contrasts by hand is extremely difficult to do correctly and in-


stead we would prefer to specify the contrasts using language like “create all pos-
sible pairwise contrasts” where each pair is just a subtraction. The R-package
emmeans tries to simply the creation of common contrasts.
To show how to use the emmeans package, we’ll consider a bunch of common
models and show how to address common statistical questions for each.

4.4.1 Simple Regression

There is a dataset built into R named trees which describes a set of 𝑛 = 31


cherry trees and the goal is to predict the volume of timber produced by each
tree just using the tree girth a 4.5 feet above the ground.

data(trees)
model <- lm( Volume ~ Girth, data=trees )
trees <- trees %>%
dplyr::select( -matches('fit'), -matches('lwr'), -matches('upr') ) %>%
cbind( predict(model, interval='conf'))
ggplot(trees, aes(x=Girth, y=Volume)) +
geom_point() +
geom_ribbon( aes(ymin=lwr, ymax=upr), alpha=.3 ) +
geom_line( aes(y=fit) )

80

60
Volume

40

20

0
8 12 16 20
Girth

Using the summary() function, we can test hypotheses about if the y-intercept or
slope could be equal to zero, but we might be interested in confidence intervals
for the regression line at girth values of 10 and 12.
56 CHAPTER 4. CONTRASTS

# We could find the regression line heights and CI using


# either predict() or emmeans()
predict(model, newdata=data.frame(Girth=c(10,12)), interval='conf' )

## fit lwr upr


## 1 13.71511 11.44781 15.98240
## 2 23.84682 22.16204 25.53159

emmeans(model, specs = ~Girth, at=list(Girth=c(10,12)) )

## Girth emmean SE df lower.CL upper.CL


## 10 13.7 1.109 29 11.4 16.0
## 12 23.8 0.824 29 22.2 25.5
##
## Confidence level used: 0.95

The emmeans() function requires us to specify the grid of reference points we


are interested as well as which variable or variables we wish to separate out. In
the simple regression case, the specs argument is just the single covariate.
We might next ask if the difference in volume between a tree with 10 inch girth
is statistically different than a tree with 12 inch girth? In other words, we want
to test
𝐻0 ∶ (𝛽0 + 𝛽1 ⋅ 10) − (𝛽0 + 𝛽1 ⋅ 12) = 0
𝐻𝑎 ∶ (𝛽0 + 𝛽1 ⋅ 10) − (𝛽0 + 𝛽1 ⋅ 12) ≠ 0

In this case, we want to look at all possible pairwise differences between the
predicted values at 10 and 12. (The rev part just reverses the order in which
we do the subtraction.)

emmeans(model, specs = revpairwise~Girth,


at=list(Girth=c(10,12)) )

## $emmeans
## Girth emmean SE df lower.CL upper.CL
## 10 13.7 1.109 29 11.4 16.0
## 12 23.8 0.824 29 22.2 25.5
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 12 - 10 10.1 0.495 29 20.478 <.0001
4.4. USING EMMEANS PACKAGE 57

Notice that if I was interested in 3 points, we would get all of the differences.

emmeans(model, specs = pairwise~Girth,


at=list(Girth=c(10,11,12)) )

## $emmeans
## Girth emmean SE df lower.CL upper.CL
## 10 13.7 1.109 29 11.4 16.0
## 11 18.8 0.945 29 16.8 20.7
## 12 23.8 0.824 29 22.2 25.5
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 10 - 11 -5.07 0.247 29 -20.478 <.0001
## 10 - 12 -10.13 0.495 29 -20.478 <.0001
## 11 - 12 -5.07 0.247 29 -20.478 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates

In this very simple case, the slope parameter is easily available as a parameter
value, but we could use the emtrends() function to obtain the slope.

emtrends( model, ~Girth, 'Girth' )

## Girth Girth.trend SE df lower.CL upper.CL


## 13.2 5.07 0.247 29 4.56 5.57
##
## Confidence level used: 0.95

This output is a bit mysterious because because of the 13.248 component. What
has happened is that emtrends is telling us the slope of the line at a particular
point on the x-axis (the mean of all the girth values). While this doesn’t matter
in this example, because the slope is the same for all values of girth, if we had
fit a quadratic model, it would not.

model <- lm( Volume ~ poly(Girth, 2), data=trees ) # Girth + Girth^2


trees <- trees %>%
dplyr::select(Volume, Girth) %>%
cbind(predict(model, interval='conf'))

ggplot(trees, aes(x=Girth, y=Volume)) +


geom_point() +
58 CHAPTER 4. CONTRASTS

geom_ribbon( aes(ymin=lwr, ymax=upr), alpha=.3 ) +


geom_line( aes(y=fit) )

80

60
Volume

40

20

8 12 16 20
Girth

emtrends( model, ~ poly(Girth,2), 'Girth',


at=list(Girth=c(10,11,12)) )

## Girth Girth.trend SE df lower.CL upper.CL


## 10 3.00 0.510 28 1.96 4.05
## 11 3.51 0.405 28 2.68 4.34
## 12 4.02 0.308 28 3.39 4.65
##
## Confidence level used: 0.95

4.4.2 1-way ANOVA

To consider the pairwise contrasts between different levels we will consider the
college student hostility data again. A clinical psychologist wished to compare
three methods for reducing hostility levels in university students, and used a
certain test (HLT) to measure the degree of hostility. A high score on the
test indicated great hostility. The psychologist used 24 students who obtained
high and nearly equal scores in the experiment. Eight subjects were selected
at random from among the 24 problem cases and were treated with method 1,
seven of the remaining 16 students were selected at random and treated with
method 2 while the remaining nine students were treated with method 3. All
treatments were continued for a one-semester period. Each student was given
the HLT test at the end of the semester, with the results show in the following
table.
4.4. USING EMMEANS PACKAGE 59

Hostility <- data.frame(


HLT = c(96,79,91,85,83,91,82,87,
77,76,74,73,78,71,80,
66,73,69,66,77,73,71,70,74),
Method = c( rep('M1',8), rep('M2',7), rep('M3',9) ) )

ggplot(Hostility, aes(x=Method, y=HLT)) +


geom_boxplot()

90
HLT

80

70

M1 M2 M3
Method

To use the emmeans(), we again will use the pairwise command where we specify
that we want all the pairwise contrasts between Method levels.

model <- lm( HLT ~ Method, data=Hostility )


emmeans(model, specs = pairwise~Method,
infer=c(TRUE,TRUE) ) # Print out Confidence Intervals and Pvalues

## $emmeans
## Method emmean SE df lower.CL upper.CL t.ratio p.value
## M1 86.8 1.52 21 83.6 89.9 57.141 <.0001
## M2 75.6 1.62 21 72.2 78.9 46.563 <.0001
## M3 71.0 1.43 21 68.0 74.0 49.604 <.0001
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## M1 - M2 11.18 2.22 21 5.577 16.8 5.030 0.0002
## M1 - M3 15.75 2.09 21 10.491 21.0 7.548 <.0001
## M2 - M3 4.57 2.16 21 -0.883 10.0 2.112 0.1114
##
## Confidence level used: 0.95
60 CHAPTER 4. CONTRASTS

## Conf-level adjustment: tukey method for comparing a family of 3 estimates


## P value adjustment: tukey method for comparing a family of 3 estimates

By default, emmeans() and emtrends() don’t provide p-values for testing if


the effect could possibly be equal to zero. To get those, you’ll need to set
the inference option to return both confidence intervals and p-values using the
infer=c(TRUE,TRUE), where the two logical values control the inclusion of the
confidence interval and p-value, respectively.

4.5 Exercises
1. The American Community Survey is on ongoing survey being conducted
monthly by the US Census Bureau and the package Lock5Data has a
dataset called EmployedACS that has 431 randomly selected anonymous
US residents from this survey.

data('EmployedACS', package='Lock5Data')
?Lock5Data::EmployedACS

a) Create a boxplot of the respondents’ Race and Income.


b) Fit a 1-way ANOVA on these data.
c) Use multcomp::glht() to calculate all the pairwise differences be-
tween the race categories.
d) Use emmeans() to calculate all the pairwise differences between the
race categories.

2. We will examine a data set from Ashton et al. (2007) that relates the
length of a tortoise’s carapace to the number of eggs laid in a clutch. The
data are

Eggs <- data.frame(


carapace = c(284,290,290,290,298,299,302,306,306,
309,310,311,317,317,320,323,334,334),
clutch.size = c(3,2,7,7,11,12,10,8,8,
9,10,13,7,9,6,13,2,8))

a) Plot the data with carapace as the explanatory variable and clutch
size as the response.
b) We want to fit the model
𝑖𝑖𝑑
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜖𝑖 where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )
To fit this model, we need the design matrix with a column of ones,
a column of 𝑥𝑖 values and a column of 𝑥2𝑖 values. To fit this model,
we could create a new column of 𝑥2𝑖 values, or do it in the formula.
4.5. EXERCISES 61

Eggs <- Eggs %>% mutate( carapace.2 = carapace^2 )

model <- lm(clutch.size ~ 1 + carapace + carapace.2, data=Eggs) # one version


model <- lm(clutch.size ~ 1 + carapace + I(carapace^2), data=Eggs) # Do squaring inside
model <- lm(clutch.size ~ 1 + poly(carapace, degree=2, raw=TRUE), data=Eggs) # using pol

c) Use the predict() function to calculate the regression predictions


and add those predictions to the Eggs dataset.
d) Graph the data with the regression curve.
e) Use the emmeans() function to estimate the clutch sizes for tortoise’s
with carapace sizes of 300 and 320. Provide a confidence interval for
the estimates and a p-value for the hypothesis test that the value
could be zero.
f) Use the emmeans() function to estimate the difference between clutch
size prediction for tortoise’s with carapace sizes of 300 and 320. Pro-
vide a confidence interval for the estimates and a p-value for the
hypothesis test that the difference could be zero.
g) Use the emtrends() function to find the estimated slope at carapace
sizes of 300 and 320. Provide a confidence interval for the estimates
and a p-value for the hypothesis test that the value could be zero.
h) Use the emtrends() function to estimate the differences between
slopes at carapace sizes 300 and 320. Provide a confidence interval for
the estimates and a p-value for the hypothesis test that the difference
could be zero.
62 CHAPTER 4. CONTRASTS
Statistical Models

63
Chapter 5

Analysis of Covariance
(ANCOVA)

library(tidyverse) # ggplot2, tidyr, dplyr


library(emmeans)

5.1 Introduction
One way that we could extend the ANOVA and regression models is to have
both categorical and continuous predictor variables. For historical reasons going
back to pre “computer in your pocket” days, statisticians call this the Analysis
of Covariance (ANCOVA) model. Because it is just another example of a 𝑦 =
𝑋𝛽 +𝜖 linear model, I prefer to think of it as simply having both continuous and
categorical variables in my model. None of the cookbook calculations change,
but the interpretation of the parameters gets much more interesting.
The dataset teengamb in the package faraway has data regarding the rates
of gambling among teenagers in Britain and their gender and socioeconomic
status. One question we might be interested in is how gender and income relate
to how much a person gambles. But what should be the effect of gender be?
There are two possible ways that gender could enter the model. Either:

1. We could fit two lines to the data one for males and one for females but
require that the lines be parallel (i.e. having the same slopes for income).
This is accomplished by having a separate y-intercept for each gender. In
effect, the line for the females would be offset by a constant amount from
the male line.

65
66 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)

2. We could fit two lines but but allow the slopes to differ as well as the
y-intercept. This is referred to as an “interaction” between income and
gender. One way to remember that this is an interaction is because the
effect of income on gambling rate is dependent on the gender of the indi-
vidual.

Additive Interaction

150

100
sex
gamble

Male
Female
50

0 5 10 15 0 5 10 15
income

*It should be noted here, that the constant variance assumption is being violated
and we really ought to do a transformation. I would recommend performing a

⋅ transformation on both the gamble and income covariates, but we’ll leave
them as is for now.

We will now see how to go about fitting these two models. As might be imagined,
these can be fit in the same fashion we have been solving the linear models, but
require a little finesse in defining the appropriate design matrix 𝑋.

5.2 Offset parallel Lines (aka additive models)

In order to get offset parallel lines, we want to write a model

𝛽0 + 𝛽1 + 𝛽2 𝑥𝑖 + 𝜖𝑖 if female
𝑦𝑖 = {
𝛽0 + 𝛽2 𝑥𝑖 + 𝜖𝑖 if male

where 𝛽1 is the vertical offset of the female group regression line to the reference
group, which is the males regression line. Because the first 19 observations are
5.3. LINES WITH DIFFERENT SLOPES (AKA INTERACTION MODEL)67

female, we can this in in matrix form as


𝑦1 1 1 𝑥1 𝜖1
⎡ ⋮ ⎤ ⎡ ⋮ ⋮ ⋮ ⎤ ⎡ ⋮ ⎤
⎢ ⎥ ⎢ ⎥ 𝛽0 ⎢ ⎥
⎢ 𝑦19 ⎥=⎢ 1 1 𝑥19 ⎥⎡⎢ 𝛽1 ⎤

𝜖
+ ⎢ 19 ⎥
⎢ 𝑦20 ⎥ ⎢ 1 0 𝑥20 ⎥ 𝛽 ⎢ 𝜖20 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⎥⎣ 2 ⎦ ⎢ ⋮ ⎥
⎣ 𝑦47 ⎦ ⎣ 1 0 𝑥47 ⎦ ⎣ 𝜖47 ⎦

I like this representation where 𝛽1 is the offset from the male regression line
because it makes it very convenient to test if the offset is equal to zero. The
second column of the design matrix referred to as a “dummy variable” or “in-
dicator variable” that codes for the female gender. Notice that even though I
have two genders, I only had to add one additional variable to my model because
we already had a y-intercept 𝛽0 and we only added one indicator variable for
females.
What if we had a third group? Then we would fit another column of indicator
variable for the third group. The new beta coefficient in the model would be the
offset of the new group to the reference group. For example we consider 𝑛 = 9
observations with 𝑛𝑖 = 3 observations per group where 𝑦𝑖,𝑗 is the 𝑗 th replication
of the 𝑖th group.
𝑦1,1 1 0 0 𝑥1,1 𝜖1,1
⎡ 𝑦1,2 ⎤ ⎡ 1 0 0 𝑥1,2 ⎤ ⎡ 𝜖1,2 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝑦1,3 ⎥ ⎢ 1 0 0 𝑥1,3 ⎥ ⎢ 𝜖1,3 ⎥
𝛽0
⎢ 𝑦2,1 ⎥ ⎢ 1 1 0 𝑥2,1 ⎥⎡ ⎤ ⎢ 𝜖2,1 ⎥
⎢ ⎥=⎢ ⎥⎢ 𝛽1
𝑦2,2 1 1 0 𝑥2,2 ⎥+⎢ 𝜖2,2 ⎥
⎢ 𝑦2,3 ⎥ ⎢ 1 1 0 𝑥2,3 ⎥⎢ 𝛽2 ⎥ ⎢
𝜖2,3 ⎥
⎢ ⎥ ⎢ ⎥⎣ 𝛽3 ⎦ ⎢ ⎥
⎢ 𝑦3,1 ⎥ ⎢ 1 0 1 𝑥3,1 ⎥ ⎢ 𝜖3,3 ⎥
⎢ 𝑦3,2 ⎥ ⎢ 1 0 1 𝑥3,2 ⎥ ⎢ 𝜖3,2 ⎥
⎣ 𝑦3,3 ⎦ ⎣ 1 0 1 𝑥3,3 ⎦ ⎣ 𝜖3,3 ⎦

In this model, 𝛽0 is the y-intercept for group 1. The parameter 𝛽1 is the vertical
offset from the reference group (group 1) for the second group. Similarly 𝛽2 is
the vertical offset for group 3. All groups will share the same slope, 𝛽4 .

5.3 Lines with different slopes (aka Interaction


model)
We can now include a discrete random variable and create regression lines that
are parallel, but often that is inappropriate, such as in the teenage gambling
dataset. We want to be able to fit a model that has different slopes.
(𝛽0 + 𝛽1 ) + (𝛽2 + 𝛽3 ) 𝑥𝑖 + 𝜖𝑖 if female
𝑦𝑖 = {
𝛽0 + 𝛽2 𝑥𝑖 + 𝜖𝑖 if male
68 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)

Where 𝛽1 is the offset in y-intercept of the female group from the male group,
and 𝛽3 is the offset in slope. Now our matrix formula looks like

𝑦1 1 1 𝑥1 𝑥1 𝜖1
⎡ ⋮ ⎤ ⎡ ⋮ ⋮ ⋮ ⋮ ⎤ 𝛽0 ⎡ ⋮ ⎤
⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ 𝜖 ⎥
⎢ 𝑦19 ⎥=⎢ 1 1 𝑥19 𝑥19 ⎥⎢ 𝛽1
⎥ + ⎢ 19 ⎥
⎢ 𝑦20 ⎥ ⎢ 1 0 𝑥20 0 ⎥⎢ 𝛽2 ⎥ ⎢ 𝜖20 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋮ ⎥⎣ 𝛽3 ⎦ ⎢ ⋮ ⎥
⎣ 𝑦47 ⎦ ⎣ 1 0 𝑥47 0 ⎦ ⎣ 𝜖47 ⎦

where the new fourth column is the what I would get if I multiplied the 𝑥 column
element-wise with the dummy-variable column. To fit this model in R we have

data('teengamb', package='faraway')

# Forces R to recognize that 0, 1 are categorical, also


# relabels the levels to something I understand.
teengamb <- teengamb %>% mutate( sex = ifelse( sex==1, 'Female', 'Male') )

# Fit a linear model with the interaction of sex and income


# Interactions can be specified useing a colon :
m1 <- lm( gamble ~ 1 + sex + income + sex:income, data=teengamb )
m1 <- lm( gamble ~ sex + income + sex:income, data=teengamb )

# R allows a shortcut for the prior definition


m1 <- lm( gamble ~ sex * income, data=teengamb )

# save the fit, lwr, upr values for each observation


# these are the yhat and CI
# If columns for fit, upr, lwr are already present, remove them
teengamb <- teengamb %>%
dplyr::select( -matches('fit'), -matches('lwr'), -matches('upr') ) %>%
cbind( predict(m1, interval='conf') )

# Make a nice plot that includes the regression line.


ggplot(teengamb, aes(x=income, col=sex, fill=sex)) +
geom_ribbon(aes(ymin=lwr, ymax=upr),
alpha=.3) + # how solid the layer is
geom_point(aes(y=gamble)) +
geom_line(aes(y=fit))
5.3. LINES WITH DIFFERENT SLOPES (AKA INTERACTION MODEL)69

150

100
sex
gamble

Female
50 Male

0 5 10 15
income

# print the model summary


summary(m1)

##
## Call:
## lm(formula = gamble ~ sex * income, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.522 -4.860 -1.790 6.273 93.478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1400 9.2492 0.339 0.73590
## sexMale -5.7996 11.2003 -0.518 0.60724
## income 0.1749 1.9034 0.092 0.92721
## sexMale:income 6.3432 2.1446 2.958 0.00502 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.98 on 43 degrees of freedom
## Multiple R-squared: 0.5857, Adjusted R-squared: 0.5568
## F-statistic: 20.26 on 3 and 43 DF, p-value: 2.451e-08

To interpret the terms, we have

Coefficients Interpretation
(Intercept) y-intercept for the females
sexMale The difference in y-intercept for Males
income Slope of the female regression line
sexMale:income The offset in slopes for the Males
70 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)

So looking at the summary, we see the interaction term sexMale:income is


statistically significant indicating that we prefer the more complicated model
with different slopes for each gender.
To calculate the differences between the predicted values at an income levels
of 5 and 10, we could use multcomp::glht() and figure out the appropriate
contrast vector, but we’ll use the easy version with emmeans()

emmeans(m1, specs = ~ income * sex,


at=list(income=c(5,10), sex=c('Male','Female')))

## income sex emmean SE df lower.CL upper.CL


## 5 Male 29.93 3.97 43 21.93 37.9
## 10 Male 62.52 6.35 43 49.71 75.3
## 5 Female 4.01 5.08 43 -6.23 14.3
## 10 Female 4.89 12.13 43 -19.58 29.4
##
## Confidence level used: 0.95

If we are interested in the differences we can just do a pairwise in the specs


argument, but I also want to just calculate the differences at each income level.

# The pipe in the formula is essentially a group_by


emmeans(m1, specs = pairwise ~ sex | income,
at=list(income=c(5,10),
sex=c('Male','Female')))

## $emmeans
## income = 5:
## sex emmean SE df lower.CL upper.CL
## Male 29.93 3.97 43 21.93 37.9
## Female 4.01 5.08 43 -6.23 14.3
##
## income = 10:
## sex emmean SE df lower.CL upper.CL
## Male 62.52 6.35 43 49.71 75.3
## Female 4.89 12.13 43 -19.58 29.4
##
## Confidence level used: 0.95
##
## $contrasts
## income = 5:
## contrast estimate SE df t.ratio p.value
## Male - Female 25.9 6.44 43 4.022 0.0002
##
5.3. LINES WITH DIFFERENT SLOPES (AKA INTERACTION MODEL)71

## income = 10:
## contrast estimate SE df t.ratio p.value
## Male - Female 57.6 13.69 43 4.208 0.0001

If we want the slopes as well as the difference in slopes, we would use the
emtrends() function.

emtrends(m1, pairwise ~ income * sex, var = "income",


at=list(income=10, sex=c('Male','Female')))

## $emtrends
## income sex income.trend SE df lower.CL upper.CL
## 10 Male 6.518 0.988 43 4.53 8.51
## 10 Female 0.175 1.903 43 -3.66 4.01
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 10 Male - 10 Female 6.34 2.14 43 2.958 0.0050

While I specified to calculate the slope at the x-value of income=10, that doesn’t
matter because the slopes are the same at all x-values

Somewhat less interestingly, we could calculate the average of the Male and
Female slopes.

# when specs doesn't include a variable that was used


# in the model, this will either
# a) average over the missing levels (categorical)
# b) use the average value of the variable (quantitative)
emtrends(m1, specs = ~ income, var = 'income',
at=list(income=10))

## NOTE: Results may be misleading due to involvement in interactions

## income income.trend SE df lower.CL upper.CL


## 10 3.35 1.07 43 1.18 5.51
##
## Results are averaged over the levels of: sex
## Confidence level used: 0.95
72 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)

5.4 Iris Example


For a second example, we will explore the relationship between sepal length and
sepal width for three species of irises. This data set is available in R as iris.

data(iris) # read in the iris dataset


levels(iris$Species) # notice the order of levels of Species

## [1] "setosa" "versicolor" "virginica"

The very first thing we should do when encountering a dataset is to do some


sort of graphical summary to get an idea of what model seems appropriate.

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +


geom_point()

4.5

4.0

Species
Sepal.Width

3.5
setosa
versicolor
3.0
virginica

2.5

2.0
5 6 7 8
Sepal.Length

Looking at this graph, it seems that I will likely have a model with different
y-intercepts for each species, but it isn’t clear to me if we need different slopes.
We consider the sequence of building successively more complex models:

# make virginica the reference group


iris <- iris %>%
mutate( Species = forcats::fct_relevel(Species, 'virginica') )

m1 <- lm( Sepal.Width ~ Sepal.Length, data=iris ) # One line


m2 <- lm( Sepal.Width ~ Sepal.Length + Species, data=iris ) # Parallel Lines
m3 <- lm( Sepal.Width ~ Sepal.Length * Species, data=iris ) # Non-parallel Lines

The three models we consider are the following:


5.4. IRIS EXAMPLE 73

4.5

4.0

3.5

m1
3.0

2.5

2.0
4.5

4.0
Species
Sepal.Width

3.5
virginica

m2
3.0 setosa
versicolor
2.5

2.0
4.5

4.0

3.5

m3
3.0

2.5

2.0
5 6 7 8
Sepal.Length

Looking at these, it seems obvious that the simplest model where we ignore
Species is horrible. The other two models seem decent, and I am not sure about
the parallel lines model vs the differing slopes model.

m1 %>% broom::tidy() %>% mutate_if( is.numeric, round, digits=3 )

## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.42 0.254 13.5 0
## 2 Sepal.Length -0.062 0.043 -1.44 0.152

For the simplest model, there is so much unexplained noise that the slope vari-
able isn’t significant.
Moving onto the next most complicated model, where each species has their
own y-intercept, but they share a slope, we have
74 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)

m2 %>% broom::tidy() %>% mutate_if( is.numeric, round, digits=3 )

## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.669 0.308 2.17 0.031
## 2 Sepal.Length 0.35 0.046 7.56 0
## 3 Speciessetosa 1.01 0.093 10.8 0
## 4 Speciesversicolor 0.024 0.065 0.37 0.712

The first two lines are the y-intercept and slope associated with the reference
group and the last two lines are the y-intercept offsets from the reference group
to Setosa and Versicolor, respectively. We have that the slope associated with
increasing Sepal Length is significant and that Setosa has a statistically different
y-intercept than the reference group Virginica and that Versicolor does not have
a statistically different y-intercept than the reference group.
Finally we consider the most complicated model that includes two more slope
parameters

m3 %>% broom::tidy() %>% mutate_if( is.numeric, round, digits=3 )

## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.45 0.405 3.57 0
## 2 Sepal.Length 0.232 0.061 3.79 0
## 3 Speciessetosa -2.02 0.686 -2.94 0.004
## 4 Speciesversicolor -0.574 0.605 -0.95 0.344
## 5 Sepal.Length:Speciessetosa 0.567 0.126 4.49 0
## 6 Sepal.Length:Speciesversicolor 0.088 0.097 0.905 0.367

These parameters are:

Meaning R-label
Reference group y-intercept (Intercept)
Reference group slope Sepal.Length
offset to y-intercept for Setosa Speciessetosa
offset to y-intercept for Versicolor Speciesversicolor
offset to slope for Setosa Sepal.Length:Speciessetosa
offset to slope for Versicolor Sepal.Length:Speciesversicolor
5.5. EXERCISES 75

It appears that slope for Setosa is different from the reference group Virginica.
However because we’ve added 2 parameters to the model, testing Model2 vs
Model3 is not equivalent to just looking at the p-value for that one slope. Instead
we need to look at the F-test comparing the two models which will evaluate if
the decrease in SSE is sufficient to justify the addition of two parameters.

anova(m2, m3)

## Analysis of Variance Table


##
## Model 1: Sepal.Width ~ Sepal.Length + Species
## Model 2: Sepal.Width ~ Sepal.Length * Species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 146 12.193
## 2 144 10.680 2 1.5132 10.201 7.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-test concludes that there is sufficient decrease in the SSE to justify adding
two additional parameters to the model.

5.5 Exercises
1. The in the faraway package, there is a dataset named phbirths that gives
babies birth weights along with their gestational time in utero along with
the mother’s smoking status.
a. Load and inspect the dataset using
data('phbirths', package='faraway') # load the data within the package
?faraway::phbirths # Look at the help file

b. Create a plot of the birth weight vs the gestational age. Color code
the points based on the mother’s smoking status. Does it appear
that smoking matters?
c. Fit the simple model (one regression line) along with both the main
effects (parallel lines) and interaction (non-parallel lines) ANCOVA
model to these data. Which model is preferred?
d. Using whichever model you selected in the previous section, create a
graph of the data along with the confidence region for the regression
line(s).
e. Now consider only the “full term babies” which are babies with ges-
tational age at birth ≥ 36 weeks. With this reduced dataset, repeat
parts c,d.
76 CHAPTER 5. ANALYSIS OF COVARIANCE (ANCOVA)

f. Interpret the relationship between gestational length and mother’s


smoking status on birth weight.
2. The in the faraway package, there is a dataset named clot that gives
information about the time for blood to clot verses the blood dilution
concentration when the blood was diluted with prothrombin-free plasma.
Unfortunately the researchers had to order the plasma in two different lots
(could think of this as two different sources) and need to ascertain if the
lot number makes any difference in clotting time.
a. Log transform the time and conc variable and plot the log-
transformed data with color of the data point indicating the lot
number. (We will discuss why we performed this transformation
later in the course.)
b. Ignoring the slight remaining curvature in the data, perform the ap-
propriate analysis using transformed variables. Does lot matter?
3. In base R, there is a data set ToothGrowth which is data from an experi-
ment giving Vitamin C to guinea pigs. The guinea pigs were given vitamin
C doses either via orange juice or an ascorbic acid tablet. The response of
interest was a measure of tooth growth where a higher growth is better.
a. Log transform the dose and use that throughout this problem. Use 𝑒
as the base, which R does by default when you use the log() function.
(We will discuss why we performed this transformation later in the
course.)
b. Graph the data, fit appropriate ANCOVA models, and describe the
relationship between the delivery method, log(dose) level, and tooth
growth. Produce a graph with the data and the regression line(s)
along with the confidence region for the line(s).
c. Is there a statistically significant difference in slopes between the two
delivery methods?
d. Just using your graphs and visual inspection, at low dose levels, say
log(𝑑𝑜𝑠𝑒) = −0.7, is there a difference in delivery method? What
about at high dose levels, say log(𝑑𝑜𝑠𝑒) = 0.7?
e. Use emmeans() to test if there is a statistically significant difference
at low dose levels log(𝑑𝑜𝑠𝑒) = −0.7. Furthermore, test if there is
a statistically significant difference at high dose levels. Summarize
your findings.
Chapter 6

Two-way ANOVA

# Load my usual packages


library(tidyverse) # ggplot2, dplyr, tidyr
library(ggfortify) # autoplot() for lm objects
library(emmeans) # pairwise contrasts stuff

6.1 Review of 1-way ANOVA

Given a categorical covariate (which I will call a factor) with 𝐼 levels, we are
interested in fitting the model

𝑦𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝜖𝑖𝑗

𝑖𝑖𝑑
where 𝜖𝑖𝑗 ∼ 𝑁 (0, 𝜎2 ), 𝜇 is the overall mean, and 𝜏𝑖 are the offset of factor
level 𝑖 from 𝜇. Unfortunately this model is not identifiable because I could
add a constant (say 5) to 𝜇 and subtract that same constant from each of the
𝜏𝑖 values and the group mean 𝜇 + 𝜏𝑖 would not change. There are two easy
restrictions we could make to make the model identifiable:

1. Set 𝜇 = 0. In this case, 𝜏𝑖 represents the expected value of an observation


in group level 𝑖. We call this the “cell means” representation.

2. Set 𝜏1 = 0. Then 𝜇 represents the expected value of treatment 1, and the


𝜏𝑖 values will represent the offsets from group 1. The group or level that
we set to be zero is then referred to as the reference group. We can call
this the “offset from reference” model.

77
78 CHAPTER 6. TWO-WAY ANOVA

We will be interested in testing the null and alternative hypotheses

𝐻0 ∶ 𝑦𝑖𝑗 = 𝜇 + 𝜖𝑖𝑗
𝐻𝑎 ∶ 𝑦𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗

6.1.1 An Example

We look at a dataset that comes from the study of blood coagulation times:
24 animals were randomly assigned to four different diets and the samples were
taken in a random order. The diets are denoted as 𝐴,𝐵,𝐶,and 𝐷 and the
response of interest is the amount of time it takes for the blood to coagulate.

data('coagulation', package='faraway')
ggplot(coagulation, aes(x=diet, y=coag)) +
geom_boxplot() +
labs( x='Diet', y='Coagulation Time' )

68
Coagulation Time

64

60

56
A B C D
Diet

Just by looking at the graph, we expect to see that diets 𝐴 and 𝐷 are similar
while 𝐵 and 𝐶 are different from 𝐴 and 𝐷 and possibly from each other, too.
We first fit the offset model.

m <- lm(coag ~ diet, data=coagulation)


summary(m)

##
## Call:
## lm(formula = coag ~ diet, data = coagulation)
##
## Residuals:
## Min 1Q Median 3Q Max
6.1. REVIEW OF 1-WAY ANOVA 79

## -5.00 -1.25 0.00 1.25 5.00


##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 ***
## dietB 5.000e+00 1.528e+00 3.273 0.003803 **
## dietC 7.000e+00 1.528e+00 4.583 0.000181 ***
## dietD 2.991e-15 1.449e+00 0.000 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.366 on 20 degrees of freedom
## Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212
## F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05

Notice that diet 𝐴 is the reference level and it has a mean of 61. Diet 𝐵 has an
offset from 𝐴 of 5, etc. From the very small F-statistic, we conclude that simple
model
𝑦𝑖𝑗 = 𝜇 + 𝜖𝑖𝑗
is not sufficient to describe the data.

6.1.2 Degrees of Freedom


Throughout the previous example, the degrees of freedom that are reported
keeps changed depending on what models we are comparing. The simple model
we are considering is
𝑦𝑖𝑗 ∼ 𝜇 + 𝜖𝑖𝑗
which has 1 parameter that defines the expected value versus
𝑦𝑖𝑗 ∼ 𝜇 + 𝜏𝑖 + 𝜖𝑖𝑗
where there really are only 4 parameters that define the expected value because
𝜏1 = 0. In general, the larger model is only adding 𝐼 − 1 terms to the model
where 𝐼 is the number of levels of the factor of interest.

6.1.3 Pairwise Comparisons


After detecting differences in the factor levels, we are often interested in which
factor levels are different from which. Often we are interested in comparing the
mean of level 𝑖 with the mean of level 𝑗.
As usual we let the vector of parameter estimates be 𝛽̂ then the contrast of
interested can be written as

1−𝛼/2 𝑇 −1
𝑐𝑇 𝛽̂ ± 𝑡𝑛−𝑝 𝜎̂ √𝑐𝑇 (𝑋 𝑋) 𝑐
80 CHAPTER 6. TWO-WAY ANOVA

for some vector 𝑐.


Unfortunately this interval does not take into account the multiple comparisons
issue (i.e. we are making 𝐼(𝐼 − 1)/2 contrasts if our factor has 𝐼 levels). To
account for this, we will not use a quantile from a t-distribution,
√ but from
Tukey’s studentized range distribution 𝑞𝑛,𝑛−𝐼 divided by 2. The intervals we
will use are:
1−𝛼/2
𝑞𝑛,𝑛−𝐼 𝑇 −1
𝑐𝑇 𝛽̂ ± √ 𝜎̂ √𝑐𝑇 (𝑋 𝑋) 𝑐
2

There are several ways to make R calculate this interval, but the easiest is to
use the emmeans package. This package computes the above intervals which are
commonly known as Tukey’s Honestly Significant Differences.

m <- lm(coag ~ diet, data=coagulation) # use the lm() function as usual


emmeans(m, specs= pairwise~diet) %>%
summary(level=0.90)

## $emmeans
## diet emmean SE df lower.CL upper.CL
## A 61 1.183 20 59.0 63.0
## B 66 0.966 20 64.3 67.7
## C 68 0.966 20 66.3 69.7
## D 61 0.837 20 59.6 62.4
##
## Confidence level used: 0.9
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B -5 1.53 20 -3.273 0.0183
## A - C -7 1.53 20 -4.583 0.0010
## A - D 0 1.45 20 0.000 1.0000
## B - C -2 1.37 20 -1.464 0.4766
## B - D 5 1.28 20 3.912 0.0044
## C - D 7 1.28 20 5.477 0.0001
##
## P value adjustment: tukey method for comparing a family of 4 estimates

Here we see that diets 𝐴 and 𝐷 are similar to each other, but different than 𝐵
and 𝐶 and that 𝐵 and 𝐶 are not statistically different from each other at the
0.10 level.
Often I want to produce the “Compact Letter Display” which identifies which
groups are significantly different. Unfortunately this leads to a somewhat binary
decision of “is statistically significant” or “is NOT statistically significant”, but
we should at least know how to do this calculation.
6.2. TWO-WAY ANOVA 81

LetterData <- emmeans(m, specs= ~ diet) %>% # cld() will freak out if you have pairwise here...
multcomp::cld(Letters=letters, level=0.95) %>%
mutate(.group = str_remove_all(.group, '\\s') ) %>% # remove the spaces
mutate( y = 73 ) # height to place the letters at.
LetterData

## diet emmean SE df lower.CL upper.CL .group y


## A 61 1.183 20 58.5 63.5 a 73
## D 61 0.837 20 59.3 62.7 a 73
## B 66 0.966 20 64.0 68.0 b 73
## C 68 0.966 20 66.0 70.0 b 73
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05

I can easily add these to my boxplot with the following:

ggplot(coagulation, aes(x=diet, y=coag)) +


geom_boxplot() +
labs( x='Diet', y='Coagulation Time' ) +
geom_text( data=LetterData, aes(x=diet, y=y, label=.group), size=8 )

a b b a
70
Coagulation Time

65

60

A B C D
Diet

6.2 Two-Way ANOVA


Given a response that is predicted by two different categorical variables. Sup-
pose we denote the levels of the first factor as 𝛼𝑖 and has 𝐼 levels. The second
𝑖𝑖𝑑
factor has levels 𝛽𝑗 and has 𝐽 levels. As usual we let 𝜖𝑖𝑗𝑘 ∼ 𝑁 (0, 𝜎2 ), and we
wish to fit the model
82 CHAPTER 6. TWO-WAY ANOVA

𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + 𝜖𝑖𝑗𝑘

which has the main effects of each covariate or possibly the model with the
interaction
𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 + 𝜖𝑖𝑗𝑘

To consider what an interaction term might mean consider the role of temper-
ature and humidity on the amount of fungal growth. You might expect to see
data similar to this (where the numbers represent some sort of measure of fungal
growth):

5% 30% 60% 90%


2C 2 4 8 16
Temperature 10C 3 9 27 81
30C 4 16 64 256

In this case we see that increased humidity increases the amount of fungal
growth, but the amount of increase depends on the temperature. At 2 C, the
increase is humidity increases are significant, but at 10 C the increases are
larger, and at 30 C the increases are larger yet. The effect of changing from one
humidity level to the next depends on which temperature level we are at. This
change in effect of humidity is an interaction effect. A memorable example is
that chocolate by itself is good. Strawberries by themselves are also good. But
the combination of chocolate and strawberries is a delight greater than the sum
of the individual treats.

We can look at a graph of the Humidity and Temperature vs the Response and
see the effect of increasing humidity changes based on the temperature level.
Just as in the ANCOVA model, the interaction manifested itself in non-parallel
slopes, the interaction manifests itself in non-parallel slopes when I connect the
dots across the factor levels.
6.3. ORTHOGONALITY 83

200
Temperature
Growth

2C
10C
100
30C

0
5% 30% 60% 90%
Humidity

Unfortunately the presence of a significant interaction term in the model makes


interpretation difficult, but examining the interaction plots can be quite helpful
in understanding the effects. Notice in this example, we 3 levels of temperature
and 4 levels of humidity for a total of 12 different possible treatment combina-
tions. In general I will refer to these combinations as cells.

6.3 Orthogonality
When designing an experiment, I want to make sure than none of my covariates
are confounded with each other and I’d also like for them to not be correlated.
Consider the following three experimental designs, where the number in each
bin is the number of subjects of that type. I am interested in testing 2 different
drugs and studying its effect on heart disease within the gender groups.

Design 1 Males Females Design 2 Males Females


Treatment 0 10 Treatment 1 9
A A
Treatment 6 0 Treatment 5 1
B B

Design 3 Males Females Design 4 Males Females


Treatment 3 5 Treatment 4 4
A A
Treatment 3 5 Treatment 4 4
B B

1. This design is very bad. Because we have no males taking drug 1, and no
females taking drug 2, we can’t say if any observed differences are due to
84 CHAPTER 6. TWO-WAY ANOVA

the effect of drug 1 versus 2, or gender. When this situation happens, we


say that the gender effect is confounded with the drug effect.
2. This design is not much better. Because we only have one observation in
the Male-Drug 1 group, any inference we make about the effect of drug 1
on males is based on one observation. In general that is a bad idea.
3. Design 3 is better than the previous 2 because it evenly distributes the
males and females among the two drug categories. However, it seems
wasteful to have more females than males because estimating average of
the male groups, I only have 6 observations while I have 10 females.
4. This is the ideal design, with equal numbers of observations in each gender-
drug group.

Designs 3 and 4 are good because the correlation among my predictors is 0. In


design 1, the drug covariate is perfectly correlated to the gender covariate. The
correlation is less in design 2, but is zero in designs 3 and 4.We could show this
by calculating the design matrix for each design and calculating the correlation
coefficients between each of pairs of columns.
Having an orthogonal design with equal numbers of observations in each group
has many nice ramifications. Most importantly, with an orthogonal design, the
interpretation of parameter is not dependent on what other factors are in the
model. Balanced designs are also usually optimal in the sense that the variances
of 𝛽̂ are as small as possible given the number of observations we have (barring
any other a priori information).

6.4 Main Effects Model


In the one factor ANOVA case, the additional degrees of freedom used by adding
a factor with 𝐼 levels was 𝐼 − 1. In the case that we consider two factors with
the first factor having 𝐼 levels and the second factor having 𝐽 levels, then model

𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + 𝜖𝑖𝑗𝑘

adds (𝐼 − 1) + (𝐽 − 1) parameters to the model because both 𝛼1 = 𝛽1 = 0.

• The intercept term, 𝜇 is the reference point for all the other parameters.
This is the expected value for an observation in the first level of factor 1
and the first level of factor two.
• 𝛼𝑖 is the amount you expect the response to increase when changing from
factor 1 level 1, to factor 1 level i (while the second factor is held constant).
• 𝛽𝑗 is the amount you expect the response to increase when changing from
factor 2 level 1 to factor 2 level j (while the first factor is held constant).
6.4. MAIN EFFECTS MODEL 85

Referring back to the fungus example, let the 𝛼𝑖 values be associated with
changes in humidity and 𝛽𝑗 values be associated with changes in temperature
levels. Then the expected value of each treatment combination is

5% 30% 60% 90%


2C 𝜇+0+0 𝜇 + 𝛼2 + 0 𝜇 + 𝛼3 + 0 𝜇 + 𝛼4 + 0
10C 𝜇 + 0 + 𝛽2 𝜇 + 𝛼 2 + 𝛽2 𝜇 + 𝛼 3 + 𝛽2 𝜇 + 𝛼 4 + 𝛽2
30C 𝜇 + 0 + 𝛽3 𝜇 + 𝛼 2 + 𝛽3 𝜇 + 𝛼 3 + 𝛽3 𝜇 + 𝛼 4 + 𝛽3

6.4.1 Example - Fruit Trees


An experiment was conducted to determine the effects of four different pesticides
on the yield of fruit from three different varieties of a citrus tree. Eight trees
of each variety were randomly selected from an orchard. The four pesticides
were randomly assigned to two trees of each variety and applications were made
according to recommended levels. Yields of fruit (in bushels) were obtained
after the test period.
Critically notice that we have equal number of observations for each treatment
combination.
# Typing the data in by hand because I got this example from a really old text book...
Pesticide <- factor(c('A','B','C','D'))
Variety <- factor(c('1','2','3'))
fruit <- data.frame( expand.grid(rep=1:2, Pest=Pesticide, Var=Variety) )
fruit$Yield <- c(49,39,50,55,43,38,53,48,55,41,67,58,53,42,85,73,66,68,85,92,69,62,85,99)

The first thing to do (as always) is to look at our data

ggplot(fruit, aes(x=Pest, color=Var, y=Yield, shape=Var)) +


geom_point(size=5)

100

80 Var
1
Yield

2
60 3

40

A B C D
Pest
86 CHAPTER 6. TWO-WAY ANOVA

The first thing we notice is that pesticides B and D seem to be better than
the others and that variety 3 seems to be the best producer. The effect of
pesticide treatment seems consistent between varieties, so we don’t expect that
the interaction effect will be significant.

m1 <- lm(Yield ~ Var, data=fruit)


m2 <- lm(Yield ~ Pest, data=fruit)
m3 <- lm(Yield ~ Var + Pest, data=fruit)
summary(m1)$coef %>% round(digits=3)

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 46.875 4.359 10.754 0.000
## Var2 12.375 6.164 2.008 0.058
## Var3 31.375 6.164 5.090 0.000

summary(m2)$coef %>% round(digits=3)

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 53.000 6.429 8.243 0.000
## PestB 14.833 9.093 1.631 0.118
## PestC -1.833 9.093 -0.202 0.842
## PestD 20.833 9.093 2.291 0.033

summary(m3)$coef %>% round(digits=3)

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 38.417 3.660 10.497 0.000
## Var2 12.375 3.660 3.381 0.003
## Var3 31.375 3.660 8.573 0.000
## PestB 14.833 4.226 3.510 0.003
## PestC -1.833 4.226 -0.434 0.670
## PestD 20.833 4.226 4.930 0.000

Notice that the affects for Variety and Pesticide are the same whether or not the
other is in the model. This is due to the orthogonal design of the experiment
and makes it much easier to interpret the main effects of Variety and Pesticide.

6.4.2 ANOVA Table

Most statistical software will produce an analysis of variance table when fitting
a two-way ANOVA. This table is very similar to the analysis of variance table
we have seen in the one-way ANOVA, but has several rows which correspond to
the additional factors added to the model.
6.4. MAIN EFFECTS MODEL 87

Consider the two-way ANOVA with factors 𝐴 and 𝐵 which have levels 𝐼 and
𝐽 discrete levels respectively. For convenience let 𝑅𝑆𝑆1 is the residual sum of
squares of the intercept-only model, and 𝑅𝑆𝑆𝐴 be the residual sum of squares
for the model with just the main effect of factor 𝐴, and 𝑅𝑆𝑆𝐴+𝐵 be the residual
sum of squares of the model with both main effects. Finally assume that we
have a total of 𝑛 observations. The ANOVA table for this model is as follows:

Sum of Sq
Source df (SS) Mean Sq F p-value
A 𝑑𝑓𝐴 = 𝑆𝑆𝐴 = 𝑀 𝑆𝐴 = 𝑀 𝑆𝐴 /𝑀 𝑆𝐸
𝑃 (𝐹𝑑𝑓𝐴 ,𝑑𝑓𝑒 > 𝐹𝐴 )
𝐼 −1 𝑅𝑆𝑆1 − 𝑆𝑆𝐴 /𝑑𝑓𝐴
𝑅𝑆𝑆𝐴
B 𝑑𝑓𝐵 = 𝑆𝑆𝐵 = 𝑀 𝑆𝐵 = 𝑃 (𝐹𝑑𝑓𝐵 ,𝑑𝑓𝑒 > 𝐹𝐵 )
𝑀 𝑆𝐵 /𝑀 𝑆𝐸
𝐽 −1 𝑅𝑆𝑆𝐴 − 𝑆𝑆𝐵 /𝑑𝑓𝐵
𝑅𝑆𝑆𝐴+𝐵
Error 𝑑𝑓𝑒 = 𝑅𝑆𝑆𝐴+𝐵 𝑀 𝑆𝐸 =
𝑛−𝐼 − 𝑅𝑆𝑆𝐴+𝐵 /𝑑𝑓𝑒
𝐽 +1

Note, if the table is cut off, you can change decrease your font size and have it
all show up…
This arrangement of the ANOVA table is referred to as “Type I” sum of squares.
We can examine this table in the fruit trees example using the anova()() com-
mand but just passing a single model.

m4 <- lm(Yield ~ Var + Pest, data=fruit)


anova( m4 )

## Analysis of Variance Table


##
## Response: Yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Var 2 3996.1 1998.04 37.292 3.969e-07 ***
## Pest 3 2227.5 742.49 13.858 6.310e-05 ***
## Residuals 18 964.4 53.58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We might think that this is the same as fitting three nested models and running
an F-test on each successive pairs of models, but it isn’t. While both will give
the same Sums of Squares, the F statistics are different because the MSE of the
complex model is different. In particular, the F-statistics are larger and thus
the p-values are smaller for detecting significant effects.
88 CHAPTER 6. TWO-WAY ANOVA

m1 <- lm(Yield ~ 1, data=fruit)


m2 <- lm(Yield ~ Var, data=fruit)
m3 <- lm(Yield ~ Var + Pest, data=fruit)
anova( m1, m2 ) # Notice the F-statistic here is different than previous

## Analysis of Variance Table


##
## Model 1: Yield ~ 1
## Model 2: Yield ~ Var
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 7188.0
## 2 21 3191.9 2 3996.1 13.146 0.0001987 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova( m2, m3 ) # This F-statistic matches what we saw previously

## Analysis of Variance Table


##
## Model 1: Yield ~ Var
## Model 2: Yield ~ Var + Pest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 3191.9
## 2 18 964.4 3 2227.5 13.858 6.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.4.3 Estimating Contrasts

As in the one-way ANOVA, we are interested in which factor levels differ. For
example, we might suspect that it makes sense to group pesticides B and D
together and claim that they are better than the group of A and C.
Just as we did in the one-way ANOVA model, this is such a common thing to
do that there is an easy way to do this, using emmeans.

m3 <- lm(Yield ~ Var + Pest, data=fruit)


emmeans(m3, spec=pairwise~Var)

## $emmeans
## Var emmean SE df lower.CL upper.CL
## 1 46.9 2.59 18 41.4 52.3
## 2 59.2 2.59 18 53.8 64.7
6.4. MAIN EFFECTS MODEL 89

## 3 78.2 2.59 18 72.8 83.7


##
## Results are averaged over the levels of: Pest
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 1 - 2 -12.4 3.66 18 -3.381 0.0089
## 1 - 3 -31.4 3.66 18 -8.573 <.0001
## 2 - 3 -19.0 3.66 18 -5.191 0.0002
##
## Results are averaged over the levels of: Pest
## P value adjustment: tukey method for comparing a family of 3 estimates

emmeans(m3, spec=pairwise~Pest)

## $emmeans
## Pest emmean SE df lower.CL upper.CL
## A 53.0 2.99 18 46.7 59.3
## B 67.8 2.99 18 61.6 74.1
## C 51.2 2.99 18 44.9 57.4
## D 73.8 2.99 18 67.6 80.1
##
## Results are averaged over the levels of: Var
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B -14.83 4.23 18 -3.510 0.0122
## A - C 1.83 4.23 18 0.434 0.9719
## A - D -20.83 4.23 18 -4.930 0.0006
## B - C 16.67 4.23 18 3.944 0.0048
## B - D -6.00 4.23 18 -1.420 0.5038
## C - D -22.67 4.23 18 -5.364 0.0002
##
## Results are averaged over the levels of: Var
## P value adjustment: tukey method for comparing a family of 4 estimates

These outputs are nice and they show the main effects of variety and pesticide.
Similar to the 1-way ANOVA, we also want to be able to calculate the compact
letter display.

m3 <- lm(Yield ~ Var + Pest, data=fruit)


emmeans(m3, spec= ~Var) %>% multcomp::cld(Letters=letters)
90 CHAPTER 6. TWO-WAY ANOVA

## Var emmean SE df lower.CL upper.CL .group


## 1 46.9 2.59 18 41.4 52.3 a
## 2 59.2 2.59 18 53.8 64.7 b
## 3 78.2 2.59 18 72.8 83.7 c
##
## Results are averaged over the levels of: Pest
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 3 estimates
## significance level used: alpha = 0.05

emmeans(m3, spec= ~Pest) %>% multcomp::cld(Letters=letters)

## Pest emmean SE df lower.CL upper.CL .group


## C 51.2 2.99 18 44.9 57.4 a
## A 53.0 2.99 18 46.7 59.3 a
## B 67.8 2.99 18 61.6 74.1 b
## D 73.8 2.99 18 67.6 80.1 b
##
## Results are averaged over the levels of: Var
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05

So we see that each variety is significantly different from all the others and
among the pesticides, 𝐴 and 𝐶 are indistinguishable as are 𝐵 and 𝐷, but there
is a difference between the 𝐴, 𝐶 and 𝐵, 𝐷 groups.

6.5 Interaction Model

When the model contains the interaction of the two factors, our model is written
as
𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 + 𝜖𝑖𝑗𝑘

Interpreting effects effects can be very tricky. Under the interaction, the effect
of changing from factor 1 level 1 to factor 1 level 𝑖 depends on what level of
factor 2 is. In essence, we are fitting a model that allows each of the 𝐼 × 𝐽 cells
in my model to vary independently. As such, the model has a total of 𝐼 × 𝐽
parameters but because the model without interactions had 1 + (𝐼 − 1) + (𝐽 − 1)
terms in it, the interaction is adding 𝑑𝑓𝐴𝐵 parameters. We can solve for this
6.5. INTERACTION MODEL 91

via:
𝐼 × 𝐽 = 1 + (𝐼 − 1) + (𝐽 − 1) + 𝑑𝑓𝐴𝐵
𝐼 × 𝐽 = 𝐼 + 𝐽 − 1 + 𝑑𝑓𝐴𝐵
𝐼𝐽 − 𝐼 − 𝐽 = −1 + 𝑑𝑓𝐴𝐵
𝐼(𝐽 − 1) − 𝐽 = −1 + 𝑑𝑓𝐴𝐵
𝐼(𝐽 − 1) − 𝐽 + 1 = 𝑑𝑓𝐴𝐵
𝐼(𝐽 − 1) − (𝐽 − 1) = 𝑑𝑓𝐴𝐵
(𝐼 − 1)(𝐽 − 1) = 𝑑𝑓𝐴𝐵

This makes sense because the first factor added (𝐼 − 1) columns to the de-
sign matrix and an interaction with a continuous covariate just multiplied the
columns of the factor by the single column of the continuous covariate. Creating
an interaction of two factors multiplies each column of the first factor by all the
columns defined by the second factor.
The expected value of the 𝑖𝑗 combination is 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 . Returning to
our fungus example, the expected means for each treatment under the model
with main effects and the interaction is

5% 30% 60% 90%


2C 𝜇+0+ 𝜇 + 𝛼2 + 0 + 0 𝜇 + 𝛼3 + 0 + 0 𝜇 + 𝛼4 + 0 + 0
0+0
10C 𝜇 + 0 + 𝜇 + 𝛼2 + 𝛽2 + (𝛼𝛽)22 𝜇+𝛼3 +𝛽2 +(𝛼𝛽)32 𝜇+𝛼4 +𝛽2 +(𝛼𝛽)42
𝛽2 + 0
30C 𝜇 + 0 + 𝜇 + 𝛼2 + 𝛽3 + (𝛼𝛽)23 𝜇+𝛼3 +𝛽2 +(𝛼𝛽)33 𝜇+𝛼4 +𝛽2 +(𝛼𝛽)43
𝛽3 + 0

Notice that we have added 6 = 3 ⋅ 2 = (4 − 1) (3 − 1) = (𝐼 − 1) (𝐽 − 1) interac-


tion parameters (𝛼𝛽)𝑖𝑗 to the main effects only model. The interaction model
has 𝑝 = 12 parameters, one for each cell in my treatment array.
In general it is hard to interpret the meaning of 𝛼𝑖 , 𝛽𝑗 , and (𝛼𝛽)𝑖𝑗 and the best
way to make sense of them is to look at the interaction plots.

6.5.1 ANOVA Table

Most statistical software will produce an analysis of variance table when fitting
a two-way ANOVA. This table is very similar to the analysis of variance table
we have seen in the one-way ANOVA, but has several rows which correspond to
the additional factors added to the model.
Consider the two-way ANOVA with factors 𝐴 and 𝐵 which have levels 𝐼 and
𝐽 discrete levels respectively. For convenience let 𝑅𝑆𝑆1 be the residual sum of
92 CHAPTER 6. TWO-WAY ANOVA

squares of the intercept-only model, and 𝑅𝑆𝑆𝐴 be the residual sum of squares
for the model with just the main effect of factor 𝐴. Likewise 𝑅𝑆𝑆𝐴+𝐵 and
𝑅𝑆𝑆𝐴∗𝐵 shall be the residual sum of squares of the model with just the main
effects and the model with main effects and the interaction. Finally assume
that we have a total of 𝑛 observations. The ANOVA table for this model is as
follows:

df Sum Sq (SS) MS F 𝑃 𝑟(≥ 𝐹 )


A 𝑑𝑓𝐴 = 𝑆𝑆𝐴 = 𝑀 𝑆𝐴 = 𝑀 𝑆𝐴 /𝑀 𝑆𝐸
𝑃 𝑟(𝐹𝑑𝑓𝐴 ,𝑑𝑓𝜖 ≥
𝐼 −1 𝑅𝑆𝑆1 − 𝑆𝑆𝐴 /𝑑𝑓𝐴 𝐹𝐴
𝑅𝑆𝑆𝐴
B 𝑑𝑓𝐵 = 𝑆𝑆𝐵 = 𝑀 𝑆𝐵 = 𝑀 𝑆𝐵 /𝑀 𝑆𝐸
𝑃 𝑟(𝐹𝑑𝑓𝐵 ,𝑑𝑓𝜖 ≥
𝐽 −1 𝑅𝑆𝑆𝐴 − 𝑆𝑆𝐵 /𝑑𝑓𝐵 𝐹𝐵
𝑅𝑆𝑆𝐴+𝐵
AB 𝑑𝑓𝐴𝐵 = 𝑆𝑆𝐴∗𝐵 = 𝑀 𝑆𝐴𝐵 = 𝑀 𝑆𝐴𝐵 /𝑀𝑃𝑆𝐸
𝑟(𝐹𝑑𝑓𝐴𝐵 ,𝑑𝑓𝜖 ≥
(𝐼 − 𝑅𝑆𝑆𝐴+𝐵 − 𝑆𝑆𝐴𝐵 /𝑑𝑓𝐴𝐵 𝐹𝐴𝐵
1)(𝐽 −1) 𝑅𝑆𝑆𝐴∗𝐵
Error 𝑑𝑓𝜖 = 𝑅𝑆𝑆𝐴∗𝐵 𝑀 𝑆𝐸 =
𝑛 − 𝐼𝐽 𝑅𝑆𝑆𝐴∗𝐵 /𝑑𝑓𝜖

This arrangement of the ANOVA table is referred to as “Type I” sum of squares.


Type III sums of squares are the difference between the full interaction model
and the model removing each parameter group, even when it doesn’t make sense.
For example in the Type III table, 𝑆𝑆𝐴 = 𝑅𝑆𝑆𝐵+𝐴∶𝐵 − 𝑅𝑆𝑆𝐴∗𝐵 . There is an
intermediate form of the sums of squares called Type II, that when removing a
main effect also removes the higher order interaction. In the case of balanced
(orthogonal) designs, there is no difference between the different types, but for
non-balanced designs, the numbers will change. To access these other types of
sums of squares, use the Anova() function in the package car.

6.5.2 Example - Fruit Trees (continued)

We next consider whether or not to include the interaction term to the fruit
tree model. We fit the model with the interaction and then graph the results.

# Create the Interaction Plot using emmeans package. IP stands for interaction plot
m4 <- lm(Yield ~ Var * Pest, data=fruit)
emmip( m4, Var ~ Pest ) # color is LHS of the formula
6.5. INTERACTION MODEL 93

90

80
Linear prediction

Var
70 1
2
60
3

50

40
A B C D
Levels of Pest

# Create the interaction plot by hand


m4 <- lm(Yield ~ Var * Pest, data=fruit)
fruit$y.hat <- predict(m4)
ggplot(fruit, aes(x=Pest, color=Var, shape=Var, y=Yield)) +
geom_point(size=5) +
geom_line(aes(y=y.hat, x=as.integer(Pest)))

100

80 Var
1
Yield

2
60 3

40

A B C D
Pest

All of the line segments are close to parallel so, we don’t expect the interaction
to be significant.

anova( m4 )

## Analysis of Variance Table


##
## Response: Yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Var 2 3996.1 1998.04 47.2443 2.048e-06 ***
94 CHAPTER 6. TWO-WAY ANOVA

## Pest 3 2227.5 742.49 17.5563 0.0001098 ***


## Var:Pest 6 456.9 76.15 1.8007 0.1816844
## Residuals 12 507.5 42.29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Examining the ANOVA table, we see that the interaction effect is not significant
and we will stay with simpler model Yield~Var+Pest.

6.5.3 Example - Warpbreaks

This data set looks at the number of breaks that occur in two different types of
wool under three different levels of tension (low, medium, and high). The fewer
number of breaks, the better.
As always, the first thing we do is look at the data. In this case, it looks like
the number of breaks decreases with increasing tension and perhaps wool B has
fewer breaks than wool A.

data(warpbreaks)
ggplot(warpbreaks, aes(x=tension, y=breaks, color=wool, shape=wool), size=2) +
geom_boxplot() +
geom_point(position=position_dodge(width=.35)) # offset the wool groups

60

wool
breaks

40 A
B

20

L M H
tension

We next fit our linear model and examine the diagnostic plots.

model <- lm(breaks ~ tension + wool, data=warpbreaks)


autoplot(model, which=c(1,2)) + geom_point( aes(color=tension:wool))

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.


6.5. INTERACTION MODEL 95

## Please use `arrange()` instead.


## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Residuals vs Fitted Normal Q−Q


30 5 5
9 9
tension:wool tension:wool

Standardized residuals
2
20
L:A L:A
Residuals

L:B 1 L:B
10
M:A M:A
0 M:B 0 M:B
H:A H:A
−10 −1
H:B H:B

−20 29 29
20 25 30 35 40 −2 −1 0 1 2
Fitted values Theoretical Quantiles

The residuals vs fitted values plot is a little worrisome and appears to be an


issue with non-constant variance, but the normality assumption looks good.
We’ll check for a Box-Cox transformation next.

MASS::boxcox(model)

95%
log−Likelihood

−65
−80

−2 −1 0 1 2

λ
This suggests we should make a log transformation, though because the confi-
dence interval is quite wide we might consider if the increased difficulty in in-
terpretation makes sufficient progress towards making the data meet the model
assumptions. The diagnostic plots of the resulting model look better for the
constant variance assumption, but the normality is now a worse off. Because
the Central Limit Theorem helps deal with the normality question, I’d rather
stabilize the variance at the cost of the normality.

model.1 <- lm(log(breaks) ~ tension + wool, data=warpbreaks)


autoplot(model.1, which=c(1,2)) + geom_point( aes(color=tension:wool))
96 CHAPTER 6. TWO-WAY ANOVA

Residuals vs Fitted Normal Q−Q

tension:wool tension:wool

Standardized residuals
0.4 1
L:A L:A
Residuals

L:B L:B
0.0 0
M:A M:A
M:B M:B
−0.4 H:A −1 H:A
H:B H:B

23 −2 23
−0.8 14 29 1429
3.0 3.2 3.4 3.6 −2 −1 0 1 2
Fitted values Theoretical Quantiles

Next we’ll fit the interaction model and check the diagnostic plots. The diag-
nostic plots look good and this appears to be a legitimate model.

model.2 <- lm(log(breaks) ~ tension * wool, data=warpbreaks)


autoplot(model.2, which=c(1,2)) + geom_point( aes(color=tension:wool))

Residuals vs Fitted Normal Q−Q


2
24 24
tension:wool tension:wool
Standardized residuals

0.4
1
L:A L:A
Residuals

L:B L:B
0.0 0
M:A M:A
M:B M:B
−1
−0.4 H:A H:A
H:B H:B
29 −2
29
−0.8 23 23
3.0 3.2 3.4 3.6 −2 −1 0 1 2
Fitted values Theoretical Quantiles

Then we’ll do an F-test to see if it is a better model than the main effects model.
The p-value is marginally significant, so we’ll keep the interaction in the model,
but recognize that it is a weak interaction.

anova(model.1, model.2) # explicitly look model1 vs model2

## Analysis of Variance Table


##
## Model 1: log(breaks) ~ tension + wool
## Model 2: log(breaks) ~ tension * wool
## Res.Df RSS Df Sum of Sq F Pr(>F)
6.5. INTERACTION MODEL 97

## 1 50 7.6270
## 2 48 6.7138 2 0.91315 3.2642 0.04686 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model.2) # table of sequentially added terms in model 2

## Analysis of Variance Table


##
## Response: log(breaks)
## Df Sum Sq Mean Sq F value Pr(>F)
## tension 2 2.1762 1.08808 7.7792 0.001185 **
## wool 1 0.3125 0.31253 2.2344 0.141511
## tension:wool 2 0.9131 0.45657 3.2642 0.046863 *
## Residuals 48 6.7138 0.13987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next we look at the effect of the interaction and the easiest way to do this is
to look at the interaction plot. The emmeans::emmip() just shows the mean of
each treatment combination, while the plot I made by hand shows the mean of
each treatment combination along with the raw data.

A <- emmip(model.2, wool ~ tension) # LHS is color, RHS is the x-axis variable

B <- warpbreaks %>%


mutate( logy.hat = predict(model.2) ) %>%
ggplot(aes(x=tension, y=log(breaks), color=wool, shape=wool)) +
geom_point() +
geom_line(aes(y=logy.hat, x=as.integer(tension))) # make tension continuous to draw the lines

cowplot::plot_grid(A,B) # Plot these version of the interaction plot side-by-side.

3.6 4.0
Linear prediction

log(breaks)

3.4 wool 3.5 wool


A A
B B
3.2 3.0

3.0 2.5

L M H L M H
Levels of tension tension
98 CHAPTER 6. TWO-WAY ANOVA

We can see that it appears that wool A has a decrease in breaks between low
and medium tension, while wool B has a decrease in breaks between medium
and high. It is actually quite difficult to see this interaction when we examine
the model coefficients.

summary(model.2)

##
## Call:
## lm(formula = log(breaks) ~ tension * wool, data = warpbreaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81504 -0.27885 0.04042 0.27319 0.64358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7179 0.1247 29.824 < 2e-16 ***
## tensionM -0.6012 0.1763 -3.410 0.00133 **
## tensionH -0.6003 0.1763 -3.405 0.00134 **
## woolB -0.4356 0.1763 -2.471 0.01709 *
## tensionM:woolB 0.6281 0.2493 2.519 0.01514 *
## tensionH:woolB 0.2221 0.2493 0.891 0.37749
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.374 on 48 degrees of freedom
## Multiple R-squared: 0.3363, Adjusted R-squared: 0.2672
## F-statistic: 4.864 on 5 and 48 DF, p-value: 0.001116

To test if there is a statistically significant difference between medium and high


tensions for wool type B, we really need to test the following hypothesis:

𝐻0 ∶ (𝜇 + 𝛼2 + 𝛽2 + (𝛼𝛽)22 ) − (𝜇 + 𝛼3 + 𝛽2 + (𝛼𝛽)32 ) = 0
𝐻𝑎 ∶ (𝜇 + 𝛼2 + 𝛽2 + (𝛼𝛽)22 ) − (𝜇 + 𝛼3 + 𝛽2 + (𝛼𝛽)32 ) ≠ 0

This test reduces to testing if 𝛼2 − 𝛼3 + (𝛼𝛽)22 − (𝛼𝛽)23 = 0. Calculating this


difference from the estimated values of the summery table we have −.6012 +
.6003 + .6281 − .2221 = 0.4051, we don’t know if that is significantly different
than zero.
In the main effects model, we were able to read off the necessary test using
emmeans. Fortunately, we can do the same thing here. In this case, we’ll look
at the interactions piece of the emmeans command. In this case, we find the test
H:B - M:B in the last row of the interactions.
6.5. INTERACTION MODEL 99

emmeans(model.2, specs= pairwise~tension*wool)

## $emmeans
## tension wool emmean SE df lower.CL upper.CL
## L A 3.72 0.125 48 3.47 3.97
## M A 3.12 0.125 48 2.87 3.37
## H A 3.12 0.125 48 2.87 3.37
## L B 3.28 0.125 48 3.03 3.53
## M B 3.31 0.125 48 3.06 3.56
## H B 2.90 0.125 48 2.65 3.15
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## L A - M A 0.601196 0.176 48 3.410 0.0158
## L A - H A 0.600323 0.176 48 3.405 0.0160
## L A - L B 0.435567 0.176 48 2.471 0.1535
## L A - M B 0.408619 0.176 48 2.318 0.2071
## L A - H B 0.813794 0.176 48 4.616 0.0004
## M A - H A -0.000873 0.176 48 -0.005 1.0000
## M A - L B -0.165629 0.176 48 -0.939 0.9341
## M A - M B -0.192577 0.176 48 -1.092 0.8821
## M A - H B 0.212598 0.176 48 1.206 0.8319
## H A - L B -0.164756 0.176 48 -0.935 0.9355
## H A - M B -0.191704 0.176 48 -1.087 0.8840
## H A - H B 0.213471 0.176 48 1.211 0.8295
## L B - M B -0.026948 0.176 48 -0.153 1.0000
## L B - H B 0.378227 0.176 48 2.145 0.2823
## M B - H B 0.405175 0.176 48 2.298 0.2149
##
## Results are given on the log (not the response) scale.
## P value adjustment: tukey method for comparing a family of 6 estimates

The last call to emmeans gives us all the pairwise tests comparing the cell means.
If we don’t want to wade through all the other pairwise contrasts we could do
the following:

# If I want to not wade through all those contrasts and just grab the
# contrasts for wool type 'B' and tensions 'M' and 'H'
emmeans(model.2, pairwise~tension*wool, at=list(wool='B', tension=c('M','H')))

## $emmeans
100 CHAPTER 6. TWO-WAY ANOVA

## tension wool emmean SE df lower.CL upper.CL


## M B 3.31 0.125 48 3.06 3.56
## H B 2.90 0.125 48 2.65 3.15
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## M B - H B 0.405 0.176 48 2.298 0.0260
##
## Results are given on the log (not the response) scale.

What would happen if we just looked at the main effects? In the case where our
experiment is balanced with equal numbers of observations in each treatment
cell, we can interpret these differences as follows. Knowing that each cell in
our table has a different estimated mean, we could consider the average of all
the type A cells as the typical wool A. Likewise we could average all the cell
means for the wool B cells. Then we could look at the difference between those
two averages. In the balanced design, this is equivalent to removing the tension
term from the model and just looking at the difference between the average log
number of breaks.

emmeans(model.2, specs= pairwise~tension )

## NOTE: Results may be misleading due to involvement in interactions

## $emmeans
## tension emmean SE df lower.CL upper.CL
## L 3.50 0.0882 48 3.32 3.68
## M 3.21 0.0882 48 3.04 3.39
## H 3.01 0.0882 48 2.83 3.19
##
## Results are averaged over the levels of: wool
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## L - M 0.287 0.125 48 2.303 0.0649
## L - H 0.489 0.125 48 3.925 0.0008
## M - H 0.202 0.125 48 1.622 0.2465
##
## Results are averaged over the levels of: wool
## Results are given on the log (not the response) scale.
## P value adjustment: tukey method for comparing a family of 3 estimates
6.5. INTERACTION MODEL 101

emmeans(model.2, specs= pairwise~wool )

## NOTE: Results may be misleading due to involvement in interactions

## $emmeans
## wool emmean SE df lower.CL upper.CL
## A 3.32 0.072 48 3.17 3.46
## B 3.17 0.072 48 3.02 3.31
##
## Results are averaged over the levels of: tension
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B 0.152 0.102 48 1.495 0.1415
##
## Results are averaged over the levels of: tension
## Results are given on the log (not the response) scale.

Using emmeans, we can see the wool effect difference between types B and A is
−0.1522. We can calculate the mean number of log breaks for each wool type
and take the difference by the following:

warpbreaks %>%
group_by(wool) %>%
summarise( wool.means = mean(log(breaks)) ) %>%
summarise( diff(wool.means) )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 1 x 1
## `diff(wool.means)`
## <dbl>
## 1 -0.152

In the unbalanced case taking the average of the cell means produces a different
answer than taking the average of the data. The emmeans package chooses to
take the average of the cell means.
102 CHAPTER 6. TWO-WAY ANOVA

6.6 Exercises
As we have developed much of the necessary theory, we are moving into exer-
cises that emphasize modeling decisions and interpretation. As such, grading of
exercises will move to emphasizing interpretation and justification of why mod-
eling decisions were made. For any R output produced, be certain to discuss
what is important. Feel free to not show exploratory work, but please comment
that you considered. Furthermore, as students progress, necessary analysis steps
will not be listed and students will be expected to appropriately perform the
necessary work and comment as appropriate. These are leading to, at the end
of the course, students being given a dataset and, with no prompts as to what
is an appropriate analysis, told to model the data appropriately.

1. In the faraway package, the data set rats has data on a gruesome ex-
periment that examined the time till death of 48 rats when they were
subjected to three different types of poison administered in four different
manners (which they called treatments). We are interested in assessing
which poison works the fastest as well as which administration method is
most effective.
a. The response variable time needs to be transformed. To see this, we’ll
examine the diagnostic plots from the interaction model and there
is clearly a problem with non-constant variance. We’ll look at the
Box-Cox family of transformations and see that 𝑦−1 is a reasonable
transformation.
data('rats', package='faraway')
model <- lm( time ~ poison * treat, data=rats)
# lindia::gg_diagnose(model) # All the plots...
lindia::gg_diagnose(model, plot.all=FALSE)[[4]] # just resid vs fitted
lindia::gg_boxcox(model)
rats <- rats %>% mutate( speed = time^(-1) )

b. Fit the interaction model using the transformed response. Create a


graph of data and the predicted values. Visually assess if you think
the interaction is significant.
c. Perform an appropriate statistical test to see if the interaction is
statistically significant.
d. What do you conclude about the poisons and treatment (application)
types?
2. In the faraway package, the dataset butterfat has information about the
the percent of the milk was butterfat (more is better) taken from 𝑛 = 100
cows. There are 5 different breeds of cows and 2 different ages. We are
interested in assessing if Age and Breed affect the butterfat content
a. Graph the data. Do you think an interaction model is justified?
6.6. EXERCISES 103

b. Perform an appropriate set of tests to select a model for predicting


Butterfat.
c. Discuss your findings.
3. In the faraway package, the dataset alfalfa has information from a study
that examined the effect of seed inoculum, irrigation, and shade on alfalfa
yield. This data has 𝑛 = 25 observations.
a. Examine the help file for this dataset. Graph the data. What effects
seem significant?
b. Consider the main effects model with all three predictor variables.
Which effects are significant? Using the model you ultimately select,
examine the diagnostic plots. These all look fine, but it is useful to
see examples where everything is ok.
c. Consider the model with shade and inoculum and the interaction
between the two. Examine the anova table. Why does R complain
that the fit is perfect? Hint: Think about the degrees of freedom of
the model compared to the sample size.
d. Discuss your findings and the limitations of your investigation based
on data.
104 CHAPTER 6. TWO-WAY ANOVA
Chapter 7

Diagnostics

library(ggfortify) # for autoplot for lm objects


library(emmeans) # emmeans for pairwise constrasts.
library(tidyverse) # for dplyr, tidyr, ggplot2

We will be interested in analyzing whether or not our linear model is a good


model and whether or not the data violate any of the assumptions that are re-
quired. In general we will be interested in three classes of assumption violations
and our diagnostic measures might be able detect one or more of the following
issues:

1. Unusual observations that contribute too much influence to the analysis.


These few observations might drastically change the outcome of the model.
2. Model misspecification. Our assumption that 𝐸 [𝑦] = 𝑋𝛽 might be wrong
and we might need to include different covariates in the model to get a
satisfactory result.
𝑖𝑖𝑑
3. Error distribution. We have assumed that 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ) but autocorre-
lation, heteroscedasticity, and non-normality might be present.

Often problems with one of these can be corrected by transforming either the
explanatory or response variables.

7.1 Detecting Assumption Violations


Throughout this chapter I will use data created by Francis Anscombe that show
how simple linear regression can be misused. In particular, these data sets will

105
106 CHAPTER 7. DIAGNOSTICS

show how our diagnostic measures will detect various departures from the model
assumptions.
The data are available in R as a data frame anscombe and is loaded by default.
The data consists of four datasets, each having the same linear regression 𝑦 ̂ =
3 + 0.5 𝑥 but the data are drastically different.

# The anscombe dataset has 8 columns - x1,x2,x3,x4,y1,y2,y3,y4


# and I want it to have 3 columns - Set, X, Y
Anscombe <- rbind(
data.frame(x=anscombe$x1, y=anscombe$y1, set='Set 1'),
data.frame(x=anscombe$x2, y=anscombe$y2, set='Set 2'),
data.frame(x=anscombe$x3, y=anscombe$y3, set='Set 3'),
data.frame(x=anscombe$x4, y=anscombe$y4, set='Set 4'))

# order them by their x values, and add an index column


Anscombe <- Anscombe %>%
group_by(set) %>% # Every subsequent action happens by dataset
arrange(x,y) %>% # sort them on the x-values and if tied, by y-value
mutate( index = 1:n() ) # give each observation within a set, an ID number

# Make a nice graph


ggplot(Anscombe, aes(x=x, y=y)) +
geom_point() +
facet_wrap(~set, scales='free') +
stat_smooth(method="lm", formula=y~x, se=FALSE)

Set 1 Set 2
10
10
8
8
6
6
4
4
6 9 12 6 9 12
y

Set 3 Set 4
13
11 11
9 9
7 7
5 5
6 9 12 7.5 10.0 12.5 15.0 17.5
x

7.1.1 Measures of Influence


7.1.1.1 Standardized Residuals (aka Studentized )

Recall that we have


7.1. DETECTING ASSUMPTION VIOLATIONS 107

𝑦 ̂ = 𝑋 𝛽̂
𝑇 −1 𝑇
= 𝑋 (𝑋 𝑋) 𝑋 𝑦
= 𝐻𝑦

𝑇 −1 𝑇
where the “Hat Matrix” is 𝐻 = 𝑋 (𝑋 𝑋) 𝑋 because we have 𝑦 ̂ = 𝐻𝑦. The
elements of 𝐻 can be quite useful in diagnostics. It can be shown that the
variance of the 𝑖the residual is
𝑉 𝑎𝑟 (𝜖𝑖̂ ) = 𝜎2 (1 − 𝐻 𝑖𝑖 )
where 𝐻 𝑖𝑖 is the 𝑖th element of the main diagonal of 𝐻. This suggests that I
could rescale my residuals to
𝜖𝑖̂
𝜖∗𝑖̂ =
𝜎√1
̂ − 𝐻 𝑖𝑖
which, if the normality and homoscedasticity assumptions hold, should behave
as a 𝑁 (0, 1) sample.
These rescaled residuals are called “studentized residuals”, though R typically
refers to them as “standardized”. Since we have a good intuition about the scale
of a standard normal distribution, the scale of standardized residuals will give
a good indicator if normality is violated.
There are actually two types of studentized residuals, typically called internal
and external among statisticians. The version presented above is the internal
version which can be obtained using the R function rstandard() while the
external version is available using rstudent(). Whenever you see R present
standardized residuals, they are talking about internally studentized residuals.
For sake of clarity, I will use the term standardized as well.

7.1.1.1.1 Example - Anscombe’s set 3


For the third dataset, the outlier is the ninth observation with 𝑥9 = 13 and 𝑦9 =
12.74. We calculate the standardized residuals using the function rstandard()
and plot them

Set3 <- Anscombe %>% filter(set == 'Set 3') # Just set 3


model <- lm(y ~ x, data=Set3) # Fit the regression line
Set3$stdresid <- rstandard(model) # rstandard() returns the standardized residuals

ggplot(Set3, aes(x=index, y=stdresid)) + # make a plot


geom_point() +
labs(x='Observation Index',
y='Standardized Residuals',
title='Standardized Residuals vs Observation Index')
108 CHAPTER 7. DIAGNOSTICS

Standardized Residuals vs Observation Index


3

Standardized Residuals 2

−1

3 6 9
Observation Index

and we notice that the outlier residual is really big. If the model assumptions
were true, then the standardized residuals should follow a standard normal dis-
tribution, and I would need to have hundreds of observations before I wouldn’t
be surprised to see a residual more than 3 standard deviations from 0.

7.1.1.2 Leverage

The extremely large standardized residual suggests that this data point is im-
portant, but we would like to quantify how important this observation actually
is.
One way to quantify this is to look at the elements of 𝐻. Because
𝑛
𝑦𝑖̂ = ∑ 𝐻 𝑖𝑗 𝑦𝑗
𝑗=1

then the 𝑖th row of 𝐻 is a vector of weights that tell us how influential a point
𝑦𝑗 is for calculating the predicted value 𝑦𝑖̂ . If I look at just the main diagonal
of 𝐻, these are how much weight a point has on its predicted value. As such, I
can think of the 𝐻 𝑖𝑖 as the amount of leverage a particular data point has on
the regression line. It can be shown that the leverages must be 0 ≤ 𝐻 𝑖𝑖 ≤ 1 and
that ∑ 𝐻 𝑖𝑖 = 𝑝.

Set3 <- Anscombe %>% filter( set == 'Set 3')


Set4 <- Anscombe %>% filter( set == 'Set 4')

model3 <- lm(y ~ x, data = Set3 )


model4 <- lm(y ~ x, data = Set4 )

X <- model.matrix(model3)
H <- X %*% solve( t(X) %*% X) %*% t(X)

round(H, digits=2)
7.1. DETECTING ASSUMPTION VIOLATIONS 109

## 1 2 3 4 5 6 7 8 9 10 11
## 1 0.32 0.27 0.23 0.18 0.14 0.09 0.05 0.00 -0.05 -0.09 -0.14
## 2 0.27 0.24 0.20 0.16 0.13 0.09 0.05 0.02 -0.02 -0.05 -0.09
## 3 0.23 0.20 0.17 0.15 0.12 0.09 0.06 0.04 0.01 -0.02 -0.05
## 4 0.18 0.16 0.15 0.13 0.11 0.09 0.07 0.05 0.04 0.02 0.00
## 5 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.05
## 6 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09
## 7 0.05 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## 8 0.00 0.02 0.04 0.05 0.07 0.09 0.11 0.13 0.15 0.16 0.18
## 9 -0.05 -0.02 0.01 0.04 0.06 0.09 0.12 0.15 0.17 0.20 0.23
## 10 -0.09 -0.05 -0.02 0.02 0.05 0.09 0.13 0.16 0.20 0.24 0.27
## 11 -0.14 -0.09 -0.05 0.00 0.05 0.09 0.14 0.18 0.23 0.27 0.32

Fortunately there is already a function hatvalues() to compute these 𝐻 𝑖𝑖 val-


ues for me. We will compare the leverages from Anscombe’s set 3 versus set
4.

Set3 <- Set3 %>% mutate(leverage = hatvalues(model3)) # add leverage columns


Set4 <- Set4 %>% mutate(leverage = hatvalues(model4))

ggplot( rbind(Set3,Set4), aes(x=index, y=leverage) ) +


geom_point() +
facet_grid( . ~ set )

Set 3 Set 4
1.00

0.75
leverage

0.50

0.25

3 6 9 3 6 9
index

This leverage idea only picks out the potential for a specific value of 𝑥 to be
influential, but does not actually measure influence. It has picked out the issue
with the fourth data set, but does not adequately address the outlier in set 3.

7.1.1.3 Cook’s Distance

To attempt to measure the actual influence of an observation {𝑦𝑖 , 𝑥𝑇𝑖 } on the lin-
ear model, we consider the effect on the regression if we removed the observation
110 CHAPTER 7. DIAGNOSTICS

and fit the same model. Let


𝑦 ̂ = 𝑋 𝛽̂
be the vector of predicted values, where 𝛽̂ is created using all of the data, and
̂ = 𝑋 𝛽̂ (𝑖) be the vector of predicted values where 𝛽̂ (𝑖) was estimated using all
𝑦(𝑖)
of the data except the 𝑖th observation. Letting 𝑝 be the number of 𝛽𝑗 parameters
as usual we define Cook’s distance of the 𝑖th observation as
𝑇 𝑛
(𝑦 ̂ − 𝑦(𝑖)
̂ ) (𝑦 ̂ − 𝑦(𝑖)
̂ ) ̂ )𝑗 )2
∑𝑗=1 (𝑦𝑗̂ − (𝑦(𝑖)
𝐷𝑖 = =
𝑝𝜎̂ 2 𝑝𝜎̂ 2
which boils down to saying if the predicted values have large changes when the
𝑖th element is removed, then the distance is big. It can be shown that this
formula can be simplified to
𝜖∗𝑖̂ 𝐻 𝑖𝑖
𝐷𝑖 =
𝑝 (1 − 𝐻𝑖𝑖 )
which expresses Cook’s distance in terms of the 𝑖th studentized residual and the
𝑖th leverage.
Nicely, the R function cooks.distance() will calculate Cook’s distance.

Set3 <- Set3 %>% mutate(cooksd = cooks.distance(model3))


Set4 <- Set4 %>% mutate(cooksd = cooks.distance(model4))

# Note: The high leverage point in set 4 has a Cook's distance of Infinity.
ggplot(rbind(Set3,Set4), aes(x=index, y=cooksd)) +
geom_point() +
facet_grid(. ~ set) +
labs(y="Cook's Distance")

## Warning: Removed 1 rows containing missing values (geom_point).

Set 3 Set 4
Cook's Distance

1.0

0.5

0.0
3 6 9 3 6 9
index
7.1. DETECTING ASSUMPTION VIOLATIONS 111

Some texts will give a rule of thumb that points with Cook’s distances greater
than 1 should be considered influential, while other books claim a reasonable
rule of thumb is 4/ (𝑛 − 𝑝 − 1) where 𝑛 is the sample size, and 𝑝 is the number
of parameters in 𝛽. My take on this, is that you should look for values that are
highly different from the rest of your data.

7.1.2 Diagnostic Plots

After fitting a linear model in R, you have the option of looking at diagnostic
plots that help to decide if any assumptions are being violated. We will step
through each of the plots that are generated by the function plot(model) or
using ggplot2 using the package ggfortify.

In the package ggfortify there is a function that will calculate the diagnostics
measures and add them to your dataset. This will simplify our graphing process.

Set1 <- Anscombe %>% filter(set == 'Set 1')


model <- lm( y ~ x, data=Set1)
Set1 <- fortify(model) # add diagnostic measures to the dataset
Set1 %>% round(digits=3) # show the dataset nicely

## y x .hat .sigma .cooksd .fitted .resid .stdresid


## 1 4.26 4 0.318 1.273 0.123 5.000 -0.740 -0.725
## 2 5.68 5 0.236 1.310 0.004 5.501 0.179 0.166
## 3 7.24 6 0.173 1.220 0.127 6.001 1.239 1.102
## 4 4.82 7 0.127 1.147 0.154 6.501 -1.681 -1.455
## 5 6.95 8 0.100 1.311 0.000 7.001 -0.051 -0.043
## 6 8.81 9 0.091 1.218 0.062 7.501 1.309 1.110
## 7 8.04 10 0.100 1.312 0.000 8.001 0.039 0.033
## 8 8.33 11 0.127 1.310 0.002 8.501 -0.171 -0.148
## 9 10.84 12 0.173 1.100 0.279 9.001 1.839 1.635
## 10 7.58 13 0.236 1.056 0.489 9.501 -1.921 -1.778
## 11 9.96 14 0.318 1.311 0.000 10.001 -0.041 -0.041

7.1.2.1 Residuals vs Fitted

In the simple linear regression the most useful plot to look at was the residuals
versus the 𝑥-covariate, but we also saw that this was similar to looking at the
residuals versus the fitted values. In the general linear model, we will look at
the residuals versus the fitted values or possibly the studentized residuals versus
the fitted values.
112 CHAPTER 7. DIAGNOSTICS

7.1.2.1.1 Polynomial relationships


To explore how this plot can detect non-linear relationships between 𝑦 and 𝑥,
we will examine a data set from Ashton et al. (2007) that relates the length of
a tortoise’s carapace to the number of eggs laid in a clutch. The data are

Eggs <- data.frame(


carapace = c(284,290,290,290,298,299,302,306,306,
309,310,311,317,317,320,323,334,334),
clutch.size = c(3,2,7,7,11,12,10,8,8,
9,10,13,7,9,6,13,2,8))
ggplot(Eggs, aes(x=carapace, y=clutch.size)) +
geom_point()

10
clutch.size

290 300 310 320 330


carapace

Looking at the data, it seems that we are violating the assumption that a linear
model is appropriate, but we will fit the model anyway and look at the residual
graph.

model <- lm( clutch.size ~ carapace, data=Eggs )


plot(model, which=1) # Base R Function: which=1 tells R to only make the first plo

Residuals vs Fitted
6

12
Residuals

2
−2

2
17
−8

7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8

Fitted values
lm(clutch.size ~ carapace)
7.1. DETECTING ASSUMPTION VIOLATIONS 113

lindia::gg_diagnose(model) # using lindia package

Histogram of Residuals Residual vs. carapace

residuals
2.0
count

1.5 5
1.0 0
0.5 −5
0.0
−7.5 −5.0 −2.5 0.0 2.5 5.0 290 300 310 320 330
Residuals carapace

Standardized Residual
Residual vs. Fitted Value Normal−QQ Plot
Residuals

5.0
5 2.5
0 0.0
−2.5
−5 −5.0
7.6 8.0 8.4 8.8 Standardized Residuals −2 −1 0 1 2
Fitted Values Theoretical Quantile

Scale−Location Plot Residual vs. Leverage


Fitted Values

1 1
0 0
−1 −1
−2 −2
7.6 8.0 8.4 8.8 0.05 0.10 0.15 0.20 0.25
Sqrt(Standardized Residuals) Leverage
Cook's distance

Cook's Distance Plot


0.8 17
0.6
0.4
0.2 12
0.0
0 5 10 15 20
Observation Number

lindia::gg_diagnose(model, plot.all=FALSE)[[3]] # using lindia package

Residual vs. Fitted Value

5
Residuals

−5

7.6 8.0 8.4 8.8


Fitted Values
114 CHAPTER 7. DIAGNOSTICS

autoplot(model, which=1) # same plot using ggplot2 and ggfortify package

Residuals vs Fitted
5.0 12

2.5
Residuals

0.0

−2.5

−5.0
2
17
7.6 8.0 8.4 8.8
Fitted values
The blue curves going through the plot is a smoother of the residuals. Ideally
this should be a flat line and I should see no trend in this plot. Clearly there
is a quadratic trend as larger tortoises have larger clutch sizes until some point
where the extremely large tortoises start laying fewer (perhaps the extremely
large tortoises are extremely old as well). To correct for this, we should fit a
model that is quadratic in carapace length. We will create a new covariate,
carapace.2, which is the square of the carapace length and add it to the model.
In general I could write the quadratic model as

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜖𝑖

and note that my model is still a linear model with respect to covariates 𝑥 and
𝑥2 because I can still write the model as
𝑦 = 𝑋𝛽 + 𝜖
1 𝑥1 𝑥21 𝜖1
⎡ 1 𝑥2 𝑥22 ⎤ 𝛽 ⎡ 𝜖 ⎤
⎢ ⎥⎡ 0 ⎤ ⎢ 2 ⎥
=⎢ 1 𝑥3 𝑥23 ⎥ ⎢ 𝛽1 ⎥ + ⎢ 𝜖3 ⎥
⎢ ⋮ ⋮ ⋮ ⎥ ⎣ 𝛽2 ⎦ ⎢ ⋮ ⎥
⎣ 1 𝑥𝑛 𝑥2𝑛 ⎦ ⎣ 𝜖𝑛 ⎦

# add a new column that is carapace^2


Eggs2 <- Eggs %>% mutate( carapace.2 = carapace^2 )
model <- lm( clutch.size ~ carapace + carapace.2, data=Eggs2 )

# make R do it inside the formula... convenient


model <- lm( clutch.size ~ carapace + I(carapace^2), data=Eggs )
7.1. DETECTING ASSUMPTION VIOLATIONS 115

# Fit an arbitrary degree polynomial, I recommend this method for fitting the model!
model <- lm( clutch.size ~ poly(carapace, 2), data=Eggs )

# If you use poly() in the formula, you must use 'data=' here,
# otherwise you can skip it and R will do the right thing.
autoplot(model, which=1, data=Eggs)

Residuals vs Fitted
16

2.5
Residuals

0.0

−2.5
15
2
3 5 7 9
Fitted values

Now our residual plot versus fitted values does not show any trend, suggesting
that the quadratic model is fitting the data well. Graphing the original data
along with the predicted values confirms this.

# add the fitted and CI lwr/upr columns to my dataset


Eggs <- Eggs %>%
select( -matches('fit'), -matches('lwr'), -matches('upr') ) %>%
cbind( predict(model, interval='confidence') )

ggplot(Eggs, aes(x=carapace)) +
geom_ribbon( aes(ymin=lwr, ymax=upr), fill='red', alpha=.3) +
geom_line(aes(y=fit), color='red') +
geom_point(aes(y=clutch.size))
116 CHAPTER 7. DIAGNOSTICS

10
fit

290 300 310 320 330


carapace

7.1.2.1.2 Heteroskedasticity
The plot of residuals versus fitted values can detect heteroskedasticity (non-
constant variance) in the error terms.
To illustrate this, we turn to another dataset in the Faraway book. The dataset
airquality uses data taken from an environmental study that measured four
variables, ozone, solar radiation, temperature and wind speed for 153 consec-
utive days in New York. The goal is to predict the level of ozone using the
weather variables.
We first graph all pairs of variables in the dataset.

data(airquality)
# pairs(~ Ozone + Solar.R + Wind + Temp, data=airquality)
airquality %>% select( Solar.R, Wind, Temp, Ozone) %>%
GGally::ggpairs()

Solar.R Wind Temp Ozone


0.004
Solar.R

0.003 Corr: Corr: Corr:


0.002
0.001 −0.057 0.276*** 0.348***
0.000
20
15 Corr: Corr:
Wind

10
5 −0.458*** −0.602***

90 Corr:
Temp

80
70 0.698***
60
150
Ozone

100
50
0
0 100 200 300 5 10 15 20 60 70 80 90 0 50 100 150
7.1. DETECTING ASSUMPTION VIOLATIONS 117

and notice that ozone levels are positively correlated with solar radiation and
temperature, and negatively correlated with wind speed. A linear relationship
with wind might be suspect as is the increasing variability in the response to
high temperature. However, we don’t know if those trends will remain after
fitting the model, because there is some covariance among the predictors.

model <- lm(Ozone ~ Solar.R + Wind + Temp, data=airquality)


autoplot(model, which=1)

Residuals vs Fitted
100
117

50 30 62
Residuals

0 50 100
Fitted values

As we feared, we have both a non-constant variance and a non-linear relation-


ship. A transformation of the 𝑦 variable might be able to fix our problem.

7.1.2.2 QQplots

If we are taking a sample of size 𝑛 = 10 from a standard normal distribution,


then I should expect that the smallest observation will be negative. Intuitively,
you would expect the smallest observation to be near the 10th percentile of
the standard normal, and likewise the second smallest should be near the 20th
percentile.
This idea needs a little modification because the largest observation cannot be
near the 100th percentile (because that is ∞). So we’ll adjust the estimates to
still be spaced at (1/𝑛) quantile increments, but starting at the 0.5/𝑛 quantile
instead of the 1/𝑛 quantile. So the smallest observation should be near the
0.05 quantile, the second smallest should be near the 0.15 quantile, and the
largest observation should be near the 0.95 quantile. I will refer to these as the
theoretical quantiles.
118 CHAPTER 7. DIAGNOSTICS

Std Normal distribution


0.4

0.3
Density

0.2

0.1

0.0
z0.05 z0.15 z0.25 z0.35 z0.45z0.55 z0.65 z0.75 z0.85 z0.95
x

I can then graph the theoretical quantiles vs my observed values and if they lie
on the 1-to-1 line, then my data comes from a standard normal distribution.

set.seed(93516) # make random sample in the next code chunk consistant run-to-run

n <- 10
data <- data.frame( observed = rnorm(n, mean=0, sd=1) ) %>%
arrange(observed) %>%
mutate( theoretical = qnorm( (1:n -.5)/n ) )

ggplot(data, aes(x=theoretical, y=observed) ) +


geom_point() +
geom_abline( intercept=0, slope=1, linetype=2, alpha=.7) +
labs(main='Q-Q Plot: Observed vs Normal Distribution')

1.5

1.0

0.5
observed

0.0

−0.5

−1.0

−1 0 1
theoretical

In the context of a regression model, we wish to look at the residuals and see
if there are obvious departures from normality. Returning to the air quality
example, R will calculate the qqplot for us.
7.1. DETECTING ASSUMPTION VIOLATIONS 119

model <- lm(Ozone ~ Solar.R + Wind + Temp, data=airquality)


autoplot(model, which=2)

Normal Q−Q
117
Standardized residuals

2.5 3062

0.0

−2 −1 0 1 2
Theoretical Quantiles

In this case, we have a large number of residuals that are bigger than I would
expect them to be based on them being from a normal distribution. We could
further test this using the Shapiro-Wilks test and compare the standardized
residuals against a 𝑁 (0, 1) distribution.

shapiro.test( rstandard(model) )

##
## Shapiro-Wilk normality test
##
## data: rstandard(model)
## W = 0.9151, p-value = 2.819e-06

The tail of the distribution of observed residuals is far from what we expect to
see.

7.1.2.3 Scale-Location Plot

This plot is a variation on the fitted vs residuals plot, but the y-axis uses the
square root of the absolute value of the standardized residuals. Supposedly this
makes detecting increasing variance easier to detect, but I’m not convinced.
120 CHAPTER 7. DIAGNOSTICS

7.1.2.4 Residuals vs Leverage (plus Cook’s Distance)

This plot lets the user examine the which observations have a high potential for
being influential (i.e. high leverage) versus how large the residual is. Because
Cook’s distance is a function of those two traits, we can also divide the graph
up into regions by the value of Cook’s Distance.

Returning to Anscombe’s third set of data, we see

model3 <- lm(y ~ x, data=Set3)


autoplot(model3, which=5)

Residuals vs Leverage
3 10
Standardized Residuals

9
−1
11
0.0 0.1 0.2 0.3
Leverage

that one data point (observation 10) has an extremely large standardized resid-
ual. This is one plot where I prefer what the base graphics in R does compared
to the ggfortify version. The base version of R adds some contour lines that
mark where the contours of where Cook’s distance is 1/2 and 1.

plot(model3, which=5)
7.2. EXERCISES 121
Standardized residuals
Residuals vs Leverage
10
3

1
0.5
1

Cook's distance
−1

9
11 0.5

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Leverage
lm(y ~ x)

7.2 Exercises
1. In the ANCOVA chapter, we examined the relationship on dose of vitamin
C on guinea pig tooth growth based on how the vitamin was delivered
(orange juice vs a pill supplement).
a. Load the ToothGrowth data which is pre-loaded in base R.
b. Plot the data with dose level on the x-axis and tooth length growth
(len) on the y-axis. Color the points by supplement type (supp).
c. Is the data evenly distributed along the x-axis? Comment on the
wisdom of using this design.
d. Fit a linear model to these data and examine the diagnostic plots.
What stands out to you?
e. Log-transform the dose variable and repeat parts (c) and (d). Com-
ment on the effect of the log transformation.
2. The dataset infmort in the faraway package has information about infant
mortality from countries around the world. Be aware that this is a old data
set and does not necessarily reflect current conditions. More information
about the dataset can be found using ?faraway::infmort. We will be
interested in understanding how infant mortality is predicted by per capita
income, world region, and oil export status.
a. Plot the relationship between income and mortality. This can be
done using the command
data('infmort', package='faraway')
pairs(mortality ~., data=infmort)

What do you notice about the relationship between mortality and


income?
b. Fit a linear model without any interaction terms with all three co-
variates as predictors of infant mortality. Examine the diagnostic
plots. What stands out?
122 CHAPTER 7. DIAGNOSTICS

c. Examine the pairs plot with log(mortality), income, and log(income).


Which should be used in our model, income or log(income)?
3. The datasetpressure in the datasets package gives emperical measure-
ments of the vapor pressure of mercury at various temperatures. Fit a
model with pressure as the response and temperature as the predictor
using transformations to obtain a good fit. Feel free to experiment with
what might be considered a ridiculously complicated model with a high
degree polynomial. These models can most easily be fit using the poly(x,
degree=p) function in the formula specification where you swap out the
covariate x and polynomial degree p.
a. Document your process of building your final model. Do not show
graphs or computer output that is not relevant to your decision or
that you do not wish to comment on.
b. Comment on the interpretability of your (possibly ridiculously com-
plicated) model. Consider a situation where I’m designing a system
where it is easy to measure temperature, but I am unable to mea-
sure pressure. Could we get by with just measuring temperature?
What if I didn’t know the Clausius-Clapeyron equation (which de-
scribes vapor pressure at various temperatures) and I’m trying to
better understand how temperature and pressure are related?
4. We will consider the relationship between income and race using a subset
of employed individuals from the American Community Survey.
a. Load the EmployedACS dataset from the Lock5Data package.
b. Create a box plot showing the relationship between Race and Income.
c. Fit an ANOVA model to this data and consider the diagnostic plots
for the residuals. What do you notice?
Chapter 8

Data Transformations

library(ggfortify) # for autoplot for lm objects


library(emmeans) # emmeans for pairwise constrasts.
library(tidyverse) # for dplyr, tidyr, ggplot2

Transformations of the response variable and/or the predictor variables can


drastically improve the model fit and can correct violations of the model as-
sumptions. We might also create new predictor variables that are functions of
existing variables. These include quadratic and higher order polynomial terms
and interaction terms.
Often we are presented with data and we would like to fit a linear model to
the data. Unfortunately the data might not satisfy all of the assumptions of a
linear model. For the simple linear model

𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 + 𝜖 𝑖
𝑖𝑖𝑑
where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ), the necessary assumptions are (in order of importance):

1. The model contains all the appropriate covariates and no more.


2. Independent errors. (Hard to check this one!)
3. Errors have constant variance, no matter what the x-value (or equivalently
the fitted value)
4. Errors are normally distributed

In general, a transformation of the response variable can be used to address the


2nd and 3rd assumptions, and adding new covariates to the model will be how to
address deficiencies of assumption 4. Because of the interpretability properties
we will develop here, log() transformations are very popular, if they are useful.

123
124 CHAPTER 8. DATA TRANSFORMATIONS

8.1 A review of log(𝑥) and 𝑒𝑥


One of the most common transformations that is used on either the response 𝑦
or the covariates 𝑥 is the log() function. In this next section we will consider
log() with base 𝑒. However, if you prefer log2 () or log10 you may substitute 𝑒
with 2 or 10 everywhere.
In primary school you might have learned that the log() function looks like this:
2

0
log(x)

−1

−2

−3

0 1 2 4 6 8
x

Critical aspects to notice about log(𝑥):

1. As 𝑥 → 0, log(𝑥) → −∞.
2. At 𝑥 = 1 we have 𝑙𝑜𝑔(𝑥 = 1) = 0.
3. As 𝑥 → ∞, log(𝑥) → ∞ as well, but at a much slower rate.
4. Even though 𝑙𝑜𝑔(𝑥) is only defined for 𝑥 > 0, the result can take on any
real value, positive or negative.

The inverse function of log(𝑥) is 𝑒𝑥 = exp(𝑥), where 𝑒 = 2.71828 … which looks

6
exp(x)

0
−3 −2 −1 0 1 2
like this: x

Critical aspects to notice about 𝑒𝑥 :


8.1. A REVIEW OF LOG(𝑋) AND 𝐸 𝑋 125

1. as 𝑥 → −∞, 𝑒𝑥 → 0.
2. At 𝑥 = 0 we have 𝑒0 = 1.
3. as 𝑥 → ∞, 𝑒𝑥 → ∞ as well, but at a much faster rate.
4. The function 𝑒𝑥 can be evaluated for any real number, but the result is
always > 0.

Finally we have that 𝑒𝑥 and 𝑙𝑜𝑔(𝑥) are inverse functions of each other by the
following identity:
𝑥 = log (𝑒𝑥 )
and
𝑥 = 𝑒log(𝑥) if 𝑥 > 0

Also it is important to note that the log function has some interesting properties
in that it makes operations “1-operation easier”.

log (𝑎𝑏 ) = 𝑏 log 𝑎


𝑎
log ( ) = log 𝑎 − log 𝑏
𝑏
log (𝑎𝑏) = log 𝑎 + log 𝑏

One final aspect of exponents that we will utilize is that

𝑒𝑎+𝑏 = 𝑒𝑎 𝑒𝑏

The reason we like using a log() transformation is that it acts differently on large
values than small. In particular for 𝑥 > 1 we have that log(𝑥) makes all of the
smaller, but the transformation on big values of 𝑥 is more extreme. Consider
the following, where most of the x-values are small, but we have a few that are
quite large. Those large values will have extremely high leverage and we’d like
to reduce that.

0.4
0.015

0.3
density

density

0.010

0.2

0.005
0.1

0.000 0.0
0 200 400 600 0 2 4 6
x log(x)
126 CHAPTER 8. DATA TRANSFORMATIONS

8.2 Transforming the Response

When the normality or constant variance assumption is violated, sometimes it is


possible to transform the response to satisfy the assumption. Often times count
data is analyzed as log(count) and weights are analyzed after taking a square
root or cube root transform. Statistics involving income or other monetary
values are usually analyzed on the log scale so as to reduce the leverage of high
income observations.

5
150
4
100

log(y)
3
y

50
2
0
1
−50
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
x x

Clearly the model fit to the log transformed y-variable is a much better regres-
sion model. However, I would like to take the regression line and confidence
interval back to the original y-scale. This is allowed by doing the inverse func-
tion 𝑒log(𝑦)̂ .
For example if we fit a linear model for income (𝑦) based on the amount of
schooling the individual has received (𝑥). In this case, I don’t really want to
make predictions on the log(𝑦) scale, because (almost) nobody will understand
magnitude difference between predicting 5 vs 6.
Suppose the model is
log 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖
then we might want to give a prediction interval for an 𝑥0 value. The predicted
𝑙𝑜𝑔(𝑖𝑛𝑐𝑜𝑚𝑒) value is
log (𝑦0̂ ) = 𝛽0̂ + 𝛽1̂ 𝑥0
and we could calculate the appropriate predicted income as
̂ ̂
𝑦0̂ = 𝑒𝛽0 +𝛽1 𝑥0 = 𝑒𝑙𝑜𝑔(𝑦0̂ )

Likewise if we had a confidence interval or prediction interval for log (𝑦0̂ ) of the
form (𝑙, 𝑢) then the appropriate interval for 𝑦0̂ is (𝑒𝑙 , 𝑒𝑢 ). Notice that while
(𝑙, 𝑢) might be symmetric about log (𝑦0̂ ), the back-transformed interval is not
symmetric about 𝑦0̂ .
8.2. TRANSFORMING THE RESPONSE 127

model <- lm( log(y) ~ x, data)


data <- data %>%
select( -matches('(fit|lwr|upr)')) %>%
cbind( predict(model, newdata=., interval='confidence') )
data <- data %>% mutate(
fit = exp(fit),
lwr = exp(lwr),
upr = exp(upr))

ggplot(data, aes(x=x)) +
geom_ribbon( aes(ymin=lwr, ymax=upr), alpha=.6 ) +
geom_line( aes(y=fit), color='blue' ) +
geom_point( aes(y=y) ) +
labs(y='y')

150

100
y

50

0
0.0 0.5 1.0 1.5
x

This back transformation on the 𝑦 ̂ values will be acceptable for any 1-to-1 trans-
formation we use, not just log(𝑦).
Unfortunately the interpretation of the regression coefficients 𝛽0̂ and 𝛽1̂ on the
un-transformed scale becomes more complicated. This is a very serious difficulty
and might sway a researcher from transforming their data.

8.2.1 Box-Cox Family of Transformations

The Box-Cox method is a popular way of determining what transformation to


make. It is intended for responses that are strictly positive (because log 0 = −∞
and the square root of a number gives complex numbers, which we don’t know
how to address in regression). The transformation is defined as
𝑦𝜆 −1
𝜆≠0
𝑔 (𝑦) = { 𝜆
log 𝑦 𝜆=0
128 CHAPTER 8. DATA TRANSFORMATIONS

This transformation is a smooth family of transformations because


𝑦𝜆 − 1
lim = log 𝑦
𝜆→0 𝜆
In the case that 𝜆 ≠ 0, then a researcher will usually use the simpler trans-
formation 𝑦𝜆 because the subtraction and division does not change anything
in a non-linear fashion. Thus for purposes of addressing the assumption viola-
tions, all we care about is the 𝑦𝜆 and prefer the simpler (i.e. more interpretable)
transformation.
Finding the best transformation can be done by adding the 𝜆 parameter to
the model and finding the value that maximizes the log-likelihood function.
Fortunately, we don’t have to do this by hand, as the function boxcox() in the
MASS library will do all the heavy calculation for us.

data(gala, package='faraway')
g <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data=gala)

# I don't like loading the MASS package because it includes a select() function
# that fights with dplyr::select(), so whenever I use a function in the MASS
# package, I just call it using the package::function() naming.
#
# #MASS::boxcox(g, lambda=seq(-2,2, by=.1)) # Set lambda range manually...
MASS::boxcox( g ) # With default lambda range.

95%
log−Likelihood

−100
−160

−2 −1 0 1 2

λ

The optimal transformation for these data would be 𝑦1/4 = 4 𝑦 but that is
an extremely uncommon transformation. Instead we should pick the nearest
“standard” transformation which would suggest that we should use either the

log 𝑦 or 𝑦 transformation.
Thoughts on the Box-Cox transformation:

1. In general, I prefer to using a larger-than-optimal model when picking a


transformation and then go about the model building process. After a
suitable model has been chosen, I’ll double check the my transformation
was appropriate given the model that I ended up with.
8.3. TRANSFORMING THE PREDICTORS 129

2. Outliers can have a profound effect on this method. If the “optimal”


transformation is extreme (𝜆 = 5 or something silly) then you might have
to remove the outliers and refit the transformation.
3. If the range of the response 𝑦 is small, then the method is not as sensitive.
4. These are not the only possible transformations. For example, for binary
data, the logit and probit transformations are common. In classical
non-parametric statistics, we take a rank transformation to the y-values.

8.3 Transforming the predictors

8.3.1 Polynomials of a predictor

Perhaps the most common transformation to make is to make a quadratic func-


tion in 𝑥. Often the relationship between 𝑥 and 𝑦 follows a curve and we want
to fit a quadratic model
𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥 + 𝛽2̂ 𝑥2
and we should note that this is still a linear model because 𝑦 ̂ is a linear function
of 𝑥 and 𝑥2 . As we have already seen, it is easy to fit the model. Adding the
column of 𝑥2 values to the design matrix does the trick.
The difficult part comes in the interpretation of the parameter values. No
longer is 𝛽1̂ the increase in 𝑦 for every one unit increase in 𝑥. Instead the three
parameters in my model interact in a complicated fashion. For example, the
peak of the parabola is at −𝛽1̂ /2𝛽2̂ and whether the parabola is cup shaped
vs dome shaped and its steepness is controlled by 𝛽2̂ . Because my geometric
understanding of degree 𝑞 polynomials relies on have all factors of degree 𝑞 or
lower, whenever I include a covariate raised to a power, I should include all the
lower powers as well.

8.3.2 Log and Square Root of a predictor

Often the effect of a covariate is not linearly related to response, but rather
some function of the covariate. For example the area of a circle is not linearly
related to its radius, but it is linearly related to the radius squared.

𝐴𝑟𝑒𝑎 = 𝜋𝑟2

Similar situations might arise in biological settings, such as the volume of con-
ducting tissue being related to the square of the diameter. Or perhaps an
animals metabolic requirements are related to some power of body length. In
sociology, it is often seen that the utility of, say, $1000 drops off in a logarith-
mic fashion according to the person’s income. To a graduate student, $1K is
a big deal, but to a corporate CEO, $1K is just another weekend at the track.
130 CHAPTER 8. DATA TRANSFORMATIONS

Making a log transformation on any monetary covariate, might account for the
non-linear nature of “utility”.
Picking a good transformation for a covariate is quite difficult, but most fields
of study have spent plenty of time thinking about these issues. When in doubt,
look at scatter plots of the covariate vs the response and ask what transformation
would make the data fall onto a line?

8.3.3 Galapagos Example

To illustrate how to add a transformation of a predictor to a linear model in R,


we will consider the Galapagos data in faraway.

data('gala', package='faraway')
# look at all the scatterplots
gala %>%
mutate(LogSpecies = log(Species)) %>%
dplyr::select(LogSpecies, Area, Elevation, Nearest, Scruz, Adjacent) %>%
GGally::ggpairs(upper=list(continuous='points'), lower=list(continuous='cor'))

## Registered S3 method overwritten by 'GGally':


## method from
## +.gg ggplot2

LogSpecies Area Elevation Nearest Scruz Adjacent

LogSpecies Area
0.25
0.20
0.15
0.10
0.05
0.00
4000 Corr:
3000
2000
1000 0.429*
0
Elevation Nearest

1500 Corr: Corr:


1000
500 0.671*** 0.754***
0
50
40 Corr: Corr: Corr:
30
20
10 0.130 −0.111 −0.011
0
300
Corr: Corr: Corr: Corr:
Scruz

200
100 −0.077 −0.101 −0.015 0.615***
0
Adjacent

4000 Corr: Corr: Corr: Corr: Corr:


3000
2000
1000 0.109 0.180 0.536** −0.116 0.052
0
2 4 6 01000
2000
3000
4000 0 50010001500 0 10 20 30 40 50 0 100 200 30001000
2000
3000
4000

Looking at these graphs, I think I should definitely transform Area and


Adjacent, and I wouldn’t object to doing the same to Elevation, Nearest and
8.3. TRANSFORMING THE PREDICTORS 131

Scruz. Given the high leverages, a log transformation should be a good idea.
One problem is that log(0) = −∞. A quick look at the data set summary:

gala %>%
dplyr::select(Species, Area, Elevation, Nearest,Scruz, Adjacent) %>%
summary()

## Species Area Elevation Nearest


## Min. : 2.00 Min. : 0.010 Min. : 25.00 Min. : 0.20
## 1st Qu.: 13.00 1st Qu.: 0.258 1st Qu.: 97.75 1st Qu.: 0.80
## Median : 42.00 Median : 2.590 Median : 192.00 Median : 3.05
## Mean : 85.23 Mean : 261.709 Mean : 368.03 Mean :10.06
## 3rd Qu.: 96.00 3rd Qu.: 59.237 3rd Qu.: 435.25 3rd Qu.:10.03
## Max. :444.00 Max. :4669.320 Max. :1707.00 Max. :47.40
## Scruz Adjacent
## Min. : 0.00 Min. : 0.03
## 1st Qu.: 11.03 1st Qu.: 0.52
## Median : 46.65 Median : 2.59
## Mean : 56.98 Mean : 261.10
## 3rd Qu.: 81.08 3rd Qu.: 59.24
## Max. :290.20 Max. :4669.32

reveals that Scruz has a zero value, and so a log transformation will result in a
−∞. So, lets take the square root of Scruz

gala %>%
mutate(LogSpecies = log(Species), LogElevation=log(Elevation), LogArea=log(Area), LogNearest=lo
SqrtScruz=sqrt(Scruz), LogAdjacent=log(Adjacent)) %>%
dplyr::select(LogSpecies, LogElevation, LogArea, LogNearest, SqrtScruz, LogAdjacent) %>%
GGally::ggpairs(upper=list(continuous='points'), lower=list(continuous='cor'))

LogSpecies LogElevation LogArea LogNearest SqrtScruz LogAdjacent


LogSpecies

0.25
0.20
0.15
0.10
0.05
LogElevation

0.00
7 Corr:
6
5
4 0.746***
3
LogArea

5 Corr: Corr:
0
0.870*** 0.904***
LogNearest

−5
4 Corr: Corr: Corr:
2
0 −0.040 0.064 0.086
SqrtScruz

15 Corr: Corr: Corr: Corr:


10
5 −0.043 0.053 0.127 0.620***
LogAdjacent

0
5 Corr: Corr: Corr: Corr: Corr:
0 0.111 0.171 0.179 −0.096 0.031
2 4 6 3 4 5 6 7 −5 0 5 0 2 4 0 5 10 15 0 5
132 CHAPTER 8. DATA TRANSFORMATIONS

Looking at these graphs, it is clear that log(Elevation) and log(Area) are


highly correlated and we should probably have one or the other, but not both
in the model.

m.c <- lm(log(Species) ~ log(Area) + log(Nearest) + sqrt(Scruz) + log(Adjacent), data=


summary(m.c)$coefficients %>% round(digits=3) # more readable...

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 3.285 0.275 11.960 0.000
## log(Area) 0.402 0.043 9.443 0.000
## log(Nearest) -0.041 0.118 -0.351 0.728
## sqrt(Scruz) -0.049 0.045 -1.085 0.288
## log(Adjacent) -0.024 0.046 -0.529 0.602

We will remove all the parameters that appear to be superfluous, and perform
an F-test to confirm that the simple model is sufficient.

m.s <- lm(log(Species) ~ log(Area), data=gala)


anova(m.s, m.c)

## Analysis of Variance Table


##
## Model 1: log(Species) ~ log(Area)
## Model 2: log(Species) ~ log(Area) + log(Nearest) + sqrt(Scruz) + log(Adjacent)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 17.218
## 2 25 15.299 3 1.9196 1.0456 0.3897

Next we will look at the coefficients.

summary(m.s)

##
## Call:
## lm(formula = log(Species) ~ log(Area), data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5442 -0.4001 0.0941 0.5449 1.3752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9037 0.1571 18.484 < 2e-16 ***
8.3. TRANSFORMING THE PREDICTORS 133

## log(Area) 0.3886 0.0416 9.342 4.23e-10 ***


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7842 on 28 degrees of freedom
## Multiple R-squared: 0.7571, Adjusted R-squared: 0.7484
## F-statistic: 87.27 on 1 and 28 DF, p-value: 4.23e-10

The slope coefficient (0.3886) is the increase in log(Species) for every 1 unit
increase in log(Area). Unfortunately that is not particularly convenient to in-
terpretation and we will address this in the next section of this chapter.
Finally, we might be interested in creating a confidence interval for the expected
number of tortoise species for an island with Area=50.

x0 <- data.frame(Area=50)
log.Species.CI <- predict(m.s, newdata=x0, interval='confidence')
log.Species.CI # Log(Species) scale

## fit lwr upr


## 1 4.423903 4.068412 4.779394

exp(log.Species.CI) # Species scale

## fit lwr upr


## 1 83.42122 58.46403 119.0322

Notice that on the species-scale, we see that the fitted value is not in the center
of the confidence interval.
To help us understand what the log transformations are doing, we can produce
a plot with the island Area on the x-axis and the expected number of Species on
the y-axis and hopefully that will help us understand the relationship between
the two.

library(ggplot2)
pred.data <- data.frame(Area=1:50)
pred.data <- pred.data %>%
cbind( predict(m.s, newdata=pred.data, interval='conf'))
ggplot(pred.data, aes(x=Area)) +
geom_line(aes(y=exp(fit))) +
geom_ribbon(aes(ymin=exp(lwr), ymax=exp(upr)), alpha=.2) +
ylab('Number of Species')
134 CHAPTER 8. DATA TRANSFORMATIONS

100

Number of Species
75

50

25

0 10 20 30 40 50
Area

8.4 Interpretation of log transformed variable


coefficients

One of the most difficult issues surrounding transformed variables is that the
interpretation is difficult. Compared to taking the square root, log transforma-
tions are surprisingly interpretable on the original scale. Here we look at the
interpretation of log transformed variables.

To investigate the effects of a log transformation, we’ll examine a dataset that


predicts the writing scores of 𝑛 = 200 students using the gender, reading and
math scores. This example was taken from the UCLA Statistical Consulting
Group.

file <- 'https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lgtrans.csv' # on the


file <- 'data-raw/lgtrans.csv' # on my l
scores <- read.csv(file=file)
scores <- scores %>% rename(gender = female)

scores %>%
dplyr::select(write, read, math, gender) %>%
GGally::ggpairs( aes(color=gender),
upper=list(continuous='points'), lower=list(continuous='cor'))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
8.4. INTERPRETATION OF LOG TRANSFORMED VARIABLE COEFFICIENTS135

write read math gender


0.05
0.04

write
0.03
0.02
0.01
0.00
70 Corr: 0.597***
60 female: 0.621***

read
50
40 male: 0.648***
30
70 Corr: 0.617*** Corr: 0.662***

math
60 female: 0.675***female: 0.711***
50
40 male: 0.627*** male: 0.609***
15
10

gender
5
0
15
10
5
0
30 40 50 60 30 40 50 60 70 40 50 60 70 female male

These data look pretty decent, and I’m not certain that I would do any transfor-
mation, but for the sake of having a concrete example that has both continuous
and categorical covariates, we will interpret effects on a students’ writing score.

8.4.1 Log-transformed response, un-transformed covari-


ates

We consider the model where we have transformed the response variable and
just an intercept term.
log 𝑦 = 𝛽0 + 𝜖

model <- lm(log(write) ~ 1, data=scores)


broom::tidy(model)

## # A tibble: 1 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.95 0.0137 288. 7.01e-263

We interpret the intercept as the mean of the log-transformed response values.


̂
We could back transform this to the original scale 𝑦 ̂ = 𝑒𝛽0 = 𝑒3.94835 = 51.85 as
a typical value of write. To distinguish this from the usually defined mean of
the write values, we will call this as the geometric mean. Instead of calculating
this by hand, we can have emmeans() do it for us.

emmeans(model, ~1) # Return y-hat value on the log-scale

## 1 emmean SE df lower.CL upper.CL


## overall 3.95 0.0137 199 3.92 3.98
136 CHAPTER 8. DATA TRANSFORMATIONS

##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95

emmeans(model, ~1, type='response') # Return y-hat value on the original scale

## 1 response SE df lower.CL upper.CL


## overall 51.8 0.71 199 50.5 53.3
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale

Next we examine how to interpret the model when a categorical variable is


added to the model.

𝛽0 + 𝜖 if female
log 𝑦 = {
𝛽0 + 𝛽 1 + 𝜖 if male

model <- lm(log(write) ~ gender, data=scores)


broom::tidy(model)

## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.00 0.0179 223. 1.02e-239
## 2 gendermale -0.103 0.0266 -3.89 1.39e- 4

The intercept is now the mean of the log-transformed write responses for the
̂
females and thus 𝑒𝛽0 = 𝑦𝑓̂ and the offset for males is the change in log(write)
from the female group. Notice that for the males, we have

̂ = 𝛽0̂ + 𝛽1̂
log 𝑦𝑚
̂ ̂
̂ = 𝑒𝛽0 +𝛽1
𝑦𝑚
𝛽0̂
= 𝑒⏟ 𝛽1̂
∗ 𝑒⏟
𝑦𝑓̂ multiplier for males

and therefore we see that males tend to have writing scores 𝑒−0.103 = 0.90 = 90%
of the females. Typically this sort of result would be reported as the males have
a 10% lower writing score than the females.
Hand calculating these is challenging to do it correctly, but as usual we can have
emmeans calculate it for us.
8.4. INTERPRETATION OF LOG TRANSFORMED VARIABLE COEFFICIENTS137

# I used reverse pairwise to get the ratio as male/female instead of female/male


emmeans(model, revpairwise~gender, type='response') %>%
.[['contrasts']]

## contrast ratio SE df t.ratio p.value


## male / female 0.902 0.024 198 -3.887 0.0001
##
## Tests are performed on the log scale

The model with a continuous covariate has a similar interpretation.

𝛽0 + 𝛽 2 𝑥 + 𝜖 if female
log 𝑦 = {
𝛽0 + 𝛽 1 + 𝛽 2 𝑥 + 𝜖 if male

We will use the reading score read to predict the writing score. Then 𝛽2̂ is the
predicted increase in log(write) for every 1-unit increase in read score. The
̂
interpretation of 𝛽0̂ is now log 𝑦 ̂ when 𝑥 = 0 and therefore 𝑦 ̂ = 𝑒𝛽0 when 𝑥 = 0.

model <- lm(log(write) ~ gender + read, data=scores) # main effects model


broom::tidy(model)

## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.41 0.0546 62.5 8.68e-132
## 2 gendermale -0.116 0.0210 -5.52 1.08e- 7
## 3 read 0.0113 0.00102 11.1 2.02e- 22

For females, we consider the difference in log 𝑦 ̂ for a 1-unit increase in 𝑥 and
will interpret this on the original write scale.

log 𝑦𝑓̂ = 𝛽0̂ + 𝛽2̂ 𝑥


̂ ̂
𝑦𝑓̂ = 𝑒𝛽0 +𝛽2 𝑥
̂
therefore we consider 𝑒𝛽2 as the multiplicative increase in write score for a 1-unit
increase in 𝑥 because of the following. Consider 𝑥1 and 𝑥2 = 𝑥1 + 1. Then we
consider the ratio of predicted values:
̂ ̂ ̂ ̂ ̂
𝑦2̂ 𝑒𝛽0 +𝛽2 (𝑥+1) 𝑒𝛽0 𝑒𝛽2 𝑥 𝑒𝛽2 ̂
= = = 𝑒𝛽2
𝑦1̂ 𝑒𝛽0̂ +𝛽2̂ 𝑥 𝑒𝛽0̂ 𝑒𝛽2̂ 𝑥
̂
For our writing scores example we have that 𝑒𝛽2 = 𝑒0.0113 = 1.011 meaning
there is an estimated 1% increase in write score for every 1-point increase in
read score.
138 CHAPTER 8. DATA TRANSFORMATIONS

If we are interested in, say, a 20-unit increase in 𝑥, then that would result in an
increase of

̂ ̂ ̂ ̂ ̂
𝑒𝛽0 +𝛽2 (𝑥+20) 𝑒𝛽0 𝑒𝛽2 𝑥 𝑒20𝛽2 ̂ ̂
20
= = 𝑒20𝛽2 = (𝑒𝛽2 )
𝑒𝛽0̂ +𝛽2̂ 𝑥 𝑒𝛽0̂ 𝑒𝛽2̂ 𝑥

and for the writing scores we have


20
̂ ̂
𝑒20𝛽2 = (𝑒𝛽2 ) = 1.011320 = 1.25

or a 22% increase in writing score for a 20-point increase in reading score.

# to make emmeans calculate this, we must specify a 1-unit or 20-unit increase


emmeans(model, pairwise ~ read, at=list(read=c(51,50)), type='response') %>%
.[['contrasts']]

## contrast ratio SE df t.ratio p.value


## 51 / 50 1.011 0.001032 197 11.057 <.0001
##
## Results are averaged over the levels of: gender
## Tests are performed on the log scale

emmeans(model, pairwise ~ read, at=list(read=c(90,70)), type='response') %>%


.[['contrasts']]

## contrast ratio SE df t.ratio p.value


## 90 / 70 1.25 0.0256 197 11.057 <.0001
##
## Results are averaged over the levels of: gender
## Tests are performed on the log scale

̂
In short, we can interpret 𝑒𝛽𝑖 as the multiplicative increase/decrease in the non-
transformed response variable. Some students get confused by what is meant
by a % increase or decrease in 𝑦.

• A 75% decrease in 𝑦 has a resulting value of (1 − 0.75) 𝑦 = (0.25) 𝑦


• A 75% increase in 𝑦 has a resulting value of (1 + 0.75) 𝑦 = (1.75) 𝑦
• A 100% increase in 𝑦 has a resulting value of (1 + 1.00) 𝑦 = 2𝑦 and is a
doubling of 𝑦.
• A 50% decrease in 𝑦 has a resulting value of (1 − 0.5) 𝑦 = (0.5) 𝑦 and is a
halving of 𝑦.
8.4. INTERPRETATION OF LOG TRANSFORMED VARIABLE COEFFICIENTS139

8.4.2 Un-transformed response, log-transformed covariate

We consider the model


𝑦 = 𝛽0 + 𝛽2 log 𝑥 + 𝜖
and consider two different values of 𝑥 (which we’ll call 𝑥1 and 𝑥2 and we are
considering the effect of moving from 𝑥1 to 𝑥2 ) and look at the differences
between the predicted values 𝑦2̂ − 𝑦1̂ .

𝑦2̂ − 𝑦1̂ = [𝛽0̂ + 𝛽2̂ log 𝑥2 ] − [𝛽0̂ + 𝛽2̂ log 𝑥1 ]


= 𝛽2̂ [log 𝑥2 − log 𝑥1 ]
𝑥2
= 𝛽2̂ log [ ]
𝑥1

This means that so long as the ratio between the two x-values is constant, then
the change in 𝑦 ̂ is the same. So doubling the value of 𝑥 from 1 to 2 has the same
effect on 𝑦 ̂ as changing x from 50 to 100.

model <- lm( write ~ gender + log(read), data=scores)


broom::tidy(model)

## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -59.1 9.95 -5.94 1.28e- 8
## 2 gendermale -5.43 1.01 -5.36 2.29e- 7
## 3 log(read) 29.0 2.53 11.5 9.98e-24

# predict writing scores for three females,


# each with a reading score 50% larger than the other previous
predict(model, newdata=data.frame(gender=rep('female',3),
read=c(40, 60, 90)))

## 1 2 3
## 48.06622 59.84279 71.61936

We should see a
29.045 log (1.5) = 11.78

difference in 𝑦 ̂ values for the first and second students and the second and third.
140 CHAPTER 8. DATA TRANSFORMATIONS

emmeans(model, revpairwise~log(read), at=list(read=c(2,4,8)))

## $emmeans
## read emmean SE df lower.CL upper.CL
## 2 -41.66 8.21 197 -57.9 -25.46
## 4 -21.53 6.47 197 -34.3 -8.78
## 8 -1.39 4.72 197 -10.7 7.92
##
## Results are averaged over the levels of: gender
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 4 - 2 20.1 1.75 197 11.493 <.0001
## 8 - 2 40.3 3.50 197 11.493 <.0001
## 8 - 4 20.1 1.75 197 11.493 <.0001
##
## Results are averaged over the levels of: gender
## P value adjustment: tukey method for comparing a family of 3 estimates

8.4.3 Log-transformed response, log-transformed covari-


ate

This combines the interpretations in the previous two sections. We consider

log 𝑦 = 𝛽0 + 𝛽2 log 𝑥 + 𝜖

and we again consider two 𝑥 values (again 𝑥1 and 𝑥2 ). We then examine the
difference in the log 𝑦 ̂ values as

log 𝑦2̂ − log 𝑦1̂ = [𝛽0̂ + 𝛽2̂ log 𝑥2 ] − [𝛽0̂ + 𝛽2̂ log 𝑥1 ]
𝑦2̂ 𝑥
log [ ] = 𝛽2̂ log [ 2 ]
𝑦1̂ 𝑥1
𝛽̂
𝑦2̂ 𝑥2 2 ⎤
log [ ] = log ⎡
⎢ 𝑥 ) ⎥
(
𝑦1̂
⎣ 1 ⎦
𝛽2̂
𝑦2̂ 𝑥
= ( 2)
𝑦1̂ 𝑥1

This allows us to examine the effect of some arbitrary percentage increase in 𝑥


value as a percentage increase in 𝑦 value.
8.4. INTERPRETATION OF LOG TRANSFORMED VARIABLE COEFFICIENTS141

model <- lm(log(write) ~ gender + log(read), data=scores)


broom::tidy(model)

## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.71 0.205 8.36 1.14e-14
## 2 gendermale -0.114 0.0209 -5.48 1.27e- 7
## 3 log(read) 0.581 0.0521 11.1 1.08e-22

which implies for a 10% increase in read score, we should see a 1.100.581 = 1.056
multiplier in write score. That is to say, a 10% increase in reading score results
in a 5% increase in writing score.

emmeans(model, pairwise~log(read), at=list(read=c(55,50)),


var='log(read)', type='response')

## $emmeans
## read response SE df lower.CL upper.CL
## 55 53.8 0.594 197 52.6 54.9
## 50 50.9 0.535 197 49.8 51.9
##
## Results are averaged over the levels of: gender
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df t.ratio p.value
## 55 / 50 1.06 0.00525 197 11.148 <.0001
##
## Results are averaged over the levels of: gender
## Tests are performed on the log scale

For the Galapagos islands, we had

m.s <- lm(log(Species) ~ log(Area), data=gala)


broom::tidy(m.s)

## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.90 0.157 18.5 3.17e-17
## 2 log(Area) 0.389 0.0416 9.34 4.23e-10
142 CHAPTER 8. DATA TRANSFORMATIONS

emmeans(m.s, pairwise~Area, at=list(Area=c(400, 200)), type='response')

## $emmeans
## Area response SE df lower.CL upper.CL
## 400 187 43.7 28 116.0 302
## 200 143 30.2 28 92.7 221
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df t.ratio p.value
## 400 / 200 1.31 0.0377 28 9.342 <.0001
##
## Tests are performed on the log scale

and therefore doubling of Area (i.e. the ratio of the 𝐴𝑟𝑒𝑎2 /𝐴𝑟𝑒𝑎1 = 2) results
in a 20.389 = 1.31 multiplier of the Species value. That is to say doubling the
island area increases the number of species by 31%.
In the table below 𝛽 represents the group offset value, or the slope value asso-
ciated with 𝑥. If we are in a model with multiple slopes such as an ANCOVA
model, then the beta term represents the slope of whatever group you are in-
terested.

Response Explanatory Term Interpretation


𝛽
log(𝑦) Categorical 𝑒 Switching from the
reference group results in
this multiplicative change
on 𝑦.
log(𝑦) Continuous 𝑥 𝑒𝛽 A 1-unit change in 𝑥
results in this
multiplicative change on
𝑦.
𝛿
log(𝑦) Continuous 𝑥 (𝑒𝛽 ) A 𝛿-unit change in 𝑥
results in this
multiplicative change on
𝑦.
𝑦 Continuous log(𝑥) 𝛽 log ( 𝑥𝑥2 ) The proportional change
1
in 𝑥 results in an additive
change on 𝑦.
8.5. EXERCISES 143

Response Explanatory Term Interpretation


𝛽
log(𝑦) Continuous log(𝑥) ( 𝑥𝑥2 ) The proportional change
1
in 𝑥 results in the
multiplicative change on
𝑦.

8.5 Exercises
1. In the ANCOVA chapter, we examined the relationship on dose of vitamin
C on guinea pig tooth growth based on how the vitamin was delivered
(orange juice vs a pill supplement).
a. Load the ToothGrowth data which is available in base R.
b. Plot the data with log dose level on the x-axis and tooth length
growth on the y-axis. Color the points by supplement type.
c. Fit a linear model using the log transformed dose.
d. Interpret the effect of doubling the dose on tooth growth for the OJ
and VC supplement groups.
2. We will consider the relationship between income and race using a subset
of employed individuals from the American Community Survey.
a. Load the EmployedACS dataset from the Lock5Data package.
b. Create a box plot showing the relationship between Race and Income.
c. Consider the boxcox family of transformations of Income. What
transformation seems appropriate? Consider both square-root and
log transformation? While the race differences are not statistically
significant in either case, there is an interesting shift in how black,
white, and other groups are related. Because there are people with
zero income, we have to do something. We could either use a trans-

formation like 𝑦, remove all the zero observations, or to add a
small value to the zero observations. We’ll add 0.05 to the zero val-
ues, which represents the zero income people receiving $50. Graph
both the log and square root transformations. Do either completely
address the issue? What about the cube-root (𝜆 = 1/3 )?
d. Using your cube-root transformed Income variable, fit an ANOVA
model and evaluate the relationship between race and income uti-
lizing these data. Provide (and interpret) the point estimates, even
though they aren’t statistically significant. Importantly, we haven’t
accounted for many sources of variability such as education level and
job type. There is much more to consider than just this simple anal-
ysis.
3. The dataset Lock5Data::HomesForSale has a random sample of home
prices in 4 different states. Consider a regression model predicting the
144 CHAPTER 8. DATA TRANSFORMATIONS

Price as a function of Size, Bedrooms, and Baths.


a) Examine a scatterplot of Price and Size and justify a log transfor-
mation to both.
b) Build an appropriate model considering only main effects. Find any
observations that are unusual and note any decisions about including
correlated variables.
c) Calculate and interpret the estimated effect of a 10% increase in home
size on the price.
d) Calculate and interpret the difference between California and Penn-
sylvania in terms of average house price.
Correlated Covariates

library(tidyverse)

Interpretation with Correlated Covariates


The standard interpretation of the slope parameter is that 𝛽𝑗 is the amount of
increase in 𝑦 for a one unit increase in the 𝑗th covariate, provided that all other
covariates stayed the same.
The difficulty with this interpretation is that covariates are often related, and
the phrase “all other covariates stayed the same” is often not reasonable. For
example, if we have a dataset that models the mean annual temperature of a lo-
cation as a function of latitude, longitude, and elevation, then it is not physically
possible to hold latitude, and longitude constant while changing elevation.
One common issue that make interpretation difficult is that covariates can be
highly correlated.
Perch Example: We might be interested in estimating the weight of a fish based
off of its length and width. The dataset we will consider is from fishes are caught
from the same lake (Laengelmavesi) near Tampere in Finland. The following
variables were observed:

Variable Interpretation
Weight Weight (g)
Length.1 Length from nose to beginning of Tail (cm)
Length.2 Length from nose to notch of Tail (cm)
Length.3 Length from nose to tip of tail (cm)
Height Maximal height as a percentage of Length.3
Width Maximal width as a percentage of Length.3
Sex 0=Female, 1=Male
Species Which species of perch (1-7)

145
146 CHAPTER 8. DATA TRANSFORMATIONS

We first look at the data and observe the expected relationship between length
and weight.

# File location on the internet or on my local computer.


file <- 'https://raw.githubusercontent.com/dereksonderegger/571/master/data-raw/Fish.cs
file <- '~/github/571/data-raw/Fish.csv'

fish <- read.table(file, header=TRUE, skip=111, sep=',') %>%


filter( !is.na(Weight) )

### generate a pairs plot in a couple of different ways...


# pairs(fish)
# pairs( Weight ~ Length.1 + Length.2 + Length.3 + Height + Width, data=fish )
# pairs( Weight ~ ., data=fish )

fish %>%
dplyr::select(Weight, Length.1, Length.2, Length.3, Height, Width) %>%
GGally::ggpairs(upper=list(continuous='points'),
lower=list(continuous='cor'))

Weight Length.1 Length.2 Length.3 Height Width


0.0015

WeightLength.1
0.0010
0.0005
0.0000
60 Corr:
40
20 0.916***

Length.2
60 Corr: Corr:
40
20 0.919*** 1.000***

Length.3Height Width
60 Corr: Corr: Corr:
40
20 0.924*** 0.992*** 0.994***
40 Corr: Corr: Corr: Corr:
30
20 0.193* 0.035 0.055 0.132.
21 Corr: Corr: Corr: Corr: Corr:
18
15
12 0.133. 0.031 0.043 0.036 0.456***
9
0 50010001500 20 40 60 20 40 60 20 40 60 20 30 40 9 12 15 18 21

Naively, we might consider the linear model with all the length effects present.

model <- lm(Weight ~ Length.1 + Length.2 + Length.3 + Height + Width, data=fish)


summary(model)

##
## Call:
## lm(formula = Weight ~ Length.1 + Length.2 + Length.3 + Height +
## Width, data = fish)
##
## Residuals:
## Min 1Q Median 3Q Max
8.5. EXERCISES 147

## -302.22 -79.72 -39.88 92.63 344.85


##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -724.539 77.133 -9.393 <2e-16 ***
## Length.1 32.389 45.134 0.718 0.4741
## Length.2 -9.184 48.367 -0.190 0.8497
## Length.3 8.747 16.283 0.537 0.5919
## Height 4.947 2.768 1.787 0.0759 .
## Width 8.636 6.972 1.239 0.2174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132.9 on 152 degrees of freedom
## Multiple R-squared: 0.8675, Adjusted R-squared: 0.8631
## F-statistic: 199 on 5 and 152 DF, p-value: < 2.2e-16

This is crazy. There is a negative relationship between Length.2 and Weight.


That does not make any sense unless you realize that this is the effect of
Length.2 assuming the other covariates are in the model and can be held con-
stant while changing the value of Length.2, which is obviously ridiculous.
If we remove the highly correlated covariates then we see a much better behaved
model

model <- lm(Weight ~ Length.2 + Height + Width, data=fish)


summary(model)

##
## Call:
## lm(formula = Weight ~ Length.2 + Height + Width, data = fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -306.14 -75.11 -36.45 89.54 337.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -701.0750 71.0438 -9.868 < 2e-16 ***
## Length.2 30.4360 0.9841 30.926 < 2e-16 ***
## Height 5.5141 1.4311 3.853 0.000171 ***
## Width 5.6513 5.2016 1.086 0.278974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
148 CHAPTER 8. DATA TRANSFORMATIONS

## Residual standard error: 132.3 on 154 degrees of freedom


## Multiple R-squared: 0.8669, Adjusted R-squared: 0.8643
## F-statistic: 334.2 on 3 and 154 DF, p-value: < 2.2e-16

When you have two variables in a model that are highly positively correlated,
you often find that one will have a positive coefficient and the other will be
negative. Likewise, if two variables are highly negatively correlated, the two
regression coefficients will often be the same sign.
In this case the sum of the three length covariate estimates was approximately
31 in both cases, but with three length variables, the second could be negative
the third be positive with approximately the same magnitude and we get ap-
proximately the same model as with both the second and third length variables
missing from the model.

𝑦𝑖 = 𝛽 0 + 𝛽 1 𝐿 1 + 𝛽 2 𝐿 2 + 𝛽 3 𝐿 3 + 𝜖 𝑖
but if the covariates are highly correlated, then approximately 𝐿1 = 𝐿2 = 𝐿3 =
𝐿 and this equation could be written:

𝑦𝑖 = 𝛽0 + (𝛽1 + 𝛽2 + 𝛽3 )𝐿3 + 𝜖𝑖

𝑦𝑖 = 𝛽0 + [(𝛽1 − 10) + 𝛽2 + (𝛽3 + 10)] 𝐿3 + 𝜖𝑖

In general, you should be very careful with the interpretation of the regression
coefficients when the covariates are highly correlated.

Solutions

In general you have three ways to deal with correlated covariates.

1. Select one of the correlated covariates to include in the model. You could
either select the covariate with the highest correlation with the response.
Alternatively you could use your scientific judgment as to which covariate
is most appropriate.

2. You could create an index variable that combines the correlated covariates
into a single amalgamation variable. For example, we could create 𝐿 =
(𝐿1 + 𝐿2 + 𝐿3 )/3 to be the average of the three length measurements.

3. We could use some multivariate method such as Principle Components


Analysis to inform us on how to make the index variable. We won’t
address this in this here, though.
8.5. EXERCISES 149

4. Just not worry about it. Remember, the only problem is with the in-
terpretation of the parameters. If we only care about making accurate
predictions within the scope of our data, then including highly correlated
variables doesn’t matter. However, if you are making predictions away
from the data, the flipped signs might be a big deal.
150 CHAPTER 8. DATA TRANSFORMATIONS
Chapter 9

Variable Selection

library(tidyverse) # dplyr, tidyr, ggplot2, etc

Given a set of data, we are interested in selecting the best subset of predictors
for the following reasons:

1. Occam’s Razor tells us that from a list of plausible model or explanations,


the simplest is usually the best. In the statistics sense, I want the smallest
model that adequately explains the observed data patterns.

2. Unnecessary predictors add noise to the estimates of other quantities and


will waste degrees of freedom, possibly increasing the estimate of 𝜎̂ 2 .

3. We might have variables that are co-linear.

The problems that arise in the diagnostics of a model will often lead a researcher
to consider other models, for example to include a quadratic term to account for
curvature. The model building process is often an iterative procedure where we
build a model, examine the diagnostic plots and consider what could be added
or modified to correct issues observed.

9.1 Nested Models

Often one model is just a simplification of another and can be obtained by


setting some subset of 𝛽𝑖 values equal to zero. Those models can be adequately
compared by the F-test, which we have already made great use of.

151
152 CHAPTER 9. VARIABLE SELECTION

We should be careful to note that we typically do not want to remove the main
covariate from the model if the model uses the covariate in a more complicated
fashion. For example, if my model is
𝑦 = 𝛽 0 + 𝛽 1 𝑥 + 𝛽 2 𝑥2 + 𝜖
where 𝜖 ∼ 𝑁 (0, 𝜎2 ), then considering the simplification 𝛽1 = 0 and removing
the effect of 𝑥 is not desirable because that forces the parabola to be symmetric
about 𝑥 = 0. Similarly, if the model contains an interaction effect, then the
removal of the main effect drastically alters the interpretation of the interaction
coefficients and should be avoided. Often times removing a lower complexity
term while keeping a higher complexity term results in unintended consequences
and is typically not recommended.

9.2 Testing-Based Model Selection


Starting with a model that is likely too complex, consider a list of possible
terms to remove and remove each in turn while evaluating the resulting model
to the starting model using an F-test. Whichever term has the highest p-value
is removed and the process is repeated until no more terms have non-significant
p-values. This is often referred to as backward selection.
It should be noted that the cutoff value for significance here does not have to
be 𝛼 = 0.05. If prediction performance is the primary goal, then a more liberal
𝛼 level is appropriate.
Starting with a model that is likely too small, consider adding terms until there
are no more terms that when added to the model are significant. This is called
forward selection.
This is a hybrid between forward selection and backward elimination. At every
stage, a term is either added or removed. This is referred to as stepwise selection.
Stepwise, forward, and backward selection are commonly used but there are
some issues.

1. Because of the “one-at-a-time” nature of the addition/deletion, the most


optimal model might not be found.
2. p-values should not be treated literally. Because the multiple comparisons
issue is completely ignored, the p-values are lower than they should be if
multiple comparisons were accounted for. As such, it is possible to sort
through a huge number of potential covariates and find one with a low p-
value simply by random chance. This is “data dredging” and is a serious
issue.
3. As a non-thinking algorithm, these methods ignore the science behind that
data and might include two variables that are highly collinear or might
ignore variables that are scientifically interesting.
9.2. TESTING-BASED MODEL SELECTION 153

9.2.1 Example - U.S. Life Expectancy

Using data from the Census Bureau we can look at the life expectancy as a
response to a number of predictors. One R function that is often convenient
to use is the update() function that takes a lm() object and adds or removes
things from the formula. The notation . ~ . means to leave the response and
all the predictors alone, while . ~ . + vnew will add the main effect of vnew
to the model.

data('state') # loads a matrix state.x77 and a vector of stat abbreviations

# Convert from a matrix to a data frame with state abbreviations


state.data <- data.frame(state.x77, row.names=state.abb)
str(state.data)

## 'data.frame': 50 obs. of 8 variables:


## $ Population: num 3615 365 2212 2110 21198 ...
## $ Income : num 3624 6315 4530 3378 5114 ...
## $ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : num 50708 566432 113417 51945 156361 ...

We should first look at the

state.data %>%
dplyr::select( Life.Exp, Population:Area ) %>%
GGally::ggpairs( upper=list(continuous='points'), lower=list(continuous='cor') )
154 CHAPTER 9. VARIABLE SELECTION

Life.Exp Population Income Illiteracy Murder HS.Grad Frost Area


0.3

Life.Exp Population Income


0.2
0.1
0.0
20000
15000 Corr:
10000
5000 −0.068
0
6000 Corr: Corr:
5000
4000 0.340* 0.208
3000

Illiteracy
2 Corr: Corr: Corr:
1−0.588*** 0.108 −0.437**

Murder HS.Grad
12 Corr: Corr: Corr: Corr:
8
4−0.781*** 0.344* −0.230 0.703***

60 Corr: Corr: Corr: Corr: Corr:


50
40
0.582*** −0.098 0.620***−0.657***
−0.488***

150 Corr: Corr: Corr: Corr: Corr: Corr:

Frost
100
50 0.262. −0.332* 0.226 −0.672***
−0.539***0.367**
0
6e+05
4e+05 Corr: Corr: Corr: Corr: Corr: Corr: Corr:

Area
2e+05 −0.107 0.023 0.363** 0.077 0.228 0.334* 0.059
0e+00
68 70 72 05000
10000
15000
20000
3000
4000
5000
6000 1 2 4 8 12 40 50 60 0 501001500e+00
2e+05
4e+05

I want to add a quadratic effect for HS.Grad rate and for Income. Also, we see
that Population and Area seem to have some high skew to their distributions,
so a log transformation might help. We’ll modify the data and then perform
the backward elimination method starting with the model with all predictors as
main effects.

state.data <- state.data %>%


mutate( HS.Grad.2 = HS.Grad ^ 2,
Income.2 = Income ^ 2,
Log.Population = log(Population),
Log.Area = log(Area)) %>%
dplyr::select( -Population, -Area ) # remove the original Population and Area covar

# explicitly define my starting model


m1 <- lm(Life.Exp ~ Log.Population + Income + Illiteracy +
Murder + HS.Grad + Frost + HS.Grad.2 + Income.2 + Log.Area, data=s
#
# Define the same model, but using shorthand
# The '.' means everything else in the data frame
m1 <- lm( Life.Exp ~ ., data=state.data)
summary(m1)$coefficients %>% round(digits=3)

## Estimate Std. Error t value Pr(>|t|)


9.2. TESTING-BASED MODEL SELECTION 155

## (Intercept) 61.174 6.984 8.759 0.000


## Income 0.004 0.003 1.517 0.137
## Illiteracy 0.367 0.388 0.946 0.350
## Murder -0.304 0.049 -6.214 0.000
## HS.Grad -0.031 0.208 -0.151 0.881
## Frost -0.004 0.003 -1.068 0.292
## HS.Grad.2 0.001 0.002 0.394 0.696
## Income.2 0.000 0.000 -1.518 0.137
## Log.Population 0.191 0.150 1.273 0.211
## Log.Area 0.100 0.113 0.885 0.382

The signs make reasonable sense (higher murder rates decrease life expectancy)
but covariates like Income are not significant, which is surprising. The largest
p-value is HS.Grad. However, I don’t want to remove the lower-order graduation
term and keep the squared-term. So instead I will remove both of them since
they are the highest p-values. Notice that HS.Grad is correlated with Income
and Illiteracy.

# Remove Graduation Rate from the model from the model


m1 <- update(m1, .~. - HS.Grad - HS.Grad.2)
summary(m1)$coefficients %>% round(digits=3)

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 60.894 6.100 9.983 0.000
## Income 0.004 0.002 1.711 0.094
## Illiteracy 0.087 0.373 0.233 0.817
## Murder -0.318 0.048 -6.686 0.000
## Frost -0.006 0.003 -1.807 0.078
## Income.2 0.000 0.000 -1.581 0.121
## Log.Population 0.041 0.132 0.309 0.759
## Log.Area 0.206 0.103 1.995 0.053

# Next remove Illiteracy


m1 <- update(m1, .~. - Illiteracy)
summary(m1)$coefficients %>% round(digits=3)

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 61.779 4.717 13.098 0.000
## Income 0.004 0.002 1.872 0.068
## Murder -0.314 0.042 -7.423 0.000
## Frost -0.006 0.003 -2.345 0.024
## Income.2 0.000 0.000 -1.699 0.097
## Log.Population 0.041 0.130 0.314 0.755
## Log.Area 0.198 0.097 2.048 0.047
156 CHAPTER 9. VARIABLE SELECTION

# And Log.Population...
m1 <- update(m1, .~. - Log.Population)
summary(m1)$coefficients %>% round(digits=3)

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 61.413 4.523 13.577 0.000
## Income 0.004 0.002 2.257 0.029
## Murder -0.309 0.039 -7.828 0.000
## Frost -0.006 0.002 -2.612 0.012
## Income.2 0.000 0.000 -2.044 0.047
## Log.Area 0.200 0.096 2.091 0.042

The removal of Income.2 is a tough decision because the p-value is very close
to 𝛼 = 0.05 and might be left in if it makes model interpretation easier or if the
researcher feels a quadratic effect in income is appropriate (perhaps rich people
are too stressed?).

summary(m1)

##
## Call:
## lm(formula = Life.Exp ~ Income + Murder + Frost + Income.2 +
## Log.Area, data = state.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28858 -0.50631 -0.07242 0.49738 1.75839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.141e+01 4.523e+00 13.577 < 2e-16 ***
## Income 4.212e-03 1.867e-03 2.257 0.0290 *
## Murder -3.092e-01 3.950e-02 -7.828 7.14e-10 ***
## Frost -6.487e-03 2.483e-03 -2.612 0.0123 *
## Income.2 -4.188e-07 2.049e-07 -2.044 0.0470 *
## Log.Area 2.002e-01 9.576e-02 2.091 0.0424 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7349 on 44 degrees of freedom
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.7003
## F-statistic: 23.9 on 5 and 44 DF, p-value: 1.549e-11

We are left with a model that adequately explains Life.Exp but we should
be careful to note that just because a covariate was removed from the model
9.3. CRITERION BASED PROCEDURES 157

does not imply that it isn’t related to the response. For example, being a high
school graduate is highly correlated with not being illiterate as is Income and
thus replacing Illiteracy shows that illiteracy is associated with lower life
expectancy, but it is not as predictive as Income.

m2 <- lm(Life.Exp ~ Illiteracy+Murder+Frost, data=state.data)


summary(m2)

##
## Call:
## lm(formula = Life.Exp ~ Illiteracy + Murder + Frost, data = state.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59010 -0.46961 0.00394 0.57060 1.92292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.556717 0.584251 127.611 < 2e-16 ***
## Illiteracy -0.601761 0.298927 -2.013 0.04998 *
## Murder -0.280047 0.043394 -6.454 6.03e-08 ***
## Frost -0.008691 0.002959 -2.937 0.00517 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7911 on 46 degrees of freedom
## Multiple R-squared: 0.6739, Adjusted R-squared: 0.6527
## F-statistic: 31.69 on 3 and 46 DF, p-value: 2.915e-11

Notice that the 𝑅2 values for both models are quite similar 0.7309 vs 0.6739
but the first model with the higher 𝑅2 has one more predictor variable? Which
model should I prefer? I can’t do an F-test because these models are not nested.

9.3 Criterion Based Procedures

9.3.1 Information Criterions

It is often necessary to compare models that are not nested. For example, I
might want to compare
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜖
vs
𝑦 = 𝛽 0 + 𝛽2 𝑤 + 𝜖
158 CHAPTER 9. VARIABLE SELECTION

This comparison comes about naturally when doing forward model selection
and we are looking for the “best” covariate to add to the model first.
Akaike introduced his criterion (which he called “An Information Criterion”) as

− 2 log 𝐿 (𝛽,̂ 𝜎|̂ data ) +


𝐴𝐼𝐶 = ⏟⏟⏟⏟⏟⏟⏟⏟⏟ 2𝑝

decreases if RSS decreases increases as p increases

where 𝐿 (𝛽|̂ data ) is the likelihood function and 𝑝 is the number of elements in
the 𝛽̂ vector and we regard a lower AIC value as better. Notice the 2𝑝 term is
essentially a penalty on adding addition covariates so to lower the AIC value, a
new predictor must lower the negative log likelihood more than it increases the
penalty.
To convince ourselves that the first summand decreases with decreasing RSS in
the standard linear model, we examine the likelihood function

1 1 𝑇
𝑓 (𝑦 | 𝛽, 𝜎, 𝑋) = exp [− (𝑦 − 𝑋𝛽) (𝑦 − 𝑋𝛽)]
(2𝜋𝜎2 )
𝑛/2 2𝜎2
= 𝐿 (𝛽, 𝜎 | 𝑦, 𝑋)

and we could re-write this as


𝑛/2 1 𝑇
log 𝐿 (𝛽,̂ 𝜎̂ | data) = − log ((2𝜋𝜎̂ 2 ) )− 2
(𝑦 − 𝑋 𝛽) ̂
̂ (𝑦 − 𝑋 𝛽)
2𝜎̂
𝑛 1 𝑇
̂ (𝑦 − 𝑋 𝛽) ̂
= − log (2𝜋𝜎̂ 2 ) − 2 (𝑦 − 𝑋 𝛽)
2 2𝜎̂
1 1 𝑇
̂ (𝑦 − 𝑋 𝛽)]
̂
= − [𝑛 log (2𝜋𝜎̂ 2 ) + 2 (𝑦 − 𝑋 𝛽)
2 𝜎̂
1 1
= − [+𝑛 log (2𝜋) + 𝑛 log 𝜎̂ 2 + 2 𝑅𝑆𝑆]
2 𝜎̂

It isn’t clear what we should do with the 𝑛 log (2𝜋) term in the log 𝐿() function.
There are some compelling reasons to ignore it and just use the second, and there
are reasons to use both terms. Unfortunately, statisticians have not settled on
one convention or the other and different software packages might therefore
report different values for AIC.
As a general rule of thumb, if the difference in AIC values is less than two then
the models are not significantly different, differences between 2 and 4 AIC units
are marginally significant and any difference greater than 4 AIC units is highly
significant.
Notice that while this allows us to compare models that are not nested, it does
require that the same data are used to fit both models. Because I could start
out with my data frame including both 𝑥 and 𝑥2 , (or more generally 𝑥 and 𝑓 (𝑥)
for some function 𝑓()) you can regard a transformation of a covariate as “the
9.3. CRITERION BASED PROCEDURES 159

same data”. However, a transformation of a y-variable is not and therefore we


cannot use AIC to compare a models log(y) ~ x versus the model y ~ x.
Another criterion that might be used is Bayes Information Criterion (BIC)
which is

𝐵𝐼𝐶 = −2 log 𝐿 (𝛽,̂ 𝜎|̂ data ) + 𝑝 log 𝑛

and this criterion punishes large models more than AIC does (because log 𝑛 > 2
for 𝑛 ≥ 8)
The AIC value of a linear model can be found using the AIC() on a lm()()
object.

AIC(m1)

## [1] 118.6942

AIC(m2)

## [1] 124.2947

Because the AIC value for the first model is lower, we would prefer the first
model that includes both Income and Income.2 compared to model 2, which
was Life.Exp ~ Illiteracy+Murder+Frost.

9.3.2 Adjusted R-sq

One of the problems with 𝑅2 is that it makes no adjustment for how many
parameters in the model. Recall that 𝑅2 was defined as

𝑅𝑆𝑆𝑆 − 𝑅𝑆𝑆𝐶 𝑅𝑆𝑆𝐶


𝑅2 = =1−
𝑅𝑆𝑆𝑆 𝑅𝑆𝑆𝑆
2
where the simple model was the intercept only model. We can create an 𝑅𝑎𝑑𝑗
statistic that attempts to add a penalty for having too many parameters by
defining
2 𝑅𝑆𝑆𝐶 / (𝑛 − 𝑝)
𝑅𝑎𝑑𝑗 =1−
𝑅𝑆𝑆𝑆 / (𝑛 − 1)
With this adjusted definition, adding a variable to the model that has no pre-
2
dictive power will decrease 𝑅𝑎𝑑𝑗 .
160 CHAPTER 9. VARIABLE SELECTION

9.3.3 Example

Returning to the life expectancy data, we could start with a simple model add
covariates to the model that have the lowest AIC values. R makes this easy
with the function add1() which will take a linear model (which includes the
data frame that originally defined it) and will sequentially add all of the possible
terms that are not currently in the model and report the AIC values for each
model.

# Define the biggest model I wish to consider


biggest <- Life.Exp ~ Log.Population + Income + Illiteracy + Murder +
HS.Grad + Frost + Log.Area + HS.Grad.2 + Income.2

# Define the model I wish to start with


m <- lm(Life.Exp ~ 1, data=state.data)

add1(m, scope=biggest) # what is the best addition to make?

## Single term additions


##
## Model:
## Life.Exp ~ 1
## Df Sum of Sq RSS AIC
## <none> 88.299 30.435
## Log.Population 1 1.054 87.245 31.835
## Income 1 10.223 78.076 26.283
## Illiteracy 1 30.578 57.721 11.179
## Murder 1 53.838 34.461 -14.609
## HS.Grad 1 29.931 58.368 11.737
## Frost 1 6.064 82.235 28.878
## Log.Area 1 1.042 87.257 31.842
## HS.Grad.2 1 27.414 60.885 13.848
## Income.2 1 7.464 80.835 28.020

Clearly the addition of Murder to the model results in the lowest AIC value,
so we will add Murder to the model. Notice the <none> row corresponds to
the model m which we started with and it has a RSS=88.299. For each model
considered, R will calculate the RSS_{C} for the new model and will calculate
the difference between the starting model and the more complicated model and
display this in the Sum of Squares column.

m <- update(m, . ~ . + Murder) # add murder to the model


add1(m, scope=biggest) # what should I add next?

## Single term additions


9.3. CRITERION BASED PROCEDURES 161

##
## Model:
## Life.Exp ~ Murder
## Df Sum of Sq RSS AIC
## <none> 34.461 -14.609
## Log.Population 1 2.9854 31.476 -17.140
## Income 1 2.4047 32.057 -16.226
## Illiteracy 1 0.2732 34.188 -13.007
## HS.Grad 1 4.6910 29.770 -19.925
## Frost 1 3.1346 31.327 -17.378
## Log.Area 1 1.4583 33.003 -14.771
## HS.Grad.2 1 4.4396 30.022 -19.505
## Income.2 1 1.8972 32.564 -15.441

There is a companion function to add1() that finds the best term to drop. It is
conveniently named drop1() but here the scope parameter defines the smallest
model to be considered.
It would be nice if all of this work was automated. Again, R makes our life
easy and the function step() does exactly this. The set of models searched
is determined by the scope argument which can be a list of two formulas with
components upper and lower or it can be a single formula, or it can be blank.
The right-hand-side of its lower component defines the smallest model to be
considered and the right-hand-side of the upper component defines the largest
model to be considered. If scope is a single formula, it specifies the upper
component, and the lower model taken to be the intercept-only model. If scope
is missing, the initial model is used as the upper model.

smallest <- Life.Exp ~ 1


biggest <- Life.Exp ~ Log.Population + Income + Illiteracy +
Murder + HS.Grad + Frost + Log.Area + HS.Grad.2 + Income.2
m <- lm(Life.Exp ~ Income, data=state.data)
stats::step(m, scope=list(lower=smallest, upper=biggest))

## Start: AIC=26.28
## Life.Exp ~ Income
##
## Df Sum of Sq RSS AIC
## + Murder 1 46.020 32.057 -16.226
## + Illiteracy 1 21.109 56.968 12.523
## + HS.Grad 1 19.770 58.306 13.684
## + Income.2 1 19.062 59.015 14.288
## + HS.Grad.2 1 17.193 60.884 15.847
## + Frost 1 3.188 74.889 26.199
## <none> 78.076 26.283
## + Log.Population 1 1.298 76.779 27.445
162 CHAPTER 9. VARIABLE SELECTION

## + Log.Area 1 0.994 77.082 27.642


## - Income 1 10.223 88.299 30.435
##
## Step: AIC=-16.23
## Life.Exp ~ Income + Murder
##
## Df Sum of Sq RSS AIC
## + Frost 1 3.918 28.138 -20.745
## + Income.2 1 3.036 29.021 -19.200
## + HS.Grad 1 2.388 29.668 -18.097
## + Log.Population 1 2.371 29.686 -18.068
## + HS.Grad.2 1 2.199 29.857 -17.780
## <none> 32.057 -16.226
## + Log.Area 1 1.229 30.827 -16.181
## - Income 1 2.405 34.461 -14.609
## + Illiteracy 1 0.011 32.046 -14.242
## - Murder 1 46.020 78.076 26.283
##
## Step: AIC=-20.74
## Life.Exp ~ Income + Murder + Frost
##
## Df Sum of Sq RSS AIC
## + HS.Grad 1 2.949 25.189 -24.280
## + HS.Grad.2 1 2.764 25.375 -23.914
## + Log.Area 1 2.122 26.017 -22.664
## + Income.2 1 2.017 26.121 -22.465
## <none> 28.138 -20.745
## + Illiteracy 1 0.950 27.189 -20.461
## + Log.Population 1 0.792 27.347 -20.172
## - Income 1 3.188 31.327 -17.378
## - Frost 1 3.918 32.057 -16.226
## - Murder 1 46.750 74.889 26.199
##
## Step: AIC=-24.28
## Life.Exp ~ Income + Murder + Frost + HS.Grad
##
## Df Sum of Sq RSS AIC
## + Log.Population 1 2.279 22.911 -27.021
## + Income.2 1 1.864 23.326 -26.124
## - Income 1 0.182 25.372 -25.920
## <none> 25.189 -24.280
## + Log.Area 1 0.570 24.619 -23.425
## + HS.Grad.2 1 0.218 24.972 -22.714
## + Illiteracy 1 0.131 25.058 -22.541
## - HS.Grad 1 2.949 28.138 -20.745
## - Frost 1 4.479 29.668 -18.097
9.3. CRITERION BASED PROCEDURES 163

## - Murder 1 32.877 58.067 15.478


##
## Step: AIC=-27.02
## Life.Exp ~ Income + Murder + Frost + HS.Grad + Log.Population
##
## Df Sum of Sq RSS AIC
## - Income 1 0.011 22.921 -28.998
## <none> 22.911 -27.021
## + Income.2 1 0.579 22.331 -26.302
## + Log.Area 1 0.207 22.704 -25.475
## + Illiteracy 1 0.052 22.859 -25.134
## + HS.Grad.2 1 0.009 22.901 -25.042
## - Frost 1 2.107 25.017 -24.623
## - Log.Population 1 2.279 25.189 -24.280
## - HS.Grad 1 4.436 27.347 -20.172
## - Murder 1 33.706 56.616 16.214
##
## Step: AIC=-29
## Life.Exp ~ Murder + Frost + HS.Grad + Log.Population
##
## Df Sum of Sq RSS AIC
## <none> 22.921 -28.998
## + Log.Area 1 0.216 22.705 -27.471
## + Illiteracy 1 0.052 22.870 -27.111
## + Income.2 1 0.034 22.887 -27.073
## + HS.Grad.2 1 0.012 22.909 -27.024
## + Income 1 0.011 22.911 -27.021
## - Frost 1 2.214 25.135 -26.387
## - Log.Population 1 2.450 25.372 -25.920
## - HS.Grad 1 6.959 29.881 -17.741
## - Murder 1 34.109 57.031 14.578

##
## Call:
## lm(formula = Life.Exp ~ Murder + Frost + HS.Grad + Log.Population,
## data = state.data)
##
## Coefficients:
## (Intercept) Murder Frost HS.Grad Log.Population
## 68.720810 -0.290016 -0.005174 0.054550 0.246836

Notice that our model selected by step() is not the same model we obtained
when we started with the biggest model and removed things based on p-values.
The log-likelihood is only defined up to an additive constant, and there are
different conventional constants used. This is more annoying than anything
164 CHAPTER 9. VARIABLE SELECTION

because all we care about for model selection is the difference between AIC
values of two models and the additive constant cancels. The only time it matters
is when you have two different ways of extracting the AIC values. Recall the
model we fit using the top-down approach was

# m1 was
m1 <- lm(Life.Exp ~ Income + Murder + Frost + Income.2, data = state.data)
AIC(m1)

## [1] 121.4293

and the model selected by the stepwise algorithm was

m3 <- lm(Life.Exp ~ Murder + Frost + HS.Grad + Log.Population, data = state.data)


AIC(m3)

## [1] 114.8959

Because step() and AIC() are following different conventions the absolute value
of the AICs are different, but the difference between the two is constant no
matter which function we use.
First we calculate the difference using the AIC() function:

AIC(m1) - AIC(m3)

## [1] 6.533434

and next we use add1() on both models to see what the AIC values for each.

add1(m1, scope=biggest)

## Single term additions


##
## Model:
## Life.Exp ~ Income + Murder + Frost + Income.2
## Df Sum of Sq RSS AIC
## <none> 26.121 -22.465
## Log.Population 1 0.10296 26.018 -20.662
## Illiteracy 1 0.10097 26.020 -20.658
## HS.Grad 1 2.79527 23.326 -26.124
## Log.Area 1 2.36019 23.761 -25.200
## HS.Grad.2 1 2.79698 23.324 -26.127
9.4. EXERCISES 165

add1(m3, scope=biggest)

## Single term additions


##
## Model:
## Life.Exp ~ Murder + Frost + HS.Grad + Log.Population
## Df Sum of Sq RSS AIC
## <none> 22.921 -28.998
## Income 1 0.010673 22.911 -27.021
## Illiteracy 1 0.051595 22.870 -27.111
## Log.Area 1 0.215741 22.706 -27.471
## HS.Grad.2 1 0.011894 22.909 -27.024
## Income.2 1 0.034356 22.887 -27.073

Using these results, we can calculate the difference in AIC values to be the same
as we calculated before
−22.465 − −28.998 = −22.465 + 28.998
= 6.533

9.4 Exercises
1. Consider the prostate data from the faraway package. The variable lpsa
is a measurement of a prostate specific antigen which higher levels are
indicative of prostate cancer. Use lpsa as the response and all the other
variables as predictors (no interactions). Determine the “best” model
using:
a. Backward elimination using the analysis of variance F-statistic as the
criteria.
b. Forward selection using AIC as the criteria.
2. Again from the faraway package, use the divusa which has divorce rates
for each year from 1920-1996 along with other population information for
each year. Use divorce as the response variable and all other variables
as the predictors.
a. Determine the best model using stepwise selection starting from the
intercept only model and the most complex model being all main
effects (no interactions). Use the F-statistic to determine significance.
Note: add1(), drop1(), and step() allow an option of test=‘F’ to use
an F-test instead of AIC.
b. Following the stepwise selection, comment on the relationship be-
tween p-values used and the AIC difference observed. Do the AIC
rules of thumb match the p-value interpretation?
166 CHAPTER 9. VARIABLE SELECTION
Chapter 10

Mixed Effects Models

library(tidyverse) # dplyr, tidyr, ggplot


library(stringr) # string manipulation stuff
library(latex2exp) # for LaTeX mathematical notation

library(ggfortify) # For model diagnostic plots


library(lme4) # Our primary analysis routine
library(lmerTest) # A user friendly interface to lme4 that produces p-values
library(emmeans) # For all of my pairwise contrasts
library(car) # For bootstrap Confidence/Prediction Intervals

# library(devtools)
# install_github('dereksonderegger/dsData') # datasets I've made; only install once...
library(dsData)

The assumption of independent observations is often not supported and depen-


dent data arises in a wide variety of situations. The dependency structure could
be very simple such as rabbits within a litter being correlated and the litters
being independent. More complex hierarchies of correlation are possible. For
example we might expect voters in a particular part of town (called a precinct)
to vote similarly, and particular districts in a state tend to vote similarly as
well, which might result in a precinct / district / state hierarchy of correlation.

10.1 Block Designs

Often there are covariates in the experimental units that are known to affect
the response variable and must be taken into account. Ideally an experimenter

167
168 CHAPTER 10. MIXED EFFECTS MODELS

can group the experimental units into blocks where the within block variance
is small, but the block to block variability is large. For example, in testing a
drug to prevent heart disease, we know that gender, age, and exercise levels
play a large role. We should partition our study participants into gender, age,
and exercise groups and then randomly assign the treatment (placebo vs drug)
within the group. This will ensure that we do not have a gender, age, and
exercise group that has all placebo observations.
Often blocking variables are not the variables that we are primarily interested
in, but must nevertheless be considered. We call these nuisance variables. We
already know how to deal with these variables by adding them to the model,
but there are experimental designs where we must be careful because the ex-
perimental treatments are nested.
Example 1. An agricultural field study has three fields in which the researchers
will evaluate the quality of three different varieties of barley. Due to how they
harvest the barley, we can only create a maximum of three plots in each field. In
this example we will block on field since there might be differences in soil type,
drainage, etc from field to field. In each field, we will plant all three varieties so
that we can tell the difference between varieties without the block effect of field
confounding our inference. In this example, the varieties are nested within the
fields.

Field 1 Field 2 Field 3


Plot 1 Variety A Variety C Variety B
Plot 2 Variety B Variety A Variety C
Plot 3 Variety C Variety B Variety A

Example 2. We are interested in how a mouse responds to five different materials


inserted into subcutaneous tissue to evaluate the materials’ use in medicine.
Each mouse can have a maximum of 3 insertions. Here we will block on the
individual mice because even lab mice have individual variation. We actually
are not interested in estimating the effect of the mice because they aren’t really
of interest, but the mouse block effect should be accounted for before we make
any inferences about the materials. Notice that if we only have one insertion
per mouse, then the mouse effect will be confounded with materials.

10.2 Randomized Complete Block Design


(RCBD)

The dataset oatvar in the faraway library contains information about an exper-
iment on eight different varieties of oats. The area in which the experiment was
done had some systematic variability and the researchers divided the area up
10.2. RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) 169

into five different blocks in which they felt the area inside a block was uniform
while acknowledging that some blocks are likely superior to others for growing
crops. Within each block, the researchers created eight plots and randomly
assigned a variety to a plot. This type of design is called a Randomized Com-
plete Block Design (RCBD) because each block contains all possible levels of
the factor of primary interest.

data('oatvar', package='faraway')
ggplot(oatvar, aes(y=yield, x=block, color=variety)) +
geom_point(size=5) +
geom_line(aes(x=as.integer(block))) # connect the dots

500
variety
1
2
400
3
yield

4
5
300 6
7
8

I II III IV V
block

While there is one unusual observation in block IV, there doesn’t appear to be
a blatant interaction. We will consider the interaction shortly. For the main
effects model of yield ~ block + variety we have 𝑝 = 12 parameters and 28
residual degrees of freedom because

𝑑𝑓𝜖 = 𝑛 − 𝑝
= 𝑛 − (1 + [(𝐼 − 1) + (𝐽 − 1)])
= 40 − (1 + [(5 − 1) + (8 − 1)])
= 40 − 12
= 28

m1 <- lm( yield ~ block + variety, data=oatvar)


anova(m1)

## Analysis of Variance Table


##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
170 CHAPTER 10. MIXED EFFECTS MODELS

## block 4 33396 8348.9 6.2449 0.001008 **


## variety 7 77524 11074.8 8.2839 1.804e-05 ***
## Residuals 28 37433 1336.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# plot(m1) # check diagnostic plots - they are fine...

Because this is an orthogonal design, the sums of squares doesn’t change regard-
less of which order we add the factors, but if we remove one or two observations,
they would.

In determining the significance of variety the above F-value and p-value is


correct. We have 40 observations (5 per variety), and after accounting for the
model structure (including the extraneous blocking variable), we have 28 resid-
ual degrees of freedom.

But the F-value and p-value for testing if block is significant is nonsense! Imag-
ine that variety didn’t matter we just have 8 replicate samples per block, but
these aren’t true replicates, they are what is called pseudoreplicates. Imagine
taking a sample of 𝑛 = 3 people and observing their height at 1000 different
points in time during the day. You don’t have 3000 data points for estimating
the mean height in the population, you have 3. Unless we account for the this,
the inference for the block variable is wrong. In this case, we only have one
observation for each block, so we can’t do any statistical inference at the block
scale!

Fortunately in this case, we don’t care about the blocking variable and including
it in the model was simply guarding us in case there was a difference, but I wasn’t
interested in estimating it. If the only covariate we care about is the most deeply
nested effect, then we can do the usual analysis and recognize the p-value for
the blocking variable is nonsense, and we don’t care about it.

# Ignore any p-values regarding block, but I'm happy with the analysis for variety
letter_df <- emmeans(m1, ~variety) %>%
multcomp::cld(Letters=letters) %>%
dplyr::select(variety, .group) %>%
mutate(yield = 500)

ggplot(oatvar, aes(x=variety, y=yield)) +


geom_boxplot() +
geom_text( data=letter_df, aes(label=.group) )
10.2. RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) 171

500 ab bc b a c ab ab bc

400
yield

300

1 2 3 4 5 6 7 8
variety

However it would be pretty sloppy to not do the analysis correctly because our
blocking variable might be something we care about.
Once correct way to model these data are using hierarchical models which are
created by having multiple error terms that can be introduced. In many re-
spects, the random effects structure provides an extremely flexible framework
to consider many of the traditional experimental designs as well as many non-
traditional designs with the benefit of more easily assessing variability at each
hierarchical level.
Mixed effects models combine what we call “fixed” and “random” effects.

Fixed effects Unknown constants that we wish to estimate from


the model and could be similarly estimated in
subsequent experimentation. The research is
interested in these particular levels.
Random effects Random variables sampled from a population which
cannot be observed in subsequent experimentation.
The research is not interested in these particular
levels, but rather how the levels vary from sample to
sample.

For example, in a rabbit study that examined the effect of diet on the growth
of domestic rabbits and we had 10 litters of rabbits and used the 3 most similar
from each litter to test 6 different diets. Here, the 6 different diets are fixed
effects because they are not randomly selected from a population, these exact
same diets can be further studied, and these are the diets we are interested
it. The litters of rabbits and the individual rabbits are randomly selected from
populations, cannot be exactly replicated in future studies, and we are not
interested in the individual litters but rather what the variability is between
individuals and between litters.
Often random effects are not of primary interest to the researcher, but must
172 CHAPTER 10. MIXED EFFECTS MODELS

be considered. Blocking variables are random effects because the arise from a
random sample of possible blocks that are potentially available to the researcher.
Mixed effects models are models that have both fixed and random effects. We
will first concentrate on understanding how to address a model with two sources
error and then complicate the matter with fixed effects.

10.3 Review of Maximum Likelihood Methods

Recall that the likelihood function is the function links the model parameters
to the data and is found by taking the probability density function and inter-
preting it as a function of the parameters instead of the a function of the data.
Loosely, the probability function tells us what outcomes are most probable, with
the height of the function telling us which values (or regions of values) are most
probable given a set of parameter values. The higher the probability function,
the higher the probability of seeing that value (or data in that region). The
likelihood function turns that relationship around and tells us what parame-
ter values are most likely to have generated the data we have, again with the
parameter values with a higher likelihood value being more “likely”.
𝑖𝑖𝑑
The likelihood function for a sample 𝑦𝑖 ∼ 𝑁 (𝑋𝑖,⋅ 𝛽, 𝜎) can be written as a
function of our parameters 𝛽 and 𝜎2 then we have defined our likelihood function

1 1 𝑇 −1
𝐿 (𝛽, 𝜎2 |𝑦1 , … , 𝑦𝑛 ) = 𝑛/2 1/2
exp [− (𝑦 − 𝑋𝛽) Ω (𝑦 − 𝑋𝛽)]
(2𝜋) [det (Ω)] 2

where the variance/covariance matrix is Ω = 𝜎𝐼 𝑛 .


We can use to this equation to find the maximum likelihood estimators by
either taking the derivatives and setting them equal to zero and solving for the
parameters or by using numerical methods. In the normal case, we can find the
maximum likelihood estimators (MLEs) using the derivative trick and we find
that
𝑇 −1 −1 𝑇 −1
𝛽̂ = (𝑋 Ω 𝑋) 𝑋 Ω 𝑦
𝑀𝐿𝐸

and
2 1 𝑛 2
𝜎̂𝑀𝐿𝐸 = ∑ (𝑦 − 𝑦)̂
𝑛 𝑖=1 𝑖

and where 𝑦𝑖̂ = 𝑋𝑖⋅ 𝛽̂ we notice that this is not our usual estimator 𝜎̂ 2 = 𝑠2
where 𝑠2 is the sample variance. It turns out that the MLE estimate of 𝜎2 is
biased (the correction is to divide by 𝑛 − 1 instead of 𝑛). This is normally not
an issue if our sample size is large, but with a small sample, the bias is not
insignificant.
10.4. 1-WAY ANOVA WITH A RANDOM EFFECT 173

Notice if we happened to know that 𝜇 = 0, then we could use

2 1 𝑛 2
𝜎̂𝑀𝐿𝐸 = ∑𝑦
𝑛 𝑖=1 𝑖

and this would be unbiased for 𝜎2 .


In general (a not just in the normal case above) the Likelihood Ratio Test (LRT)
provides a way for us to compare two nested models. Given 𝑚0 which is a
simplification of 𝑚1 then we could calculate the likelihoods functions of the two
models 𝐿 (𝜃0 ) and 𝐿 (𝜃1 ) where 𝜃0 is a vector of parameters for the null model
and 𝜃1 is a vector of parameter for the alternative. Let 𝜃0̂ be the maximum
likelihood estimators for the null model and 𝜃1̂ be the maximum likelihood
estimators for the alternative. Finally we consider the value of
𝐿 (𝜃0̂ )
𝐷 = −2 ∗ log [ ]
𝐿 (𝜃 ̂ )
1

= −2 [log 𝐿 (𝜃0̂ ) − log 𝐿 (𝜃1̂ )]


Under the null hypothesis that 𝑚0 is the true model, the 𝐷 ∼ 𝜒2𝑝1 −𝑝0 where
𝑝1 − 𝑝0 is the difference in number of parameters in the null and alternative
models. That is to say that asymptotically 𝐷 has a Chi-squared distribution
with degrees of freedom equal to the difference in degrees of freedom of the two
models.
We could think of 𝐿 (𝜃 ̂ ) as the maximization of the likelihood when some
0
parameters are held constant (at zero) and all the other parameters are vary.
But we are not required to hold it constant at zero. We could chose any value
of interest and perform a LRT.
Because we often regard a confidence interval as the set of values that would
not be rejected by a hypothesis test, we could consider a sequence of possible
values for a parameter and figure out which would not be rejected by the LRT.
In this fashion we can construct confidence intervals for parameter values.
Unfortunately all of this hinges on the asymptotic distribution of 𝐷 and often
this turns out to be a poor approximation. In simple cases more exact tests can
be derived (for example the F-tests we have used prior) but sometimes nothing
better is currently known. Another alternative is to use resampling methods for
the creation of confidence intervals or p-values.

10.4 1-way ANOVA with a random effect


We first consider the simplest model with two sources of variability, a 1-way
ANOVA with a random factor covariate
𝑦𝑖𝑗 = 𝜇 + 𝛾𝑖 + 𝜖𝑖𝑗
174 CHAPTER 10. MIXED EFFECTS MODELS

𝑖𝑖𝑑 𝑖𝑖𝑑
where 𝛾𝑖 ∼ 𝑁 (0, 𝜎𝛾2 ) and 𝜖𝑖𝑗 ∼ 𝑁 (0, 𝜎𝜖2 ). This model could occur, for example,
when looking at the adult weight of domestic rabbits where the random effect is
the effect of litter and we are interested in understanding how much variability
there is between litters (𝜎𝛾2 ) and how much variability there is within a litter
(𝜎𝜖2 ). Another example is the the creation of computer chips. Here a single
wafer of silicon is used to create several chips and we might have wafer-to-wafer
variability and then within a wafer, you have chip-to-chip variability.
First we should think about what the variances and covariances are for any two
observations.
𝑉 𝑎𝑟 (𝑦𝑖𝑗 ) = 𝑉 𝑎𝑟 (𝜇 + 𝛾𝑖 + 𝜖𝑖𝑗 )
= 𝑉 𝑎𝑟 (𝜇) + 𝑉 𝑎𝑟 (𝛾𝑖 ) + 𝑉 𝑎𝑟 (𝜖𝑖𝑗 )
= 0 + 𝜎𝛾2 + 𝜎𝜖2
and 𝐶𝑜𝑣 (𝑦𝑖𝑗 , 𝑦𝑖𝑘 ) = 𝜎𝛾2 because the two observations share the same litter 𝛾𝑖 .
For two observations in different litters, the covariance is 0. These relationships
induce a correlation on observations within the same litter of
𝜎𝛾2
𝜌=
𝜎𝛾2 + 𝜎𝜖2

For example, suppose that we have 𝐼 = 3 litters and in each litter we have 𝐽 = 3
rabbits per litter. Then the variance-covariance matrix looks like

𝜎𝛾2 + 𝜎𝜖2 𝜎𝛾2 𝜎𝛾2 . . . . . .


⎡ 𝜎𝛾2 𝜎𝛾2 + 𝜎𝜖2 𝜎𝛾2 . . . . . . ⎤
⎢ ⎥
⎢ 𝜎𝛾2 𝜎𝛾2 𝜎𝛾 + 𝜎𝜖2
2
. . . . . . ⎥
⎢ . . . 𝜎𝛾2 + 𝜎𝜖2 𝜎𝛾2 𝜎𝛾2 . . . ⎥
Ω=⎢ . . . 𝜎𝛾2 𝜎𝛾2 + 𝜎𝜖2 𝜎𝛾2 . . . ⎥
⎢ . . . 𝜎𝛾2 𝜎𝛾2 𝜎𝛾2 + 𝜎𝜖2 . . . ⎥
⎢ ⎥
⎢ . . . . . . 𝜎𝛾 + 𝜎𝜖2
2
𝜎𝛾2 𝜎𝛾2 ⎥
⎢ . . . . . . 𝜎𝛾2 𝜎𝛾 + 𝜎𝜖2
2
𝜎𝛾2 ⎥
⎣ . . . . . . 𝜎𝛾2 𝜎𝛾2 𝜎𝛾 + 𝜎𝜖2
2

Substituting this new variance-covariance matrix into our likelihood function,


we now have a likelihood function which we can perform our usual MLE tricks
with.
In the more complicated situation where we have a full mixed effects model, we
could write
𝑦 = 𝑋𝛽 + 𝑍𝛾 + 𝜖
where 𝑋 is the design matrix for the fixed effects, 𝛽 is the vector of fixed effect
coefficients, 𝑍 is the design matrix for random effects, 𝛾 is the vector of random
𝑖𝑖𝑑
effects such that 𝛾𝑖 ∼ 𝑁 (0, 𝜎𝛾2 ) and finally 𝜖 is the vector of error terms such
𝑖𝑖𝑑
that 𝜖𝑖𝑗 ∼ 𝑁 (0, 𝜎𝜖2 ). Notice in our rabbit case
10.4. 1-WAY ANOVA WITH A RANDOM EFFECT 175

1 ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⎡ 1 ⋅ ⋅ ⎤ ⎡ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎤
⎢ ⎥ ⎢ ⎥
⎢ 1 ⋅ ⋅ ⎥ ⎢ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎥
⎢ ⋅ 1 ⋅ ⎥ ⎢ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⎥
𝑍=⎢ ⋅ 1 ⋅ ⎥ 𝑍𝑍 𝑇 = ⎢ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⎥
⎢ ⋅ 1 ⋅ ⎥ ⎢ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⋅ ⋅ 1 ⎥ ⎢ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⎥
⎢ ⋅ ⋅ 1 ⎥ ⎢ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⎥
⎣ ⋅ ⋅ 1 ⎦ ⎣ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⎦

which makes it easy to notice


𝑇
Ω = 𝜎𝛾2 𝑍𝑍 + 𝜎𝜖2 𝐼

In practice we tend to have relatively small numbers of block parameters and


thus have a small number of observations in which to estimate 𝜎𝛾2 which means
that the biased nature of MLE estimates will be sub-optimal. If we knew that
𝑋𝛽 = 0 we could use that fact and have an unbiased estimate of our variance
𝑇
parameters. Because 𝑋 is known, we can find linear functions 𝑘 such that 𝑘 𝑋 =
0. We can form a matrix 𝐾 that represents all of these possible transformations
and we notice that
𝑇 𝑇 𝑇 𝑇
𝐾 𝑦 ∼ 𝑁 (𝐾 𝑋𝛽, 𝐾 Ω𝐾) = 𝑁 (0, 𝐾 Ω𝐾)

and perform our maximization on this transformed set of data. Once we have
our unbiased estimates of 𝜎𝛾2 and 𝜎𝜖2 , we can substitute these back into the
untransformed likelihood function and find the MLEs for 𝛽. This process is
called Restricted Maximum Likelihood (REML) and is generally preferred over
the variance component estimates found simply maximizing the regular like-
lihood function. As usual, if our experiment is balanced these complications
aren’t necessary as the REML estimates of 𝛽 are usually the same as the ML
estimates.

Our first example comes from an experiment to test the paper brightness as
affected by the shift operator. The data has 20 observations with 4 different
operators. Each operator had 5 different observations made. The data set is
pulp in the package faraway. We will first analyze this using a fixed-effects one-
way ANOVA, but we will use a different model representation. Instead of using
the first operator as the reference level, we will use the sum-to-zero constraint
(to make it easier to compare with the output of the random effects model).

data('pulp', package='faraway')
ggplot(pulp, aes(x=operator, y=bright)) + geom_point()
176 CHAPTER 10. MIXED EFFECTS MODELS

61.00

60.75

60.50
bright

60.25

60.00

59.75
a b c d
operator

# set the contrasts to sum-to-zero constraint


options(contrasts=c('contr.sum', 'contr.poly'))
m <- lm(bright ~ operator, data=pulp)
summary(m)

##
## Call:
## lm(formula = bright ~ operator, data = pulp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.440 -0.195 -0.070 0.175 0.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.40000 0.07289 828.681 <2e-16 ***
## operator1 -0.16000 0.12624 -1.267 0.223
## operator2 -0.34000 0.12624 -2.693 0.016 *
## operator3 0.22000 0.12624 1.743 0.101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.326 on 16 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.3359
## F-statistic: 4.204 on 3 and 16 DF, p-value: 0.02261

coef(m)

## (Intercept) operator1 operator2 operator3


## 60.40 -0.16 -0.34 0.22
10.4. 1-WAY ANOVA WITH A RANDOM EFFECT 177

The sum-to-zero constraint forces the operator parameters to sum to zero so we


can find the value of the fourth operator as operator4 = −(−0.16−0.34+0.22) =
0.28
To fit the random effects model we will use the package lmerTest which is a
nicer user interface to the package lme4. The reason we won’t use lme4 directly
is that the authors of lme4 refuse to calculate p-values. The reason for this
is that in mixed models it is not always clear what the appropriate degrees of
freedom are for the residuals, and therefore we don’t know what the appropriate
t-distribution is to compare the t-values to. In simple balanced designs the
degrees of freedom can be calculated, but in complicated unbalanced designs the
appropriate degrees of freedom is not known and all proposed heuristic methods
(including what is calculated by SAS) can fail spectacularly in certain cases.
The authors of lme4 are adamant that until robust methods are developed,
they prefer to not calculate any p-values. There are other packages out there
that recognize that we need approximate p-values and the package lmerTest
provides reasonable answers that match was SAS calculates.

m2 <- lmer( bright ~ 1 + (1|operator), data=pulp )


summary(m2)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [


## lmerModLmerTest]
## Formula: bright ~ 1 + (1 | operator)
## Data: pulp
##
## REML criterion at convergence: 18.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.4666 -0.7595 -0.1244 0.6281 1.6012
##
## Random effects:
## Groups Name Variance Std.Dev.
## operator (Intercept) 0.06808 0.2609
## Residual 0.10625 0.3260
## Number of obs: 20, groups: operator, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 60.4000 0.1494 3.0000 404.2 3.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the estimate of the fixed effect (the overall mean) is the same in
the fixed-effects ANOVA and in the mixed model. However the fixed effects
178 CHAPTER 10. MIXED EFFECTS MODELS

ANOVA estimates the effect of each operator while the mixed model is inter-
ested in estimating the variance between operators. In the model statement the
(1|operator) denotes the random effect and this notation tells us to fit a model
with a random intercept term for each operator. Here the variance associated
with the operators is 𝜎𝛾2 = 0.068 while the “pure error” is 𝜎𝜖2 = 0.106. The col-
umn for standard deviation is not the variability associated with our estimate,
but is simply the square-root of the variance terms 𝜎𝛾 and 𝜎𝜖 . This was fit using
the REML method.
We might be interested in the estimated effect of each operator

ranef(m2)

## $operator
## (Intercept)
## a -0.1219403
## b -0.2591231
## c 0.1676679
## d 0.2133955
##
## with conditional variances for "operator"

These effects are smaller than the values we estimated in the fixed effects model
due to distributional assumption that penalizes large deviations from the mean.
In general, the estimated random effects are of smaller magnitude than the effect
size estimated using a fixed effect model.

# reset the contrasts to the default


options(contrasts=c("contr.treatment", "contr.poly" ))

10.5 Blocks as Random Variables


Blocks are properties of experimental designs and usually we are not interested
in the block levels per se but need to account for the variability introduced by
them.
Recall the agriculture experiment in the dataset oatvar from the faraway pack-
age. We had 8 different varieties of oats and we had 5 different fields (which
we called blocks). Because of limitations on how we plant, we could only divide
the blocks into 8 plots and in each plot we planted one of the varieties.

data('oatvar', package='faraway')
ggplot(oatvar, aes(y=yield, x= variety)) +
geom_point() +
facet_wrap(~block, labeller=label_both)
10.5. BLOCKS AS RANDOM VARIABLES 179

block: I block: II block: III


500

400

300
yield

1 2 3 4 5 6 7 8
block: IV block: V
500

400

300

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
variety

In this case, we don’t really care about these particular fields (blocks) and would
prefer to think about these as a random sample of fields that we might have
used in our experiment. The analysis to compare these simple model (without
variety) to the complex model it should use ML because the fixed effects are
different between the two models and thus the 𝐾 matrix used in REML fits are
different.

model.0 <- lmer( yield ~ (1|block), data=oatvar)


model.1 <- lmer( yield ~ variety + (1|block), data=oatvar)

# By default anova() will always refit the models using ML assuming you want to
# compare models with different fixed effects. Use refit=FALSE to suppress this.
anova(model.0, model.1)

## refitting model(s) with ML (instead of REML)

## Data: oatvar
## Models:
## model.0: yield ~ (1 | block)
## model.1: yield ~ variety + (1 | block)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model.0 3 446.94 452.01 -220.47 440.94
## model.1 10 421.67 438.56 -200.84 401.67 39.27 7 1.736e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that this is doing a Likelihood Ratio Test


𝐿0
−2 ∗ log ( ) = −2 (log(𝐿0 ) − log(𝐿𝑎 )) = −2 ∗ (−220.47 − −200.56) = 39.24
𝐿𝑎

Recall that we don’t really like to trust Likelihood Ratio Tests because they
depend on the asymptotic distribution of the statistic (as sample size increases)
180 CHAPTER 10. MIXED EFFECTS MODELS

and that convergence is pretty slow. Another, and usually better option, is to
perform an F-test with the numerator degrees of freedom estimated using either
the Satterthwaite or Kenward-Roger methods. To do this, we’ll use the anova()
command with just a single model.

# Do an F test for the fixed effects using the similar degree of freedom
# approximations done by SAS
#
# anova(model.1, ddf='lme4') # don't return the p-value because we don't trust it!
# anova(model.1, ddf='Satterthwaite') # Use Satterthwaite
# anova(model.1, ddf='Kenward-Roger') # Use Kenward-Roger
anova(model.1) # default is Satterthwaite's degrees of freedom

## Type III Analysis of Variance Table with Satterthwaite's method


## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## variety 77524 11075 7 28 8.2839 1.804e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unsurprisingly given our initial thoughts about the data, it looks like variety
is a statistically significant covariate.
There is quite a bit of debate among statisticians about how to calculate the de-
nominator degrees of freedom and which method is preferred in different scenar-
ios. By default, lmerTest uses Satterthwaite’s method, but ‘Kenward-Roger’
is also allowed. In this case, the two methods produces the same estimated
denominator degrees of freedom and the same p-value.
To consider if the random effect should be included in the model, we will turn
to the Likelihood Ratio test. The following examines all single term deletions
to the random effects structure. In our case, this is just considering removing
(1|block)

# Each line tests if a random effect can be removed or reduced to a simpler random effe
# Something like (1|Group) will be tested if it can be removed.
# Something like (1+Trt | Group) will be tested if it can be reduced to (1 | Group)
ranova(model.1)

## ANOVA-like table for random-effects: Single term deletions


##
## Model:
## yield ~ variety + (1 | block)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 10 -170.68 361.35
## (1 | block) 9 -175.08 368.16 8.8065 1 0.003002 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10.5. BLOCKS AS RANDOM VARIABLES 181

Here, both the AIC and LRT suggest that the random effect of block is appro-
priate to include in the model.
Now that we have chosen our model, we can examine is model.

summary(model.1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [


## lmerModLmerTest]
## Formula: yield ~ variety + (1 | block)
## Data: oatvar
##
## REML criterion at convergence: 341.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7135 -0.5503 -0.1280 0.4862 2.1756
##
## Random effects:
## Groups Name Variance Std.Dev.
## block (Intercept) 876.5 29.61
## Residual 1336.9 36.56
## Number of obs: 40, groups: block, 5
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 334.40 21.04 15.26 15.894 6.64e-11 ***
## variety2 42.20 23.12 28.00 1.825 0.0787 .
## variety3 28.20 23.12 28.00 1.219 0.2328
## variety4 -47.60 23.12 28.00 -2.058 0.0490 *
## variety5 105.00 23.12 28.00 4.541 9.73e-05 ***
## variety6 -3.80 23.12 28.00 -0.164 0.8707
## variety7 -16.00 23.12 28.00 -0.692 0.4947
## variety8 49.80 23.12 28.00 2.154 0.0400 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) varty2 varty3 varty4 varty5 varty6 varty7
## variety2 -0.550
## variety3 -0.550 0.500
## variety4 -0.550 0.500 0.500
## variety5 -0.550 0.500 0.500 0.500
## variety6 -0.550 0.500 0.500 0.500 0.500
## variety7 -0.550 0.500 0.500 0.500 0.500 0.500
## variety8 -0.550 0.500 0.500 0.500 0.500 0.500 0.500
182 CHAPTER 10. MIXED EFFECTS MODELS

We start with the Random effects. This section shows us the block-to-block
variability (and the square root of that, the Standard Deviation) as well as the
“pure-error”, labeled residuals, which is an estimate of the variability associ-
ated with two different observations (after the difference in variety is accounted
for) planted within the same block. For this we see that block-to-block variabil-
ity is only slightly smaller than the within block variability.
Why do we care about this? This actually tells us quite a lot about the spatial
variability. Because yield is affected by soil nutrients, micro-climate, soil water
availability, etc, I expect that two identical seedlings planted in slightly different
conditions will have slightly different yields. By examining how the yield changes
over small distances (the residual within block variability) vs how it changes over
long distances (block to block variability) we can get a sense as to the scale at
which these background lurking processes operate.
Next we turn to the fixed effects. These will be the offsets from the reference
group, as we’ve typically worked with. Here we see that varieties 2,5, and 8 are
the best performers (relative to variety 1),
We are certain that there are differences among the varieties, and we should
look at all of the pairwise contrasts among the variety levels. As usual we could
use the package emmeans, which automates much of this (and uses lmerTest
produced p-values for the tests).

LetterResults <- emmeans::emmeans( model.1, ~ variety) %>%


multcomp::cld(Letters=letters)
LetterResults

## variety emmean SE df lower.CL upper.CL .group


## 4 287 21 15.2 242 332 a
## 7 318 21 15.2 274 363 ab
## 6 331 21 15.2 286 375 ab
## 1 334 21 15.2 290 379 ab
## 3 363 21 15.2 318 407 b
## 2 377 21 15.2 332 421 bc
## 8 384 21 15.2 339 429 bc
## 5 439 21 15.2 395 484 c
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 8 estimates
## significance level used: alpha = 0.05

As usual we’ll join this information into the original data table and then make
a nice summary graph.
10.5. BLOCKS AS RANDOM VARIABLES 183

LetterResults <- LetterResults %>%


mutate(LetterHeight=500, .group = str_trim(.group))

oatvar %>%
mutate(variety = fct_reorder(variety, yield)) %>%
ggplot( aes(x=variety, y=yield)) +
geom_point(aes(color=block)) +
geom_text(data=LetterResults, aes(label=.group, y=LetterHeight))

500 a b ab ab ab bc bc c

block

400 I
II
yield

III
IV
300
V

4 3 7 1 6 8 2 5
variety

We’ll consider a second example using data from the pharmaceutical industry.
We are interested in 4 different processes (our treatment variable) used in the
biosynthesis and purification of the drug penicillin. The biosynthesis requires a
nutrient source (corn steep liquor) as a nutrient source for the fungus and the
nutrient source is quite variable. Each batch of the nutrient is is referred to as
a ‘blend’ and each blend is sufficient to create 4 runs of penicillin. We avoid
confounding our biosynthesis methods with the blend by using a Randomized
Complete Block Design and observing the yield of penicillin from each of the
four methods (A,B,D, and D) in each blend.

data(penicillin, package='faraway')

ggplot(penicillin, aes(y=yield, x=treat)) +


geom_point() +
facet_wrap( ~ blend, ncol=5)
184 CHAPTER 10. MIXED EFFECTS MODELS

Blend1 Blend2 Blend3 Blend4 Blend5

95

90
yield

85

80

A B C D A B C D A B C D A B C D A B C D
treat

It looks like there is definitely a Blend effect (e.g. Blend1 is much better than
Blend5) but it isn’t clear that there is a treatment effect.

model.0 <- lmer(yield ~ 1 + (1 | blend), data=penicillin)


model.1 <- lmer(yield ~ treat + (1 | blend), data=penicillin)
# anova(model.0, model.1) # Analysis using a LRT. Not my preference for analysis
anova(model.1) # Tests using an F-test with an approximated degrees of freedom

## Type III Analysis of Variance Table with Satterthwaite's method


## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## treat 70 23.333 3 12 1.2389 0.3387

It looks like we don’t have a significant effect of the treatments. Next we’ll
examine the simple model to understand the variability.

# Test if we should remove the (1|blend) term using either AIC or Likelihood Ratio Test
ranova(model.1)

## ANOVA-like table for random-effects: Single term deletions


##
## Model:
## yield ~ treat + (1 | blend)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 6 -51.915 115.83
## (1 | blend) 5 -53.296 116.59 2.7629 1 0.09647 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Blend looks marginally significant, but it might still be interesting to compare


the variability between blends to variability within blends.
10.6. NESTED EFFECTS 185

# summary(model.0) # Shows everything...


summary(model.0)[['varcor']] # Just the sigma terms for each random term

## Groups Name Std.Dev.


## blend (Intercept) 3.4010
## Residual 4.4422

We see that the noise is more in the within blend (labeled residual here) rather
than the between blends. If my job were to understand the variability and figure
out how to improve production, this suggests that variability is introduced at
the blend level and at the run level, but the run level is more important.

10.6 Nested Effects


When the levels of one factor vary only within the levels of another factor, that
factor is said to be nested. For example, when measuring the performance of
workers at several job locations, if the workers only work at one site, then the
workers are nested within site. If the workers work at more than one location,
we would say that workers are crossed with site.
We’ve already seen a number of nested designs when we looked at split plot
designs. Recall the AgData set that I made up that simulated an agricultural
experiment with 8 plots and 4 subplots per plot. We applied an irrigation
treatment at the plot level and a fertilizer treatment at the subplot level. I
actually have 5 replicate observations per subplot.
plot: 1 plot: 2 plot: 3 plot: 4
Irrigation Low Irrigation Low Irrigation Low Irrigation Low

Fertilizer

plot: 5 plot: 6 plot: 7 plot: 8 Low

Irrigation High Irrigation High Irrigation High Irrigation High High

So all together we have 8 plots, 32 subplots, and 5 replicates per subplot. When
I analyze the fertilizer, I have 32 experimental units (the thing I have applied
my treatment to), but when analyzing the effect of irrigation, I only have 8
experimental units. In other words, I should have 8 random effects for plot, and
32 random effects for subplot.
186 CHAPTER 10. MIXED EFFECTS MODELS

# The following model definitions are equivalent


model <- lmer(yield ~ Irrigation + Fertilizer + (1|plot) + (1|plot:subplot), data=AgDat
model <- lmer(yield ~ Irrigation + Fertilizer + (1|plot/subplot), data=AgData)
anova(model)

## Type III Analysis of Variance Table with Satterthwaite's method


## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Irrigation 3.4769 3.4769 1 5.9999 3.4281 0.1136
## Fertilizer 31.3824 31.3824 1 22.9999 30.9420 1.169e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we saw before, the effect of irrigation is not significant and the fertilizer effect
is highly significant. We’ll remove the irrigation covariate and refit the model.
We designed our experiment assuming that both plot and subplot were pos-
sibly important. Many statisticians would argue that because that was how we
designed the experiment, we should necessarily keep that structure when ana-
lyzing the data. This is particularly compelling considering that the Irrigation
and Fertilizer were applied on different scales. However it is still fun to look at
the statistical significance of the random effects.

model <- lmer(yield ~ Fertilizer + (1|plot/subplot), data=AgData)


ranova(model)

## ANOVA-like table for random-effects: Single term deletions


##
## Model:
## yield ~ Fertilizer + (1 | subplot:plot) + (1 | plot)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 5 -286.32 582.64
## (1 | subplot:plot) 4 -369.99 747.97 167.334 1 < 2.2e-16 ***
## (1 | plot) 4 -293.00 594.01 13.367 1 0.0002561 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Each row (except the none) is testing the exclusion of a random effect. The
first being the 32 subplots, and the second being the 8 plots. Both are highly
statistically significant. To assess the practical significance of subplots and plots,
we need to look at the variance terms:

summary(model)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [


10.6. NESTED EFFECTS 187

## lmerModLmerTest]
## Formula: yield ~ Fertilizer + (1 | plot/subplot)
## Data: AgData
##
## REML criterion at convergence: 572.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.78714 -0.62878 -0.08602 0.64094 2.36353
##
## Random effects:
## Groups Name Variance Std.Dev.
## subplot:plot (Intercept) 5.345 2.312
## plot (Intercept) 8.854 2.975
## Residual 1.014 1.007
## Number of obs: 160, groups: subplot:plot, 32; plot, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 21.0211 1.2056 8.9744 17.436 3.14e-08 ***
## FertilizerHigh 4.6323 0.8328 23.0000 5.563 1.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## FertilzrHgh -0.345

Notice the standard deviation plant-to-plant noise is about 1/3 of the noise
associated with subplot-to-subplot or even plot-to-plot. Finally the effect of
increasing the fertilizer level is to increase yield by about 4.6.

A number of in-situ experiments looking at the addition CO2 and warming on


landscapes have been done (typically called Free Air CO2 Experiments (FACE))
and these are interesting from an experimental design perspective because we
have limited number of replicates because the cost of exposing plants to different
CO2 levels outside a greenhouse is extraordinarily expensive. In the dsData
package, there is a dataset that is inspired by one of those studies.
The experimental units for the CO2 treatment will be called a ring, and we have
nine rings. We have three treatments (A,B,C) which correspond to an elevated
CO2 treatment, an ambient CO2 treatment with all the fans, and a pure control.
For each ring we’ll have some measure of productivity but we have six replicate
observations per ring.
188 CHAPTER 10. MIXED EFFECTS MODELS

data("HierarchicalData", package = 'dsData')


head(HierarchicalData)

## Trt Ring Rep y


## 1 A 1 1 363.9684
## 2 A 1 2 312.0613
## 3 A 1 3 332.9916
## 4 A 1 4 320.0109
## 5 A 1 5 292.2656
## 6 A 1 6 315.8136

Experiment Design and Results


Ring: 1 Ring: 2 Ring: 3

350
300
250
200
150

Ring: 4 Ring: 5 Ring: 6

350 Trt
Response

300 A
250 B
200
C
150

Ring: 7 Ring: 8 Ring: 9

350
300
250
200
150
2 4 6 2 4 6 2 4 6
Replicate

We can easily fit this model using random effects for each ring.

model <- lmer( y ~ Trt + (1|Ring), data=HierarchicalData )


anova(model)

## Type III Analysis of Variance Table with Satterthwaite's method


## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Trt 10776 5388.1 2 6 5.7176 0.04075 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10.6. NESTED EFFECTS 189

To think about what is actually going on, it is helpful to consider the predicted
values from this model. As usual we will use the predict function, but now we
have the option of including the random effects or not.

First lets consider the predicted values if we completely ignore the Ring random
effect while making predictions.

HierarchicalData.Predictions <- HierarchicalData %>%


mutate( y.hat = predict(model, re.form= ~ 0), # don't include any random effects
y.hat = round( y.hat, digits=2),
my.text = TeX(paste('$\\hat{y}$ =', y.hat), output='character')) %>%
group_by(Trt, Ring) %>%
slice(1) # Predictions are the same for all replicates in the ring

HierarchicalData.Predictions %>%
head()

## # A tibble: 6 x 6
## # Groups: Trt, Ring [6]
## Trt Ring Rep y y.hat my.text
## <fct> <fct> <int> <dbl> <dbl> <chr>
## 1 A 1 1 364. 312. paste('','',hat(paste('y')),'',' = 311','. 78')
## 2 A 2 1 269. 312. paste('','',hat(paste('y')),'',' = 311','. 78')
## 3 A 3 1 321. 312. paste('','',hat(paste('y')),'',' = 311','. 78')
## 4 B 4 1 189. 235. paste('','',hat(paste('y')),'',' = 234','. 53')
## 5 B 5 1 265. 235. paste('','',hat(paste('y')),'',' = 234','. 53')
## 6 B 6 1 251. 235. paste('','',hat(paste('y')),'',' = 234','. 53')
190 CHAPTER 10. MIXED EFFECTS MODELS

Treatment Only Predictions


Ring: 1 Ring: 2 Ring: 3
400
y^ = 311. 78 y^ = 311. 78 y^ = 311. 78
300

200

Ring: 4 Ring: 5 Ring: 6


400
y^ = 234. 53 y^ = 234. 53 y^ = 234. 53 Trt
Response

300 A
B
200
C

Ring: 7 Ring: 8 Ring: 9


400
y^ = 232. 19 y^ = 232. 19 y^ = 232. 19
300

200

2 4 6 2 4 6 2 4 6
Replicate

Now we consider the predicted values, but created using the Ring random effect.
These random effects provide for a slight perturbation up or down depending
on the quality of the Ring, but the sum of all 9 Ring effects is required to be 0.

ranef(model)

## $Ring
## (Intercept)
## 1 9.458738
## 2 -29.798425
## 3 20.339687
## 4 -40.532972
## 5 21.067503
## 6 19.465469
## 7 -21.814405
## 8 2.548287
## 9 19.266118
##
## with conditional variances for "Ring"

sum(ranef(model)$Ring)

## [1] -3.519407e-12
10.6. NESTED EFFECTS 191

Also notice that the sum of the random effects within a treatment is zero! (Recall
Ring 1:3 was treatment A, 4:6 was treatment B, and 7:9 was treatment C).

HierarchicalData.Predictions <- HierarchicalData %>%


mutate( y.hat = predict(model, re.form= ~ (1|Ring)), # Include Ring Random effect
y.hat = round( y.hat, digits=2),
my.text = TeX(paste('$\\hat{y}$ =', y.hat), output='character'))

Treatment and Ring Predictions


Ring: 1 Ring: 2 Ring: 3
400
y^ = 321. 24 y^ = 281. 98 y^ = 332. 12
300

200

Ring: 4 Ring: 5 Ring: 6


400
y^ = 194 y^ = 255. 6 y^ = 254 Trt
Response

300 A
B
200
C

Ring: 7 Ring: 8 Ring: 9


400
y^ = 210. 37 y^ = 234. 73 y^ = 251. 45
300

200

2 4 6 2 4 6 2 4 6
Replicate

We interpret the random effect of Ring as a perturbation to expected value of


the response that you expect just based on the treatment provided.

We’ll now consider an example with a somewhat ridiculous amount of nesting.


We will consider an experiment run to test the consistency between laborato-
ries. A large jar of dried egg power was fully homogenized and divided into a
number of samples and the fat content between the samples should be the same.
Six laboratories were randomly selected and instructed to have two technicians
analyze what they thought were different all different samples. Each technician
received two separate samples (labeled H and G) and were instructed to mea-
sure each sample twice. So our hierarchy is that observations are nested within
samples which are nested within technicians which are nested in labs.
192 CHAPTER 10. MIXED EFFECTS MODELS

Fat Lab Technician Sample


1 0.62 I one G
2 0.55 I one G
3 0.34 I one H
4 0.24 I one H
… … … … …
45 0.18 VI two G
46 0.2 VI two G
47 0.26 VI two H
48 0.06 VI two H

Notice in the data, the technicians are always labeled as ‘one’ and ‘two’ regard-
less of the lab. Likewise the two samples given to each technician are always
labeled ‘G’ and ‘H’ even though the actual physical samples are different for
each technician.
In terms of notation, we will refer to the 6 labs as 𝐿𝑖 and the lab technicians
as 𝑇𝑖𝑗 and we note that 𝑗 is either 1 or 2 which doesn’t uniquely identify the
technician unless we include the lab subscript as well. Finally the sub-samples
are nested within the technicians and we denote them as 𝑆𝑖𝑗𝑘 . Finally our “pure”
error is the two observations from the same sample. So the model we wish to
fit is:
𝑦𝑖𝑗𝑘𝑙 = 𝜇 + 𝐿𝑖 + 𝑇𝑖𝑗 + 𝑆𝑖𝑗𝑘 + 𝜖𝑖𝑗𝑘𝑙
𝑖𝑖𝑑 2 𝑖𝑖𝑑 𝑖𝑖𝑑 𝑖𝑖𝑑
where 𝐿𝑖 ∼ 𝑁 (0, 𝜎𝐿 ), 𝑇𝑖𝑗 ∼ 𝑁 (0, 𝜎𝑇2 ), 𝑆𝑖𝑗𝑘 ∼ 𝑁 (0, 𝜎𝑆2 ), 𝜖𝑖𝑗𝑘𝑙 ∼ 𝑁 (0, 𝜎𝜖2 ).
We need a convenient way to tell lmer which factors are nested in which. We can
do this by creating data columns that make the interaction terms. For example
there are 12 technicians (2 from each lab), but in our data frame we only see
two levels, so to create all 12 random effects, we need to create an interaction
column (or tell lmer to create it and use it). Likewise there are 24 sub-samples
and 48 “pure” random effects.

data('eggs', package='faraway')
model <- lmer( Fat ~ 1 + (1|Lab) + (1|Lab:Technician) +
(1|Lab:Technician:Sample), data=eggs)
model <- lmer( Fat ~ 1 + (1|Lab/Technician/Sample), data=eggs)

eggs <- eggs %>%


mutate( yhat = predict(model, re.form=~0))
ggplot(eggs, aes(x=Sample, y=Fat)) +
geom_point() +
geom_line(aes(y=yhat, x=as.integer(Sample)), color='red') +
facet_grid(. ~ Lab:Technician) +
ggtitle('Average Value Only')
10.6. NESTED EFFECTS 193

Average Value Only


I:one I:two II:one II:two III:one III:two IV:one IV:two V:one V:two VI:one VI:two
0.8

0.6
Fat

0.4

0.2

G H G H G H G H G H G H G H G H G H G H G H G H
Sample

eggs <- eggs %>%


mutate( yhat = predict(model, re.form=~(1|Lab)))
ggplot(eggs, aes(x=Sample, y=Fat)) +
geom_point() +
geom_line(aes(y=yhat, x=as.integer(Sample)), color='red') +
facet_grid(. ~ Lab+Technician) +
ggtitle('Average With Lab Offset')

Average With Lab Offset


I I II II III III IV IV V V VI VI
one two one two one two one two one two one two
0.8

0.6
Fat

0.4

0.2

G H G H G H G H G H G H G H G H G H G H G H G H
Sample

eggs <- eggs %>%


mutate( yhat = predict(model, re.form=~(1|Lab/Technician)))
ggplot(eggs, aes(x=Sample, y=Fat)) +
geom_point() +
geom_line(aes(y=yhat, x=as.integer(Sample)), color='red') +
facet_grid(. ~ Lab+Technician) +
ggtitle('Average With Lab + Technician Offset')
194 CHAPTER 10. MIXED EFFECTS MODELS

Average With Lab + Technician Offset


I I II II III III IV IV V V VI VI
one two one two one two one two one two one two
0.8

0.6
Fat

0.4

0.2

G H G H G H G H G H G H G H G H G H G H G H G H
Sample

eggs <- eggs %>%


mutate( yhat = predict(model, re.form=~(1|Lab/Technician/Sample)))
ggplot(eggs, aes(x=Sample, y=Fat)) +
geom_point() +
geom_line(aes(y=yhat, x=as.integer(Sample)), color='red') +
facet_grid(. ~ Lab+Technician) +
ggtitle('Average With Lab + Technician + Sample Offset')

Average With Lab + Technician + Sample Offset


I I II II III III IV IV V V VI VI
one two one two one two one two one two one two
0.8

0.6
Fat

0.4

0.2

G H G H G H G H G H G H G H G H G H G H G H G H
Sample

Now that we have an idea of how things vary, we can look at the 𝜎 terms.

summary(model)[['varcor']]

## Groups Name Std.Dev.


## Sample:(Technician:Lab) (Intercept) 0.055359
## Technician:Lab (Intercept) 0.083548
## Lab (Intercept) 0.076941
## Residual 0.084828
10.7. CROSSED EFFECTS 195

It looks like there is still plenty of unexplained variability, but the next largest
source of variability is in the technician and also the lab. Is the variability lab-to-
lab large enough for us to convincingly argue that it is statistically significant?

ranova(model)

## ANOVA-like table for random-effects: Single term deletions


##
## Model:
## Fat ~ (1 | Sample:(Technician:Lab)) + (1 | Technician:Lab) +
## (1 | Lab)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 5 32.118 -54.235
## (1 | Sample:(Technician:Lab)) 4 31.316 -54.632 1.60342 1 0.20542
## (1 | Technician:Lab) 4 30.740 -53.480 2.75552 1 0.09692 .
## (1 | Lab) 4 31.719 -55.439 0.79649 1 0.37215
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It looks like the technician effect is at the edge of statistically significant, but the
lab-to-lab effect is smaller than the pure error and not statistically significant.
I’m not thrilled with the repeatability, but the technicians are a bigger concern
than the individual labs. These data aren’t strong evidence of big differences
between labs, but I would need to know if the size of the error is practically
important before we just pick the most convenient lab to send our samples to.

10.7 Crossed Effects


If two effects are not nested, we say they are crossed. In the penicillin example,
the treatments and blends were not nested and are therefore crossed.
An example is a Latin square experiment to look the effects of abrasion on four
different material types (A, B, C, and D). We have a machine to do the abrasion
test with four positions and we did 4 different machine runs. Our data looks
like the following setup:

run Position: 1 Position: 2 Position: 3 Position: 4


1 C D B A
2 A B D C
3 D C A B
4 B A C D
196 CHAPTER 10. MIXED EFFECTS MODELS

Our model can be written as

𝑦𝑖𝑗𝑘 = 𝜇 + 𝑀𝑖 + 𝑃𝑗 + 𝑅𝑘 + 𝜖𝑖𝑗𝑘

and we notice that the position and run effects are not nested within anything
else and thus the subscript have just a single index variable. Certainly the run
effect should be considered random as these four are a sample from all possible
runs, but what about the position variable? Here we consider that the machine
being used is a random selection from all possible abrasion machines and any
position differences have likely developed over time and could be considered as
a random sample of possible position effects. We’ll regard both position and
run as crossed random effects.

data('abrasion', package='faraway')
ggplot(abrasion, aes(x=material, y=wear, color=position, shape=run)) +
geom_point(size=3)

position
1

260 2
3
4
240
wear

run
220
1
2

200 3
4
A B C D
material

It certainly looks like the materials are different. I don’t think the run matters,
but position 2 seems to develop excessive wear compared to the other positions.

m <- lmer( wear ~ material + (1|run) + (1|position), data=abrasion)


anova(m)

## Type III Analysis of Variance Table with Satterthwaite's method


## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## material 4621.5 1540.5 3 6 25.151 0.0008498 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The material effect is statistically significant and we can figure out the pairwise
differences in the usual fashion.
10.7. CROSSED EFFECTS 197

emmeans(m, specs= pairwise~material)

## $emmeans
## material emmean SE df lower.CL upper.CL
## A 266 7.67 7.48 248 284
## B 220 7.67 7.48 202 238
## C 242 7.67 7.48 224 260
## D 230 7.67 7.48 213 248
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## A - B 45.8 5.53 6 8.267 0.0007
## A - C 24.0 5.53 6 4.337 0.0190
## A - D 35.2 5.53 6 6.370 0.0029
## B - C -21.8 5.53 6 -3.930 0.0295
## B - D -10.5 5.53 6 -1.897 0.3206
## C - D 11.2 5.53 6 2.033 0.2743
##
## Degrees-of-freedom method: kenward-roger
## P value adjustment: tukey method for comparing a family of 4 estimates

emmeans(m, specs= ~material) %>%


multcomp::cld(Letters=letters)

## material emmean SE df lower.CL upper.CL .group


## B 220 7.67 7.48 202 238 a
## D 230 7.67 7.48 213 248 ab
## C 242 7.67 7.48 224 260 b
## A 266 7.67 7.48 248 284 c
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05

So material D is in between materials B and C for abrasion resistance.

summary(m)[['varcor']]

## Groups Name Std.Dev.


198 CHAPTER 10. MIXED EFFECTS MODELS

## run (Intercept) 8.1790


## position (Intercept) 10.3471
## Residual 7.8262

Notice that run and the pure error standard deviation have about the same
magnitude, but position is more substantial. Lets see what happens if we remove
the run random effect.

m2 <- lmer( wear ~ material + (1|position), data=abrasion) # Run is marginally signific


# anova(m, m2) # This would be wrong because it would refit the model by de
anova(m, m2, refit=FALSE) # refit=FALSE keeps it using REML instead of refitting using

## Data: abrasion
## Models:
## m2: wear ~ material + (1 | position)
## m: wear ~ material + (1 | run) + (1 | position)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m2 6 115.30 119.94 -51.651 103.30
## m 7 114.26 119.66 -50.128 100.26 3.0459 1 0.08094 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m3 <- lmer( wear ~ material + (1|run), data=abrasion) # Position is mildly significant


anova(m, m3, refit=FALSE) #

## Data: abrasion
## Models:
## m3: wear ~ material + (1 | run)
## m: wear ~ material + (1 | run) + (1 | position)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m3 6 116.85 121.48 -52.425 104.85
## m 7 114.26 119.66 -50.128 100.26 4.5931 1 0.0321 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Now do a similar test, but all at once.


ranova(m)

## ANOVA-like table for random-effects: Single term deletions


##
## Model:
## wear ~ material + (1 | run) + (1 | position)
## npar logLik AIC LRT Df Pr(>Chisq)
10.8. REPEATED MEASURES / LONGITUDINAL STUDIES 199

## <none> 7 -50.128 114.26


## (1 | run) 6 -51.651 115.30 3.0459 1 0.08094 .
## (1 | position) 6 -52.425 116.85 4.5931 1 0.03210 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that while position continues to be weakly statistically significant, the


run has dropped down to marginal significance.

10.8 Repeated Measures / Longitudinal Studies

In repeated measurement experiments, repeated observations are taken on each


subject. When those repeated measurements are taken over a sequence of time,
we call it a longitudinal study. Typically covariates are also observed at the
same time points and we are interested in how the response is related to the
covariates.

In this case the correlation structure is that observations on the same


person/object should be more similar than observations between two peo-
ple/objects. As a result we need to account for repeated measures by including
the person/object as a random effect.

To demonstrate a longitudinal study we turn to the data set sleepstudy in


the lme4 library. Eighteen patients participated in a study in which they were
allowed only 3 hours of sleep per night and their reaction time in a specific test
was observed. On day zero (before any sleep deprivation occurred) their reaction
times were recorded and then the measurement was repeated on 9 subsequent
days.

data('sleepstudy', package='lme4')
ggplot(sleepstudy, aes(y=Reaction, x=Days)) +
facet_wrap(~ Subject, ncol=6) +
geom_point() +
geom_line()
200 CHAPTER 10. MIXED EFFECTS MODELS

308 309 310 330 331 332

400
300
200
333 334 335 337 349 350
Reaction

400
300
200
351 352 369 370 371 372

400
300
200
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
Days

We want to fit a line to these data, but how should we do this? First we notice
that each subject has their own baseline for reaction time and the subsequent
measurements are relative to this, so it is clear that we should fit a model with
a random intercept.

m1 <- lmer( Reaction ~ Days + (1|Subject), data=sleepstudy)


summary(m1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [


## lmerModLmerTest]
## Formula: Reaction ~ Days + (1 | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1786.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2257 -0.5529 0.0109 0.5188 4.2506
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 1378.2 37.12
## Residual 960.5 30.99
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 251.4051 9.7467 22.8102 25.79 <2e-16 ***
## Days 10.4673 0.8042 161.0000 13.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
10.8. REPEATED MEASURES / LONGITUDINAL STUDIES 201

## Correlation of Fixed Effects:


## (Intr)
## Days -0.371

ranova(m1)

## ANOVA-like table for random-effects: Single term deletions


##
## Model:
## Reaction ~ Days + (1 | Subject)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 4 -893.23 1794.5
## (1 | Subject) 3 -946.83 1899.7 107.2 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To visualize how well this model fits our data, we will plot the predicted values
which are lines with y-intercepts that are equal to the sum of the fixed effect of
intercept and the random intercept per subject. The slope for each patient is
assumed to be the same and is approximately 10.4.

sleepstudy <- sleepstudy %>%


mutate(yhat = predict(m1, re.form=~(1|Subject)))
ggplot(sleepstudy, aes(y=Reaction, x=Days)) +
facet_wrap(~ Subject, ncol=6) +
geom_point() +
geom_line() +
geom_line(aes(y=yhat), color='red')

308 309 310 330 331 332

400
300
200

333 334 335 337 349 350


Reaction

400
300
200

351 352 369 370 371 372

400
300
200
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
Days

This isn’t too bad, but I would really like to have each patient have their own
slope as well as their own y-intercept. The random slope will be calculated as
a fixed effect of slope plus a random offset from that.
202 CHAPTER 10. MIXED EFFECTS MODELS

# Random effects for intercept and Slope


m2 <- lmer( Reaction ~ Days + ( 1+Days | Subject), data=sleepstudy)

sleepstudy <- sleepstudy %>%


mutate(yhat = predict(m2, re.form=~(1+Days|Subject)))
ggplot(sleepstudy, aes(y=Reaction, x=Days)) +
facet_wrap(~ Subject, ncol=6) +
geom_point() +
geom_line() +
geom_line(aes(y=yhat), color='red')

308 309 310 330 331 332

400
300
200
333 334 335 337 349 350
Reaction

400
300
200
351 352 369 370 371 372

400
300
200
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
Days

This appears to fit the observed data quite a bit better, but it is useful to test
this.

# This is the first time the ranova() table has considered a reduction. Here we
# consider reducing the random term from (1+Days|Subject) to (1|Subject)
ranova(m2)

## ANOVA-like table for random-effects: Single term deletions


##
## Model:
## Reaction ~ Days + (1 + Days | Subject)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 6 -871.81 1755.6
## Days in (1 + Days | Subject) 4 -893.23 1794.5 42.837 2 4.99e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# We get the same analysis directly specifying which Simple/Complex model to compare
anova(m2, m1, refit=FALSE)
10.8. REPEATED MEASURES / LONGITUDINAL STUDIES 203

## Data: sleepstudy
## Models:
## m1: Reaction ~ Days + (1 | Subject)
## m2: Reaction ~ Days + (1 + Days | Subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m1 4 1794.5 1807.2 -893.23 1786.5
## m2 6 1755.6 1774.8 -871.81 1743.6 42.837 2 4.99e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that indeed the random effect for each subject in both y-intercept
and in slope is a better model that just a random offset in y-intercept.
It is instructive to look at this example from the top down. First we plot the
population regression line.

sleepstudy <- sleepstudy %>%


mutate(yhat = predict(m2, re.form=~0))
ggplot(sleepstudy, aes(x=Days, y=yhat)) +
geom_line(color='red') + ylab('Reaction') +
ggtitle('Population Estimated Regression Curve') +
scale_x_continuous(breaks = seq(0,9, by=2))

Population Estimated Regression Curve


350

325
Reaction

300

275

250
0 2 4 6 8
Days

sleepstudy <- sleepstudy %>%


mutate(yhat.ind = predict(m2, re.form=~(1+Days|Subject)))
ggplot(sleepstudy, aes(x=Days)) +
geom_line(aes(y=yhat), size=3) +
geom_line(aes(y=yhat.ind, group=Subject), color='red') +
scale_x_continuous(breaks = seq(0,9, by=2)) +
ylab('Reaction') + ggtitle('Person-to-Person Variation')
204 CHAPTER 10. MIXED EFFECTS MODELS

Person−to−Person Variation

400
Reaction

300

200
0 2 4 6 8
Days

ggplot(sleepstudy, aes(x=Days)) +
geom_line(aes(y=yhat)) +
geom_line(aes(y=yhat.ind, group=Subject), color='red') +
scale_x_continuous(breaks = seq(0,9, by=2)) +
ylab('Reaction') + ggtitle('Within Person Variation') +
facet_wrap(~ Subject, ncol=6) +
geom_point(aes(y=Reaction))

Within Person Variation


308 309 310 330 331 332
400
300
200
333 334 335 337 349 350
Reaction

400
300
200
351 352 369 370 371 372
400
300
200
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
Days

Finally we want to go back and look at the coefficients for the complex model.

summary(m2)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [


## lmerModLmerTest]
## Formula: Reaction ~ Days + (1 + Days | Subject)
## Data: sleepstudy
##
10.9. CONFIDENCE AND PREDICTION INTERVALS 205

## REML criterion at convergence: 1743.6


##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9536 -0.4634 0.0231 0.4634 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 612.10 24.741
## Days 35.07 5.922 0.07
## Residual 654.94 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 251.405 6.825 17.000 36.838 < 2e-16 ***
## Days 10.467 1.546 17.000 6.771 3.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138

10.9 Confidence and Prediction Intervals


As with the standard linear model, we often want to create confidence and pre-
diction intervals for a new observation or set of observations. Unfortunately,
there isn’t a nice way to easily incorporate the uncertainty of the variance com-
ponents. Instead we have to rely on bootstapping techniques to produce these
quantities. Fortunately the lme4 package provides a function that will handle
most of the looping required, but we have to describe to the program how to
create the bootstrap samples, and given a bootstrap sample, what statistics do
we want to produce intervals for.
Recall that the steps of a bootstrap procedure are:

1. Generate a bootstrap sample.


2. Fit a model to that bootstrap sample.
3. From that model, calculate some statistic(s) you care about. This is the
only step that the user needs do any work to specify.
4. Repeat steps 1-3, many times, generating a bootstrap distribution of the
statistics you care about.
5. From the bootstrap distribution generate confidence intervals for the value
of interest.
206 CHAPTER 10. MIXED EFFECTS MODELS

Typically the bootstrap is used when we don’t want to make any distributional
assumptions on the data. In that case, we sample with replacement from the
observed data to create the bootstrap data. But, if we don’t mind making dis-
tributional assumptions, then instead of re-sampling the data, we could sample
from the distribution with the observed parameter. In our sleep study example,
we have estimated a population intercept and slope of 251.4 and 10.5. But we
also have a subject intercept and slope random effect which we assumed to be
normally distributed centered at zero with and with estimated standard devi-
ations of 24.7 and 5.9. Then given a subjects regression line, observations are
just normal (mean zero, standard deviation 25.6) perturbations from the line.
All of these numbers came from the summary(m2) output.
To create a bootstrap data simulating a new subject, we could do the following:

subject.intercept = 251.4 + rnorm( 1, mean = 0, sd=24.7)


subject.slope = 10.5 + rnorm( 1, mean = 0, sd=10.5)
c(subject.intercept, subject.slope)

## [1] 276.51147 16.71648

subject.obs <- data.frame(Days = 0:8) %>%


mutate( Reaction = subject.intercept + subject.slope*Days + rnorm(9, sd=25.6) )

ggplot(subject.obs, aes(x=Days, y=Reaction)) + geom_point()

450

400
Reaction

350

300

250
0 2 4 6 8
Days

This approach is commonly referred to as a “parametric” bootstrap because we


are making some assumptions about the parameter distributions, whereas in a
“non-parametric” bootstrap we don’t make any distributional assumptions. By
default, the bootMer function will

1) Perform a parametric bootstrap to create new bootstrap data sets, using


the results of the initial model.
10.9. CONFIDENCE AND PREDICTION INTERVALS 207

2) Create a bootstrap model by analyzing the bootstrap data using the same
model formula used by the initial model.
3) Apply some function you write to each bootstrap model. This function
takes in a bootstrap model and returns a statistic or vector of statistics.
4) Repeat steps 1-4 repeatedly to create the bootstrap distribution of the
statistics returned by your function.

10.9.1 Confidence Intervals

Now that we have a bootstrap data set, we need to take the data and then fit a
model to the data and then grab the predictions from the model. At this point
we are creating a confidence interval for the response line of a randomly selected
person from the population. The lme4::bootMer function will create bootstrap
data sets and then send those into the lmer function.

ConfData <- data.frame(Days=0:8) # What x-values I care about

# Get our best guess as to the relationship between day and reaction time
ConfData <- ConfData %>%
mutate( Estimate = predict( m2, newdata = ConfData, re.form=~0) )

# A function to generate yhat from a model


myStats <- function(model.star){
out <- predict( model.star, newdata=ConfData, re.form=~0 )
return(out)
}

# bootMer generates new data sets, calls lmer on the data to produce a model,
# and then produces calls whatever function I pass in. It repeats this `nsim`
# number of times.
bootObj <- lme4::bootMer(m2, FUN=myStats, nsim = 1000 )

# using car::hist.boot, but because hist() is overloaded, I'm not able


# to call it directly. Unfortunately, it doesn't allow me to select the
# bias-corrected and accelerated method, but the percentile method is ok
# for visualizaton.
hist(bootObj, ci='perc')
208 CHAPTER 10. MIXED EFFECTS MODELS

Normal Density
Kernel Density
perc 95% CI
Obs. Value

0.06
0.06

0.06
Density

Density

Density

0.03
0.03

0.03
0.00

0.00

0.00
230 250 270 240 260 280 250 270 290

1 2 3

0.06
0.06

0.04
0.04
Density

Density

Density
0.03

0.02
0.02
0.00

0.00

0.00
250 270 290 310 260 280 300 320 280 300 320 340

4 5 6
0.04
0.04
Density

Density

Density

0.02
0.02
0.02
0.00

0.00

280 300 320 340 280 320 360 0.00 280 320 360

7 8 9

# Add the confidence interval values onto estimates data frame


CI <- confint( bootObj, level=0.95, ci='bca')
colnames(CI) <- c('lwr','upr')
ConfData <- cbind(ConfData, CI)

# Now for a nice graph!


ConfData %>%
ggplot(aes(x=Days)) +
geom_line(aes(y=Estimate), color='red') +
geom_ribbon(aes(ymin=lwr, ymax= upr), fill='salmon', alpha=0.2)
10.9. CONFIDENCE AND PREDICTION INTERVALS 209

360

320
Estimate

280

240

0 2 4 6 8
Days

10.9.2 Prediction Intervals

For a confidence interval, we just want to find the range of observed values.
In this case, we want to use the bootstrap data, but don’t need to fit a model
at each bootstrap step. The lme4::simulate function creates the bootstrap
dataset and doesn’t send it for more processing. It returns a vector of response
values that are appropriately organized to be appended to the original dataset.

# # set up the structure of new subjects


PredData <- data.frame(Subject='new', Days=0:8) # Simulate a NEW patient

# Create a n x 1000 data frame


Simulated <- simulate(m2, newdata=PredData, allow.new.levels=TRUE, nsim=1000)

# squish the Subject/Day info together with the simulated and then grab the quantiles
# for each day
PredIntervals <- cbind(PredData, Simulated) %>%
gather('sim','Reaction', sim_1:sim_1000 ) %>% # go from wide to long structure
group_by(Subject, Days) %>%
summarize(lwr = quantile(Reaction, probs = 0.025),
upr = quantile(Reaction, probs = 0.975))

## `summarise()` regrouping output by 'Subject' (override with `.groups` argument)

# Plot the prediction and confidence intervals


ggplot(ConfData, aes(x=Days)) +
geom_line(aes(y=Estimate), color='red') +
geom_ribbon(aes(ymin=lwr, ymax= upr), fill='salmon', alpha=0.2) +
geom_ribbon(data=PredIntervals, aes(ymin=lwr, ymax=upr), fill='blue', alpha=0.2)
210 CHAPTER 10. MIXED EFFECTS MODELS

400
Estimate

300

200

0 2 4 6 8
Days

10.10 Exercises

1. An experiment was conducted to select the supplier of raw materials for


production of a component. The breaking strength of the component
was the objective of interest. Raw materials from four suppliers were
considered. In our factory, we have four operators that can only produce
one component per day. We utilized a Latin square design so that each
factory operator worked with a different supplier each day. The data set
is presented in the faraway package as breaking.

a. Explain why it would be natural to treat the operators and days as


random effects but the suppliers as fixed effects.
b. Inspect the data? Does anything seem weird? It turns out that the
person responsible for entering the data made an input error and
transposed the response values for rows 15 and 16. Create a graph
where the transposition is evident and then fix it. After your fix,
make sure that each day has all 4 suppliers and 4 operators.
c. Build a model to predict the breaking strength. Describe the varia-
tion from operator to operator and from day to day.
d. Test the significance of the supplier effect.
e. Is there a significant difference between the operators?

2. An experiment was performed to investigate the effect of ingestion of thy-


roxine or thiouracil. The researchers took 27 newborn rats and divided
them into three groups. The control group is ten rats that receive no ad-
dition to their drinking water. A second group of seven rats has thyroxine
added to their drinking water and the final set ten rats have thiouracil
added to their water. For each of five weeks, we take a body weight
measurement to monitor the rats’ growth. The data are available in the
faraway package as ratdrink. I suspect that we had 30 rats to begin with
10.10. EXERCISES 211

and somehow three rats in the thyroxine group had some issue unrelated
to the treatment. The following R code might be helpful for the initial
visualization.

# we need to force ggplot to only draw lines between points for the same
# rat. If I haven't already defined some aesthetic that is different
# for each rat, then it will connect points at the same week but for different
# rats. The solution is to add an aesthetic that does the equivalent of the
# dplyr function group_by(). In ggplot2, this aesthetic is "group".
ggplot(ratdrink, aes(y=wt, x=weeks, color=treat)) +
geom_point(aes(shape=treat)) +
geom_line(aes(group=subject)) # play with removing the group=subject aesthetic...

a. Consider the model with an interaction between Treatment and Week


along with a random effect for each subject rat. Does the model with
a random offset in the y-intercept perform as well as the model with
random offsets in both the y-intercept and slope?
b. Next consider if you can simplify the model by removing the
interaction between Treatment and Week and possibly even the
Treatment main effect.

c. Comment on the effect of each treatment.

3. An experiment was conducted to determine the effect of recipe and baking


temperature on chocolate cake quality. For each recipe, 15 batches of cake
mix for were prepared (so 45 batches total). Each batch was sufficient for
six cakes. Each of the six cakes was baked at a different temperature which
was randomly assigned. Several measures of cake quality were recorded of
which breaking angle was just one. The dataset is available in the faraway
package as choccake.

a. For the variables Temperature, Recipe, and Batch, which should be


fixed and which should be random?
b. Inspect the data. Create a graph to consider if there is an interac-
tion between Batch and Recipe. Describe the effect of Batch on the
BreakAngle. Hint make graphs with Batch number on the x-axis and
faceting on Temperature and Recipe.
c. Based on the graph in part (b), it appears that there is some main
effect of Batch, i.e the effect of Batch does NOT change between
recipes. I would hypothesize that this is actually some “Time of
Day” effect going on. (I’d love to talk the scientist who did the
experiment about this). The 45 batch:recipe mixes need to be in
the model because each batter mix resulted in 6 different cakes. Make
a new variable called mix that has a unique value for each of the 45
recipe:batch levels. Then rename batch to be time. Should time
be fixed or random?
212 CHAPTER 10. MIXED EFFECTS MODELS

d. Build a mixed model using the fixed effects along with the random
effect of mix. Consider two-way interactions.
e. Using the model you selected, discuss the impact of the different
variance components.
Chapter 11

Binomial Regression

library(tidyverse) # dplyr, tidyr, ggplot2, etc...


library(emmeans)
library(pROC)
library(lmerTest)

The standard linear model assumes that the observed data is distributed
𝑖𝑖𝑑
𝑦 = 𝑋𝛽 + 𝜖 where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )

which can be re-written as

𝑦 ∼ 𝑁 (𝜇 = 𝑋𝛽, 𝜎2 𝐼)

and notably this assumes that the data are independent. This model has 𝐸 [𝑦] =
𝑋𝛽. This model is quite flexible and includes:

Model Predictor Type Response


Simple Linear 1 Continuous Continuous Normal
Regression Response
1-way ANOVA 1 Categorical Continuous Normal
Response
2-way ANOVA 2 Categorical Continuous Normal
Response
ANCOVA 1 Continuous, 1 Continuous Normal
Categorical Response

The general linear model expanded on the linear model and we allow the data

213
214 CHAPTER 11. BINOMIAL REGRESSION

points to be correlated
𝑦 ∼ 𝑁 (𝑋𝛽, 𝜎2 Ω)
where we assume that Ω has some known form but may include some unknown
correlation parameters. This type of model includes our work with mixed models
and time series data.
The study of generalized linear models removes the assumption that the error
terms are normally distributed and allows the data to be distributed according
to some other distribution such as Binomial, Poisson, or Exponential. These
distributions are parameterized differently than the normal (instead of 𝜇 and
𝜎, we might be interested in 𝜆 or 𝑝). However, I am still interested in how my
covariates can be used to estimate my parameter of interest.
Critically, I still want to parameterize my covariates as 𝑋𝛽 because we under-
stand the how continuous and discrete covariates added and interpreted and
what interactions between them mean. By keeping the 𝑋𝛽 part, we continue
to build on the earlier foundations.

11.1 Binomial Regression Model


To remove a layer of abstraction, we will now consider the case of binary re-
gression. In this model, the observations (which we denote by 𝑤𝑖 ) are zeros and
ones which correspond to some binary observation, perhaps presence/absence
of an animal in a plot, or the success or failure of an viral infection. Recall that
we could model this as 𝑊𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝𝑖 ) random variable.

𝑃 (𝑊𝑖 = 1) = 𝑝𝑖

𝑃 (𝑊𝑖 = 0) = (1 − 𝑝𝑖 )
which I can rewrite more formally letting 𝑤𝑖 be the observed value as
𝑤 1−𝑤𝑖
𝑃 (𝑊𝑖 = 𝑤𝑖 ) = 𝑝𝑖 𝑖 (1 − 𝑝𝑖 )

and the parameter that I wish to estimate and understand is the probability
of a success 𝑝𝑖 and usually I wish to know how my covariate data 𝑋𝛽 informs
these probabilities.
In the normal distribution case, we estimated the expected value of my response
vector (𝜇) simply using 𝜇̂ = 𝑋 𝛽̂ but this will not work for an estimate of 𝑝̂
because there is no constraint on 𝑋 𝛽,̂ there is nothing to prevent it from being
negative or greater than 1. Because we require the probability of success to be
a number between 0 and 1, I have a problem.
Example: Suppose we are interested in the abundance of mayflies in a stream.
Because mayflies are sensitive to metal pollution, I might be interested in looking
at the presence/absence of mayflies in a stream relative to a pollution gradient.
11.1. BINOMIAL REGRESSION MODEL 215

Here the pollution gradient is measured in Cumulative Criterion Units (CCU:


CCU is defined as the ratio of the measured metal concentration to the hardness
adjusted chronic criterion concentration, and then summed across each metal)
where larger values imply more metal pollution.

data('Mayflies', package='dsData')
head(Mayflies)

## CCU Occupancy
## 1 0.05261076 1
## 2 0.25935617 1
## 3 0.64322010 1
## 4 0.90168941 1
## 5 0.97002630 1
## 6 1.08037011 1

ggplot(Mayflies, aes(x=CCU, y=Occupancy)) + geom_point()

1.00

0.75
Occupancy

0.50

0.25

0.00
0 1 2 3 4 5
CCU

If I just fit a regular linear model to this data, we fit the following:

m <- lm( Occupancy ~ CCU, data=Mayflies )


Mayflies %>% mutate(yhat = predict(m)) %>%
ggplot(aes(x=CCU, y=Occupancy)) +
geom_point() +
geom_line(aes(y=yhat), color='red')
216 CHAPTER 11. BINOMIAL REGRESSION

0.8

Occupancy

0.4

0.0

0 1 2 3 4 5
CCU

which is horrible. First, we want the regression line to be related to the proba-
bility of occurrence and it is giving me a negative value. Instead, we want it to
slowly tail off and give me more of an sigmoid-shaped curve. Perhaps something
1.00

0.75
Occupancy

0.50

0.25

0.00
0 1 2 3 4 5
more like the following: CCU

We need a way to convert our covariate data 𝑦 = 𝑋𝛽 from something that can
take values from −∞ to +∞ to something that is constrained between 0 and 1
so that we can fit the model


⎜ ⎞


⎜ ⎛ ⎞⎟

𝑤𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 ⎜𝑔 −1 ⎜ 𝑦 ⎟⎟
⎜ ⏟𝑖 ⎟

⎜⏟⏟⎝ ⎟
in
⏟⏟⏟⏟⏟[−∞,∞] ⎠⎟
⎝ in [0,1] ⎠

We use the notation 𝑦𝑖 = 𝑋 𝑖,⋅ 𝛽 to denote a single set of covariate values and
notice that this is unconstrained and can be in (−∞, +∞) while the parameter
of interest 𝑝𝑖 = 𝑔−1 (𝑦𝑖 ) is constrained to [0, 1]. When convenient, we will drop
the 𝑖 subscript while keeping the domain restrictions.
There are several options for the link function 𝑔−1 (⋅) that are commonly used.
11.1. BINOMIAL REGRESSION MODEL 217

1. Logit (log odds) transformation. The link function is

⎡ 𝑝 ⎤
𝑔 (𝑝) = logit(𝑝) = log ⎢ ⎥=𝑦
⎢ 1⏟
− 𝑝⎥
⎣ odds ⎦
with inverse
1
𝑔−1 (𝑦) = ilogit(𝑦) = =𝑝
1 + 𝑒−𝑦
and we think of 𝑔 (𝑝) as the log odds function.
2. Probit transformation. The link function is 𝑔 (𝑝) = Φ−1 (𝑝) where Φ
is the standard normal cumulative distribution function and therefore
𝑔−1 (𝑋𝛽) = Φ (𝑋𝛽).
3. Complementary log-log transformation: 𝑔 (𝑝) = log [− log(1 − 𝑝)].

All of these functions will give a sigmoid shape with higher probability as 𝑦
increases and lower probability as it decreases. The logit and probit transfor-
mations have the nice property that if 𝑦 = 0 then 𝑔−1 (0) = 12 .
Inverse Logit() function
1.00

0.75

0.50
p

0.25

0.00
−5.0 −2.5 0.0 2.5 5.0
y = Xβ

Usually the difference in inferences made using these different curves is relatively
small and we will usually use the logit transformation because its form lends
itself to a nice interpretation of my 𝛽 values. In these cases, a slope parameter
in our model will be interpreted as “the change in log odds for every one unit
change in the predictor.”
As in the mixed model case, there are no closed form solution for 𝛽̂ and instead
we must rely on numerical solutions to find the maximum likelihood estimators
for 𝛽.̂ To do this, we must derive the likelihood function.
If

𝑖𝑖𝑑
𝑤𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝𝑖 )
218 CHAPTER 11. BINOMIAL REGRESSION

Then

𝑛
𝑤
ℒ(𝑝|𝑤) = ∏ 𝑝𝑖 𝑖 (1 − 𝑝𝑖 )1−𝑤𝑖
𝑖=1

and therefore substituting 𝑝𝑖 = ilogit(𝑋 𝑖 𝛽) in we have

𝑛
𝑤𝑖 1−𝑤𝑖
ℒ (𝛽|𝑤) = ∏ (ilogit(𝑋 𝑖 𝛽)) (1 − ilogit(𝑋 𝑖 𝛽))
𝑖=1

and

𝑛
𝑤𝑖 1−𝑤𝑖
log ℒ (𝛽|𝑤) = ∑ log {(ilogit(𝑋 𝑖 𝛽)) (1 − ilogit(𝑋 𝑖 𝛽)) }
𝑖=1

Often times it is more numerically stable to maximize the log-likelihood rather


than the pure likelihood function because using logs helps prevent machine
under-flow issues when the values of the likelihood is really small, but we will
ignore that here and just assume that the function the performs the maximiza-
tion is well designed to consider such issues.
We have more than one response at a particular level of 𝑋. Let 𝑚𝑖 be the number
of observations observed at the particular value of 𝑋, and 𝑤𝑖 be the number of
successes at that value of 𝑋. In that case, 𝑤𝑖 is not a Bernoulli random variable,
but rather a binomial random variable. Note that the Bernoulli distribution is
the special case of the binomial distribution with 𝑚𝑖 = 1.

# beta are the parameters


# w is the response in terms of 0,1 for this mayfly example
# X is the design matrix that we use for our covariates.
logL <- function(beta, w, X){
out <- dbinom( w, size=1, prob= faraway::ilogit(X%*%beta), log=TRUE ) %>% sum()
return(out)
}
NeglogL <- function(beta, w, X){ return( -logL(beta, w, X)) }

optim( par=c(0,0), # initial bad guesses


fn = NeglogL, # we want to minimize the -logL function, maximize logL
w=Mayflies$Occupancy, # parameters to pass to the logL function
X = model.matrix( ~ 1 + CCU, data=Mayflies ) ) # ditto

## $par
## [1] 5.100892 -3.050336
##
## $value
11.1. BINOMIAL REGRESSION MODEL 219

## [1] 6.324365
##
## $counts
## function gradient
## 71 NA
##
## $convergence
## [1] 0
##
## $message
## NULL

In general, we don’t want to have to specify the likelihood function by hand.


Instead we will specify the model using the glm function which accepts a model
formula as well as the distribution family that the data comes from.

head(Mayflies)

## CCU Occupancy
## 1 0.05261076 1
## 2 0.25935617 1
## 3 0.64322010 1
## 4 0.90168941 1
## 5 0.97002630 1
## 6 1.08037011 1

# The following defines a "success" as if the site is


# occupied. For binomial regression, it is best to always specify the response
# as two columns where the first is the number of successes and the second is the
# number of failures for the particular set of covariates.
#
# In our case, each data row denotes a single trial and occupancy is either 0 or 1.
# To get the number of failures, we just have to subtract from the number of trials
# at the given value of CCU, which is just 1.
m1 <- glm( cbind(Occupancy, 1-Occupancy) ~ CCU, data=Mayflies, family=binomial ) #ok!

For binomial response data, we need to know the number of successes and the
number of failures at each level of our covariate. In this case it is quite sim-
ple because there is only one observation at each CCU level, so the number of
successes is Occupancy and the number of failures is just 1-Occupancy. For bi-
nomial data, glm expect the response to be a two-column matrix where the first
column is the number successes and and the second column is the number of fail-
ures. The default choice of link function for binomial data is the logit link, but
the probit can be easily chosen as well using family=binomial(link=probit)
220 CHAPTER 11. BINOMIAL REGRESSION

in the call to glm(). If you only give a single response vector, it is assumed that
the second column is to be calculated as 1-first.column.

summary(m1)

##
## Call:
## glm(formula = cbind(Occupancy, 1 - Occupancy) ~ CCU, family = binomial,
## data = Mayflies)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.55741 -0.31594 -0.06553 0.08653 2.13362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.102 2.369 2.154 0.0313 *
## CCU -3.051 1.211 -2.520 0.0117 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.795 on 29 degrees of freedom
## Residual deviance: 12.649 on 28 degrees of freedom
## AIC: 16.649
##
## Number of Fisher Scoring iterations: 7

Notice that the summary table includes an estimate of the standard error of
each 𝛽𝑗̂ and a standardized value and z-test that are calculated in the usual
𝛽𝑗̂ −0
manner 𝑧𝑗 = 𝑆𝑡𝑑𝐸𝑟𝑟(𝛽𝑗̂ )
but these only approximately follow a standard normal
distribution (due to the CLT results for Maximum Likelihood Estimators). We
should regard the p-values given as approximate.
The sigmoid curve shown prior was the result of the logit model and we can
estimate the probability of occupancy for any value of CCU. Surprisingly, R does
not have a built-in function for the logit and ilogit function, but the faraway
package does include them.

# Here are three ways to calculate the phat value for CCU = 1. The predict()
# function won't give you confidence intervals, however. So I prefer emmeans()
# new.df <- data.frame(CCU=1)
# predict(m1, newdata=new.df) %>% faraway::ilogit() # back transform to p myself
11.1. BINOMIAL REGRESSION MODEL 221

# predict(m1, newdata=new.df, type='response') # ask predict() to do it


emmeans(m1, ~CCU, at=list(CCU=1), type='response')

## CCU prob SE df asymp.LCL asymp.UCL


## 1 0.886 0.129 Inf 0.391 0.99
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale

# Finally a nice graph


# new.df <- data.frame( CCU=seq(0,5, by=.01) )
# yhat.df <- new.df %>% mutate(fit = predict(m1, newdata=new.df, type='response') )
yhat.df <- emmeans(m1, ~CCU, at=list(CCU=seq(0,5,by=.01)), type='response') %>%
as.data.frame()

ggplot(Mayflies, aes(x=CCU)) +
geom_ribbon(data=yhat.df, aes(ymin=asymp.LCL, ymax=asymp.UCL), fill='salmon', alpha=.4) +
geom_line( data=yhat.df, aes(y=prob), color='red') +
geom_point(aes(y=Occupancy)) +
labs(y='Probability of Occupancy', x='Heavy Metal Pollution (CCU)', title='Mayfly Example')

Mayfly Example
1.00
Probability of Occupancy

0.75

0.50

0.25

0.00
0 1 2 3 4 5
Heavy Metal Pollution (CCU)

Suppose that we were to give a predicted Presence/Absence class based on the


𝑝̂ value. Lets predict presence if the probability is greater than 0.5 and absent
if the the probability is less that 0.5.

Mayflies <- Mayflies %>%


mutate(phat = predict(m1, type='response')) %>%
mutate(chat = ifelse(phat >0.5, 1, 0))

# This is often called the "confusion matrix"


table(Truth = Mayflies$Occupancy, Prediction = Mayflies$chat)
222 CHAPTER 11. BINOMIAL REGRESSION

## Prediction
## Truth 0 1
## 0 21 1
## 1 2 6

This scheme has mis-classified 3 observations, two cases where mayflies were
present but we predicted they would be absent, and one case where no mayflies
were detected but we predicted we would see them.

11.2 Measures of Fit Quality

11.2.1 Deviance

In the normal linear models case, we were very interested in the Sum of Squared
Error (SSE)
𝑛
2
𝑆𝑆𝐸 = ∑ (𝑤𝑖 − 𝑤̂ 𝑖 )
𝑖=1

because it provided a mechanism for comparing the fit of two different models.
If a model had a very small SSE, then it fit the observed data well. We used
this as a basis for forming our F-test to compare nested models (some re-scaling
by the appropriate degrees of freedom was necessary, though).
We want an equivalent measure of goodness-of-fit for models that are non-
normal, but in the normal case, I would like it to be related to my SSE statistic.
The deviance of a model with respect to some data 𝑦 is defined by

𝐷 (𝑤, 𝜃0̂ ) = −2 [log 𝐿 (𝜃0̂ |𝑤) − log 𝐿 (𝜃𝑆̂ |𝑤)]

where 𝜃0̂ are the fitted parameters of the model of interest, and 𝜃𝑆̂ are the
fitted parameters under a “saturated” model that has as many parameters as
it has observations and can therefore fit the data perfectly. Thus the deviance
is a measure of deviation from a perfect model and is flexible enough to handle
non-normal distributions appropriately.
Notice that this definition is very similar to what is calculated during the Like-
lihood Ratio Test. For any two models under consideration, the LRT can be
formed by looking at the difference of the deviances of the two nested models

̂ ̂ ⋅
𝐿𝑅𝑇 = 𝐷 (𝑤, 𝜃𝑠𝑖𝑚𝑝𝑙𝑒 ) − 𝐷 (𝑤, 𝜃𝑐𝑜𝑚𝑝𝑙𝑒𝑥 ) ∼ 𝜒2𝑑𝑓𝑐𝑜𝑚𝑝𝑙𝑒𝑥 −𝑑𝑓𝑠𝑖𝑚𝑝𝑙𝑒

m0 <- glm( Occupancy ~ 1, data=Mayflies, family=binomial )


anova(m0, m1)
11.2. MEASURES OF FIT QUALITY 223

## Warning in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, :


## models with response '"cbind(Occupancy, 1 - Occupancy)"' removed because
## response differs from model 1

## Analysis of Deviance Table


##
## Model: binomial, link: logit
##
## Response: Occupancy
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 29 34.795

1 - pchisq( 22.146, df=1 )

## [1] 2.526819e-06

A convenient way to get R to calculate the LRT 𝜒2 p-value for you is to specify
the test=LRT inside the anova function.

anova(m1, test='LRT')

## Analysis of Deviance Table


##
## Model: binomial, link: logit
##
## Response: cbind(Occupancy, 1 - Occupancy)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 29 34.795
## CCU 1 22.146 28 12.649 2.527e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The inference of this can be confirmed by looking at the AIC values of the two
models as well.
224 CHAPTER 11. BINOMIAL REGRESSION

AIC(m0, m1)

## df AIC
## m0 1 36.79491
## m1 2 16.64873

11.2.2 Goodness of Fit

The deviance is a good way to measure if a model fits the data, but it is not
the only method. Pearson’s 𝑋 2 statistic is also applicable. This statistic takes
2
𝑛
the general form 𝑋 2 = ∑𝑖=1 (𝑂𝑖 −𝐸 𝐸𝑖
𝑖)
where 𝑂𝑖 is the number of observations
observed in category 𝑖 and 𝐸𝑖 is the number expected in category 𝑖. In our case
we need to figure out the categories we have. Since we have both the number
of success and failures, we’ll have two categories per observation 𝑖.
𝑛 2 2𝑛 2
(𝑤𝑖 − 𝑛𝑖 𝑝𝑖̂ ) ((𝑛𝑖 − 𝑤𝑖 ) − 𝑛𝑖 (1 − 𝑝𝑖̂ )) (𝑤 − 𝑛𝑖 𝑝𝑖̂ )
𝑋2 = ∑ [ + ]=∑ 𝑖
𝑖=1
𝑛𝑖 𝑝𝑖̂ 𝑛𝑖 (1 − 𝑝𝑖̂ ) 𝑛 𝑝̂ (1 − 𝑝𝑖̂ )
𝑖=1 𝑖 𝑖

and the Pearson residual can be defined as


𝑤𝑖 − 𝑛𝑖 𝑝𝑖̂
𝑟𝑖 =
√𝑛𝑖 𝑝𝑖̂ (1 − 𝑝𝑖̂ )

These can be found in R via the following commands

sum( residuals(m1, type='pearson')^2 )

## [1] 14.92367

Pearson’s 𝑋 2 statistic is quite similar to the deviance statistic

deviance(m1)

## [1] 12.64873

11.3 Confidence Intervals


Confidence intervals for the regression could be constructed using normal ap-
proximations for the parameter estimates. An approximate 100 (1 − 𝛼) % con-
fidence interval for 𝛽𝑖 would be

𝛽𝑖̂ ± 𝑧 1−𝛼/2 𝑆𝑡𝑑𝐸𝑟𝑟 (𝛽𝑖̂ )


11.3. CONFIDENCE INTERVALS 225

but we know that this is not a good approximation because the the normal
approximation will not be good for small sample sizes and it isn’t clear what
is “big enough”. Instead we will use an inverted LRT to develop confidence
intervals for the 𝛽𝑖 parameters.
We first consider the simplest case, where we have only an intercept and slope
parameter. Below is a contour plot of the likelihood surface and the shaded
region is the region of the parameter space where the parameters (𝛽0 , 𝛽1 ) would
not be rejected by the LRT. This region is found by finding the maximum
likelihood estimators 𝛽0̂ and 𝛽1̂ , and then finding set of 𝛽0 , 𝛽1 pairs such that

−2 [log 𝐿 (𝛽0 , 𝛽1 ) − log 𝐿 (𝛽0̂ , 𝛽1̂ )] ≤ 𝜒2𝑑𝑓=1,0.95


−1
log 𝐿 (𝛽0 , 𝛽1 ) ≥ ( ) 𝜒21,0.95 + log 𝐿 (𝛽0̂ , 𝛽1̂ )
2

Log−Likelihood Surface
0

−2
β1

−4

−6

0 5 10
β0

Looking at just the 𝛽0 axis, this translates into a confidence interval of


(1.63, 11.78). This method is commonly referred to as the “profile likelihood”
interval because the interval is created by viewing the contour plot from the
one axis. The physical analogy is to viewing a mountain range from afar and
asking, “What parts of the mountain are higher than 8000 feet?”
This type of confidence interval is more robust than the normal approximation
and should be used whenever practical. In R, the profile likelihood confidence
interval for glm objects is available in the MASS library.

confint(m1) # using defaults

## Waiting for profiling to be done...

## 2.5 % 97.5 %
## (Intercept) 1.629512 11.781167
## CCU -6.446863 -1.304244
226 CHAPTER 11. BINOMIAL REGRESSION

confint(m1, level=.95, parm='CCU') # Just the slope parameter

## Waiting for profiling to be done...

## 2.5 % 97.5 %
## -6.446863 -1.304244

11.4 Interpreting model coefficients


𝑝
We first consider why we are dealing with odds 1−𝑝 instead of just 𝑝. They
contain the same information, so the choice is somewhat arbitrary, however
we’ve been using probabilities for so long that it feels unnatural to switch to
odds. There are two good reasons for this, however.
𝑝
The first is that the odds 1−𝑝 can take on any value from 0 to ∞ and so part
of our translation of 𝑝 to an unrestricted domain is already done.
The second is that it is easier to compare odds than to compare probabilities.
For example, (as of this writing) I have a three month old baby who is prone to
spitting up her milk.

• I think the probability that she will not spit up on me today is 𝑝1 = 0.10.
My wife disagrees and believes the probability is 𝑝2 = 0.01. We can look
at those probabilities and recognize that we differ in our assessment by
a factor of 10 because 10 = 𝑝1 /𝑝2 . If we had assessed the chance of her
spitting up using odds, I would have calculated 𝑜1 = 0.1/0.9 = 1/9. My
wife, on the other hand, would have calculated 𝑜2 = .01/.99 = 1/99. The
odds ratio of these is [1/9] / [1/99] = 99/9 = 11. This shows that she is
much more certain that the event will not happen and the multiplying
factor of the pair of odds is 11.

• But what if we were to consider the probability that my daughter will spit
up? The probabilities assigned by me versus my wife are 𝑝1 = 0.9 and
𝑝2 = 0.99. How should I assess that our probabilities differ by a factor
of 10, because 𝑝1 /𝑝2 = 0.91 ≠ 10? The odds ratio remains the same
calculation, however. The odds I would give are 𝑜1 = .9/.1 = 9 vs my
wife’s odds 𝑜2 = .99/.01 = 99. The odds ratio is now 9/99 = 1/11 and
gives the same information as I calculated from the where we defined a
success as my daughter not spitting up.

To try to clear up the verbiage we’ll consider a few different cases:


11.4. INTERPRETING MODEL COEFFICIENTS 227

Probability Odds Verbiage


95 19
𝑝 = .95 5 = 1 = 19 19 to 1 odds for
75 3
𝑝 = .75 25 = 1 =3 3 to 1 odds for
50 1
𝑝 = .50 50 = 1 =1 1 to 1 odds
25 1
𝑝 = .25 75 = 3 = 0.33 3 to 1 odds against
95 1
𝑝 = .05 5 = 19 = 0.0526 19 to 1 odds against

Given a logistic regression model with two continuous covariates, then using the
logit() link function we have
𝑝
log ( ) = 𝛽 0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥2
1−𝑝
𝑝
= 𝑒𝛽0 𝑒𝛽1 𝑥1 𝑒𝛽2 𝑥2
1−𝑝
and we can interpret 𝛽1 and 𝛽2 as the increase in the log odds for every unit
increase in 𝑥1 and 𝑥2 . We could alternatively interpret 𝛽1 and 𝛽2 using the
notion that a one unit change in 𝑥1 as a percent change of 𝑒𝛽1 in the odds.
That is to say, 𝑒𝛽1 is the odds ratio of that change.
To investigate how to interpret these effects, we will consider an example of
the rates of respiratory disease of babies in the first year based on covariates of
gender and feeding method (breast milk, formula from a bottle, or a combination
of the two). The data percentages of babies suffering respiratory disease are

Breast Milk +
Formula f Breast Milk b Supplement s
77 47 19
Males M 485 494 147
48 31 16
Females F 384 464 127

We can fit the saturated model (6 parameters to fit 6 different probabilities) as

data('babyfood', package='faraway')
head(babyfood)

## disease nondisease sex food


## 1 77 381 Boy Bottle
## 2 19 128 Boy Suppl
## 3 47 447 Boy Breast
## 4 48 336 Girl Bottle
## 5 16 111 Girl Suppl
## 6 31 433 Girl Breast
228 CHAPTER 11. BINOMIAL REGRESSION

m2 <- glm( cbind(disease,nondisease) ~ sex * food, family=binomial, data=babyfood )


summary(m2)

##
## Call:
## glm(formula = cbind(disease, nondisease) ~ sex * food, family = binomial,
## data = babyfood)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.59899 0.12495 -12.797 < 2e-16 ***
## sexGirl -0.34692 0.19855 -1.747 0.080591 .
## foodBreast -0.65342 0.19780 -3.303 0.000955 ***
## foodSuppl -0.30860 0.27578 -1.119 0.263145
## sexGirl:foodBreast -0.03742 0.31225 -0.120 0.904603
## sexGirl:foodSuppl 0.31757 0.41397 0.767 0.443012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.6375e+01 on 5 degrees of freedom
## Residual deviance: 2.6401e-13 on 0 degrees of freedom
## AIC: 43.518
##
## Number of Fisher Scoring iterations: 3

Notice that the residual deviance is effectively zero with zero degrees of freedom
indicating we just fit the saturated model.
It is nice to look at the single term deletions to see if the interaction term could
be dropped from the model.

anova(m2, test='LRT')

## Analysis of Deviance Table


##
## Model: binomial, link: logit
##
## Response: cbind(disease, nondisease)
##
11.4. INTERPRETING MODEL COEFFICIENTS 229

## Terms added sequentially (first to last)


##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 5 26.3753
## sex 1 5.4761 4 20.8992 0.01928 *
## food 2 20.1772 2 0.7219 4.155e-05 ***
## sex:food 2 0.7219 0 0.0000 0.69701
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given this, we will look use the reduced model with out the interaction and
check if we could reduce the model any more.

m1 <- glm( cbind(disease, nondisease) ~ sex + food, family=binomial, data=babyfood)


drop1(m1, test='Chi') # all single term deletions, using LRT

## Single term deletions


##
## Model:
## cbind(disease, nondisease) ~ sex + food
## Df Deviance AIC LRT Pr(>Chi)
## <none> 0.7219 40.240
## sex 1 5.6990 43.217 4.9771 0.02569 *
## food 2 20.8992 56.417 20.1772 4.155e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this we see that we cannot reduce the model any more and we will interpret
the coefficients of this model.

coef(m1, digits=5) # more accuracy

## (Intercept) sexGirl foodBreast foodSuppl


## -1.6127038 -0.3125528 -0.6692946 -0.1725424

We interpret the intercept term as the log odds that a male child fed only formula
will develop a respiratory disease in their first year. With that, we could then
calculate what the probability of a male formula fed baby developing respiratory
disease using following

𝑝𝑀,𝑓
−1.6127 = log ( ) = logit (𝑝𝑀,𝑓 )
1 − 𝑝𝑀,𝑓
230 CHAPTER 11. BINOMIAL REGRESSION

thus
1
𝑝𝑀,𝑓 = ilogit (−1.6127) = = 0.1662
1 + 𝑒1.6127
We notice that the odds of respiratory disease disease is
𝑝𝑀,𝑓 0.1662
= = 0.1993 = 𝑒−1.613
1 − 𝑝𝑀,𝑓 1 − 0.1662

For a female child bottle fed only formula, their probability of developing res-
piratory disease is

1 1
𝑝𝐹 ,𝑓 = = = 0.1273
1+ 𝑒−(−1.6127−0.3126) 1 + 𝑒1.9253

and the associated odds are


𝑝𝐹 ,𝑓 0.1273
= = 0.1458 = 𝑒−1.6127−0.3126
1 − 𝑝𝐹 ,𝑓 1 − 0.1273

so we can interpret 𝑒−0.3126 = 0.7315 as the percent change in odds from male
to female infants. That is to say, it is the odds ratio of the female infants to the
males is
𝑝
( 1−𝑝𝐹 ,𝑓 ) 0.1458
−0.3126 𝐹 ,𝑓
𝑒 = 𝑝𝑀,𝑓
= = 0.7315
( 1−𝑝 ) 0.1993
𝑀,𝑓

The interpretation here is that odds of respiratory infection for females is 73.1%
than that of a similarly feed male child and I might say that being female reduces
the odds of respiratory illness by 27% compared to male babies. Similarly we
can calculate the change in odds ratio for the feeding types:

exp( coef(m1) )

## (Intercept) sexGirl foodBreast foodSuppl


## 0.1993479 0.7315770 0.5120696 0.8415226

First we notice that the intercept term can be interpreted as the odds of infec-
tion for the reference group. The each of the offset terms are the odds ratios
compared to the reference group. We see that breast milk along with formula
has only 84% of the odds of respiratory disease as a formula only baby, and a
breast milk fed child only has 51% of the odds for respiratory disease as the
formula fed baby. We can look at confidence intervals for the odds ratios by the
following:
11.4. INTERPRETING MODEL COEFFICIENTS 231

exp( confint(m1) )

## Waiting for profiling to be done...

## 2.5 % 97.5 %
## (Intercept) 0.1591988 0.2474333
## sexGirl 0.5536209 0.9629225
## foodBreast 0.3781905 0.6895181
## foodSuppl 0.5555372 1.2464312

We should be careful in drawing conclusions here because this study was a ret-
rospective study and the decision to breast feed a baby vs feeding with formula
is inextricably tied to socio-economic status and we should investigate if the
effect measured is due to feeding method or some other lurking variable tied to
socio-economic status.
As usual, we don’t want to calculate all these quantities by hand and would
prefer if emmeans would do all the back-transformations for us.

emmeans(m1, ~ sex+food, type='response') # probabilities for each sex/treatment

## sex food prob SE df asymp.LCL asymp.UCL


## Boy Bottle 0.1662 0.0156 Inf 0.1379 0.1990
## Girl Bottle 0.1273 0.0143 Inf 0.1018 0.1580
## Boy Breast 0.0926 0.0111 Inf 0.0730 0.1168
## Girl Breast 0.0695 0.0093 Inf 0.0533 0.0901
## Boy Suppl 0.1437 0.0234 Inf 0.1036 0.1958
## Girl Suppl 0.1093 0.0194 Inf 0.0766 0.1536
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale

emmeans(m1, pairwise~ sex, type='response') # compare Boy / Girl

## $emmeans
## sex prob SE df asymp.LCL asymp.UCL
## Boy 0.1309 0.0111 Inf 0.1105 0.154
## Girl 0.0992 0.0103 Inf 0.0808 0.121
##
## Results are averaged over the levels of: food
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
232 CHAPTER 11. BINOMIAL REGRESSION

## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Boy / Girl 1.37 0.193 Inf 2.216 0.0267
##
## Results are averaged over the levels of: food
## Tests are performed on the log odds ratio scale

emmeans(m1, revpairwise~ sex, type='response') # compare Girl / Boy

## $emmeans
## sex prob SE df asymp.LCL asymp.UCL
## Boy 0.1309 0.0111 Inf 0.1105 0.154
## Girl 0.0992 0.0103 Inf 0.0808 0.121
##
## Results are averaged over the levels of: food
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Girl / Boy 0.732 0.103 Inf -2.216 0.0267
##
## Results are averaged over the levels of: food
## Tests are performed on the log odds ratio scale

emmeans(m1, pairwise ~ food, type='response') # compare Foods

## $emmeans
## food prob SE df asymp.LCL asymp.UCL
## Bottle 0.1457 0.01220 Inf 0.1233 0.1712
## Breast 0.0803 0.00877 Inf 0.0647 0.0993
## Suppl 0.1255 0.01994 Inf 0.0913 0.1700
##
## Results are averaged over the levels of: sex
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Bottle / Breast 1.953 0.299 Inf 4.374 <.0001
## Bottle / Suppl 1.188 0.244 Inf 0.839 0.6786
## Breast / Suppl 0.609 0.132 Inf -2.296 0.0564
##
## Results are averaged over the levels of: sex
11.5. PREDICTION AND EFFECTIVE DOSE LEVELS 233

## P value adjustment: tukey method for comparing a family of 3 estimates


## Tests are performed on the log odds ratio scale

emmeans(m1, revpairwise ~ food, type='response') # compare Foods

## $emmeans
## food prob SE df asymp.LCL asymp.UCL
## Bottle 0.1457 0.01220 Inf 0.1233 0.1712
## Breast 0.0803 0.00877 Inf 0.0647 0.0993
## Suppl 0.1255 0.01994 Inf 0.0913 0.1700
##
## Results are averaged over the levels of: sex
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
## contrast odds.ratio SE df z.ratio p.value
## Breast / Bottle 0.512 0.0783 Inf -4.374 <.0001
## Suppl / Bottle 0.842 0.1730 Inf -0.839 0.6786
## Suppl / Breast 1.643 0.3556 Inf 2.296 0.0564
##
## Results are averaged over the levels of: sex
## P value adjustment: tukey method for comparing a family of 3 estimates
## Tests are performed on the log odds ratio scale

11.5 Prediction and Effective Dose Levels


To demonstrate the ideas in this section, we’ll use a toxicology study that exam-
ined insect mortality as a function of increasing concentrations of an insecticide.

data('bliss', package='faraway')

We first fit the logistic regression model and plot the results

m1 <- glm( cbind(alive, dead) ~ conc, family=binomial, data=bliss)

Given this, we want to develop a confidence interval for the probabilities by first
calculating using the following formula. As usual, we recall that the 𝑦 values
live in (−∞, ∞).
𝐶𝐼𝑦 ∶ 𝑦 ̂ ± 𝑧 1−𝛼/2 𝑆𝑡𝑑𝐸𝑟𝑟 (𝑦)̂
We must then convert this to the [0, 1] space using the ilogit() function.
𝐶𝐼𝑝 = ilogit (𝐶𝐼𝑦 )
234 CHAPTER 11. BINOMIAL REGRESSION

probs <- data.frame(conc=seq(0,4,by=.1))


yhat <- predict(m1, newdata=probs, se.fit=TRUE) # list with two elements fit and se.fi
yhat <- data.frame( fit=yhat$fit, se.fit = yhat$se.fit)
probs <- cbind(probs, yhat)
head(probs)

## conc fit se.fit


## 1 0.0 2.323790 0.4178878
## 2 0.1 2.207600 0.4022371
## 3 0.2 2.091411 0.3868040
## 4 0.3 1.975221 0.3716158
## 5 0.4 1.859032 0.3567036
## 6 0.5 1.742842 0.3421035

probs <- probs %>% mutate(


phat = faraway::ilogit(fit),
lwr = faraway::ilogit( fit - 1.96 * se.fit ),
upr = faraway::ilogit( fit + 1.96 * se.fit ))
ggplot(bliss, aes(x=conc)) +
geom_point(aes(y=alive/(alive+dead))) +
geom_line(data=probs, aes(y=phat), color='red') +
geom_ribbon(data=probs, aes(ymin=lwr, ymax=upr), fill='red', alpha=.3) +
ggtitle('Bliss Insecticide Data') +
xlab('Concentration') + ylab('Proportion Alive')

Bliss Insecticide Data


1.00

0.75
Proportion Alive

0.50

0.25

0.00
0 1 2 3 4
Concentration

Alternatively, we might want to do this calculation via emmeans.

emmeans::emmeans(m1, ~conc, at=list(conc=c(0:4)), type='response')

## conc prob SE df asymp.LCL asymp.UCL


11.5. PREDICTION AND EFFECTIVE DOSE LEVELS 235

## 0 0.9108 0.0339 Inf 0.8183 0.959


## 1 0.7617 0.0500 Inf 0.6507 0.846
## 2 0.5000 0.0518 Inf 0.3998 0.600
## 3 0.2383 0.0500 Inf 0.1542 0.349
## 4 0.0892 0.0339 Inf 0.0414 0.182
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale

The next thing we want to do is come up with a confidence intervals for the
concentration level that results in the death of 100(𝑝)% of the insects. Often
we are interested in the case of 𝑝 = 0.5. This is often called LD50, which is the
lethal dose for 50% of the population. Using the link function you can set the
𝑝 value and solve for the concentration value to find

logit (𝑝) − 𝛽0̂


𝑥𝑝̂ =
𝛽̂
1

which gives us a point estimate of LD(p). To get a confidence interval we need


to find the standard error of 𝑥𝑝̂ . Since this is a non-linear function of 𝛽0̂ and
𝛽1̂ which are correlated, we must be careful in the calculation. The actual
calculation is done using the Delta Method Approximation:

̂ = 𝑔′ (𝜃)𝑇 𝑉 𝑎𝑟 (𝜃) 𝑔′ (𝜃)


𝑉 𝑎𝑟 (𝑔 (𝜃))

Fortunately we don’t have to do these calculations by hand and can use the
dose.p() function in the MASS package.

LD <- MASS::dose.p(m1, p=c(.25, .5, .75))


LD

## Dose SE
## p = 0.25: 2.945535 0.2315932
## p = 0.50: 2.000000 0.1784367
## p = 0.75: 1.054465 0.2315932

and we can use these to create approximately confidence intervals for these 𝑥𝑝̂
values via
𝑥𝑝̂ ± 𝑧 1−𝛼/2 𝑆𝑡𝑑𝐸𝑟𝑟 (𝑥𝑝̂ )

# why did the MASS authors make LD a vector of the


# estimated values and have an additional attribute
# that contains the standard errors? Whatever, lets
# turn this into a convential data.frame.
str(LD)
236 CHAPTER 11. BINOMIAL REGRESSION

## 'glm.dose' Named num [1:3] 2.95 2 1.05


## - attr(*, "names")= chr [1:3] "p = 0.25:" "p = 0.50:" "p = 0.75:"
## - attr(*, "SE")= num [1:3, 1] 0.232 0.178 0.232
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "p = 0.25:" "p = 0.50:" "p = 0.75:"
## .. ..$ : NULL
## - attr(*, "p")= num [1:3] 0.25 0.5 0.75

CI <- data.frame(p = attr(LD,'p'),


Dose = as.vector(LD),
SE = attr(LD,'SE')) %>% # save the output table as LD
mutate( lwr = Dose - qnorm(.975)*SE,
upr = Dose + qnorm(.975)*SE )
CI

## p Dose SE lwr upr


## 1 0.25 2.945535 0.2315932 2.4916207 3.399449
## 2 0.50 2.000000 0.1784367 1.6502705 2.349730
## 3 0.75 1.054465 0.2315932 0.6005508 1.508379

11.6 Overdispersion
In the binomial distribution, the variance is a function of the probability of
success and is
𝑉 𝑎𝑟 (𝑊 ) = 𝑛𝑝 (1 − 𝑝)
but there are many cases where we might be interested in adding an additional
variance parameter 𝜙 to the model. A common reason for overdispersion to
appear is that we might not have captured all the covariates that influence 𝑝.
We can do a quick simulation to demonstrate that additional variability in 𝑝
leads to addition variability overall.

N <- 1000
n <- 10
p <- .6
overdispersed_p <- p + rnorm(n, mean=0, sd=.05)
sim.data <- NULL
for( i in 1:N ){
sim.data <- sim.data %>% rbind(data.frame(
var = var( rbinom(N, size=n, prob=p)),
type = 'Standard'))
sim.data <- sim.data %>% rbind(data.frame(
var = var( rbinom(N, size=n, prob=overdispersed_p )),
11.6. OVERDISPERSION 237

type = 'OverDispersed'))
}
true.var <- p*(1-p)*n
ggplot(sim.data, aes(x=var, y=..density..)) +
geom_histogram(bins=30) +
geom_vline(xintercept = true.var, color='red') +
facet_grid(type~.) +
ggtitle('Histogram of Sample Variances')

Histogram of Sample Variances


4

OverDispersed
3
2
1
density

0
4
3

Standard
2
1
0
2.00 2.25 2.50 2.75 3.00 3.25
var

We see that the sample variances fall neatly about the true variance of 2.4 in
the case where the data is distributed with a constant value for 𝑝. However
adding a small amount of random noise about the parameter 𝑝, and we’d have
more variance in the samples.
The extra uncertainty of the probability of success results in extra variability in
the responses.
We can recognize when overdispersion is present by examining the deviance of
our model. Because the deviance is approximately distributed

𝐷 (𝑦, 𝜃) ∼ 𝜒2𝑑𝑓

where 𝑑𝑓 is the residual degrees of freedom in the model. Because the 𝜒2𝑘 is
the sum of 𝑘 independent, squared standard normal random variables, it has an
expectation 𝑘 and variance 2𝑘. For binomial data with group sizes (say larger
than 5), this approximation isn’t too bad and we can detect overdispersion.
For binary responses, the approximation is quite poor and we cannot detect
overdispersion.
The simplest approach for modeling overdispersion is to introduce an addition
dispersion parameter 𝜎2 . This dispersion parameter may be estimated using

𝑋2
𝜎̂ 2 = .
𝑛−𝑝
238 CHAPTER 11. BINOMIAL REGRESSION

With the addition of the overdispersion parameter to the model, the differences
between a simple and complex model is no longer distributed 𝜒2 and we must
use the following approximate F-statistic

(𝐷𝑠𝑖𝑚𝑝𝑙𝑒 − 𝐷𝑐𝑜𝑚𝑝𝑙𝑒𝑥 ) / (𝑑𝑓𝑠𝑚𝑎𝑙𝑙 − 𝑑𝑓𝑙𝑎𝑟𝑔𝑒 )


𝐹 =
𝜎̂ 2

Using the F-test when the the overdispersion parameter is 1 is a less powerful test
than the 𝜒2 test, so we’ll only use the F-test when the overdispersion parameter
must be estimated.
Example: We consider an experiment where at five different stream locations,
four boxes of trout eggs were buried and retrieved at four different times after
the original placement. The number of surviving eggs was recorded and the
eggs disposed of.

data(troutegg, package='faraway')
troutegg <- troutegg %>%
mutate( perish = total - survive) %>%
dplyr::select(location, period, survive, perish, total) %>%
arrange(location, period)

troutegg %>% arrange(location, period)

## location period survive perish total


## 1 1 4 89 5 94
## 2 1 7 94 4 98
## 3 1 8 77 9 86
## 4 1 11 141 14 155
## 5 2 4 106 2 108
## 6 2 7 91 15 106
## 7 2 8 87 9 96
## 8 2 11 104 18 122
## 9 3 4 119 4 123
## 10 3 7 100 30 130
## 11 3 8 88 31 119
## 12 3 11 91 34 125
## 13 4 4 104 0 104
## 14 4 7 80 17 97
## 15 4 8 67 32 99
## 16 4 11 111 21 132
## 17 5 4 49 44 93
## 18 5 7 11 102 113
## 19 5 8 18 70 88
## 20 5 11 0 138 138
11.6. OVERDISPERSION 239

We can first visualize the data

ggplot(troutegg, aes(x=period, y=survive/total, color=location)) +


geom_point(aes(size=total)) +
geom_line(aes(x=as.integer(period)))

location
1.00
1
2
0.75 3
survive/total

0.50 5

total
0.25
100
120
0.00
140
4 7 8 11
period

We can fit the logistic regression model (noting that the model with the inter-
action of location and period would be saturated):

m <- glm(cbind(survive,perish) ~ location * period, family=binomial, data=troutegg)


summary(m)

##
## Call:
## glm(formula = cbind(survive, perish) ~ location * period, family = binomial,
## data = troutegg)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.8792 0.4596 6.265 3.74e-10 ***
## location2 1.0911 0.8489 1.285 0.19870
## location3 0.5136 0.6853 0.749 0.45356
## location4 24.4707 51676.2179 0.000 0.99962
## location5 -2.7716 0.5044 -5.495 3.90e-08 ***
## period7 0.2778 0.6869 0.404 0.68591
## period8 -0.7326 0.5791 -1.265 0.20582
## period11 -0.5695 0.5383 -1.058 0.29007
## location2:period7 -2.4453 1.0291 -2.376 0.01749 *
240 CHAPTER 11. BINOMIAL REGRESSION

## location3:period7 -2.4667 0.8796 -2.804 0.00504 **


## location4:period7 -26.0789 51676.2179 -0.001 0.99960
## location5:period7 -2.6125 0.7847 -3.329 0.00087 ***
## location2:period8 -0.9690 0.9836 -0.985 0.32453
## location3:period8 -1.6169 0.7983 -2.025 0.04284 *
## location4:period8 -25.8783 51676.2179 -0.001 0.99960
## location5:period8 -0.7331 0.6696 -1.095 0.27354
## location2:period11 -1.6468 0.9297 -1.771 0.07651 .
## location3:period11 -1.8388 0.7672 -2.397 0.01654 *
## location4:period11 -25.1154 51676.2179 0.000 0.99961
## location5:period11 -27.1679 51597.7368 -0.001 0.99958
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.0215e+03 on 19 degrees of freedom
## Residual deviance: 5.5198e-10 on 0 degrees of freedom
## AIC: 116.53
##
## Number of Fisher Scoring iterations: 22

The residual deviance seems a little large. With 12 residual degrees of freedom,
the deviance should be near 12. We can confirm that the deviance is quite large
via:

1 - pchisq( 64.5, df=12 )

## [1] 3.372415e-09

We therefore estimate the overdispersion parameter

sigma2 <- sum( residuals(m, type='pearson') ^2 ) / 12


sigma2

## [1] 2.29949e-11

and note that this is quite a bit larger than 1, which is what it should be in the
non-overdispersed setting. Using this we can now test the significance of the
effects of location and period.

drop1(m, scale=sigma2, test='F')


11.6. OVERDISPERSION 241

## Warning in drop1.glm(m, scale = sigma2, test = "F"): F test assumes


## 'quasibinomial' family

## Warning in pf(q = q, df1 = df1, ...): NaNs produced

## Single term deletions


##
## Model:
## cbind(survive, perish) ~ location * period
##
## scale: 2.29949e-11
##
## Df Deviance AIC F value Pr(>F)
## <none> 0.000 1.1700e+02
## location:period 12 64.495 2.8048e+12 0

and conclude that both location and period are significant predictors of trout
egg survivorship.
We could have avoided having to calculate 𝜎̂ 2 by hand by simply using the
quasibinomial family instead of the binomial.

m2 <- glm(cbind(survive,perish) ~ location + period,


family=quasibinomial, data=troutegg)
summary(m2)

##
## Call:
## glm(formula = cbind(survive, perish) ~ location + period, family = quasibinomial,
## data = troutegg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8305 -0.3650 -0.0303 0.6191 3.2434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6358 0.6495 7.138 1.18e-05 ***
## location2 -0.4168 0.5682 -0.734 0.477315
## location3 -1.2421 0.5066 -2.452 0.030501 *
## location4 -0.9509 0.5281 -1.800 0.096970 .
## location5 -4.6138 0.5777 -7.987 3.82e-06 ***
## period7 -2.1702 0.5504 -3.943 0.001953 **
## period8 -2.3256 0.5609 -4.146 0.001356 **
## period11 -2.4500 0.5405 -4.533 0.000686 ***
242 CHAPTER 11. BINOMIAL REGRESSION

## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 5.330358)
##
## Null deviance: 1021.469 on 19 degrees of freedom
## Residual deviance: 64.495 on 12 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5

drop1(m2, test='F')

## Single term deletions


##
## Model:
## cbind(survive, perish) ~ location + period
## Df Deviance F value Pr(>F)
## <none> 64.50
## location 4 913.56 39.494 8.142e-07 ***
## period 3 228.57 10.176 0.001288 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# anova(m2, test='F')

While each of the time periods is different than the first, it looks like periods
7,8, and 11 aren’t different from each other. As usual, we need to turn to the
emmeans package for looking at the pairwise differences between the periods.

emmeans(m2, pairwise~period, type='response')

## $emmeans
## period prob SE df asymp.LCL asymp.UCL
## 4 0.960 0.0177 Inf 0.907 0.984
## 7 0.735 0.0567 Inf 0.611 0.831
## 8 0.704 0.0618 Inf 0.571 0.809
## 11 0.677 0.0549 Inf 0.562 0.774
##
## Results are averaged over the levels of: location
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
11.6. OVERDISPERSION 243

## contrast odds.ratio SE df z.ratio p.value


## 4 / 7 8.76 4.821 Inf 3.943 0.0005
## 4 / 8 10.23 5.740 Inf 4.146 0.0002
## 4 / 11 11.59 6.263 Inf 4.533 <.0001
## 7 / 8 1.17 0.475 Inf 0.383 0.9810
## 7 / 11 1.32 0.501 Inf 0.739 0.8814
## 8 / 11 1.13 0.431 Inf 0.327 0.9880
##
## Results are averaged over the levels of: location
## P value adjustment: tukey method for comparing a family of 4 estimates
## Tests are performed on the log odds ratio scale

emmeans(m2, ~period, type='response') %>% multcomp::cld(Letters=letters)

## period prob SE df asymp.LCL asymp.UCL .group


## 11 0.677 0.0549 Inf 0.562 0.774 a
## 8 0.704 0.0618 Inf 0.571 0.809 a
## 7 0.735 0.0567 Inf 0.611 0.831 a
## 4 0.960 0.0177 Inf 0.907 0.984 b
##
## Results are averaged over the levels of: location
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## P value adjustment: tukey method for comparing a family of 4 estimates
## Tests are performed on the log odds ratio scale
## significance level used: alpha = 0.05

Looking at this experiment, it looks like there is an interaction between location


and period. If we fit a model with the interaction, it is the saturated model (20
covariates for 20 observations)

m3 <- glm( cbind(survive, perish) ~ period * location, family=binomial, data=troutegg)


summary(m3)

##
## Call:
## glm(formula = cbind(survive, perish) ~ period * location, family = binomial,
## data = troutegg)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
244 CHAPTER 11. BINOMIAL REGRESSION

## (Intercept) 2.8792 0.4596 6.265 3.74e-10 ***


## period7 0.2778 0.6869 0.404 0.68591
## period8 -0.7326 0.5791 -1.265 0.20582
## period11 -0.5695 0.5383 -1.058 0.29007
## location2 1.0911 0.8489 1.285 0.19870
## location3 0.5136 0.6853 0.749 0.45356
## location4 24.4707 51676.2184 0.000 0.99962
## location5 -2.7716 0.5044 -5.495 3.90e-08 ***
## period7:location2 -2.4453 1.0291 -2.376 0.01749 *
## period8:location2 -0.9690 0.9836 -0.985 0.32453
## period11:location2 -1.6468 0.9297 -1.771 0.07651 .
## period7:location3 -2.4667 0.8796 -2.804 0.00504 **
## period8:location3 -1.6169 0.7983 -2.025 0.04284 *
## period11:location3 -1.8388 0.7672 -2.397 0.01654 *
## period7:location4 -26.0789 51676.2184 -0.001 0.99960
## period8:location4 -25.8783 51676.2184 -0.001 0.99960
## period11:location4 -25.1154 51676.2184 0.000 0.99961
## period7:location5 -2.6125 0.7847 -3.329 0.00087 ***
## period8:location5 -0.7331 0.6696 -1.095 0.27354
## period11:location5 -27.1679 51597.7368 -0.001 0.99958
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.0215e+03 on 19 degrees of freedom
## Residual deviance: 5.5196e-10 on 0 degrees of freedom
## AIC: 116.53
##
## Number of Fisher Scoring iterations: 22

Notice that we’ve fit the saturated model and we’ve resolved the overdispersion
problem because of that.
As usual, we can now look at the effect of period at each of the locations.

emmeans(m3, ~ period | location, type='response') %>% multcomp::cld()

## location = 1:
## period prob SE df asymp.LCL asymp.UCL .group
## 8 0.8953488 0.03300798 Inf 0.8109406 0.9446443 1
## 11 0.9096774 0.02302375 Inf 0.8532710 0.9457777 1
## 4 0.9468085 0.02314665 Inf 0.8785095 0.9776867 1
## 7 0.9591837 0.01998733 Inf 0.8962639 0.9845962 1
##
11.6. OVERDISPERSION 245

## location = 2:
## period prob SE df asymp.LCL asymp.UCL .group
## 11 0.8524590 0.03210799 Inf 0.7779341 0.9050269 1
## 7 0.8584906 0.03385381 Inf 0.7784456 0.9128537 1
## 8 0.9062500 0.02974911 Inf 0.8295443 0.9504977 12
## 4 0.9814815 0.01297276 Inf 0.9289964 0.9953638 2
##
## location = 3:
## period prob SE df asymp.LCL asymp.UCL .group
## 11 0.7280000 0.03980111 Inf 0.6434907 0.7987421 1
## 8 0.7394958 0.04023479 Inf 0.6533948 0.8104142 1
## 7 0.7692308 0.03695265 Inf 0.6891126 0.8336850 1
## 4 0.9674797 0.01599358 Inf 0.9165610 0.9877408 2
##
## location = 4:
## period prob SE df asymp.LCL asymp.UCL .group
## 8 0.6767677 0.04700674 Inf 0.5787856 0.7613551 1
## 7 0.8247423 0.03860215 Inf 0.7360186 0.8881766 12
## 11 0.8409091 0.03183558 Inf 0.7682755 0.8939194 2
## 4 1.0000000 0.00000007 Inf 0.0000000 1.0000000 12
##
## location = 5:
## period prob SE df asymp.LCL asymp.UCL .group
## 11 0.0000000 0.00000005 Inf 0.0000000 1.0000000 12
## 7 0.0973451 0.02788552 Inf 0.0547290 0.1672733 1
## 8 0.2045455 0.04299929 Inf 0.1328383 0.3015023 1
## 4 0.5268817 0.05177260 Inf 0.4256954 0.6259069 2
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## P value adjustment: tukey method for comparing a family of 4 estimates
## Tests are performed on the log odds ratio scale
## significance level used: alpha = 0.05

Notice that there isn’t a huge difference in period for locations 1-4, but in
location 5 things are very different.
We might consider that location really ought to be a random effect. Fortunately
lme4 supports the family option, although it will not accept quasi families, you
either fit a random effect or fit a quasibinomial. In either case, fitting the
full interaction model with period*location doesn’t work because we have a
saturated model

# addititive model
m3 <- glmer(cbind(survive,perish) ~ period + (1|location),
family=binomial, data=troutegg)
246 CHAPTER 11. BINOMIAL REGRESSION

summary(m3)

## Generalized linear mixed model fit by maximum likelihood (Laplace


## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(survive, perish) ~ period + (1 | location)
## Data: troutegg
##
## AIC BIC logLik deviance df.resid
## 180.3 185.3 -85.2 170.3 15
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.2274 -0.3592 0.0027 0.6531 3.5841
##
## Random effects:
## Groups Name Variance Std.Dev.
## location (Intercept) 2.682 1.638
## Number of obs: 20, groups: location, 5
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1799 0.7594 4.187 2.82e-05 ***
## period7 -2.1545 0.2368 -9.097 < 2e-16 ***
## period8 -2.3085 0.2414 -9.564 < 2e-16 ***
## period11 -2.4324 0.2325 -10.460 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) perid7 perid8
## period7 -0.224
## period8 -0.225 0.732
## period11 -0.234 0.758 0.761

emmeans(m3, ~period, type='response') %>% multcomp::cld(Letters=letters)

## period prob SE df asymp.LCL asymp.UCL .group


## 11 0.679 0.1615 Inf 0.331 0.900 a
## 8 0.705 0.1546 Inf 0.358 0.911 a
## 7 0.736 0.1444 Inf 0.394 0.923 a
## 4 0.960 0.0291 Inf 0.844 0.991 b
##
## Confidence level used: 0.95
11.7. ROC CURVES 247

## Intervals are back-transformed from the logit scale


## P value adjustment: tukey method for comparing a family of 4 estimates
## Tests are performed on the log odds ratio scale
## significance level used: alpha = 0.05

# Interaction model
m4 <- glmer(cbind(survive,perish) ~ period + (1 + period|location),
family=binomial, data=troutegg)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :


## Model failed to converge with max|grad| = 0.0184817 (tol = 0.002, component 1)

#summary(m4)
#emmeans(m4, ~period, type='response') %>% multcomp::cld(Letters=letters)

Unfortunately, we aren’t able to fit the saturated model using random effects
because the numerical optimization function for finding MLE estimates failed
to converge.

11.7 ROC Curves


The dataset faraway::wbca comes from a study of breast cancer in Wisconsin.
There are 681 cases of potentially cancerous tumors of which 238 are actually
malignant (ie cancerous). Determining whether a tumor is really malignant is
traditionally determined by an invasive surgical procedure. The purpose of this
study was to determine whether a new procedure called ‘fine needle aspiration’,
which draws only a small sample of tissue, could be effective in determining
tumor status.

data('wbca', package='faraway')

# clean up the data


wbca <- wbca %>%
mutate(Class = ifelse(Class==0, 'malignant', 'benign')) %>%
dplyr::select(Class, BNucl, UShap, USize)

# Fit the model where Malignant is considered a success


# model <- glm( I(Class=='malignant') ~ ., data=wbca, family='binomial' ) # emmeans hates this v

model <- wbca %>% mutate( Class = (Class == 'malignant') ) %>% # Clear what is success
glm( Class ~., data=., family='binomial' ) # and emmeans still happy
248 CHAPTER 11. BINOMIAL REGRESSION

# Get the response values


# type='response' gives phat values which live in [0,1]
# type='link' gives the Xbeta values whice live in (-infinity, infinity)
wbca <- wbca %>%
mutate(phat = predict(model, type='response'),
yhat = ifelse(phat > .5, 'malignant', 'benign'))

# Calculate the confusion matrix


table( Truth=wbca$Class, Predicted=wbca$yhat )

## Predicted
## Truth benign malignant
## benign 432 11
## malignant 15 223

As usual we can calculate the summary tables…

summary(model)

##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8890 -0.1409 -0.1409 0.0287 2.2284
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.35433 0.54076 -11.751 < 2e-16 ***
## BNucl 0.55297 0.08041 6.877 6.13e-12 ***
## UShap 0.62583 0.17506 3.575 0.000350 ***
## USize 0.56793 0.15910 3.570 0.000358 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 881.39 on 680 degrees of freedom
## Residual deviance: 148.43 on 677 degrees of freedom
## AIC: 156.43
##
## Number of Fisher Scoring iterations: 7
11.7. ROC CURVES 249

From this table, we see that for a breast tumor, the larger values of BNucl,
UShap, and USize imply a greater probability of it being malignant. So for a
tumor with

## BNucl UShap USize


## 1 2 1 2

We would calculate

𝑋 𝛽 ̂ = 𝛽0̂ + 2 ⋅ 𝛽1̂ + 1 ⋅ 𝛽2̂ + 2 ⋅ 𝛽3̂


= −6.35 + 2 ∗ (0.553) + 1 ∗ (0.626) + 2 ∗ (0.568)
= −3.482

and therefore
1 1
𝑝̂ = = = 0.0297
1+ 𝑒−𝑋𝛽 ̂ 1 + 𝑒3.482

newdata = data.frame( BNucl=2, UShap=1, USize=2 )


predict(model, newdata=newdata)

## 1
## -3.486719

predict(model, newdata=newdata, type='response')

## 1
## 0.0296925

So for a tumor with these covariates, we would classify it as most likely to be


benign.
In this medical scenario where we have to decide how to classify if a tumor is
malignant of benign, we shouldn’t treat the misclassification errors as being the
same. If we incorrectly identify a tumor as malignant when it is not, that will
cause a patient to undergo a somewhat invasive surgury to remove the tumor.
However if we incorrectly identify a tumor as being benign, then the cancerous
tumor will likely grow and eventually kill the patient. While the first error is
regretable, the second is far worse.
Given that reasoning, perhaps we shouldn’t use the rule: If 𝑝̂ >= 0.5 classify as
malignant. Instead perhaps we should use 𝑝̂ >= 0.3 or 𝑝̂ >= 0.05.
Whatever decision rule we make, we should consider how many of each type of
error we make. Consider the following confusion matrix:
250 CHAPTER 11. BINOMIAL REGRESSION

Predict Negative Predict Positive Total


Negative True Neg (TN) False Pos (FP) 𝑁
Positive False Neg (FN) True Pos (TP) 𝑃
Total 𝑁∗ 𝑃∗

where 𝑃 is the number of positive cases, 𝑁 is the number of negative cases, 𝑃 ∗


is the number of observations predicted to be positive, and 𝑁 ∗ is the number
predicted to be negative.

Quantity Formula Synonyms


False Positive Rate 𝐹 𝑃 /𝑁 Type I Error; 1-Specificity
True Positive Rate 𝑇 𝑃 /𝑃 Power; Sensitivity; Recall
Pos. Pred. Value 𝑇 𝑃 /𝑃 ∗ Precision

We can think of the True Positive Rate as the probability that a Positive case
will be correctly classified as a positive. Similarly a False Positive Rate is the
probability that a Negative case will be incorrectly classified as a positive.

I wish to examine the relationship between the False Positive Rate and the True
Positive Rate for any decision rule. So what we could do is select a sequence
of decision rules and for each calculate the (FPR, TPR) pair, and then make a
plot where we play connect the dots with the (FPR, TPR) pairs.

Of course we don’t want to have to do this by hand, so we’ll use the package
pROC to do it for us.

# Calculate the ROC information using pROC::roc()


myROC <- roc( wbca$Class, wbca$phat )

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

# make a nice plot using ggplot2 and pROC::ggroc()


ggroc( myROC )
11.7. ROC CURVES 251

1.00

0.75
sensitivity

0.50

0.25

0.00
1.00 0.75 0.50 0.25 0.00
specificity

This looks pretty good and in an ideal classifier that makes prefect predictions,
this would be a perfect right angle at the bend.
Lets zoom in a little on the high specificity values (i.e. low false positive rates)

ggroc( myROC ) + xlim(1, .9)

1.00

0.75
sensitivity

0.50

0.25

0.00
1.000 0.975 0.950 0.925 0.900
specificity

We see that if we want to correctly identify about 99% of maligant tumors, we


will have a false positive rate of about 1-0.95 = 0.05. So about 5% of benign
tumors would be incorrectly classified as malignant.
It is a little challenging to read the graph to see what the Sensitivity is for a
particular value of Specificity. To do this we’ll use another function because
the authors would prefer you to also estimate the confidence intervals for the
quantity. This is a case where bootstrap confidence intervals are quite effective.

ci(myROC, of='sp', sensitivities=.99)

## 95% CI (2000 stratified bootstrap replicates):


252 CHAPTER 11. BINOMIAL REGRESSION

## se sp.low sp.median sp.high


## 0.99 0.9187 0.9571 0.9797

ci(myROC, of='se', specificities=.975)

## 95% CI (2000 stratified bootstrap replicates):


## sp se.low se.median se.high
## 0.975 0.8827 0.937 0.9958

One measure of how far we are from the perfect predictor is the area under the
curve. The perfect model would have an area under the curve of 1. For this
model the area under the curve is:

auc(myROC)

## Area under the curve: 0.9929

ci(myROC, of='auc')

## 95% CI: 0.9878-0.9981 (DeLong)

ci(myROC, of='auc', method='bootstrap')

## 95% CI: 0.9872-0.9973 (2000 stratified bootstrap replicates)

which seems pretty good and Area Under the Curve (AUC) is often used as a
way of comparing the quality of binary classifiers.

11.8 Exercises

1. The dataset faraway::wbca comes from a study of breast cancer in Wis-


consin. There are 681 cases of potentially cancerous tumors of which 238
are actually malignant (i.e. cancerous). Determining whether a tumor is
really malignant is traditionally determined by an invasive surgical proce-
dure. The purpose of this study was to determine whether a new procedure
called ‘fine needle aspiration’, which draws only a small sample of tissue,
could be effective in determining tumor status.
11.8. EXERCISES 253

a. Fit a binomial regression with Class as the response variable and the
other nine variables as predictors (for consistency among students,
define a success as the tumor being benign and remember that glm
wants the response to be a matrix where the first column is the
number of successes). Report the residual deviance and associated
degrees of freedom. Can this information be used to determine if this
model fits the data?
b. Use AIC as the criterion to determine the best subset of variables
using the step function.
c. Use the reduced model to give the estimated probability that a tumor
with associated predictor variables

newdata <- data.frame( Adhes=1, BNucl=1, Chrom=3, Epith=2, Mitos=1,


NNucl=1, Thick=4, UShap=1, USize=1)

is benign and give a confidence interval for your estimate.


d. Suppose that a cancer is classified as benign if 𝑝̂ > 0.5 and malignant
if 𝑝̂ ≤ 0.5. Compute the number of errors of both types that will be
made if this method is applied to the current data with the reduced
model. Hint: save the 𝑝̂ as a column in the wbca data frame and use
that to create a new column Est_Class which is the estimated class
(making sure it is the same encoding scheme as Class). Then use
dplyr functions to create a table of how many rows fall into each of
the four Class/Est_Class combinations.
e. Suppose we changed the cutoff to 0.9. Compute the number of errors
of each type in this case. Discuss the ethical issues in determining
the cutoff.
2. Aflatoxin B1 was fed to lab animals at various doses and the number
responding with liver cancer recorded and is available in the dataset
faraway::aflatoxin.
a. Build a model to predict the occurrence of liver cancer. Consider a
square-root transformation to the dose level.
b. Compute the ED50 level (effective dose level… same as LD50 but isn’t
confined to strictly lethal effects) and an approximate 95% confidence
interval.
3. The dataset faraway::pima is data from a study of adult female Pima
Indians living near Phoenix was done and resulted 𝑛 = 752 observations
after the cases of missing data (obnoxiously coded as 0) were removed.
Testing positive for diabetes was the success (test) and the predictor
variables we will use are: pregnant, glucose, and bmi.
a. Remove the observations that have missing data (coded as a zero) for
either glucose or bmi. The researcher’s choice of using 0 to represent
missing data is a bad idea because 0 is a valid value for the number
254 CHAPTER 11. BINOMIAL REGRESSION

of pregnancies, so assume a zero in the pregnant covariate is a true


value. The dplyr function filter could be used here.
b. Fit the logistic regression model for test with using the main effects
of glucose, bmi, and pregnant.
c. Produce a graphic that displays the relationship between the vari-
ables. Notice I’ve done the part (a) for you and the assume that your
model produced in part (b) is named m. I also split up the pregnancy
and bmi values into some logical grouping for the visualization. If
you’ve never used the cut function, go look it up because it is ex-
tremely handy.
pima <- pima %>% filter( bmi != 0, glucose != 0)
pima <- pima %>% mutate(
phat=ilogit(predict(m)),
pregnant.grp = cut(pregnant, c(0,1,3,6,100), right = FALSE, labels = c('0','
bmi.grp = cut(bmi, c(0,18,25,30,100), labels=c('Underweight','Normal','Overw
ggplot(pima, aes(y=test, x=glucose)) +
geom_point() +
geom_line(aes(y=phat), color='red') +
facet_grid(bmi.grp ~ pregnant.grp)

d. Discuss the quality of your predictions based on the graphic above


and modify your model accordingly.

e. Give the probability of testing positive for diabetes for a Pima woman
who had had no pregnancies, had bmi=28 and a glucose level of 110.
f. Give the odds that the same woman would test positive for diabetes.
g. How do her odds change to if she were to have a child? That is to
say, what is the odds ratio for that change?
Appendix

255
Chapter 12

Block Designs

# packages for this chapter


library(tidyverse) # ggplot2, dplyr, etc...
library(emmeans) # TukeyLetters stuff

Often there are covariates in the experimental units that are known to affect
the response variable and must be taken into account. Ideally an experimenter
can group the experimental units into blocks where the within block variance
is small, but the block to block variability is large. For example, in testing a
drug to prevent heart disease, we know that gender, age, and exercise levels
play a large role. We should partition our study participants into gender, age,
and exercise groups and then randomly assign the treatment (placebo vs drug)
within the group. This will ensure that we do not have a gender, age, and
exercise group that has all placebo observations.
Often blocking variables are not the variables that we are primarily interested
in, but must nevertheless be considered. We call these nuisance variables. We
already know how to deal with these variables by adding them to the model,
but there are experimental designs where we must be careful because the ex-
perimental treatments are nested.
Example 1. An agricultural field study has three fields in which the researchers
will evaluate the quality of three different varieties of barley. Due to how they
harvest the barley, we can only create a maximum of three plots in each field. In
this example we will block on field since there might be differences in soil type,
drainage, etc from field to field. In each field, we will plant all three varieties so
that we can tell the difference between varieties without the block effect of field
confounding our inference. In this example, the varieties are nested within the
fields.

257
258 CHAPTER 12. BLOCK DESIGNS

Field 1 Field 2 Field 3


Plot 1 Variety A Variety C Variety B
Plot 2 Variety B Variety A Variety C
Plot 3 Variety C Variety B Variety A

Example 2. We are interested in how a mouse responds to five different materials


inserted into subcutaneous tissue to evaluate the materials’ use in medicine.
Each mouse can have a maximum of 3 insertions. Here we will block on the
individual mice because even lab mice have individual variation. We actually
are not interested in estimating the effect of the mice because they aren’t really
of interest, but the mouse block effect should be accounted for before we make
any inferences about the materials. Notice that if we only have one insertion
per mouse, then the mouse effect will be confounded with materials.

12.1 Randomized Complete Block Design


(RCBD)

The dataset oatvar in the faraway library contains information about an exper-
iment on eight different varieties of oats. The area in which the experiment was
done had some systematic variability and the researchers divided the area up
into five different blocks in which they felt the area inside a block was uniform
while acknowledging that some blocks are likely superior to others for growing
crops. Within each block, the researchers created eight plots and randomly
assigned a variety to a plot. This type of design is called a Randomized Com-
plete Block Design (RCBD) because each block contains all possible levels of
the factor of primary interest.

data('oatvar', package='faraway')
ggplot(oatvar, aes(y=yield, x=block, color=variety)) +
geom_point(size=5) +
geom_line(aes(x=as.integer(block))) # connect the dots
12.1. RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) 259

500
variety
1
2
400
3
yield

4
5
300 6
7
8

I II III IV V
block

While there is one unusual observation in block IV, there doesn’t appear to be
a blatant interaction. We will consider the interaction shortly. For the main
effects model of yield ~ block + variety we have 𝑝 = 12 parameters and 28
residual degrees of freedom because
𝑑𝑓𝜖 = 𝑛 − 𝑝
= 𝑛 − (1 + [(𝐼 − 1) + (𝐽 − 1)])
= 40 − (1 + [(5 − 1) + (8 − 1)])
= 40 − 12
= 28

m1 <- lm( yield ~ block + variety, data=oatvar)


anova(m1)

## Analysis of Variance Table


##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## block 4 33396 8348.9 6.2449 0.001008 **
## variety 7 77524 11074.8 8.2839 1.804e-05 ***
## Residuals 28 37433 1336.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# plot(m1) # check diagnostic plots - they are fine...

Because this is an orthogonal design, the sums of squares doesn’t change regard-
less of which order we add the factors, but if we remove one or two observations,
they would.
In determining the significance of variety the above F-value and p-value is
correct. We have 40 observations (5 per variety), and after accounting for the
260 CHAPTER 12. BLOCK DESIGNS

model structure (including the extraneous blocking variable), we have 28 resid-


ual degrees of freedom.
But the F-value and p-value for testing if block is significant is nonsense! Imag-
ine that variety didn’t matter we just have 8 replicate samples per block, but
these aren’t true replicates, they are what is called pseudoreplicates. Imagine
taking a sample of 𝑛 = 3 people and observing their height at 1000 different
points in time during the day. You don’t have 3000 data points for estimating
the mean height in the population, you have 3. Unless we account for the this,
the inference for the block variable is wrong. In this case, we only have one
observation for each block, so we can’t do any statistical inference at the block
scale!
Fortunately in this case, we don’t care about the blocking variable and including
it in the model was simply guarding us in case there was a difference, but I wasn’t
interested in estimating it. If the only covariate we care about is the most deeply
nested effect, then we can do the usual analysis and recognize the p-value for
the blocking variable is nonsense, and we don’t care about it.

# Ignore any p-values regarding block, but I'm happy with the analysis for variety
letter_df <- emmeans(m1, ~variety) %>%
multcomp::cld(Letters=letters) %>%
dplyr::select(variety, .group) %>%
mutate(yield = 500)

ggplot(oatvar, aes(x=variety, y=yield)) +


geom_boxplot() +
geom_text( data=letter_df, aes(label=.group) )

500 ab bc b a c ab ab bc

400
yield

300

1 2 3 4 5 6 7 8
variety

However it would be pretty sloppy to not do the analysis correctly because


our blocking variable might be something we care about. To make R do the
correct analysis, we have to denote the nesting. In this case we have block-
to-block errors, and then variability within blocks. To denote the nesting we
12.2. SPLIT-PLOT DESIGNS 261

use the Error() function within our formula. By default, Error() just creates
independent error terms, but when we add a covariate, it adds the appropriate
nesting.

m3 <- aov( yield ~ variety + Error(block), data=oatvar)


summary(m3)

##
## Error: block
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 4 33396 8349
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## variety 7 77524 11075 8.284 1.8e-05 ***
## Residuals 28 37433 1337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that in our block level, there is no p-value to assess if the blocks are
different. This is because we don’t have any replication of the blocks. So our
analysis respects that blocks are present, but does not attempt any statistical
analyses on them.

12.2 Split-plot designs


There are plenty of experimental designs where we have levels of treatments
nested within each other for practical reasons. The literature often gives the
example of an agriculture experiment where we investigate the effect of irrigation
and fertilizer on the yield of a crop. However because our irrigation system can’t
be fine-tuned, we have plots with different irrigation levels and within each plot
we have perhaps four subplots that have the fertilizer treatment. To summarize,
Irrigation treatments were randomly assigned to plots, and fertilizer treatments
were randomly assigned to sub-plots.

## `summarise()` regrouping output by 'plot', 'subplot', 'Fertilizer' (override with `.groups` ar

## # A tibble: 6 x 5
## # Groups: plot, subplot, Fertilizer [6]
## plot subplot Fertilizer Irrigation yield
## <fct> <fct> <fct> <fct> <dbl>
## 1 1 1 Low Low 20.2
## 2 1 2 High Low 24.4
262 CHAPTER 12. BLOCK DESIGNS

## 3 1 3 Low Low 18.0


## 4 1 4 High Low 21.0
## 5 2 1 Low Low 23.2
## 6 2 2 High Low 26.5

plot: 1 plot: 2 plot: 3 plot: 4


Irrigation High Irrigation High Irrigation Low Irrigation High
2

1
Fertilizer
row

plot: 5 plot: 6 plot: 7 plot: 8 Low

Irrigation High Irrigation Low Irrigation Low Irrigation Low High

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
col

So all together we have 8 plots, and 32 subplots. When I analyze the fertilizer,
I have 32 experimental units (the thing I have applied my treatment to), but
when analyzing the effect of irrigation, I only have 8 experimental units.

I like to think of this set up as having some lurking variables that act at the plot
level (changes in aspect, maybe something related to what was planted prior)
and some lurking variables that act on a local subplot scale (maybe variation
in clay/silt/sand ratios). So even after I account for Irrigation and Fertilizer
treatments, observations within a plot will be more similar to each other than
observations in two different plots.

We can think about doing two separate analyses, one for the effect of irrigation,
and another for the effect of the fertilizer.

# AgData came from my data package, dsData, (however I did some summarization
# first.)

# To analyze Irrigation, average over the subplots first...


Irrigation.data <- AgData %>%
group_by(plot, Irrigation) %>%
summarise( yield = mean(yield)) %>%
as.data.frame() # the aov command doesn't like tibbles.

## `summarise()` regrouping output by 'plot' (override with `.groups` argument)


12.2. SPLIT-PLOT DESIGNS 263

# Now do a standard analysis. I use the aov() command instead of lm()


# because we will shortly do something very tricky that can only be
# done with aov(). For the most part, everything is
# identical from what you are used to.
m <- aov( yield ~ Irrigation, data=Irrigation.data )
anova(m)

## Analysis of Variance Table


##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Irrigation 1 26.064 26.0645 3.4281 0.1136
## Residuals 6 45.619 7.6032

In this case we see that we have insufficient evidence to conclude that the
observed difference between the Irrigation levels could not be due to random
chance.
Next we can do the appropriate analysis for the fertilizer, recognizing that all
the p-values for the plot effects are nonsense and should be ignored.

m <- aov( yield ~ plot + Fertilizer, data=AgData )


summary(m)

## Df Sum Sq Mean Sq F value Pr(>F)


## plot 7 286.73 40.96 3.153 0.0173 *
## Fertilizer 1 0.43 0.43 0.033 0.8572
## Residuals 23 298.83 12.99
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ideally I wouldn’t have to do the averaging over the nested observations and
we would like to not have the misleading p-values for the plots. To do this, we
only have to specify the nesting of the error terms and R will figure out the
appropriate degrees of freedom for the covariates.

# To do this right, we have to abandon the general lm() command and use the more
# specialized aov() command. The Error() part of the formula allows me to nest
# the error terms and allow us to do the correct analysis. The order of these is
# to start with the largest/highest level and then work down the nesting.
m2 <- aov( yield ~ Irrigation + Fertilizer + Error(plot/subplot), data=AgData )
summary(m2)
264 CHAPTER 12. BLOCK DESIGNS

##
## Error: plot
## Df Sum Sq Mean Sq F value Pr(>F)
## Irrigation 1 104.3 104.26 3.428 0.114
## Residuals 6 182.5 30.41
##
## Error: plot:subplot
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 1 0.43 0.43 0.033 0.857
## Residuals 23 298.83 12.99

In the output, we see that the ANOVA table row for the Fertilizer is the same
for both analyses, but the sums-of-squares for Irrigation are different between
the two analyses (because of the averaging) while the F and p values are the
same between the two analyses.
What would have happened if we had performed the analysis incorrectly and
had too many degrees of freedom for the Irrigation test?

bad.model <- aov( yield ~ Irrigation + Fertilizer, data=AgData)


anova(bad.model)

## Analysis of Variance Table


##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Irrigation 1 104.26 104.258 6.2818 0.01806 *
## Fertilizer 1 0.43 0.430 0.0259 0.87324
## Residuals 29 481.31 16.597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case we would have concluded that we had statistically significant evi-
dence to conclude the Irrigation levels are different. Notice that the sums-of-
squares in this wrong analysis match up with the sums-of-squares in the correct
design and the only difference is that when we figure out the sum-of-squares for
the residuals we split that into different pools.

𝑅𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = 𝑅𝑆𝑆𝐹 𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟 + 𝑅𝑆𝑆𝐼𝑟𝑟𝑖𝑔𝑎𝑡𝑖𝑜𝑛


456.12 = 273.64 + 182.5

When we want to infer if the amount of noise explained by adding Irrigation or


Fertilizer is sufficiently large to justify their inclusion into the model, we compare
the sum-of-squares value to the RSS but now we have to use the appropriate
pool.
12.2. SPLIT-PLOT DESIGNS 265

A second example of a slightly more complex split plot is given in the package
MASS under the dataset oats. From the help file the data describes the following
experiment:

The yield of oats from a split-plot field trial using three varieties and
four levels of manurial treatment. The experiment was laid out in
6 blocks of 3 main plots, each split into 4 sub-plots. The varieties
were applied to the main plots and the manurial treatments to the
sub-plots.

This is a lot to digest so lets unpack it. First we have 6 blocks and we’ll replicate
the exact same experiment in each block. Within a block, we’ll split it into three
sections, which we’ll call plots (within the block). Finally within each plot, we’ll
have 4 subplots.

We have 3 varieties of oats, and 4 levels of fertilizer (manure). To each set of


3 plots, we’ll randomly assign the 3 varieties, and to each set of subplots, we’ll
assign the fertilizers.

One issue that makes this issue confusing for students is that most texts get lazy
and don’t define the blocks, plots, and sub-plots when there are no replicates in
a particular level. I prefer to be clear about defining those so.

data('oats', package='MASS')
oats <- oats %>% mutate(
Nf = ordered(N, levels = sort(levels(N))), # make manure an ordered factor
plot = as.integer(V), # plot
subplot = as.integer(Nf)) # sub-plot

As always we first create a graph to examine the data

oats <- oats %>% mutate(B_Plot = interaction(B, plot))


ggplot(oats, aes(x=Nf, y=Y, color=V)) +
facet_grid( B ~ plot, labeller=label_both) +
geom_point() +
geom_line(aes(x=as.integer(Nf)))
266 CHAPTER 12. BLOCK DESIGNS

plot: 1 plot: 2 plot: 3


175
150
125

B: I
100
75
50
175
150
125

B: II
100
75
50
175
150
125

B: III
100 V
75
Golden.rain
50
Y

175 Marvellous
150
Victory

B: IV
125
100
75
50
175
150
125

B: V
100
75
50
175
150
B: VI

125
100
75
50
0.0cwt0.2cwt0.4cwt0.6cwt 0.0cwt0.2cwt0.4cwt0.6cwt 0.0cwt0.2cwt0.4cwt0.6cwt
Nf

This graph also makes me think that variety doesn’t matter and it is unlikely
that there an interaction between oat variety and fertilizer level, but we should
check.

# What makes sense to me


# m.c <- aov( Y ~ V * Nf + Error(B/plot/subplot), data=oats)

Unfortunately the above model isn’t correct because R isn’t smart enough to
understand that the levels of plot and subplot are exact matches to the Variety
and Fertilizer levels. As a result if I defined the model above, the degrees of
freedom will be all wrong because there is too much nesting. So we have to
be smart enough to recognize that plot and subplot are actually Variety and
Fertilizer.
12.2. SPLIT-PLOT DESIGNS 267

m.c <- aov( Y ~ V * Nf + Error(B/V/Nf), data=oats)


summary(m.c)

##
## Error: B
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 5 15875 3175
##
## Error: B:V
## Df Sum Sq Mean Sq F value Pr(>F)
## V 2 1786 893.2 1.485 0.272
## Residuals 10 6013 601.3
##
## Error: B:V:Nf
## Df Sum Sq Mean Sq F value Pr(>F)
## Nf 3 20020 6673 37.686 2.46e-12 ***
## V:Nf 6 322 54 0.303 0.932
## Residuals 45 7969 177
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sure enough the interaction term is not significant. We next consider the Variety
term.

m.s <- aov( Y ~ V + Nf + Error(B/V/Nf), data=oats)


summary(m.s)

##
## Error: B
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 5 15875 3175
##
## Error: B:V
## Df Sum Sq Mean Sq F value Pr(>F)
## V 2 1786 893.2 1.485 0.272
## Residuals 10 6013 601.3
##
## Error: B:V:Nf
## Df Sum Sq Mean Sq F value Pr(>F)
## Nf 3 20020 6673 41.05 1.23e-13 ***
## Residuals 51 8291 163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
268 CHAPTER 12. BLOCK DESIGNS

We conclude by noticing that the Variety does not matter, but that the fertilizer
level is quite significant.

There are many other types of designs out there. For example you might have 5
levels of a factor, but when you split your block into plots, you can only create
3 plots. So not every block will have every level of the factor. This is called
Randomized Incomplete Block Designs (RIBD).
You might have a design where you apply even more levels of nesting. Suppose
you have a green house study where you have rooms where you can apply a
temperature treatment, within the room you have four tables and can apply
a light treatment to each table. Finally within each table you can have four
trays where can apply a soil treatment to each tray. This is a continuation of
the split-plot design and by extending the nesting we can develop split-split-plot
and split-split-split-plot designs.
You might have 7 covariates each with two levels (High, Low) and you want
to investigate how these influence your response but also allow for second and
third order interactions. If you looked at every treatment combination you’d
have 27 = 128 different treatment combinations and perhaps you only have the
budget for a sample of 𝑛 = 32. How should you design your experiment? This
question is addressed by fractional factorial designs.
If your research interests involve designing experiments such as these, you should
consider taking an Experimental design course.

12.3 Exercises
1. ???
2. ???
3. ???
Chapter 13

Maximum Likelihood
Estimation

library(tidyverse) # dplyr, tidyr, ggplot2

Learning Outcomes

• Explain how the probability mass/density function 𝑓(𝑥|𝜃) indicates what


data regions are more probable.
• Explain how the likelihood function ℒ(𝜃|𝑥) is defined if we know the prob-
ability function.
• Explain how the likelihood function ℒ(𝜃|𝑥) is used to find the maximum
likelihood estimate of 𝜃.
• For a given sample of data drawn from a distribution, find the maximum
likelihood estimate for the distribution parameters using R.

13.1 Introduction

The goal of statistical modeling is to take data that has some general trend
along with some un-explainable variability, and say something intelligent about
the trend. For example, the simple regression model

𝑖𝑖𝑑
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 where 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 )

269
270 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION

Simple Regression
3
yi = β0 + β1xi + εi
Response

0.00 0.25 0.50 0.75 1.00


Explanatory

There is a general increasing trend in the response (i.e. the 𝛽0 + 𝛽1 𝑥𝑖 term) and
𝑖𝑖𝑑
then some un-explainable noise (the 𝜖𝑖 ∼ 𝑁 (0, 𝜎2 ) part).
While it has been convenient to write the model in this form, it is also possible
to write the simple regression model as
𝑖𝑛𝑑
𝑦𝑖 ∼ 𝑁 ( 𝛽 0 + 𝛽 1 𝑥𝑖 , 𝜎 2 )

This model contains three parameters 𝛽0 , 𝛽1 , and 𝜎 but it certainly isn’t clear
how to estimate these three values. In this chapter, we’ll develop a mechanism
for taking observed data sampled from some distribution parameterized by some
𝛽, 𝜆, 𝜎, or 𝜃 and then estimating those parameters.

13.2 Distributions

Depending on what values the data can take on (integers, positive values) and
the shape of the distribution of values, we might chose to model the data using
one of several different distributions. Next we’ll quickly introduce the mathe-
matical relationship between the parameter and probable data values of several
distributions.

13.2.1 Poisson

The Poisson distribution is used to model the number of events that happen
in some unit of time or space. This distribution is often used to model events
that can only be positive integers. This distribution is parameterized by 𝜆,
which represents the expected number of events that happen (as defined as the
average over an infinitely large number of draws). Because 𝜆 represents the
average number of events, the 𝜆 parameter must be greater than or equal to 0.
13.2. DISTRIBUTIONS 271

The function that defines the relationship between the parameter 𝜆 and what
values are most probable is called the probability mass function when talking
about discrete random variables and probability density functions in the contin-
uous case. Either way, these functions are traditionally notated using 𝑓(𝑥).

𝑒−𝜆 𝜆𝑥
𝑓(𝑥|𝜆) = for 𝑥 ∈ {0, 1, 2, … }
𝑥!

Poisson ( λ = 3.5 )
λ

0.2
f(x | λ = 3.5)

0.1

0.0
0 2 4 6 8 10
X

The notation 𝑓(𝑥|𝜆) read as “f given 𝜆” and is used to denote that this is a
function that describes what values of the data 𝑋 are most probable and that
the function depends on the parameter value. This is emphasizing that if we
were to change the parameter value (to say 𝜆 = 10), then a different set of data
values would be more probable. In the above example with 𝜆 = 3.5, the most
probable outcome is 3 but we aren’t surprised if we were to observe a value of
𝑥 = 1, 2, or 4. However, from this graph, we see that 𝑥 = 10 or 𝑥 = 15 would
be highly improbable.

13.2.2 Exponential

The Exponential distribution can be used to model events that take on a positive
real value and the distribution of values has some skewness. We will parame-
terize this distribution using 𝛽 as the mean of the distribution.

𝑓(𝑥|𝛽) = 𝛽𝑒𝛽𝑥 for 𝑥 > 0


272 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION

Exponential ( β = 3.5 )

0.2
f(x | β = 3.5)

0.1

0.0
0 5 10 15 20
X

In this distribution, the region near zero is the most probable outcome, and
larger observations are less probable.

13.2.3 Normal

The Normal distribution is extremely commonly used in statistics and can be


used to modeled continuous variables and is parameterized by the center 𝜇 and
spread 𝜎 parameters.

1 1 𝑥−𝜇 2
𝑓(𝑥|𝜇, 𝜎) = √ exp [− ( ) ]
𝜎 2𝜋 2 𝜎
where exp[𝑤] = 𝑒𝑤 is just a notational convenience.
Normal (µ = 6, σ = 1)
µ
0.4
f( x | µ = 6, σ = 1 )

0.3

σ
0.2

0.1

0.0
4 6 8
X

All of these distributions (and there are many, many more distributions com-
monly used) have some mathematical function that defines the how probable a
region of response values is and that function depends on the parameters. Im-
portantly 𝑋 regions with the highest 𝑓(𝑥|𝜃) are the most probable data values.
13.3. LIKELIHOOD FUNCTION 273

There are many additional mathematical details that go into these density func-
tions but the important aspect is that they tell us what data values are most
probable given some parameter values 𝜃.

13.3 Likelihood Function

As a researcher, I am not particularly interested in saying ”If 𝜇 = 3 and 𝜎 = 2


then I’m likely to observe approximately 95% of my data between −1 and 7.
Instead, I want to make an inference about what values for 𝜇 and 𝜎 are the most
concordant with observed data that I’ve collected. However, the probability
density function 𝑓(𝑥|𝜇, 𝜎) is still the mathematical link between the data and
parameter and we will continue to use that function, but we’ll re-interpret which
is known.

The Likelihood function is just the probability density (or mass) function 𝑓(𝑥|𝜃)
re-interpreted to be a function where the data is the known quantity and we are
looking to see what parameter values are consistent with the data.

13.3.1 Poisson

Suppose that we have observed a single data point drawn from a Poisson(𝜆) and
we don’t know what 𝜆 is. We first write down the likelihood function

𝑒−𝜆 𝜆𝑥
ℒ(𝜆|𝑥) = 𝑓(𝑥|𝜆) =
𝑥!

If we have observed 𝑥 = 4, then ℒ(𝜆|𝑥 = 4) is the following function


274 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION

Poisson Likelihood given x=4


^
0.20 λ

L( λ | x=4 ) 0.15

0.10

0.05

0.00
0 5 10 15
λ

Poisson log−Likelihood given x=4


^
0 λ
logL( λ | x=4 )

−5

−10

−15

−20
0 5 10 15
λ

Our best estimate for 𝜆 is the value that maximizes ℒ(𝜆|𝑥). We could do this two
different ways. First we could mathematically solve by taking the derivative,
setting it equal to zero, and then solving for lambda. Often this process is
made mathematically simpler (and computationally more stable) by instead
maximizing the log of the Likelihood function. This is equivalent because the
log function is monotonically increasing and if 𝑎 < 𝑏 then log(𝑎) < log(𝑏).
It is simpler because taking logs makes everything 1 operation simpler and
reduces the need for using the chain rule while taking derivatives. We could
also find the value of 𝜆 that maximizes the likelihood using numerical methods.
Again because the log function makes everything nicer, in practice we’ll always
maximize the log likelihood. Many optimization functions are designed around
finding function minimums, so to use those, we’ll actually seek to minimize the
negative log likelihood which is simply −1 ∗ log ℒ().
Numerical solvers are convenient, but are only accurate to machine tolerance
you specify. In this case where 𝑥 = 4, the actual maximum likelihood value is
𝜆̂ = 4.

x <- 4
neglogL <- function(param){
dpois(x, lambda=param) %>%
log() %>% # take the log
prod(-1) %>% # multiply by -1
return()
13.3. LIKELIHOOD FUNCTION 275

# Optimize function will find the maximum of the Likelihood function


# over the range of lambda values [0, 20]. By default, the optimize function
# finds the minimum, but has an option to find the maximum. Alternatively
# we could find the minimum of the -logL function.
optimize(neglogL, interval=c(0,20) )

## $minimum
## [1] 3.999993
##
## $objective
## [1] 1.632876

But what if we have multiple observations from this Poisson distribution? If the
observations are independent, then the probability mass or probability density
functions 𝑓(𝑥𝑖 |𝜃) can just be multiplied together.

𝑛 𝑛
𝑒−𝜆 𝜆𝑥𝑖
ℒ(𝜆|x) = ∏ 𝑓(𝑥𝑖 |𝜆) = ∏
𝑖=1 𝑖=1
𝑥𝑖 !

So, suppose we have observed x = {4, 6, 3, 3, 2, 4, 3, 2}. We could maximize this


function either using calculus methods or numerical methods this function and
discover that the maximum occurs at 𝜆̂ = 𝑥̄ = 3.375.
276 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION

Poisson Likelihood given x


^
1.0e−06 λ

L( λ | x ) 7.5e−07

5.0e−07

2.5e−07

0.0e+00
0 5 10 15
λ

Poisson log−Likelihood given x


0 ^
λ
−25
logL( λ | x )

−50

−75

−100
0 5 10 15
λ

If we are using the log-Likelihood, then the multiplication is equivalent to sum-


ming and we’ll define and subsequently optimize our log-Likelihood function like
this:

x <- c(4,6,3,3,2,4,3,2)
neglogL <- function(param){
dpois(x, lambda=param) %>%
log() %>% sum() %>% prod(-1) %>%
return()
}
optimize(neglogL, interval=c(0,20) )

## $minimum
## [1] 3.37499
##
## $objective
## [1] 13.85426

13.3.2 Exponential Example

We next consider data sampled from the exponential distribution. Recall the
exponential distribution can be parametrized by a single parameter, 𝛽 and
13.3. LIKELIHOOD FUNCTION 277

which is the expectation of the distribution and variance is 𝛽 2 . We might


consider using as an estimator either the sample mean or the sample standard
deviation. It turns out that the sample mean is the maximum likelihood es-
timator in this case. For a concrete example, suppose that we had observed
x = {15.6, 2.03, 9.12, 1.54, 3.69}
The likelihood is
𝑛
ℒ(𝛽|x) = ∏ 𝛽𝑒𝛽𝑥𝑖
𝑖=1

A bit of calculus will show that this is maximized at 𝛽 ̂ = 𝑥̄ = 6.39. We can


numerically see this following the same process as previously seen.

## [1] 6.396

Exponential Likelihood given x


^
β
6e−07
L( β | x )

4e−07

2e−07

0e+00
0 10 20 30
β

Exponential log−Likelihood given x


^
β
−15
logL( λ | x )

−20

−25

−30
0 10 20 30
β

x <- c(15.6, 2.03, 9.12, 1.54, 3.69)


neglogL <- function(param){
dexp(x, rate=1/param, log = TRUE) %>% # We defined beta as 1/rate
sum() %>%
prod(-1) %>%
return()
}
optimize(neglogL, interval=c(0,30) )
278 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION

## $minimum
## [1] 6.396004
##
## $objective
## [1] 14.27836

13.3.3 Normal

We finally consider the case where we have observation coming from a distri-
bution that has multiple parameters. The normal distribution is parameterized
𝑖𝑖𝑑
by a mean 𝜇 and spread 𝜎. Suppose that we had observed 𝑥𝑖 ∼ 𝑁 (𝜇, 𝜎2 ) and
saw x = {5, 8, 9, 7, 11, 9}.

As usual we can calculate the likelihood function

𝑛 𝑛
1 1 (𝑥𝑖 − 𝜇)2
ℒ(𝜇, 𝜎|x) = ∏ 𝑓(𝑥𝑖 |𝜇, 𝜎) = ∏ √ exp [− ]
𝑖=1 𝑖=1 2𝜋𝜎 2 𝜎2

Again using calculus, it can be shown that the maximum likelihood estimators
in this model are

𝜇̂ = 𝑥̄ = 8.16666

1 𝑛
𝜎̂𝑚𝑙𝑒 = √ ∑(𝑥 − 𝑥)̄ 2 = 1.8634
𝑛 𝑖=1 𝑖

1
which is somewhat unexpected because the typical estimator we use has a 𝑛−1
multiplier.

x <- c(5, 8, 9, 7, 11, 9)


xbar <- mean(x)
s2 <- sum( (x - xbar)^2 / 6 )
s <- sqrt(s2)
13.3. LIKELIHOOD FUNCTION 279

Normal Likelihood given x

2.5

2.0

^ mle
σ
µ

1.5

1.0 ^
µ
6 8 10 12
µ

x <- c(5, 8, 9, 7, 11, 9)


neglogL <- function(param){
dnorm(x, mean=param[1], sd=param[2]) %>%
log() %>% sum() %>% prod(-1) %>%
return()
}

# Bivariate optimization uses the optim function that only can search
# for a minimum. The first argument is an initial guess to start the algorithm.
# So long as the start point isn't totally insane, the numerical algorithm should
# be fine.
optim(c(5,2), neglogL )

## $par
## [1] 8.166924 1.863391
##
## $value
## [1] 12.24802
##
## $counts
## function gradient
## 71 NA
280 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION

##
## $convergence
## [1] 0
##
## $message
## NULL

13.4 Discussion
1. How could the numerical maximization happen? Assume we have a 1-
dimensional parameter space, we have a reasonable starting estimate, and
the function to be maximized is continuous, smooth, and ≥ 0 for all 𝑥.
While knowing the derivative function 𝑓 ′ (𝑥) would allow us to be much
more clever, lets think about how to do the maximization by just evaluating
𝑓(𝑥) for different values of 𝑥.
2. Convince yourself that if 𝑥0 is the value of 𝑥 that maximizes 𝑓(𝑥), then it
is also the value that maximizes 𝑙𝑜𝑔(𝑓(𝑥)). This will rest on the idea that
𝑙𝑜𝑔() is a strictly increasing function.

13.5 Exercises
1. The 𝜒2 distribution is parameterized by its degrees of freedom parameter
𝜈 which corresponds to the mean of the distribution (𝜈 must be > 0). The
density function 𝑓(𝑥|𝜈) can be accessed R using the dchisq(x, df=nu).
a) For different values of 𝜈, plot the distribution function of 𝑓(𝑥|𝜈). You
might consider 𝜈 = 5, 10, and 20. The valid range of 𝑥 values is [0, ∞)
so select the
b) Suppose that we’ve observed 𝑥 = {9, 7, 7, 6, 10, 7, 9). Calculate the
sample mean and standard deviation.
c) Graphically show that the maximum likelihood estimator of 𝜈 is 𝑥̄ =
7.857.
d) Show that the maximum likelihood estimator of 𝜈 is 𝑥̄ = 7.857 using
a numerical maximization function.
2. The Beta distribution is often used when dealing with data that are pro-
portions. It is parameterized by two parameters, usually called 𝛼 and 𝛽
(both of which must be greater than zero). The mean of this distribution
is by
𝛼
𝐸[𝑋] =
𝛼+𝛽
while the spread of the distribution is inversely related to the magnitude
of 𝛼 and 𝛽. The density function 𝑓(𝑥|𝛼, 𝛽) can be accessed in R using the
dbeta(x, alpha, beta) function.
13.5. EXERCISES 281

a) For different values of 𝛼 and 𝛽, plot the distribution function


𝑓(𝑥|𝛼, 𝛽). You might consider keeping the ratio between 𝛼 and 𝛽
constant and just increase their magnitude. The valid range of 𝑥
values is [0, 1].
b) Suppose that we’ve observed 𝑥 = {0.25, 0.42, 0.45, 0.50, 0.55}. Cal-
culate the sample mean and standard deviation.
c) Graphically show that the maximum likelihood estimators of 𝛼 and
𝛽 are approximately 9.5 and 12.5.
d) Show that the maximum likelihood estimators of 𝛼 and 𝛽 are ap-
proximately 9.5 and 12.5.
282 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION
Project Appendix

STA 571 Course Project Timeline This document presents a rough outline to the
projects conducted in STA 571: Statistical Methods II. The outline gives when
certain topics should be discussed, when decisions should be made on moving
forward with projects, and what students need to be focused on. Proposal and
final project rubrics will be provided in additional PDFs. As always, there is
flexibility based on the way you run your course.

13.6 Weeks 1 – 4 (Project Feasibility)


Leading up to Exam 1, decisions will be made if the projects can be conducted
effectively within the course. This requires students to determine what data set
they want to analyze and what methods they think they would be interested in
using. 1. For students with thesis/dissertation work, the idea is to allow them
to use their own data (likely data they have collected) to advance the project. 2.
Students that don’t have thesis/dissertation work, or for students who don’t yet
have data from their own research should be engaged in looking into a source of
data and determining what type of analysis can be completed. One very good
source of datasets and problems is kaggle.

13.6.1 WIBGIs

Project ideas can be motivated through “Wouldn’t it be great if …” statements,


or WIBGIs.
Write out 3 - 5 ideas for a project to be done this semester. Think about what
data you may have available or what data you may be interested in analyzing.
If possible try to write a statement of this form:
Wouldn’t it be great if what you want to analyze could be used to solve a
problem using statistics/data science.
Example: Wouldn’t it be great if plasma thermogram data could be used to
classify lupus using logistic regression.

283
284 CHAPTER 13. MAXIMUM LIKELIHOOD ESTIMATION

The first four weeks should include students trying to create as many WIBGIs
as possible. At the end of each homework assignment, I added an additional
optional section entitle “Project Development” where I reminded the students
they need to produce at minimum 3 WIBGI statements.
The main concept here is to encourage exploration and broad thinking. It is
okay if students write statements that are not feasible. It is also okay if they
write statements that don’t conform to the example statement. The goal here
is creativity. The idea is to begin generating ideas and determining sources of
data; we are trying to encourage data exploration and thinking outside the box
from their core studies.

You might also like