0% found this document useful (0 votes)
23 views98 pages

MATH3091

The document outlines the curriculum for MATH3091: Statistical Modelling II, covering topics such as likelihood-based statistical theory, linear models, linear mixed models, generalized linear models, and models for categorical data. It emphasizes the importance of statistical models in understanding data variability and making predictions, while also addressing the need for model plausibility, parsimony, and goodness of fit. The course includes practical applications using various datasets to illustrate the concepts discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views98 pages

MATH3091

The document outlines the curriculum for MATH3091: Statistical Modelling II, covering topics such as likelihood-based statistical theory, linear models, linear mixed models, generalized linear models, and models for categorical data. It emphasizes the importance of statistical models in understanding data variability and making predictions, while also addressing the need for model plausibility, parsimony, and goodness of fit. The course includes practical applications using various datasets to illustrate the concepts discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

MATH3091: Statistical Modelling II

Professor Sujit Sahu and Dr Chao Zheng

2021-2022, Semester 2
2
Contents

Preface 7

1 Preliminaries 9
1.1 Lecture 1: Introduction . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Elements of statistical modelling . . . . . . . . . . . . 9
1.1.2 Regression models . . . . . . . . . . . . . . . . . . . . 10
1.1.3 Example data to be analysed . . . . . . . . . . . . . . 10

2 Likelihood Based Statistical Theory 15


2.1 Lecture 2: Likelihood function . . . . . . . . . . . . . . . . . . 15
2.1.1 The likelihood function . . . . . . . . . . . . . . . . . . 16
2.1.2 Maximum likelihood estimation . . . . . . . . . . . . . 17
2.2 Lecture 3: Score function and Information matrix . . . . . . . 19
2.2.1 Score function . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Information matrix . . . . . . . . . . . . . . . . . . . . 22
2.3 Lecture 4: Likelihood based inference . . . . . . . . . . . . . . 24
2.3.1 Asymptotic distribution of the MLE . . . . . . . . . . 24
2.3.2 Quantifying uncertainty in parameter estimates . . . . 25
2.3.3 Comparing statistical models . . . . . . . . . . . . . . 26

3 Linear Models 33
3.1 Lecture 5: Linear Model Theory: Revision of MATH2010 . . . 33
3.1.1 The linear model . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Examples of linear model structure . . . . . . . . . . . 34
3.1.3 Maximum likelihood estimation . . . . . . . . . . . . . 38
3.1.4 Properties of the MLE . . . . . . . . . . . . . . . . . . 38
3.1.5 Comparing linear models . . . . . . . . . . . . . . . . . 39

3
4 CONTENTS

4 Linear Mixed Models 43


4.1 Lecture 6: Introduction to Linear Mixed Models . . . . . . . . 43
4.1.1 Motivations of LMMs . . . . . . . . . . . . . . . . . . 43
4.1.2 Basics of Linear Mixed Models . . . . . . . . . . . . . 45
4.2 Lecture 7: LMMs parameter estimation I . . . . . . . . . . . . 48
4.2.1 Estimation of β . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 “Estimation” of γ . . . . . . . . . . . . . . . . . . . . . 49
4.3 Lecture 8: LMMs parameter estimation II . . . . . . . . . . . 52
4.3.1 The log-likelihood function . . . . . . . . . . . . . . . 52
4.3.2 Profile likelihood method . . . . . . . . . . . . . . . . 54
4.3.3 Restricted maximum likelihood method . . . . . . . . 54
4.4 Lecture 9: Statistical Inference of LMMs . . . . . . . . . . . . 56
4.4.1 Confidence intervals . . . . . . . . . . . . . . . . . . . 56
4.4.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . 57

5 Generalised Linear Models 59


5.1 Lecture 10: The Exponential family . . . . . . . . . . . . . . 59
5.1.1 Regression models for non-normal data . . . . . . . . . 59
5.1.2 The exponential family . . . . . . . . . . . . . . . . . . 60
5.2 Lecture 11: Components of a generalised linear model . . . . . 63
5.2.1 The random component . . . . . . . . . . . . . . . . . 63
5.2.2 The systematic (or structural) component . . . . . . . 64
5.2.3 The link function . . . . . . . . . . . . . . . . . . . . . 65
5.3 Lecture 12: Examples of generalised linear models . . . . . . . 67
5.3.1 The linear model . . . . . . . . . . . . . . . . . . . . . 67
5.3.2 Models for binary data . . . . . . . . . . . . . . . . . . 67
5.3.3 Models for count data . . . . . . . . . . . . . . . . . . 68
5.4 Lecture 13: Maximum likelihood estimation . . . . . . . . . . 69
5.5 Lecture 14: Confidence intervals . . . . . . . . . . . . . . . . . 74
5.6 Lecture 15: Comparing generalised linear models . . . . . . . 75
5.6.1 The likelihood ratio test . . . . . . . . . . . . . . . . . 75
5.7 Lecture 16: Scaled deviance and the saturated model . . . . . 76
5.8 Lecture 17: Models with unknown a(ϕ) . . . . . . . . . . . . . 80
5.8.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 Models for categorical data 85


6.1 Lecture 18: Contingency tables . . . . . . . . . . . . . . . . . 85
6.2 Lecture 19: Log-linear models . . . . . . . . . . . . . . . . . . 87
CONTENTS 5

6.3 Lecture 20: Multinomial sampling . . . . . . . . . . . . . . . . 89


6.3.1 Product multinomial sampling . . . . . . . . . . . . . 91
6.4 Lecture 21: Interpreting log-linear models for two-way tables . 93
6.5 Lecture 22: Interpreting log-linear models for multiway tables 95
6.5.1 Simpson’s paradox . . . . . . . . . . . . . . . . . . . . 97
6 CONTENTS
Preface

The pre-requisite module MATH2010: Statistical Modelling I covered in de-


tail the theory of linear regression models, where explanatory variables are
used to explain the variation in a response variable, which is assumed to be
normally distributed.
However, in many practical situations the data are not appropriate for such
analysis. For example, the response variable may be binary, and interest
may be focused on assessing the dependence of the probability of ‘success’
on potential explanatory variables. Alternatively, the response variable may
be a count of events, and we may wish to infer how the rate at which events
occur depends on explanatory variables. Such techniques are important in
many disciplines such as finance, biology, social sciences and medicine.
The aim of this module is to cover the theory and application of what are
known as generalised linear models (GLMs). This is an extremely broad
class of statistical models, which incorporates the linear regression models
studied in MATH2010, but also allows binary and count response data to be
modelled coherently.

7
8 CONTENTS
Chapter 1

Preliminaries

1.1 Lecture 1: Introduction


1.1.1 Elements of statistical modelling
Probability and statistics can be characterised as the study of variability.
In particular, statistical inference is the science of analysing statistical data,
viewed as the outcome of some random process, in order to draw conclusions
about that random process.
Statistical models help us to understand the random process by which ob-
served data have been generated. This may be of interest in itself, but also
allows us to make predictions and perhaps most importantly decisions con-
tingent on our inferences concerning the process.
It is also important, as part of the modelling process, to acknowledge that
our conclusions are only based on a (potentially small) sample of possible
observations of the process and are therefore subject to error. The science of
statistical inference therefore involves assessment of the uncertainties associ-
ated with the conclusions we draw.
Probability theory is the mathematics associated with randomness and uncer-
tainty. We usually try to describe random processes using probability models.
Then, statistical inference may involve estimating any unspecified features of
a model, comparing competing models, and assessing the appropriateness of
a model; all in the light of observed data.

9
10 CHAPTER 1. PRELIMINARIES

In order to identify ‘good’ statistical models, we require some principles on


which to base our modelling procedures. In general, we have three require-
ments of a statistical model
• Plausibility
• Parsimony
• Goodness of fit
The first of these is not a statistical consideration, and a subject-matter
expert usually needs to be consulted about this. For some objectives, like
prediction, it might be considered unimportant. Parsimony and goodness of
fit are statistical issues. Indeed, there is usually a trade-off between the two
and our statistical modelling strategies will take account of this.

1.1.2 Regression models


Many statistical models, and all the ones we shall deal with in MATH3091,
can be formulated as regression models.
In practical applications, we often distinguish between a response variable
and a group of explanatory variables. The aim is to determine the pattern of
dependence of the response variable on the explanatory variables. A regres-
sion model has the general form
response = function(structure and randomness)
The structural part of the model describes how the response depends on the
explanatory variables and the random part defines the probability distribu-
tion of the response. Together, they produce the response and the statistical
modeller’s task is to ‘separate’ these out.

1.1.3 Example data to be analysed


1.1.3.1 nitric: Nitric acid
This data set relates to 21 successive days of operation of a plant oxidising
ammonia to nitric acid. The response yield is ten times the percentage of
ingoing ammonia that is lost as unabsorbed nitric acid (an indirect measure
of the yield). The aim here is to study how the yield depends on flow of air to
the plant (flow), temperature of the cooling water entering the absorption
tower (temp) and concentration of nitric acid in the absorbing liquid (conc).
1.1. LECTURE 1: INTRODUCTION 11

These data will be analysed in worksheet 2 using multiple linear regression


models.

1.1.3.2 birth: Weight of newborn babies


This data set contains weights of 24 newborn babies. There are two explana-
tory variables, sex (Sex) and gestational age in weeks (Age) together with
the response variable, birth weight in grams (Weight). The aim here is to
study how birth weight depends on sex and gestational age. This data set
will be analysed in worksheet 3 by using multiple linear regression models
including both categorical and continuous explanatory variables.

1.1.3.3 survival: Time to death


This data set, analysed in worksheet 4, contains survival times in 10 hour
units (time) of 48 rats each allocated to one of 12 combinations of 3 poisons
(poison) and 4 treatments (treatment). The aim here is to study how
survival time depends on the poison and the treatment, and to determine
whether there is an interaction between these two categorical variables.

1.1.3.4 beetle: Mortality from carbon disulphide


This data set represents the number of beetles exposed (exposed) and num-
ber killed (killed) in eight groups exposed to different doses (dose) of a
particular insecticide. Interest is focussed on how mortality is related to
dose. It seems sensible to model the number of beetles killed in each group
as the binomial random variable with probability of death depending on dose.
This will be discussed in worksheet 5.

1.1.3.5 shuttle: Challenger disaster


This data set concerns the 23 space shuttle flights before the Challenger
disaster. The disaster is thought to have been caused by the failure of a
number of O-rings, of which there were six in total. The data consist of
four variables, the number of damaged O-rings for each pre-Challenger flight
(n_damaged), together with the launch temperature in degrees Fahrenheit
(temp), the pressure at which the pre-launch test of O-ring leakage was carried
out (pressure) and the name of the orbiter (orbiter). The Challenger
launch temperature on 20th January 1986 was 31F. The aim is to predict
12 CHAPTER 1. PRELIMINARIES

the probability of O-ring damage at the Challenger launch. This will be


discussed in worksheet 6.

1.1.3.6 heart: Treatment for heart attack


This data set represents the results of a clinical trial to assess the effectiveness
of a thrombolytic (clot-busting) treatment for patients who have suffered
an acute myocardial infarction (heart attack). There are four categorical
explanatory variables, representing
• the site of infarction: anterior, inferior or other
• the time between infarction and treatment: ≤ 12 or > 12 hours
• whether the patient was already taking Beta-blocker medication prior
to the infarction, blocker: yes or no
• the treatment the patient was given: active or placebo.
For each combination of these categorical variables, the dataset gives the
total number of patients (n_patients), and the number who survived for for
35 days (n_survived). The aim is to find out how these categorical variables
affect a patient’s chance of survival. These data will be analysed in worksheet
7.

1.1.3.7 accident: Road traffic accidents


This example concerns the number of road accidents (number) and the volume
of traffic (volume), on each of two roads in Cambridge (road), at various
times of day (time, taking values morning, midday or afternoon). We should
be able to answer questions like:
1. Is Mill Road more dangerous than Trumpington Road?
2. How does time of day affect the rate of road accident?
These issues will be considered in worksheet 8.

1.1.3.8 lymphoma: Lymphoma patients


The lymphoma data set represents 30 lymphoma patients classified by sex
(Sex), cell type of lymphoma (Cell) and response to treatment (Remis). It
is an example of data which may be represented as a three-way (2 × 2 ×
2) contingency table. The aim here is to study the complex dependence
1.1. LECTURE 1: INTRODUCTION 13

structures between the three classifying factors. This is taken up in worksheet


9.
14 CHAPTER 1. PRELIMINARIES
Chapter 2

Likelihood Based Statistical


Theory

2.1 Lecture 2: Likelihood function


Probability distributions like the binomial, Poisson and normal, enable us
to calculate probabilities, and other quantities of interest (e.g. expectations)
for a probability model of a random process. Therefore, given the model, we
can make statements about possible outcomes of the process.
Statistical inference is concerned with the inverse problem. Given outcomes
of a random process (observed data), what conclusions (inferences) can we
draw about the process itself?
We assume that the n observations of the response y = (y1 , . . . , yn )T are
observations of random variables Y = (Y1 , . . . , Yn )T , which have joint p.d.f.
fY (joint p.f. for discrete variables). We use the observed data y to make
inferences about fY .
We usually make certain assumptions about fY . In particular, we often as-
sume that y1 , . . . , yn are observations of independent random variables. Hence
Y
n
fY (y) = fY1 (y1 )fY2 (y2 ) · · · fYn (yn ) = fYi (yi ).
i=1

In parametric statistical inference, we specify a joint distribution fY , for Y ,

15
16 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

which is known, except for the values of parameters θ1 , θ2 , . . . , θp (sometimes


denoted by θ). Then we use the observed data y to make inferences about
θ1 , θ2 , . . . , θp . In this case, we usually write fY as fY (y; θ), to make explicit
the dependence on the unknown θ.

2.1.1 The likelihood function


We often think of the joint density fY (y; θ) as a function of y for fixed θ,
which describes the relative probabilities of different possible values of y,
given a particular set of parameters θ. However, in statistical inference, we
have observed y1 , . . . , yn (values of Y1 , . . . , Yn ). Knowledge of the probability
of alternative possible realisations of Y is largely irrelevant. What we want
to know about is θ.
Our only link between the observed data y1 , . . . , yn and θ is through the
function fY (y; θ). Therefore, it seems sensible that parametric statistical
inference should be based on this function. We can think of fY (y; θ) as a
function of θ for fixed y, which describes the relative likelihoods of different
possible (sets of) θ, given observed data y1 , . . . , yn . We write

L(θ; y) = fY (y; θ)

for this likelihood, which is a function of the unknown parameter θ. For


convenience, we often drop y from the notation, and write L(θ).
The likelihood function is of central importance in parametric statistical in-
ference. It provides a means for comparing different possible values of θ,
based on the probabilities (or probability densities) that they assign to the
observed data y1 , . . . , yn .

Notes
1. Frequently it is more convenient to consider the log-likelihood function
ℓ(θ) = log L(θ).
2. Nothing in the definition of the likelihood requires y1 , . . . , yn to be
observations of independent random variables, although we shall fre-
quently make this assumption.
3. Any factors which depend on y1 , . . . , yn alone (and not on θ) can be
ignored when writing down the likelihood. Such factors give no in-
2.1. LECTURE 2: LIKELIHOOD FUNCTION 17

formation about the relative likelihoods of different possible values of


θ.

Example 2.1 (Bernoulli). y1 , . . . , yn are observations of Y1 , . . . , Yn , inde-


pendent identically distributed (i.i.d.) Bernoulli(p) random variables. Here
θ = (p) and the likelihood is
Y
n Pn Pn
L(p) = pyi (1 − p)1−yi = p i=1
yi
(1 − p)n− i=1
yi
.
i=1

The log-likelihood is

ℓ(p) = log L(p) = nȳ log p + n(1 − ȳ) log(1 − p).

Example 2.2 (Normal). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


N (µ, σ 2 ) random variables. Here θ = (µ, σ 2 ) and the likelihood is

Y
n  
1 1
L(µ, σ 2 ) = √ exp − 2 (yi − µ)2
i=1 2πσ 2 2σ
 
1 X
= (2πσ 2 )− 2 exp − 2
n
(yi − µ)2

2σ 
2 −n 1 X
∝ (σ ) exp − 2
2 (yi − µ) .
2

The log-likelihood is
n n 1 X
ℓ(µ, σ 2 ) = log L(µ, σ 2 ) = − log(2π) − log(σ 2 ) − 2 (yi − µ)2 .
2 2 2σ

2.1.2 Maximum likelihood estimation


One of the primary tasks of parametric statistical inference is estimation of
the unknown parameters θ1 , . . . , θp . Consider the value of θ which maximises
the likelihood function. This is the ‘most likely’ value of θ, the one which
makes the observed data ‘most probable’. When we are searching for an
estimate of θ, this would seem to be a good candidate.
We call the value of θ which maximises the likelihood L(θ) the maximum
likelihood estimate (MLE) of θ, denoted by θ̂. θ̂ depends on y, as different
18 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

observed data samples lead to different likelihood functions. The correspond-


ing function of Y is called the maximum likelihood estimator and is also
denoted by θ̂.
Note that as θ = (θ1 , . . . , θp ), the MLE for any component of θ is given by
the corresponding component of θ̂ = (θ̂1 , . . . , θ̂p )T . Similarly, the MLE for
any function of parameters g(θ) is given by g(θ̂).
As log is a strictly increasing function, the value of θ which maximises L(θ)
also maximises ℓ(θ) = log L(θ). It is almost always easier to maximise ℓ(θ).
This is achieved in the usual way; finding a stationary point by differentiat-
ing ℓ(θ) with respect to θ1 , . . . , θp , and solving the resulting p simultaneous
equations. It should also be checked that the stationary point is a maximum.

Example 2.3 (Bernoulli). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


Bernoulli(p) random variables. Here θ = (p) and the log-likelihood is
ℓ(p) = nȳ log p + n(1 − ȳ) log(1 − p).
Differentiating with respect to p,
∂ nȳ n(1 − ȳ)
ℓ(p) = −
∂p p 1−p
so the MLE p̂ solves
nȳ n(1 − ȳ)
− = 0.
p̂ 1 − p̂
Solving this for p̂ gives p̂ = ȳ. Note that
∂2
ℓ(p) = −nȳ/p2 − n(1 − ȳ)/(1 − p)2 < 0
∂p2
everywhere, so the stationary point is clearly a maximum.

Example 2.4 (Normal). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


N (µ, σ 2 ) random variables. Here θ = (µ, σ 2 ) and and the log-likelihood is
n n 1 X
ℓ(µ, σ 2 ) = − log(2π) − log(σ 2 ) − 2 (yi − µ)2 .
2 2 2σ
Differentiating with respect to µ
∂ 1 X n(ȳ − µ)
ℓ(µ, σ 2 ) = 2 (yi − µ) =
∂µ σ σ2
2.2. LECTURE 3: SCORE FUNCTION AND INFORMATION MATRIX19

so (µ̂, σ̂ 2 ) solve

n(ȳ − µ̂)
= 0. (2.1)
σ̂ 2

Differentiating with respect to σ 2

∂ n 1 X
2
ℓ(µ, σ 2 ) = − 2 + (yi − µ)2 ,
∂σ 2σ 2(σ 2 )2
so

n 1 X
− 2
+ (yi − µ̂)2 = 0 (2.2)
2σ̂ 2(σ̂ 2 )2

Solving (2.1) and (2.2), we obtain µ̂ = ȳ and

1X 1X
σ̂ 2 = (yi − µ̂)2 = (yi − ȳ)2 .
n n

Strictly, to show that this stationary point is a maximum, we need to show


that the Hessian matrix (the matrix of second derivatives with elements
2
[H(θ)]ij = ∂θ∂i ∂θj ℓ(θ)) is negative definite at θ = θ̂, that is aT H(θ̂)a < 0 for
every a ̸= 0. Here !
2 − σ̂n2 0
H(µ̂, σ̂ ) =
0 − 2(σ̂n2 )2

which is clearly negative definite.

2.2 Lecture 3: Score function and Informa-


tion matrix
2.2.1 Score function
Let

ui (θ) ≡ ℓ(θ), i = 1, . . . , p
∂θi
20 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

and u(θ) ≡ [u1 (θ), . . . , up (θ)]T . Then we call u(θ) the vector of scores or
score vector. Where p = 1 and θ = (θ), the score is the scalar defined as


u(θ) ≡ ℓ(θ).
∂θ

The maximum likelihood estimate θ̂ satisfies

u(θ̂) = 0,

that is,

ui (θ̂) = 0, i = 1, . . . , p.

Note that u(θ) is a function of θ for fixed (observed) y. However, if we


replace y1 , . . . , yn in u(θ), by the corresponding random variables Y1 , . . . , Yn
then we obtain a vector of random variables U (θ) ≡ [U1 (θ), . . . , Up (θ)]T .

An important result in likelihood theory is that the expected score at the


true (but unknown) value of θ is zero:

Theorem 2.1. We have E[U (θ)] = 0, i.e. E[Ui (θ)] = 0, i = 1, . . . , p,


provided that

1. The expectation exists.


2. The sample space for Y does not depend on θ.

R P
Proof. Our proof is for continuous y – in the discrete case, replace by .
For each i = 1, . . . , n
2.2. LECTURE 3: SCORE FUNCTION AND INFORMATION MATRIX21

Z
E[Ui (θ)] = Ui (θ)fY (y, θ)dy
Z

= ℓ(θ)fY (y; θ)dy
∂θi
Z

= log fY (y; θ)fY (y; θ)dy
∂θi
Z ∂ f (y; θ)
Y
= ∂θi fY (y; θ)dy
fY (y; θ)
Z

= fY (y; θ)dy
∂θi
∂ Z
= fY (y; θ)dy
∂θi

= 1 = 0,
∂θi

as required.

Here by taking the expectation, the integral is with respect to the true density
fY (y; θ) at the unkonwn true value of θ, otherwise the proof does not holds.

Example 2.5 (Bernoulli). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


Bernoulli(p) random variables. Here θ = (p) and
u(p) = nȳ/p − n(1 − ȳ)/(1 − p).
Since E[U (p)] = 0, we must have E[Ȳ ] = p (which we already know is
correct).

Example 2.6 (Normal). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


N (µ, σ 2 ) random variables. Here θ = (µ, σ 2 ) and

u1 (µ, σ 2 ) = n(ȳ − µ)/σ 2


n 1 X n
u2 (µ, σ 2 ) = − 2 + (yi − µ)2
2σ 2(σ 2 )2 i=1
Pn
Since E[U (µ, σ 2 )] = 0, we must have E[Ȳ ] = µ and E[ n1 i=1 (Yi − µ)2 ] = σ 2 .
22 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

2.2.2 Information matrix


Suppose that y1 , . . . , yn are observations of Y1 , . . . , Yn , whose joint p.d.f. L(θ)
is completely specified except for the values of p unknown parameters θ =
(θ1 , . . . , θp )T . Previously, we defined the Hessian matrix H(θ) to be the
matrix with components

∂2
[H(θ)]ij ≡ ℓ(θ) i = 1, . . . , p; j = 1, . . . , p.
∂θi ∂θj

We call the matrix −H(θ) the observed information matrix. Where p = 1


and θ = (θ), the observed information is a scalar defined as

−H(θ) ≡ − ℓ(θ).
∂θ2

As with the score, if we replace y1 , . . . , yn in H(θ), by the corresponding


random variables Y1 , . . . , Yn , we obtain a matrix of random variables. Then,
we define the expected information matrix or Fisher information matrix

[I(θ)]ij = Eθ (−[H(θ)]ij ) i = 1, . . . , p; j = 1, . . . , p.

Here Eθ means the expectation is taken with respect to the value of θ that
being evaluated.
An important result in likelihood theory is that the variance-covariance ma-
trix of the score vector (with respect to the θ) is equal to the expected
information matrix:

Theorem 2.2. We have Varθ [U (θ)] = I(θ), i.e.

Varθ [U (θ)]ij = [I(θ)]ij , i = 1, . . . , p, j = 1, . . . , p

provided that
1. The variance exists.
2. The sample space for Y does not depend on θ.
R P
Proof. Our proof is for continuous y – in the discrete case, replace by .
For each i = 1, . . . , p and j = 1, . . . , p,
2.2. LECTURE 3: SCORE FUNCTION AND INFORMATION MATRIX23

Varθ [U (θ)]ij = Eθ [Ui (θ)Uj (θ)]


Z
∂ ∂
= ℓ(θ) ℓ(θ)fY (y; θ)dy
∂θi ∂θj
Z
∂ ∂
= log fY (y; θ) log fY (y; θ)fY (y; θ)dy
∂θi ∂θj
Z ∂ f (y; θ) ∂ fY (y; θ)
Y ∂θj
= ∂θi fY (y; θ)dy
fY (y; θ) fY (y; θ)
Z
1 ∂ ∂
= fY (y; θ) fY (y; θ)dy.
fY (y; θ) ∂θi ∂θj

Now

" #
∂2
[I(θ)]ij = Eθ − ℓ(θ)
∂θi ∂θj
Z
∂2
= − log fY (y; θ)fY (y; θ)dy
∂θi ∂θj
 ∂ 
Z
∂  ∂θj fY (y; θ) 
= − fY (y; θ)dy
∂θi fY (y; θ)
 
Z ∂2
f (y; θ) ∂
f (y; θ) ∂θ∂ j fY (y; θ)
∂θi ∂θj Y ∂θi Y
= − +  fY (y; θ)dy
fY (y; θ) fY (y; θ)2
∂2 Z Z
1 ∂ ∂
=− fY (y; θ)dy + fY (y; θ) fY (y; θ)dy
∂θi ∂θj fY (y; θ) ∂θi ∂θj
= Varθ [U (θ)]ij ,

as required.

Example 2.7 (Bernoulli). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


Bernoulli(p) random variables. Here θ = (p) and
24 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

nȳ n(1 − ȳ)


u(p) = −
p (1 − p)
nȳ n(1 − ȳ)
−H(p) = 2 +
p (1 − p)2
n n n
I(p) = + = .
p (1 − p) p(1 − p)

Example 2.8 (Normal). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


N (µ, σ 2 ) random variables. Here θ = (µ, σ 2 ) and

n(ȳ − µ)
u1 (µ, σ 2 ) =
σ2
n 1 X
u2 (µ, σ 2 ) = − 2 + 2 2
(yi − µ)2 .
2σ 2(σ )

Therefore
 
n n(ȳ−µ)
−H(µ, σ 2 ) =  σ2 P
(σ 2 )2 
n(ȳ−µ)
(σ 2 )2
1
(σ 2 )3
(yi − µ)2 − n
2(σ 2 )2

n
!
0
I(µ, σ ) =
2 σ2
n .
0 2(σ 2 )2

2.3 Lecture 4: Likelihood based inference


2.3.1 Asymptotic distribution of the MLE
Maximum likelihood estimation is an attractive method of estimation for a
number of reasons. It is intuitively sensible and usually reasonably straight-
forward to carry out. Even when the simultaneous equations we obtain by
differentiating the log-likelihood function are impossible to solve directly, so-
lution by numerical methods is usually feasible.
Perhaps the most compelling reason for considering maximum likelihood es-
timation is the asymptotic behaviour of maximum likelihood estimators.
2.3. LECTURE 4: LIKELIHOOD BASED INFERENCE 25

Suppose that y1 , . . . , yn are observations of independent random variables


Q
Y1 , . . . , Yn , whose joint p.d.f. fY (y; θ) = ni=1 fYi (yi ; θ) is completely speci-
fied except for the values of an unknown parameter vector θ, and that θ̂ is
the maximum likelihood estimator of θ.
Then, as n → ∞, the distribution of θ̂ tends to a multivariate normal distri-
bution with mean vector θ and variance covariance matrix I(θ)−1 .
Where p = 1 and θ = (θ), the distribution of the MLE θ̂ tends to N [θ, 1/I(θ)].
For ‘large enough n’, we can treat the asymptotic distribution of the MLE
as an approximation. The fact that E(θ̂) ≈ θ means that the maximum like-
lihood estimator is approximately unbiased for large samples. The variance
of θ̂ is approximately I(θ)−1 . It is possible to show that this is the smallest
possible variance of any unbiased estimator of θ (this result is called the
Cramér–Rao lower bound, which we do not prove here). Therefore the MLE
is the ‘best possible’ estimator in large samples (and therefore we hope also
reasonable in small samples, though we should investigate this case by case).

2.3.2 Quantifying uncertainty in parameter estimates


The usefulness of an estimate is always enhanced if some kind of measure
of its precision can also be provided. Usually, this will be a standard error,
an estimate of the standard deviation of the associated estimator. For the
maximum likelihood estimator θ̂, a standard error is given by

1
s.e.(θ̂) = 1 ,
I(θ̂) 2

and for a vector parameter θ


1
s.e.(θ̂i ) = [I(θ̂)−1 ]ii2 , i = 1, . . . , p.

An alternative summary of the information provided by the observed data


about the location of a parameter θ and the associated precision is a confi-
dence interval.
The asymptotic distribution of the maximum likelihood estimator can be used
to provide approximate large sample confidence intervals. Asymptotically, θ̂i
26 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

has a N (θi , [I(θ)−1 ]ii ) distribution and we can find z1− α2 such that
 
θ̂i − θi
P −z1− α2 ≤ 1 ≤ z1− α2  = 1 − α.
[I(θ)−1 ]ii2
Therefore
 1 1

P θ̂i − z1− α2 [I(θ)−1 ]ii2 ≤ θi ≤ θ̂i + z1− α2 [I(θ)−1 ]ii2 = 1 − α.

The endpoints of this interval cannot be evaluated because they also depend
on the unknown parameter vector θ. However, if we replace I(θ) by its MLE
I(θ̂) we obtain the approximate large sample 100(1−α)% confidence interval
1 1
[θ̂i − z1− α2 [I(θ̂)−1 ]ii2 , θ̂i + z1− α2 [I(θ̂)−1 ]ii2 ].
For α = 0.1, 0.05, 0.01, z1− α2 = 1.64, 1.96, 2.58.

Example 2.9 (Bernoulli). If y1 , . . . , yn are observations of Y1 , . . . , Yn ,


i.i.d. Bernoulli(p) random variables then asymptotically p̂ = ȳ has a
N (p, p(1 − p)/n) distribution, and a large sample 95% confidence interval
for p is

[p̂ − 1.96[I(p̂)−1 ] 2 , p̂ + 1.96[I(p̂)−1 ] 2 ]


1 1

1 1
= [p̂ − 1.96[p̂(1 − p̂)/n] 2 , p̂ + 1.96[p̂(1 − p̂)/n] 2 ]
1 1
= [ȳ − 1.96[ȳ(1 − ȳ)/n] 2 , ȳ + 1.96[ȳ(1 − ȳ)/n] 2 ].

2.3.3 Comparing statistical models


If we have a set of competing probability models which might have generated
the observed data, we may want to determine which of the models is most
appropriate. In practice, we proceed by comparing models pairwise. Suppose
(0) (1)
that we have two competing alternatives, fY (model M0 ) and fY (model
M1 ) for fY , the joint distribution of Y1 , . . . , Yn . Often H0 and H1 both take
the same parametric form, fY (y; θ) but with θ ∈ Θ(0) for H0 and θ ∈ Θ(1)
for H1 , where Θ(0) and Θ(1) are alternative sets of possible values for θ. In
the regression setting, we are often interested in determining which of a set
of explanatory variables have an impact on the distribution of the response.
2.3. LECTURE 4: LIKELIHOOD BASED INFERENCE 27

2.3.3.1 Hypothesis testing


A hypothesis test provides one mechanism for comparing two competing sta-
tistical models. A hypothesis test does not treat the two hypotheses (models)
symmetrically. One hypothesis,
H0 : the data were generated from model M0 ,
is accorded special status, and referred to as the null hypothesis. The null
hypothesis is the reference model, and will be assumed to be appropriate
unless the observed data strongly indicate that H0 is inappropriate, and that
H1 : the data were generated from model M1 ,
(the alternative hypothesis) should be preferred. The fact that a hypothesis
test does not reject H0 should not be taken as evidence that H0 is true and
H1 is not, or that H0 is better supported by the data than H1 , merely that
the data does not provide sufficient evidence to reject H0 in favour of H1 .
A hypothesis test is defined by its critical region or rejection region, which
we shall denote by C. C is a subset of Rn and is the set of possible y which
would lead to rejection of H0 in favour of H1 , i.e.
• If y ∈ C, H0 is rejected in favour of H1 ;
• If y ̸∈ C, H0 is not rejected.
As Y is a random variable, there remains the possibility that a hypothesis
test will produce an erroneous result. We define the size (or significance level)
of the test
α = max P (Y ∈ C; θ)
θ∈Θ(0)

This is the maximum probability of erroneously rejecting H0 , over all possible


distributions for Y implied by H0 . We also define the power function
ω(θ) = P (Y ∈ C; θ)
It represents the probability of rejecting H0 for a particular value of θ. Clearly
we would like to find a test with where ω(θ) is large for every θ ∈ Θ(1) \ Θ(0) ,
while at the same time avoiding erroneous rejection of H0 . In other words, a
good test will have small size, but large power.
The general hypothesis testing procedure is to fix α to be some small value
(often 0.05), so that the probability of erroneous rejection of H0 is limited.
28 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

In doing this, we are giving H0 precedence over H1 . Given our specified α,


we try to choose a test, defined by its rejection region C, to make ω(θ) as
large as possible for θ ∈ Θ(1) \ Θ(0) .

2.3.3.2 Likelihood ratio tests for nested hypotheses


Suppose that H0 and H1 both take the same parametric form, fY (y; θ) with
θ ∈ Θ(0) for H0 and θ ∈ Θ(1) for H1 , where Θ(0) and Θ(1) are alternative sets
of possible values for θ. A likelihood ratio test of H0 against H1 has a critical
region of the form
( )
maxθ∈Θ(1) L(θ)
C= y: >k (2.3)
maxθ∈Θ(0) L(θ)

where k is determined by α, the size of the test, so

max P (y ∈ C; θ) = α.
θ∈Θ(0)

Therefore, we will only reject H0 if H1 offers a distribution for Y1 , . . . , Yn


which makes the observed data much more probable than any distribution
under H0 . This is intuitively appealing and tends to produce good tests
(large power) across a wide range of examples.
In order to determine k in (2.3), we need to know the distribution of the
likelihood ratio, or an equivalent statistic, under H0 . In general, this will not
be available to us. However, we can make use of an important asymptotic
result.
First we notice that, as log is a strictly increasing function, the rejection
region is equivalent to
( ! )
maxθ∈Θ(1) L(θ) ′
C = y : 2 log >k
maxθ∈Θ(0) L(θ)

where
max P (y ∈ C; θ) = α.
θ∈Θ(0)

Write !
maxθ∈Θ(1) L(θ)
L01 ≡ 2 log
maxθ∈Θ(0) L(θ)
2.3. LECTURE 4: LIKELIHOOD BASED INFERENCE 29

for the log-likelihood ratio test statistic. Provided that H0 is nested within
H1 , the following result provides a useful large-n approximation to the dis-
tribution of L01 .

Theorem 2.3. Suppose that H0 : θ ∈ Θ(0) and H1 : θ ∈ Θ(1) , where Θ(0) ⊂


Θ(1) . Let d0 = dim(Θ(0) ) and d1 = dim(Θ(1) ). Under H0 , the distribution of
L01 tends towards χ2d1 −d0 as n → ∞.

Proof. First we note that in the case where θ is one-dimensional and θ = (θ),
a Taylor series expansion of ℓ(θ) around the MLE θ̂ gives
1
ℓ(θ) = ℓ(θ̂) + (θ − θ̂)U (θ̂) + (θ − θ̂)2 U ′ (θ̂) + . . .
2

Now, U (θ̂) = 0, and if we approximate U ′ (θ̂) ≡ H(θ̂) by E[H(θ)] ≡ −I(θ),


and also ignore higher order terms, we obtain

2[ℓ(θ̂) − ℓ(θ)] = (θ − θ̂)2 I(θ)

As θ̂ is asymptotically N [θ, I(θ)−1 ], (θ − θ̂)2 I(θ) is asymptotically χ21 , and


hence so is 2[ℓ(θ̂) − ℓ(θ)].
Similarly it can be shown that when θ ∈ Θ, a multidimensional space,
2[ℓ(θ̂) − ℓ(θ)] is asymptotically χ2p , where p is the dimension of Θ.
Now, suppose that H0 is true and θ ∈ Θ(0) and therefore θ ∈ Θ(1) . Further-
more, suppose that ℓ(θ) is maximised in Θ(0) by θ̂ (0) and is maximised in
Θ(1) by θ̂ (1) . Then

!
maxθ∈Θ(1) L(θ)
L01 ≡ 2 log
maxθ∈Θ(0) L(θ)
= 2 log L(θ̂ (1) ) − 2 log L(θ̂ (0) )
= 2[log L(θ̂ (1) ) − log L(θ)] − 2[log L(θ̂ (0) ) − log L(θ)]
= L1 − L0 .

Therefore L1 = L01 + L0 and we know that, under H0 , L1 has a χ2d1 distri-


bution and L0 has a χ2d0 distribution. Furthermore, it is possible to show
30 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY

(although we will not do so here) that under H0 , L01 and L0 are independent.
It can also be shown that under H0 the difference L1 − L0 can be expressed
as a quadratic form of normal random variables. Therefore, it follows that
under H0 , the log likelihood ratio statistic L01 has a χ2d1 −d0 distribution.

Example 2.10 (Bernoulli). y1 , . . . , yn are observations of Y1 , . . . , Yn , i.i.d.


Bernoulli(p) random variables. Suppose that we require a size α test of the
hypothesis H0 : p = p0 against the general alternative H1 : ‘p is unrestricted’
where α and p0 are specified.
Here θ = (p), Θ(0) = {p0 } and Θ(1) = (0, 1) and the log likelihood ratio
statistic is ! !
ȳ 1 − ȳ
L01 = 2nȳ log + 2n(1 − ȳ) log .
p0 1 − p0
As d1 = 1 and d0 = 0, under H0 , the log likelihood ratio statistic has an
asymptotic χ21 distribution. For a log likelihood ratio test, we only reject H0
in favour of H1 when the test statistic is too large (observed data are much
more probable under model H1 than under model H0 ), so in this case we
reject H0 when the observed value of the test statistic above is ‘too large’
to have come from a χ21 distribution. What we mean by ‘too large’ depends
on the significance level α of the test. For example, if α = 0.05, a common
choice, then we should reject H0 if the test statistic is greater than the 3.84,
the 95% quantile of the χ21 distribution.

2.3.3.3 Information criteria for model comparison


It is more difficult to use the likelihood ratio test of Section 2.3.3.2 to com-
pare two models if those models are not nested. An alternative approach
is to record some criterion measuring the quality of the model for each of a
candidate set of models, then choose the model which is the best according
to this criterion.
When we were estimating the unknown parameters θ of a model, we chose the
value which maximised the likelihood: that is, the value of θ that maximises
the probability of observing the data we actually saw. It is tempting to use
a similar system for choosing between two models, and to choose the model
which has the greater likelihood, under which the probability of seeing the
data we actually observed is maximised. However, if we do this we will always
2.3. LECTURE 4: LIKELIHOOD BASED INFERENCE 31

end up choosing complicated models, which fit the observed data very closely,
but do not meet our requirement of parsimony.
For a given model depending on parameters θ ∈ Rp , let ℓ̂ = ℓ(θ̂) be the log-
likelihood function for that model evaluated at the MLE θ̂. It is not sensible
to choose between models by maximising ℓ̂ directly, and instead it is common
to choose a model to maximise a criteria of the form

ℓ̂ − penalty,

where the penalty term will be large for complex models, and small for simple
models.
Equivalently, we may choose between models by minimising a criteria of the
form
−2ℓ̂ + penalty.
By convention, many commonly-used criteria for model comparison take this
form. For instance, the Akaike information criterion (AIC) is

AIC = −2ℓ̂ + 2p,

where p is the dimension of the unknown parameter in the candidate model,


and the Bayesian information criterion (BIC) is

BIC = −2ℓ̂ + log(n)p,

where n is the number of observations.


32 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY
Chapter 3

Linear Models

3.1 Lecture 5: Linear Model Theory: Revi-


sion of MATH2010
3.1.1 The linear model
In practical applications, we often distinguish between a response variable
and a group of explanatory variables. The aim is to determine the pattern of
dependence of the response variable on the explanatory variables. We denote
the n observations of the response variable by y = (y1 , y2 , . . . , yn )T . These
are assumed to be observations of random variables Y = (Y1 , Y2 , . . . , Yn )T .
Associated with each yi is a vector xi = (xi1 , xi2 , . . . , xip )T of values of p
explanatory variables.
In a linear model, we assume that

Yi = β1 xi1 + β2 xi2 + . . . + βp xip + ϵi


X
p
= xij βj + ϵi
j=1

= xTi β + ϵi
= [Xβ]i + ϵi , i = 1, . . . , n (3.1)

33
34 CHAPTER 3. LINEAR MODELS

where ϵi ∼ N (0, σ 2 ) independently,


   
xT1 x11 · · · x1p
 .   . .. .. 
X= .
.
 
= .
 . . . 

xTn xn1 · · · xnp

and β = (β1 , . . . , βp )T is a vector of fixed but unknown parameters describing


the dependence of Yi on xi . The four ways of describing the linear model in
(3.1) are equivalent, but the most economical is the matrix form

Y = Xβ + ϵ. (3.2)

where ϵ = (ϵ1 , ϵ2 , . . . , ϵn )T .
The n × p matrix X consists of known (observed) constants and is called the
design matrix. The ith row of X is xTi , the explanatory data corresponding
to the ith observation of the response. The jth column of X contains the n
observations of the jth explanatory variable.
The error vector ϵ has a multivariate normal distribution with mean vector 0
and variance covariance matrix σ 2 I, since Var(ϵi ) = σ 2 , and Cov(ϵi , ϵj ) = 0,
as ϵ1 , . . . , ϵn are independent of one another. It follows from (3.2) that the
distribution of Y is multivariate normal with mean vector Xβ and variance
covariance matrix σ 2 I, i.e. Y ∼ N (Xβ, σ 2 I).

3.1.2 Examples of linear model structure


Example 3.1 (The null model). If we do not include any variables xi in the
model, we have

Yi = β0 + ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,

so  
1
 
1
X= . , β = (β0 ).
.
.
1
This is one (dummy) explanatory variable. In practice, this variable is present
in all models.
3.1. LECTURE 5: LINEAR MODEL THEORY: REVISION OF MATH201035

Example 3.2 (Simple linear regression). If we include a single variable xi


in the model, we might have

Yi = β0 + β1 xi + ϵi , ϵi ∼ N (0, σ 2 ) i = 1, . . . , n

so  
1 x1
  !
1 x2  β
X=
 .. .. 
, β= 0 .
. . β1
1 xn
There are two explanatory variables: the dummy variable and one ‘real’ vari-
able.

Example 3.3 (Polynomial regression). If we want to allow for a non-linear


impact of xi on the mean of Yi , we might model

Yi = β0 +β1 xi +β2 x2i +. . .+βp−1 xp−1


i +ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,

so    
1 x1 x21 · · · xp−1
1 β0
 p−1   
1 x2 x22 · · · x2   β1 
X=
 .. .. .. . . .. 
, β=
 .. .

. . . . .   . 
1 xn x2n · · · xp−1
n βp−1
There are p explanatory variables: the dummy variable and one ‘real’ variable,
transformed to p − 1 variables.

Example 3.4 (Multiple regression). To include multiple explanatory vari-


ables, we might model

Yi = β0 +β1 xi1 +β2 xi2 +. . .+βp−1 xi p−1 +ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,

so    
1 x11 x12 · · · x1 p−1 β0
   
1 x21 x22 · · · x2 p−1   β1 

X =  .. .. ..   .
.. .. , β=  .. 
. . . . .   . 
1 xn1 xn2 · · · xn p−1 βp−1
There are p explanatory variables: the dummy variable and p − 1 ‘real’ vari-
ables.
36 CHAPTER 3. LINEAR MODELS

Example 3.5 (One categorical explanatory variable). Suppose xi is a cate-


gorical variable, taking values in a set of k possible categories. For simplicity
of notation, we will give each category a number, and write xi ∈ {1, . . . , k}.
We wish to model

Yi = µxi + ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,

so that the mean of Yi is the same for all observations in the same category,
but differs for different categories.
We could rewrite this model to include an intercept, as

Yi = β0 + βxi + ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,

so that µj = β0 + βj , for j = 1, . . . , k. It is not possible to estimate all of


the β parameters separately, as they only affect the distribution through the
combination β0 + βj . Instead, we choose a reference category l, and set
βl = 0. The intercept term β0 then gives the mean for the reference category,
with βj giving the difference in mean between category j and the reference
category. In R, categorical variables are called factors, and by default the
reference category will be the first category when the names of the categories
(the levels of the factor) are sorted alphabetically.
We can rewrite the model as a form of multiple regression by first defining a
new explanatory variable zi

zi = (zi1 , . . . , zik )T ,

where 
1 if xi = j
zij =
0 otherwise.
zi is sometimes called the one-hot encoding of xi , as it contains precisely
one 1 (corresponding to the category xi ), and is 0 everywhere else. We then
have
Yi = β0 + β1 zi1 + β2 zi2 + . . . + βk zik + ϵi ,
so    
1 z11 z12 · · · z1k β0
   
1 z21 z22 · · · z2k   β1 
X=
 .. .. .. . . .. 
, β= 
 ..  ,
. . . . .   . 
1 zn1 zn2 · · · znk βk
3.1. LECTURE 5: LINEAR MODEL THEORY: REVISION OF MATH201037

where each row of X will have two ones, and the remaining entries will be
zero.

Example 3.6 (Two categorical explanatory variables). Suppose we have


two categorical variables xi1 ∈ {1, . . . , k1 } and xi2 ∈ {1, . . . , k2 }. We might
consider a model

Yi = β0 + βx(1)
i1
+ βx(2)
i2
+ ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,

where
 
(1) (1) (2) (2) T
β = β0 , β1 , . . . , βk1 , β1 , . . . , βk2 ,

where as in Example 3.5 we choose reference categories l1 and l2 for each


(1) (2) (1)
categorical variable, and set βl1 = βl2 = 0. The terms βj are called the
(2)
main effects for the categorical variables xi1 , and βj are the main effects
for xi2 .
We might also want to allow an interaction between xi1 and xi2 , letting

Yi = β0 + βx(1)
i1
+ βx(2)
i2
+ βx(1,2)
i1 ,xi2
+ ϵi ,

where
 
(1) (1) (2) (2) (1,2) (1,2) (1,2) T
β = β0 , β1 , . . . , βk1 , β1 , . . . , βk2 , β11 , β1,2 , . . . , βk1 ,k2 .

(1,2)
The terms βj1 ,j2 are called the interaction effects. This model is equivalent
to
Yi = µxi1 ,xi2 + ϵi ,

allowing a different mean for each possible combination of categories. To


allow us to estimate the parameters, given reference categories l1 and l2 , we
set
(1) (2) (1,2) (1,2)
βl1 = βl2 = 0; βl1 ,j = 0, j = 1, . . . , k2 ; βj,l2 = 0, j = 1, . . . , k1 .

As in Example 3.5, it is possible to rewrite the model with a design matrix


X, by using one-hot encoding of xi1 and xi2 .
38 CHAPTER 3. LINEAR MODELS

3.1.3 Maximum likelihood estimation


The regression coefficients β1 , . . . , βp describe the pattern by which the re-
sponse depends on the explanatory variables. We use the observed data
y1 , . . . , yn to estimate this pattern of dependence.
The likelihood for a linear model is
!
 − n 1 Xn
L(β, σ 2 ) = 2πσ 2 2
exp − 2 (yi − xTi β)2 . (3.3)
2σ i=1

This is maximised with respect to (β, σ 2 ) at


β̂ = (X T X)−1 X T y
and n 
1X 2
σ̂ 2 = yi − xTi β̂ .
n i=1

The corresponding fitted values are


ŷ = X β̂ = X(X T X)−1 X T y
or
ŷi = xTi β̂, i = 1, . . . , n.

The residuals r = (r1 , . . . , rn ) are r = y − ŷ or ri = yi − xTi β̂ for i = 1, . . . , n..


These residuals describe the variability in the observed responses y1 , . . . , yn
which has not been explained by the linear model. We call
X
n n 
X 2
D= ri2 = yi − xTi β̂
i=1 i=1

the residual sum of squares or deviance for the linear model.

3.1.4 Properties of the MLE


As Y is normally distributed, and β̂ = (X T X)−1 X T Y is a linear function
of Y , then β̂ must also be normally distributed. We have E(β̂) = β and
Var(β̂) = σ 2 (X T X)−1 , so
β̂ ∼ N (β, σ 2 (X T X)−1 ).
3.1. LECTURE 5: LINEAR MODEL THEORY: REVISION OF MATH201039

It is possible to prove (although we shall not do so here) that


D
∼ χ2n−p
σ2
which implies that
n−p 2
E(σ̂ 2 ) = σ ,
n
so the maximum likelihood estimator is biased for σ 2 (although still asymp-
totically unbiased as n−p
n
→ 1 as n → ∞). We often use the unbiased
2
estimator of σ
D 1 X n
σ̃ 2 = = r2 .
n−p n − p i=1 i
The denominator n − p, the number of observations minus the number of
linear coefficients in the model is called the degrees of freedom of the model.
Therefore, we estimate the residual variance by the deviance divided by the
degrees of freedom.

3.1.5 Comparing linear models


If we have a set of competing linear models which might explain the depen-
dence of the response on the explanatory variables, we will want to determine
which of the models is most appropriate.
As described previously, we proceed by comparing models pairwise using a
likelihood ratio test. For linear models this kind of comparison is restricted
to situations where one of the models, H0 , is nested in the other, H1 . This
usually means that the explanatory variables present in H0 are a subset of
those present in H1 . In this case model H0 is a special case of model H1 ,
where certain coefficients are set equal to zero. We let θ represent the collec-
tion of linear parameters for model H1 , together with the residual variance
σ 2 , and let Θ(1) be the unrestricted parameter space for θ. Then Θ(0) is
the parameter space corresponding to model H0 , i.e. with the appropriate
coefficients constrained to zero.
We will assume that model H1 contains p linear parameters and model H0 a
subset of q < p of these. Without loss of generality, we can think of H1 as
the model p X
Yi = xij βj + ϵi , i = 1, . . . , n
j=1
40 CHAPTER 3. LINEAR MODELS

and H0 being the same model with

βq+1 = βq+2 = · · · = βp = 0.

Now, a likelihood ratio test of H0 against H1 has a critical region of the form
( )
max(β,σ2 )∈Θ(1) L(β, σ 2 )
C= y: >k
max(β,σ2 )∈Θ(0) L(β, σ 2 )

where k is determined by α, the size of the test, so

max P (y ∈ C; β, σ 2 ) = α.
θ∈Θ(0)

For a linear model,


!
  n 1 Xn
2 −2
2
L(β, σ ) = 2πσ exp − 2 (yi − xTi β)2 .
2σ i=1

This is maximised with respect to (β, σ 2 ) at β = β̂ and σ 2 = σ̂ 2 = D/n.


Therefore

!
−n n X n
max 2
L(β, σ ) = (2πD/n) 2 exp − (yi − xTi β̂)2
2
β,σ 2D i=1
 
n
= (2πD/n)− 2
n
exp − .
2

This form applies for both θ ∈ Θ(0) and θ ∈ Θ(1) , with only the model
changing. Let the deviances under models H0 and H1 be denoted by D0 and
D1 respectively. Then the critical region for the likelihood ratio test is of the
form
(2πD1 /n)− 2
n

n > k
(2πD0 /n)− 2
so  n
D0 2
> k,
D1
and  
D0 n−p
−1 > k′
D1 p−q
3.1. LECTURE 5: LINEAR MODEL THEORY: REVISION OF MATH201041

for some k ′ . Rearranging,

(D0 − D1 )/(p − q)
> k′.
D1 /(n − p)

We refer to the left hand side of this inequality as the F -statistic. We reject
the simpler model H0 in favour of the more complex model H1 if F is ‘too
large’.
As we have required H0 to be nested in H1 , F ∼ Fp−q, n−p when H0 is true.
To see this, note that
D0 D0 − D1 D1
2
= + 2.
σ σ2 σ
Furthermore, under H0 , D1 /σ ∼ χn−p and D0 /σ 2 ∼ χ2n−q . It is possible to
2 2

show (although we will not do so here) that under H0 , (D0 − D1 )/σ 2 and
D0 /σ 2 are independent. Therefore, from the properties of the chi-squared dis-
tribution, it follows that under H0 , (D0 − D1 )/σ 2 ∼ χ2p−q , and F ∼ Fp−q, n−p
distribution.
Therefore, the precise critical region can be evaluated given the size, α, of
the test. We reject H0 in favour of H1 when

(D0 − D1 )/(p − q)
>k
D1 /(n − p)

where k is the 100(1 − α)% quantile of the Fp−q, n−p distribution.


42 CHAPTER 3. LINEAR MODELS
Chapter 4

Linear Mixed Models

In this chapter, we introduce the linear mixed models (LMMs) with random
effects. This is a method for analyzing complex datasets contain features such
as multilevel/hierarchy, longitudinality, or correlation/dependence, where the
linear models in Chapter 3 can not be applied. We will study both the general
concepts/interpretation and some simple theory of LMMs.

4.1 Lecture 6: Introduction to Linear Mixed


Models
4.1.1 Motivations of LMMs
In statistical models that are mentioned earlier, it has been generally as-
sume that all observations are independent from each other. In partic-
ular, in linear regression model, we assume that the error term indepen-
dently follows N (0, σ 2 ), which leads to independent response random vari-
ables {Y1 , Y2 , . . . , Yn }.
However, in many practical data problems, such as clustered data or lon-
gitudinal data, the above assumption is not valid. We need to consider
more sophisticated linear models, in which observations are allowed to be
correlated. Linear mixed models, sometimes also referred to as linear mixed
effect models, is one of the most popular models that can incorporate correla-
tions between responses. It can be viewed as an extension of linear regression

43
44 CHAPTER 4. LINEAR MIXED MODELS

for cross-sectional (panel) data, by introducing random effects to account for


within-groups correlations.
We introduce some motivating examples to help you understand the LMMs.
More mathematical details will be presented in subsequent sections.

Example 4.1 (Cluster data). In clustered data, we have each response is


measured for each datapoint, and each datapoint belongs to a group (cluster).
For example, consider the math test scores for all Year 1 students in a primary
school, where students are grouped by classrooms. In this way, each class
room forms a cluster and we believe it is more sensible to assume the scores
within each cluster are not independent but somehow correlated, due to the
fact that students in the same classroom take the exact same courses.

Example 4.2 (longitudinal data). In longitudinal data, we have each re-


sponse is measure at several time points, and the number of time points is
fixed.
For example, consider the the number of sales of different products at each
month in Year 2021. Here we have 12 timepoint, at each we assume the sales
of different products are correlated.

In linear regression models, all model parameters in regression coefficients


β = (β1 , . . . , βp )T are fixed, i.e., are the same for all observations of
x1 , x2 , . . . , xn . Therefore we call it as the fixed effect model.
In contrast, if the regression coefficients are random variables, we call it as a
random effect model. A linear mixed model, by definition, is a linear model
with both fixed effects and random effects. Usually it refers to a regression
model in which data can be grouped according to several observed factors.
Such groups can be clusters or data collected at the same time point, as
shown in the above examples. Mixture of fixed and random effects allow us
to make inference or prediction on a specific group, which is essential in some
applications.
Random effects can be thought of as missing information on individual sub-
jects that, were it available, would be included in the statistical model. To
reflect our not knowing what values to use for the random effects, we model
4.1. LECTURE 6: INTRODUCTION TO LINEAR MIXED MODELS 45

them as a random sample from a distribution. In this way it induces cor-


relation amongst repeated measurements on the same subjects or amongst
measurements in a cluster.

4.1.2 Basics of Linear Mixed Models


In this section, we present the linear mixed models in general forms. Let

yi = (yi1 , yi2 , . . . , yini ) i = 1, 2 . . . , m,

be ni observed responses within group i, where m is the number of groups


P
and we have the total number of observations is n = m i=1 ni . As usual,
yi = (yi1 , yi2 , . . . , yini ) are assume to be observations of random variables

Yi = (Yi1 , Yi2 , . . . , Yini ).

A general LMM is then as follows:

Yij = xTij β + uTij γi + ϵij , j = 1, . . . , ni ; i = 1, . . . , m, (4.1)

where same to the linear model, xij is the p×1 vector of explanatory variables
corresponding to the j-th component in i-th group, associated with the fixed
effects, and we write β = (β1 , . . . , βp )T as a p × 1 parameter vector of fixed
effects. We assume the random errors ϵij ∼ N (0, σ 2 ) are all independent.
Different from linear model, here we also have uij , which is the q × 1 vector
of explanatory variables corresponding to the j-th component in i-th group.
It is associated with the random effects. Usually uij can be either some
new covariates or a subset of xij . We write γi = (γi1 , . . . , γiq )T as a q × 1
parameter vector of random effects in i-th group.
As introduced earlier, unlike β being the parameters of interests, γi is a
vector of random coefficients. For the rest of this section, we assume that

γi ∼ N (0, D),

where D is a q×q covariance matrix of the random effects. The distributional


assumption of coefficients θ is resemble to specification of a prior in Bayesian
analysis. Some of the computation in LMMs will also correspond to Bayesian
methods, as we will see later.
46 CHAPTER 4. LINEAR MIXED MODELS

Usually we assume the variance components in LMM is indexed by (depends


on) some parameters θ, which will be our prime target of statistical inference
about the random effects. As a result we can write D = Dθ , and σ 2 is an
element of θ.
Let ϵi = (ϵi1 , , . . . , ϵini )T represents random errors within i-th group, which
is independent of γi . In LMMs, we also assume independence between each
groups, that means γ1 , γ2 , . . . , γm , ϵ1 , ϵ2 , . . . , ϵm are all independent.
LMMs in (4.1) specifically incorporates two sources of randomness: the
within-group randomness and the between-group randomness. Thus, it can
be interpreted as two-stage hierarchical,
• stage 1: specifies the within-group randomness, which is given by fixing
i and letting j = 1, . . . , ni ;
• stage 2: specifies the between-group randomness, which is given by
letting i = 1, . . . , m.

Example 4.3 (With-in group correlations in LMMs). We consider the fol-


lowing simple LMMs to illustrate the correlation introduced by the random
effect in the model:

Yij = β0 + γi + ϵij , i = 1, . . . , m; j = 1, . . . , ni ,

where the random effect and the error satisfy:

γi ∼ N (0, σγ2 ), ϵij ∼ N (0, σϵ2 ).

In this case, we have the variance parameters θ = (σγ , σϵ ). This model only
include an intercept with no covariate in the fixed effect. It can be shown that
the correlation between the responses within i-th group {Yi1 , Yi2 , . . . , Yini } is:

i Cov(γi + ϵij , γi + ϵik )


rjk = corr(Yij , Yik ) = q
Var(γi + ϵij )Var(γi + ϵik )
σγ2
= .
σγ2 + σϵ2

Thus, the random effect γi introduces correlation between with-in group re-
sponses. If there is no random effect (i.e., σγ = 0), there is no such correlation.
4.1. LECTURE 6: INTRODUCTION TO LINEAR MIXED MODELS 47

If the between-group randomness is much smaller than the within-group ran-


domness, i.e., σϵ2 ≪ σγ2 , the correlation rjk
i
becomes very high (close to 1).

To present the i-th group of LMMs in a matrix form, let Yi =


(Yi1 , Yi2 , . . . , Yini )T , Xi = (xi1 , xi2 , . . . , xini )T be a ni × p design ma-
trix, and Ui = (ui1 , ui2 , . . . , uini )T be a ni × q design matrix. Therefore, we
can write

Yi = Xi β + Ui γi + ϵi
= Xi β + ϵ∗i , i = 1 . . . , m,

where ϵ∗i = Ui γi + ϵi can be regarded as random “errors” in i-th group. Note


that ϵi ∼ N (0, σ 2 Ini ). Let Vi = Ui DUiT + σ 2 Ini , it is straightforward to see
that ϵ∗i ∼ N (0, Vi ), which leads to the marginal model:

Yi ∼ N (Xi β, Vi ).

If we want to further present all the m groups in one big matrix formula, we
can simply write
   
Y1 X1
 .   . 

Y =  .. 
∈R ,
n 
X =  .. 
∈R
n×p
,
Ym Xm

and    
γ1 ϵ1
 .   . 
γ= 
 ..  ∈ R ,
mq
ϵ= 
 ..  ∈ R
n×1
.
γm ϵm
Moreover, let
 
U1 , 0n1 ×q , · · · , 0n1 ×q
 
 0n2 ×q , U2 , 

U =  ..  ∈ Rn×mq
... 
 . 
0nm ×q , Um

Therefore, we can write the LMMs for all group as

Y = Xβ + U γ + ϵ, (4.2)
48 CHAPTER 4. LINEAR MIXED MODELS

where ϵ ∼ N (0, σ 2 In ) and γ ∼ N (0, G), with


 
D
 
 D 
G=   ∈ Rmq×mq
 ... 
 
D

Similarly, if we express ϵ∗ = U γ + ϵ, we can write (4.2) as

Y = Xβ + ϵ∗ ,

where ϵ∗ ∼ N (0, V ) with V = U GU T + σ 2 In . We use y to denote observa-


tions of random vector Y .
As introduced earlier, the covariance matrix G and V depend on parameters
θ. We can further denote them as Gθ and Vθ . For simplicity we omit the
subscript here.

4.2 Lecture 7: LMMs parameter estimation


I
Statistical inference for a LMMs is typically based on the maximum like-
lihood, and depend on if the variance component parameters θ are known of
not. In this lecture, let us first assume the knowledge of θ, which means and
σ 2 , D, G, V are all known.

4.2.1 Estimation of β
We rewrite the linear mixed model in equation (4.2) as:

Y = Xβ + ϵ∗ , ϵ∗ ∼ N (0, V ). (4.3)

It looks very similar to the linear model in Chapter 3. However, here we


cannot directly apply the MLE estimator of β introduced in Section 3.1.3 as
V ̸= σ 2 In .
To solve this problem, note that

V −1/2 ϵ∗ ∼ N (0, V −1/2 V V −1/2 ) = N (0, In ).


4.2. LECTURE 7: LMMS PARAMETER ESTIMATION I 49

Therefore, multiply by V −1/2 on both sides of (4.3), we have

V −1/2 Y = V −1/2 Xβ + V −1/2 ϵ∗ (4.4)

Let Y ′ = V −1/2 Y , X ′ = V −1/2 X and ϵ′ = V −1/2 ϵ∗ , we can rewright


equation (4.4) as:

Y ′ = X ′ β + ϵ′ , ϵ′ ∼ N (0, In ), (4.5)

which is an ordinary multiple linear regression with i.i.d errors. Similar to


Section 3.1.3, we have the MLE for β is

β̂ =(X ′T X ′ )−1 X ′T y ′
=(X T V −1 X)−1 X T V −1 y (4.6)

This is the BLUE (best linear unbiased estimator) of β as obtained in linear


model, given the knowledge of θ.

4.2.2 “Estimation” of γ
The random coefficients γ are random coefficients not parameters of inter-
ests. Nevertheless, sometimes we need an ”estimator” γ̂ as an intermediate
quantity in our statistical inference. To this end, note that

Cov(Y , γ) = Cov(xβ + U γ + ϵ, γ) = Cov(U γ, γ) = U G.

As a result, ! !!
Xβ V , UG
(Y , γ) ∼ N , .
0 GU T , G

Before proceeding, we first present the following lemma.

Lemma 4.1. If we partition the random vector X ∼ N (µ, Σ) into two


random vectors x1 and X2 , and partition its mean vector and covariance
matrix in a corresponding manner, i.e.,
! !
µ1 Σ11 , Σ12
µ= , Σ=
µ2 Σ21 , Σ22
50 CHAPTER 4. LINEAR MIXED MODELS

such that X1 ∼ N (µ1 , Σ11 ), X2 ∼ N (µ2 , Σ22 ) and Cov(Xi , Xj ) = Σij for
i, j ∈ {1, 2}.
The conditional distribution of X2 given known values for X1 = x1 is multi-
variate normal N (µX2 |x1 , ΣX2 |x1 ), where

µX2 |x1 = µ2 + Σ21 Σ−1


11 (x1 − µ1 ),
ΣX2 |x1 = Σ22 − Σ21 Σ−1
11 Σ12 .

Proof. Consider Z = X2 − Σ21 Σ−1


11 X1 , note that
 
Cov(X, X1 ) =Cov(X2 , X1 ) − Cov Σ21 Σ−1
11 X2 , X1

=Σ21 − Σ21 Σ−1


11 Σ11

As for jointly normal random vectors, zero covariance leads to independent,


we have that z and x1 are independent. Therefore
 
E(X2 | x1 ) =E Z + Σ21 Σ−1
11 x1 | x1
 
=E(Z | x1 ) + E Σ21 Σ−1
11 x1 | x1

=E(Z) + Σ21 Σ−111 x1


−1
=µ2 + Σ21 Σ11 (x1 − µ1 ).

Similarly, we have
 
Var(X2 | x1 ) =Var Z + Σ21 Σ−1
11 x1 | x1
 
=Var(Z | x1 ) + Var Σ21 Σ−1 −1
11 x1 | x1 + 2Σ21 Σ11 Cov (Z, x1 | x1 )

=Var(Z | x1 )
=Var(Z)
=Σ22 + Σ21 Σ−1 −1 −1
11 Σ11 Σ11 Σ12 − 2Σ21 Σ11 Σ12
=Σ22 − Σ21 Σ−1 −1
11 Σ11 Σ11 Σ12 ,

which completes the proof.

Applying Lemma 4.1 implies conditional on Y = y,


 
γ|Y = y ∼ N µγ|y , Σγ|y .
4.2. LECTURE 7: LMMS PARAMETER ESTIMATION I 51

Hence, given the value of β, we have


E(γ|Y = y) = µγ|Y = GU T V −1 (y − Xβ).
This suggests the following estimator:
γ̂ = GU T V −1 (y − Xβ). (4.7)

When β is not given, we can replace β with its estimator β̂ derived in (4.6). It
is straightforward to see that E(γ̂ | Y = y) = E(γ | Y = y). γ̂ is sometimes
referred as the maximum a posteriori (MAP), or predicted random effects.
In the exercise class we will prove the above estimators β̂ and γ̂ are also the
joint maximizer of the log-likelihood of (Y T , γ T ), with respect to β and γ,
meaning that they are MLEs (under the knowledge of θ).

Example 4.4 (Estimation of a simple 2-covariate LMM). Consider the fol-


lowing linear mixed model:
Yij = β1 + β2 xij + γ1i + γ2i xij + ϵij , i = 1, . . . , m; j = 1, . . . , ni ,
where ϵij ∼ N (0, σ 2 ) are random errors and (γ1i , γ2i )T ∼ N (0, I2 ) are two
random effects.
In this example, we have G = I2m , and
 
  X1 , 0n1 ×2 , · · · , 0n1 ×q
1 xi1  
. ..   0n2 ×2 , X2 , 
.
Ui = Xi = . . 
,

U =  .. ... .

 . 
1 xini
0nm ×2 , Xm

Therefore, we have
 
X1 X1T + σ 2 In1 , 0n1 ×n2 , ··· , 0n1 ×nm
 T 2 
 0 n2 ×n1 , X X
2 2 + σ In2 , 
T 2 
V = U GU +σ In =  .. 
... 
 . 
T 2
0nm ×n1 , Xm Xm + σ Inm

Combined with (4.6) and (4.7) implies the estimator


"m #−1 m
X X
−1
β̂ = (βˆ1 , βˆ2 ) =
T
XiT (Xi XiT 2
+ σ Ini ) Xi XiT (Xi XiT + σ 2 Ini )−1 y
i=1 i=1
52 CHAPTER 4. LINEAR MIXED MODELS

and

γ̂i = (γ̂i1 , γ̂i2 )T = XiT (Xi XiT + σ 2 Ini )−1 (yi − Xi β̂).

4.3 Lecture 8: LMMs parameter estimation


II
When θ are not known, we will have to estimate it together with β using
the maximum likelihood and relevant estimations. In this section we will
re-introduce the subscript in covariance matrices.

4.3.1 The log-likelihood function


The likelihood for parameters β and θ of the linear mixed model is in principle
based on the joint probability density function of Y = (Y1T , . . . , YmT )T , i.e.,
fY (y; β, θ). Since we know Y ∼ N (Xβ, Vθ ), we have its joint probability
density function is
" #
−n/2 −1/2 (y − Xβ)T Vθ−1 (y − Xβ)
(2π) |Vθ | exp
2
This likelihood expression looks simple however it requires the inverse of
n × n matrix Vθ . The computational complexity (cost) is O(n3 ), which is
very expensive. In practice we usually use an alternative approach.
From standard property of conditional density, we have

f (y, γ; β, θ) = f (y|γ; β, θ)f (γ; β, θ).

Note that Y |γ ∼ N (Xβ + U γ, σ 2 In ), we have


" #
  (y − Xβ − U γ)T (y − Xβ − U γ)
2 −n/2
f (y|γ; β, θ) = 2πσ exp − .
2σ 2

Moreover, as γ ∼ N (0, Gθ ), we have


!
−mq/2 −1/2 γG −1 γ
f (γ; β, θ) = (2π) |Gθ | exp − θ
2
4.3. LECTURE 8: LMMS PARAMETER ESTIMATION II 53

As a result consider evaluating the likelihood f (y; β, θ) by integrating out γ


in f (y, γ; β, θ):
Z Z
L(β, θ) = fY (y; β, θ) = f (y, γ; β, θ)dγ = exp [log f (y, γ; β, θ)] dγ.

Here we are taking an additional log and exponential to apply the following
trick — consider Taylor expansion of log f (y, γ; β, θ) at γ̂, where γ̂ is the
estimator we obtained in (4.7), which is also the maximiser of f (y, γ; β, θ)
as we proved in the exercise class. Hence we have
1 ∂ 2 log f (y, γ; β, θ)
log f (y, γ; β, θ) = log f (y, γ̂; β, θ) + (γ − γ̂)T (γ − γ̂)
2 ∂γ∂γ T γ̂
!
1 UT U −1
= log f (y, γ̂; β, θ) − (γ − γ̂)T
+ Gθ (γ − γ̂).
2 σ2
There are no further remainder terms because the higher order derivatives of
log f (y, γ; β, θ) with respect to γ are exactly zero since it is polynomial of
order 2. Hence, it arrives
Z " ! #
1 UT U
L(β, θ) = exp log f (y, γ̂; β, θ) − (γ − γ̂)T + Gθ−1 (γ − γ̂) dγ
2 σ2
Z " ! #
1 UT U −1
=f (y, γ̂; β, θ) exp − (γ − γ̂) T
+ Gθ (γ − γ̂) dγ
2 σ2
Z " ! #
1 UT U −1
=f (y|γ̂; β, θ)f (γ̂; β, θ) exp − (γ − γ̂)T
+ Gθ (γ − γ̂) dγ.
2 σ2

Consider
1/2 " ! #
UT U 1 UT U
(2π) −mq/2
2
+ Gθ−1 exp − (γ − γ̂)T 2
+ Gθ−1 (γ − γ̂) ,
σ 2 σ
  −1 
which is the probability density function of N γ̂, U T U /σ 2 + Gθ , it must
integrate to 1. This implies:
Z " ! #
1 UT U
exp − (γ − γ̂)T 2
+ Gθ−1 (γ − γ̂) dγ
2 σ
(2π)mq/2
= 1/2
(4.8)
UT U
+ Gθ−1
σ2
54 CHAPTER 4. LINEAR MIXED MODELS

Combining the formula of f (y|γ; β, θ), f (γ; β, θ) and (4.8), we finally have
−1/2 !
UT U −1 γ̂Gθ−1 γ̂
L(β, θ) =(2πσ 2 )−n/2 |Gθ |−1/2 + G θ · exp −
σ2 2
" #
(y − Xβ − U γ̂) (y − Xβ − U γ̂)
T
· exp − . (4.9)
2σ 2

This likelihood function only contains the inverse of Gθ , which is a mq × mq


matrix. The computational complexity therefore is much smaller.
Therefore taking the log we have the log likelihood is
(y − Xβ − U γ̂)T (y − Xβ − U γ̂) γ̂ T Gθ−1 γ̂
ℓ(β, θ) = − −
2σ 2 2
log |Gθ | 1 UT U n
− − log + Gθ−1 − log(2πσ 2 ). (4.10)
2 2 σ2 2

4.3.2 Profile likelihood method


Unlike MLE in linear model, we do not have a close form solution for β̂
without θ. To solve the optimization problem of (4.10) with respect to β
and θ, we can update their values iteratively to seek the MLE by numerical
methods.
For this specific problem, to get solutions of MLE what people usually do
is to first maximize the log-likelihood ℓ(β, θ) with respect to β for a given
value of θ. The optimised solution β̂(θ) is obtained as a function of θ. We
can then substitute this solution into the log likelihood function as
ℓp (θ) = ℓ(β̂(θ), θ),
which is is a function of θ, and we then just need to optimise it with respect to
θ. Once the maximiser value θ̂p is found, we then estimate β by β̂p = β̂(θ̂).
This simple and naive method is called profiled likelihood, as we have profiled
out β in the log likelihood function.

4.3.3 Restricted maximum likelihood method


Recall the MLE of σ 2 in linear model, which is
n  2
1X
σ̂ 2 = yi − xTi β̂ .
n i=1
4.3. LECTURE 8: LMMS PARAMETER ESTIMATION II 55

n−p 2
We know this is an biased estimator as E(σ̂ 2 ) = σ , and in practice we
n
often use the unbiased alternative

n  2
1 X
σ̃ 2 = yi − xTi β̂ .
n − p i=1

This is a typical problem of MLE for variance components. It gets worse


in LMMs as the number of fixed effects increases. Therefore, Restricted
Maximum Likelihood (REML) is proposed as to alleviate the problem. The
idea is just to includes only the variance components in the REML, while the
β that parameterise the fixed effect terms in LMMs is estimated in a second
step.

To this end, we can think as treating β also as a random coefficients. This


is similar to the Bayesian framework where we can give a prior distribution
to the parameters. Here we can just apply the improper uniform prior dis-
tribution where f (β) = 1 at (0, 1)p . As a result,

Z Z Z
f (y; θ) = f (y, β; θ)dβ = f (y|β; θ)f (β)dβ = f (y; β, θ)dβ.

Using the exactly same techniques in Section 4.3.1, we can get the restricted
log-likelihood function ℓr (θ). Details are omitted here. The resulted max-
imiser is denoted as θ̂r . And we can calculate β̂r based on the value of
θ̂r .

REML accounts for the degrees of freedom loss by estimating the fixed effects,
and results in a less biased estimation of random effects variances. The
estimates of θ are invariant to the value of β and less sensitive to outliers in
the data compared to MLE.

In practice, we can use seminal numerical methods such as Newton-Raphson


or an Expectation-Maximization (EM) algorithm to obtain β̂ and γ̂ by max-
imising profile likelihood and REML. There are R packages ready to use for
LMMs, such as lme4, which we will illustrate in the computer lab.
56 CHAPTER 4. LINEAR MIXED MODELS

4.4 Lecture 9: Statistical Inference of LMMs


4.4.1 Confidence intervals
First let us consider θ is known, hence D and σ 2 , and the covari-
ance matrix V is fixed and known. Since Y ∼ N (Xβ, V ), we have
β̂ = (X T V −1 X)−1 X T V −1 Y is normally distributed with mean β and
covariance given by
Cov(β̂) =(X T V −1 X)−1 X T V −1 Cov(Y )V −1 X(X T V −1 X)−1
=(X T V −1 X)−1 .

As a result, the j-th diagonal element in above matrix is σj2 = (X T V −1 X)−1


jj .
This equals to Var(β̂j ), which leads to
q
β̂j ± z1− α2 (X T V −1 X)−1
jj (4.11)
as the 100(1 − α)% confidence interval for βj .
If θ is not known, we can use its estimates, e.g. θ̂p or θ̂r instead, which
leads to an approximation to the covariance matrix, V (θ̂), and to the fix
effect coefficients, β̂(θ̂) = (X T V (θ̂)−1 X)−1 X T V (θ̂)−1 Y . Therefore, (4.12)
becomes
q
β̂j (θ̂) ± z1− α2 (X T V (θ̂)−1 X)−1
jj , (4.12)
which gives an approximate 100(1 − α)% confidence interval for βj .
We would expected that (X T V (θ̂)−1 X)−1jj underestimate Var(β̂j ) since the
variation in θ̂ is not taken into account. A more sophisticated way to do
this is using Bayesian analysis such as MCMC for this confidence interval
approximations.
Confidence interval for θ is rely on the large sample theory that
θ̂p ∼ N (θ, Iˆp−1 ) and θ̂r ∼ N (θ, Iˆr−1 ), as n → ∞

where Iˆp−1 and Iˆp−1 are the information matrix of the log profile likelihood or
log restricted likelihood, respectively, i.e.,
∂ 2 ℓp (θ) ∂ 2 ℓr (θ)
Iˆp−1 = and Iˆr−1 = .
(∂θ)2 θ̂p
(∂θ)2 θ̂r
4.4. LECTURE 9: STATISTICAL INFERENCE OF LMMS 57

Therefore we can build approximated confidence intervals for θ based on this


result.

4.4.2 Hypothesis testing


For hypothesis testing, suppose we want to perform a test on the fix effects
β, such as:
H0 : Cβ = c against H1 : Cβ ̸= c,
where C is a r × p constant matrix with rank r, and c is a r-dimensional
vector.
 
Due to β̂ ∼ N β, (X T V −1 X)−1 , under the null hypothesis, we have that
 
C β̂ − c ∼ N 0, C(X T V −1 X)−1 C T .

Hence  −1/2  
C(X T V −1 X)−1 C T C β̂ − c ∼ N (0, Ir ).
This implies
 T h i−1  
W = C β̂ − c C(X T V −1 X)−1 C T C β̂ − c ∼ χ2r .
 −1/2  
If H1 is true, C(X T V −1 X)−1 C T C β̂ − c ∼ N (Cβ − c, Ir ), the
distribution of W will shift to the right by (Cβ − c)T (Cβ − c). 1
Therefore, we could employ W as the test statistic of H0 against H1 . This
is the so-called Wald-Test. We will reject H0 if W > χ2r,1−α , where χ2r,1−α is
the 100(1 − α)% quantile of the χ2r distribution.
Again, if θ is unknown, we can use its estimate θ̂p , and replace V and β̂ in
above terms with V (θ̂p ) and β̂(θ̂p ), respectively. Note that REML method
can not be used to compare models with different fixed effect structures,
because ℓr (θ) is not comparable between models with different fixed effect.

Example 4.5 (Comparing models). If we want to compare two linear mixed


models with differences in fixed effects β, i.e., one of the models, H0 , is nested
1
In this scenario, W follows a non-central chi-square distribution with non-centrality
parameter (Cβ − c)T (Cβ − c).
58 CHAPTER 4. LINEAR MIXED MODELS

in the other, H1 . Again, we assume model H1 contains p linear parameters


β1 , β2 , . . . , βp and model H0 contains a subset of q < p of these, i.e., β1 , . . . , βq .
Therefore our hypothesis is:

H0 : βq+1 = βq+2 = · · · = βp = 0 against H1 : βq+1 , βq+2 , . . . , βp are not all 0

We can simply apply the above Wald test by setting


 
0, . . . , 0, 1, 0, . . . , 0
 
0, . . . , 0, 0, 1, . . . , 0

C =  .. .. .. .. ,
.. 
. . . . . 
0, . . . , 0, 0, 0, . . . , 1

and c = 0, with r = p − q.
Chapter 5

Generalised Linear Models

5.1 Lecture 10: The Exponential family


5.1.1 Regression models for non-normal data
The linear model of Chapter 3 assumes each response Yi ∼ N (µi , σ 2 ), where
the mean µi depends on explanatory variables through µi = xTi β. For many
types of data, this assumption of normality of the response may not be jus-
tified. For instance, we might have
• a binary response (Yi ∈ {0, 1}), for instance representing whether or
not a patient recovers from a disease. A natural model is that Yi ∼
Bernoulli(pi ), and we might want to model how the ‘success’ probability
pi depends on explanatory variables xi .
• a count response (Yi ∈ {0, 1, 2, 3, . . .}), for instance representing the
number of customers arrive at a shop. A natural model is that Yi ∼
Poisson(λi ), and we might want to model how the rate λi depends on
explanatory variables.
In Section 5.1.2, we define the exponential family, which includes the
Bernoulli and Poisson distributions as special cases. In a generalised
linear model, the response distribution is assumed to be a member of the
exponential family.
To complete the specification of a generalised linear model, we will need
to model how the parameters of the response distribution (e.g. the success

59
60 CHAPTER 5. GENERALISED LINEAR MODELS

probability pi or the rate λi ) depend on explanatory variables xi . We need


to do this in a way which respects constraints on the possible values which
these parameters may take: for instance, we should not model pi = xTi β
directly, as we need to enforce pi ∈ [0, 1].

5.1.2 The exponential family


A probability distribution is said to be a member of the exponential family
if its probability density function (or probability function, if discrete) can be
written in the form
!
yθ − b(θ)
fY (y; θ, ϕ) = exp + c(y, ϕ) . (5.1)
a(ϕ)

The parameter θ is called the natural or canonical parameter. The parameter


ϕ is usually assumed known. If it is unknown then it is often called the
nuisance parameter.
The density (5.1) can be thought of as a likelihood resulting from a single
observation y. Then the log-likelihood is
yθ − b(θ)
ℓ(θ, ϕ) = + c(y, ϕ)
a(ϕ)
and the score is
∂ y − ∂θ

b(θ) y − b′ (θ)
u(θ) = ℓ(θ, ϕ) = = .
∂θ a(ϕ) a(ϕ)
The Hessian is
2
∂2 ∂
2 b(θ) b′′ (θ)
H(θ) = 2 ℓ(θ, ϕ) = − ∂θ =−
∂θ a(ϕ) a(ϕ)
so the expected information is
b′′ (θ)
I(θ) = E[−H(θ)] = .
a(ϕ)
From the properties of the score function in Section ??, we know that
E[U (θ)] = 0. Therefore " #
Y − b′ (θ)
E = 0,
a(ϕ)
5.1. LECTURE 10: THE EXPONENTIAL FAMILY 61

so E[Y ] = b′ (θ). We often denote the mean by µ, so µ = b′ (θ).


Furthermore, " #
Y − b′ (θ) Var[Y ]
Var[U (θ)] = Var = ,
a(ϕ) a(ϕ)2
as b′ (θ) and a(ϕ) are constants (not random variables). We also know from
Section 2.2.2 that Var[U (θ)] = I(θ). Therefore

Var[Y ] = a(ϕ)2 Var[U (θ)] = a(ϕ)2 I(θ) = a(ϕ)b′′ (θ).

and hence the mean and variance of a random variable with probability
density function (or probability function) of the form (5.1) are b′ (θ) and
a(ϕ)b′′ (θ) respectively.
The variance is the product of two functions; b′′ (θ) depends on the canonical
parameter θ (and hence µ) only and is called the variance function (V (µ) ≡
b′′ (θ)); a(ϕ) is sometimes of the form a(ϕ) = σ 2 /w where w is a known weight
and σ 2 is called the dispersion parameter or scale parameter.

Example 5.1 (Normal distribution). Suppose Y ∼ N (µ, σ 2 ). Then

 
1 1
fY (y; µ, σ 2 ) = √ exp − 2 (y − µ)2 y ∈ R; µ ∈ R
2πσ 2 2σ
" #!
yµ − 12 µ2 1 y 2
= exp − + log(2πσ 2 ) .
σ2 2 σ2

This is in the form (5.1), with θ = µ, b(θ) = 12 θ2 , a(ϕ) = σ 2 and


" #
1 y2
c(y, ϕ) = − + log(2πa[ϕ]) .
2 a(ϕ)

Therefore
E(Y ) = b′ (θ) = θ = µ,
Var(Y ) = a(ϕ)b′′ (θ) = σ 2
and the variance function is
V (µ) = 1.
62 CHAPTER 5. GENERALISED LINEAR MODELS

Example 5.2 (Poisson distribution). Suppose Y ∼ Poisson(λ). Then

exp(−λ)λy
fY (y; λ) = y ∈ {0, 1, . . .}; λ ∈ R+
y!
= exp (y log λ − λ − log y!) .

This is in the form (5.1), with θ = log λ, b(θ) = exp θ, a(ϕ) = 1 and c(y, ϕ) =
− log y!. Therefore
E(Y ) = b′ (θ) = exp θ = λ,
Var(Y ) = a(ϕ)b′′ (θ) = exp θ = λ
and the variance function is
V (µ) = µ.

Example 5.3 (Bernoulli distribution). Suppose Y ∼ Bernoulli(p). Then

fY (y; p) = py (1 − p)1−y y ∈ {0, 1}; p ∈ (0, 1)


!
p
= exp y log + log(1 − p)
1−p

p
This is in the form (5.1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(ϕ) = 1
and c(y, ϕ) = 0. Therefore
exp θ
E(Y ) = b′ (θ) = = p,
1 + exp θ
exp θ
Var(Y ) = a(ϕ)b′′ (θ) = = p(1 − p)
(1 + exp θ)2
and the variance function is

V (µ) = µ(1 − µ).

Example 5.4 (Binomial distribution). Suppose Y ∗ ∼ Binomial(n, p). Here,


n is assumed known (as usual) and the random variable Y = Y ∗ /n is taken
as the proportion of successes, so
5.2. LECTURE 11: COMPONENTS OF A GENERALISED LINEAR MODEL63

!  
n 1 2
fY (y; p) = pny (1 − p)n(1−y) y ∈ 0, , , . . . , 1 ; p ∈ (0, 1)
ny n n
!!
y log 1−p + log(1 − p)
p
n
= exp 1 + log .
n
ny

p 1
This is in the form (5.1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(ϕ) =
  n
n
and c(y, ϕ) = log ny
. Therefore

exp θ
E(Y ) = b′ (θ) = = p,
1 + exp θ
1 exp θ p(1 − p)
Var(Y ) = a(ϕ)b′′ (θ) = 2
=
n (1 + exp θ) n
and the variance function is
V (µ) = µ(1 − µ).
Here, we can write a(ϕ) ≡ σ 2 /w where the scale parameter σ 2 = 1 and the
weight w is n, the binomial denominator.

5.2 Lecture 11: Components of a generalised


linear model
5.2.1 The random component
As in a linear model, the aim is to determine the pattern of dependence
of a response variable on explanatory variables. We denote the n observa-
tions of the response by y = (y1 , y2 , . . . , yn )T . In a generalised linear model
(GLM), these are assumed to be observations of independent random vari-
ables Y = (Y1 , Y2 , . . . , Yn )T , which take the same distribution from the expo-
nential family. In other words, the functions a, b and c and usually the scale
parameter ϕ are the same for all observations, but the canonical parameter
θ may differ. Therefore, we write
!
yi θi − b(θi )
fYi (yi ; θi , ϕi ) = exp + c(yi , ϕi )
a(ϕi )
64 CHAPTER 5. GENERALISED LINEAR MODELS

and the joint density for Y = (Y1 , Y2 , . . . , Yn )T is

Y
n
fY (y; θ, ϕ) = fYi (yi ; θi , ϕi )
i=1
!
X
n
yi θ i − b(θi ) X n
= exp + c(yi , ϕi ) (5.2)
i=1 a(ϕi ) i=1

where θ = (θ1 , . . . , θn )T is the collection of canonical parameters and ϕ =


(ϕ1 , . . . , ϕn )T is the collection of nuisance parameters (where they exist).
Note that for a particular sample of observed responses, y = (y1 , y2 , . . . , yn )T ,
(5.2) is the likelihood function L(θ, ϕ) for θ and ϕ.

5.2.2 The systematic (or structural) component


Associated with each yi is a vector xi = (xi1 , xi2 , . . . , xip )T of p explanatory
variables. In a generalised linear model, the distribution of the response
variable Yi depends on xi through the linear predictor ηi where

ηi = β1 xi1 + β2 xi2 + . . . + βp xip


X
p
= xij βj
j=1

= xTi β
= [Xβ]i , i = 1, . . . , n, (5.3)

where, as with a linear model,


   
xT1 x11 · · · x1p
 .   . .. .. 
  
X =  .  =  ..
. . . 

xTn xn1 · · · xnp

and β = (β1 , . . . , βp )T is a vector of fixed but unknown parameters describing


the dependence of Yi on xi . The four ways of describing the linear predictor
in (5.3) are equivalent, but the most economical is the matrix form
5.2. LECTURE 11: COMPONENTS OF A GENERALISED LINEAR MODEL65

η = Xβ. (5.4)

Again, we call the n × p matrix X the design matrix. The ith row of X is xTi ,
the explanatory data corresponding to the ith observation of the response.
The jth column of X contains the n observations of the jth explanatory
variable.
As for the linear model in Section 3.1.2, this structure allows quite general
dependence of the linear predictor on explanatory variables. For instance, we
can allow non-linear dependence of ηi on a variable xi through polynomial
regression (as in Example 3.3), or include categorical explanatory variables
(as in Examples 3.5 and 3.6).

5.2.3 The link function


For specifying the pattern of dependence of the response variable on the
explanatory variables, the canonical parameters θ1 , . . . , θn in (5.2) are not of
direct interest. Furthermore, we have already specified that the distribution
of Yi should depend on xi through the linear predictor ηi . It is the parameters
β1 , . . . , βp of the linear predictor which are of primary interest.
The link between the distribution of Y and the linear predictor η is provided
by the link function g,

ηi = g(µi ), i = 1, . . . , n,

where µi ≡ E(Yi ), i = 1, . . . , n. Hence, the dependence of the distribution


of the response on the explanatory variables is established as

g(E[Yi ]) = g(µi ) = ηi = xTi β, i = 1, . . . , n.

In principle, the link function g can be any one-to-one differentiable function.


However, we note that ηi can in principle take any value in R (as we make no
restriction on possible values taken by explanatory variables or model param-
eters). However, for some exponential family distributions µi is restricted.
For example, for the Poisson distribution µi ∈ R+ ; for the Bernoulli distribu-
tion µi ∈ (0, 1). If g is not chosen carefully, then there may exist a possible xi
and β such that ηi ̸= g(µi ) for any possible value of µi . Therefore, ‘sensible’
choices of link function map the set of allowed values for µi onto R.
66 CHAPTER 5. GENERALISED LINEAR MODELS

Recall that for a random variable Y with a distribution from the exponential
family, E(Y ) = b′ (θ). Hence, for a generalised linear model

µi = E(Yi ) = b′ (θi ), i = 1, . . . , n.

Therefore

θi = b −1 (µi ), i = 1, . . . , n
and as g(µi ) = ηi = xTi β, then


θi = b −1 (g −1 [xTi β]), i = 1, . . . , n. (5.5)

Hence, we can express the joint density (5.2) in terms of the coefficients β,
and for observed data y, this is the likelihood L(β) for β. As β is our
parameter of real interest (describing the dependence of the response on the
explanatory variables) this likelihood will play a crucial role.
Note that considerable simplification is obtained in (5.5) if the functions g

and b −1 are identical. Then

θi = xTi β i = 1, . . . , n

and the resulting likelihood is


!
X
n
yi xTi β − b(xTi β) X n
L(β) = exp + c(yi , ϕi ) .
i=1 a(ϕi ) i=1

The link function



g(µ) ≡ b −1 (µ)
is called the canonical link function. Under the canonical link, the canonical
parameter is equal to the linear predictor.
The canonical link functions are:

b −1 (µ) ≡
Distribution b(θ) b′ (θ) ≡ µ θ Link Name
1 2
Normal 2
θ θ µ g(µ) = µ Identity
Poisson exp θ exp θ log µ g(µ) = log µ Log
exp θ µ
Binomial log(1 + 1+exp θ
log 1−µ g(µ) = Logit
µ
exp θ) log 1−µ
5.3. LECTURE 12: EXAMPLES OF GENERALISED LINEAR MODELS67

5.3 Lecture 12: Examples of generalised lin-


ear models
5.3.1 The linear model
The linear model considered in Chapter 3 is also a generalised linear model.
We assume Y1 , . . . , Yn are independent normally distributed random variables,
so that Yi ∼ N (µi , σ 2 ). We have seen in Example 5.1 that the normal
distribution is a member of the exponential family.
The explanatory variables enter a linear model through the linear predictor
ηi = xTi β, i = 1, . . . , n.

The link between E(Y ) = µ and the linear predictor η is through the (canon-
ical) identity link function
µi = ηi , i = 1, . . . , n.

5.3.2 Models for binary data


In binary regression, we assume either Yi ∼ Bernoulli(pi ), or Yi ∼
binomial(ni , pi ), where ni are known. The objective is to model the success
probability pi as a function of the explanatory variables xi . We have seen
in Examples 5.3 and 5.4 that the Bernoulli and binomial distributions are
members of the exponential family.
When the canonical (logit) link is used, we have
pi
logit(pi ) = log = ηi = xTi β.
1 − pi
This implies
exp(ηi ) 1
pi = = .
1 + exp(ηi ) 1 + exp(−ηi )
1
The function F (η) = 1+exp(−η) is the cumulative distribution function (cdf)
of a distribution called the logistic distribution.
The cumulative distribution functions of other distributions are also com-
monly used to generate link functions for binary regression. For example, if
we let
pi = Φ(xTi β) = Φ(ηi ),
68 CHAPTER 5. GENERALISED LINEAR MODELS

where Φ(·) is the cdf of the standard normal distribution, then we get the
link function
g(µ) = g(p) = Φ−1 (µ) = η,

which is called the probit link.

5.3.3 Models for count data


If Yi represent counts of the number of times an event occurs in a fixed time
(or a fixed region of space), we might model Yi ∼ Poisson(λi ). We have seen
in Example 5.2 that the Poisson distribution is a member of the exponential
family.

With the canonical (log) link, we have

log λi = ηi = xTi β,

or
λi = exp{ηi } = exp{xTi β}.

This model is often called a log-linear model.

Now suppose that Yi represents a count of the number of events which occur
in a given region i, for instance the number of times a particular drug is
prescribed on a given day, in a district i of a country. We might want to
model the prescription rate per patient in the district λ∗i . Write Ni is
the number of patients registered in district i, often called the exposure of
observation i. We model Yi ∼ Poisson(Ni λ∗i ), where

log λ∗i = xTi β.

Equivalently, we may write the model as Yi ∼ Poisson(λi ), where

log λi = log Ni + xTi β,

(since λi = Ni λ∗i , so log λi = log Ni +log λ∗i ). The log-exposure log Ni appears
as a fixed term in the linear predictor, without any associated parameter.
Such a fixed term is called an offset.
5.4. LECTURE 13: MAXIMUM LIKELIHOOD ESTIMATION 69

5.4 Lecture 13: Maximum likelihood estima-


tion
The regression coefficients β1 , . . . , βp describe the pattern by which the re-
sponse depends on the explanatory variables. We use the observed data
y1 , . . . , yn to estimate this pattern of dependence.

As usual, we maximise the log-likelihood function which, from (5.2), can be


written

X
n
yi θi − b(θi ) X n
ℓ(β, ϕ) = + c(yi , ϕi ) (5.6)
i=1 a(ϕi ) i=1

and depends on β through

θi = (b′ )−1 (µi ),


µi = g −1 (ηi ),
X
p
ηi = xTi β = xij βj , i = 1, . . . , n.
i=1

To find β̂, we consider the scores


uk (β) = ℓ(β, ϕ) k = 1, . . . , p
∂βk

and then find β̂ to solve uk (β̂) = 0 for k = 1, . . . , p.

From (5.6)
70 CHAPTER 5. GENERALISED LINEAR MODELS


uk (β) = ℓ(β, ϕ)
∂βk
∂ X n
yi θi − b(θi ) ∂ X n
= + c(yi , ϕi )
∂βk i=1 a(ϕi ) ∂βk i=1
" #
X
n
∂ yi θi − b(θi )
=
i=1 ∂βk a(ϕi )
" #
X
n
∂ yi θi − b(θi ) ∂θi ∂µi ∂ηi
=
i=1 ∂θi a(ϕi ) ∂µi ∂ηi ∂βk
X
n
yi − b′ (θi ) ∂θi ∂µi ∂ηi
= , k = 1, . . . , p,
i=1 a(ϕi ) ∂µi ∂ηi ∂βk

where

" #−1
∂θi ∂µi 1
= =
∂µi ∂θi b′′ (θi )
" #−1
∂µi ∂ηi 1
= = ′
∂ηi ∂µi g (µi )
∂ηi ∂ X p
= xij βj = xik .
∂βk ∂βk j=1

Therefore

X
n
yi − b′ (θi ) xik X
n
yi − µi xik
uk (β) = = , k = 1, . . . , p,
i=1 a(ϕi ) b (θi )g (µi ) i=1 Var(Yi ) g ′ (µi )
′′ ′

(5.7)
which depends on β through µi ≡ E(Yi ) and Var(Yi ), i = 1, . . . , n.
In theory, we solve the p simultaneous equations uk (β̂) = 0, k = 1, . . . , p to
evaluate β̂. In practice, these equations are usually non-linear and have no
analytic solution. Therefore, we rely on numerical methods to solve them.
5.4. LECTURE 13: MAXIMUM LIKELIHOOD ESTIMATION 71

First, we note that the Hessian and Fisher information matrices can be de-
rived directly from (5.7).
∂2 ∂
[H(β)]jk = ℓ(β, ϕ) = uk (β).
∂βj ∂βk ∂βj
Therefore

∂ X n
yi − µi xik
[H(β)]jk =
∂βj i=1 Var(Yi ) g ′ (µi )
" #
X
n − ∂β
∂µi
xik Xn
∂ xik
= j

+ (yi − µi )
i=1 Var(Yi ) g (µi ) i=1 ∂βj Var(Yi )g ′ (µi )

and

∂µi " #
X
n
∂βj xik Xn
∂ xik
[I(β)]jk = ′
− (E[Yi ] − µi )
i=1 Var(Yi ) g (µi ) i=1 ∂βj Var(Yi )g ′ (µi )
∂µi
X
n
∂βj xik
= ′
i=1 Var(Yi ) g (µi )
Xn
xij xik
= ′
.
i=1 Var(Yi )g (µi )
2

Hence we can write

I(β) = X T W X (5.8)

where    
xT1 x11 · · · x1p
 .   . .. .. 
  
X =  ..  =  .. . . 
,
xn T
xn1 · · · xnp
 
w1 0 ··· 0
 .. 
 0 w2 . 
 
W = diag(w) =  .. 
 .. 
 . . 0
0 ··· 0 wn
72 CHAPTER 5. GENERALISED LINEAR MODELS

and
1
wi = , i = 1, . . . , n.
Var(Yi )g ′ (µi )2
The Fisher information matrix I(β) depends on β through µ and
Var(Yi ), i = 1, . . . , n.
We notice that the score in (5.7) may now be written as
X
n X
n
uk (β) = (yi − µi )xik wi g ′ (µi ) = xik wi zi , k = 1, . . . , p,
i=1 i=1
where
zi = (yi − µi )g ′ (µi ), i = 1, . . . , n.
Therefore

u(β) = X T W z. (5.9)

One possible method to solve the p simultaneous equations u(β̂) = 0 that


give β̂ is the (multivariate) Newton-Raphson method.
If β (m) is the current estimate of β̂ then the next estimate is

β (m+1) = β (m) − H(β (m) )−1 u(β (m) ). (5.10)

In practice, an alternative to Newton-Raphson replaces H(θ) in (5.10) with


E[H(θ)] ≡ −I(β). Therefore, if β (m) is the current estimate of β̂ then the
next estimate is

β (m+1) = β (m) + I(β (m) )−1 u(β (m) ). (5.11)

The resulting iterative algorithm is called Fisher scoring. Notice that if we


substitute (5.8) and (5.9) into (5.11) we get

β (m+1) = β (m) + [X T W (m) X]−1 X T W (m) z (m)


= [X T W (m) X]−1 [X T W (m) Xβ (m) + X T W (m) z (m) ]
= [X T W (m) X]−1 X T W (m) [Xβ (m) + z (m) ]
= [X T W (m) X]−1 X T W (m) [η (m) + z (m) ],
5.4. LECTURE 13: MAXIMUM LIKELIHOOD ESTIMATION 73

where η (m) , W (m) and z (m) are all functions of β (m) .


Note that this is a weighted least squares equation, that is β (m+1) minimises
the weighted sum of squares
X
n  2
(η + z − Xβ) W (η + z − Xβ) =
T
wi ηi + zi − xTi β
i=1

as a function of β where w1 , . . . , wn are the weights and η + z is called the


adjusted dependent variable. Therefore, the Fisher scoring algorithm proceeds
as follows.
1. Choose an initial estimate β (m) for β̂ at m = 0.
2. Evaluate η (m) , W (m) and z (m) at β (m) .
3. Calculate

β (m+1) = [X T W (m) X]−1 X T W (m) [η (m) + z (m) ].

4. If ||β (m+1) − β (m) || > ϵ, for some prespecified (small) tolerance ϵ then
set m → m + 1 and go to 2.
5. Use β (m+1) as the solution for β̂.
As this algorithm involves iteratively minimising a weighted sum of squares,
it is sometimes known as iteratively (re)weighted least squares.
Notes

1. Recall that the canonical link function is g(µ) = b −1 (µ) and with this
link ηi = g(µi ) = θi . Then

1 ∂µi ∂µi
= = = b′′ (θi ), i = 1, . . . , n.
g ′ (µ i) ∂ηi ∂θi

Therefore Var(Yi )g ′ (µi ) = a(ϕi ) which does not depend on β, and hence
" #
∂ xik
=0
∂βj Var(Yi )g ′ (µi )

for all j = 1, . . . , p. It follows that H(θ) = −I(β) and, for the canoni-
cal link, Newton-Raphson and Fisher scoring are equivalent.
2. The linear model is a generalised linear model with identity link, ηi =
g(µi ) = µi and Var(Yi ) = σ 2 for all i = 1, . . . , n. Therefore wi =
74 CHAPTER 5. GENERALISED LINEAR MODELS

[Var(Yi )g ′ (µi )2 ]−1 = σ −2 and zi = (yi − µi )g ′ (µi ) = yi − ηi for i =


1, . . . , n. Hence z + η = y and W = σ −2 I, neither of which depend
on β. So the Fisher scoring algorithm converges in a single iteration
to the usual least squares estimate.
3. Estimation of an unknown scale parameter σ 2 is discussed later. A
common (to all i) σ 2 has no effect on β̂.

5.5 Lecture 14: Confidence intervals


Recall from Section 2.3 that the maximum likelihood estimator β̂ is asymp-
totically normally distributed with mean β (it is unbiased) and variance
covariance matrix I(β)−1 . For ‘large enough n’ we treat this distribution as
an approximation.
Therefore, standard errors (estimated standard deviations) are given by
1 1
s.e.(β̂i ) = [I(β̂)−1 ]ii2 = [(X T Ŵ X)−1 ]ii2 i = 1, . . . , p.
where the diagonal matrix Ŵ = diag(ŵ) is evaluated at β̂, that is ŵi =
ˆ i )g ′ (µ̂i )2 )−1 where µ̂i and Var(Y
(Var(Y ˆ i ) are evaluated at β̂ for i = 1, . . . , n.
Furthermore, if Var(Yi ) depends on an unknown scale parameter, then this
too must be estimated in the standard error.
The asymptotic distribution of the maximum likelihood estimator can be
used to provide approximate large sample confidence intervals. For given α
we can find z1− α2 such that
 
β̂i − βi
P −z1− α2 ≤ 1 ≤ z1− α2  = 1 − α.
[I(β)−1 ]ii2
Therefore
 1 1

P β̂i − z1− α2 [I(β)−1 ]ii2 ≤ βi ≤ β̂i + z1− α2 [I(β)−1 ]ii2 = 1 − α.

The endpoints of this interval cannot be evaluated because they also depend
on the unknown parameter vector β. However, if we replace I(β) by its MLE
I(β̂) we obtain the approximate large sample 100(1 − α) confidence interval
[β̂i − s.e.(β̂i )z1− α2 , β̂i + s.e.(β̂i )z1− α2 ].
For α = 0.10, 0.05, 0.01, z1− α2 = 1.64, 1.96, 2.58, respectively.
5.6. LECTURE 15: COMPARING GENERALISED LINEAR MODELS75

5.6 Lecture 15: Comparing generalised linear


models
5.6.1 The likelihood ratio test
If we have a set of competing generalised linear models which might explain
the dependence of the response on the explanatory variables, we will want
to determine which of the models is most appropriate. Recall that we have
three main requirements of a statistical model; plausibility, parsimony and
goodness of fit, of which parsimony and goodness of fit are statistical issues.

As with linear models, we proceed by comparing models pairwise using a


likelihood ratio test. This kind of comparison is restricted to situations where
one of the models, H0 , is nested in the other, H1 . Then the asymptotic
distribution of the log likelihood ratio statistic under H0 is a chi-squared
distribution with known degrees of freedom.

For generalised linear models, ‘nested’ means that H0 and H1 are

1. based on the same exponential family distribution, and


2. have the same link function, but
3. the explanatory variables present in H0 are a subset of those present
in H1 .

We will assume that model H1 contains p linear parameters and model H0 a


subset of q < p of these. Without loss of generality, we can think of H1 as
the model
X
p
ηi = xij βj i = 1, . . . , n
j=1

and H0 is the same model with

βq+1 = βq+2 = · · · = βp = 0.

Then model H0 is a special case of model H1 , where certain coefficients are set
equal to zero, and therefore Θ(0) , the set of values of the canonical parameter
θ allowed by H0 , is a subset of Θ(1) , the set of values allowed by H1 .

Now, the log likelihood ratio statistic for a test of H0 against H1 is


76 CHAPTER 5. GENERALISED LINEAR MODELS

!
maxθ∈Θ(1) L(θ)
L01 ≡ 2 log
maxθ∈Θ(0) L(θ)
= 2 log L(θ̂ (1) ) − 2 log L(θ̂ (0) ), (5.12)

where θ̂ (1) and θ̂ (0) follow from b′ (θ̂i ) = µ̂i , g(µ̂i ) = η̂i , i = 1, . . . , n where η̂ for
each model is the linear predictor evaluated at the corresponding maximum
likelihood estimate for β. Here, we assume that a(ϕi ), i = 1, . . . , n are
known; unknown a(ϕ) is discussed in Section 5.8.
Recall that we reject H0 in favour of H1 when L01 is ‘too large’ (the observed
data are much more probable under H1 than H0 ). To determine a threshold
value k for L01 , beyond which we reject H0 , we set the size of the test α and
use the result of Section 2.3.3.2 that, because H0 is nested in H1 , L01 has
an asymptotic chi-squared distribution with p − q degrees of freedom. For
example, if α = 0.05, we reject H0 in favour of H1 when L01 is greater than
the 95 point of the χ2p−q distribution.
Note that setting up our model selection procedure in this way is consistent
with our desire for parsimony. The simpler model is H0 , and we do not
reject H0 in favour of the more complex model H1 unless the data provide
convincing evidence for H1 over H0 , that is unless H1 fits the data significantly
better.

5.7 Lecture 16: Scaled deviance and the sat-


urated model
Consider a model where β is n-dimensional, and therefore η = Xβ. Assum-
ing that X is invertible, then this model places no constraints on the linear
predictor η = (η1 , . . . , ηn ). It can take any value in Rn . Correspondingly the
means µ and the canonical parameters θ are unconstrained. The model is of
dimension n and can be parameterised equivalently using β, η, µ or θ. Such
a model is called the saturated model.
As the canonical parameters θ are unconstrained, we can calculate their
maximum likelihood estimates θ̂ directly from their likelihood (5.2) (without
first having to calculate β̂)
5.7. LECTURE 16: SCALED DEVIANCE AND THE SATURATED MODEL77

X
n
yi θi − b(θi ) X n
ℓ(θ) = + c(yi , ϕi ). (5.13)
i=1 a(ϕi ) i=1

We obtain θ̂ by first differentiating with respect to θ1 , . . . , θn to give

∂ yk − b′ (θk )
ℓ(θ) = k = 1, . . . , n.
∂θk a(ϕk )

Therefore b′ (θ̂k ) = yk , k = 1, . . . , n, and it follows immediately that µ̂k =


yk , k = 1, . . . , n. Hence the saturated model fits the data perfectly, as the
fitted values µ̂k and observed values yk are the same for every observation
k = 1, . . . , n.
The saturated model is rarely of any scientific interest in its own right. It is
highly parameterised, having as many parameters as there are observations.
This goes against our desire for parsimony in a model. However, every other
model is necessarily nested in the saturated model, and a test comparing a
model H0 against the saturated model HS can be interpreted as a goodness
of fit test. If the saturated model, which fits the observed data perfectly,
does not provide a significantly better fit than model H0 , we can conclude
that H0 is an acceptable fit to the data.
The log likelihood ratio statistic for a test of H0 against HS is, from (5.12)

L0s = 2 log L(θ̂ (s) ) − 2 log L(θ̂ (0) ),

where θ̂ (s) follows from b′ (θ̂) = µ̂ = y and θ̂ (0) is a function of the corre-
sponding maximum likelihood estimate for β = (β1 , . . . , βq )T . Under H0 , L0s
has an asymptotic chi-squared distribution with n − q degrees of freedom.
Therefore, if L0s is ‘too large’ (for example, larger than the 95 point of the
χ2n−q distribution) then we reject H0 as a plausible model for the data, as it
does not fit the data adequately.
The degrees of freedom of model H0 is defined to be the degrees of freedom
for this test, n − q, the number of observations minus the number of linear
parameters of H0 . We call L0s the scaled deviance (R calls it the residual
deviance) of model H0 .
From (5.12) and (5.13) we can write the deviance of model H0 as
78 CHAPTER 5. GENERALISED LINEAR MODELS

(s) (0) (s) (0)


X
n
yi [θ̂i − θ̂i ] − [b(θ̂i ) − b(θ̂i )]
L0s = 2 , (5.14)
i=1 a(ϕi )

which can be calculated using the observed data, provided that a(ϕi ), i =
1, . . . , n is known.

Notes
1. The log likelihood ratio statistic (5.12) for testing H0 against a non-
saturated alternative H1 can be written as

L01 = 2 log L(θ̂ (1) ) − 2 log L(θ̂ (0) )


= [2 log L(θ̂ (s) ) − 2 log L(θ̂ (0) )] − [2 log L(θ̂ (s) ) − 2 log L(θ̂ (1) )]
= L0s − L1s . (5.15)

Therefore the log likelihood ratio statistic for comparing two nested
models is the difference of their deviances. Furthermore, as p − q =
(n − q) − (n − p), the degrees of freedom for the test is the difference
in degrees of freedom of the two models.
2. The asymptotic theory used to derive the distribution of the log like-
lihood ratio statistic under H0 does not really apply to the goodness
of fit test (comparison with the saturated model). However, for bino-
mial or Poisson data, we can proceed as long as the relevant binomial
or Poisson distributions are likely to be reasonably approximated by
normal distributions (i.e. for binomials with large denominators or
Poissons with large means). However, for Bernoulli data, we cannot
use the scaled deviance as a goodness of fit statistic in this way.
3. An alternative goodness of fit statistic for a model H0 is Pearson’s X 2
given by
(0)
X
n
(yi − µ̂i )2
X2 = . (5.16)
i=1
ˆ i)
Var(Y
X 2 is small when the squared differences between observed and fitted
values (scaled by variance) is small. Hence, large values of X 2 corre-
spond to poor fitting models. In fact, X 2 and L0s are asymptotically
5.7. LECTURE 16: SCALED DEVIANCE AND THE SATURATED MODEL79

equivalent and under H0 , X 2 , like L0s , has an asymptotic chi-squared


distribution with n − q degrees of freedom. However, the asymptotics
associated with X 2 are often more reliable for small samples, so if there
is a discrepancy between X 2 and L0s , it is usually safer to base a test
of goodness of fit on X 2 .
4. Although the deviance for a model is expressed in (5.14) in terms of
the maximum likelihood estimates of the canonical parameters, it is
more usual to express it in terms of the maximum likelihood estimates
µ̂i , i = 1, . . . , n of the mean parameters. For the saturated model,
these are just the observed values yi , i = 1, . . . , n, and for the model
of interest, H0 , we call them the fitted values. Hence, for a particular
generalised linear model, the scaled deviance function describes how
discrepancies between the observed and fitted values are penalised.

Example 5.5 (Poisson). Suppose Yi ∼ Poisson(λi ), i = 1, . . . , n. Recall


from Section 5.1.2 that θ = log λ, b(θ) = exp θ, µ = b′ (θ) = exp θ and
Var(Y ) = a(ϕ)V (µ) = 1 · µ. Therefore, by (5.14) and (5.16)

X
n
(s) (0) (s) (0)
L0s = 2 yi [log µ̂i − log µ̂i ] − [µ̂i − µ̂i ]
i=1
!
X
n
yi (0)
=2 yi log (0)
− yi + µ̂i
i=1 µ̂i

and
(0)
X
n
(yi − µ̂i )2
X2 = (0)
.
i=1 µ̂i

Example 5.6 (Binomial). Suppose ni Yi ∼ Binomial(ni , pi ), i = 1, . . . , n.


p
Recall from Section 5.1.2 that θ = log 1−p , b(θ) = log(1 + exp θ), µ = b′ (θ) =
exp θ
1+exp θ
and Var(Y ) = a(ϕ)V (µ) = n1 · µ(1 − µ). Therefore, by (5.14) and
(5.16)
80 CHAPTER 5. GENERALISED LINEAR MODELS

 
X
n
µ̂i
(s) (0)
µ̂i X
n h i
ni yi log +2 (s) (0)
L0s = 2 (s)
− log (0)
ni log(1 − µ̂i ) − log(1 − µ̂i )
i=1 1 − µ̂i 1 − µ̂i i=1
" ! !#
X
n
yi 1 − yi
=2 ni yi log (0)
+ ni (1 − yi ) log (0)
i=1 µ̂i 1 − µ̂i

and
(0)
X
n
ni (yi − µ̂i )2
X2 = (0) (0)
.
i=1 µ̂i (1 − µ̂i )
Bernoulli data are binomial with ni = 1, i = 1, . . . , n.

5.8 Lecture 17: Models with unknown a(ϕ)


The theory of Section 5.6 has assumed that a(ϕ) is known. This is the case
for both the Poisson distribution (a(ϕ) = 1) and the binomial distribution
(a(ϕ) = 1/n). Neither the scaled deviance (5.14) nor Pearson X 2 statistic
(5.16) can be evaluated unless a(ϕ) is known. Therefore, when a(ϕ) is not
known, we cannot use the scaled deviance as a measure of goodness of fit,
or to compare models using (5.15). For such models, there is no equivalent
goodness of fit test, but we can develop a test for comparing nested models.
Here we assume that a(ϕi ) = σ 2 /mi , i = 1, . . . , n where σ 2 is a common un-
known scale parameter and m1 , . . . , mn are known weights. (A linear model
takes this form, as Var(Yi ) = σ 2 , i = 1, . . . , n, so mi = 1, i = 1, . . . , n.)
Under this assumption

2 X n
(s) (0) (s) (0) 1
L0s = mi yi [ θ̂i − θ̂ i ] − m i [b(θ̂i ) − b( θ̂i )] = D0s , (5.17)
σ 2 i=1 σ2

where D0s is defined to be twice the sum above, which can be calculated
using the observed data. We call D0s the deviance of the model.
In order to test nested models H0 and H1 as set up in Section 5.6.1, we
calculate the test statistic
5.8. LECTURE 17: MODELS WITH UNKNOWN A(ϕ) 81

L01 /(p − q) (L0s − L1s )/(p − q)


F = =
L1s /(n − p) L1s /(n − p)
 
1
D − σ12 D1s /(p
σ 2 0s
− q) (D0s − D1s )/(p − q)
= = . (5.18)
1
D /(n − p)
σ 2 1s
D1s /(n − p)

This statistic does not depend on the unknown scale parameter σ 2 , so can be
calculated using the observed data. Asymptotically, if H0 is true, we know
that L01 ∼ χ2p−q and L1s ∼ χ2n−p . Furthermore, L01 and L1s are independent
(not proved here) so F has an asymptotic Fp−q,n−p distribution. Hence, we
compare nested generalised linear models by calculating F and rejecting H0
in favour of H1 if F is too large (for example, greater than the 95 point of
the relevant F distribution).

The dependence of the maximum likelihood equations u(β̂) = 0 on σ 2 (where


u is given by (5.7)) can be eliminated by multiplying through by σ 2 . However,
inference based on the maximum likelihood estimates, as described in Section
5.5, does require knowledge of σ 2 . This is because asymptotically Var(β̂) is
the inverse of the Fisher information matrix I(β) = X T W X, and this
1 ′′ 2 ′′
depends on wi = Var(Yi )g ′ (µ )2 , where Var(Yi ) = a(ϕi )b (θi ) = σ b (θi )/mi
i
here.
Therefore, to calculate standard errors and confidence intervals, we need
to supply an estimate σ̂ 2 of σ 2 . Generally, we do not use the maximum
likelihood estimate. Instead, we notice that, from (5.17), L0s = D0s /σ 2 , and
we know that asymptotically, if model H0 is an adequate fit, L0s has a χ2n−q
distribution. Hence
  !
1 1
E(L0s ) = E D0s = n − q ⇒ E D0s = σ 2 .
σ2 n−q

Therefore the deviance of a model divided by its degrees of freedom is an


asymptotically unbiased estimator of the scale parameter σ 2 . Hence σ̂ 2 =
D0s /(n − q).
An alternative estimator of σ 2 is based on the Pearson X 2 statistic. As
Var(Y ) = a(ϕ)V (µ) = σ 2 V (µ)/m here, then from (5.16)
82 CHAPTER 5. GENERALISED LINEAR MODELS

(0)
1 X
n
2mi (yi − µ̂i )2
X = 2 (0)
. (5.19)
σ i=1 V (µ̂i )

Again, if H0 is an adequate fit, X 2 has an chi-squared distribution with n − q


degrees of freedom, so
(0)
2 1 X n
mi (yi − µ̂i )2
σ̂ =
n − q i=1 (0)
V (µ̂i )
is an alternative unbiased estimator of σ 2 . This estimator tends to be more
reliable in small samples.

Example 5.7 (Normal). Suppose Yi ∼ N (µi , σ 2 ), i = 1, . . . , n. Recall


from Section 5.1.2 that θ = µ, b(θ) = θ2 /2, µ = b′ (θ) = θ and Var(Y ) =
a(ϕ)V (µ) = σ 2 · 1, so mi = 1, i = 1, . . . , n. Therefore, by (5.17),

X
n
(s) (0) 1 (s)2 1 (0)2 Xn
(0)
D0s = 2 yi [µ̂i − µ̂i ] − [ µ̂i − µ̂i ] = [yi − µ̂i ]2 , (5.20)
i=1 2 2 i=1

which is just the residual sum of squares for model H0 . Therefore, we estimate
σ 2 for a normal GLM by its residual sum of squares for the model divided
by its degrees of freedom. From (5.19), the estimate for σ 2 based on X 2 is
identical.

5.8.1 Residuals
Recall that for linear models, we define the residuals to be the differences
(0)
between the observed and fitted values yi − µ̂i , i = 1, . . . , n. From (5.20) we
notice that both the scaled deviance and Pearson X 2 statistic for a normal
GLM are the sum of the squared residuals divided by σ 2 . We can generalise
this to define residuals for other generalised linear models in a natural way.
For any GLM we define the Pearson residuals to be
(0)
yi − µ̂i
riP = i = 1, . . . , n.
ˆ i ) 12
Var(Y
Then, from (5.16), X 2 is the sum of the squared Pearson residuals.
5.8. LECTURE 17: MODELS WITH UNKNOWN A(ϕ) 83

For any GLM we define the deviance residuals to be


 1
(s) (0) (s) (0)
yi [θ̂i − θ̂i ] − [b(θ̂i ) − b(θ̂i )] 
2

riD = sign(yi − µ̂i ) 2


(0)
, i = 1, . . . , n,
a(ϕi )

where sign(x) = 1 if x > 0 and −1 if x < 0. Then, from (5.14), the scaled
deviance, L0s , is the sum of the squared deviance residuals.
When a(ϕ) = σ 2 /m and σ 2 is unknown, as in Section 5.8, the residuals are
based on (5.17) and (5.19), and the expressions above need to be multiplied
through by σ 2 to eliminate dependence on the unknown scale parameter.
Therefore, for a normal GLM the Pearson and deviance residuals are both
(0)
equal to the usual residuals, yi − µ̂i , i = 1, . . . , n.
Residual plots are most commonly of use in normal linear models, where they
provide an essential check of the model assumptions. This kind of check is
less important for a model without an unknown scale parameter as the scaled
deviance provides a useful overall assessment of fit which takes into account
most aspects of the model.
However, when data have been collected in serial order, a plot of the deviance
or Pearson residuals against the order may again be used as a check for
potential serial correlation.
Otherwise, residual plots are most useful when a model fails to fit (scaled
deviance is too high). Then, examining the residuals may give an indication
of the reason(s) for lack of fit. For example, there may be a small number of
outlying observations.
A plot of deviance or Pearson residuals against the linear predictor should
produce something that looks like a random scatter. If not, then this may
be due to incorrect link function, wrong scale for an explanatory variable, or
perhaps a missing polynomial term in an explanatory variable.
84 CHAPTER 5. GENERALISED LINEAR MODELS
Chapter 6

Models for categorical data

6.1 Lecture 18: Contingency tables


A particularly important application of generalised linear models is the anal-
ysis of categorical data. Here, the data are observations of one or more cate-
gorical variables on each of a number of units (often individuals). Therefore,
each of the units are cross-classified by the categorical variables.
For example, the job dataset gives the job satisfaction and income band of
901 individuals from the 1984 General Social Survey, which is summarised
in Table 6.1.

Job Satisfaction
Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat.
<6000 20 24 80 82
6000-15000 22 38 104 125
15000-25000 13 28 81 113
>25000 7 18 54 92

Table 6.1: A contingency table of the job dataset.

A cross-classification table like this is called a contingency table. This is a


two-way table, as there are two classifying variables. It might also be describe
as a 4 × 4 contingency table (as each of the classifying variables takes one of
four possible levels).

85
86 CHAPTER 6. MODELS FOR CATEGORICAL DATA

Job Satisfaction
Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat. Sum
<6000 20 24 80 82 206
6000-15000 22 38 104 125 289
15000-25000 13 28 81 113 235
>25000 7 18 54 92 171
Sum 62 108 319 412 901

Table 6.2: A contingency table of the job dataset, including one-way mar-
gins.

Each position in a contingency table is called a cell and the number of indi-
viduals in a particular cell is the cell count.

Partial classifications derived from the table are called margins. For a two-
way table these are often displayed in the margins of the table, as in Table
6.2. These are one-way margins as they represent the classification of items
by a single variable; income group and job satisfaction respectively.

The lymphoma dataset gives information about 30 patients, classified by cell


type of lymphoma, sex, and response to treatment, as shown in Table 6.3.
This is an example of a three-way contingency table. It is a 2 × 2 × 2 or 23
table.

Remission
Cell Type Sex No Yes
Female 3 1
Diffuse
Male 12 1
Female 2 6
Nodular
Male 1 4

Table 6.3: A contingency table of the lymphoma dataset.

For multiway tables, higher order margins may be calculated. For example,
for lymphoma, the two-way Cell type/Sex margin is shown in Table 6.4.
6.2. LECTURE 19: LOG-LINEAR MODELS 87

Sex
Cell Type Female Male
Diffuse 4 13
Nodular 8 5

Table 6.4: The two-way Cell type/Sex margin for the lymphoma dataset.

6.2 Lecture 19: Log-linear models


We can model contingency table data using generalised linear models. To do
this, we assume that the cell counts are observations of independent Poisson
random variables. This is intuitively sensible as the cell counts are non-
negative integers (the sample space for the Poisson distribution). Therefore,
if the table has n cells, which we label 1, . . . , n, then the observed cell counts
y1 , . . . , yn are assumed to be observations of independent Poisson random
variables Y1 , . . . , Yn . We denote the means of these Poisson random variables
by µ1 , . . . , µn . The canonical link function for the Poisson distribution is the
log function, and we assume this link function throughout. A generalised
linear model for Poisson data using the log link function is called a log-linear
model, as we have already seen in Section 5.3.3.
The explanatory variables in a log-linear model for contingency table data are
the cross-classifying variables, or factors. As usual with categorical variables,
we can include interactions in the model as well as just main effects (see
Example 3.6). Such a model will describe how the expected count in each
cell depends on the classifying variables, and any interactions between them.
Interpretation of these models will be discussed further in Section 6.4.
Table 6.5 shows the original data structure of the job dataset. It provides
exactly the same data as the contingency table in Table 6.1, but in a different
format, sometimes called long format. A log-linear model is just a Poisson
GLM, where the response variable is Count, and Income and Satisfaction
are explanatory variables.
Table 6.6 shows the lymphoma dataset in long format. Again, a log-linear
model for the contingency table (Table 6.3) is just a Poisson GLM for this
data, where in this case the response variable is Cell, and Sex and Remis
are explanatory variables.
88 CHAPTER 6. MODELS FOR CATEGORICAL DATA

Income Satisfaction Count


<6000 Very Dissatisfied 20
<6000 A Little Dissatisfied 24
<6000 Moderately Satisfied 80
<6000 Very Satisfied 82
6000-15000 Very Dissatisfied 22
6000-15000 A Little Dissatisfied 38
6000-15000 Moderately Satisfied 104
6000-15000 Very Satisfied 125
15000-25000 Very Dissatisfied 13
15000-25000 A Little Dissatisfied 28
15000-25000 Moderately Satisfied 81
15000-25000 Very Satisfied 113
>25000 Very Dissatisfied 7
>25000 A Little Dissatisfied 18
>25000 Moderately Satisfied 54
>25000 Very Satisfied 92

Table 6.5: The job dataset.

Cell Sex Remis Count


Nodular Male No 1
Nodular Male Yes 4
Nodular Female No 2
Nodular Female Yes 6
Diffuse Male No 12
Diffuse Male Yes 1
Diffuse Female No 3
Diffuse Female Yes 1

Table 6.6: The lymphoma dataset.


6.3. LECTURE 20: MULTINOMIAL SAMPLING 89

6.3 Lecture 20: Multinomial sampling


Although the assumption of Poisson distributed observations is convenient for
the purposes of modelling, it might not be a realistic assumption, because of
the way in which the data have been collected. Frequently, when contingency
table data are obtained, the total number of observations (the grand total, the
sum of all the cell counts) is fixed in advance. In this case, no individual cell
count can exceed the prespecified fixed total, so the assumption of Poisson
sampling is invalid as the sample space is bounded. Furthermore, with a
fixed total, the observations can no longer be observations of independent
random variables.
For example, consider the lymphoma contingency table from Table 6.3. It
may be that these data were collected over a fixed period of time, and that
in that time there happened to be 30 patients. In this case, the Poisson
assumption is perfectly valid. However, it may have been decided at the
outset to collect data on 30 patients, in which case the grand total is fixed
at 30, and the Poisson assumption is not valid.
When the grand total is fixed, a more appropriate distribution for the cell
counts is the multinomial distribution. The multinomial distribution is the
distribution of cell counts arising when a prespecified total of N items are
each independently assigned to one of n cells, where the probability of being
P
classified into cell i is pi , i = 1, . . . , n, so ni=1 pi = 1. The probability function
for the multinomial distribution is

fY (y; p) = P (Y1 = y1 , . . . , Yn = yn )

N ! Qn P
y
pi i
i=1 yi ! if ni=1 yi = N
= (6.1)
0 otherwise.

The binomial is the special case of the multinomial with two cells (n = 2).
We can still use a log-linear model for contingency table data when the data
have been obtained by multinomial sampling. We model log µi = log(N pi ),
i = 1, . . . , n as a linear function of explanatory variables. However, such a
P
model must preserve ni=1 µi = N , the grand total which is fixed in advance.
90 CHAPTER 6. MODELS FOR CATEGORICAL DATA

From (6.1), the log-likelihood for a multinomial log-linear model is


X
n X
n
ℓ(µ) = yi log µi − N log N − log yi ! + log N !.
i=1 i=1
P
Therefore, the maximum likelihood estimates µ̂ maximise ni=1 yi log µi sub-
P P
ject to the constraints ni=1 µi = N = ni=1 yi (multinomial sampling) and
log µ = Xβ (imposed by the model).
For a Poisson log-linear model,
Y
n
e−µi µyi i
L(µ) = .
i=1 yi !

Therefore,

X
n X
n X
n
ℓ(µ) = − µi + yi log µi − log yi ! (6.2)
i=1 i=1 i=1
Xn X
n X
n
=− exp(log µi ) + yi log µi − log yi !. (6.3)
i=1 i=1 i=1

Now any Poisson log-linear model with an intercept can be expressed as

log µi = α + other terms depending on i, i = 1, . . . , n

so that

∂ Xn Xn
ℓ(µ) = − exp(log µi ) + yi . (6.4)
∂α i=1 i=1
X
n X
n
⇒ µ̂i = yi . (6.5)
i=1 i=1

From (6.2), we notice that, at α = α̂ the log-likelihood takes the form


X
n X
n X
n
ℓ(µ) = − yi + yi log µi − log yi !.
i=1 i=1 i=1
6.3. LECTURE 20: MULTINOMIAL SAMPLING 91

Hence, when we maximise the log-likelihood, for a Poisson log-linear model


with intercept, with respect to the other log-linear parameters, we are max-
P P P
imising ni=1 yi log µi subject to the constraints ni=1 µi = ni=1 yi from (6.5)
and log µ = Xβ (imposed by the model).
Therefore, the maximum likelihood estimates for multinomial log-linear pa-
rameters are identical to those for Poisson log-linear parameters. Further-
more, the maximised log-likelihoods for both Poisson and multinomial mod-
P
els take the form ni=1 yi log µ̂i as functions of the log-linear parameter esti-
mates. Therefore any inferences based on maximised log-likelihoods (such as
likelihood ratio tests) will be the same.
Therefore, we can analyse contingency table data using Poisson log-linear
models, even when the data has been obtained by multinomial sampling.
All that is required is that we ensure that the Poisson model contains an
intercept term.

6.3.1 Product multinomial sampling


Sometimes margins other than just the grand total may be prespecified. For
example, consider the lymphoma contingency table in Table 6.3. It may have
been decided at the outset to collect data on 18 male patients and 12 fe-
male patients. Alternatively, perhaps the distribution of both the Sex and
Cell type of the patients was fixed in advance as in Table 6.4. In cases
where a margin is fixed by design, the data consist of a number of fixed total
subgroups, defined by the fixed margin. Neither Poisson nor multinomial
sampling assumptions are valid. The appropriate distribution combines a
separate, independent multinomial for each subgroup. For example, if just
the Sex margin is fixed as above, then the appropriate distribution for mod-
elling the data is two independent multinomials, one for males with N = 18
and one for females with N = 12. Each of these multinomials has four cells,
representing the cross-classification of the relevant patients by Cell Type and
Remission. Alternatively, if it is the Cell type/Sex margin which has been
fixed, then the appropriate distribution is four independent two-cell multi-
nomials (binomials) representing the remission classification for each of the
four fixed-total patient subgroups.
When the data are modelled using independent multinomials, then the joint
distribution of the cell counts Y1 , . . . , Yn is the product of terms of the same
92 CHAPTER 6. MODELS FOR CATEGORICAL DATA

form as (6.1), one for each fixed-total subgroup. We call this a distribution
a product multinomial. Each subgroup has its own fixed total. The full joint
density is a product of n terms, as before, with each cell count appearing
exactly once.

For example, if the Sex margin is fixed for lymphoma, then the product multi-
nomial distribution has the form

 y

 Q4 ymi
pmi Q4 pffi i P4 P4
Nm ! i=1 ymi ! Nf ! i=1 yf i ! if i=1 ymi = Nm and i=1 yf i = Nf
fY (y; p) =

0 otherwise,

where Nm and Nf are the two fixed marginal totals (18 and 12 respec-
tively), ym1 , . . . , ym4 are the cell counts for the Cell type/Remission cross-
classification for males and yf 1 , . . . , yf 4 are the corresponding cell counts for
P P
females. Here 4i=1 pmi = 4i=1 pf i = 1, E(Ymi ) = Nm pmi , i = 1, . . . , 4, and
E(Yf i ) = Nf pf i , i = 1, . . . , 4.

Using similar results to those used in Section 6.3 (but not proved here),
we can analyse contingency table data using Poisson log-linear models, even
when the data has been obtained by product multinomial sampling. However,
we must ensure that the Poisson model contains a term corresponding to the
fixed margin (and all marginal terms). Then, the estimated means for the
specified margin are equal to the corresponding fixed totals.

For example, for the lymphoma dataset, for inferences obtained using a Pois-
son model to be valid when the Sex margin is fixed in advance, the Poisson
model must contain the Sex main effect (and the intercept). For inferences
obtained using a Poisson model to be valid when the Cell type/Sex margin
is fixed in advance, the Poisson model must contain the Cell type/Sex inter-
action, and all marginal terms (the Cell type main effect, the Sex main effect
and the intercept).

Therefore, when analysing product multinomial data using a Poisson log-


linear model, certain terms must be present in any model we fit. If they are
removed, the inferences would no longer be valid.
6.4. LECTURE 21: INTERPRETING LOG-LINEAR MODELS FOR TWO-WAY TABLES93

6.4 Lecture 21: Interpreting log-linear mod-


els for two-way tables
Log-linear models for contingency tables enable us to determine important
properties concerning the joint distribution of the classifying variables. In
particular, the form of our preferred log linear model for a table will have
implications for how the variables are associated.
Each combination of the classifying variables occurs exactly once in a con-
tingency table. Therefore, the model with the highest order interaction (be-
tween all the variables) and all marginal terms (all other interactions) is the
saturated model. The implication of this model is that every combination of
levels of the variables has its own mean (probability) and that there are no
relationships between these means (no structure). The variables are highly
dependent.
To consider the implications of simpler models, we first consider a two-way
r × c table where the two classifying variables R and C have r and c levels
respectively. The saturated model R ∗ C implies that the two variables are
associated. If we remove the RC interaction, we have the model R + C,

log µi = α + βR (ri ) + βC (ci ), i = 1, . . . , n

where n = rc is the total number of cells in the table. Because of the


equivalence of Poisson and multinomial sampling, we can think of each cell
mean µi as equal to N pi where N is the total number of observations in the
table, and pi is a cell probability. As each combination of levels of R and
C is represented in exactly one cell, it is also convenient to replace the cell
label i by the pair of labels j and k representing the corresponding levels of
R and C respectively. Hence

log pjk = α + βR (j) + βC (k) − log N, j = 1, . . . , r, k = 1, . . . , c.

Therefore

P (R = j, C = k) = exp[α+βR (j)+βC (k)−log N ], j = 1, . . . , r, k = 1, . . . , c,

so
94 CHAPTER 6. MODELS FOR CATEGORICAL DATA

X
r X
c
1= exp[α + βR (j) + βC (k) − log N ]
j=1 k=1
1 Xr Xc
= exp[α] exp[βR (j)] exp[βC (k)].
N j=1 k=1

Furthermore

X
c
P (R = j) = exp[α + βR (j) + βC (k) − log N ]
k=1
1 Xc
= exp[α] exp[βR (j)] exp[βC (k)], j = 1, . . . , r,
N k=1

and

X
r
P (C = k) = exp[α + βR (j) + βC (k) − log N ]
j=1
1 Xr
= exp[α] exp[βC (k)] exp[βR (j)], k = 1, . . . , c.
N j=1

Therefore

1
P (R = j)P (C = k) = exp[α] exp[βC (k)] exp[βR (j)] × 1
N
= P (R = j, C = k), j = 1, . . . , r, k = 1, . . . , c.

Absence of the interaction R∗C in a log-linear model implies that R and C are
independent variables. Absence of main effects is generally less interesting,
and main effects are typically not removed from a log-linear model.
6.5. LECTURE 22: INTERPRETING LOG-LINEAR MODELS FOR MULTIWAY TABLES95

6.5 Lecture 22: Interpreting log-linear mod-


els for multiway tables
In multiway tables, absence of a two-factor interaction does not necessarily
mean that the two variables are independent. For example, consider the
lymphoma dataset, with 3 binary classifying variables Sex (S), Cell type (C)
and Remission (R). After comparing the fit of several possible models, we
find that a reasonable log-linear model for these data is R ∗ C + C ∗ S.
Hence the interaction between remission and sex is absent. The fitted cell
probabilities from this log-linear model are shown in Table 6.7.

Remission
Cell Type Sex No Yes
Female 0.1176 0.0157
Diffuse
Male 0.3824 0.0510
Female 0.0615 0.2051
Nodular
Male 0.0385 0.1282

Table 6.7: Fitted probabilities of each cell in the lymphoma dataset.

The estimated probabilities for the two-way Sex/Remission margin (together


with the corresponding one-way margins) are shown in Table 6.8.

Remission
Sex No Yes Sum
Female 0.1792 0.2208 0.4
Male 0.4208 0.1792 0.6
Sum 0.6 0.4 1.0

Table 6.8: Fitted marginal probabilities for the lymphoma dataset.

It can immediately be seen that this model does not imply independence of
R and S, as P̂ (R, S) ̸= P̂ (R)P̂ (S). What the model R ∗ C + C ∗ S implies
is that R is independent of S conditional on C, that is

P (R, S|C) = P (R|C)P (S|C).


96 CHAPTER 6. MODELS FOR CATEGORICAL DATA

Another way of expressing this is

P (R|S, C) = P (R|C),

that is, the probability of each level of R given a particular combination


of S and C, does not depend on which level S takes. Table 6.9 shows the
estimated conditional probabilities for the lymphoma data. The probability
of remission depends only on a patient’s cell type, and not on their sex.

Remission
Cell Type Sex No Yes P̂ (R|S, C)
Female 0.1176 0.0157 0.12
Diffuse
Male 0.3824 0.0510 0.12
Female 0.0615 0.2051 0.77
Nodular
Male 0.0385 0.1282 0.77

Table 6.9: Fitted probabilities of each cell and conditional probability of


remission in the lymphoma dataset.

In general, if we have an r-way contingency table with classifying variables


X1 , . . . , Xr , then a log linear model which does not contain the X1 ∗ X2
interaction (and therefore by the principle of marginality contains no inter-
action involving both X1 and X2 ) implies that X1 and X2 are conditionally
independent given X3 , . . . , Xr , that is

P (X1 , X2 |X3 , . . . , Xr ) = P (X1 |X3 , . . . , Xr )P (X2 |X3 , . . . , Xr ).

The proof of this is similar to the proof in the two-way case. Again, an
alternative way of expressing conditional independence is

P (X1 |X2 , X3 , . . . , Xr ) = P (X1 |X3 , . . . , Xr )

or
P (X2 |X1 , X3 , . . . , Xr ) = P (X2 |X3 , . . . , Xr ).

Although for the lymphoma dataset R and S are conditionally independent


given C, they are not marginally independent. Using the marginal cell prob-
abilities from Table 6.8, we find that the probability of remission is 0.30 for
6.5. LECTURE 22: INTERPRETING LOG-LINEAR MODELS FOR MULTIWAY TABLES97

men and 0.55 for women. Male patients have a much lower probability of
remission. The reason for this is that, although R and S are not directly
associated, they are both associated with C. Observing the estimated values
it is clear that patients with nodular cell type have a greater probability of
remission, and furthermore, that female patients are more likely to have this
cell type than males. Hence women are more likely to have remission than
men.
Suppose the factors for a three-way tables are X1 , X2 and X3 . We can list
all possible models and the implications for the conditional independence
structure:
1. Model 1 containing the terms X1 , X2 , X3 . All factors are mutually
independent.
2. Model 2 containing the terms X1 ∗ X2 , X3 . The factor X3 is jointly
independent of X1 and X2 .
3. Model 3 containing the terms X1 ∗ X2 , X2 ∗ X3 . The factors X1 and X3
are conditionally independent given X2 .
4. Model 4 containing the terms X1 ∗ X2 , X2 ∗ X3 , X1 ∗ X3 . There is
no conditional independence structure. This is the model without the
highest order interaction term.
5. Model 5 containing X1 ∗X2 ∗X3 . This is the saturated model. No more
simplification of dependence structure is possible.

6.5.1 Simpson’s paradox


Conditional and marginal association of two variables can therefore often
appear somewhat different. Sometimes, the association can be ‘reversed’ so
that what looks like a positive association marginally, becomes a negative
association conditionally. This is known as Simpson’s paradox.
In 1972-74, a survey of women was carried out in an area of Newcastle. A
follow-up survey was carried out 20 years later. Among the variables observed
in the initial survey was whether or not the individual was a smoker and
among those in the follow-up survey was whether the individual was still
alive, or had died in the intervening period. A summary of the responses is
shown in Table 6.10.
Looking at this table, it appears that the non-smokers had a greater probabil-
ity of dying. However, there is an important extra variable to be considered,
98 CHAPTER 6. MODELS FOR CATEGORICAL DATA

Smoker Dead Alive P̂ (Dead|Smoker)


Yes 139 443 0.24
No 230 502 0.31

Table 6.10: Number of respondents dead or alive at follow up, by smoking


status

Age Smoker Dead Alive P̂ (Dead|Age, Smoker)


Yes 5 174 0.03
18–34
No 6 213 0.03
Yes 14 95 0.13
35–44
No 7 114 0.06
Yes 27 103 0.21
45–54
No 12 66 0.15
Yes 51 64 0.44
55–64
No 40 81 0.33
Yes 29 7 0.81
65–74
No 101 28 0.78
Yes 13 0 1
75–
No 64 0 1

Table 6.11: Number of respondents dead or alive at follow up, by smoking


status and age.

related to both smoking habit and mortality – age (at the time of the initial
survey). When we consider this variable, we get Table 6.11. Conditional on
every age at outset, it is now the smokers who have a higher probability of
dying. The marginal association is reversed in the table conditional on age,
because mortality (obviously) and smoking are associated with age. There
are proportionally many fewer smokers in the older age-groups (where the
probability of death is greater).
When making inferences about associations between variables, it is important
that all other variables which are relevant are considered. Marginal inferences
may lead to misleading conclusions.

You might also like