MATH3091
MATH3091
2021-2022, Semester 2
2
Contents
Preface 7
1 Preliminaries 9
1.1 Lecture 1: Introduction . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Elements of statistical modelling . . . . . . . . . . . . 9
1.1.2 Regression models . . . . . . . . . . . . . . . . . . . . 10
1.1.3 Example data to be analysed . . . . . . . . . . . . . . 10
3 Linear Models 33
3.1 Lecture 5: Linear Model Theory: Revision of MATH2010 . . . 33
3.1.1 The linear model . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Examples of linear model structure . . . . . . . . . . . 34
3.1.3 Maximum likelihood estimation . . . . . . . . . . . . . 38
3.1.4 Properties of the MLE . . . . . . . . . . . . . . . . . . 38
3.1.5 Comparing linear models . . . . . . . . . . . . . . . . . 39
3
4 CONTENTS
7
8 CONTENTS
Chapter 1
Preliminaries
9
10 CHAPTER 1. PRELIMINARIES
15
16 CHAPTER 2. LIKELIHOOD BASED STATISTICAL THEORY
L(θ; y) = fY (y; θ)
Notes
1. Frequently it is more convenient to consider the log-likelihood function
ℓ(θ) = log L(θ).
2. Nothing in the definition of the likelihood requires y1 , . . . , yn to be
observations of independent random variables, although we shall fre-
quently make this assumption.
3. Any factors which depend on y1 , . . . , yn alone (and not on θ) can be
ignored when writing down the likelihood. Such factors give no in-
2.1. LECTURE 2: LIKELIHOOD FUNCTION 17
The log-likelihood is
Y
n
1 1
L(µ, σ 2 ) = √ exp − 2 (yi − µ)2
i=1 2πσ 2 2σ
1 X
= (2πσ 2 )− 2 exp − 2
n
(yi − µ)2
2σ
2 −n 1 X
∝ (σ ) exp − 2
2 (yi − µ) .
2
2σ
The log-likelihood is
n n 1 X
ℓ(µ, σ 2 ) = log L(µ, σ 2 ) = − log(2π) − log(σ 2 ) − 2 (yi − µ)2 .
2 2 2σ
so (µ̂, σ̂ 2 ) solve
n(ȳ − µ̂)
= 0. (2.1)
σ̂ 2
∂ n 1 X
2
ℓ(µ, σ 2 ) = − 2 + (yi − µ)2 ,
∂σ 2σ 2(σ 2 )2
so
n 1 X
− 2
+ (yi − µ̂)2 = 0 (2.2)
2σ̂ 2(σ̂ 2 )2
1X 1X
σ̂ 2 = (yi − µ̂)2 = (yi − ȳ)2 .
n n
and u(θ) ≡ [u1 (θ), . . . , up (θ)]T . Then we call u(θ) the vector of scores or
score vector. Where p = 1 and θ = (θ), the score is the scalar defined as
∂
u(θ) ≡ ℓ(θ).
∂θ
u(θ̂) = 0,
that is,
ui (θ̂) = 0, i = 1, . . . , p.
R P
Proof. Our proof is for continuous y – in the discrete case, replace by .
For each i = 1, . . . , n
2.2. LECTURE 3: SCORE FUNCTION AND INFORMATION MATRIX21
Z
E[Ui (θ)] = Ui (θ)fY (y, θ)dy
Z
∂
= ℓ(θ)fY (y; θ)dy
∂θi
Z
∂
= log fY (y; θ)fY (y; θ)dy
∂θi
Z ∂ f (y; θ)
Y
= ∂θi fY (y; θ)dy
fY (y; θ)
Z
∂
= fY (y; θ)dy
∂θi
∂ Z
= fY (y; θ)dy
∂θi
∂
= 1 = 0,
∂θi
as required.
Here by taking the expectation, the integral is with respect to the true density
fY (y; θ) at the unkonwn true value of θ, otherwise the proof does not holds.
∂2
[H(θ)]ij ≡ ℓ(θ) i = 1, . . . , p; j = 1, . . . , p.
∂θi ∂θj
[I(θ)]ij = Eθ (−[H(θ)]ij ) i = 1, . . . , p; j = 1, . . . , p.
Here Eθ means the expectation is taken with respect to the value of θ that
being evaluated.
An important result in likelihood theory is that the variance-covariance ma-
trix of the score vector (with respect to the θ) is equal to the expected
information matrix:
provided that
1. The variance exists.
2. The sample space for Y does not depend on θ.
R P
Proof. Our proof is for continuous y – in the discrete case, replace by .
For each i = 1, . . . , p and j = 1, . . . , p,
2.2. LECTURE 3: SCORE FUNCTION AND INFORMATION MATRIX23
Now
" #
∂2
[I(θ)]ij = Eθ − ℓ(θ)
∂θi ∂θj
Z
∂2
= − log fY (y; θ)fY (y; θ)dy
∂θi ∂θj
∂
Z
∂ ∂θj fY (y; θ)
= − fY (y; θ)dy
∂θi fY (y; θ)
Z ∂2
f (y; θ) ∂
f (y; θ) ∂θ∂ j fY (y; θ)
∂θi ∂θj Y ∂θi Y
= − + fY (y; θ)dy
fY (y; θ) fY (y; θ)2
∂2 Z Z
1 ∂ ∂
=− fY (y; θ)dy + fY (y; θ) fY (y; θ)dy
∂θi ∂θj fY (y; θ) ∂θi ∂θj
= Varθ [U (θ)]ij ,
as required.
n(ȳ − µ)
u1 (µ, σ 2 ) =
σ2
n 1 X
u2 (µ, σ 2 ) = − 2 + 2 2
(yi − µ)2 .
2σ 2(σ )
Therefore
n n(ȳ−µ)
−H(µ, σ 2 ) = σ2 P
(σ 2 )2
n(ȳ−µ)
(σ 2 )2
1
(σ 2 )3
(yi − µ)2 − n
2(σ 2 )2
n
!
0
I(µ, σ ) =
2 σ2
n .
0 2(σ 2 )2
1
s.e.(θ̂) = 1 ,
I(θ̂) 2
has a N (θi , [I(θ)−1 ]ii ) distribution and we can find z1− α2 such that
θ̂i − θi
P −z1− α2 ≤ 1 ≤ z1− α2 = 1 − α.
[I(θ)−1 ]ii2
Therefore
1 1
P θ̂i − z1− α2 [I(θ)−1 ]ii2 ≤ θi ≤ θ̂i + z1− α2 [I(θ)−1 ]ii2 = 1 − α.
The endpoints of this interval cannot be evaluated because they also depend
on the unknown parameter vector θ. However, if we replace I(θ) by its MLE
I(θ̂) we obtain the approximate large sample 100(1−α)% confidence interval
1 1
[θ̂i − z1− α2 [I(θ̂)−1 ]ii2 , θ̂i + z1− α2 [I(θ̂)−1 ]ii2 ].
For α = 0.1, 0.05, 0.01, z1− α2 = 1.64, 1.96, 2.58.
1 1
= [p̂ − 1.96[p̂(1 − p̂)/n] 2 , p̂ + 1.96[p̂(1 − p̂)/n] 2 ]
1 1
= [ȳ − 1.96[ȳ(1 − ȳ)/n] 2 , ȳ + 1.96[ȳ(1 − ȳ)/n] 2 ].
max P (y ∈ C; θ) = α.
θ∈Θ(0)
where
max P (y ∈ C; θ) = α.
θ∈Θ(0)
Write !
maxθ∈Θ(1) L(θ)
L01 ≡ 2 log
maxθ∈Θ(0) L(θ)
2.3. LECTURE 4: LIKELIHOOD BASED INFERENCE 29
for the log-likelihood ratio test statistic. Provided that H0 is nested within
H1 , the following result provides a useful large-n approximation to the dis-
tribution of L01 .
Proof. First we note that in the case where θ is one-dimensional and θ = (θ),
a Taylor series expansion of ℓ(θ) around the MLE θ̂ gives
1
ℓ(θ) = ℓ(θ̂) + (θ − θ̂)U (θ̂) + (θ − θ̂)2 U ′ (θ̂) + . . .
2
!
maxθ∈Θ(1) L(θ)
L01 ≡ 2 log
maxθ∈Θ(0) L(θ)
= 2 log L(θ̂ (1) ) − 2 log L(θ̂ (0) )
= 2[log L(θ̂ (1) ) − log L(θ)] − 2[log L(θ̂ (0) ) − log L(θ)]
= L1 − L0 .
(although we will not do so here) that under H0 , L01 and L0 are independent.
It can also be shown that under H0 the difference L1 − L0 can be expressed
as a quadratic form of normal random variables. Therefore, it follows that
under H0 , the log likelihood ratio statistic L01 has a χ2d1 −d0 distribution.
end up choosing complicated models, which fit the observed data very closely,
but do not meet our requirement of parsimony.
For a given model depending on parameters θ ∈ Rp , let ℓ̂ = ℓ(θ̂) be the log-
likelihood function for that model evaluated at the MLE θ̂. It is not sensible
to choose between models by maximising ℓ̂ directly, and instead it is common
to choose a model to maximise a criteria of the form
ℓ̂ − penalty,
where the penalty term will be large for complex models, and small for simple
models.
Equivalently, we may choose between models by minimising a criteria of the
form
−2ℓ̂ + penalty.
By convention, many commonly-used criteria for model comparison take this
form. For instance, the Akaike information criterion (AIC) is
Linear Models
= xTi β + ϵi
= [Xβ]i + ϵi , i = 1, . . . , n (3.1)
33
34 CHAPTER 3. LINEAR MODELS
Y = Xβ + ϵ. (3.2)
where ϵ = (ϵ1 , ϵ2 , . . . , ϵn )T .
The n × p matrix X consists of known (observed) constants and is called the
design matrix. The ith row of X is xTi , the explanatory data corresponding
to the ith observation of the response. The jth column of X contains the n
observations of the jth explanatory variable.
The error vector ϵ has a multivariate normal distribution with mean vector 0
and variance covariance matrix σ 2 I, since Var(ϵi ) = σ 2 , and Cov(ϵi , ϵj ) = 0,
as ϵ1 , . . . , ϵn are independent of one another. It follows from (3.2) that the
distribution of Y is multivariate normal with mean vector Xβ and variance
covariance matrix σ 2 I, i.e. Y ∼ N (Xβ, σ 2 I).
Yi = β0 + ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,
so
1
1
X= . , β = (β0 ).
.
.
1
This is one (dummy) explanatory variable. In practice, this variable is present
in all models.
3.1. LECTURE 5: LINEAR MODEL THEORY: REVISION OF MATH201035
Yi = β0 + β1 xi + ϵi , ϵi ∼ N (0, σ 2 ) i = 1, . . . , n
so
1 x1
!
1 x2 β
X=
.. ..
, β= 0 .
. . β1
1 xn
There are two explanatory variables: the dummy variable and one ‘real’ vari-
able.
so
1 x1 x21 · · · xp−1
1 β0
p−1
1 x2 x22 · · · x2 β1
X=
.. .. .. . . ..
, β=
.. .
. . . . . .
1 xn x2n · · · xp−1
n βp−1
There are p explanatory variables: the dummy variable and one ‘real’ variable,
transformed to p − 1 variables.
so
1 x11 x12 · · · x1 p−1 β0
1 x21 x22 · · · x2 p−1 β1
X = .. .. .. .
.. .. , β= ..
. . . . . .
1 xn1 xn2 · · · xn p−1 βp−1
There are p explanatory variables: the dummy variable and p − 1 ‘real’ vari-
ables.
36 CHAPTER 3. LINEAR MODELS
Yi = µxi + ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,
so that the mean of Yi is the same for all observations in the same category,
but differs for different categories.
We could rewrite this model to include an intercept, as
Yi = β0 + βxi + ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,
zi = (zi1 , . . . , zik )T ,
where
1 if xi = j
zij =
0 otherwise.
zi is sometimes called the one-hot encoding of xi , as it contains precisely
one 1 (corresponding to the category xi ), and is 0 everywhere else. We then
have
Yi = β0 + β1 zi1 + β2 zi2 + . . . + βk zik + ϵi ,
so
1 z11 z12 · · · z1k β0
1 z21 z22 · · · z2k β1
X=
.. .. .. . . ..
, β=
.. ,
. . . . . .
1 zn1 zn2 · · · znk βk
3.1. LECTURE 5: LINEAR MODEL THEORY: REVISION OF MATH201037
where each row of X will have two ones, and the remaining entries will be
zero.
Yi = β0 + βx(1)
i1
+ βx(2)
i2
+ ϵi , ϵi ∼ N (0, σ 2 ), i = 1, . . . , n,
where
(1) (1) (2) (2) T
β = β0 , β1 , . . . , βk1 , β1 , . . . , βk2 ,
Yi = β0 + βx(1)
i1
+ βx(2)
i2
+ βx(1,2)
i1 ,xi2
+ ϵi ,
where
(1) (1) (2) (2) (1,2) (1,2) (1,2) T
β = β0 , β1 , . . . , βk1 , β1 , . . . , βk2 , β11 , β1,2 , . . . , βk1 ,k2 .
(1,2)
The terms βj1 ,j2 are called the interaction effects. This model is equivalent
to
Yi = µxi1 ,xi2 + ϵi ,
βq+1 = βq+2 = · · · = βp = 0.
Now, a likelihood ratio test of H0 against H1 has a critical region of the form
( )
max(β,σ2 )∈Θ(1) L(β, σ 2 )
C= y: >k
max(β,σ2 )∈Θ(0) L(β, σ 2 )
max P (y ∈ C; β, σ 2 ) = α.
θ∈Θ(0)
!
−n n X n
max 2
L(β, σ ) = (2πD/n) 2 exp − (yi − xTi β̂)2
2
β,σ 2D i=1
n
= (2πD/n)− 2
n
exp − .
2
This form applies for both θ ∈ Θ(0) and θ ∈ Θ(1) , with only the model
changing. Let the deviances under models H0 and H1 be denoted by D0 and
D1 respectively. Then the critical region for the likelihood ratio test is of the
form
(2πD1 /n)− 2
n
n > k
(2πD0 /n)− 2
so n
D0 2
> k,
D1
and
D0 n−p
−1 > k′
D1 p−q
3.1. LECTURE 5: LINEAR MODEL THEORY: REVISION OF MATH201041
(D0 − D1 )/(p − q)
> k′.
D1 /(n − p)
We refer to the left hand side of this inequality as the F -statistic. We reject
the simpler model H0 in favour of the more complex model H1 if F is ‘too
large’.
As we have required H0 to be nested in H1 , F ∼ Fp−q, n−p when H0 is true.
To see this, note that
D0 D0 − D1 D1
2
= + 2.
σ σ2 σ
Furthermore, under H0 , D1 /σ ∼ χn−p and D0 /σ 2 ∼ χ2n−q . It is possible to
2 2
show (although we will not do so here) that under H0 , (D0 − D1 )/σ 2 and
D0 /σ 2 are independent. Therefore, from the properties of the chi-squared dis-
tribution, it follows that under H0 , (D0 − D1 )/σ 2 ∼ χ2p−q , and F ∼ Fp−q, n−p
distribution.
Therefore, the precise critical region can be evaluated given the size, α, of
the test. We reject H0 in favour of H1 when
(D0 − D1 )/(p − q)
>k
D1 /(n − p)
In this chapter, we introduce the linear mixed models (LMMs) with random
effects. This is a method for analyzing complex datasets contain features such
as multilevel/hierarchy, longitudinality, or correlation/dependence, where the
linear models in Chapter 3 can not be applied. We will study both the general
concepts/interpretation and some simple theory of LMMs.
43
44 CHAPTER 4. LINEAR MIXED MODELS
where same to the linear model, xij is the p×1 vector of explanatory variables
corresponding to the j-th component in i-th group, associated with the fixed
effects, and we write β = (β1 , . . . , βp )T as a p × 1 parameter vector of fixed
effects. We assume the random errors ϵij ∼ N (0, σ 2 ) are all independent.
Different from linear model, here we also have uij , which is the q × 1 vector
of explanatory variables corresponding to the j-th component in i-th group.
It is associated with the random effects. Usually uij can be either some
new covariates or a subset of xij . We write γi = (γi1 , . . . , γiq )T as a q × 1
parameter vector of random effects in i-th group.
As introduced earlier, unlike β being the parameters of interests, γi is a
vector of random coefficients. For the rest of this section, we assume that
γi ∼ N (0, D),
Yij = β0 + γi + ϵij , i = 1, . . . , m; j = 1, . . . , ni ,
In this case, we have the variance parameters θ = (σγ , σϵ ). This model only
include an intercept with no covariate in the fixed effect. It can be shown that
the correlation between the responses within i-th group {Yi1 , Yi2 , . . . , Yini } is:
Thus, the random effect γi introduces correlation between with-in group re-
sponses. If there is no random effect (i.e., σγ = 0), there is no such correlation.
4.1. LECTURE 6: INTRODUCTION TO LINEAR MIXED MODELS 47
Yi = Xi β + Ui γi + ϵi
= Xi β + ϵ∗i , i = 1 . . . , m,
Yi ∼ N (Xi β, Vi ).
If we want to further present all the m groups in one big matrix formula, we
can simply write
Y1 X1
. .
Y = ..
∈R ,
n
X = ..
∈R
n×p
,
Ym Xm
and
γ1 ϵ1
. .
γ=
.. ∈ R ,
mq
ϵ=
.. ∈ R
n×1
.
γm ϵm
Moreover, let
U1 , 0n1 ×q , · · · , 0n1 ×q
0n2 ×q , U2 ,
U = .. ∈ Rn×mq
...
.
0nm ×q , Um
Y = Xβ + U γ + ϵ, (4.2)
48 CHAPTER 4. LINEAR MIXED MODELS
Y = Xβ + ϵ∗ ,
4.2.1 Estimation of β
We rewrite the linear mixed model in equation (4.2) as:
Y = Xβ + ϵ∗ , ϵ∗ ∼ N (0, V ). (4.3)
Y ′ = X ′ β + ϵ′ , ϵ′ ∼ N (0, In ), (4.5)
β̂ =(X ′T X ′ )−1 X ′T y ′
=(X T V −1 X)−1 X T V −1 y (4.6)
4.2.2 “Estimation” of γ
The random coefficients γ are random coefficients not parameters of inter-
ests. Nevertheless, sometimes we need an ”estimator” γ̂ as an intermediate
quantity in our statistical inference. To this end, note that
As a result, ! !!
Xβ V , UG
(Y , γ) ∼ N , .
0 GU T , G
such that X1 ∼ N (µ1 , Σ11 ), X2 ∼ N (µ2 , Σ22 ) and Cov(Xi , Xj ) = Σij for
i, j ∈ {1, 2}.
The conditional distribution of X2 given known values for X1 = x1 is multi-
variate normal N (µX2 |x1 , ΣX2 |x1 ), where
Similarly, we have
Var(X2 | x1 ) =Var Z + Σ21 Σ−1
11 x1 | x1
=Var(Z | x1 ) + Var Σ21 Σ−1 −1
11 x1 | x1 + 2Σ21 Σ11 Cov (Z, x1 | x1 )
=Var(Z | x1 )
=Var(Z)
=Σ22 + Σ21 Σ−1 −1 −1
11 Σ11 Σ11 Σ12 − 2Σ21 Σ11 Σ12
=Σ22 − Σ21 Σ−1 −1
11 Σ11 Σ11 Σ12 ,
When β is not given, we can replace β with its estimator β̂ derived in (4.6). It
is straightforward to see that E(γ̂ | Y = y) = E(γ | Y = y). γ̂ is sometimes
referred as the maximum a posteriori (MAP), or predicted random effects.
In the exercise class we will prove the above estimators β̂ and γ̂ are also the
joint maximizer of the log-likelihood of (Y T , γ T ), with respect to β and γ,
meaning that they are MLEs (under the knowledge of θ).
Therefore, we have
X1 X1T + σ 2 In1 , 0n1 ×n2 , ··· , 0n1 ×nm
T 2
0 n2 ×n1 , X X
2 2 + σ In2 ,
T 2
V = U GU +σ In = ..
...
.
T 2
0nm ×n1 , Xm Xm + σ Inm
and
γ̂i = (γ̂i1 , γ̂i2 )T = XiT (Xi XiT + σ 2 Ini )−1 (yi − Xi β̂).
Here we are taking an additional log and exponential to apply the following
trick — consider Taylor expansion of log f (y, γ; β, θ) at γ̂, where γ̂ is the
estimator we obtained in (4.7), which is also the maximiser of f (y, γ; β, θ)
as we proved in the exercise class. Hence we have
1 ∂ 2 log f (y, γ; β, θ)
log f (y, γ; β, θ) = log f (y, γ̂; β, θ) + (γ − γ̂)T (γ − γ̂)
2 ∂γ∂γ T γ̂
!
1 UT U −1
= log f (y, γ̂; β, θ) − (γ − γ̂)T
+ Gθ (γ − γ̂).
2 σ2
There are no further remainder terms because the higher order derivatives of
log f (y, γ; β, θ) with respect to γ are exactly zero since it is polynomial of
order 2. Hence, it arrives
Z " ! #
1 UT U
L(β, θ) = exp log f (y, γ̂; β, θ) − (γ − γ̂)T + Gθ−1 (γ − γ̂) dγ
2 σ2
Z " ! #
1 UT U −1
=f (y, γ̂; β, θ) exp − (γ − γ̂) T
+ Gθ (γ − γ̂) dγ
2 σ2
Z " ! #
1 UT U −1
=f (y|γ̂; β, θ)f (γ̂; β, θ) exp − (γ − γ̂)T
+ Gθ (γ − γ̂) dγ.
2 σ2
Consider
1/2 " ! #
UT U 1 UT U
(2π) −mq/2
2
+ Gθ−1 exp − (γ − γ̂)T 2
+ Gθ−1 (γ − γ̂) ,
σ 2 σ
−1
which is the probability density function of N γ̂, U T U /σ 2 + Gθ , it must
integrate to 1. This implies:
Z " ! #
1 UT U
exp − (γ − γ̂)T 2
+ Gθ−1 (γ − γ̂) dγ
2 σ
(2π)mq/2
= 1/2
(4.8)
UT U
+ Gθ−1
σ2
54 CHAPTER 4. LINEAR MIXED MODELS
Combining the formula of f (y|γ; β, θ), f (γ; β, θ) and (4.8), we finally have
−1/2 !
UT U −1 γ̂Gθ−1 γ̂
L(β, θ) =(2πσ 2 )−n/2 |Gθ |−1/2 + G θ · exp −
σ2 2
" #
(y − Xβ − U γ̂) (y − Xβ − U γ̂)
T
· exp − . (4.9)
2σ 2
n−p 2
We know this is an biased estimator as E(σ̂ 2 ) = σ , and in practice we
n
often use the unbiased alternative
n 2
1 X
σ̃ 2 = yi − xTi β̂ .
n − p i=1
Z Z Z
f (y; θ) = f (y, β; θ)dβ = f (y|β; θ)f (β)dβ = f (y; β, θ)dβ.
Using the exactly same techniques in Section 4.3.1, we can get the restricted
log-likelihood function ℓr (θ). Details are omitted here. The resulted max-
imiser is denoted as θ̂r . And we can calculate β̂r based on the value of
θ̂r .
REML accounts for the degrees of freedom loss by estimating the fixed effects,
and results in a less biased estimation of random effects variances. The
estimates of θ are invariant to the value of β and less sensitive to outliers in
the data compared to MLE.
where Iˆp−1 and Iˆp−1 are the information matrix of the log profile likelihood or
log restricted likelihood, respectively, i.e.,
∂ 2 ℓp (θ) ∂ 2 ℓr (θ)
Iˆp−1 = and Iˆr−1 = .
(∂θ)2 θ̂p
(∂θ)2 θ̂r
4.4. LECTURE 9: STATISTICAL INFERENCE OF LMMS 57
Hence −1/2
C(X T V −1 X)−1 C T C β̂ − c ∼ N (0, Ir ).
This implies
T h i−1
W = C β̂ − c C(X T V −1 X)−1 C T C β̂ − c ∼ χ2r .
−1/2
If H1 is true, C(X T V −1 X)−1 C T C β̂ − c ∼ N (Cβ − c, Ir ), the
distribution of W will shift to the right by (Cβ − c)T (Cβ − c). 1
Therefore, we could employ W as the test statistic of H0 against H1 . This
is the so-called Wald-Test. We will reject H0 if W > χ2r,1−α , where χ2r,1−α is
the 100(1 − α)% quantile of the χ2r distribution.
Again, if θ is unknown, we can use its estimate θ̂p , and replace V and β̂ in
above terms with V (θ̂p ) and β̂(θ̂p ), respectively. Note that REML method
can not be used to compare models with different fixed effect structures,
because ℓr (θ) is not comparable between models with different fixed effect.
and c = 0, with r = p − q.
Chapter 5
59
60 CHAPTER 5. GENERALISED LINEAR MODELS
and hence the mean and variance of a random variable with probability
density function (or probability function) of the form (5.1) are b′ (θ) and
a(ϕ)b′′ (θ) respectively.
The variance is the product of two functions; b′′ (θ) depends on the canonical
parameter θ (and hence µ) only and is called the variance function (V (µ) ≡
b′′ (θ)); a(ϕ) is sometimes of the form a(ϕ) = σ 2 /w where w is a known weight
and σ 2 is called the dispersion parameter or scale parameter.
1 1
fY (y; µ, σ 2 ) = √ exp − 2 (y − µ)2 y ∈ R; µ ∈ R
2πσ 2 2σ
" #!
yµ − 12 µ2 1 y 2
= exp − + log(2πσ 2 ) .
σ2 2 σ2
Therefore
E(Y ) = b′ (θ) = θ = µ,
Var(Y ) = a(ϕ)b′′ (θ) = σ 2
and the variance function is
V (µ) = 1.
62 CHAPTER 5. GENERALISED LINEAR MODELS
exp(−λ)λy
fY (y; λ) = y ∈ {0, 1, . . .}; λ ∈ R+
y!
= exp (y log λ − λ − log y!) .
This is in the form (5.1), with θ = log λ, b(θ) = exp θ, a(ϕ) = 1 and c(y, ϕ) =
− log y!. Therefore
E(Y ) = b′ (θ) = exp θ = λ,
Var(Y ) = a(ϕ)b′′ (θ) = exp θ = λ
and the variance function is
V (µ) = µ.
p
This is in the form (5.1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(ϕ) = 1
and c(y, ϕ) = 0. Therefore
exp θ
E(Y ) = b′ (θ) = = p,
1 + exp θ
exp θ
Var(Y ) = a(ϕ)b′′ (θ) = = p(1 − p)
(1 + exp θ)2
and the variance function is
!
n 1 2
fY (y; p) = pny (1 − p)n(1−y) y ∈ 0, , , . . . , 1 ; p ∈ (0, 1)
ny n n
!!
y log 1−p + log(1 − p)
p
n
= exp 1 + log .
n
ny
p 1
This is in the form (5.1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(ϕ) =
n
n
and c(y, ϕ) = log ny
. Therefore
exp θ
E(Y ) = b′ (θ) = = p,
1 + exp θ
1 exp θ p(1 − p)
Var(Y ) = a(ϕ)b′′ (θ) = 2
=
n (1 + exp θ) n
and the variance function is
V (µ) = µ(1 − µ).
Here, we can write a(ϕ) ≡ σ 2 /w where the scale parameter σ 2 = 1 and the
weight w is n, the binomial denominator.
Y
n
fY (y; θ, ϕ) = fYi (yi ; θi , ϕi )
i=1
!
X
n
yi θ i − b(θi ) X n
= exp + c(yi , ϕi ) (5.2)
i=1 a(ϕi ) i=1
= xTi β
= [Xβ]i , i = 1, . . . , n, (5.3)
η = Xβ. (5.4)
Again, we call the n × p matrix X the design matrix. The ith row of X is xTi ,
the explanatory data corresponding to the ith observation of the response.
The jth column of X contains the n observations of the jth explanatory
variable.
As for the linear model in Section 3.1.2, this structure allows quite general
dependence of the linear predictor on explanatory variables. For instance, we
can allow non-linear dependence of ηi on a variable xi through polynomial
regression (as in Example 3.3), or include categorical explanatory variables
(as in Examples 3.5 and 3.6).
ηi = g(µi ), i = 1, . . . , n,
Recall that for a random variable Y with a distribution from the exponential
family, E(Y ) = b′ (θ). Hence, for a generalised linear model
µi = E(Yi ) = b′ (θi ), i = 1, . . . , n.
Therefore
′
θi = b −1 (µi ), i = 1, . . . , n
and as g(µi ) = ηi = xTi β, then
′
θi = b −1 (g −1 [xTi β]), i = 1, . . . , n. (5.5)
Hence, we can express the joint density (5.2) in terms of the coefficients β,
and for observed data y, this is the likelihood L(β) for β. As β is our
parameter of real interest (describing the dependence of the response on the
explanatory variables) this likelihood will play a crucial role.
Note that considerable simplification is obtained in (5.5) if the functions g
′
and b −1 are identical. Then
θi = xTi β i = 1, . . . , n
The link between E(Y ) = µ and the linear predictor η is through the (canon-
ical) identity link function
µi = ηi , i = 1, . . . , n.
where Φ(·) is the cdf of the standard normal distribution, then we get the
link function
g(µ) = g(p) = Φ−1 (µ) = η,
log λi = ηi = xTi β,
or
λi = exp{ηi } = exp{xTi β}.
Now suppose that Yi represents a count of the number of events which occur
in a given region i, for instance the number of times a particular drug is
prescribed on a given day, in a district i of a country. We might want to
model the prescription rate per patient in the district λ∗i . Write Ni is
the number of patients registered in district i, often called the exposure of
observation i. We model Yi ∼ Poisson(Ni λ∗i ), where
(since λi = Ni λ∗i , so log λi = log Ni +log λ∗i ). The log-exposure log Ni appears
as a fixed term in the linear predictor, without any associated parameter.
Such a fixed term is called an offset.
5.4. LECTURE 13: MAXIMUM LIKELIHOOD ESTIMATION 69
X
n
yi θi − b(θi ) X n
ℓ(β, ϕ) = + c(yi , ϕi ) (5.6)
i=1 a(ϕi ) i=1
∂
uk (β) = ℓ(β, ϕ) k = 1, . . . , p
∂βk
From (5.6)
70 CHAPTER 5. GENERALISED LINEAR MODELS
∂
uk (β) = ℓ(β, ϕ)
∂βk
∂ X n
yi θi − b(θi ) ∂ X n
= + c(yi , ϕi )
∂βk i=1 a(ϕi ) ∂βk i=1
" #
X
n
∂ yi θi − b(θi )
=
i=1 ∂βk a(ϕi )
" #
X
n
∂ yi θi − b(θi ) ∂θi ∂µi ∂ηi
=
i=1 ∂θi a(ϕi ) ∂µi ∂ηi ∂βk
X
n
yi − b′ (θi ) ∂θi ∂µi ∂ηi
= , k = 1, . . . , p,
i=1 a(ϕi ) ∂µi ∂ηi ∂βk
where
" #−1
∂θi ∂µi 1
= =
∂µi ∂θi b′′ (θi )
" #−1
∂µi ∂ηi 1
= = ′
∂ηi ∂µi g (µi )
∂ηi ∂ X p
= xij βj = xik .
∂βk ∂βk j=1
Therefore
X
n
yi − b′ (θi ) xik X
n
yi − µi xik
uk (β) = = , k = 1, . . . , p,
i=1 a(ϕi ) b (θi )g (µi ) i=1 Var(Yi ) g ′ (µi )
′′ ′
(5.7)
which depends on β through µi ≡ E(Yi ) and Var(Yi ), i = 1, . . . , n.
In theory, we solve the p simultaneous equations uk (β̂) = 0, k = 1, . . . , p to
evaluate β̂. In practice, these equations are usually non-linear and have no
analytic solution. Therefore, we rely on numerical methods to solve them.
5.4. LECTURE 13: MAXIMUM LIKELIHOOD ESTIMATION 71
First, we note that the Hessian and Fisher information matrices can be de-
rived directly from (5.7).
∂2 ∂
[H(β)]jk = ℓ(β, ϕ) = uk (β).
∂βj ∂βk ∂βj
Therefore
∂ X n
yi − µi xik
[H(β)]jk =
∂βj i=1 Var(Yi ) g ′ (µi )
" #
X
n − ∂β
∂µi
xik Xn
∂ xik
= j
′
+ (yi − µi )
i=1 Var(Yi ) g (µi ) i=1 ∂βj Var(Yi )g ′ (µi )
and
∂µi " #
X
n
∂βj xik Xn
∂ xik
[I(β)]jk = ′
− (E[Yi ] − µi )
i=1 Var(Yi ) g (µi ) i=1 ∂βj Var(Yi )g ′ (µi )
∂µi
X
n
∂βj xik
= ′
i=1 Var(Yi ) g (µi )
Xn
xij xik
= ′
.
i=1 Var(Yi )g (µi )
2
I(β) = X T W X (5.8)
where
xT1 x11 · · · x1p
. . .. ..
X = .. = .. . .
,
xn T
xn1 · · · xnp
w1 0 ··· 0
..
0 w2 .
W = diag(w) = ..
..
. . 0
0 ··· 0 wn
72 CHAPTER 5. GENERALISED LINEAR MODELS
and
1
wi = , i = 1, . . . , n.
Var(Yi )g ′ (µi )2
The Fisher information matrix I(β) depends on β through µ and
Var(Yi ), i = 1, . . . , n.
We notice that the score in (5.7) may now be written as
X
n X
n
uk (β) = (yi − µi )xik wi g ′ (µi ) = xik wi zi , k = 1, . . . , p,
i=1 i=1
where
zi = (yi − µi )g ′ (µi ), i = 1, . . . , n.
Therefore
u(β) = X T W z. (5.9)
4. If ||β (m+1) − β (m) || > ϵ, for some prespecified (small) tolerance ϵ then
set m → m + 1 and go to 2.
5. Use β (m+1) as the solution for β̂.
As this algorithm involves iteratively minimising a weighted sum of squares,
it is sometimes known as iteratively (re)weighted least squares.
Notes
′
1. Recall that the canonical link function is g(µ) = b −1 (µ) and with this
link ηi = g(µi ) = θi . Then
1 ∂µi ∂µi
= = = b′′ (θi ), i = 1, . . . , n.
g ′ (µ i) ∂ηi ∂θi
Therefore Var(Yi )g ′ (µi ) = a(ϕi ) which does not depend on β, and hence
" #
∂ xik
=0
∂βj Var(Yi )g ′ (µi )
for all j = 1, . . . , p. It follows that H(θ) = −I(β) and, for the canoni-
cal link, Newton-Raphson and Fisher scoring are equivalent.
2. The linear model is a generalised linear model with identity link, ηi =
g(µi ) = µi and Var(Yi ) = σ 2 for all i = 1, . . . , n. Therefore wi =
74 CHAPTER 5. GENERALISED LINEAR MODELS
The endpoints of this interval cannot be evaluated because they also depend
on the unknown parameter vector β. However, if we replace I(β) by its MLE
I(β̂) we obtain the approximate large sample 100(1 − α) confidence interval
[β̂i − s.e.(β̂i )z1− α2 , β̂i + s.e.(β̂i )z1− α2 ].
For α = 0.10, 0.05, 0.01, z1− α2 = 1.64, 1.96, 2.58, respectively.
5.6. LECTURE 15: COMPARING GENERALISED LINEAR MODELS75
βq+1 = βq+2 = · · · = βp = 0.
Then model H0 is a special case of model H1 , where certain coefficients are set
equal to zero, and therefore Θ(0) , the set of values of the canonical parameter
θ allowed by H0 , is a subset of Θ(1) , the set of values allowed by H1 .
!
maxθ∈Θ(1) L(θ)
L01 ≡ 2 log
maxθ∈Θ(0) L(θ)
= 2 log L(θ̂ (1) ) − 2 log L(θ̂ (0) ), (5.12)
where θ̂ (1) and θ̂ (0) follow from b′ (θ̂i ) = µ̂i , g(µ̂i ) = η̂i , i = 1, . . . , n where η̂ for
each model is the linear predictor evaluated at the corresponding maximum
likelihood estimate for β. Here, we assume that a(ϕi ), i = 1, . . . , n are
known; unknown a(ϕ) is discussed in Section 5.8.
Recall that we reject H0 in favour of H1 when L01 is ‘too large’ (the observed
data are much more probable under H1 than H0 ). To determine a threshold
value k for L01 , beyond which we reject H0 , we set the size of the test α and
use the result of Section 2.3.3.2 that, because H0 is nested in H1 , L01 has
an asymptotic chi-squared distribution with p − q degrees of freedom. For
example, if α = 0.05, we reject H0 in favour of H1 when L01 is greater than
the 95 point of the χ2p−q distribution.
Note that setting up our model selection procedure in this way is consistent
with our desire for parsimony. The simpler model is H0 , and we do not
reject H0 in favour of the more complex model H1 unless the data provide
convincing evidence for H1 over H0 , that is unless H1 fits the data significantly
better.
X
n
yi θi − b(θi ) X n
ℓ(θ) = + c(yi , ϕi ). (5.13)
i=1 a(ϕi ) i=1
∂ yk − b′ (θk )
ℓ(θ) = k = 1, . . . , n.
∂θk a(ϕk )
where θ̂ (s) follows from b′ (θ̂) = µ̂ = y and θ̂ (0) is a function of the corre-
sponding maximum likelihood estimate for β = (β1 , . . . , βq )T . Under H0 , L0s
has an asymptotic chi-squared distribution with n − q degrees of freedom.
Therefore, if L0s is ‘too large’ (for example, larger than the 95 point of the
χ2n−q distribution) then we reject H0 as a plausible model for the data, as it
does not fit the data adequately.
The degrees of freedom of model H0 is defined to be the degrees of freedom
for this test, n − q, the number of observations minus the number of linear
parameters of H0 . We call L0s the scaled deviance (R calls it the residual
deviance) of model H0 .
From (5.12) and (5.13) we can write the deviance of model H0 as
78 CHAPTER 5. GENERALISED LINEAR MODELS
which can be calculated using the observed data, provided that a(ϕi ), i =
1, . . . , n is known.
Notes
1. The log likelihood ratio statistic (5.12) for testing H0 against a non-
saturated alternative H1 can be written as
Therefore the log likelihood ratio statistic for comparing two nested
models is the difference of their deviances. Furthermore, as p − q =
(n − q) − (n − p), the degrees of freedom for the test is the difference
in degrees of freedom of the two models.
2. The asymptotic theory used to derive the distribution of the log like-
lihood ratio statistic under H0 does not really apply to the goodness
of fit test (comparison with the saturated model). However, for bino-
mial or Poisson data, we can proceed as long as the relevant binomial
or Poisson distributions are likely to be reasonably approximated by
normal distributions (i.e. for binomials with large denominators or
Poissons with large means). However, for Bernoulli data, we cannot
use the scaled deviance as a goodness of fit statistic in this way.
3. An alternative goodness of fit statistic for a model H0 is Pearson’s X 2
given by
(0)
X
n
(yi − µ̂i )2
X2 = . (5.16)
i=1
ˆ i)
Var(Y
X 2 is small when the squared differences between observed and fitted
values (scaled by variance) is small. Hence, large values of X 2 corre-
spond to poor fitting models. In fact, X 2 and L0s are asymptotically
5.7. LECTURE 16: SCALED DEVIANCE AND THE SATURATED MODEL79
X
n
(s) (0) (s) (0)
L0s = 2 yi [log µ̂i − log µ̂i ] − [µ̂i − µ̂i ]
i=1
!
X
n
yi (0)
=2 yi log (0)
− yi + µ̂i
i=1 µ̂i
and
(0)
X
n
(yi − µ̂i )2
X2 = (0)
.
i=1 µ̂i
X
n
µ̂i
(s) (0)
µ̂i X
n h i
ni yi log +2 (s) (0)
L0s = 2 (s)
− log (0)
ni log(1 − µ̂i ) − log(1 − µ̂i )
i=1 1 − µ̂i 1 − µ̂i i=1
" ! !#
X
n
yi 1 − yi
=2 ni yi log (0)
+ ni (1 − yi ) log (0)
i=1 µ̂i 1 − µ̂i
and
(0)
X
n
ni (yi − µ̂i )2
X2 = (0) (0)
.
i=1 µ̂i (1 − µ̂i )
Bernoulli data are binomial with ni = 1, i = 1, . . . , n.
2 X n
(s) (0) (s) (0) 1
L0s = mi yi [ θ̂i − θ̂ i ] − m i [b(θ̂i ) − b( θ̂i )] = D0s , (5.17)
σ 2 i=1 σ2
where D0s is defined to be twice the sum above, which can be calculated
using the observed data. We call D0s the deviance of the model.
In order to test nested models H0 and H1 as set up in Section 5.6.1, we
calculate the test statistic
5.8. LECTURE 17: MODELS WITH UNKNOWN A(ϕ) 81
This statistic does not depend on the unknown scale parameter σ 2 , so can be
calculated using the observed data. Asymptotically, if H0 is true, we know
that L01 ∼ χ2p−q and L1s ∼ χ2n−p . Furthermore, L01 and L1s are independent
(not proved here) so F has an asymptotic Fp−q,n−p distribution. Hence, we
compare nested generalised linear models by calculating F and rejecting H0
in favour of H1 if F is too large (for example, greater than the 95 point of
the relevant F distribution).
(0)
1 X
n
2mi (yi − µ̂i )2
X = 2 (0)
. (5.19)
σ i=1 V (µ̂i )
X
n
(s) (0) 1 (s)2 1 (0)2 Xn
(0)
D0s = 2 yi [µ̂i − µ̂i ] − [ µ̂i − µ̂i ] = [yi − µ̂i ]2 , (5.20)
i=1 2 2 i=1
which is just the residual sum of squares for model H0 . Therefore, we estimate
σ 2 for a normal GLM by its residual sum of squares for the model divided
by its degrees of freedom. From (5.19), the estimate for σ 2 based on X 2 is
identical.
5.8.1 Residuals
Recall that for linear models, we define the residuals to be the differences
(0)
between the observed and fitted values yi − µ̂i , i = 1, . . . , n. From (5.20) we
notice that both the scaled deviance and Pearson X 2 statistic for a normal
GLM are the sum of the squared residuals divided by σ 2 . We can generalise
this to define residuals for other generalised linear models in a natural way.
For any GLM we define the Pearson residuals to be
(0)
yi − µ̂i
riP = i = 1, . . . , n.
ˆ i ) 12
Var(Y
Then, from (5.16), X 2 is the sum of the squared Pearson residuals.
5.8. LECTURE 17: MODELS WITH UNKNOWN A(ϕ) 83
where sign(x) = 1 if x > 0 and −1 if x < 0. Then, from (5.14), the scaled
deviance, L0s , is the sum of the squared deviance residuals.
When a(ϕ) = σ 2 /m and σ 2 is unknown, as in Section 5.8, the residuals are
based on (5.17) and (5.19), and the expressions above need to be multiplied
through by σ 2 to eliminate dependence on the unknown scale parameter.
Therefore, for a normal GLM the Pearson and deviance residuals are both
(0)
equal to the usual residuals, yi − µ̂i , i = 1, . . . , n.
Residual plots are most commonly of use in normal linear models, where they
provide an essential check of the model assumptions. This kind of check is
less important for a model without an unknown scale parameter as the scaled
deviance provides a useful overall assessment of fit which takes into account
most aspects of the model.
However, when data have been collected in serial order, a plot of the deviance
or Pearson residuals against the order may again be used as a check for
potential serial correlation.
Otherwise, residual plots are most useful when a model fails to fit (scaled
deviance is too high). Then, examining the residuals may give an indication
of the reason(s) for lack of fit. For example, there may be a small number of
outlying observations.
A plot of deviance or Pearson residuals against the linear predictor should
produce something that looks like a random scatter. If not, then this may
be due to incorrect link function, wrong scale for an explanatory variable, or
perhaps a missing polynomial term in an explanatory variable.
84 CHAPTER 5. GENERALISED LINEAR MODELS
Chapter 6
Job Satisfaction
Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat.
<6000 20 24 80 82
6000-15000 22 38 104 125
15000-25000 13 28 81 113
>25000 7 18 54 92
85
86 CHAPTER 6. MODELS FOR CATEGORICAL DATA
Job Satisfaction
Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat. Sum
<6000 20 24 80 82 206
6000-15000 22 38 104 125 289
15000-25000 13 28 81 113 235
>25000 7 18 54 92 171
Sum 62 108 319 412 901
Table 6.2: A contingency table of the job dataset, including one-way mar-
gins.
Each position in a contingency table is called a cell and the number of indi-
viduals in a particular cell is the cell count.
Partial classifications derived from the table are called margins. For a two-
way table these are often displayed in the margins of the table, as in Table
6.2. These are one-way margins as they represent the classification of items
by a single variable; income group and job satisfaction respectively.
Remission
Cell Type Sex No Yes
Female 3 1
Diffuse
Male 12 1
Female 2 6
Nodular
Male 1 4
For multiway tables, higher order margins may be calculated. For example,
for lymphoma, the two-way Cell type/Sex margin is shown in Table 6.4.
6.2. LECTURE 19: LOG-LINEAR MODELS 87
Sex
Cell Type Female Male
Diffuse 4 13
Nodular 8 5
Table 6.4: The two-way Cell type/Sex margin for the lymphoma dataset.
fY (y; p) = P (Y1 = y1 , . . . , Yn = yn )
N ! Qn P
y
pi i
i=1 yi ! if ni=1 yi = N
= (6.1)
0 otherwise.
The binomial is the special case of the multinomial with two cells (n = 2).
We can still use a log-linear model for contingency table data when the data
have been obtained by multinomial sampling. We model log µi = log(N pi ),
i = 1, . . . , n as a linear function of explanatory variables. However, such a
P
model must preserve ni=1 µi = N , the grand total which is fixed in advance.
90 CHAPTER 6. MODELS FOR CATEGORICAL DATA
Therefore,
X
n X
n X
n
ℓ(µ) = − µi + yi log µi − log yi ! (6.2)
i=1 i=1 i=1
Xn X
n X
n
=− exp(log µi ) + yi log µi − log yi !. (6.3)
i=1 i=1 i=1
so that
∂ Xn Xn
ℓ(µ) = − exp(log µi ) + yi . (6.4)
∂α i=1 i=1
X
n X
n
⇒ µ̂i = yi . (6.5)
i=1 i=1
form as (6.1), one for each fixed-total subgroup. We call this a distribution
a product multinomial. Each subgroup has its own fixed total. The full joint
density is a product of n terms, as before, with each cell count appearing
exactly once.
For example, if the Sex margin is fixed for lymphoma, then the product multi-
nomial distribution has the form
y
Q4 ymi
pmi Q4 pffi i P4 P4
Nm ! i=1 ymi ! Nf ! i=1 yf i ! if i=1 ymi = Nm and i=1 yf i = Nf
fY (y; p) =
0 otherwise,
where Nm and Nf are the two fixed marginal totals (18 and 12 respec-
tively), ym1 , . . . , ym4 are the cell counts for the Cell type/Remission cross-
classification for males and yf 1 , . . . , yf 4 are the corresponding cell counts for
P P
females. Here 4i=1 pmi = 4i=1 pf i = 1, E(Ymi ) = Nm pmi , i = 1, . . . , 4, and
E(Yf i ) = Nf pf i , i = 1, . . . , 4.
Using similar results to those used in Section 6.3 (but not proved here),
we can analyse contingency table data using Poisson log-linear models, even
when the data has been obtained by product multinomial sampling. However,
we must ensure that the Poisson model contains a term corresponding to the
fixed margin (and all marginal terms). Then, the estimated means for the
specified margin are equal to the corresponding fixed totals.
For example, for the lymphoma dataset, for inferences obtained using a Pois-
son model to be valid when the Sex margin is fixed in advance, the Poisson
model must contain the Sex main effect (and the intercept). For inferences
obtained using a Poisson model to be valid when the Cell type/Sex margin
is fixed in advance, the Poisson model must contain the Cell type/Sex inter-
action, and all marginal terms (the Cell type main effect, the Sex main effect
and the intercept).
Therefore
so
94 CHAPTER 6. MODELS FOR CATEGORICAL DATA
X
r X
c
1= exp[α + βR (j) + βC (k) − log N ]
j=1 k=1
1 Xr Xc
= exp[α] exp[βR (j)] exp[βC (k)].
N j=1 k=1
Furthermore
X
c
P (R = j) = exp[α + βR (j) + βC (k) − log N ]
k=1
1 Xc
= exp[α] exp[βR (j)] exp[βC (k)], j = 1, . . . , r,
N k=1
and
X
r
P (C = k) = exp[α + βR (j) + βC (k) − log N ]
j=1
1 Xr
= exp[α] exp[βC (k)] exp[βR (j)], k = 1, . . . , c.
N j=1
Therefore
1
P (R = j)P (C = k) = exp[α] exp[βC (k)] exp[βR (j)] × 1
N
= P (R = j, C = k), j = 1, . . . , r, k = 1, . . . , c.
Absence of the interaction R∗C in a log-linear model implies that R and C are
independent variables. Absence of main effects is generally less interesting,
and main effects are typically not removed from a log-linear model.
6.5. LECTURE 22: INTERPRETING LOG-LINEAR MODELS FOR MULTIWAY TABLES95
Remission
Cell Type Sex No Yes
Female 0.1176 0.0157
Diffuse
Male 0.3824 0.0510
Female 0.0615 0.2051
Nodular
Male 0.0385 0.1282
Remission
Sex No Yes Sum
Female 0.1792 0.2208 0.4
Male 0.4208 0.1792 0.6
Sum 0.6 0.4 1.0
It can immediately be seen that this model does not imply independence of
R and S, as P̂ (R, S) ̸= P̂ (R)P̂ (S). What the model R ∗ C + C ∗ S implies
is that R is independent of S conditional on C, that is
P (R|S, C) = P (R|C),
Remission
Cell Type Sex No Yes P̂ (R|S, C)
Female 0.1176 0.0157 0.12
Diffuse
Male 0.3824 0.0510 0.12
Female 0.0615 0.2051 0.77
Nodular
Male 0.0385 0.1282 0.77
The proof of this is similar to the proof in the two-way case. Again, an
alternative way of expressing conditional independence is
or
P (X2 |X1 , X3 , . . . , Xr ) = P (X2 |X3 , . . . , Xr ).
men and 0.55 for women. Male patients have a much lower probability of
remission. The reason for this is that, although R and S are not directly
associated, they are both associated with C. Observing the estimated values
it is clear that patients with nodular cell type have a greater probability of
remission, and furthermore, that female patients are more likely to have this
cell type than males. Hence women are more likely to have remission than
men.
Suppose the factors for a three-way tables are X1 , X2 and X3 . We can list
all possible models and the implications for the conditional independence
structure:
1. Model 1 containing the terms X1 , X2 , X3 . All factors are mutually
independent.
2. Model 2 containing the terms X1 ∗ X2 , X3 . The factor X3 is jointly
independent of X1 and X2 .
3. Model 3 containing the terms X1 ∗ X2 , X2 ∗ X3 . The factors X1 and X3
are conditionally independent given X2 .
4. Model 4 containing the terms X1 ∗ X2 , X2 ∗ X3 , X1 ∗ X3 . There is
no conditional independence structure. This is the model without the
highest order interaction term.
5. Model 5 containing X1 ∗X2 ∗X3 . This is the saturated model. No more
simplification of dependence structure is possible.
related to both smoking habit and mortality – age (at the time of the initial
survey). When we consider this variable, we get Table 6.11. Conditional on
every age at outset, it is now the smokers who have a higher probability of
dying. The marginal association is reversed in the table conditional on age,
because mortality (obviously) and smoking are associated with age. There
are proportionally many fewer smokers in the older age-groups (where the
probability of death is greater).
When making inferences about associations between variables, it is important
that all other variables which are relevant are considered. Marginal inferences
may lead to misleading conclusions.