Stat Modelling Notes
Stat Modelling Notes
Rajen D. Shah
[email protected]
Introduction
This course is largely about analysing data composed of observations that come in the form of
pairs
(y1 , x1 ), . . . , (yn , xn ). (0.0.1)
Our aim will be to infer an unknown regression function relating the values yi , to the xi , which
may be p-dimensional vectors xi = (xi1 , . . . , xip )T . The yi are often called the response, target
or dependent variable; the xi are known as predictors, covariates, independent variables or
explanatory variables. Below are some examples of possible responses and covariates.
Response Covariates
House price Numbers of bedrooms, bathrooms; Plot area; Year built; Location
Weight loss Type of diet plan; type of exercise regime
Short-sightedness Parents’ short-sightedness; Hours spent watching TV or reading books
First note that in each of the examples above, it would be hopeless to attempt to find a
deterministic function that gives the response for every possible set of values of the covariates.
Instead, it makes sense to think of the data-generating mechanism as being inherently random,
with perhaps a deterministic function relating average values of the responses to values of the
covariates.
We model the responses yi as realisations of random variables Yi . Depending on how the
data were collected, it may seem appropriate to also treat the xi as random. However, in such
cases we usually condition on the observed values of the explanatory variables. To aid intuition,
it may help to imagine a hypothetical sequence of repetitions of the ‘experiment’ that was
conducted to produce the data with the xi , i = 1, . . . , n held fixed, and think of the dataset at
hand as being one of the many elements of such a sequence.
In the course Principles of Statistics, theory was developed for data that were i.i.d. In our
setting here, this assumption is not appropriate: the distributions of Yi and Yj may well be
different is xi 6= xj . In fact what we are interested in is how the distributions of the Yi differ.
However, we will still usually assume that the data are at least independent. It turns out that
with this assumption of independence, much of the theory from Principles of Statistics can be
applied, with little modification.
In this course we will study some of the most popular and important statistical models for
data of the form (0.0.1). We begin with the linear model, which you will have met in Statistics
IB.
i
Contents
1 Linear models 2
1.1 Ordinary least squares (OLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Orthogonal projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Analysis of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Normal errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 The multivariate normal distribution and related distributions . . . . . . 6
1.2.3 Inference for the normal linear model . . . . . . . . . . . . . . . . . . . . 8
1.2.4 ANOVA and ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.5 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
ii
Chapter 1
Linear models
Y = Xβ + ε,
where T
Y1 x1 β1 ε1
.. .. .. ..
Y = . , X = . , β = . , ε = . ,
Yn xTn βp εn
and the εi are to be considered as random errors that satisfy
(A1) E(εi ) = 0,
(A2) Var(i ) = σ 2 ,
(A3) Cov(εi , εj ) = 0 for i 6= j.
A word on models. It is important to recognise that this, or any statistical model is a math-
ematical object and cannot really be thought of as a ‘true’ representation of reality. Nevertheless
statistical models can nevertheless be a useful representation of reality. Though the model may
be wrong, it can still be used to answer questions of interest, and help inform decisions.
X = ... ... .
1 xTn
To include quadratic terms, we may take
xT1 x211 · · · x21p
1
.. .. .. .. .
X = . . . .
1 xTn x2n1 · · · x2np
The resulting model will not be linear in the xi , but it is still a linear model because it is linear
in β.
2
Least squares
Under assumptions A1–A3, a sensible way to estimate β is using OLS. This gives an estimate
β̂ that satisfies
provided the n by p matrix X has full column rank (i.e. r(X) = p) so X T X is invertible
(see example sheet). The fitted values, Ŷ := X β̂ are then given by X(X T X)−1 X T Y . Let
P := X(X T X)−1 X T . Then P known as the ‘hat’ matrix because it puts the hat on Y . In fact
it is an orthogonal projection on to the column space of X. To discuss this further, we recall
some facts about projections from linear algebra.
V ⊥ := {w ∈ Rn : wT v = 0 for all v ∈ V }.
(iii) Π2 = Π = ΠT , so Π is idempotent and symmetric. The former is clear from the definition.
To see that Π is symmetric observe that for all u1 , u2 ∈ Rn ,
(iv) Orthonormal bases of V and V ⊥ are eigenvectors of Π with eigenvalues 1 and 0 respectively.
Therefore we can from the eigendecomposition Π = U DU T where U is an orthogonal
matrix with columns as eigenvectors of Π and D is a diagonal matrix of corresponding
eigenvalues.
3
(v) r(Π) = dim(V ). Also, by the eigendecomposition above,
Note that the matrix P = X(X T X)−1 X T defined earlier is the orthogonal projection on to
the column space of X. Indeed, P Xb = Xb and if w is orthogonal to the column space of X, so
X T w = 0, then P w = 0. Also, our derivation of P Y as the linear combination of columns of X
that is closest in Euclidean distance to Y reveals another property of orthogonal projections: if
Π is an orthogonal projection on to V , then for any v ∈ Rn , Πv is the closest point on V to the
vector v—in other words
Πv = arg min kv − uk2 .
u∈V
Cov(Z1 , Z2 )ij
Corr(Z1 , Z2 ) := p .
Var(Z1,i )Var(Z2,j )
For any constants a1 ∈ Rn1 and a2 ∈ Rn2 , Cov(Z1 + a1 , Z2 + a2 ) = Cov(Z1 , Z2 ). Also recall that
for any d by n1 matrix A and any constant vector m ∈ Rn1 , as expectation is a linear operator,
E(m + AZ1 ) = m + AE(Z1 ).
We can show that the vector of residuals, ε̂ := Y − Ŷ = (I − P )Y is uncorrelated with the
fitted values Ŷ :
Here is another way to think of the OLS coefficients that can offer further insight. Let us
write Xj for the j th column of X, and X−j for the n × (p − 1) matrix formed by removing the
j th column from X. Define P−j as the orthogonal projection on to the column space of X−j .
Proposition 1. Let Xj⊥ := (I − P−j )Xj , so Xj⊥ is the orthogonal projection of Xj on to the
orthogonal complement of the column space of X−j . Then
(Xj⊥ )T Y
β̂j = .
kXj⊥ k2
4
Proof. Note that Y = P Y + (I − P )Y and
so
(Xj⊥ )T Y (Xj⊥ )T X(X T X)−1 X T Y
= .
kXj⊥ k2 kXj⊥ k2
(Xj⊥ )T X = (0 · · · 0 (Xj⊥ )T Xj 0 · · · 0)
↑
th
j position
We see that Var(β̂j ) = σ 2 kXj⊥ k−2 . Thus if Xj is closely aligned to the column space of X−j ,
the variance of β̂j will be large. In particular, if a pair of variables are highly correlated with
each other, the variances of the estimates of the corresponding coefficients will be large.
We can measure the quality of a regression procedure by its mean-squared prediction error
(MSPE). This is defined here as
1
E(kXβ − X β̂k2 ).
n
Note that X β̂ = P Y = Xβ + P ε, so
Thus
1 p
E(kXβ − X β̂k2 ) = σ 2 .
n n
More is true. Note that β̂ is unbiased, as
Further,
Var(β̂) = (X T X)−1 X T Var(ε){(X T X)−1 X T }T = σ 2 (X T X)−1 . (1.1.2)
In fact it is the best linear unbiased estimator (BLUE), that is for any other estimator β̃ that
is linear in Y , we have Var(β̃) − Var(β̂) is positive semi-definite. In particular this means that
given a new observation x∗ ∈ Rp , we can estimate the regression function at x∗ optimally in the
sense that E{(x∗ T β − x∗ T β̂)2 } ≤ E{(x∗ T β − x∗ T β̃)2 }.
Theorem 2 (Gauss–Markov). Under (A1)–(A3) OLS is BLUE.
5
θ ∈ Θ ⊆ Rd , where Θ is the parameter space. The likelihood function is a function of θ for each
fixed y given by
L(θ) := L(θ; y) = c(y)f (y; θ),
where c(y) is an arbitrary constant of proportionality. We form an estimate θ̂ by choosing that
θ which maximises the likelihood. Often it is easier to work with the log-likelihood defined by
If we assume that the errors εi in our linear model have N (0, σ 2 ) distributions, we see that
the log-likelihood for (β, σ 2 ) is
n
n 1 X
`(β, σ 2 ) = − log(σ 2 ) − 2 (yi − xTi β)2 .
2 2σ
i=1
The maximiser of this over β is precisely the least squares estimator (X T X)−1 X T Y . Maximum
likelihood does much more than simply give us another interpretation of OLS here. It allows us
to perform inference: that is construct confidence interval for parameters and perform hypoth-
esis tests. Before moving on to this topic, we review some facts about the multivariate normal
distribution.
Note that for example, the vector of residuals from the normal linear model (I − P )ε ∼
Nn (0, σ 2 (I − P )) but it does not have a density of the form given above as I − P is not
invertible.
Proposition 3. If Z1 and Z2 are jointly normal (i.e. (Z1 , Z2 ) has a multivariate normal dis-
tribution), then if Cov(Z1 , Z2 ) := E[{Z1 − E(Z1 )}{Z2 − E(Z2 }T ] = 0, we have that Z1 and Z2
are independent.
Proof. Let Z̃1 and Z̃2 be independent and have the same distributions as Z1 and Z2 respectively.
Then the mean and variance of the random variables (Z̃1 , Z̃2 ) and (Z1 , Z2 ) are identical and they
are both multivariate normal (the former is multivariate normal because sums of independent
normal random variables are normal). Since a multivariate normal distribution is uniquely
d
determined by its mean and variance, we must have (Z̃1 , Z̃2 ) = (Z1 , Z2 ).
6
χ2 distribution
d
We say Z has a χ2 distribution on k degrees of freedom, and write Z ∼ χ2k if Z = Z12 + · · · + Zk2
i.i.d.
where Z1 , . . . , Zk ∼ N (0, 1).
Proposition 4. Let Π be an n by n orthogonal projection with rank k, and let ε ∼ Nn (0, σ 2 I).
Then kΠεk2 ∼ σ 2 χ2k .
Proof. As Π is an orthogonal projection, we may form its eigendecomposition U DU T where U
is an orthogonal matrix and D is diagonal with entries in {0, 1}. Then
Student’s t distribution
We say Z has a t distribution on k degrees of freedom, and write Z ∼ tk if
d Z1
Z=p
Z2 /k
where Z1 and Z2 are independent N (0, 1) and χ2k random variables respectively.
Multivariate t distribution
This is a generalisation of the Student’s t distribution above. We say Z has a p-dimensional
multivariate t distribution on k degrees of freedom, and write Z ∼ tk (µ, Σ) if
d Z1
Z =µ+ p
Z2 /k
where Z1 and Z2 are independent Np (0, Σ) and χ2k random variables respectively. It can be
shown that when Σ is invertible, Z has density
−(k+p)/2
Γ((k + p)/2) −1/2 1 T −1
f (z) := |Σ| 1 + (z − µ) Σ (z − µ) ,
Γ(k/2)(kπ)p/2 k
where the gamma function Γ satisfies Γ(m) = (m − 1)! for m ≥ 1.
F distribution
We say Z has an F distribution on k and l degrees of freedom, and write Z ∼ Fk,l if
d Z1 /k
Z=
Z2 /l
where Z1 and Z2 are independent and follow χ2k and χ2l distributions respectively.
Notation. We will denote the upper α-points of the χ2k , tk and Fk,l distributions by χ2k (α),
tk (α) and Fk,l (α) respectively. (So, for example, if Z ∼ χ2k then P{Z ≥ χ2k (α)} = α. As the tk
distribution is symmetric, if Z ∼ tk , then P{−tk (α/2) ≤ Z ≤ tk (α/2)} = 1 − α.)
7
Informal summary
Distribution of σ̂ 2
The maximum likelihood estimate for the σ 2 is
1 1 1
kY − X β̂k2 = k(I − P )Y k2 = k(I − P )εk2 .
n n n
We already know that the fitted values P Y and residuals (I − P )Y are uncorrelated. But
(P Y, (I − P )Y ) is a linear transformation of the multivariate normal Y , so P Y and (I − P )Y
must be independent. Therefore β̂ = (X T X)−1 X T P Y and σ̂ 2 are independent. Proposition 4
shows that σ̂ 2 ∼ σ 2 χ2n−p /n. Note that E(σ̂ 2 ) = (n − p)σ 2 /n, so σ̂ 2 is a biased estimator of σ 2 .
Let
n 1 σ2 2
σ̃ 2 := σ̂ 2 = kY − X β̂k2 ∼ χ ,
n−p n−p n − p n−p
so σ̃ 2 is now an unbiased estimator of σ 2 .
Now that we know the joint distribution of (β̂, σ̃ 2 ), it is rather easy to construct confidence
sets for β.
Np (0, (X T X)−1 )
β̂ − β
= q independent
σ̃ 1 2
n−p χn−p
(p)
is a pivot, that is its distribution does not depend on β or σ 2 . In fact it has a tn−p (0, (X T X)−1 )
distribution. For example, observe that
β̂j − βj
q ∼ tn−p ,
σ̃ 2 (X T X)−1
jj
8
so a (1 − α)-confidence interval for βj is given by
q q
β̂j − σ̃ 2 (X T X)−1 t
jj n−p (α/2), β̂j + σ̃ 2 (X T X)−1 t
jj n−p (α/2) =: Cj (α).
H0 : βj = β0,j
H1 : βj 6= β0,j .
and
H0 : β = β0
H1 : β 6= β0 .
Prediction intervals
Given a new observation x∗ , we can easily form a confidence interval for x∗ T β, the regression
function at x∗ , by noting that
so
x∗ T (β̂ − β)
p ∼ tn−p .
σ̃ 2 x∗ T (X T X)−1 x∗
A (1 − α)-level prediction interval for x∗ is a random interval I depending only on Y such
that Pβ,σ2 (Y ∗ ∈ I) = 1 − α where Y ∗ := x∗ T β + ε∗ and ε∗ ∼ N (0, σ 2 ) independently of
ε1 , . . . , εn . This will be wider than the confidence interval for x∗ T β as it must take into account
the additional variability of ε∗ . Indeed
so
Y ∗ − x∗ T β̂
p ∼ tn−p .
σ̃ 2 {1 + x∗ T (X T X)−1 x∗ }
9
The Bayesian normal linear model
So far we have treated β and σ 2 as unknown but fixed quantities. We have constructed estima-
tors of these quantities and tried to understand how we expect them to vary under hypothetical
repetitions of the experiment used to generate the data (with the design matrix fixed). This is
a frequentist approach to inference.
A Bayesian approach instead treats unknown parameters as random variables, and examines
their distribution conditional on the data observed. To fix ideas, suppose we have posited that
the density of the r.v. representing our data Y ∈ Rn conditional on a parameter vector θ is
p(y|θ). In addition to this statistical model, the Bayesian method requires that we agree on a
marginal distribution for θ, p(θ). This can represent prior information about the parameters that
is known before any of the data has been analysed, and hence it is called the prior distribution.
Inference about θ is based on the posterior distribution, p(θ|y), which satisfies
p(y|θ) p(θ)
p(θ|y) = .
p(y)
Taking the mean or mode of the posterior distribution gives point estimates for θ. Note that
in order to determine the posterior p(θ|y), we only need knowledge of the right-hand side up to
multiplication by an arbitrary function of y, in particular it suffices to consider p(y|θ)p(θ). To
recover p(θ|y) we simply multiply by
Z −1
0 0 0
p(y|θ )p(θ )dθ ,
or alternatively we may be able to spot the form of the density for θ and find the normalising
constant that way.
In contrast to frequentist confidence sets, using p(θ|y), we can construct sets S such that
the posterior probability of {θ ∈ S} is at least 1 − α. These are known as credible sets.
In the context of the Bayesian linear model, it is convenient to work with the precision ω :=
σ −2 rather than the variance. A commonly used prior for the parameters (β, ω) is p(β, ω) = ω −1 .
This is not a density since it does not have a finite integral. Nevertheless, the posterior resulting
from this prior is a genuine density, and inference based on this posterior has many similarities
with inference in the frequentist context. To see this we first recall the gamma distribution.
ba z a−1 e−bz
f (z; a, b) = for z ≥ 0 and a, b > 0,
Γ(a)
we write Z ∼ Γ(a, b) and say Z has a gamma distribution with shape a and rate b. We note,
for future use, that since the gamma density integrates to 1, we must have that
Z ∞
Γ(a)
z a−1 e−bz dz = a . (1.2.1)
z=0 b
p(y|β, ω) ∝ ω (n−p)/2 exp{−ωk(I − P )yk2 /2} ω p/2 exp{−ω(β − β̂)T X T X(β − β̂)/2} .
| {z } | {z }
∝ Γ((n − p)/2 + 1, k(I − P )yk2 /2) density For fixed ω ∝ Np (β̂, ω −1 (X T X)−1 ) density
10
Then, multiplying by the prior, we see that the posterior is a product of gamma and normal
densities: informally
Thus β|ω, Y ∼ Np (β̂, ω −1 (X T X)−1 ). Compare this to the distribution of β̂ in the frequentist
setting. The marginal posterior for β can be obtained by integrating out ω in the joint posterior
above. Rather than performing the integration directly, we note that as a function of ω alone,
the joint posterior is of the form
ω A−1 × exp(−ωB),
where
n
A=
2
1
B = {k(I − P )yk2 + kX(β − β̂)k2 }.
2
Thus by (1.2.1), we have that the marginal posterior for β satistfies
Z ∞
p(β|y) ∝ ω A−1 × exp(−ωB)
ω=0
∝ B −A
!−{(n−p)/2+p/2}
kX(β − β̂)k2
∝ 1+
k(I − P )yk2
−{(n−p)/2+p/2}
1 T −2 T
∝ 1+ (β − β̂) (σ̃ X X)(β − β̂) ,
n−p
(p)
which we recognise as proportional to the density of a tn−p (β̂, σ̃ 2 (X T X)−1 ) distribution. Thus
β − β̂ (p)
Y ∼ tn−p (0, (X T X)−1 ),
σ̃
similarly to the frequentist case, though here it is β rather than β̂ that is random. From this
we see that
βj − β̂j
q Y ∼ tn−p ,
σ̃ 2 (X T X)−1
jj
kX(β − β̂)k2
Y ∼ Fp,n−p ,
σ̃ 2
so the frequentist confidence regions described in earlier sections can also be thought of as
Bayesian credible regions, when the prior p(β, ω) ∝ ω −1 is used.
11
where X0 is n by p0 and X1 is n by p − p0 , and correspondingly β0 ∈ Rp0 and β1 ∈ Rp−p0 . We
are interesting in testing
H0 : β1 = 0 against
H1 : β1 6= 0.
One sensible way of proceeding is to construct a generalised likelihood ratio test. Recall that
given an n-vector Y , assumed to have density f (y; θ) for some unknown θ ∈ Θ, the likelihood
ratio test for testing
H0 : θ ∈ Θ0 against
H1 : θ ∈
/ Θ0 ,
where Θ0 ⊂ Θ, rejects the null hypothesis for large values of wLR defined by
supθ0 ∈Θ L(θ0 )
wLR (H0 ) = 2 log = 2{ sup `(θ0 ) − sup `(θ0 )}.
supθ0 ∈Θ0 L(θ0 ) θ0 ∈Θ θ0 ∈Θ0
Let us apply the generalised likelihood ratio test to the problem of assigning significance
to groups of variables in the linear model. Write β̌0 and σ̌ 2 for the MLEs of the vector of
regression coefficients and the variance respectively under the null hypothesis (i.e. when the
model is Y = X0 β0 + ε with ε ∼ Nn (0, σ 2 I)).
We have
1 1
wLR (H0 ) = −n log(σ̂ 2 ) − 2 kY − X β̂k2 + n log(σ̌ 2 ) + 2 kY − X0 β̌0 k2
σ̂ σ̌
k(I − P )Y k2
= −n log .
k(I − P0 )Y k2
To determine the right cutoff for an α-level test, we need to obtain the distribution of (a
monotone function of the) argument of the logarithm under the null hypothesis, that is, the
distribution of
k(I − P0 )εk2
.
k(I − P )εk2
By dividing top and bottom by σ 2 , we see that the distribution of the quantity above doesn’t
depend on any unknown parameters. To find its distribution we argue as follows. Write
I − P0 = (I − P ) + (P − P0 ).
Now since the columns of P and P0 are in the column space of X, (I − P )(P − P0 ) = 0, so
whence
k(I − P0 )εk2 k(P − P0 )εk2
= 1 + .
k(I − P )εk2 k(I − P )εk2
Also
12
is multivariate normal (being the image of a multivariate normal vector under a linear map),
we know that (I − P )ε and (P − P0 )ε are independent. Hence k(I − P )εk2 and k(P − P0 )εk2 are
independent. We know that k(I −P )εk2 /σ 2 ∼ χ2n−p . It turns out that k(P −P0 )εk2 /σ 2 ∼ χ2p−p0 .
This follows from Proposition 4 and the fact that P − P0 is an orthogonal projection with
rank p − p0 . Indeed, it is certainly symmetric, and
(P − P0 )2 = P − P P0 − P0 P + P0 = P − P0 ,
the final equality following from P0 P = P0T P T = (P P0 )T = P0T = P0 . Thus P − P0 is an
orthogonal projection, so we know
r(P − P0 ) = tr(P − P0 ) = tr(P ) − tr(P0 ) = r(P ) − r(P0 ) = p − p0 .
Finally, we may conclude that
1
p−p0 k(P − P0 )εk2
1 ∼ Fp−p0 ,n−p .
n−p k(I − P )εk2
In summary, we can perform a generalised likelihood ratio test for
H0 : β1 = 0 against
H1 : β1 6= 0
at level α by comparing the test statistic
1
p−p0 k(P − P0 )Y k2
1
n−p k(I − P )Y k2
to Fp−p0 ,n−p (α) and rejecting for large values of the test statistic.
13
This type of model is known as a one-way analysis of variance (ANOVA). If all the nj were
equal, it would be called a balanced one-way ANOVA.
An alternative parametrisation is
where µ is the baseline or mean effect and αj is the effect of the j th regime in relation to the
baseline.
Notice that the parameter vector (µ, β) is not identifiable since, for example, replacing µ
with µ + c and each αj with αj − c gives the same model for every c ∈ R. To make the model
identifiable, one option is to constrain α1 = 0. This is known as a corner point constraint and
is the default in R. This makes it easier
PJ to test for differences from the control. Another option
is to use a sum-to-zero constraint: j=1 nj αj = 0. Note that the particular constraints used
do not affect the fitted values in any way.
If each of the subjects in our hypothetical experiment also went on one of I different diets,
then writing Yijk now to mean the weight loss of the k th participant of exercise regime j and
diet i, we might model the Yijk as independent with
This model is called an additive two-way ANOVA because it assumes that the effects of the
different factors are additive. The model is over-parametrised and as before, constraints must
be imposed on the parameters to ensure identifiability. By default, R uses the corner point
constraints α1 = β1 = 0.
If the contribution of one of the exercise regimes to the response was not the same for all
the different types of diets, it may be more appropriate to use the model
σ2 σ2
Var(β̂0,j ) = ≤ = Var(β̂j ),
k(I − P0,−j )Xj k2 k(I − P−j )Xj k2
for j = 1, . . . , p0 (see example sheet). Here P0,−j is the orthogonal projection on to the column
space of X0,−j , the matrix formed by removing the j th column from X0 .
It is thus useful to check whether a model formed from a smaller set of variables can ad-
equately explain the data observed. Another advantage of selecting the right model is that it
allows one to focus on variables of interest.
14
Coefficient of determination
One popular measure of the goodness of fit of a linear model is the coefficient of determination
or R2 . It compares the residual sum of squares (RSS) under the model in question to a minimal
model containing just an intercept, and is defined by
kY − Ȳ 1n k2 − k(I − P )Y k2
R2 := ,
kY − Ȳ 1n k2
where 1n is an n-vector of 1’s. The interpretation of R2 is as the proportion of the total variation
in the data explained by the model. It takes values between 0 and 1 with higher values indicating
a better fit. The R2 will always increase if variables are added to the model. The adjusted R2 ,
R̃2 defined by
n−1
R̃2 := 1 − (1 − R2 )
n−p
can be motivated by analogy with the F statistic, and takes account of the number of parameters.
AIC
Another approach to measuring the fit of a model is Akaike’s Information Criterion (AIC). We
will describe AIC in a more general setting than the normal linear model, since it will be used
when assessing the fit of generalised linear models which will be introduced in the next chapter.
Suppose that our data (Y1 , xT1 ), . . . , (Yn , xTn ) are generated with Yi independent conditional
on the design matrix X whose rows are the xTi . Suppose that given xi , the true pdf of Yi is
gxi and from a model F := {(fxi (·; θ))ni=1 , θ ∈ Θ ⊆ Rp } the corresponding maximum likelihood
fitted pdf is fxi (·; θ̂). One measure of the quality of fˆxi (·) := fxi (·; θ̂) as an estimate of the true
density gxi is the Kullback–Leibler divergence, K(gxi , fˆxi ) defined as
Z ∞
ˆ
K(gxi , fxi ) := [log{gxi (y)} − log{fˆxi (y)}]gxi (y)dy.
−∞
One can show via Jensen’s inequality that K̄ ≥ 0 with equality if and only if each gxi =
fˆxi (almost surely). Thus if K̄ is low, we have a good fit. Given a collection of different
fitted densities for the data, it is therefore desirable to select that which minimises K̄. This is
equivalent to minimising
n Z ∞ n
1X 1 X
K̃ := − log{fˆxi (y)}gxi (y)dy = − EYi∗ ∼gxi [log{fˆxi (Yi∗ )}|Yi ].
n −∞ n
i=1 i=1
Of course, we cannot compute K̄ or K̃ from the data since this requires knowledge of gxi
for i = 1, . . . , n. However, it can be shown that it is possible to estimate E(K̃) (where the
expectation is over the randomness in the fˆxi ). Akaike’s information criterion (AIC) defined as
15
satisfies E(AIC)/n ≈ 2E(K̃) for large n, provided the true densities gxi , i = 1, . . . , n are
contained in the model F.
In the normal linear model where X is n by p with full column rank, AIC amounts to
thus the best set of variables to use according to the AIC method is determined by minimising
n log(σ̂) + p across all candidate models.
Fact: If Z ∼ χ2k with k > 2 then E(Z −1 ) = (k − 2)−1 . Since σ̂ 2 and kXβ − X β̂k2 = kP εk2 are
independent, the second expectation in the display above equals
n(n + p)
,
n−p−2
n(n + p)
n log(2πσ̂ 2 ) + .
n−p−2
The corrected information criterion, AICc , is given by
1 + p/n
AICc = n log(2πσ̂ 2 ) + n .
1 − (p + 2)/n
Note that
!
p+1
1 + p/n n
n =n 1+2
1 − (p + 2)/n 1 − p+2
n
1
= n + 2(p + 1) .
1 − p+2
n
Thus when p/n is small, AICc ≈ AIC in the case of the normal linear model.
Orthogonality
One way to use the above model selection criteria is to fit each of the 2p−1 submodels that can
be created using our design matrix (assuming we include an intercept every time and the first
column of X is a column of 1’s) and pick the one that seems best based on our criterion of
choice. However, if p is reasonably large, this becomes a very computationally intensive task.
16
One situation where such an approach is feasible is when the columns of X are orthogonal.
Indeed, more generally, if X can be partitioned as X = (X0 X1 ) with the vector of coefficients
correspondingly partitioned as β = (β0T , β1T )T , we say that β0 and β1 are orthogonal sets of
parameters if X0T X1 = 0. Then
−1 T
X0T
X0
β̂ = (X0 X1 ) Y
X1T X1T
(X0 X0 )−1
T T
0 X0
= Y
0 (X1T X1 )−1 X1T
(X0 X0 )−1 X0T Y
T
β̂0
= = .
(X1T X1 )−1 X1T Y β̂1
If all the columns of X are orthogonal, we can easily find the best fitting model (in terms of the
RSS) with p0 variables. We simply order the kβ̂j Xj k2 = XjT Y /kXj k (excluding the intercept
term) in decreasing order, and pick variables corresponding to the first p0 terms. This works
because letting XS for S ⊆ {1, . . . , p} be the matrix formed from the columns of X indexed by
S, and writing PS for the projection on to the column space of XS ,
2
X X
2
k(I − PS )Y k = Y − β̂j Xj = kY k2 − kβ̂j Xj k.
j∈S j∈S
Exact orthogonality is of course unlikely to occur unless we have designed the design matrix X
ourselves, either through choosing the values of the original covariates, or through transforming
them in particular ways. A very common example of the latter is mean-centring each variable
before adding an intercept term, so the intercept coefficient is then orthogonal to the rest of the
coefficients.
Forward selection.
2. Add to the current model the predictor variable reduces the residual sum of squares the
most.
3. Continue step 2 until all predictor variables have been chosen or until a large number of
predictor variables has been selected. This produces a sequence of sub-models S0 ⊂ S1 ⊂
S2 ⊂ · · · .
4. Pick a model from the sequence of models created using either AIC or R2 based criteria
(or something better!).
An alternative is:
17
Backward selection.
1. Fit the largest model available (i.e. include all predictors) and call this S0 .
2. Exclude the predictor variable whose removal from the current model decreases the resid-
ual sum of squares the least.
3. Continue step 2 until all predictor variables have been removed (or a large number of
predictor variables have been removed). This produces a sequence of submodels S0 ⊃
S1 ⊃ S2 ⊃ · · · .
(A1) E(εi ) = 0. If this is false, the coefficients in the linear model need to be interpreted with
care. Furthermore, our estimate of σ 2 will tend to be inflated and F -tests may lose power
though they will have the correct size (see example sheet).
(A2) Var(εi ) = σ 2 . This assumption of constant variance is called homoscedasticity, and its
violation (nonconstant variance) is called heteroscedasticity. A violation of this assumption
means the least squares estimates are not as efficient as they could be, and furthermore
hypothesis tests and confidence intervals need not have their nominal levels and coverages
respectively. If the variances of the errors are known up to an unknown multiplicative
constant, weighted least squares can be used (see example sheet).
18
(A3) Cov(εi , εj ) = 0 for i 6= j: the errors are uncorrelated. When data are ordered in time
or space, this assumption is often violated. As with heteroscedasticity, the standard
inferential techniques can give misleading results.
(A4) The errors εi are normally distributed. Though the confidence intervals and hypothesis
tests we have studied rest on the assumption of normality, arguments based on the central
limit theorem can be used to show that even when the errors are not normally distributed,
provided (A1–A3) are satisfied, inferences are still asymptotically valid under reasonable
conditions.
A useful way of assessing whether the assumptions above are satisfied is to analyse the
residuals ε̂ := (I − P )Y arising from the model fit. This is usually done graphically rather than
through formal tests. An advantage of the graphical approach is that we can look for many
different signs for departures from the assumptions simultaneously. One potential issue is that
it may not always be clear what indicates a genuine violation of assumptions compared to the
natural variation that one should expect even if the assumptions held.
Note that under (A1), E(ε̂) = 0. It is common to plot the residuals against the fitted values
Ŷi , and also against each of the variables in the design matrix (including those not in the current
model). If (A1) holds, there should not be an obvious trend in the mean of the residuals.
Under (A2) and (A3), Var(ε̂) = σ 2 (I − P ). Define the studentised residuals to be
ε̂i
η̂i := √ , where pi := Pii i = 1, . . . , n.
σ̃ 1 − pi
Provided σ̃ is a good estimate of σ, the variance p of η̂i should be approximately 1. A standard
check of the validity of (A2) involves plotting |η̂i | against the fitted values.
If (A1–A4) hold, then we’d expect the η̂i to look roughly like an i.i.d. sample from a N (0, 1)
distribution since
ε̂i
η̂i ≈ √
σ 1 − pi
and so
−Pij
Cov(η̂i , η̂j ) ≈ p ,
(1 − pi )(1 − pj )
for i 6= j. When n p We expect this covariance to be close to 0 because
1 X 2 1 1 1 p
Pij = 2 tr(P T P ) = 2 tr(P ) = 2 r(P ) = 2 ,
n2 n n n n
i,j
19
Variable transformations
We have already discussed how predictors may be transformed so that models that are nonlinear
in the original data (but linear in the parameter β) still fall within the linear model framework.
Sometimes it can also be helpful to transform the response so that it fits the linear model.
Consider the following model
If we make the transformation Yi 7→ log(Yi ) we will have a linear model in the logged response.
The Box–Cox family of transformations is given by
λ
y −1
if λ 6= 0,
λ
(λ)
y 7→ y :=
log(y) if λ = 0.
(λ) (λ)
Typically one plots the log-likelihood of the transformed data (y1 , . . . , yn ) as a function of λ
and then selects a value of λ which lies close to the λ that maximises the log-likelihood, and
still gives a model with interpretable parameters.
Unusual observations
Often we may find that though the bulk of our data satisfy the assumptions (A1–A4) and fit the
model well, there are a few observations that do not. These are called outliers. It is important
to detect these so that they can be excluded when fitting the model, if necessary. A more subtle
way in which an observation can be unusual is if it is unusual in the predictor space i.e. it has
an unusual x value; it is this we discuss first.
The value pi := Pii is called the leverage of the ith observation. It measures the contribution
that Yi makes to the fitted value Ŷi . It can be shown that 0 ≤ pi ≤ 1. Since Var(εˆi ) = σ 2 (1−pi ),
values of pi close to 1 force the regression line (or plane) to pass very close to Yi .
The idea of leverage is about the potential for an observation to have a large effect on the
fit; if the observation does not have an unusual response value, it is possible that removing the
observation will change the estimated regression coefficients very little. However in this case,
the R2 and the results of an F -test with the null hypothesis as the intercept only model may
still change a lot.
The relationship ni=1 pi = tr(P ) = p motivates a rule of thumb that says the influence of
P
the ith observation may be of concern if pi > 3p/n. When the design matrix consists of just a
single variable and a column of 1’s representing an intercept term (as the first column), it can
be shown that
1 (Xi2 − X̄2 )2
p i = + Pn 2
,
n k=1 (Xk2 − X̄2 )
20
Cook’s distance. The Cook’s distance Di of the observation (Yi , xi ) is defined as
1
p kX(β̂(−i) − β̂)k2
Di := ,
σ̃ 2
where β(−i) is the OLS estimate of β when omitting observation (Yi , xi ).
The interpretation of Cook’s distance is that if Di = Fp,n−p (α) then omitting the ith data
point moves the m.l.e. of β to the edge of the (1 − α)-level confidence set for β.
Note that we do not need to fit n + 1 linear models to compute all of the Cook’s distances,
since in fact
1 pi
Di = η̂ 2 (see example sheet).
p 1 − pi i
Thus Cook’s distance combines the studentised fitted residuals with the leverage as a measure
of influence. A rule of thumb is that we should be concerned about the influence of (Yi , xi ) if
Di > Fp,n−p (0.5).
21
Chapter 2
(i) The random component: Y1 , . . . , Yn are independent normal random variables, with Yi
having mean µi and variance σ 2 .
Of course this is an unnecessarily wasteful way to write out the linear model, but it is suggestive
of generalisations.
GLMs extend linear models in (i) and (iii) above, allowing different classes of distributions
for the response variables and allowing a more general link:
ηi = g(µi )
22
Why is this a useful endeavour? We could just work with a particular family of distributions
for the response that is useful for our own purposes, and develop algorithms for estimating
parameters and theory for the distributions of our estimates (just as we did for the normal
linear model). However, if we work in a more general framework, there we may be able to
formulate inference procedures and develop computational techniques that are applicable for a
number of families of distributions.
We begin our quest for such a general framework with the concept of an exponential family.
We motivate the idea by starting with a single density or probability mass function f0 (y),
y ∈ Y ⊆ R. Rather than always writing “density or probability mass function”, we will use
the term “model function” to mean either a density function or p.m.f. (Of course, those of you
who attended Probability and Measure will know that p.m.f.’s are just densities with respect
to counting measure, so we could equally well use “density” throughout).
We will require that f0 be a non-degenerate model function, that is if Y has model function
f0 then Var(Y ) > 0. For example, f0 (y) might be the uniform density on the unit interval
Y = [0, 1], or might have the probability mass function y(1 − y) on Y = {0, 1}.
We can generate a whole family of model functions based on f0 via exponential tilting:
eyθ f0 (y)
f (y; θ) = R , y ∈ Y.
ey0 θ f0 (y 0 )dy 0
We can only consider values of θ for which the integral in the denominator is finite. Note that
the denominator is precisely the moment generating function of f0 evaluated at θ. Let us briefly
recall some facts about moment generating functions before proceeding.
The moment and cumulant generating functions. The moment generating function
(m.g.f.) of a random variable, or equivalently its model function, is M (t) := E(etY ). The
cumulant generating function (c.g.f.) is the logarithm of the m.g.f.: K(t) := log(M (t)). The set
of values where these functions are finite is an interval containing 0. If this contains an open
interval about 0, then we have the series expansions
∞
X tr
M (t) = E(Y r ) ,
r!
r=0
∞
X tr
K(t) = κr ,
r!
r=0
where κr is known as the rth cumulant. Standard theory about power series tells us that
{f (y; θ) : θ ∈ Θ}
is called the natural exponential family (of order 1) generated by f0 , and is an example of
an exponential family. With a different generating model function f0 , we can get a different
exponential family.
23
The parameter θ is called the natural parameter Rand Θ is called the natural parameter space.
Note that we may write f (y; θ) = eθy−K(θ) f0 (y) so Y eθy−K(θ) f0 (y) = 1 for all θ ∈ Θ.
The mean of f (y; θ) is of course related to the parameter θ, and it is often useful to
reparametrise the family of model functions in terms of their means. To discuss this, let us
first find the mean and variance of f (y; θ), i.e. the first and second cumulants.
The m.g.f. of f (·; θ), M (t; θ) is
Z
M (t; θ) = ety eθy−K(θ) f0 (y)dy
Y
Z
K(θ+t)−K(θ)
=e e(θ+t)y−K(θ+t) f0 (y)dy
Y
K(θ+t)−K(θ)
=e , for θ, θ + t ∈ Θ.
Thus if Y has a f (y; θ) model function, then
d d2
Eθ (Y ) = K(t; θ) = K 0 (θ), Varθ (Y ) = 2 K(t; θ) = K 00 (θ).
dt t=0 dt t=o
It can be shown that since f0 was assumed to be non-degenerate, so must be every f (y; θ).
Then
µ(θ) := Eθ (Y ) = K 0 (θ) satisfies
0 00
µ (θ) = K (θ) > 0
so µ is a smooth, strictly increasing function from Θ to M := {µ(θ) : θ ∈ Θ} (M for ‘mean
space’), with inverse function θ := θ(µ). This leads to the mean value parametrisation:
f (y; µ) = eθ(µ)y−K(θ(µ)) f0 (y), y ∈ Y, µ ∈ M.
The function V : M → (0, ∞) defined by V (µ) = Varθ(µ) (Y ) = K 00 (θ(µ)) is called the variance
function.
Examples.
2
1. Let f0 = φ, the standard normal density. Then M (θ) = eθ /2 , θ ∈ R so K(θ) = 21 θ2 . Thus
the natural exponential family generated by the standard normal density is
2 1 2 1 2
f (y; θ) = eθy−θ /2 √ e−y /2 = √ e−(y−θ) /2 , y ∈ R, θ ∈ R.
2π 2π
This is the N (θ, 1) family. Clearly µ(θ) = θ, θ(µ) = µ, M = R and V (µ) = 1, as can be
verified by taking derivatives of K(θ).
2. Let f0 denote the Pois(1) p.m.f.:
1
f0 (y) = e−1 , y ∈ {0, 1, . . .}.
y!
Then
∞ θr
X e
M (θ) = e−1 = exp(eθ − 1).
r!
r=0
Thus with exponential tilting, we get
1 (eθ )y exp(−eθ )
f (y; θ) = eθy−exp(θ) = , y ∈ {0, 1, . . .}, θ ∈ R.
y! y!
This is the Pois(eθ ) family of distributions. The mean function is µ = eθ with inverse
θ = log(µ), and the variance function, V (µ) = µ; the mean space is M = (0, ∞).
24
Technical conditions. Why did we impose the technical conditions that the set of values
where the c.g.f. of f0 is finite, Θ, is an open interval containing 0? Note that then given any
θ ∈ Θ, {t : θ + t ∈ Θ} = Θ − θ is an open interval containing 0. Thus the result we have shown
that
K(θ + t) − K(θ)
is the c.g.f. of f (·; θ) is valid for all t ∈ Θ − θ, and as this has a power series expansion, we can
recover the cumulants by taking derivatives an evaluating at 0.
where
• a(σ 2 , y) is a known positive function (c.f. f0 (y) that generated the exponential family),
• Θ is an open interval,
and in addition the model functions are non-degenerate, is called an exponential dispersion
family (of order 1). The parameter σ 2 is called the dispersion parameter. (Note many authors
simply call the family of model functions in (2.3.1) an example of an exponential family.)
Let K(·; θ, σ 2 ) be the c.g.f. of the model function f (y; θ, σ 2 ) in (2.3.1). It can be shown (see
example sheet) that the c.g.f. of the density in (2.3.1) is
1
K(t; θ, σ 2 ) = {K(σ 2 t + θ) − K(θ)},
σ2
for θ + σ 2 t ∈ Θ. Since the set of values where K(·; θ, σ 2 ) is finite contains an open interval
about 0, if Y has model function (2.3.1) then
As before, we may define µ(θ) := K 0 (θ). Since Varθ,σ2 (Y ) > 0 (by non-degeneracy of the
model functions), K 00 (θ) > 0, so we can define an inverse function to µ, θ(µ). Further define
M := {µ(θ) : θ ∈ Θ} and variance function V : M → (0, ∞) given by V (µ) := K 00 (θ(µ))
(though now the variance of the model function is actually σ 2 V (µ)).
Examples.
1. Consider the family N (ν, τ 2 ) where ν ∈ R and τ 2 ∈ (0, ∞). We may write the densities
as
y2
1 1 1 2
f (y; ν, τ 2 ) = √ exp − 2 exp 2
νy − ν ,
2πτ 2 2τ τ 2
showing that this family is an exponential dispersion family with θ = ν, µ(θ) = θ, σ 2
= τ 2 . Of course we know that µ(θ) = θ and V (µ) = 1, but we can also check this by
differentiating K(θ) = θ2 /2.
25
2. Let Z ∼ Bin(n, p). Then Y := Z/n ∼ n1 Bin(n, p) has p.m.f.
n
f (y; p) = pny (1 − p)n(1−y) , y ∈ {0, 1/n, 2/n, . . . , 1}
ny
Consider the family of p.m.f.’s of the form above with p ∈ (0, 1) and n ∈ N. To show this
is an exponential dispersion family, we write
p n
f (y; p) = exp ny log + n log(1 − p)
1−p ny
θ 2
yθ − log(1 + e ) 1/σ
= exp 2
,
σ y/σ 2
with σ 2 = 1/n, θ = log{p/(1 − p)} and K(θ) = log(1 + eθ ). To find the mean function
µ(θ), we differentiate K
d eθ
µ(θ) = log(1 + eθ ) = (= p),
dθ 1 + eθ
with inverse θ(µ) = log{µ/(1 − µ)}. Differentiating once more we see that
(1 + eθ(µ) )eθ(µ) − (eθ(µ) )2
V (µ) =
(1 + eθ(µ) )2
!
eθ(µ) eθ(µ)
= 1−
1 + eθ(µ) 1 + eθ(µ)
= µ(1 − µ).
Here M = (0, 1) and Φ = N.
3. Consider the gamma family of densities,
λα y α−1 e−λy
f (y; α, λ) = for y > 0 and α, λ > 0.
Γ(α)
It is not immediately clear how to write this in exponential dispersion family form, so let us
take advantage of the fact that we know the mean and variance of a gamma distribution. If
Y has the gamma density then Eα,λ (Y ) = α/λ and Varα,λ (Y ) = α/λ2 . If this family were
an exponential dispersion family then µ = α/λ and σ 2 V (µ) = α/λ2 . It is not clear what
we should take as σ 2 . However, the y α−1 term would need to be absorbed by the a(y, σ 2 )
in the definition of the EDF. Thus we can try taking σ 2 as a function of α alone. What
function must this be? Imagine that α = λ, so σ 2 V (µ) = σ 2 × constant ∝ 1/λ = 1/α.
Thus we must have σ 2 = α−1 (or some constant multiple of it). In the new parametrisation
where α = σ −2 and λ = (µσ 2 )−1
−2 −1
2
yσ exp(− σy2 µ )
f (y; µ, σ ) =
(σ 2 µ)σ−2 Γ(σ −2 )
−2
y σ −1
y 1
= 2 σ−2 exp − − log µ
(σ ) Γ(σ −2 ) µ σ2
σ −2 −1
y 1
= 2 σ−2 exp 2 {yθ − K(θ)} ,
(σ ) Γ(σ −2 ) σ
where θ(µ) = −µ−1 and K(θ) = log(−θ−1 ). We found the variance function to be
V (µ) = µ2 and both M and Φ and (0, ∞).
26
2.4 Generalised linear models
Having finally defined the concept of an exponential dispersion family, we can now define what
a generalised linear model is. A generalised linear model for observations (Y1 , x1 ), . . . , (Yn , xn )
is defined by the following properties.
1. Y1 , . . . , Yn are independent, each Yi having model function in the same exponential dis-
persion family of the form
2 2 1
f (y; θi , σi ) = a(σi , y) exp 2 {θi y − K(θi )} , y ∈ Y, θi ∈ Θ, σi2 ∈ Φ ⊆ (0, ∞),
σi
with σi2 = σ 2 ai where a1 , . . . , an are known and ai > 0, though σ 2 may be unknown. Note
that the functions a and K must be fixed for all i.
2. The mean µi of the ith observation and the ith component of the linear predictor ηi := xTi β
are linked by the equation
g(µi ) = ηi , i = 1, . . . , n,
where g is a strictly increasing, twice differentiable function called the link function.
27
Using the canonical link function can simplify some calculations. With g the canonical link
function, θ(µi ) = xTi β, so we have log-likelihood
n n
X 1 X
`(β, σ 2 ; y1 , . . . , yn ) = {yi x T
i β − K(x T
i β)} + log{a(σ 2 ai , yi )}.
σ 2 ai
i=1 i=1
One feature of the log-likelihood above that makes it particularly easy to maximise over β is
that the Hessian is negative semi-definite so the log-likelihood is is a concave function of β (for
any fixed σ 2 ). We have
n
∂`(β, σ 2 ) X xi
= {yi − K 0 (xTi β)}
∂β σ 2 ai
i=1
n
∂ 2 `(β, σ 2 ) X xi xT
=− i
K 00 (xTi β),
∂β∂β T σ 2 ai
i=1
and K 00 > 0. This in particular means that as a function of β, the log-likelihood cannot have
multiple local maxima. Indeed, we know that if a maximiser of the log-likelihood, β̂ exists, it
must satisfy
∂
`(β, σ 2 ) = 0. (2.4.1)
∂β β=β̂
However due to concavity of the log-likelihood, the converse is also true: if β̂ satisfies (2.4.1)
then it must maximise the log-likelihood. Indeed, for any β0 , consider the function
Note that f (0) = `(β̂, σ 2 ) and f (1) = `(β0 , σ 2 ). A Taylor expansion of f about 0 gives us
1
f (1) = f (0) + f 0 (0) + f 00 (t)
2
for some t ∈ [0, 1] (note this is a Taylor expansion with a “mean-value” form of the remainder).
Noting that f 0 (0) = 0 by assumption,
1 ∂ 2 `(β, σ 2 )
f (1) − f (0) = (β0 − β̂)T (β0 − β̂) ≤ 0,
2 ∂β∂β T β=β̃
2.5 Inference
Having generalised the normal linear model, how do we compute maximum likelihood estimators
and how can we perform inference (i.e. construct confidence sets, perform hypothesis test)?
These tasks were fairly simple in the normal linear model setting since the maximum likelihood
estimator had an explicit form. In our more general setting, this will not (necessarily) be the
case. Despite this, we can still perform inference and compute m.l.e.’s, but approximations
must be involved in both of these tasks. We first turn to the problem of inference.
28
2.5.1 The score function
Consider data (Y1 , xT1 ), . . . , (Yn , xTn ) with the Yi independent given the xi , and suppose Y =
(Y1 , . . . , Yn )T has density in
( n )
Y
n d d
{f (y, θ), y ∈ Y : θ ∈ Θ ⊆ R } = fxi (yi ; θ), yi ∈ Y : θ ∈ Θ ⊆ R .
i=1
We will review some theory associated with maximum likelihood estimators in this setting.
Here we simply aim to sketch out the main results; for a rigorous treatment see your Principles
of Statistics notes (or borrow someone’s). In particular, we do not state all the conditions
required for the results to be true (broadly known as “regularity conditions”), but they will all
be satisfied for the generalised linear model setting to which we wish to apply the results.
Let θ̂ be the maximum likelihood estimator of θ (assuming it exists and is unique). If
we cannot write down the explicit form of θ̂ as a function of the data, in order to study its
properties, we must argue from what we do know about the m.l.e.—the fact that it maximises
the likelihood, or equivalently the log-likelihood. This means θ̂ satisfies
∂
`(θ; Y ) = 0,
∂θ θ=θ̂
where
n
X
`(θ; Y ) = log f (Y ; θ) = log fxi (Yi ; θ).
i=1
We call the vector of partial derivatives of the likelihood the score function, U (θ; Y ):
∂
Ur (θ; Y ) := `(θ; Y ).
∂θr
Two key features of the score function are that provided the order of differentiation w.r.t. a
component of θ and integration over the sample space Y n may be interchanged,
1. Eθ {U (θ; Y )} = 0,
∂2
2. Varθ {U (θ; Y )} = −Eθ `(θ; Y ) .
∂θ∂θT
To see the first property, note that for r = 1, . . . , d,
Z
∂
Eθ {Ur (θ; Y )} = log{f (y; θ)}f (y; θ)dy
Y n ∂θr
Z
∂
= f (y; θ)dy
Y n ∂θ r
Z
∂ ∂
= f (y; θ)dy = (1) = 0.
∂θr Y n ∂θr
29
is known as the Fisher information. It can be thought of as a measure of how hard it is to
estimate θ when it is the true parameter value. A related quantity is the observed information
matrix, j(θ) defined by
∂2
j(θ) = − `(θ; Y ).
∂θ∂θT
Note that i(θ) = Eθ (j(θ)).
Example. Consider our friend the normal linear model: Y = Xβ + ε, ε ∼ Nn (0, σ 2 ). Then
−2 T
2 σ X X 0
i(β, σ ) = .
0 nσ −4 /2
Note that writing i−1 (β) for the top left p × p sub-matrix of i−1 (β, σ 2 ) (the matrix inverse of
i(β, σ 2 )), we have that Var(β̂) = i−1 (β).
In fact we have the following result.
Theorem 5 (Cramér–Rao lower bound). Let θ̃ be an unbiased estimator of θ. Then under
regularity conditions,
Varθ (θ̃) − i−1 (θ)
is positive semi-definite.
Proof. We only sketch the proof when d = 1. By the Cauchy–Schwarz inequality,
As E{U (θ)} = 0,
30
*Convergence of random variables*. We say a sequence of random variables Z1 , Z2 , . . .
with corresponding distribution functions F1 , F2 , . . . converges in distribution to a random vari-
d
able Z with distribution function F , and write Zn → Z if Fn (x) → F (x) at all x where F is
continuous.
A sequence of random vectors Zn ∈ Rk converges in distribution to a continuous random
vector Z ∈ Rk when
P(Zn ∈ B) → P(Z ∈ B)
for all (Borel) sets B for which δB := cl(B) \ int(B) has P(Z ∈ δB) = 0.
For example, the multidimensional central limit theorem (CLT) states that if Z1 , Z2 , . . . are
i.i.d. random vectors in Rk with positive definite variance Σ and mean µ ∈ Rk , then writing
¯ 1 n
Z (n) for n i=1 Zi , we have
P
√
n(Z¯(n) − µ) → N (0, Σ).
d
k
Theorem 8. Assume that the Fisher information matrix when there are n observations, i(n) (θ)
(where we have made the dependence on n explicit) satisfies i(n) (θ)/n → I(θ) for some positive
definite matrix I. Then denoting the maximum likelihood estimator of θ when there are n
observations by θ̂(n) , under regularity conditions we have
√ d
n(θ̂(n) − θ) → Nd (0, I −1 (θ)).
A short-hand and informal version of writing this (which is fine for this course) is that
to be read “θ̂ is asymptotically normal with mean θ and variance i−1 (θ)”.
31
*Sketch of proof*. Here is a sketch of the proof when d = 1 and our data are i.i.d. rather
than simply independent.
A Taylor expansion of the score function about the true parameter value θ gives
Since E(U (θ)) = 0 and Var(U (θ)) = i(θ) = ni1 (θ), where i1 (θ) is the Fisher information of the
first observation, by the CLT we have
U (θ) d
√ → N (0, i1 (θ)).
n
Remn (θ) p
√ → 0,
n
we have
√ j(θ) U (θ) Remn (θ)
n(θ̂ − θ) = √ + √
n n n
d
→ N (0, i1 (θ)).
p
But by WLLN, j(θ)/n → i1 (θ) as n → ∞, so by Slutsky’s lemma (no. 3),
√ d
n(θ̂ − θ) → N (0, i−1
1 (θ)).
Relevance of the result. How are we to use this result? The first issue is that as the true
parameter θ is unknown, so is i−1 (θ). However, provided that i−1 (θ) is a continuous function
of θ, we may estimate this well with i−1 (θ̂), and we can show that, for example
θ̂j − θj d
q → N (0, 1)
(i−1 (θ̂))jj
where zα is the upper α-point of N (0, 1). The coverage of this confidence interval tends to 1 − α
as n → ∞. Similarly, an asymptotic 1 − α level confidence set for θ is given by
Another issue is that we never have an infinite amount of data. What does the asymptotic
result have to say when we have maybe 100 observations? From a purely logical point of view,
it says absolutely nothing. You will have had it drilled into you long ago in Analysis I that
even the first trillion terms of a sequence have nothing to do with its limiting behaviour. On
the other hand, we can be more optimistic and hope that n = 100 is large enough for the finite
sample distribution of θ̂ to be close to the limiting distribution. Performing simulations can help
justify this optimism and give us values of n for which we can expect the limiting arguments to
apply.
32
Wilks’ theorem. The result on asymptotic normality of maximum likelihood estimators
allows us to construct confidence intervals for individual components of θ and hence perform
hypothesis tests of the form H0 : θj = 0, H1 : θj 6= 0. Now suppose we wish to test
H0 : θ ∈ Θ0 against
H1 : θ ∈
/ Θ0
where Θ0 ⊂ Θ, the full parameter space, and Θ0 is of lower dimension than Θ. The precise
meaning of dimension when Θ0 and Θ are not affine spaces (i.e. a translation of a subspace) but
rather general manifolds would require us to go into the realm of differential geometry, which we
won’t do here. Perhaps the most important case of interest is when θ = (θ0T , θ1T )T and θ0 ∈ Rd0
with Θ = Rd , and we are testing
H0 : θ0 = 0 against
H1 : θ0 6= 0.
Wilks’ theorem gives the asymptotic distribution of of the likelihood ratio statistic
supθ0 ∈Θ L(θ0 )
wLR (H0 ) = 2 log = 2{ sup `(θ0 ) − sup `(θ0 )}.
supθ0 ∈Θ0 L(θ0 ) θ0 ∈Θ θ0 ∈Θ0
Theorem 9 (Wilks’ theorem). Suppose that H0 is true. Then, under regularity conditions
d
wLR (H0 ) → χ2k
*Sketch of proof* We only sketch the proof when the null hypothesis is simple so Θ0 = {θ0 },
and when the data Y1 , Y2 , . . . are i.i.d. rather than just independent. A Taylor expansion of `(θ0 )
centred at the (unrestricted) maximum likelihood estimate θ̂ gives
1
`(θ0 ) = `(θ̂) + (θ̂ − θ0 )T U (θ̂) − (θ̂ − θ0 )T j(θ̂)(θ̂ − θ0 ) + Remn (θ̂).
2
p
Using U (θ̂) = 0 and provided that Remn (θ̂) → 0,
d
2{`(θ̂) − `(θ0 )} = (θ̂ − θ0 )T j(θ̂)(θ̂ − θ0 ) − 2Remn (θ̂) → χ2k
d
under H0 by Slutsky’s theorem, provided (θ̂ − θ0 )T j(θ̂)(θ̂ − θ0 ) → χ2k .
Note that the likelihood ratio test in conjunction with Wilks’ theorem can also be used to
test whether individual components of θ are 0. Unlike the analogous situation in the normal
linear model where the F -test for an individual variable is equivalent to the t-test, here tests
based on asymptotic normality of θ̂ and the likelihood ratio test will in general be different—
usually the likelihood ratio test is to be preferred, though it may be require more computation
to calculate the test statistic.
33
The asymptotic results we have studied then show that
β̂ ∼ ANp (β, i−1 (β)).
This (along with continuity of i−1 (β)) justifies the following asymptotic (1 − α)-level confidence
set for βj
q q
−1 −1
β̂j − zα/2 {i (β̂)}jj , β̂j + zα/2 {i (β̂)}jj ,
where zα is the upper α-point of N (0, 1). To test H0 : βj = 0 against H1 : βj 6= 0, we can reject
H0 if the confidence interval above excludes 0 i.e. if
|β̂j |
q > zα/2 .
{i−1 (β̂)}jj
Now suppose β is partitioned as β = (β0T , β1T )T where β0 ∈ Rp0 and we wish to test H0 :
β1 = 0 against β1 6= 0. Write β̌0 for the m.l.e. of β0 under the null model, and assume for the
moment that the dispersion parameter σ 2 is known (as is the case for the Poisson and binomial
models—the two most important generalised linear models). Write `(µ, ˜ σ 2 ) for `(β, σ 2 ) so that
n n
˜ 2 1 X 1 X
`(µ, σ ) = 2 [yi θ(µi ) − K{θ(µi )}] + log{a(σ 2 , yi )}
σ ai
i=1 i=1
34
2.6 Computation
We have seen how despite the maximum likelihood estimator β̂ of β in a generalised linear model
not having an explicit form (except in special cases such as the normal linear model), we can
show that asymptotically the m.l.e. has rather attractive properties and we can still perform
inference that is asymptotically valid. How are we to compute β̂ when all we know about it is
the fact that it satisfies
∂`(β, σ 2 )
0= =: U (β̂)? (2.6.1)
∂β β=β̂
Here, with a slight abuse of notation, we have written U (β) for the first p components of
U (β, σ 2 ); similarly let us write j(β) and i(β) for the top left p × p submatrix of j(β, σ 2 ) and
i(β, σ 2 ) respectively.
If U were linear in β, we should be able to solve the system of linear equations in (2.6.1) to
find β̂. Though in general U won’t be a linear function, given that it is differentiable (recall that
the link function g is required to be twice differentiable), an application of Taylor’s theorem
shows that it is at least locally linear, so
for β close to β0 . If we managed to find a β0 close to β̂, the fact that U (β̂) = 0 suggests
approximating β̂ by the solution of
in β, i.e.
β0 + j −1 (β0 )U (β0 ),
where we have assumed that j(β0 ) is invertible. This motivates the following iterative algorithm
(the Newton–Raphson algorithm): starting with an initial guess at β̂, β̂0 , at the mth iteration
we update
β̂m = β̂m−1 + j −1 (β̂m−1 )U (β̂m−1 ). (2.6.2)
A potential issue with this algorithm is that j(β̂m−1 ) may be singular or close to singular and
thus make the algorithm unstable. The method of Fisher scoring replaces j(β̂m−1 ) with i(β̂m−1 )
which is always positive definite (subject to regularity conditions) and generally better behaved.
Fisher scoring may not necessarily converge to β̂ but almost always does. We terminate the
algorithm when successive iterations produce negligible difference.
Let us examine this procedure in more detail. It can be shown (see example sheet) that the
score function and Fisher information matrix have entries
n
X (yi − µi )Xij
Uj (β) = j = 1, . . . , p,
σ 2 V (µi )g 0 (µi )
i=1 i
n
X Xij Xik
ijk (β) = k = 1, . . . , p.
σ2V
i=1 i
(µi ){g 0 (µi )}2
Choosing the canonical link g(µ) = θ(µ) simplifies Uj (β) and ijk (β) since g 0 (µ) = 1/V (µ). Let
W (µ) be the n × n diagonal matrix with ith diagonal entry
1
Wii (µ) := .
ai V (µi ){g 0 (µi )}2
35
Further let Ỹ (µ) ∈ Rn be the vector with ith component
U (β) = σ −2 X T W Ỹ
i(β) = σ −2 X T W X.
Let us set
Wm := W (µ̂m )
Ỹm := Ỹ (µ̂m ).
(Note here the subscript m is not indexing different components of a single vector Ỹ but different
vectors Ỹm . Then we see that
Zm := Ỹm + η̂m ,
See example sheet 1 for the final equality. Thus the sequence of approximations to β̂ are given
by iterative weighted least squares (IWLS) of the adjusted dependent variable Zm−1,i on X with
weights given by the diagonal entries of Wm−1 .
With this formulation, we can start with an initial guess of µ̂ rather than one of β̂. An
obvious choice for this initial guess µ̂0 is the response y, although a small adjustment such as
µ̂0,i = max{yi , } for > 0 may be necessary if g(µ) = log(µ) for example, to avoid problems
when yi = 0.
36
Chapter 3
2. g(µ) = Φ−1 (µ) where Φ is the c.d.f. of the standard normal distribution (so Φ−1 is the
quantile function of the standard normal) gives the probit link.
37
µ
3. g(µ) = log is the logit link. This is the canonical link function for the GLM.
1−µ
The probit link gives an interesting latent variable interpretation of the model. Consider the
case where ni = 1. Imagine that there exists a Y ∗ ∈ Rn such that
Y ∗ = Xβ ∗ + ε
where ε ∼ Nn (0, σ 2 I). Suppose we do not observe Y ∗ but instead, only see Y ∈ {0, 1}n with
ith component given by
Yi = 1{Yi∗ >0} .
Then we see that
The models generated by the other two link functions also have latent variable interpretations.
Of the three link functions, by far the most popular is the logit link. This is partly because
it is the canonical link, and so simplifies some calculations, but perhaps more importantly, the
coefficents from a model with logit link (a logistic regression model) are easy to interpret. The
value eβj gives the multiplicative change in the odds µi /(1 − µi ) for a unit increase in the value
of the j th variable, keeping the values of all other variables fixed. To see this note that
p p
µi X Y
= exp Xij βj = (eβj )Xij .
1 − µi
j=1 j=1
4
2
g(µ)
0
−2
−4
logistic
probit
c log−log
Figure 3.1: The graphs of three commonly used link functions for binomial regression.
38
3.1.2 *A classification view of logistic regression*
In the case where ni = 1 for all i, logistic regression can be thought of as a classification
procedure. The response value of each observation is then either 0 or 1, and so divides the
observations into two classes. Having fit a logistic regression to some data which we shall call
the training data, we can then predict responses (class labels) for new data for which we only
have the covariate values. We can do this by applying the function Ĉτ below to each new
observation:
Ĉτ (x) := 1{π̂(x)≥τ } ,
where
exp(xT β̂)
π̂(x) :=
1 + exp(xT β̂)
and β̂ is the m.l.e. of β based on the training data. The value τ is a threshold and should be set
according to how bad predicting a class label of 1 when it is in fact 0 is, compared to predicting
a class label of 0 when it is in fact 1.
If in addition to our training data, we have another set of labelled data, we can plot the
proportion of class 1 observations correctly classified against the proportion of class 0 observa-
tions incorrectly classified using Ĉτ , for different values of τ . This set of data is known as a
test set. As τ varies between 0 and 1, the points plotted trace out what is known as a Receiver
Operating Characteristic (ROC) curve. This gives a visual representation of how good a clas-
sifier our model is, and can serve as a way of comparing different classifiers. A classifier with
ROC curve always above that of another classifier is certainly to be preferred. However, when
ROC curves of classifiers cross, no classifier uniformly dominates the other. In these cases, a
common measure of performance is the area under the ROC curve (AUC). If in a particular
application, there is a certain probability of incorrectly classifying a class 0 observation that can
be tolerated (say 5%), and the chance of incorrectly classifying a class 1 observation is to be
minimised subject to this error tolerance, then ROC curves should be compared at the relevant
point.
Of course all these comparisons are contingent on the particular test set used. Given a
collection of data, it is advisable to randomly split it into training and test sets several times
and average the ROC curves produced by each of the splits. Suppose the training sets are all
of size ntr , say. The average ROC curve is then a measure of the average performance of the
classification procedure when it is fed ntr observations where, thinking of the covariates now
as (realisations of) random variables, this average is over the joint distribution of response and
covariates.
39
with
π1 1
α := log − (µ1 + µ0 )T Σ−1 (µ1 − µ0 )
π0 2
β := Σ−1 (µ1 − µ0 ).
Thus the log odds of the posterior class probabilities is precisely of the form needed for the
logistic regression model to be correct.
Typically if it is known that the data generating process is (3.1.1), then a classifier is
formed by replacing the population parameters π1 , µ0 , µ1 and Σ in (3.1.2) with estimates,
and then classifying to the class with the largest posterior probability. This gives Fisher’s
linear discriminant analysis (LDA), which you will have already met if you took Principles of
Statistics.
The logistic regression model is more general in that it makes fewer assumptions. It does
not specify the distribution of the covariates and instead treats them as fixed (i.e. it conditions
on them). When the mixture of Gaussians model in (3.1.1) is correct, one can expect LDA to
perform better. However, when (3.1.1) is not satisfied, logistic regression may be preferred.
where D(yi ; µ̂i ) is the ith summand in the definition of D(y; µ̂), so
2
D(yi ; µ̂i ) = [yi {θ(yi ) − θ(µ̂i )} − {K(θ(yi )) − K(θ(µ̂i ))}].
ai
In binomial regression (and also Poisson regression), one can sometimes test a particular
model against a saturated model:
In this case,
D(y; µ̂) − D(y; y) D(y; µ̂)
wLR (H0 ) = = ,
σ2 σ2
but standard asmpytotic theory no longer ensures that this converges in distribution to a χ2n−p .
Nevertheless, other asymptotic arguments can sometimes be used to justify referring the likeli-
hood ratio statistic to χ2n−p , for instance when
• Yi ∼ 1
ni Bin(ni , µi ), with ni large, and
This is because in these cases, the individual Yi get close to normally distributed random
variables. Such asymptotics are known as small dispersion asymptotics.
40
3.2 Poisson regression
We have seen how binomial regression can be appropriate when the responses are proportions
(including the important case when the proportions are in {0, 1} i.e. the classification scenario).
Now we consider count data e.g. the number of texts you receive each day, or the number of
terrorists attacks that occur in a country each week. Another example where count data arises
is the following: imagine conducting an (online) survey where perhaps you ask people to enter
their college and their voting intentions. The survey may be live for a fixed amount of time and
then you can collect to together the data into a 2-way contingency table:
College Labour Conservative Liberal Democrats Other
Trinity
..
.
When the responses are counts, it may be sensible to model them as realisations of Poisson
random variables. A word of caution though. A Poisson regression model entails a particular
relationship between the mean and variance of the responses: if Yi ∼ Pois(µi ), then Var(Yi ) = µi .
In many situations we may find this assumption is violated. Nevertheless, the Poisson regression
model can often be a reasonable approximation.
If the probability of occurence of an event in a given time interval is proportional to the
length of that time interval and independent of the occurence of other events, then the number
of events in any specified time interval will be Poisson distributed. Wikipedia lists a number of
situations where Poisson data arise naturally:
• ...
The Poisson regression model assumes that our data (Y1 , x1 ), . . . , (Yn , xn ) ∈ {0, 1, . . .} × Rp
have Y1 , . . . , Yn independent with Yi ∼ Pois(µi ), µi > 0. An example sheet question asks you
to verify that the {Pois(µ) : µ ∈ (0, ∞)} is an exponential dispersion family with dispersion
parameter σ 2 = 1. In line with the GLM framework, we assume the µi are related to the
covariates through g(µi ) = xTi β for a link function g.
By far the most commonly used link function is the log link—this also happens to be the
canonical link. In fact the Poisson regression model is often called the log-linear model. We
only consider the log link here. Two reasons for the popularity of the log link are:
• {log(µ) : µ ∈ (0, ∞)} = R. The parameter space for β is then simply Rp and no restrictions
are needed.
• Interpretability: if
Xp p
Y
µi = exp Xij βj = (eβj )Xij ,
j=1 j=1
then we see that eβj is the multiplicative change in the expected value of the response for
a unit increase in the j th variable.
41
In the next practical class we’ll look at data from the English Premier League and attempt
to model the home and away scores Yijh and Yija when team i is home to team j as independent
Poisson random variables with respective means
Here ∆ represents the home advantage (we expect it to be greater than 0) and αi and βi the
offensive and defensive strengths of team i.
so
n
X n
X
`(β) = − exp(xTi β) + yi xTi β.
i=1 i=1
Let us consider the case where we have an intercept term. We can either say that the first
column of the design matrix X is a column of 1’s, or we can include it explicitly in the model.
In the latter case we take
log(µi ) = α + xTi β,
so the log-likelihood is
n
X n
X
`(α, β) = − exp(α + xTi β) + yi (α + xTi β).
i=1 i=1
so
n n n
X yi X X yi
D(y; µ̂) = 2 yi log −2 (yi − µ̂i ) = 2 yi log ,
µ̂i µ̂i
i=1 i=1 i=1
42
when an intercept term is included. P
Write yi = µ̂i + δi , so we have that δi = 0. Then, by a Taylor expansion, assuming that
δi /µ̂i is small for each i,
n
X δi
D(y; µ̂) = 2 (µ̂i + δi ) log 1 +
µ̂i
i=1
n
X δi2 δ2
≈2 δi + − i
µ̂i 2µ̂i
i=1
n
X (yi − µ̂i )2
= .
µ̂i
i=1
so
µi = ti exp(xTi β).
This is the usual Poisson regression but with an offset of log(ti ). Since these are known con-
stants, they can be readily incorporated into the estimation procedure.
{Yij : i = 1, . . . , I, j = 1, . . . , J}, or
{Yijk : i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , K}
respectively.
Consider the example of the online survey that aimed to cross-classify individuals according
to their college and voting intentions. This data could be presented as a two-way contingency
table. If we also recorded people’s gender, for example, we would have a three-way contingency
43
table. A sensible model for this data is that the number of individuals falling into the ij th cell,
Yij , are independent Pois(µij ).
Suppose we happened to end up with n = 400 forms filled. We could also imagine a situation
where rather than accepting all the survey responses that happened to arrive in a given time,
we fix the number of submissions to consider in advance, so we keep the survey live until we
have 400 forms filled. In this case a multinomial model may be more appropriate.
Recall that a random vector Z = (Z1 , . . . , Zm ) is said to have a multinomial distribution
with parameters n and p1 , . . . , pm , written Z ∼ Multi(n; p1 , . . . , pm ) if m
P
i=1 p i = and
n!
P(Z1 = z1 , . . . , Zm = zm ) = pz1 · · · pzmm ,
z1 ! · · · z m ! 1
for zi ∈ {0, . . . , n} with z1 + · · · + zm = n.
In the second data collection scenario described above, only the overall total n = 400 was
fixed, so we might model
where
µij
pij = PI PJ .
i=1 j=1 µij
At first sight, this second model might seem to fall outside the GLM framework as the
responses Yij are not independent (adding up to n).
However, the following result suggests an alternative approach. Recall the fact that if Z1 , Z2
are independent with Zi ∼ Pois(µi ), then Z1 + Z2 ∼ Pois(µ1 + µ2 ). Obviously induction gives
a similar result for any finite collection of independent Poisson random variables.
P P P
and S := Zi ∼ Pois µj . It follows that provided i zi = n,
P Q
exp − µj (µzi i /zi !)
Pµ1 ,...,µm (Z1 = z1 , . . . , Zm = zm |S = n) = P P n
exp − µj µj /n!
n!
= pz1 . . . pzmm ,
z 1 ! . . . zm ! 1
P
where pi = µi / µj for i = 1, . . . , m.
44
Multinomial likelihood
First consider the multinomial likelihood obtained if we suppose that
where
µij
pij = PI PJ ,
i=1 j=1 µij
and
log(µij ) = α + xTij β.
Thus
exp(xTij β)
pij = PI PJ .
T
i=1 j=1 exp(xij β)
Here the explanatory variables xij will depend on the particular model being fit.
[Consider the “colleges and voting intentions” example. Each of the 400 submitted survey
forms can be thought of as realisations of i.i.d. random variables Zl , l = 1, . . . , 400, taking values
in the collection of categories {Trinity, . . .} × {Labour, Conservative, . . .}. If we assume that the
two components of the Zl are independent, then we may write
pij = P(Zl1 = collegei , Zl2 = partyj ) = P(Zl1 = collegei )P(Zl2 = partyj ) = qi rj , (3.2.1)
PI PJ
for some qi , rj ≥ 0, i = 1, . . . , I, j = 1, . . . , J, with i=1 qi = j=1 rj = 1. To parametrise this
in terms of β, we can take
where we have emphasised the fact that the likelihood is based on the conditional distribution
of the counts yij given the total n.
Poisson likelihood
Now consider the Poisson model, but where i,j yij = n. With log(µij ) = α + xTij β, we have
P
log-likelihood
X X
`P (α, β) = − µij (α, β) + yij log{µij (α, β)}
i,j i,j
X X
=− exp(α + xTij β) + yij (α + xTij β)
i,j i,j
X X
= − exp(α) exp(xTij β) + yij xTij β + nα.
i,j i,j
45
Now let us reparametrise (α, β) 7→ (τ, β) where
X X
τ= µij = exp(α) exp(xTij β).
i,j i,j
We have
X X
`P (τ, β) = yij xTij β − n log exp(xTij β) + {n log(τ ) − τ }
i,j i,j
= `m (β|n) + `P (τ ).
To maximise the log-likelihood above, we can maximise over β and τ separately. Thus if β ∗ is
the m.l.e. from the multinomial model, and β̂ is the m.l.e. from the Poisson model, we see that
(assuming the m.l.e.’s are unique) β ∗ = β̂. Several equivalences of the multinomial and Poisson
models emerge from this fact.
• The deviances from the Poisson model and the multinomial model are the same.
• The fitted values from both models are the same. Indeed, in the multinomial model, the
fitted values are
exp(xTij β̂)
np̂ij := n PI PJ .
T
i=1 j=1 exp(xij β̂)
exp(xTij β̂)
µ̂ij := τ̂ PI PJ .
T
i=1 j=1 exp(xij β̂)
But recall that since we have included an intercept term in the Poisson model,
X X
n= yij = µ̂ij = τ̂ .
i,j i,j
Summary. Multinomial models can be fit using Poisson log-linear model provided that an
intercept is included in the Poisson model. The Poisson models used to mimic multinomial
models are known as surrogate Poisson models.
log(µij ) = µ + ai + bj ,
where to ensure identifiability, we enforce the corner point constraints a1 = b1 = 0. Thus there
are 1 + (I − 1) + (J − 1) = I + J − 1 parameters. Provided the cell counts yij are large enough,
small dispersion asymptotics can be used to justify comparing the deviance or Pearson’s χ2
statistic to χ2IJ−I+J+1 = χ2(I−1)(J−1) .
46
3.2.5 Test for homogeneity of rows
Consider the following example. In a flu vaccine trial, patients were randomly allocated to one
of two groups. The first received a placebo, the other the vaccine. The levels of antibody after
six weeks were:
We are interested in the homogeneity of the different rows: is there a different response from
the vaccine group? Here the row totals were fixed before the responses were observed. We can
thus model the responses in each row as having a multinomial distribution.
If ni , i = 1, . . . , I denotes the sum of the ith row, we model the response in the ith row, Yi as
log(µij ) = µ + ai + bj ,
which is the same as for the independence example. For identifiability we may take a1 = b1 = 0.
Here, the ai are playing the role of intercepts for each row.
Consider again that the table is constructed from i.i.d. random variables Z1 , . . . , Zn taking
values in the categories
{1, . . . , I} × {1, . . . , J} × {1, . . . , K}.
Let us write Z1 = (A, B, C). Note that pijk = P(A = i, B = j, C = k). There are now eight
hypotheses concerning independence which may be of interest. Broken into four classes, they
are:
1. H1 : pijk = αi βj γk , for all i, j, k. Summing over j and k we see that αi = P(A = i). Thus
this model corresponds to
2. H2 : pijk = αi βjk for all i, j, k. As before we see that αi = P(A = i), and summing over
i we get βjk = P(B = j, C = k). This corresponds to saying A is independent of (B, C).
Two other hypotheses are obtained by permutation of A, B, C.
47
3. H3 : pijk = βij γik for all i, j, k. If we denote summing over an index with a ‘+’, so for
example X X
pi++ := pijk = βij γik = βi+ γj+ ,
j,k j,k
we see that
pijk βij γik
P(B = j, C = k|A = i) = =
pi++ βi+ γi+
so B and C are conditionally independent given A. Two other hypotheses are obtained
by permuting A, B, C.
4. H4 : pijk = αjk βik γij for all i, j, k. This hypothesis cannot be expressed as a conditional
independence statement, but means there are no three-way interactions.
48