100% found this document useful (1 vote)
21 views23 pages

Final Scribd

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
21 views23 pages

Final Scribd

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

An Introduction to Generalized Linear Models, Second

Edition

Visit the link below to download the full version of this book:
https://cheaptodownload.com/product/an-introduction-to-generalized-linear-models
-2nd-edition-full-pdf-download/
AN
INTRODUCTION
TO
GENERALIZED
LINEAR MODELS
SECOND EDITION

Annette J. Dobson

CHAPMAN & HALL/CRC


A CRC Press Company
Boca Raton London New York Washington, D.C.
Library of Congress Cataloging-in-Publication Data

Dobson, Annette J., 1945-


An introduction to generalized linear models / Annette J. Dobson.—2nd ed.
p. cm.— (Chapman & Hall/CRC texts in statistical science series)
Includes bibliographical references and index.
ISBN 1-58488-165-8 (alk. paper)
1. Linear models (Statistics) I. Title. II. Texts in statistical science.

QA276 .D589 2001


519.5′35—dc21 2001047417

This book contains information obtained from authentic and highly regarded sources. Reprinted material
is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable
efforts have been made to publish reliable data and information, but the author and the publisher cannot
assume responsibility for the validity of all materials or for the consequences of their use.

Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, microfilming, and recording, or by any information storage or
retrieval system, without prior permission in writing from the publisher.

The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for
creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC
for such copying.

Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com

© 2002 by Chapman & Hall/CRC

No claim to original U.S. Government works


International Standard Book Number 1-58488-165-8
Library of Congress Card Number 2001047417
Printed in the United States of America 1 2 3 4 5 6 7 8 9 0
Printed on acid-free paper
Contents

Preface

1 Introduction
1.1 Background
1.2 Scope
1.3 Notation
1.4 Distributions related to the Normal distribution
1.5 Quadratic forms
1.6 Estimation
1.7 Exercises

2 Model Fitting
2.1 Introduction
2.2 Examples
2.3 Some principles of statistical modelling
2.4 Notation and coding for explanatory variables
2.5 Exercises

3 Exponential Family and Generalized Linear Models


3.1 Introduction
3.2 Exponential family of distributions
3.3 Properties of distributions in the exponential family
3.4 Generalized linear models
3.5 Examples
3.6 Exercises

4 Estimation
4.1 Introduction
4.2 Example: Failure times for pressure vessels
4.3 Maximum likelihood estimation
4.4 Poisson regression example
4.5 Exercises

5 Inference
5.1 Introduction
5.2 Sampling distribution for score statistics

© 2002 by Chapman & Hall/CRC


5.3 Taylor series approximations
5.4 Sampling distribution for maximum likelihood estimators
5.5 Log-likelihood ratio statistic
5.6 Sampling distribution for the deviance
5.7 Hypothesis testing
5.8 Exercises

6 Normal Linear Models


6.1 Introduction
6.2 Basic results
6.3 Multiple linear regression
6.4 Analysis of variance
6.5 Analysis of covariance
6.6 General linear models
6.7 Exercises

7 Binary Variables and Logistic Regression


7.1 Probability distributions
7.2 Generalized linear models
7.3 Dose response models
7.4 General logistic regression model
7.5 Goodness of fit statistics
7.6 Residuals
7.7 Other diagnostics
7.8 Example: Senility and WAIS
7.9 Exercises

8 Nominal and Ordinal Logistic Regression


8.1 Introduction
8.2 Multinomial distribution
8.3 Nominal logistic regression
8.4 Ordinal logistic regression
8.5 General comments
8.6 Exercises

9 Count Data, Poisson Regression and Log-Linear Models


9.1 Introduction
9.2 Poisson regression
9.3 Examples of contingency tables
9.4 Probability models for contingency tables
9.5 Log-linear models
9.6 Inference for log-linear models
9.7 Numerical examples
9.8 Remarks
9.9 Exercises

© 2002 by Chapman & Hall/CRC


10 Survival Analysis
10.1 Introduction
10.2 Survivor functions and hazard functions
10.3 Empirical survivor function
10.4 Estimation
10.5 Inference
10.6 Model checking
10.7 Example: remission times
10.8 Exercises

11 Clustered and Longitudinal Data


11.1 Introduction
11.2 Example: Recovery from stroke
11.3 Repeated measures models for Normal data
11.4 Repeated measures models for non-Normal data
11.5 Multilevel models
11.6 Stroke example continued
11.7 Comments
11.8 Exercises

Software

References

© 2002 by Chapman & Hall/CRC


Preface

Statistical tools for analyzing data are developing rapidly so that the 1990
edition of this book is now out of date.
The original purpose of the book was to present a unified theoretical and
conceptual framework for statistical modelling in a way that was accessible
to undergraduate students and researchers in other fields. This new edition
has been expanded to include nominal (or multinomial) and ordinal logistic
regression, survival analysis and analysis of longitudinal and clustered data.
Although these topics do not fall strictly within the definition of generalized
linear models, the underlying principles and methods are very similar and
their inclusion is consistent with the original purpose of the book.
The new edition relies on numerical methods more than the previous edition
did. Some of the calculations can be performed with a spreadsheet while others
require statistical software. There is an emphasis on graphical methods for
exploratory data analysis, visualizing numerical optimization (for example,
of the likelihood function) and plotting residuals to check the adequacy of
models.
The data sets and outline solutions of the exercises are available on the
publisher’s website:
www.crcpress.com/us/ElectronicProducts/downandup.asp?mscssid=
I am grateful to colleagues and students at the Universities of Queensland
and Newcastle, Australia, for their helpful suggestions and comments about
the material.
Annette Dobson

© 2002 by Chapman & Hall/CRC


1
Introduction
1.1 Background
This book is designed to introduce the reader to generalized linear models;
these provide a unifying framework for many commonly used statistical tech-
niques. They also illustrate the ideas of statistical modelling.
The reader is assumed to have some familiarity with statistical principles
and methods. In particular, understanding the concepts of estimation, sam-
pling distributions and hypothesis testing is necessary. Experience in the use
of t-tests, analysis of variance, simple linear regression and chi-squared tests of
independence for two-dimensional contingency tables is assumed. In addition,
some knowledge of matrix algebra and calculus is required.
The reader will find it necessary to have access to statistical computing
facilities. Many statistical programs, languages or packages can now perform
the analyses discussed in this book. Often, however, they do so with a different
program or procedure for each type of analysis so that the unifying structure
is not apparent.
Some programs or languages which have procedures consistent with the
approach used in this book are: Stata, S-PLUS, Glim, Genstat and SY-
STAT. This list is not comprehensive as appropriate modules are continually
being added to other programs.
In addition, anyone working through this book may find it helpful to be able
to use mathematical software that can perform matrix algebra, differentiation
and iterative calculations.

1.2 Scope
The statistical methods considered in this book all involve the analysis of
relationships between measurements made on groups of subjects or objects.
For example, the measurements might be the heights or weights and the ages
of boys and girls, or the yield of plants under various growing conditions.
We use the terms response, outcome or dependent variable for measure-
ments that are free to vary in response to other variables called explanatory
variables or predictor variables or independent variables - although
this last term can sometimes be misleading. Responses are regarded as ran-
dom variables. Explanatory variables are usually treated as though they are
non-random measurements or observations; for example, they may be fixed
by the experimental design.
Responses and explanatory variables are measured on one of the following
scales.
1. Nominal classifications: e.g., red, green, blue; yes, no, do not know, not
applicable. In particular, for binary, dichotomous or binomial variables

© 2002 by Chapman & Hall/CRC


there are only two categories: male, female; dead, alive; smooth leaves,
serrated leaves. If there are more than two categories the variable is called
polychotomous, polytomous or multinomial.
2. Ordinal classifications in which there is some natural order or ranking be-
tween the categories: e.g., young, middle aged, old; diastolic blood pressures
grouped as ≤ 70, 71-90, 91-110, 111-130, ≥131mm Hg.
3. Continuous measurements where observations may, at least in theory, fall
anywhere on a continuum: e.g., weight, length or time. This scale includes
both interval scale and ratio scale measurements – the latter have a
well-defined zero. A particular example of a continuous measurement is
the time until a specific event occurs, such as the failure of an electronic
component; the length of time from a known starting point is called the
failure time.
Nominal and ordinal data are sometimes called categorical or discrete
variables and the numbers of observations, counts or frequencies in each
category are usually recorded. For continuous data the individual measure-
ments are recorded. The term quantitative is often used for a variable mea-
sured on a continuous scale and the term qualitative for nominal and some-
times for ordinal measurements. A qualitative, explanatory variable is called
a factor and its categories are called the levels for the factor. A quantitative
explanatory variable is sometimes called a covariate.
Methods of statistical analysis depend on the measurement scales of the
response and explanatory variables.
This book is mainly concerned with those statistical methods which are
relevant when there is just one response variable, although there will usually
be several explanatory variables. The responses measured on different subjects
are usually assumed to be statistically independent random variables although
this requirement is dropped in the final chapter which is about correlated
data. Table 1.1 shows the main methods of statistical analysis for various
combinations of response and explanatory variables and the chapters in which
these are described.
The present chapter summarizes some of the statistical theory used through-
out the book. Chapters 2 to 5 cover the theoretical framework that is common
to the subsequent chapters. Later chapters focus on methods for analyzing
particular kinds of data.
Chapter 2 develops the main ideas of statistical modelling. The modelling
process involves four steps:
1. Specifying models in two parts: equations linking the response and explana-
tory variables, and the probability distribution of the response variable.
2. Estimating parameters used in the models.
3. Checking how well the models fit the actual data.
4. Making inferences; for example, calculating confidence intervals and testing
hypotheses about the parameters.

© 2002 by Chapman & Hall/CRC


Table 1.1 Major methods of statistical analysis for response and explanatory vari-
ables measured on various scales and chapter references for this book.
Response (chapter) Explanatory variables Methods
Continuous Binary t-test
(Chapter 6)
Nominal, >2 categories Analysis of variance

Ordinal Analysis of variance

Continuous Multiple regression

Nominal & some Analysis of


continuous covariance

Categorical & continuous Multiple regression


Binary Categorical Contingency tables
(Chapter 7) Logistic regression

Continuous Logistic, probit &


other dose-response
models

Categorical & continuous Logistic regression


Nominal with Nominal Contingency tables
>2 categories
(Chapter 8 & 9) Categorical & continuous Nominal logistic
regression
Ordinal Categorical & continuous Ordinal logistic
(Chapter 8) regression
Counts Categorical Log-linear models
(Chapter 9)
Categorical & continuous Poisson regression
Failure times Categorical & continuous Survival analysis
(Chapter 10) (parametric)
Correlated Categorical & continuous Generalized
responses estimating equations
(Chapter 11) Multilevel models

© 2002 by Chapman & Hall/CRC


The next three chapters provide the theoretical background. Chapter 3 is
about the exponential family of distributions, which includes the Normal,
Poisson and binomial distributions. It also covers generalized linear models
(as defined by Nelder and Wedderburn, 1972). Linear regression and many
other models are special cases of generalized linear models. In Chapter 4
methods of estimation and model fitting are described.
Chapter 5 outlines methods of statistical inference for generalized linear
models. Most of these are based on how well a model describes the set of data.
For example, hypothesis testing is carried out by first specifying alternative
models (one corresponding to the null hypothesis and the other to a more
general hypothesis). Then test statistics are calculated which measure the
‘goodness of fit’ of each model and these are compared. Typically the model
corresponding to the null hypothesis is simpler, so if it fits the data about
as well as a more complex model it is usually preferred on the grounds of
parsimony (i.e., we retain the null hypothesis).
Chapter 6 is about multiple linear regression and analysis of vari-
ance (ANOVA). Regression is the standard method for relating a continuous
response variable to several continuous explanatory (or predictor) variables.
ANOVA is used for a continuous response variable and categorical or qualita-
tive explanatory variables (factors). Analysis of covariance (ANCOVA) is
used when at least one of the explanatory variables is continuous. Nowadays
it is common to use the same computational tools for all such situations. The
terms multiple regression or general linear model are used to cover the
range of methods for analyzing one continuous response variable and multiple
explanatory variables.
Chapter 7 is about methods for analyzing binary response data. The most
common one is logistic regression which is used to model relationships
between the response variable and several explanatory variables which may
be categorical or continuous. Methods for relating the response to a single
continuous variable, the dose, are also considered; these include probit anal-
ysis which was originally developed for analyzing dose-response data from
bioassays. Logistic regression has been generalized in recent years to include
responses with more than two nominal categories (nominal, multinomial,
polytomous or polychotomous logistic regression) or ordinal categories
(ordinal logistic regression). These new methods are discussed in Chapter
8.
Chapter 9 concerns count data. The counts may be frequencies displayed
in a contingency table or numbers of events, such as traffic accidents, which
need to be analyzed in relation to some ‘exposure’ variable such as the number
of motor vehicles registered or the distances travelled by the drivers. Mod-
elling methods are based on assuming that the distribution of counts can be
described by the Poisson distribution, at least approximately. These methods
include Poisson regression and log-linear models.
Survival analysis is the usual term for methods of analyzing failure time
data. The parametric methods described in Chapter 10 fit into the framework

© 2002 by Chapman & Hall/CRC


of generalized linear models although the probability distribution assumed for
the failure times may not belong to the exponential family.
Generalized linear models have been extended to situations where the re-
sponses are correlated rather than independent random variables. This may
occur, for instance, if they are repeated measurements on the same subject
or measurements on a group of related subjects obtained, for example, from
clustered sampling. The method of generalized estimating equations
(GEE’s) has been developed for analyzing such data using techniques analo-
gous to those for generalized linear models. This method is outlined in Chapter
11 together with a different approach to correlated data, namely multilevel
modelling.
Further examples of generalized linear models are discussed in the books
by McCullagh and Nelder (1989), Aitkin et al. (1989) and Healy (1988). Also
there are many books about specific generalized linear models such as Hos-
mer and Lemeshow (2000), Agresti (1990, 1996), Collett (1991, 1994), Diggle,
Liang and Zeger (1994), and Goldstein (1995).

1.3 Notation
Generally we follow the convention of denoting random variables by upper
case italic letters and observed values by the corresponding lower case letters.
For example, the observations y1 , y2 , ..., yn are regarded as realizations of the
random variables Y1 , Y2 , ..., Yn . Greek letters are used to denote parameters
and the corresponding lower case roman letters are used to denote estimators
and estimates; occasionally the symbol  is used for estimators or estimates.
For example, the parameter β is estimated by β  or b. Sometimes these con-
ventions are not strictly adhered to, either to avoid excessive notation in cases
where the meaning should be apparent from the context, or when there is a
strong tradition of alternative notation (e.g., e or ε for random error terms).
Vectors and matrices, whether random or not, are denoted by bold face
lower and upper case letters, respectively. Thus, y represents a vector of ob-
servations

 
y1
 .. 
 . 
yn
or a vector of random variables
 
Y1
 .. 
 . ,
Yn
β denotes a vector of parameters and X is a matrix. The superscript T is
used for a matrix transpose or when a column vector is written as a row, e.g.,
T
y = [Y1 , ..., Yn ] .

© 2002 by Chapman & Hall/CRC


The probability density function of a continuous random variable Y (or the
probability mass function if Y is discrete) is referred to simply as a proba-
bility distribution and denoted by
f (y; θ)
where θ represents the parameters of the distribution.
We use dot (·) subscripts for summation and bars (− ) for means, thus

1 
N
1
y= yi = y·.
N i=1 N
The expected value and variance of a random variable Y are denoted by
E(Y ) and var(Y ) respectively. Suppose random variables Y1 , ..., YN are inde-
pendent with E(Yi ) = µi and var(Yi ) = σi2 for i = 1, ..., n. Let the random
variable W be a linear combination of the Yi ’s
W = a1 Y1 + a2 Y2 + ... + an Yn , (1.1)
where the ai ’s are constants. Then the expected value of W is
E(W ) = a1 µ1 + a2 µ2 + ... + an µn (1.2)
and its variance is
var(W ) = a21 σ12 + a22 σ22 + ... + a2n σn2 . (1.3)

1.4 Distributions related to the Normal distribution


The sampling distributions of many of the estimators and test statistics used
in this book depend on the Normal distribution. They do so either directly be-
cause they are derived from Normally distributed random variables, or asymp-
totically, via the Central Limit Theorem for large samples. In this section we
give definitions and notation for these distributions and summarize the re-
lationships between them. The exercises at the end of the chapter provide
practice in using these results which are employed extensively in subsequent
chapters.

1.4.1 Normal distributions


1. If the random variable Y has the Normal distribution with mean µ and
variance σ 2 , its probability density function is

2
1 1 y−µ
f (y; µ, σ 2 ) = √ exp − .
2πσ 2 2 σ2

We denote this by Y ∼ N (µ, σ 2 ).


2. The Normal distribution with µ = 0 and σ 2 = 1, Y ∼ N (0, 1), is called the
standard Normal distribution.

© 2002 by Chapman & Hall/CRC


3. Let Y1 , ..., Yn denote Normally distributed random variables with Yi ∼
N (µi , σi2 ) for i = 1, ..., n and let the covariance of Yi and Yj be denoted by
cov(Yi , Yj ) = ρij σi σj ,

where ρij is the correlation coefficient for Yi and Yj . Then the joint distri-
bution of the Yi ’s is the multivariate Normal distribution with mean
T
vector µ = [µ1 , ..., µn ] and variance-covariance matrix V with diagonal
elements σi and non-diagonal elements ρij σi σj for i = j. We write this as
2
T
y ∼ N(µ, V), where y = [Y1 , ..., Yn ] .
4. Suppose the random variables Y1 , ..., Yn are independent and Normally dis-
tributed with the distributions Yi ∼ N (µi , σi2 ) for i = 1, ..., n. If
W = a1 Y1 + a2 Y2 + ... + an Yn ,

where the ai ’s are constants. Then W is also Normally distributed, so that



n n 
n
W = ai Yi ∼ N ai µi , 2 2
ai σi
i=1 i=1 i=1

by equations (1.2) and (1.3).

1.4.2 Chi-squared distribution


1. The central chi-squared distribution with n degrees of freedom is de-
fined as the sum of squares of n independent random variables Z1 , ..., Zn
each with the standard Normal distribution. It is denoted by

n
X2 = Zi2 ∼ χ2 (n).
i=1

T n
In matrix notation, if z = [Z1 , ..., Zn ] then zT z = i=1 Zi2 so that X 2 =
zTz ∼ χ2 (n).
2. If X 2 has the distribution χ2 (n), then its expected value is E(X 2 ) = n and
its variance is var(X 2 ) = 2n.
3. If Y1 , ..., Yn are independent Normally distributed random variables each
with the distribution Yi ∼ N (µi , σi2 ) then

n
Yi − µi
2
X2 = ∼ χ2 (n) (1.4)
i=1
σi

because each of the variables Zi = (Yi − µi ) /σi has the standard Normal
distribution N (0, 1).
4. Let Z1 , ..., Zn be independent random variables each with the distribution
N (0, 1) and let Yi = Zi + µi , where at least one of the µi ’s is non-zero.
Then the distribution of
  2
  
Yi2 = (Zi + µi ) = Zi2 + 2 Zi µi + µ2i

© 2002 by Chapman & Hall/CRC


has
 larger mean n + λ and larger variance 2n + 4λ than χ2 (n) where λ =
µ2i . This is called the non-central chi-squared distribution with n
degrees of freedom and non-centrality parameter λ. It is denoted by
χ2 (n, λ).
5. Suppose that the Yi ’s are not necessarily independent and the vector y =
T
[Y1 , . . . , Yn ] has the multivariate normal distribution y ∼ N(µ, V) where
the variance-covariance matrix V is non-singular and its inverse is V−1 .
Then
X 2 = (y − µ)T V−1 (y − µ) ∼ χ2 (n). (1.5)

6. More generally if y ∼ N(µ, V) then the random variable yT V−1 y has the
non-central chi-squared distribution χ2 (n, λ) where λ = µT V−1 µ.
7. If X12 , . . . , Xm
2
are m independent random variables with the chi-squared
distributions Xi2 ∼ χ2 (ni , λi ), which may or may not be central, then their
sum also has a chi-squared  distribution with ni degrees of freedom and
non-centrality parameter λi , i.e.,

m 
m 
m
Xi ∼ χ
2 2
ni , λi .
i=1 i=1 i=1

This is called the reproductive property of the chi-squared distribution.


8. Let y ∼ N(µ, V), where y has n elements but the Yi ’s are not independent
so that V is singular with rank k < n and the inverse of V is not uniquely
defined. Let V− denote a generalized inverse of V. Then the random vari-
able yT V− y has the non-central chi-squared distribution with k degrees of
freedom and non-centrality parameter λ = µT V− µ.
For further details about properties of the chi-squared distribution see Rao
(1973, Chapter 3).

1.4.3 t-distribution
The t-distribution with n degrees of freedom is defined as the ratio of two
independent random variables. The numerator has the standard Normal dis-
tribution and the denominator is the square root of a central chi-squared
random variable divided by its degrees of freedom; that is,
Z
T = (1.6)
(X 2 /n)1/2
where Z ∼ N (0, 1), X 2 ∼ χ2 (n) and Z and X 2 are independent. This is
denoted by T ∼ t(n).

1.4.4 F-distribution
1. The central F-distribution with n and m degrees of freedom is defined
as the ratio of two independent central chi-squared random variables each

© 2002 by Chapman & Hall/CRC


divided by its degrees of freedom,
X12 X22
F =/ (1.7)
n m
where X12 ∼ χ2 (n), X22 ∼ χ2 (m) and X12 and X22 are independent. This is
denoted by F ∼ F (n, m).
2. The relationship between the t-distribution and the F-distribution can be
derived by squaring the terms in equation (1.6) and using definition (1.7)
to obtain
Z2 X2
T2 = / ∼ F (1, n), (1.8)
1 n
that is, the square of a random variable with the t-distribution, t(n), has
the F-distribution, F (1, n).
3. The non-central F-distribution is defined as the ratio of two indepen-
dent random variables, each divided by its degrees of freedom, where the
numerator has a non-central chi-squared distribution and the denominator
has a central chi-squared distribution, i.e.,
X12 X22
F = /
n m
where X12 ∼ χ2 (n, λ) with λ = µT V−1 µ, X22 ∼ χ2 (m) and X12 and X22 are
independent. The mean of a non-central F-distribution is larger than the
mean of central F-distribution with the same degrees of freedom.

1.5 Quadratic forms


1. A quadratic form is a polynomial expression in which each term has
degree 2. Thus y12 + y22 and 2y12 + y22 + 3y1 y2 are quadratic forms in y1 and
y2 but y12 + y22 + 2y1 or y12 + 3y22 + 2 are not.
2. Let A be a symmetric matrix
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
 
 .. .. .. 
 . . . 
an1 an2 ··· ann
T
 
where aij = aji , then the expression y Ay = i j aij yi yj is a quadratic
form in the yi ’s. The expression (y − µ)T V−1 (y − µ) is a quadratic form
in the terms (yi − µi ) but not in the yi ’s.
3. The quadratic form yT Ay and the matrix A are said to be positive defi-
nite if yT Ay > 0 whenever the elements of y are not all zero. A necessary
and sufficient condition for positive definiteness is that all the determinants
   a11 a12 a13 
 a a12   
|A1 | = a11 , |A2 | =  11  , |A3 | =  a21 a22 a23  , ..., and
a21 a22  a31 a32 a33 

© 2002 by Chapman & Hall/CRC


|An | = det A are all positive.
4. The rank of the matrix A is also called the degrees of freedom of the
quadratic form Q = yT Ay.
5. Suppose Y1 , ..., Yn are independent
random variables each with the Normal
n
distribution N (0, σ 2 ). Let Q = i=1 Yi2 and let Q1 , ..., Qk be quadratic
forms in the Yi ’s such that
Q = Q1 + ... + Qk

where Qi has mi degrees of freedom (i = 1, . . . , k). Then


Q1 , ..., Qk are independent random variables and
Q1 /σ 2 ∼ χ2 (m1 ), Q2 /σ 2 ∼ χ2 (m2 ), · · · and Qk /σ 2 ∼ χ2 (mk ),
if and only if,
m1 + m2 + ... + mk = n.

This is Cochran’s theorem; for a proof see, for example, Hogg and Craig
(1995). A similar result holds for non-central distributions; see Chapter 3
of Rao (1973).
6. A consequence of Cochran’s theorem is that the difference of two indepen-
dent random variables, X12 ∼ χ2 (m) and X22 ∼ χ2 (k), also has a chi-squared
distribution
X 2 = X12 − X22 ∼ χ2 (m − k)
provided that X 2 ≥ 0 and m > k.

1.6 Estimation
1.6.1 Maximum likelihood estimation
T
Let y = [Y1 , ..., Yn ] denote a random vector and let the joint probability
density function of the Yi ’s be
f (y; θ)
T
which depends on the vector of parameters θ = [θ1 , ..., θp ] .
The likelihood function L(θ; y) is algebraically the same as the joint
probability density function f (y; θ) but the change in notation reflects a shift
of emphasis from the random variables y, with θ fixed, to the parameters θ
with y fixed. Since L is defined in terms of the random vector y, it is itself a
random variable. Let Ω denote the set of all possible values of the parameter
vector θ; Ω is called the parameter space. The maximum likelihood
estimator of θ is the value θ which maximizes the likelihood function, that
is
 y) ≥ L(θ; y)
L(θ; for all θ in Ω.
 is the value which maximizes the log-likelihood function
Equivalently, θ

© 2002 by Chapman & Hall/CRC


l(θ; y) = log L(θ; y), since the logarithmic function is monotonic. Thus

 y) ≥ l(θ; y)
l(θ; for all θ in Ω.

Often it is easier to work with the log-likelihood function than with the like-
lihood function itself.
Usually the estimator θ  is obtained by differentiating the log-likelihood
function with respect to each element θj of θ and solving the simultaneous
equations

∂l(θ; y)
=0 for j = 1, ..., p. (1.9)
∂θj

It is necessary to check that the solutions do correspond to maxima of


l(θ; y) by verifying that the matrix of second derivatives

∂ 2 l(θ; y)
∂θj ∂θk

evaluated at θ = θ is negative definite. For example, if θ has only one element


θ this means it is necessary to check that
 2 
∂ l(θ, y)
< 0.
∂θ2 θ=θ

It is also necessary to check if there are any values of θ at the edges of the
parameter space Ω that give local maxima of l(θ; y). When all local maxima
have been identified, the value of θ  corresponding to the largest one is the
maximum likelihood estimator. (For most of the models considered in this
book there is only one maximum and it corresponds to the solution of the
equations ∂l/∂θj = 0, j = 1, ..., p.)
An important property of maximum likelihood estimators is that if g(θ)
is any function of the parameters θ, then the maximum likelihood estimator
 This follows from the definition of θ.
of g(θ) is g(θ).  It is sometimes called
the invariance property of maximum likelihood estimators. A consequence
is that we can work with a function of the parameters that is convenient
for maximum likelihood estimation and then use the invariance property to
obtain maximum likelihood estimates for the required parameters.
In principle, it is not necessary to be able to find the derivatives of the
likelihood or log-likelihood functions or to solve equation (1.9) if θ  can be
found numerically. In practice, numerical approximations are very important
for generalized linear models.
Other properties of maximum likelihood estimators include consistency, suf-
ficiency, asymptotic efficiency and asymptotic normality. These are discussed
in books such as Cox and Hinkley (1974) or Kalbfleisch (1985, Chapters 1 and
2).

© 2002 by Chapman & Hall/CRC


1.6.2 Example: Poisson distribution
Let Y1 , ..., Yn be independent random variables each with the Poisson distri-
bution
θyi e−θ
f (yi ; θ) = , yi = 0, 1, 2, ...
yi !
with the same parameter θ. Their joint distribution is

n
θy1 e−θ θy2 e−θ θyn e−θ
f (y1 , . . . , yn ; θ) = f (yi ; θ) = × × ··· ×
i=1
y1 ! y2 ! yn !
θΣ yi
e−nθ
= .
y1 !y2 !...yn !
This is also the likelihood function L(θ; y1 , ..., yn ). It is easier to use the log-
likelihood function
 
l(θ; y1 , ..., yn ) = log L(θ; y1 , ..., yn ) = ( yi ) log θ − nθ − (log yi !).

To find the maximum likelihood estimate  θ, use


dl 1
= yi − n.
dθ θ
Equate this to zero to obtain the solution


θ= yi /n = y.

Since d2 l/dθ2 = − yi /θ2 < 0, l has its maximum value when θ = 
θ, con-
firming that y is the maximum likelihood estimate.

1.6.3 Least Squares Estimation


Let Y1 , ..., Yn be independent random variables with expected values µ1 , ..., µn
respectively. Suppose that the µi ’s are functions of the parameter vector that
T
we want to estimate, β = [β1 , ..., βp ] , p < n. Thus
E(Yi ) = µi (β).
The simplest form of the method of least squares consists of finding the
 that minimizes the sum of squares of the differences between Yi ’s
estimator β
and their expected values
 2
S= [Yi − µi (β)] .
 is obtained by differentiating S with respect to each element βj
Usually β
of β and solving the simultaneous equations
∂S
= 0, j = 1, ..., p.
∂βj
Of course it is necessary to check that the solutions correspond to minima

© 2002 by Chapman & Hall/CRC


(i.e., the matrix of second derivatives is positive definite) and to identify the
global minimum from among these solutions and any local minima at the
boundary of the parameter space.
Now suppose that the Yi ’s have variances σi2 that are not all equal. Then it
may be desirable to minimize the weighted sum of squared differences
 2
S= wi [Yi − µi (β)]

where the weights are wi = (σi2 )−1 . In this way, the observations which are
less reliable (that is, the Yi ’s with the larger variances) will have less influence
on the estimates.
More generally, let y = [Y1 , ..., Yn ]T denote a random vector with mean vec-
T
tor µ = [µ1 , ..., µn ] and variance-covariance matrix V. Then the weighted
least squares estimator is obtained by minimizing
S = (y − µ)T V−1 (y − µ).

1.6.4 Comments on estimation.


1. An important distinction between the methods of maximum likelihood and
least squares is that the method of least squares can be used without mak-
ing assumptions about the distributions of the response variables Yi be-
yond specifying their expected values and possibly their variance-covariance
structure. In contrast, to obtain maximum likelihood estimators we need
to specify the joint probability distribution of the Yi ’s.
2. For many situations maximum likelihood and least squares estimators are
identical.
3. Often numerical methods rather than calculus may be needed to obtain
parameter estimates that maximize the likelihood or log-likelihood function
or minimize the sum of squares. The following example illustrates this
approach.

1.6.5 Example: Tropical cyclones


Table 1.2 shows the number of tropical cyclones in Northeastern Australia
for the seasons 1956-7 (season 1) to 1968-9 (season 13), a period of fairly
consistent conditions for the definition and tracking of cyclones (Dobson and
Stewart, 1974).

Table 1.2 Numbers of tropical cyclones in 13 successive seasons.


Season: 1 2 3 4 5 6 7 8 9 10 11 12 13
No. of cyclones 6 5 4 6 6 3 12 7 4 2 6 7 4

Let Yi denote the number of cyclones in season i, where i = 1, . . . , 13. Sup-


pose the Yi ’s are independent random variables with the Poisson distribution

© 2002 by Chapman & Hall/CRC


55

50

45

40

35
3 4 5 6 7 8

Figure 1.1 Graph showing the location of the maximum likelihood estimate for the
data in Table 1.2 on tropical cyclones.

with parameter θ. From Example 1.6.2  θ = y = 72/13 = 5.538. An alterna-


tive approach would be to find numerically the value of θ that maximizes the
log-likelihood function. The component of the log-likelihood function due to
yi is
li = yi log θ − θ − log yi !.
The log-likelihood function is the sum of these terms

13 
13
l= li = (yi log θ − θ − log yi !) .
i=1 i=1

Only the first two terms in the brackets involve


13 θ and so are relevant to the
optimization calculation, because the term 1 log yi ! is a constant. To plot
the log-likelihood function (without the constant term) against θ, for various
of θ, calculate (yi log θ − θ) for each yi and add the results to obtain
values
l∗ = (yi log θ − θ). Figure 1.1 shows l∗ plotted against θ.
Clearly the maximum value is between θ = 5 and θ = 6. This can provide
a starting point for an iterative procedure for obtaining  θ. The results of
a simple bisection calculation are shown in Table 1.3. The function l∗ is
first calculated for approximations θ(1) = 5 and θ(2) = 6. Then subsequent
approximations θ(k) for k = 3, 4, ... are the average values of the two previous
estimates of θ with the largest values of l∗ (for example, θ(6) = 12 (θ(5) + θ(3) )).
After 7 steps this process gives 
θ 5.54 which is correct to 2 decimal places.

1.7 Exercises
1.1 Let Y1 and Y2 be independent random variables with
Y1 ∼ N (1, 3) and Y2 ∼ N (2, 5). If W1 = Y1 + 2Y2 and W2 = 4Y1 − Y2 what
is the joint distribution of W1 and W2 ?
1.2 Let Y1 and Y2 be independent random variables with
Y1 ∼ N (0, 1) and Y2 ∼ N (3, 4).

© 2002 by Chapman & Hall/CRC


Table 1.3 Successive approximations to the maximum likelihood estimate of the mean
number of cyclones per season.

k θ(k) l∗
1 5 50.878
2 6 51.007
3 5.5 51.242
4 5.75 51.192
5 5.625 51.235
6 5.5625 51.243
7 5.5313 51.24354
8 5.5469 51.24352
9 5.5391 51.24360
10 5.5352 51.24359

(a) What is the distribution of Y12 ?


 
Y1
(b) If y = , obtain an expression for yT y . What is its dis-
(Y2 − 3)/2
tribution?
Y1
(c) If y = and its distribution is y ∼ N(µ, V), obtain an expression
Y2
for yT V−1 y. What is its distribution?
1.3 Let the joint distribution of Y1 and Y2 be N(µ, V) with
2 4 1
µ= and V = .
3 1 9

(a) Obtain an expression for (y − µ) V−1 (y − µ). What is its distribution?


T

(b) Obtain an expression for yT V−1 y. What is its distribution?


1.4 Let Y1 , ..., Yn be independent random variables each with the distribution
N (µ, σ 2 ). Let
1 1 
n n
Y = Yi and S 2 = (Yi − Y )2 .
n i=1 n − 1 i=1

(a) What is the distribution of Y ?


1 n 
(b) Show that S 2 = i=1 (Yi − µ) − n(Y − µ) .
2 2
n−1  

(c) From (b) it follows that (Yi −µ)2 /σ 2 = (n−1)S 2 /σ 2 + (Y − µ)2 n/σ 2 .
How does this allow you to deduce that Y and S 2 are independent?
(d) What is the distribution of (n − 1)S 2 /σ 2 ?
Y −µ
(e) What is the distribution of √ ?
S/ n

© 2002 by Chapman & Hall/CRC


Table 1.4 Progeny of light brown apple moths.
Progeny Females Males
group
1 18 11
2 31 22
3 34 27
4 33 29
5 27 24
6 33 29
7 28 25
8 23 26
9 33 38
10 12 14
11 19 23
12 25 31
13 14 20
14 4 6
15 22 34
16 7 12

1.5 This exercise is a continuation of the example in Section 1.6.2 in which


Y1 , ..., Yn are independent Poisson random variables with the parameter θ.

(a) Show that E(Yi ) = θ for i = 1, ..., n.


(b) Suppose θ = eβ . Find the maximum likelihood estimator of β.
 2
(c) Minimize S = Yi − eβ to obtain a least squares estimator of β.
1.6 The data below are the numbers of females and males in the progeny of
16 female light brown apple moths in Muswellbrook, New South Wales,
Australia (from Lewis, 1987).

(a) Calculate the proportion of females in each of the 16 groups of progeny.


(b) Let Yi denote the number of females and ni the number of progeny in
each group (i = 1, ..., 16). Suppose the Yi ’s are independent random
variables each with the binomial distribution
ni y i
f (yi ; θ) = θ (1 − θ)ni −yi .
yi
Find the maximum likelihood estimator of θ using calculus and evaluate
it for these data.
(c) Use a numerical method to estimate 
θ and compare the answer with the
one from (b).

© 2002 by Chapman & Hall/CRC

You might also like