0% found this document useful (0 votes)
136 views341 pages

Atkinson-Riani - Robust Diagnostic Regression Analysis

Uploaded by

sqdqf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views341 pages

Atkinson-Riani - Robust Diagnostic Regression Analysis

Uploaded by

sqdqf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 341

Springer Series in Statistics

Advisors:
P. Bickel, P. Diggle, s. Fienberg K Krickeberg,
1. Olkin, N. Wermuth, S. Zeger

Springer Science+Business Media, LLC


Springer Series in Statistics
Andersen/Borgan/Gill/Keiding: Statistical Models Based on Counting Processes.
Atkinson/Riani: Robust Diagnotstic Regression Analysis.
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
BolJarine/Zacks: Prediction Theory for Finite Populations.
Borg/Groenen: Modem Multidimensional Scaling: Theory and Applications
Brockwell/Davis: Time Series: Theory and Methods, 2nd edition.
Chen/Shao/Ibrahim: Monte Carlo Methods in Bayesian Computation.
Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications.
Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear
Models.
Farebrother: Fitting Linear Relationships: A History of the Calculus of Observations
1750-1900.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume I:
Two Crops.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume II:
Three or More Crops.
Fienberg/Hoaglin/Kruskal/Tanur (Eds.) : A Statistical Model: Frederick Mosteller's
Contributions to Statistics, Science and Public Policy.
Fisher/Sen: The Collected Works ofWassily Hoeffding.
Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing
Hypotheses, 2nd edition.
Gourieroux: ARCH Models and Financial Applications.
Grandell: Aspects of Risk Theory.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Hardie: Smoothing Techniques: With Implementation in S.
Hart: Nonparametric Smoothing and Lack-of-Fit Tests.
Hartigan : Bayes Theory.
Hedayat/Sloane/Stujken : Orthogonal Arrays: Theory and Applications.
Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal
Parameter Estimation.
Huet/Bouvier/Gruet/Jolivet: Statistical Tools for Nonlinear Regression: APractical
Guide with S-PLUS Examples.
Kolen/Brennan: Test Equating: Methods and Practices.
Kotz/Johnson (Eds.) : Breakthroughs in Statistics Volume I.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume II.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume III.
Kiichler/Sorensen : Exponential Families of Stochastic Processes.
Le Cam : Asymptotic Methods in Statistical Decision Theory.
Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts, 2nd edition.
Longford: Models for Uncertainty in Educational Testing.
Miller, Jr.: Simultaneous Statistical Inference, 2nd edition.
Mosteller/Wallace: Applied Bayesian and Classical Inference: The Case of the
Federalist Papers.
Parzen/Tanabe/Kitagawa: Selected Papers of Hirotugu Akaike.
Politis/Romano/Wolf: Subsampling.

(continued after index)


Anthony Atkinson Marco Riani

Robust Diagnostic
Regression Analysis

With 192 Illustrations

" Springer
Anthony Atkinson Marco Riani
Department of Statistics Departimento di Economia (Sezione di Statistica)
London School of Economics Universita di Parma
London WC2A 2AE 43100 Parma
UK Italy
[email protected] mriani®unipr.it

Library of Congress Cataloging-in-Publication Data


Atkinson, A.C. (Anthony Curtis)
Robust diagnostic regression analysis / Anthony Atkinson, Marco Riani.
p. cm.-(Springer texts in statistics)
Includes bibliographical references and indexes.
ISBN 978-1-4612-7027-0 ISBN 978-1-4612-1160-0 (eBook)
DOI 10.1007/978-1-4612-1160-0
1. Regression analysis. 2. Robust statistics. 1. Riani, Marco. II. Title. III.
Series.

QA278.2.A85 2000
519.5'36-dc21 00-026154

Printed on acid-free paper.

© 2000 Springer Science+Business Media New York


Originally published by Springer-Verlag New York, Inc. in 2000
Softcover reprint of the hardcover 15t edition 2000
AII rights reserved. This work may not be translated or copied in whole or in part without the
written permission ofthe publisher Springer Science+Business Media, LLC.
except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the
former are not epecially identified, is not to be taken as a sign that such names, as understood by
the Trade Marks and Merchandise Marks Act, may be accordingly used freely by anyone.

Production managed by A. Orrantia; manufacturing supervised by Jerome Basma.


Electronic copy prepared from the authors' Latex2e files by Bartlett Press, Inc., Marietta, GA.

987 6 5 4 3 2 1

ISBN 978-1-4612-7027-0
Dla Basi

a Fabia
Preface

This book is about using graphs to understand the relationship between a


regression model and the data to which it is fitted. Because of the way in
which models are fitted, for example, by least squares, we can lose infor-
mation about the effect of individual observations on inferences about the
form and parameters of the model. The methods developed in this book
reveal how the fitted regression model depends on individual observations
and on groups of observations. Robust procedures can sometimes reveal
this structure, but downweight or discard some observations. The novelty
in our book is to combine robustness and a "forward" search through the
data with regression diagnostics and computer graphics. We provide easily
understood plots that use information from the whole sample to display the
effect of each observation on a wide variety of aspects of the fitted model.
This bald statement of the contents of our book masks the excitement
we feel about the methods we have developed based on the forward search.
We are continuously amazed, each time we analyze a new set of data,
by the amount of information the plots generate and the insights they
provide. We believe our book uses comparatively elementary methods to
move regression in a completely new and useful direction.
We have written the book to be accessible to students and users of
statistical methods, as well as for professional statisticians. Because statis-
tics requires mathematics, computing and data, we give an elementary
outline of the theory behind the statistical methods we employ. The pro-
gramming was done in GAUSS , with graphs for publication prepared in
S-Plus. We are now developing S-Plus functions and have set up a web site
http://stat.ecan. uuipr. it/riani/ar which includes programs and the
viii Preface

data. As our work on the forward search grows, we hope that the material
on the website will grow in a similar manner.
The first chapter of this book contains three examples of the use of the
forward search in regression. We show how single and multiple outliers
can be identified and their effect on parameter estimates determined. The
second chapter gives the theory of regression, including deletion diagnostics,
and describes the forward search and its properties.
Chapter Three returns to regression and analyzes four further examples.
In three of these a better model is obtained if the response is transformed,
perhaps by regression with the logarithm of the response, rather than with
the response itself. The transformation of a response to normality is the
subject of Chapter Four which includes both theory and examples of data
analysis. We use this chapter to illustrate the deleterious effect of outliers
on methods based on deletion of single observations.
Chapter Four ends with an example of transforming both sides of a
regression model. This is one example of the nonlinear models that are the
subject of Chapter Five. The sixth chapter is concerned with generalized
linear models. Our methods are thus extended to the analysis of data from
contingency tables and to binary data.
The theoretical material is complemented by exercises. We give references
to the statistical literature, but believe that our book is reasonably self-
contained. It should serve as a textbook for courses on applied regression
and generalized linear models, even if the emphasis in such courses is not
on the forward search.
This book is concerned with data in which the observations are inde-
pendent and in which the response is univariate. A companion volume,
coauthored with Andrea Cerioli and tentatively called Robust Diagnostic
Data Analysis, is under active preparation. This will cover topics in the
analysis of multivariate data including regression, transformations, princi-
pal components analysis, discriminant analysis, clustering and the analysis
of spatial data.
The writing of this book, and the research on which it is based, has been
both complicated and enriched by the fact that the authors are separated
by half of Europe. Our travel has been supported by the Italian Ministry
for Scientific Research, by the Staff Research Fund of the London School of
Economics and, also at the LSE, by STICERD (The Suntory and Toyota
International Centres for Economics and Related Disciplines). The develop-
ment of S-Plus functions was supported by Doug Martin of MathSoft Inc.
Kjell Konis helped greatly with the programming. We are grateful to our
numerous colleagues for their help in many ways. In England we especially
thank Dr Martin Knott at the London School of Economics, who has been
an unfailingly courteous source of help with both statistics and computing.
In Italy we thank Professor Sergio Zani of the University of Parma for his
insightful comments and continuing support and Dr Aldo Corbellini of the
same university who has devoted time, energy and skill to the creation of
Preface ix

our web site. Luigi Grossi and Fabrizio Laurini read the text with great
care and found some mistakes. We would like to be told about any others.
Anthony Atkinson's visits to Italy have been enriched by the warm hospi-
tality of Giuseppina and Luigi Riani. To all our gratitude and thanks.

Anthony Atkinson
a.c.atkinson©lse.ac.uk
www.lse.ac.uk/experts/
Marco Riani
mriani©unipr.it
stat.econ.unipr.it/riani

London and Parma, February 2000


Contents

Preface vii

1 Some Regression Examples 1


1.1 Influence and Outliers 1
1.2 Three Examples . . . . . 2
1.2.1 Forbes' Data .. . 2
1.2.2 Multiple Regression Data. 5
1.2.3 Wool Data . . . . . . . 9
1.3 Checking and Building Models . . 14

2 Regression and the Forward Search 16


2.1 Least Squares . . . . . . . . . . 16
2.1.1 Parameter Estimates .. 16
2.1.2 Residuals and Leverage. 18
2.1.3 Formal Tests. 19
2.2 Added Variables . . . . . . . . . 20
2.3 Deletion Diagnostics . . . . . . 22
2.3.1 The Algebra of Deletion 22
2.3.2 Deletion Residuals . . . 23
2.3.3 Cook's Distance . . . . . 24
2.4 The Mean Shift Outlier Model. 26
2.5 Simulation Envelopes . . . 27
2.6 The Forward Search . . . . 28
2.6.1 General Principles. 28
xii Contents

2.6.2 Step 1: Choice of the Initial Subset . . . . . ..


. 31
2.6.3 Step 2: Adding Observations During t he Forward
Search . . . . . . . . . . . . . 32
2.6.4 Step 3: Monitoring the Search 33
2.6.5 Forward Deletion Formulae 34
2.7 Further Reading . 35
2.8 Exercises 36
2.9 Solutions 37

3 Regression 43
3.1 Hawkins' Data . 43
3.2 Stack Loss Data 50
3.3 Salinity Data 62
3.4 Ozone Data 67
3.5 Exercises 73
3.6 Solutions. 74

4 Transformations to Normality 81
4.1 Background 81
4.2 Transformations in Regression 82
4.2.1 Transformation of the Response 82
4.2.2 Graphics for Transformations 86
4.2.3 Transformation of an Explanatory Variable 87
4.3 Wool Data. 88
4.4 Poison Data 95
4.5 Modified Poison Data . 98
4.6 Doubly Modified Poison Data: An Example of Masking 101
4.7 Multiply Modified Poison Data-More Masking 104
4.7.1 A Diagnostic Analysis 104
4.7.2 A Forward Analysis . 106
4.7.3 Other Graphics for Transformations. 108
4.8 Ozone Data 110
4.9 Stack Loss Data . III
4.10 Mussels' Muscles: Transformation of the Response. 116
4.11 Transforming Both Sides of a Model. 121
4.12 Shortleaf Pine 124
4.13 Other Transformations and Further Reading 127
4.14 Exercises 128
4.15 Solutions. . . . . 129

5 Nonlinear Least Squares 136


5.1 Background 137
5.1.1 Nonlinear Models 137
5.1.2 Curvature 141
Contents xiii

5.2 The Forward Search. . . . . . . . . . . 148


5.2.1 Parameter Estimation . . . . . 148
5.2.2 Monitoring the Forward Search 150
5.3 Radioactivity and Molar Concentration of Nifedipene 151
5.4 Enzyme Kinetics 154
5.5 Calcium Uptake . . . . . . 159
5.6 Nitrogen in Lakes . . . . . 164
5.7 Isomerization of n- Pentane 170
5.8 Related Literature. 173
5.9 Exercises 174
5.10 Solutions... . .. 176

6 Generalized Linear Models 179


6.1 Background . . . . . . . 180
6.1.1 British Train Accidents. 180
6.1.2 Bliss's Beetle Data 181
6.1.3 The Link Function . . . 181
6.2 The Exponential Family . . . . 185
6.3 Mean, Variance, and Likelihood 185
6.3.1 One Observation . . . . 185
6.3.2 The Variance Function 186
6.3.3 Canonical Parameterization 188
6.3.4 The Likelihood . . . . . . 188
6.4 Maximum Likelihood Estimation 189
6.4.1 Least Squares . . . . . . . 189
6.4.2 Weighted Least Squares 190
6.4.3 Newton's Method for Solving Equations 190
6.4.4 Fisher Scoring. 191
6.4.5 The Algorithm 192
6.5 Inference . . . . . . . . 194
6.5.1 The Deviance 194
6.5.2 Estimation of the Dispersion Parameter 197
6.5.3 Inference About Parameters 197
6.6 Checking Generalized Linear Models 198
6.6.1 The Hat Matrix 198
6.6.2 Residuals . . . . . . .. . 198
6.6.3 Cook's Distance .. .. . 200
6.6.4 A Goodness of Link Test 200
6.6.5 Monitoring the Forward Search 201
6.7 Gamma Models . . . . . . . . . 202
6.8 Car Insurance Data . . . . . . . 204
6.9 Dielectric Breakdown Strength . 209
6.10 Poisson Models . . . . . . . 221
6.11 British Ttain Accidents . . . 222
6.12 Cellular Differentiation Data 226
xiv Contents

6.13 Binomial Models . . . 230


6.14 Bliss's Beetle Data .. 232
6.15 Mice with Convulsions 234
6.16 Toxoplasmosis and Rainfall . 238
6.16.1 A Forward Analysis. 238
6.16.2 Comparison with Backwards Methods 245
6.17 Binary Data. . . . . . . . . . . . . . . . . . 246
6.17.1 Introduction: Vasoconstriction Data . 246
6.17.2 The Deviance . . . . . . . . . . . . . 248
6.17.3 The Forward Search for Binary Data 249
6.17.4 Perfect Fit . . . . . . . . . . . . . . . 250
6.18 Theory: The Effect of Perfect Fit and the Arcsine Link 253
6.19 Vasoconstriction Data and Perfect Fit. 256
6.20 Chapman Data . . . . . . . . . . . 259
6.21 Developments and Further Reading 265
6.22 Exercises 267
6.23 Solutions . . . . . . . . . . . . . . . 268

AD~ 2"
Bibliography 311

Author Index 319

Subject Index 323


Tables of Data

A.l Forbes' data on air pressure in the Alps and the boiling point
of water . . . . . . . . . . . . . . . . . . . . ..
. . . 278. . .
A.2 Multiple regression data showing the effect of masking .. 279
A.3 Wool data: number of cycles to failure of samples of worsted
yarn in a 33 experiment . . . . . . . . . . . . . . . . 281
. . .
A.4 Hawkins' data simulated to baffle data analysts . . . . .. 282
A.5 Brownlee's stack loss data on the oxidation of ammonia. The
response is ten times the percentage of ammonia escaping up
a stack, or chimney . . . . . . . . . . . . . . . .... . 285. .
A.6 Salinity data. Measurements on water in Pamlico Sound,
North Carolina . . . . . . . . . . . . . . . . . .. . . .286. .
A.7 Ozone data: ozone concentration at Upland, CA as a
function of eight meteorological variables. . . . . . . . . .287
A.8 Box and Cox poison data. Survival times in lO-hour units of
animals in a 3 x 4factorial experiment. Each cell in the table
includes both the observation number and the response. 289
A.9 Mussels data from Cook and Weisberg. The response is the
mass of the edible portion of the mussel . . . . . . . ... 290
A.lO Shortleaf pine. The response is the volume of the tree, Xl
the girth and X2 the height . . . . . . . . . . . . . . . . . 292
A.ll Radioactivity and the molar concentration of nifedipene. 294
A.12 Enzyme kinetics data. The response is the initial velocity of
the reaction . . . . . . . . .. . . . . . . . . .. . . . . . 295
A.13 Calcium data. Calcium uptake of cells suspended in a
solution of radioactive calcium. . . . . . . . . . . ... . . 296
XVI Tables of Data

A.14 Nitrogen concentration in American lakes . . . . . . ... 297


A.15 Reaction rate for the catalytic isomerization of n-pentane to
isopentane . . . . . . . . . . . . . . .. . . . . . . . . . . 298
A.16 Car insurance data from McCullagh and Nelder. The re-
sponse is the average claim, in L. Also given are observation
number and m , the number of claims in each category 299
A.17 Dielectric breakdown strength in kilovolts from a 4 x 8
factorial experiment . . . . . . . . . . . . . . . . . . . 300
A.18 Deaths in British Train Accidents. . . . . . . . . . . . 302
A.19 Number of cells showing differentiation in a 42 experiment 304
A.20 Bliss's beetle data on the effect of an insecticide . . . . . 304
A.2I Number of mice with convulsions after treatment with
insulin . . . . . . . . . . . . . . . . . 305
A.22 Toxoplasmosis incidence and rainfall in 34 cities in El
Salvador .. . . . . . . . . . . . . . . . . . . . . . . . . . 306
A.23 Finney's data on vasoconstriction in the skin of the fingers 307
A.24 Chapman's data on the incidence of heart disease as a
function of age, cholestorol concentration and weight 308
1
Some Regression Examples

1.1 Influence and Outliers


Regression analysis is the most widely used technique for fitting models to
data. This book is not confined to regression, but we use three examples of
regression to introduce our general ideas.
When a regression model is fitted by least squares, the estimated pa-
rameters of the fitted model depend on a few statistics aggregated over all
the data. If some of the observations are different in some way from the
bulk of the data, the fitting process may disguise the differences , forcing
all observations into the same straightjacket. It is the purpose of this book
to describe a series of powerful general methods for detecting and inves-
tigating observations that differ from the bulk of the data. These may be
individual observations that do not belong to the general model, that is
outliers. Or there may be a subset of the data that is systematically dif-
ferent from the majority. We are concerned not only with identification of
such observations, but also with the effect that they have on parameter
estimates and on inferences about models and their suitability.
In our first example there is just one outlier, which is easily detected by
plots of residuals. In slightly more complicated examples the outlier may not
be obvious from residual plots - it is said to be "masked." A single masked
outlier is easily detected by the methods of deletion diagnostics in which one
observation at a time is deleted , followed by the calculation of new residuals
and parameter estimates. The formulae for the effect of deletion are given
in §2.3. With two outliers, pairs of observations can be deleted, and the
2 1. Some Regression Examples

process can be extended to the deletion of several observations at a time.


A difficulty both for computation and interpretation is the explosion of the
number of combinations to be considered. An alternative is the repeated
application of single deletion methods. We call such methods "backwards";
they start from a fit to all the data and delete one observation at a time. The
size of the subset of observations used in fitting decreases as the method
proceeds. Our second example shows how such a backwards procedure can
fail in the presence of masking.
Instead we advocate a "forward" procedure in which the basic idea is to
order the observations by their closeness to the fitted model. We start with
a fit to very few observations and then successively fit to larger subsets. The
starting point is found by fitting to a large number of small subsets, using
methods from robust statistics to determine which subset fits best. We
then order all observations by closeness to this fitted model; for regression
models the residuals determine closeness. For multivariate models, which
are the subject of a second book, we use Mahalanobis distances. The subset
size is increased by one and the model refitted to the observations with the
smallest residuals for the increased subset size. Usually one observation
enters, but sometimes two or more enter the subset as one or more leave.
The process continues with increasing subset sizes until, finally, all the
data are fitted . As a result of this forward search we have an ordering of
the observations by closeness to the assumed model.
The ordering of the observations we achieve takes us from a very robust
fit to, for regression, ordinary least squares. If the model and data agree,
the robust and least squares fits will be similar, as will be the parame-
ter estimates and residuals from the two fits. But often the estimates and
residuals of the fitted model change appreciably during the forward search.
We monitor the changes in these quantities and in various statistics, such
as score tests for transformation, as we move forward through the data,
adding one observation at a time. As we show, this forward procedure pro-
vides a wealth of information not only for outlier detection but, much more
importantly, on the effect of each observation on aspects of inference about
the model. The details of the procedure are in Chapter 2. We start in this
chapter with three examples to show some of the principles of our method.
We follow the examples with a few comments on types of departures from,
and failures of, models.

1.2 Three Examples


1.2.1 Forbes' Data
The data in Appendix Table A.l , described in detail by Weisberg (1985 ,
pp. 2- 4) , are used by him to introduce the ideas of regression analysis.
There are 17 observations on the boiling point of water in OF at different
1.2. Three Examples 3

,

on
~


~
::>

<n
<n
i!?
S,
0

,
0>
~
.Q
x
0
~
••
on
~ ••


195 200 205 210
Boiling point

Figure 1.1. Forbes' data: scatter plot of 100 x log pressure against boiling point.
There is a suggestion of one outlier

pressures, obtained from measurements at a variety of elevations in the


Alps. The purpose of the original experiment was to allow prediction of
pressure from boiling point, which is easily measured, and so to provide an
estimate of altitude. The higher the altitude, the lower the pressure and
the consequent boiling point, which is why it is claimed that no Scotsman
would eat porridge cooked on the top of Ben Nevis.
Weisberg gives values of both pressure and 100 x log (pressure ) as
possible response. We consider only the latter, so that the variables are:

x: boiling point, OF
y: lOOxlog(pressure).

The data are plotted in Figure 1.1. A quick glance at the plot shows there
is a strong linear relationship between log (pressure) and boiling point. A
slightly longer glance reveals that one of the points lies slightly off the line.
Linear regression of y on x yields a t value for the regression of 54.45, clear
evidence of the significance of the relationship.
Two plots of the least squares residuals e are often used to check fitted
models. Figure 1.2 (left ) shows a plot of residuals against fitted values y.
This clearly shows one outlier, observation 12. The normal plot of the stu-
dentized residuals, Figure 1.2(right) , is an almost straight line from which
the large residual for observation 12 is clearly distanced. It is clear that
observation 12 is an outlier.
Now that observation 12 has been identified as different, two strategies
can be followed. One is to delete it from the data and to refit to the re-
4 1. Some Regression Examples

'"

'"o
.
o

.
-;-

•• •• • •
0
0
• •
• • •
135 140 145 -2 -1 o 2
Predicted values Quantiles of standard normal

Figure 1.2. Forbes' data: (left) least squares residuals e against predicted values
ii, showing that observation 12 is an outlier; (right) normal plot ofthe studentized
residuals, with 90% simulation envelope, confirming that observation 12 is indeed
an outlier

maining 16 observations. The other is to try to find out whether there is


something different about observation 12. The simplest place to start is to
check data entry, then transcription, working back, in t he case of survey
or laboratory data to the primary source of data recording. Taking further
observations may also be a possibility, although there may be systematic
differences between the new and old observations, so that blocking effects
need to be watched.
The plots show a single outlier. It is not however clear whether this out-
lier is important. How does its presence change the inferences drawn from
the data, such as the t test for regression, or the estimates of the parame-
ters? If observation 12 is deleted, are there now other outliers, which were
previously masked? To answer such questions we could proceed backwards
through the data, deleting observation 12, refitting and comparing new and
old parameter estimates and redrawing plots such as Figures 1.2(left) and
(right). Our forward method allows us to answer all such questions from a
single search through the data.
We start with a least squares fit to two observations, robustly chosen
as described in the next chapter. From this fit we calculate the residuals
for all 17 observations and next fit to the three observations with smallest
residuals. In general we fit to a subset of size m, order the residuals and
take as the next subset the m + 1 cases with smallest residuals. This gives
a forward search through the data, ordered by closeness to the model. We
expect that the last observations to enter the search will be t hose which are
furthest from the model and so may cause changes once they are included in
the subset used for fitting. In our search through Forbes' data, t he outlying
observation 12 was the last to enter the search.
For each value of m from 2 to 17 we calculate quantities such as the resid-
uals and the parameter estimates and see how they change. Figure 1.3(left)
1.2. Three Examples 5

EQ)
o ,
'u 0

~ o
8
~~ g~
'E"
'
0

I = ~~~~ePt jI
d

L()
o
d
W"i
~o _ _ _

o
d
5 10 15 5 10 15
Subset size m Subset size m

Figure 1.3. Forbes ' data: parameter estimates from the forward search: (left)
slope and intercept So
and Sl
(the values are virtually unaffected by the outlying
observation 12); (right) the value of the estimate of (72 increases dramatically
when observation 12 is included in the last step of the search

is a plot of the values of the parameter estimates during the forward search.
The values are extremely stable, reflecting the closeness of all observa-
tions to the straight line. The introduction of observation 12 at the end of
the search causes virtually no change in the position of the line. However,
Figure 1.3(right) shows that introduction of observation 12 causes a huge
increase in 8 2 , the residual mean square estimate of the error variance a 2 .
The information from these plots about observation 12 confirms and quan-
tifies that from the scatterplot of Figure 1.1: observation 12 is an outlier,
but the observation is at the centre of the data, so that its exclusion or in-
clusion has a small effect on the estimated parameters. The plots also show
that all other observations agree with the overall model. This is also the
conclusion from Figure 1.4 which shows the residuals during the forward
search. Throughout the search, all cases have small residuals, apart from
case 12 which is outlying from all fitted subsets. Even when it is included
in the last step of the search, its residual only decreases slightly.
Our analysis shows that Forbes' data have a simple structure - there
is one outlying observation, 12, that is not influential for the estimates
of the parameters of the linear model. Inclusion of this observation does
however cause the estimate 8 2 to increase from 0.0128 to 0.1436 with a
corresponding decrease in the t statistic for regression from 180.73 to 54.45.
We now consider a much more complicated example for which the forward
search again illuminates the structure of the data.

1.2.2 Multiple Regression Data


With one explanatory variable it is not difficult to understand the structure
of the data. In the previous example the outlier was obvious, as would
6 l. Some Regression Examples

I/')

....
-..........
12
Vl C')
OJ
:l
"C
.0;
~ N
"C
Q>
OJ
"
(I)

5 10 15
Subset size m

Figure 1.4. Forbes' data: forward plot of least squares residuals scaled by the
final estimate of 0". Observation 12 is an outlier during the whole of this stable
forward search

have been a point of high leverage. Patterns in the outliers from the linear
regression might also have indicated the need for a transformation of y or for
the inclusion of further terms, for example, a quadratic, in the linear model.
However the forward search provides a powerful method of understanding
the effect of groups of observations on inferences. In the following analysis
we focus on the presence of a group of outliers and their effect on the t
test of one parameter. This complicated structure is clearly revealed by the
forward search.
Table A.2 gives 60 observations on a response y with the values of three
explanatory variables. The scatterplot matrix of the data in Figure 1.5
shows y increasing with each of Xl, X2 and X3. The plot of residuals against
fitted values, Figure 1. 6 (1eft), shows no obvious pattern, unlike that of
Figure 1.2. The largest residual is that of case 43. However the normal
probability plot of Figure 1.6(right) shows that this residual lies within the
simulation envelope. The finer detail of this plot hints at some structure,
but it is not clear what. There is thus no clear indication that the data are
not homogeneous and well behaved.
Evidence of the structure of the data is clearly shown in Figure 1.7,
the scaled squared residuals from the forward search. This fascinating plot
reveals the presence of six masked outliers. The left-hand end of the plot
gives the residuals from the least median of squares estimates found by
sampling 1,000 subsets of size p = 4. From the most extreme residual
downwards, the cases giving rise to the outliers are 9, 30, 31, 38, 47 and
21. When all the data are fitted the largest 10 residuals belong to, in order,
1.2. Three Examples 7

-3 -1 0 2 -15 -5 0 5 10

• •• • •• • • •• •
••
·-... . ... .:".
... .r. ..
• ••• • • • 4•
~ .1'
• ."
. •••
...,.
I
· .1'
..
o
X1 ~~.,",
t. • • •
•• ••
"ot· .. ~'.
..-
... ... .
•• • • • ••
-..
• •

.. •
.....,._.~; .. ......

. ...
• •• 14· •
... :.. }.i'.'
. ..
· .,,. .
J
o ~ X2
.•;'1-1.,.•• .-.'I·· .
~.;.-
• • •
• • • • • •
....... .. ,....••
• • •

. ... ..Jl . .:4• •



A·-·
• •••• •
.•.,. ~~
•••
X3

•• , ..
..,..,.
• •• : ~..
•• .:. • .1•
• • • •• •
.,..

• • •
.. . • •

::"
,.......
o
•• • ••
,. ~
III

• · ...... .~~
•• "
.~

... • ·;• ....t.r.'.


-
o

......
· .... .
y

••
• • • • ••
-3 -1 0 2 -8 -6 -4 -2

Figure 1.5. Multiple regression data: scatterplot matrix of response and three
variables

N
..-. ••
•• co

• ••• • •• ••• • • •• •
N

'" •' •• • •
• ••
. '"
-.;
~"

•'" • • • • • •
-.; 0
~
u"
'in
u
Q) 0

• • • .'
Q)
N
II: ~
• • • • ••
';' Q)
u
a ';'
•• • • '"
'l' 'l'
43

.43 '?

-10 -5 5 10 -2 -1 0
Predicted values Quantiles of standard normal

Figure 1.6. Multiple regression data: (left) least squares residuals e against fitted
values y; (right) normal QQ plot of studentized residuals
8 1. Some Regression Examples

---- ------. -- .. _-_ .. --- .. .. ----..

-----".. . =----------==-----------------
---
o

o 10 20 30 40 50 60
Subset size m

Figure 1.7. Multiple regression data: forward plot of squared least squares residu-
als scaled by the final estimate of (T. Six masked outliers are evident in the earlier
stages of the search, but the largest residual at the end of the search belongs to
the nonoutlying observation 43

cases 43, 51, 2, 47, 31, 9, 38, 29, 7 and 48. The first outlier to be included
in this list produces the fourth largest residual and only four outliers are
included at all.
The assessment of the importance of these outliers can be made by con-
sidering the behaviour of the parameter estimates and of the related t
statistics. Apart from /31 all remain positive with t values around 10 or
greater during the course of the forward search. We therefore concentrate
on the behaviour of h, the t statistic for /31' The values for the last 20 steps
of the forward search are plotted in Figure 1.8(1eft). The general downwards
trend is typical of plots of t statistics from the forward search. It is caused
by the increasing value of 8 2 , Figure 1.8(right), as observations with larger
residuals are entered during the search. This figure also indicates the pres-
ence of some outliers by the unsmooth behaviour in the last three steps. If
the data can be ordered in agreement with the model, the curve is typically
monotonic.
An important feature in the interpretation of Figure 1.8(1eft) is the two
upward jumps in the value of the statistic. The first results from the in-
clusion of observation 43 when m = 54, giving a t value of 2.25, evidence,
significant at the 3% level, of a positive value for /31' Thereafter the out-
liers enter the subset, with observation 43 leaving when m = 58, as two
outliers enter. When m = 59 the value of the statistic has decreased to
- 1.93, close to evidence for a negative value of the parameter. Reintroduc-
1.2. Three Examples 9

o
~ +-- _________-"'L--j

45 50 55 60 10 20 30 40 50 60
Subset size m Subset size m

Figure 1.8. Multiple regression data: (left) the t statistic for /31 during the forward
search and (right) the increase in the estimate of 0"2 ; in both figures the jumps
in the curves are caused by the inclusion of outliers

tion of observation 43 in the last step of the search results in a value of


-1.26, indicating that /31 may well be zero. It is therefore important that
the outliers be identified.
This example shows very clearly the existence of masked outliers, which
would not be detected by the backward procedures of customary regression
diagnostics, which would indicate the importance of observation 43. How-
ever the forward plot of residuals in Figure 1. 7 clearly indicates a structure
that is hidden in the conventional plot of residuals in Figure 1.6(right).

1.2.3 Wool Data


In this example we show the effect of the ordering of the data during the
forward search on the estimates of regression coefficients and the error
variance as well as on a score statistic for transformation of the response.
Table A.3, taken from Box and Cox (1964) , gives the number of cycles
to failure of a worsted yarn under cycles of repeated loading. The results
are from a single 33 factorial experiment. The three factors and their levels
are:

Xl: length of test specimen (25, 30, 35 cm)


X2: amplitude of loading cycle (8 , 9, 10 mm)
X3: load (40,45,50 g).

The number of cycles to failure ranges from 90, for the shortest specimen
subject to the most severe conditions, to 3,636 for observation 19 which
comes from the longest specimen subjected to the mildest conditions. In
their analysis Box and Cox (1964) recommend that the data be fitted after
10 1. Some Regression Examples

. r - - - - - - - -- ---,--,

• C')

0
0 • •
;!

0

"'::>
"iii 0

'" •
• •• •
"0
'iii
Q)
a:

...• • ..
0
-I• •••••
0
0
u;>

-500 0 500 1500 -2 -1 o 2
Predicted values Quantiles of standard normal

Figure 1.9, Wool data: (left) least squares residuals e against fitted values ii;
(right) normal QQ plot of studentized residuals

the log transformation of y. We start with an analysis of the untransformed


data, to show the information provided by the forward search.
Figure 1. 9 (left ) is a plot of residuals against fitted values when a first-
order model in the three factors is fitted to the data. It has a curved shape
with increasing variability at the right-hand end of the plot, typical ev-
idence of the need for a transformation. Similar evidence is provided by
the normal plot of residuals in Figure 1.9(right). Here the curved shape is
a reflection of the skewed distribution of the residuals. To investigate the
impact of individual cases on the fit, we turn to the forward search.
The forward plot of residuals is given in Figure 1.10; in this plot we give
the scaled residuals themselves, rather than the squared values. It is typical
of such plots that the residuals in the early stages are far from symmetrical:
only the residuals of the m observations in the subset are constrained to
sum to zero. For most of the search the four largest residuals are for cases
19, 20, 21 and 22. Since the data are in standard order for a three-level
factorial , these consecutive case numbers suggest some systematic failure
of the model. In fact these are the four largest observations, arising when the
first factor is at its highest level and, for the three largest, the second factor
is at its lowest. Such extreme observations are likely to provide evidence
for a transformation.
Figure 1. 11 (left) is a plot, as the forward search proceeds, of the approx-
imate score statistic for transformation which is described and exemplified
in detail in Chapter 4. The null distribution is approximately normal. If
the data do not need transformation the values should lie within the 99%
limits of ±2.58 shown on the plot. However, the value of the statistic trends
steadily downward, indicating that the evidence for a transformation is not
confined to just the last few large observations, but that there are contri-
butions from all cases, The negative value of the statistic indicates that a
transformation such as the log or the reciprocal should be considered.
1.2. T hree Examples 11

19 . . . . . . . .

...
'"
c;;
:J
"0
.iij
~
"0
II>
'"
c;;
"
Cf)

5 10 15 20 25
Subset size m

F igure 1.10. Wool data: forward plot of least squares residuals scaled by t he fi nal
estimate of 0' . The t hree largest residuals can be directly rela ted to t he levels of
t he factors

"
~
0
0
~ u;>
0
0
u; U")

u; ~
~ 0 '"
~ '";"
0
"
(/)
U")
0
0
0
'";" 0
U")

0
C)I 0

10 15 20 25 5 10 15 20 25
Subset size m Subset size m

F igure 1.11. Wool data: (left) score t est for t ransformation during t he forward
search and ( ight)
r the increasing value o f the estimate 8 2
12 1. Some Regression Examples

0
$ 0
~

c
Q)
·u 0
0
m :;: 0
Q) It)
0 0

~
"
~ -._........-..... ---........_-_....- .....
/-

tI: .0
U
0 2 0
co
0 '"
E ---------------
~
w ------~:~f------------~~-~~<,~~'
0
0
12 u;> beta_3
0
5 10 15 20 25 5 10 15 20 25
Subset size m Subset size m

Figure 1.12. Wool data: (left) the multiple correlation coefficient R2 during the
forward search and (right) the values of the parameter estimates

Other forward plots indicate the way in which the model changes as more
cases are introduced. The forward plot of 8 2 in Figure 1. 11 (right) increases
dramatically towards the end, whereas that of R2, Figure 1.12(left), de-
creases to around 0.8 for part of the search, with a final value of 0.729.
Further evidence of a relationship that changes with the search is given by
the forward plot of estimated coefficients in Figure 1.12(right). Initially the
values are stable, but later they start to diverge.
These plots are to be contrasted with those from the forward search for
the transformed data when the response is log y. The plot of residuals,
Figure 1.13, suggests that perhaps cases 24 and 27 are outlying. But what
effect do they have on inferences drawn from the data? Figure 1. 14 (left) , the
forward plot of the approximate score statistic for transformation, shows
the logarithmic transformation as acceptable; the cases giving rise to large
residuals, which enter at the end of the search, have no effect whatsoever
on the value of the statistic. The plot of the parameter estimates in Fig-
ure 1. 14(right) shows how stable the estimates ofthe parameters are during
the forward search. The value of 8 2 , Figure 1. 15 (left) , increases towards the
end of the search as cases with larger residuals enter. The same pattern,
in reverse, is shown by Figure 1.15(right) for R2 , which decreases in a
smooth fashion as the later cases enter the subset. Despite the decrease,
the value of R2 is now 0.966 for all cases, a great increase from 0.729 for
the untransformed data.
In one sense these last four plots are noninformative. If an interesting
diagnostic plot is one that reveals some unexpected or unexplained feature
of the data, these are boring. However they serve as prototypes of the plots
that we expect to see when model and data agree.
1.2. Three Examples 13

-------------------~/--------',-
-----------------------

..
0

...,.
<0
:>

~
-g <)I 22
~
en
23

27

-------------'
-------- -_ ... - ------- , ........ --
24
"'i

5 10 15 20 25
Subset size m

Figure 1.13. Transformed wool data: forward plot of least squares residuals for
log y scaled by the final estimate of (J". Are observations 24 and 27 outlying?

0
III
~
i
(ij 0
;--
~

.2! \
~
0
0
UJ "?

0
";"

10 15 20 25 5 10 15 20 25
Subset size m Subset size m

Figure 1.14. Transformed wool data: (left) score test for transformation during
the forward search, showing that the log transformation is satisfactory and (right)
the extremely stable values of the parameter estimates
14 1. Some Regression Examples

0
~
M
0
ci
'"'"ci
'"ci
0
<Xl
'"~ ~
'"
cr: ci

;; ,...
ci
'"ci
0 CD
ci '"ci
5 10 15 20 25 5 10 15 20 25
Subset size m Subset size m

Figure 1.15. Transformed wool data: (left) the increasing value of the estimate
82 during the forward search and (right) the smoothly decreasing value of the
squared multiple correlation coefficient R2

1.3 Checking and Building Models


Departures from models can be isolated or systematic. The analyses of data
in this chapter provide examples of both kinds. In the analysis of Forbes'
data we detected a single outlier, an isolated departure that did not ap-
preciably affect inferences about the model to be fitted to the data. In the
analysis of the wool data on the original scale there was a systematic fail-
ure of the model involving nearly all observations. This was rectified by
a suitable transformation of the response. The remaining analysis, that of
the regression data, shows a more complicated interplay between isolated
and systematic departures: a group of outliers, an isolated departure, led
to changes in inference about the coefficient of Xl, and so potentially to a
systematic failure of the model. This interplay between isolated and sys-
tematic departures is often the hardest form of model failure to detect. It
is one which is often clearly revealed by procedures based on the forward
search.
Many assumptions are made when fitting a regression model. Some are
listed below. As many as possible should be critically examined for any set
of data. One group of questions concerns the linear model.

• Does the linear model include all relevant explanatory variables? A


potential difficulty is that not all the relevant variables may have been
recorded. A relatively simple example is data collected in time order,
when a time trend can be included in the model. A more difficult
case is when two or more batches of raw material have been used,
but the only indication is some systematic pattern in the residuals or
parameter estimates.
1.3. Checking and Building Models 15

• Does the linear model contain any irrelevant variables? There are
several standard methods for removing variables from models, usually
described as variable selection.
• Are the variables in the right form, or should they be transformed? In
our analysis of Forbes' data we regressed log pressure on temperature.
Brown (1993 , p. 3) observes that the Clausius- Clapeyron equation in-
dicates that the reciprocal of absolute temperature is linearly related
to log pressure. Over the range of the data the two models are not
easily distinguished. But the difference could become important if the
model were to be used for extrapolation. Methods for the choice of a
transformation are the subject of Chapter 4.
• Are there sufficient terms in the model, or are extra carriers needed,
for example, quadratic or interaction terms in the variables already
in the model?
The linear model is only part of a regression model. Even if a regression
model is appropriate, there are also a number of questions that need to be
answered about the errors.
• Do the errors have common variance? If not, weighted least squares
may be appropriate, for example, if the observations are averages of
varying numbers of readings.
• Are the errors approximately normal? If not , can they be transformed
to approximate normality by the methods described in Chapter 4?
• Are the errors independent? If not , are time series methods
appropriate?
If the errors are not normal but, for example, binomial, the regression model
will need replacing by some other member of the family of generalized linear
models. Then, in addition to the choice of the linear predictor, the choice
of a suitable link function also needs to be investigated and scrutinized.
This rich family of models forms the subject of Chapter 6.
Examples of many of these choices arise in successive chapters and we
show how the forward search provides information to guide model building.
Some references to standard procedures are given at the end of Chapter 2.
2
Regression and the Forward Search

The basic algebra of least squares is presented in the first section of the
chapter, followed by that for added variables, which is used in the con-
struction of some score tests for regression models, particularly that for
transformations in Chapter 4. Related results are needed for testing the
goodness of the link in a generalized linear model, Chapter 6. Several of
the quantities monitored during the forward search come from considering
the effect of deletion of an observation. Deletion diagnostics are described
in §2.3 of the chapter and, in §2.4, related to the mean shift outlier model.
Simulation envelopes are described in §2.5 and the forward search is de-
fined and discussed in §2.6. The chapter concludes with some suggestions
for further reading.

2.1 Least Squares


2.1.1 Parameter Estimates
In linear regression models, such as those used in the first chapter, there
are n observations on a continuous response y. The expected value of
the response E(Y) is related to the values of p known constants by the
relationship
E(Y) = X(3. (2.1)
In this formulation, Y is the n x 1vector of responses, X is the n x p matrix
of known constants and (3 is a vector of p unknown parameters. If some
2.1. Least Squares 17

columns of X, the matrix of carriers or independent variables, are produced


by a random mechanism, we argue conditionally on the observed values and
so ignore the random element. Usually we assume that X is of full rank
p. In polynomial regression the p carriers are functions of k explanatory
variables. All examples in Chapter 1 were for first-order polynomials in
which p = k + 1.
The model for the ith of the n observations can be written in several
ways as, for example,
p-l

Yi = T/(Xi, (3) + Ei = xT (3 + Ei = (30 + L (3j Xij + Ei· (2.2)


j=l

Under "second-order" assumptions the errors Ei have zero mean, constant


variance (j2 and are uncorrelated. That is,

i =J
(2.3)
i=!=j ,
conditions on only the first two moments of the Ei . We assume in the
regression chapters that, in addition, the errors are normally distributed.
The least squares estimates /3 of the parameters (3 minimize the sum of
squares
5((3) = (y - x(3f(y - X(3) (2.4)
and so satisfy the relationship
XT(y - x/3) = 0, (2.5)
which yields the normal equations
XTX/3 = XTy. (2.6)
The least squares estimates are therefore
/3 = (XT X)-l XT y, (2.7)
a linear combination of the observations, which will be normally distributed
if the observations are.
These estimates have been found by minimizing the sum of squares 5((3).
The minimized value is the residual sum of squares
5(/3) (y - x/3f (y - x/3)
yT y _ yT X(XT X)-lXT Y
yT {In - X(XT X)-l XT}y, (2.8)
where In is the n x n identity matrix. When, as in this chapter, the
dimension of the matrix is obvious we write I instead of In.
The vector of n predictions from the fitted model is
f) = x/3 = X(XT X)-l XT Y = Hy , (2.9)
18 2. Regression and the Forward Search

where the projection matrix H is called the hat matrix because it "puts
the hats on y" . It has an important role in the algebra of least squares. For
example, let the ith residual be ei = Yi - Yi , so that the vector of residuals
is
e= y- Y= y- xS = (1 - H)y . (2.10)

2.1.2 Residuals and Leverage


There are several different residuals that may be of importance. Where
confusion can occur we call e the least squares residuals. The residual sum
of squares can also be written explicitly in terms of H , since, from (2.8) ,
n
S(S) = LeT = yT(I - H)y. (2.11)
i=l

Comparison of (2.10) and (2.11) indicates that H , and 1 - H, are


idempotent; that is, HT H = H (Exercise 2.1).
In addition to providing estimates of the parameters (3 of the linear
model, least squares also provides the residual mean square estimate of
the variance a 2 as
(2.12)
Two further sets of residuals can be defined using this estimate of a 2 . The
values of the scaled residuals ed s do not depend on the value of a 2 . But,
like the least squares residuals, they do not all have the same variance.
Since the observations yare independent with constant variance a 2 , the
variance of the linear combination l = By is BBT a 2 , where B is a matrix
of constants. Thus for the least squares residuals (2 .10),
var e = (I - H)(1 - Hfa 2 = (I - H)a 2 , (2.13)
since (I - H) is idempotent. If we let the ith diagonal element of H be
hi , it follows that var ei = (1 - h i )a 2 . The studentized residuals are then
defined as

(2.14)

The studentized residuals are widely used in model checking as we have


in the normal probability plot of the residuals for Forbes' data in Fig-
ure 1.2(right). However, although they all have the same variance, they are
not independent, nor do they follow a Student's t distribution. That this
is unlikely comes from supposing that ei is the only large residual, when
ey
s2 ~ j(n - p) , so that the maximum value of the squared studentized
residual is bounded. Cook and Weisberg (1982, p. 19) show that rU(n - p)
has a beta distribution.
2.1. Least Squares 19

The quantity hi also occurs in the variance of the fitted values. From
(2.9),
(2.15)
so that the variance of fh = (J"2h i . The value of hi is called the leverage of
the ith observation. Since
(i=l, ... , n) , (2.16)
it follows that

L hi = tr(H) = tr {X(XT X)-l XT} = tr {(XT X)-l XT X}


n

i=l

(2.17)
so that the average value of hi is pin, with °: :;hi :::; 1 (Exercise 2.3). A
large value indicates high leverage. For such points the variance of iii will,
from (2.15), be close to (J"2, indicating that the fit is mostly determined by
the value of Yi' Likewise, from (2.13), the variance of the residual will be
small (Exercise 2.8). The effect of this local fit can be that an extra term
in a model may be included solely to give a good fit for Yi, with a small
residual. Inspection of plots of least squares, or even studentized, residuals
may not indicate how influential this observation is for the fitted model.

2.1.3 Formal Tests


To test the terms of the model we use t tests. Since, from (2.7), /3 is a linear
function of the observations,
(2.18)
If we let the kth diagonal element of (XT X)-l be
°
Vk, the t test for testing
that 13k = is
k = 1, ... ,p, (2.19)
where s~ is an estimate of (J"2 on v degrees of freedom. If 13k = 0, tk has
a t distribution on v degrees of freedom. We often use the residual mean
square estimate of (J"2 (2.12) so that v = n - p.
These individual t tests are the signed square roots of the F tests from
the difference in the sums of squares when 13k is and is not included in
the model. If the explanatory variables are correlated, dropping Xk from
the model may cause appreciable change in the significance and even the
signs of the remaining t statistics. An example in the next chapter arises
in the analysis of the Ozone Data. Table 3.1 shows how dramatically the t
statistics can change as nonsignificant variables are deleted from the model.
The t statistics can also change dramatically during the forward search as
increasingly large subsets of observations are fitted.
20 2. Regression and the Forward Search

In addition to the t tests we also find it helpful to look at the evolution


of the value of the squared multiple correlation coefficient R2 as we did for
the wool data in Figures 1.12(left) and 1.15(right). If the total corrected
sum of squares of the observations is
n
So = L(Yi - y)2 ,
i=l

where y = Lyi/n, the squared multiple correlation coefficient is defined as


R2 = {So - S(~)}/So. (2.20)
A value near one indicates that a large proportion of the total sum of
squares has been explained by the regression. However a large value, while
encouraging, says nothing about the contribution of particular groups of
observations to various aspects of the fit, such as the importance of specific
parameters.

2.2 Added Variables


The added variable plot provides a method, in some circumstances, of
assessing the impact of individual observations on estimates ~k of single pa-
rameters in a multiple regression model. The starting point is to fit a model
including all variables except the one of interest, the "added" variable. The
plot is based on residuals of the response and of the added variable. The
added variable can be replaced by a "constructed" variable derived from
the data. We use this form for the score test for transformations derived in
Chapter 4. For the moment we concentrate on regression variables.
We extend the regression model to include an extra explanatory variable,
the added variable w, so that (2.1) becomes
E(Y) = X(3 + wry, (2.21 )
where 'Y is a scalar. The least squares estimate i can be found explicitly
from the normal equations for this partitioned model
XT X~ + XT wi = XT Y (2.22)
and
wT X~ + w T wi = w T y. (2.23)
If the model without 'Y can be fitted, (XT X)-l exists and (2.22) yields
~ = (XT X)-l XT Y - (XT X) - l XT wi. (2.24)
Substitution of this value into (2.23) leads, after rearrangement, to

i = wT(I - H)y = w T Ay (2.25)


wT(I - H)w wT Aw·
2.2. Added Variables 21

Since A = (I - H) is idempotent, i can be expressed in terms of the two


sets of residuals
e = Y* = (I - H)y = Ay
and
w* = (I - H)w = Aw (2.26)
as
*T *T *
i = w e/(w w). (2.27)
Thus i is the coefficient of linear regression through the origin of the resid-
uals e on the residuals of the new variable w, both after regression on the
variables in X.
Because the slope of this regression is i, a plot of e against ~ is often used
as a visual assessment of the evidence for a regression and for the assessment
of the contribution of individual observations to the relationship. Such a
plot is called an added variable plot or a constructed variable plot if w is not
a straightforward explanatory variable. However the plot is one of residuals
against residuals. As we have alrpady argued, points of high leverage tend
to have small residuals. Thus, if something important to the regression
happens at a leverage point, it will often not show on the plot. Examples,
for the constructed variable for transformation of the response , are given
by Cook and Wang (1983) and by Atkinson (1985 , §12.3). Instead of the
plot, these authors suggest looking at the effect of individual observations
on the t test for ,.
To calculate the t statistic requires the variance of i. Since, like any
least squares estimate in a linear model , i is a linear combination of the
observations, it follows from (2.25) that
• 2 w TATAw 0"2 2 * T*
var, = 0" (w TAw)2 = wTAw = 0" /(w w). (2.28)

Calculation of the test statistic also requires the residual mean square s; ,
estimate of 0"2 from regression on X and w, given by (Exercise 2.6)
(n - p - l)s; yTy _ /FXTy _ iwTy
yTAy _ (yTAw)2 /(w TAw). (2.29)
The t statistic for testing that , = 0 is then

tw =
i
---;:;:::::::;;==;::;==;:;:;=;=::::;::=:= (2.30)
V{s~/(wTAw)}
If w is the explanatory variable Xb (2.30) is an alternative way of writing
the usual t test (2.19). But the advantage of (2.30) is that the effect of
individual observations on this statistic can be found by using the methods
of deletion diagnostics in which explicit formulae are found for the effect
22 2. Regression and the Forward Search

of deleting, or adding, one or several observations. For tw they are given in


compact form by Atkinson (1986). We do not use such formulae for the score
statistic, but instead track its value during the forward search. Examples,
for the constructed variable for transformation, were given in our analysis
of the wool data in Chapter 1, for example, in Figure 1.11 (left). However,
since such formulae are useful in deriving some diagnostic quantities that
we do calculate during the search, we now give a brief outline of some
important ideas in deletion diagnostics.

2.3 Deletion Diagnostics


2.3.1 The Algebra of Deletion
If an observation is deleted and the regression model refitted, the param-
eter estimates, residual sum of squares and residuals will all change. For
the regression models of this chapter there are exact expressions for these
changes, so that the explicit deletion of individual observations and re-
peated refitting of the model are not necessary. The methods, collectively
known as deletion diagnostics, use these exact formulae to examine the im-
portance of individual observations to inferences about the fit. For example,
it was argued above that an observation at a point of high leverage would
often have a small least squares residual and that the values of the studen-
tized residuals were bounded. However, if the observation is deleted and
the model refitted to n - 1 observations, the resulting residual will be large
if the fitted model changes appreciably when the observation is deleted.
These effects are calculated precisely using quantities from the single fit to
all n observations.
The expressions for the effect of deletion are based on a matrix relation-
ship often called the Sherman- Morrison-Woodbury formula. References to
its history are given at the end of the chapter. In the next section we show
how the combination of the mean shift outlier model and the results on
added variables of the previous section may be combined to provide the
same results with greatly reduced algebraic effort. But for now we use this
matrix relationship to give exact expressions for the effect of deletion of
a specified set of observations. However, as the number of observations to
be deleted increases, there is a combinatorial explosion of the number of
deleted subsets to be considered, which can lead to difficulties in interpre-
tation. An alternative to deleting several observations at once is to work
"backwards" through the data, repeatedly deleting single egregious obser-
vations. As is evident from the forward plot of the residuals for the multiple
regression data in Figure 1. 7, such backwards methods can fail in the pres-
ence of masked outliers. Although we do not use such methods, the algebra
of deletion is important in deriving some quantities we monitor during the
forward search. We begin with the Sherman- Morrison-Woodbury formula.
2.3. Deletion Diagnostics 23

Let A be a square p x p matrix and let U and V be matrices of dimension


p x m. Then it is easy to verify that (Exercise 2.11)
(A - UV7j-1 = A - I + A-IU(Im - VTA-IU) - IV TA-I, (2.31)
where it is assumed that all necessary inverses exist. For regression we
let A = XTX. The ith row of X is xf, Deletion of this row leaves the
matrix X(i), where the subscripted i in parentheses is to be read as "with
observation i deleted" . With this definition
X1:rCi) = (XTX - xixf).
It then follows from (2.31) that
(X1:rCi))-1 = (XTX)-I + (XTX)-IXiX[(XTX)-I / (l - hi). (2.32)
The new inverse thus depends on the old inverse (XTX)-l, on Xi and on
the leverage measure hi .
The vector of parameter estimates after deletion of observation i is ~Ci)
defined by
~Ci) = (X1:rCi)) - l(XTy - XiYi).
Use of (2.32) shows (Exercise 2.12) that this reduces to the simple form
~Ci) - ~ = -(X TX)-lXi e i/(l - hi). (2.33)
In addition to quantities from the fit to all n observations, (2.33) also needs
the least squares residual ei as well as the leverage measure hi' No further
quantities are needed for calculation of any other deletion statistics. For
example, to find SCi)' we rewrite (2.11) as
(n - p)S2 = yTy _ ~TXTy
so that (Exercise 2.10)
(n - p - l)s(i) YTY - Yi2 - (3AT
Ci)
(xTY - XiYi )
(n - p)s2 - e; /(1 - hi). (2 .34)
The change in residual sum of squares on deletion of observation i is thus
eT/(l - hi).
We now use these relationships to find simple expressions for two
quantities that are particularly informative in the forward search.

2.3.2 Deletion Residuals


The first quantity comes from considering whether deletion of the ith ob-
servation has a marked effect on prediction at Xi. In particular, does the
observed Yi agree with the prediction YCi) obtained when the ith observation
is excluded from fitting? Since Yi and YCi) are independent the difference
T'
= Yi (3(i)
A

Yi - Y(i) - Xi
24 2. Regression and the Forward Search

has variance
(2.35)

To estimate (J'2 we use the estimate 8~i) which is also independent of Yi.
The test for agreement of the observed and predicted values is
TA
* Yi - XJ3(i)
(2.36)
ri = 8(i)V{1 + XnX&r(i))-lxd'
which, when the ith observation comes from the same population as the
other observations, has a t distribution on (n-p-1) degrees of freedom. The
deletion results given above make it possible to simplify (2.36) to obtain
(Exercise 2.7)
Yi - ili (2.37)

We call ri the deletion residual. Comparison with (2.14) shows that the
deletion residual differs from the studentized residual only in the estimate
of (J' employed. This is enough to ensure that ri has the unbounded t
distribution as opposed to the scaled beta distribution of rf (§2.1.2). In
most applications the difference between the two residuals will be slight,
the difference being most acute if there is a single outlier at a point of high
leverage.
In the forward search we also consider the effect of adding observations.
Given a parameter estimate {J, a design matrix X and an estimate 8 2 of
(J'2, we add an observation Yi at Xi. Then the test for agreement between

observation and prediction (2.36) becomes

(2.38)

which is the usual t test for the agreement of a new datapoint with a
previous set of observations. Since, apart from replications, Xi is not a row
of X, it may be that di is greater than one. The power of the test is improved
by using 8 2 as the estimate of error variance, rather than updating for the
new observation.

2.3.3 Cook's Distance


A large deletion residual ri indicates that the predicted value at Xi depends
strongly On the observed Yi. But it does not directly provide information
on the dependence of the parameter estimates On Yi.
The simplest form of information on the effect of Yi on the parameter
estimates is to monitor the changes in the individual estimates {Jk as the
search progresses. But we also find it informative to have an overall measure
2.3. Deletion Diagnostics 25

of change in the vector ~. This is provided by Cook's distance, which can


be derived from the confidence region for (3.
A confidence region at level 100(1 - a)% for the parameter vector (3 is
given by those values of the parameter for which
T T 2
((3 - (3) X X ((3 - (3) So p8 Fp , lI,Q , (2.39)
A A

where 8 2 is an estimate of 1J2 on 1I degrees of freedom and Fp , lI , Q is the


100a% point of the F distribution on p and 1I degrees of freedom. Cook
(1977) proposed the statistic

(2.40)
for detecting influential observations. Large values of Di indicate observa-
tions that are influential on joint inferences about all the linear parameters
in the model. A suggestive alternative form for Di is
(2.41 )

where the vector of predictions Yei) = X~ei). One interpretation is that Di


measures the sum of squared changes in the predictions when observation
i is not used in estimating (3.
A more convenient form for Di follows from substitution for ~(i) - ~ from
(2.33) , which yields

(2.42)

with ri the studentized residual defined in (2.14). This form for Di shows
that influence depends on hi , so that appreciable outliers with low lever-
age can be expected to have little effect on the parameter estimates. Our
analysis of Forbes' data supports this interpretation. That observation 12 is
clearly an outlier is shown by all the residual plots. It is the last observation
to be included in the forward search. Figure 1.3(1eft) shows that there is
no detectable change in the slope of the regression line and little change in
the value of the intercept when observation 12 is included. For these data
h12 = 0.0596 compared with an average value of 2/17 = 0.118.
If one of the observations is an outlier, the estimate 8 2 will be too large,
except when the outlying observation is deleted. Atkinson (1982) therefore
suggested replacing 8 2 by the deletion estimate s(i) ' In addition the square
root of Di was taken, to give a residual like quantity, and the statistic scaled
by the average leverage pin. The resulting modified Cook statistic is

1/2{ }1/2
{-p-}
n - p hi ei2
(1 - h i )2 8(i)

{ n-p~ }1/2 Iril. (2.43)


p 1- hi
26 2. Regression and the Forward Search

2.4 The Mean Shift Outlier Model


In this section we sketch how the mean shift outlier model can be used to
obtain deletion results using the relationships for added variables derived
in §2.2.
Formally the model is similar to that of (2.21). We write
E(Y) = Xj3 + d¢ , (2.44)
where the n x 1 vector d is all zeroes apart from a single one in the ith
position and ¢ is a scalar parameter. Observation i therefore has its own
parameter and, when the model is fitted, the residual for observation i
will be zero; fitting (2.44) thus yields the same residual sum of squares as
deleting observation i and refitting.
To show this equivalence requires some properties of d. Since it is a vector
with one nonzero element equal to one, it extracts elements from vectors
and matrices, for example:
(2.45)
Then, from (2.25),

(2.46)

If the parameter estimate in the mean shift outlier model is denoted /Jd, it
follows from (2.24) that
/Jd = (XTX)-l XTy - (XTX)-l XTdrP,
so that, from (2.46)
(2.47)

Comparison of (2.47) with (2.33) shows that /Jd = /J(i) , confirming the
equivalence of deletion and a single mean shift outlier.
The expression for the change in residual sum of squares comes from
(2.29). If the new estimate of (1"2 is s~ we have immediately that
(n-p-1)s~ yTAy _ (yTAd)2 / (d TAd)
(n - p)S2 - eU(1- hi), (2.48)
which is (2.34).
The mean shift outlier model likewise provides a simple method of finding
the effect of multiple deletion. We first need to extend the results on added
variables in §2.2 to the addition of m variables, so that W is an n x m
matrix and 'Y an m x 1 vector of parameters. We then apply these results
to the mean shift outlier model
E(Y) = Xj3 + D¢,
2.5. Simulation Envelopes 27

with D a matrix that has a single one in each of its columns, which are oth-
erwise zero, and m rows with one nonzero element. These m entries specify
the observations that are to have individual parameters or, equivalently,
are to be deleted.

2.5 Simulation Envelopes


We have already seen some examples of the use of normal plots of residuals
to assess whether the error distribution is at least approximately normal. In
our analysis of Forbes' data the plots of Figure 1.2 clearly showed a single
outlier. In the normal plots of residuals for the multiple regression data,
Figure 1.6(right), and for the untransformed wool data, Figure 1.9(right),
it was not so clear whether the plots were sufficiently straight to confirm
agreement between data and model. This decision was helped by adding a
simulation envelope to the plot.
95% pointwise envelopes can be calculated by simulating 39 sets of n ob-
servations from the standard normal distribution. Each set of observations
is then regressed on the X matrix from the data to give a set of residuals.
Each set of residuals could be plotted to give an idea of the variability to
be expected in such normal plots. Instead the largest and smallest values
of the curves at each plotting position are used to provide a simulation
envelope for possible shapes of the curve. At each point the probability
that the curve from a normal sample falls outside the envelope is approxi-
mately 5%, since there is a probability of 2~% = 1/40 in either tail of the
distribution. The approximation arises because there will be some random
error in the position of the envelope due to sampling. If it is of importance,
the probability that at least one residual falls outside this pointwise 95%
envelope would have to be found by a further simulation.
For nonnormal models it is usually necessary to simulate from the fitted
model. But with plots of residuals from normal data, the distribution of the
residuals is independent of the parameters of the linear model. Furthermore,
with the studentized or deletion residuals plotted here, the values are also
independent of the value of a 2 . It therefore suffices to fit directly to samples
from the standard normal distribution. The same argument holds for the
distribution of Cook's distance which, as (2.42) shows, can be written as a
function of studentized residuals.
95% envelopes with less random variation can be found by taking, for
example, the third largest and smallest from a sample of 119 envelopes.
Since 3/120 = 0.025 and 117/120 = 0.975, the desired limits are obtained.
Examples are given by Dempster et al. (1984). We find approximate 90%
envelopes by taking the 5th and 95th largest observations from a sim-
ulated sample of 100. Our envelope for the multiple regression data in
Figure 1.6(right) shows that the largest residual is not exceptionally large,
28 2. Regression and the Forward Search

although there is a slight indication of some aberrant structure in the centre


of the plot. The interpretation of Figure 1.9(right) for the untransformed
wool data is comparatively straightforward and aided by the envelope: the
plot is definitely not straight, consisting almost of two straight lines. There
are too many small negative residuals and, correspondingly, too few posi-
tive ones, some of which are large. A skewed distribution of the residuals
is indicated.
Simulation envelopes can likewise be calculated and displayed at each
stage of the forward search, to explore the effect of fitting to just a subset
of the data. However sometimes the effect of including just a subset provides
a clear message as it does in Figure 3.2(right) where least median of squares
residuals are plotted at the beginning of the forward search.

2.6 The Forward Search


2.6.1 General Principles
So far in this chapter we have considered quantities that can be monitored
during the forward search. We now describe the search itself.
If the values of the parameters of the model were known, there would
be no difficulty in detecting the outliers, which would have large residuals.
The difficulty arises because the outliers are included in the data used to
estimate the parameters, which can then be badly biased. Most methods for
outlier detection therefore seek to divide the data into two parts, a larger
"clean" part and the outliers. The clean data are then used for parameter
estimation. Our method follows this prescription, our emphasis being on
parameter estimation once some of the data, including the outliers, have
been removed for the purpose of parameter estimation.
The simplest example of this division of the data into two parts is in
the use of single deletion diagnostics, such as those described above, where
the division is into one potential outlier and the rest of the data. Standard
books on regression diagnostics, such as Cook and Weisberg (1982), Atkin-
son (1985) and Chatterjee and Hadi (1988) include formulae for multiple
deletion diagnostics, extending the results of §2.3, in which a small number,
perhaps two or three, of potential outliers are considered at once. But the
combinatorial explosion of the number of cases that have to be considered
is a severe drawback of such backwards working.
Many methods for the detection of multiple outliers therefore use very
robust methods to sort the data into a clean part and potential outliers.
Our method starts from least median of squares (LMS).
For the linear regression model E(Y) = X(3 of (2.1), with X of rank
p , let b be any estimate of (3. With n observations the residuals from this
estimate are ei(b) = Yi - x'[b, (i = 1, ... , n). The LMS estimate (3; is the
value of b minimizing the median of the squared residuals e;(b) . Thus (3;
2.6. The Forward Search 29

minimizes the scale estimate


(2.49)

where efkJ(b) is the kth ordered squared residual. In order to allow for
estimation of the parameters of the linear model the median is taken as
med = [(n + p + 1)/2], (2.50)
the integer part of (n + p + 1)/2.
The parameter estimate satisfying (2.49) has, asymptotically, a break-
down point of 50%. Thus, for large n, almost half the data can be outliers,
or come from some other model and LMS will still provide an unbiased
estimate of the regression line. This is the maximum breakdown that can
be tolerated. For a higher proportion of outliers there is no longer a model
that fits the majority of the data.
The very robust behaviour of the LMS estimate is in stark contrast to
that of the least squares estimate /3 minimizing (2.4) which can be written
as
n
S(b) = L e;(b). (2.51 )
i=l

Only one outlier needs to be moved towards infinity to cause an arbitrarily


large change in the estimate /3: the breakdown point of /3 is zero. As we saw
in the forward plots of residuals in Chapter 1, such as Figure 1.7 for the
multiple regression data, the LMS estimates at the beginning of the search
can be very different from the least squares ones at the end, when outliers
are present.
The definition of /3; in (2.49) gives no indication of how to find such
a parameter estimate. Since the surface to be minimized has many local
minima, approximate methods are used. Rousseeuw (1984) finds an ap-
proximation to /3; by searching only over elemental sets, that is, subsets
of p observations, taken at random. We follow this procedure. Depending
on the dimension of the problem we find the starting point for the forward
search either by sampling 1,000 subsets or by exhaustively evaluating all
subsets. We take as our initial subset that yielding the minimum value in
(2.49), so obtaining an outlier free start for our forward search.
In this resampling algorithm the model is fitted to m = p observations,
when the remaining n - p observations can be tested to see if any outliers
are present. A similar resampling algorithm can be used for the detection of
multivariate outliers using the minimum volume ellipsoid (Rousseeuw and
van Zomeren 1990). The resulting parameter estimates are very robust,
but are defined by the algorithm that produces them, a crucial distinction
with standard statistical methods such as least squares. For example, LMS
estimates could be found by searching over larger subsamples, perhaps
with m = p + 1 or p + 2. The disadvantage is that there is an increased
30 2. Regression and the Forward Search

probability that the subsamples will contain outliers. However Woodruff


and Rocke (1994) show that such estimators, while remaining very robust,
have lower variance than those based on smaller subsets. They are therefore
more reliable when used in outlier detection procedures.
In the forward search, such larger subsamples of outlier free observations
are found by starting from small subsets and incrementing them with ob-
servations that have small residuals, and so are unlikely to be outliers. The
method was introduced by Hadi (1992) for the detection of outliers from
a fit using approximately half the observations. Different versions of the
method are described by Hadi and Simonoff (1993), Hadi and Simonoff
(1994) and by Atkinson (1994a). In this literature the emphasis is on us-
ing the forward search to find a single set of parameter estimates and of
outliers. These are determined by the point at which the algorithm stops,
which may be either deterministic or data dependent. The emphasis in this
book is very different: at each stage of the forward search we use informa-
tion such as parameter estimates and residual plots to guide us to a suitable
model.
Suppose at some stage in the forward search the set of m observations
si
used in fitting is m ). Fitting to this subset is by least squares (for regres-
sion models) yielding the parameter estimates S:n . From these parameter
estimates we can calculate a set of n residuals e;'" and we can also esti-
mate (J"2. Suppose that the subset S~m) is clear of outliers. There will then
be n - m observations not used in fitting that may contain outliers. We
do not seek to identify these outliers by a formal test. Our interest is in
the evolution, as m goes from p to n, of quantities such as the residuals,
and test statistics which we plotted in Chapter 1, together with Cook's
distance and other diagnostic quantities. We also look at the sequence of
parameter estimates S:n and related t statistics. We monitor changes that
occur, which can always be associated with the introduction of a particular
group of observations, in practice almost always one observation, into the
subset m used for fitting. Interpretation of these changes is complemented
by examination of changes in the forward plot of residuals.
Remark 1: The search starts with the approximate LMS estimator found
by sampling subsets of size p. Let this be S; and let the least squares
estimator at the end of the search be S~ = S. In the absence of outliers
and systematic departures from the model

E(S;) = E(S) = {3;

that is, both parameter estimates are unbiased estimators of the same quan-
tity. The same property holds for the sequence of estimates S:n produced
in the forward search. Therefore, in the absence of outliers, we expect both
parameter estimates and residuals to remain sensibly constant during the
forward search. We saw in the examples of Chapter 1 that this was so.
2.6. The Forward Search 31

In particular, the parameter estimates for the transformed wool data in


Figure 1.14(right) were extremely stable.
Remark 2: Now suppose there are k outliers. Starting from a clean subset,
the forward procedure will include these towards the end of the search,
usually in the last k steps. Until these outliers are included, we expect that
the conditions of Remark 1 will hold and that residual plots and parameter
estimates will remain sensibly constant until the outliers are incorporated
in the subset used for fitting. We saw an example of this phenomenon in the
forward plot of the residuals for the untransformed wool data, Figure 1.10,
where the residual pattern is initially stable, but changes appreciably at
the end of the search.
Remark 3: If there are indications that the regression data should be
transformed, it is important to remember that outliers in one transformed
scale may not be outliers in another scale. If the data are analyzed using
the wrong transformation, the k outliers may enter the search well before
the end.
The forward search algorithm is made up of three steps: the first concerns
the choice of an initial subset, the second refers to the way in which we
progress in the forward search and the third relates to the monitoring of
the statistics during the progress of the search. In the following subsections
we consider these three aspects separately.

2.6.2 Step 1: Choice of the Initial Subset


We now give a formal definition of the algorithm used to find the LMS
estimator.
If the model contains p parameters, our forward search algorithm starts
with the selection of a subset of p units. Observations in this subset are
intended to be outlier free. Let Z = (X , y) , so that Z is n x (p + 1). If n is
moderate and p « n, the choice of the initial subset can be performed by
exhaustive enumeration of all (n) p
distinct ptuples S,(p) , == {Zil " .. , Zi },
"l,· .. ,,,p p

where zl is the il th row of Z, for 1 :s: i 1, ... ,ip :s: nand i j of. i j ,. Specifically,
let ,,' = [il' ... ,ip] and let e. s(p) be the least squares residual for unit i given
t, t
observations in sIp). We take as our initial subset the ptuple sip) which
satisfies

(2.52)

where e 2 (p) is the k th ordered squared residual among e 2 (p), i


~,St t ,~
1, ... , n, and, as in (2.50), med is the integer part of (n + p + 1)/2.
If (;) is too large, we use instead some large number of samples, for
example, 1,000.
32 2. Regression and the Forward Search

2.6.3 Step 2: Adding Observations During the Forward Search


si
Given a subset of dimension m ?: p, say m ) , the forward search moves to
dimension m+ 1 by selecting the m+ 1 units with the smallest squared least
squares residuals, the units being chosen by ordering all squared residuals
e~, s~"')' i = 1, ... , n.
The forward search estimator ~F S is defined as a collection of least
squares estimators in each step of the forward search; that is,

~FS = (~;, ... ,~~). (2.53)

In most moves from m to m+ 1 just one new unit joins the subset. It may
si
also happen that two or more units join m ) as one or more leave. However
our experience is that such an event is quite unusual, only occurring when
the search includes one unit that belongs to a cluster of outliers. At the
next step the remaining outliers in the cluster seem less outlying and so
several may be included at once. Of course, several other units then have
to leave the subset.
The search that we use avoids, in the first steps, the inclusion of outliers
and provides a natural ordering of the data according to the specified null
model. Note that in this approach we use a highly robust method and at
the same time least squares (that is, fully efficient) estimators. The zero
breakdown point of least squares estimators, in the context of the forward
search, does not turn out to be disadvantageous. The introduction of atyp-
ical (influential) observations is signaled by sharp changes in the curves
that monitor parameter estimates, t tests, or any other statistic at every
step. In this context, the robustness of the method does not derive from
the choice of a particular estimator with a high breakdown point, but from
the progressive inclusion of the units into a subset which, in the first steps,
is outlier free. As a bonus of the suggested procedure, the observations can
be naturally ordered according to the specified null model and it is possible
to know how many of them are compatible with a particular specification.
Furthermore, the suggested approach enables us to analyze the inferential
effect of the atypical units (outliers) on the results of statistical analyses.
Remark 1: The method is not sensitive to the method used to select an
initial subset, provided unmasked outliers are not included at the start.
For example, the least median of squares criterion (2.49) for regression
can be replaced by that of least trimmed squares (LTS). This criterion
provides estimators with better properties than LTS estimators, found by
minimizing the sum of the smallest h squared residuals
h
Sh(b) = I:>fij(b), (2.54)
i=l

for some h with [(n + p + 1)/2] ::::; h < n. The rate of convergence of LTS
estimates is n- 1 / 2 as opposed to n- 1 / 3 for LMS. But, for datasets ofthe size
2.6. The Forward Search 33

considered in this book, there seems to be little difference in the abilities of


the two methods to detect outliers and so to provide a clean starting point
for the forward search.
What is important in our procedure is that the initial subset is either free
of outliers or contains unmasked outliers which are immediately removed
by the forward procedure. The search is often able to recover from a start
that is not very robust. An example, for regression, is given by Atkinson
and Mulira (1993) and for spatial data by Cerioli and Riani (1999).
Remark 2: Forward searches allowing for the variances of the residuals
are employed by Hadi and Simonoff (1993) and by Atkinson (1994a), who
use studentized residuals, whereas we use raw residuals. Our comparisons
show that although the choice of residual has a slight effect on the forward
search, particularly at the beginning, the search using raw residuals is the
more stable, in that usually only one observation is added at a time, rather
than several being interchanged. For monitoring the effect of individual
observations on statistics and parameter estimates, it is helpful to be able
to connect particular effects with particular observations. Of course, both
methods respond to a cluster of outliers with multiple exchanges.

2.6.4 Step 3: Monitoring the Search


Step 2 of the forward search is repeated until all units are included in the
si
subset. If just one observation enters m ) at each move, the algorithm
provides an ordering of the data according to the specified null model,
with observations furthest from it joining the subset at the last stages of
the procedure.
Remark 1: The estimate of a 2 does not remain constant during the
forward search as observations are sequentially selected that have small
residuals. Thus, even in the absence of outliers, the residual mean square
estimate 8 2 (=) < 8 2 (n) = 8 2 for m < n. The smooth increase of 8 2 (=) with
~ ~ ~
m for the transformed wool data in Figure l.15(left) is typical of what we
expect when the data agree with the model and are correctly ordered by
the forward search.
One of the most important plots monitors all residuals at each step of
the forward search (for example, Figure l. 7). Large values of the residuals
among cases not in the subset indicate the presence of outliers, as do non-
smooth changes in the value of the residual sum of squares. Because of the
strong dependence of 8 2 (m) on m, we standardize all residuals by the final
s.
root mean square estimate 8 2 , as we did, for example, in the forward plot
of residuals for Forbes' data in Figure 1.4.
34 2. Regression and the Forward Search

2.6.5 Forward Deletion Formulae


Cook's distance, and the other diagnostic quantities of §2.3, measure the
effect of deletion of a single observation and so may be liable to mask-
ing when several outliers are present. The forward search overcomes this
masking, with abrupt changes in parameter estimates indicating influential
observations, which can be detected through the monitoring of a "forward
version" of the Cook statistic D i . From the original definition in (2.40) this
is given by
'*
((3m-1 '* T (X T >n )X >n»)((3m-1
- (3m) '* '*
- (3m)/( 2
PS si>n»)
si si
m p+1, . .. ,n, (2.55)
where X si>n) is the m x p matrix that contains the m rows of the matrix
X for the units in the subset.
To calculate the modified Cook statistic requires the leverages of the
units. These leverages are themselves a useful tool for the detection of
influential observations. We plot, for every unit, the leverage hi,si>n), as
soon as that unit joins the subset:

m = p, ... , n. (2.56)

At the start of the search we have only p observations, each of which has
leverage one. The leverages decrease thereafter. An example of such be-
haviour is in Figure 3.11, which shows a forward plot of the leverages for a
four- parameter model for Brownlee's stack loss data.
The forward version of the modified Cook distance (2.43) can, from
(2.56), be calculated as

2}
p} {
1/2 h s(>n) e s(>n) 1/2
C . -
mt - {
m-p
(l -
" •
S~i>n-l)
h i ,si>n»)2
" •

for i tf. si m - 1 ) and i E si m ), (2.57)


where m = p+ 1, ... , n.
Two further useful plots for outlier detection are those that monitor the
minimum deletion residual among the units not belonging to the subset

m = p + I, . .. , n - 1 (2.58)

and the maximum studentized residual in the subset


for i E sim ) m =p+ l , .. . ,n. (2.59)

Both indices run from p + 1 since this number of observations is the min-
imum allowing estimation of (7'2 . If one or more atypical observations are
present in the data, the plot of r*m+1 J against m will show a peak in the
step prior to the inclusion of the Mrst outlier. On the other hand , the plot
2.7. Further Reading 35

that monitors T[m ] shows a sharp increase when the first outlier joins S~m).
Both plots may show a subsequent decrease, due to the effect of masking, as
further outliers enter the subset. Examples of these forward plots of resid-
uals are in Figure 3.6, with a forward plot of the modified Cook distance
in Figure 3.5.

2.7 Further Reading


Of the many books on regression , the approach of Weisberg (1985) is clos-
est to that sketched here. A more recent treatment , including robustness
and diagnostics, is given by Ryan (1997) . The books of Cook and Weisberg
(1982) and Atkinson (1985) contain fuller discussions of regression diag-
nostics, as also do Belsley et al. (1980) and Chatterjee and Hadi (1988). A
precursor of the forward search is described in a paper by Kianifard and
Swallow (1989) who order the data by a fit using all observations and then
use this ordering to calculate recursive residuals for outlier identification.
The absence of a robust fit means that their procedure fails when mask-
ing is present. The monograph of Rousseeuw and Leroy (1987) describes
very robust methods in the service of outlier detection. Cook and Weisberg
(1994a) introduce numerous graphical methods for building and checking
models that are very different from those in the other books. A major
emphasis is on the choice of a correct linear model through the use of di-
mension reduction developed from the sliced inverse regression of Li (1991).
The theory of these graphical methods is developed in Cook (1994). Cook
and Weisberg (1999) combine the graphical methods with an introduction
to regression.
Statistical applications of the Sherman- Morrison- Woodbury formula ap-
peared in Bartlett (1951) and Plackett (1950), at much the same time
as the formula itself was noted (Sherman and Morrison 1949; Woodbury
1950). A history ofrelated algebra is given by Henderson and Searle (1981).
As we indicated in §2.4, use of the mean shift outlier model provides a
more straightforward derivation of the requisite deletion formulae for least
squares.
36 2. Regression and the Forward Search

2.8 Exercises
Exercise 2.1 Show that the matrix H = X(XTX)-lX T is (a) symmetric
and (b) idempotent (§2.1).
Exercise 2.2 Show that if a matrix H is idempotent I - H is also
idempotent (§ 2.1).
Exercise 2.3 Show that (§2.1):
(a}O:::;h i :::;l;
(b) -0.5 :::; h ij :::; 0.5 for all j =J i, where hi and hij are, respectively, the
ith diagonal element and the ijth element of the hat matrix H.
Exercise 2.4 Show that in the case of simple regression of Y on a constant
term and a single explanatory variable x the ith diagonal element of H is
equal to (§2.1):

i = 1, . .. ,n.

Exercise 2.5 Prove that (§2.1):


(a) hjk(i) = hjk + (h ik hij )/(l- hi), where hjk(i) is the jkth element of the
matrix H excluding the ith row of matrix X; namely:
hjk(i) = xJ(X&)X(i)) - lxk; (2.60)

(b) hi is nonincreasing in n;
(c) ifm observations (i1 , ... ,i m ) are equal,
h
h (. .) - 2m
21 22,· · ·2", - 1 - (m - l)hi",

and h ij :::; 11m, j = 1, ... ,m.


Exercise 2.6 Prove equation (2.29) (§2.2).
Exercise 2.7 Show that (§2.3):

(2.61 )

Exercise 2.8 Show that the quantity hd(l - hi) (the ratio of variance of
the ith predicted value var(Yi) = a 2 h i ) to the var'iance of the ith ordinary
residual (var(ei) = a 2 (1 - hi)), can be interpreted as the ratio of the part
of fh due to Yi to the part due to the predicted value xi O( i); that is, show
that (§2.3):
2.9. Solutions 37

Exercise 2.9 Show that (n - p - l)r; I(n - p - rT) rv F1,n-p-l (§2.3).


Exercise 2.10 Prove equation (2.34) (§2.3).
Exercise 2.11 Verify equation (2.31) (§2.3).
Exercise 2.12 Prove equation (2.33) (§2.3).
Exercise 2.13 Show that ri can be interpreted as the t statistic for testing
the significance of the ith unit vector d i in the following model (mean shift
outlier model) (§2.4);
E(Y) = X(3 + dB. (2.62)

2.9 Solutions
Exercise 2.1
(a) HT = (X(XT X) - l XT)T = X(XT X)-1 X T = H.
(b) HH = X(X T X) - 1X T X(X T X) - 1XT = X(XTX) - 1XT.

Exercise 2.2
(1 - H)(1 - H) = 1 + H2 - 2H = 1 + H - 2H = 1 - H.

Exercise 2.3
(a) The i th diagonal element of H can be written as:
n n

hi = L h7j = hr + L h7j
i=1 j#i
from which it follows that 0 ~ hi ~ 1 for all i.
(b)
n
hi = h7 + h7j + L h;k (2.63)
k#i ,j
from which it follows that h7j ~ h i (1 - hi). Since 0 ~ hi ~ 1, it must be
that -0.5 ~ h ij ~ 0.5.

Exercise 2.4
In the case of simple regression, the matrix X has the structure:

(2.64)
38 2. Regression and the Forward Search

In this case X T X and (XT X)-l are equal, respectively, to:

XT X = ( ",n
n .
L....i=l Xt

1 (
= n Lni=l (X t . _X )2
T - 1
(X X)

_ L~l x; - 2Xi L~l Xi + nx;


- n L~=l (Xi - x)2
Adding and substracting n"J;2 in the numerator we obtain:
h 1 n"JP - 2Xi L~=l Xi + nx;
i = :;;: + n ",n (X. _ x)2
L.It=l t

1 (Xi - X)2
= :;;: + ",n
L....t=l
(X. _ X)2 .
t

Thus in simple regression hi will be large if Xi is far removed from the bulk
of other points in the data.

Exercise 2.5
(a) We start from identity (2.32):

( XT X .)- 1 = (XT X _ X .XT)-l = (XT X)-l + (XT X) - l XiX f(X T X) - l


(t) (t) t t l - hi

Premultiplying this equation by X] and postmultiplying by Xk we


immediately obtain
hikhij
hjk(i) = hjk + 1- hi· (2.65)

(b) If j = k identity (2.65) becomes:


h2 .
h.(.) = h
J t J
+ _t_J_
1 - hi·
(2.66)

Given that the second term on the right- hand side of equation (2.66) is
positive we have that hj(i) 2 hj.
(c) If the i 1 th and the i2th row of X are identical equation (2.66) reduces
to:

(2.67)
2.9. Solutions 39

If 3 rows of matrix X (iI, i 2, i3) are identical:

(2 .68)

More generally, if m rows of matrix X (iI, i 2, . . . i,m ) are equal:


h
h C. .) - 2m (2.69)
21 22,·· · ,t m - 1 - (m - l)h im

In order to prove that if m observations are equal hir :::: 11m, r = 1, ... , m,
it is enough to notice that equation (2.69) is monotonically increasing in
[0, 11m] and is exactly equal to 1 when him = 11m.

Exercise 2.6
(n-p-1)s~ (y - X~ - W'y)T(y - X~ - wi)
yT y _ yT X ~ _ yT wi
_ ~TXTy + ~TXTX~+ wTX~i
+w T X~i - w T yi + i 2 w T w
yT Y _ yT X ~ _ yT wi
+~T ( _ XTy+XTX~+XTwi)
'- v '

= 0, equation (2.22)

+i (w T X ~ - w T Y + iw T W )
" v "
= 0, equation (2.23)
yTy _ ~T XT Y _ i wTy.
Now, using expressions (2.24) and (2.25),
(n - p - l)s~ yT Y _ {yT X(XT X)-l _ iw T X(XT X) - l }XT Y
wTAy T
---w y
wTAw
T(J _ H) w TA yw TH Y - w TA yw T y
Y n Y+ w TA w
TA w T AywT (In - H)y
y y- wTAw
T (w T Ay)2
Y Ay -
wTA w .

Exercise 2.7
From equation (2.33) we have:
40 2. Regression and the Forward Search

hi
ei +1_ hi ei
ei
1 - hi
Using equation (2.32),

Using these results we can write


TA
Yi - xi (3(i) ed(l- hi)
J1/(1 - hi)

Yi - Yi
J(l - hi)

Exercise 2.8
Using equation (2.33) we can write:

Yi xi b = xi (1 ~i hi (XT X)-lXi + b(i»)


ei T
+ xi
A

hi 1 _ hi (3(i)
TA TA
hi (Yi - xi + Xi (3( i)
(3(i»)
TA
(1 - hi)x i (3(i) + hiYi.

Exercise 2.9
From equation (2.34) we can write:
(n - p - l)sCi) = (n - p)s2 - s2rT.
Rearranging we obtain:

s
-=
j$-P-12·
S(i) n - P- ri
Using this result and the identity in equation (2.61):

(n-p-1)r;
n- p- r; .
If we recall that r: has a t distribution on (n - p - 1) degrees of freedom
the result follows immediately.
2.9. Solutions 41

Exercise 2.10

(n - p - l)sZi) = yT Y - y; - /3'[i)(XT y - XiYi).


Using equation (2.33) ,

This result is obtained much more easily using the mean shift outlier model.

Exercise 2.11
We must show that the product of (A - UVT) with the right- hand side of
(2.31) gives the identity matrix.
(A - UVT){A- 1 + A-1U(Im - VT A-1U)-lVT A-I} =
Ip + U(Im - VT A-1U)-lVT A-I - UV T A-I
-UVT A-1U(Im - VT A-1U) - lV T A-I =
Ip - UV T A-I + U(Im - VT A-1U)(Im - VT A-1U)-lVT A-I =
Ip - UV T A-I + UV T A - I = Ip.

Exercise 2.12
We have:
, T -1 T
f3(i) = (X(i)X(i)) (X Y - XiYi).
Using equation (2.32),

/3(i) {(XT X)-l + (XT X)-lXiX;(X T X)-l /(1 - hi)} (XT Y - XiYi)
= /3+ (XTX)-lXi Yi - (1- hi)Yi - hiYi
1 - hi
/3 + (XT X) - lXi Yi - Yi + hiYi - hiYi
1 - hi
42 2. Regression and the Forward Search

Exercise 2.13
Let Ho : E(Y) = X(3 and HI : E(Y) = X(3 + dO. Under the normality
assumption the F statistic for testing Ho versus Hl is
F _ {SS(eo) - SS(el)} / l
(2.70)
- SS(el)/(n - p - 1) ,
where SS(ej) is the residual sum of squares under the hypothesis H j j =
{O, I}. Using the identity in equation (2.34) we find that

F = eU(\- hi) = r/2. (2.71)


S(i)

Note that since F rv FI ,n-p-l, ri = (F){l /2 ) has a t distribution on n-p-l


degrees of freedom, that is, tn-p-l.
3
Regression

In this chapter we exemplify some of the theory of Chapter 2 for four sets
of data. We start with some synthetic data that were designed to con-
tain masked outliers and so provide difficulties for least squares diagnostics
based on backwards deletion. We show that the data do indeed present
such problems, but that our procedure finds the hidden structure.
The analysis of such data sets is much clearer than that of real data
where, often, ambiguities remain even after the most careful analysis. Our
other examples in this chapter are of increasing complexity and of increas-
ing number of variables. One complication is the choice of a suitable linear
model and the relationship between a misspecified model and our diag-
nostics. A second complication is that the last three examples also involve
the transformation of the response, combined with the choice of the linear
model. We choose the transformation in an informal manner. Our more
structured examples of choice of a transformation are left to Chapter 4.

3.1 Hawkins' Data


Table A.4 contains 128 cases, the last column being y and the other eight a
set of explanatory variables. The scatterplot matrix of the data in Figure 3.1
does not reveal an interpretable structure; there seems to be no relationship
between y and seven of the eight explanatory variables, the exception being
Xg. Some structure is however suggested by residual plots.
44 3. Regression

·30030 ·20020 ·20020 04080

·20 0 20 -40 0 40 ·20 0 20 ·30 0 20 -10 5 15

Figure 3.1. Hawkins' data: scatter plot matrix. The only apparent structure
involving the response is the relationship between y and Xs
3.1. Hawkins' Data 45

..... ....

r--
o o ---

......

·2 ., ·2 ., 0
Quantiles of standard normal Quantiles of standard normal

Figure 3.2. Hawkins' data: normal plots of residuals. The least squares residuals
(left) seem to indicate six outliers and a nonnormal structure; there are 86 zero
LMS residuals (right)

The normal plot of least squares residuals in Figure 3.2(left) shows a cu-
riously banded symmetrical pattern, with six apparent outliers. The data
would seem not to be normal , but it is hard to know what interpretation to
put on this structure. For some kinds of data such patterns indicate that
the wrong class of models has been fitted . One of the generalized linear
models with non normal errors described in Chapter 6 might be appropri-
ate. Here we continue with regression and look at the normal plot of LMS
residuals. Figure 3.2(right) shows (on counting) that 86 residuals are virtu-
ally zero, with three groups of almost symmetrical outliers from the modeL
Our forward search provides a transition between these two figures. More
helpfully, it enables us to monitor changes in residuals and parameter es-
timates and their significance as the apparent outliers are included in the
subset used for fitting.
Figure 3.3 is the forward plot of squared residuals, scaled as described in
§2.6.4 by the final estimate of 0- 2 . This shows three groups of residuals, the
fourth group, the 86 smallest, being so small as to lie on the y axis of the
plot. From m = 87 onwards, the 24 cases with the next smallest residuals in
Figure 3.2(right) enter the subset. The growth in the subset causes changes
in the other two groups of residuals; in particular, the most extreme obser-
vations become less so. After m = 110, the second group of outliers begins
to enter the subset and all residuals decrease. By the end of the process,
the six largest outliers, cases 19, 21 , 46, 73, 94 and 111 still form a distinct
group, arguably more marked in Figure 3.3 than in Figure 3.2(left), which
is a normal plot of the residuals when m = n . At the end of the search, the
other groups of outliers are mixed together and masked.
The plot of residuals from the forward search reveals the structure of the
data. It is however not clear how the groups of outlying observations change
the fitted modeL This is revealed by Figure 3.4(left), which shows how the
estimated coefficients change during the forward search. The values are
constant until m = 87, after which they mostly decline to zero, apart from
46 3. Regression

7J----------------------~
~ -- -----------------------"

~
: :-:-:~-:--:--:-:-:-.-:-:~-:-~:-~-:-:-~-..-. -. , ~:
-,

'1 1 -------------------------'Y~~~~·:.;~~\ ,~
!!! -~, ;::';""\
"
:>
"!
·ilI
.....""
\" ....
.~ \
"I \
..
~
¥ I"
I .
..
"cr \\I
"
i
\

'" ------------------_.

o 20 40 60 eo 100 120
Subset size m

Figure 3.3. Hawkins' data: forward plot of scaled squared residuals. The three
groups of outliers are clearly shown, as is the effect of masking of some outliers
at the end of the search

the estimate of /30, which oscillates wildly. Such changes in parameter esti-
mates, very different from those for Figure 1.14(right) for the transformed
wool data, are an indication of outliers or of a misspecified model.
The t statistics for the parameters are in Figure 3.4(right). Initially, when
the error is close to zero, the statistics are very large and off the scale of
the plot. As groups of observations with larger variance are introduced,
the statistics decrease until, at the end of the search, there is only one
significant term, that for regression on Xg, which was suggested by the
scatterplots of Figure 3.1.
Several other plots also serve to show that there are three groups of
outliers. Three are similar in appearance. Figure 3.5 shows the modified
Cook distances (2.57), which reflect the changes in parameter estimates as
the forward search progresses. The three peaks show the effect of the large
changes due to the initial inclusion of each group of observations. After
a few observations in the group have been included, further changes in
the parameter estimates become relatively unimportant and so the values
of the distances again become small. Figures 3.6(top) and (bottom) show
similar patterns, but in plots of the residuals. Figure 3.6( top) shows the
maximum studentized residual in the subset used for fitting (2.59). This
will be large when one or two outliers are included in the subset. Finally in
this group of three plots, Figure 3.6(bottom) shows the minimum deletion
residual at each stage (2.58), where the minimization is over those cases
not yet in the subset. The three peaks in the figure show the distance of the
nearest observation from the model that has been fitted so far. The first
peak is the largest because the variance of the first 86 cases is so small.
The declining shape of each peak is caused by the increase in 8 2 as outliers
3.1. Hawkins' Data 47

b_O 0 '
' ,,
b_ ' " La
U \ \

\,\ ,,,
b_2 U
b_3 '_3 ,
b_4
b_5
b 0
'_4 "
'_5
,,
0
b_6 N
CO
bJ I
o'" b_8
'J
'_8 ~
'-: '\
\~_8
,
~\

~ b_O, b_7
b:s-·_·--- -- / :
- n-- -·---·--·· --.....-·~ I
- I
/
_~~----------- __ ., .... __I

9'"
20 40 60 80 100 120 20 40 60 80 100 120
Subset size m Subset size m

Figure 3.4. Hawkins' data: forward plots of (left) parameter estimates and (right)
the t statistics. The outliers have an extreme effect .on the parameter estimates

0
N

"g
£1
'0
'8
()
~
-c
~
'0
0
:;

o 20 40 60 80 100 120
Subset size m

Figure 3.5. Hawkins' data: forward plot of modified Cook's distance. The first
large peak is caused by the introduction of the first outlier when m = 87. The
other two peaks indicate the first inclusion of outliers from the other two groups.
The decay of the curves is due to masking
48 3. Regression

~ ;:!
'0
'<;;
<Xl
i!!
'0
~ '"
~
-g
~

1;;
"
x '"
'"
::; 0

20 40 60 80 100 120
0
'"
OJ
~
:'?
e
:2

c:
0 ;:!
:;
Q;
'0

" '"
::!
0

20 40 60 80 100 120
Subset size m

Figure 3.6. Hawkins' data: forward plot of (top) the maximum studentized resid-
ual in the subset used for fitting (2.59) and (bottom) the minimum deletion
residual outside the subset (2 .58). The effects of the three groups of outliers are
evident

are introduced during the search, which reduces the size of the deletion
residuals. At the end of the peaks there is nothing remarkable about the
values of the deletion residuals.
In this example the two plots of the residuals and that of the modified
Cook distances are very similar in structure. In other examples, not only
may the plot of the Cook distances be different from that of the residual
plots, but the two residual plots may also be distinct. These plots are one
way in which the forward search reveals the masked nature of the outliers.
Another is from forward residual plots such as Figure 3.3.
Another, very different, plot also serves to show the groups of outliers.
Figure 3.7 gives the behaviour of the estimate of 0'2 during the forward
search. Initially it is close to zero. Then the inclusion of each set of outliers
causes an increase in the estimate. The resulting plot is virtually in the form
of four line segments, one for each group of observations. The monotone
form of the plot indicates that the observations have been correctly sorted
by the forward search. Although, in this example, this plot produces no
new information about the grouping of outliers, in general we find such
plots unexpectedly helpful in determining the adequacy of models - jumps
or regions of small increase of 8 2 can usually be directly interpreted as
evidence of a particular failure of the model. As we show, forward plots of
R2 can likewise be surprisingly helpful.
Our analysis has led to the detection of four separate groups in the data.
However division of the observations into these groups does not reveal any
new structure. Figure 3.8 is the scatterplot matrix for the 86 observations
forming the group with smallest variance. The structure is not markedly
3.1. Hawkins' Data 49

o 20 40 60 80 100 120
Subset size m

Figure 3.7. Hawkins' data: forward plot of 8 2 . Initially the estimate is virtually
zero. The four line segments comprising the plot correspond to the four groups
of observations

-20 0 20 -40 0 40 ·20 0 20 -30 0 20 -5 5 15

Figure 3.8. Hawkins' data: scatterplot matrix of the 86 nonoutlying observations.


The plot is very similar to Figure 3.1 for all the data
50 3. Regression

~ .---------------------------------------.

• +.
• • 0
+
+

• •

+*
+
* +
++0
00 0
+ ++ +
+
•+ + + c'i
¥"
++ +
+
-!4 +++'0 + @ +
+ -<!l 0+ +-H- + +
0 ++ +
+ 0.f\-
*0 +&'+"0 +0 0
Q ++ + t o+ + +
++ 0 + +
0
.p- * + + + •
Si' * •
+


~ c,--_\----,------,-------,-------,------,J
o 20 40 60 80 100
X8

Figure 3.9. Hawkins' data: scatter plot of y against Xs. The filled squares are
the six largest outliers, the small crosses the 86 observations that first enter the
forward search

different from that of Figure 3.1. Finally, in Figure 3.9, we repeat one
panel of the scatterplot matrix of Figure 3.1, with the six largest outliers
highlighted. It can be seen that their effect on the relationship of y with
X8 is likely to be mainly on the residual variance, rather than on the slope
of the fitted line. The plot of the t statistics in Figure 3.4(right) quantifies
this impression.
The clear nature of the outlier structure of these data is in sharp contrast
to that of the remaining examples of this chapter, where problems of which
terms to include in the linear model, and whether to transform the response,
are intertwined with those of the detection of outliers and of influential
observations.

3.2 Stack Loss Data


Table A.5 gives the stack loss data taken from Brownlee (1965, p. 454).
There are observations from 21 days of operation of a plant for the oxidation
of ammonia as a stage in the production of nitric acid. The variables are:
Xl: air flow
X2: cooling water inlet temperature
X3: 10 x (acid concentration -50)
y: stack loss; 10 times the percentage of ingoing ammonia escaping
unconverted.
The air flow (xd measures the rate of operation of the plant. The nitric
oxides produced are absorbed in a countercurrent absorption tower; X2
3.2. Stack Loss Data 51

/',
I
/ '\
/------" I \
4 _.. ___/!'-~-------- ----- -- -- ""----_ I \
1 -_/ "----------- ,, / \

21-_____ . . . __ ' -_'v ... \


................
3 /__ -><, / ''''\/
~"
\
\
-_...... '" ........... -... \
~
\
-----
'"
o ,=====~~
5 10 15 20
Subset size m

Figure 3.10. Stack loss data: first-order model, forward plot of scaled squared
residuals . Observations 1, 3, 4and 21 have large residuals , for most of the search

is the inlet temperature of cooling water circulating through coils in this


tower and X3 is proportional to the concentration of acid in the tower.
Small values of the response correspond to efficient absorption of the nitric
oxides.
These data have been much analyzed as a testbed for methods of robust
regression and outlier detection. A summary of many analyses is given by
Atkinson (1985, p. 266). One set of conclusions is that observations 1, 3, 4
and 21 are outliers.
We begin our analysis by fitting a model in the three explanatory vari-
ables to the untransformed data. The resulting forward plot of squared
residuals is shown in Figure 3.10. Observations 1, 3, 4 and 21 do indeed
stand out as having large residuals, particularly at the beginning of the
search. To investigate the leverage of the observations we give in Figure 3.11
a plot of the values from (2.56) calculated during the forward search. For
each value of m we calculate only the leverages of the observations in the
subset. Initially m is equal to four; there are four parameters and each ob-
servation has leverage one. Thereafter the leverages of these observations
decline. The last five observations to enter from m = 17 are 2, 1, 3, 4 and
21. Of these, observations 2, 1 and 21 have relatively high leverages, al-
though that of observation 17, which is in the initial subset, is highest for
most of the search.
On its own, leverage is not particularly informative about the behaviour
of the units, since it does not depend on the observed value of y. One
aspect of the effect of these observations on the fitted model is shown in
Figure 3.12, which shows the behaviour of the t statistics for the coefficients
during the forward search. As we saw in a much more dramatic form for
52 3. Regression

C>

co
c::i

"~
<D
Ol c::i

">
"
..J
":
0

N
c::i

0
c::i

5 10 15 20
Subset size m

Figure 3.11. Stack loss data: first-order model, forward plot of leverage. As well
as observations 1, 2 and 21, observation 17 has high leverage

0
v

0
'"
g
10
0
~
0
C)I

0
"1

5 10 15 20
Subset size m

Figure 3.12. Stack loss data: first-order model, t values for the coefficients. From
m = 7 onwards, that for {33 is never significant
3.2. Stack Loss Data 53

co
0

Q) <0
0> 0
~

.
Q)
>Q)
-'
0

'"0 -------- . .. . .~

0
0

5 10 15 20
Subsel size m

Figure 3.13. Stack loss data: first-order model in Xl and X2, forward plot of
leverage. The highest curve for most of the search is for observation 12

Hawkins' data in Figure 3.4(right), the statistics decrease in value as the


search progresses and observations further from the initial fitted model are
introduced. Throughout the search the statistic for the coefficient of X3 in
Figure 3.12 is nonsignificant. We therefore drop X3 from the model and
refit.
Removing a nonsignificant variable from the equation has a negligible
effect on the fitted values: the new forward plot of residuals is similar to
Figure 3.10 and so is not shown here. However the leverages may change,
since they are a function only of the x values, and do so here. Figure 3.13
shows that observation 17 is no longer one of high leverage and that obser-
vations 1 and 2 have the same highest leverage at the end of the search,
slightly higher than that of observation 21. That all is not well with this
model is indicated by several plots, for example, that of R2 in Figure 3.14,
which shows a local maximum at m = 17.
One method of improving the model is to include second-order terms in
the two explanatory variables. Not all are needed: the resulting model has a
significant term in X1X2 (t = 2.69 on 16 degrees of freedom, p = 1.6%) and
an almost significant One in xi (t = -1.97) , both of which we include. The
forward plot of the squared residuals, Figure 3.15, shows that observations
1, 2, 3 and 4 all have comparatively high residuals, until m = 18. The end
of the plot reveals the degree of masking of these outliers. The residual for
observation 21 is small. This suggests that much of the evidence for the
extra terms may be coming from observation 21, which is nOw in the initial
subset. Further evidence that observation 21 is very important for the fit
is that it has a high leverage throughout the forward search, as is shown in
the plot of Figure 3.16. The final leverage of the observation is 0.869.
54 3. Regression

0
~

<Xl
~
0

co
~
0

"a:
<
..,.
CJ)
0

"0
CJ)

0
~
0

5 10 15 20
Subset size m

Figure 3.14. Stack loss data: fi rst-order model i n Xl and X2 , forward plot of R2 .
T he lo cal maximum at m = 17 indicates disagreement between model and data

/ /--

1 ------------/ '"
\
\
\
~ \
'"
0;
:J
" ' ..........
"0
·in \
~ \
"0 \
'" \
" '"
0;
(/)
4 - - - - ---- - -- - ----- - -------- - - -- - - - -------- - ------ -- __ '\
--\,,-
"
0

5 10 15 20
Subset size m

Figure 3.1 5. Stack loss data: second-order model with terms in Xl, X2, XIX2 and
xi. Forward plo t of scaled squared residuals. Observations 1, 2, 3 a nd 4 initially
have large residuals, but t here isappreciable masking at t he end of the search.
Observation 21 has a small resid ual throughout
3.2 . Stack Loss Data 55

q
,, -=--==.,------------
CD
,, \
\
\
\
\
.... ...... ·21
0 \
----t---........-.... ....
'>-,,
\
\\ ... ..
Q> CD
0
\\
OJ
~
\
Q>
>
Q>
...J
\ .. . ... . .
"0

'"0
0
0

5 10 15 20
Subset size m

Figure 3.16. Stack loss data: second-order model, forward plot of leverages.
Observation 21 always has a leverage close to one

That the resulting model is misspecified is shown by the value of -5.724


for the score test for transformation. Figure 3.17 shows that this evidence
is caused by the last four observations to enter the search. Again these are
4, 2 and 1, with 3 the last to enter.
Since the response variable is 10 x the percentage of NH3 escaping
unconverted, the values cannot be negative. It is therefore likely that a
transformation will be needed. The log transformation is suggested by the
statistic based on all the data.
We start with the same five-parameter model with log y as response,
so that XIX2 and xi are again included and there are no terms involving
X3' However, with this transformation, the extra terms depend crucially
on a few observations. The forward plot of residuals in Figure 3.18 shows
that, on this transformed scale, observations 4 and 21 have large residuals.
Figure 3.19 shows that the last two cases to enter the forward search (again
4 and finally 21) are providing nearly all the evidence for any nonzero
coefficients in the model. The forward plot of the modified Cook statistics,
Figure 3.20, confirms the impression of the appreciable effect of the last
two cases on the parameter estimates. If these cases are correct then they
are highly informative.
An alternative is to investigate models with response log y , but with-
out the second-order terms. The forward plot of the t statistics, which is
not given, shows that the three coefficients of the first-order model in Xl
and X2 are all significant when all 21 observations are fitted and that this
result is not sensitive to the last few observations to enter the forward
search. However the plot of R2 during the forward search, Figure 3.21, has
a local maximum at m = 18, an indication that there is still some model
56 3. Regression

"
N

"
~
0

N
'"
1ii ')'
~
~
0
"
Cf) 't

'"
"?

10 15 20

Subset size m

Figure 3.17. Stack loss data: second-order model, score test for transformation.
The evidence for a transformation depends on only the last four observations to
enter

4 --------------------------------- --------------. --,


N
................

'":J
(ij
u
"in

u
~
0

')'
i~~;:::~i!jjj
13 I
"
(ij
I
I
"
Cf)
I
I
't I
I
I
I

'" 21-----------------/-',
----- "..._--...1

5 10 15 20

Subset size m

Figure 3.18. Stack loss data: second-order model with log y , forward plot of scaled
residuals, which are large for observations 4 and 21
3.2. Stack Loss Data 57

;? constant
xl
x2
x1 11 2
xl * x2
'" / xl
/
------
~"
0
~ o tant

............ ---x-i-:-;2-- 'xl

'\'

10 15 20
Subset size m

Figure 3.19. Stack loss data: second-order model with log y, forward plot of t
statistics. The bands define the 1% region. The evidence for the second-order
terms (xi and XIX2) comes from the last two observations to enter, 4 and 21

10 15 20
Subset size m

Figure 3.20. Stack loss data: second-order model with log y, forward plot of
modified Cook statistic. Confirmation of the importance of observations 4 and 21
58 3. Regression

0
C!

co
C1>
0

~
0
~
a:
...
C1>
0

'"0
C1>

0
C1>
0
5 10 15 20
Subset size m

Figure 3.21. Stack loss data: first-order model in Xl and X2, with response log y.
Forward plot of R2. The local maximum at m = 18 is evidence that all is still
not well

mis-specification. The forward plot of the residuals in Figure 3.22 is more


stable than those we saw earlier, but shows much change in the ordering
of the residuals towards the end of the search. In addition, the last two
observations to enter the forward search (again 4 and 21) affect the score
statistic for the transformation in Figure 3.23. If they are deleted there is
evidence to reject the log transformation.
Since the score statistic is a little high, we try a lesser transformation
and take>. = 0.5, the square root transformation, again with the first-order
model. This transformation is satisfactory for all the data, Figure 3.24, as
measured by the score statistic, and the coefficients in Figure 3.25 are
slightly more significant than those for the logged response, which are not
plotted. The plot of R2 , Figure 3.26, compares favourably with Figures 3.14
and 3.21, showing that the search has ordered the data correctly according
to the model. But the well-behaved curve shows that the inclusion of cases
4 and 21 causes the value of R2 to drop from 0.977 to 0.921 (the value for
the logged response with all 21 observations is slightly lower at 0.903). The
forward plot of residuals , Figure 3.27, suggests that these two cases may be
outliers from this first-order model. However the forward plot of residuals is
appreciably more stable than any others we have seen, including Figure 3.18
(for the logged response and a second-order model), which is similar, for
most of the search, apart from the dramatic change for observation 21 at
the end of the search.
As a result of the forward analysis we are able to identify the crucial
role of cases 4 and 21. If they are accepted, then a second-order model
with logged response is appropriate, yielding an R2 value of 0.950. If they
are rejected, a first-order model with the square root of y as response is
3.2. Stack Loss Data 59

... _--------- ............ ---- ............ .

'"
4 - - --- ---- ---- - ---- --- - - -- - - - - - - -- --- - --

'"::>
fti
'0
'0;
~ 0
~~~~~-~~~~-~~~~~-~~~~~~~::~-~~:
:::: - $ *--------- ~~. _..: . . ';"'4,:
----------------------
'0
.91 -- --
rl
C/l 1----------------_________ - ---
21___ _______
~/
-,,(

/...---
'l' 2
, ,..... /
./"
/

...... _ - - . " . /

5 10 15 20
Subset size m

Figure 3.22. Stack loss data: first-order model in Xl and X2, with response log y.
Forward plot of scaled residuals. There is still some change towards the end of
the search

5 10 15 20
Subset size m

Figure 3.23. Stack loss data: first-order model and log y, forward plot of score test
for transformations. Deletion of observations 4 and 21 would lead to rejection of
the log transformation
60 3. Regression

5 10 15 20
Subset size m

Figure 3.24. Stack loss data: first-order model in Xl and X2, and Vfj, forward plot
of the score test for transformation. The square root transformation is acceptable
throughout the search

5 10 15 20
Subset size m

Figure 3.25. Stack loss data: first-order model and Vfj, forward plot of the t tests
for the parameters. All are now significantly different from zero
3.2. Stack Loss Data 61

co
'"ci
:g
ci
~
a:
...
'"ci
C\I

'"ci
0

'"ci
5 10 15 20
Subset size m

Figure 3.26. Stack loss data: first-order model and -/Y, forward plot of R2 . The
data and model now agree

4 -- _______________ _ -_ .. -_ .... -- .. - --- - - --- - _.... -- -- ......................... ..

-----------------------------~

~~====--------~-- .
ji5 .. . .. ......--,---::: /
_ ::.-=@-- -- -::

~---------------------------- -- _/
/

5 10 15 20
Subset size m

Figure 3.27. Stack loss data: first-order model and -/Y, forward plot of scaled
residuals. This stable plot suggests that observations 4 and 21 may be outliers,
but that they are not influential for the fitted model
62 3. Regression

a good model. The choice between these two depends on nonstatistical


considerations. Ideally, more data would be obtained to resolve the status
of these two anomalous or influential cases.
We return to the analysis of these data in the next chapter and show
how our systematic approach to the analysis of transformations leads to
a sharper version of these conclusions in that other transformations can
be confidently rejected and the relation between linear model and trans-
formation can be elucidated. However both analyses exhibit the power of
the forward search in distinguishing between outliers and influential obser-
vations and the variety of tools it makes available for model criticism and
building.

3.3 Salinity Data


These data, listed in Table A.6, are taken from Ruppert and Carroll (1980).
There are 28 observations on the salinity of water in the spring in Pamlico
Sound, North Carolina. Analysis of the data was originally undertaken as
part of a project for forecasting the shrimp harvest. The response is the
biweekly average of salinity. There are three explanatory variables: the
salinity in the previous two-week time period, a dummy variable for the
time period during March and April and the river discharge. Thus the
variables are:

Xl: salinity lagged two weeks


X2: trend, a dummy variable for the time period
X3: water flow, that is, river discharge
y: biweekly average salinity.

Use of lagged salinity as one of the explanatory variables in a multiple


regression model means that we ignore the errors of measurement in this
variable. Because the readings are taken over only six two-week periods,
the value of lagged salinity Xli is not necessarily equal to Yi - l.
The data seem to include one outlier. This could either be omitted, or
changed to agree with the rest of the data. We make this change and
use the forward search to show that the "corrected" observation is not
in any way outlying or influential. The ensuing analysis is comparatively
straightforward, although again a transformation of the response may be
appropriate, but with further implications for appropriate models.
We start with a first-order model and the original data. The forward plot
of residuals, Figure 3.28, shows case 16 as an outlier with a huge residual
for most of the search. It is the last case to enter the subset: when it does so,
the residual is no longer outstandingly large. This behaviour is explained
by the forward plot of leverages, Figure 3.29. This shows that when case 16
3.3. Salinity Data 63

CD

24--- ---- - -- - - - ___ __ .• __ ___ .

5 10 15 20 25
Subset size m

Figure 3.28. Salinity data: forward plot of scaled residuals. Observation 16 has a
very large residual for most of the search

enters, it does so with a leverage of 0.547; a small least squares residual is


then not surprising. A strange feature of the plot is that case 15 has high
leverage up to subset size m = 13, but then leaves the subset until m =
26.
The cause of the large residual for case 16 is evident from the scatterplot
matrix in Figure 3.30. It is clear that the value of X3 for case 16 of 33.443
is out of line with the other values of X3 and y. One possibility is to delete
case 16 and repeat the analysis. We follow an alternative, assuming that the
value is a misprint for 23.443. In the absence of any external knowledge this
might seem dangerous. A discussion of various strategies is given on page
49 of Atkinson (1985). But our purpose here is to show how our forward
method enables us to check whether the modified case 16 is in any way
influential or outlying.
We now repeat the analysis for the corrected data; the residuals are
plotted in Figure 3.31. The correction of observation 16 has little effect
on the residuals for the remaining observations: the most extreme positive
residuals are still those from observations 5, 23 and 24. Likewise the most
extreme negative values are from observations 8 and 1. Case 16 does not
figure as interesting in the plot.
In this model the t statistic for (32 is a nonsignificant -1.17. The forward
plot of the t statistics in Figure 3.32(left) shows that the significance of this
variable is marginal for much of the forward search, being most significant
between m = 13 and m = 18. This is the initial period of decline of the three
large positive residuals in Figure 3.31. In fact the scatter plots in Figure 3.30
do not show any overall relationship between y and X2, compared with that
for Xl and X3.
64 3. Regression

co
o

N
o

o
o
5 10 15 20 25
Subset size m

Figure 3.29. Salinity data: forward plot of leverages. Observation 16 is the last
to enter the search, wit h a high leverage

Ie •
3

• •• • •••
!e 160 • •
I
4

......
~
;,•

4 6 8

.....-
10 12 14


" !!
~ 1 =Lagged salini~
• •• I .,.• 16C
r.
• 16r. . ,

-
• •
1_. --
• • ••• • •
•••

Ie •
- M
• •
• • ••• M

• • •• X2=Trend
• •• • •• •
• ••• •• • • • ••
j. •••• •
...
0 •• 0 ••
1~ 16C 16• • _
I. •
,e,rr ,e,rr '6u



. • •
• •• • • •
..
X3=Waler flow
• ••• • •• I •
~
Ie.
.:-: ",.,:... • • •• ; ii •Ie.·,),.,
...-; . : : ...
"

--.-
• I •~
!!

•• ..
... • '6
•• •
@l


M
. 160

• •






•••

I I -:
••
.. '6e y=Salinity

" 6 8 10 12 14 22
• •
Z6 30

Figure 3.30. Salinity data: scatterplot matrix. The value of X3 for observation 16
seems suspect
3.3. Salinity Data 65

23----

1---______ ---
15

5 10 15 20 25
Subset size m

Figure 3.31. Corrected salinity data: forward plot of scaled residuals . Observation
16 is not evident

0 0
'Q"
'Q" 00
9 0 0
· 0
0 8
~~
(\J (\J
0 0
0
.\,1
;;; ~
0
8 o.
~ 0 :~
0 15
to
;;; - ~_2-------
-------
--J-CTID
(/)
...
0 0
0
')' I -
CX)
00 •
17
I '..... I --_._- U 0
I ,../ L3 <0 0
----- L2
--_. L3 0
0
""1" 'Q"
0

5 10 15 20 25 4 6 8 10 12 14
Subset size m Lagged salinity

Figure 3.32. Corrected salinity data: (left) forward plot of t statistics; /32 is of
marginal significance; (right) scatter plot of salinity y against lagged salinity, Xl
66 3. Regression

. ' . - . . '. . ' . . . ....


N 9- - .• -

----:c=::==::=--~=~----
./ ----
--------- ------ ",--~~

5 10 15 20 25

Subset size m

Figure 3.33. Corrected salinity data: model in Xl and X3 : forward plot of scaled
residuals

If we drop X2 and refit we obtain a model with two terms that are highly
significant at the end of the search: tl = 10.17 and t3 = -5.30. The plot of
forward residuals in Figure 3.33 is appreciably more stable than the plot of
Figure 3.31. The three largest absolute residuals are for observations 9, 15
and 17. It is interesting to see where these observations are on a scatterplot
of y against Xl. As Figure 3.32(right) shows, they lie on the outside of a
band of observations with a clear relationship between the two variables.
Whether fitting is by least median of squares, as at the beginning of the
forward search, or by least squares, as at the end, they will always have
large residuals. The stability in the forward plot of residuals in Figure 3.33
confirms this interpretation.
Finally we consider transformation of the data. Since the response, salin-
ity, is a nonnegative quantity, a transformation is physically plausible. For
the three-variable model the approximate score test for transformation is
-1.61. As Atkinson (1985, p. 122) shows, this value depends critically on
observation 3. If it is deleted the value is -2.50, evidence of the need for a
transformation. But if observation 5 is then deleted, the statistic becomes
a nonsignificant -1.148. For the model with just Xl and X3 there is no
evidence of the need for a transformation. For all observations the score
statistic equals -1.465. When observation 3 is deleted it is -1.77. There
are two reasons for not pursuing this analysis further .
One reason is that observations 3 and 5 are the two smallest, and so
can be expected to be informative about transformations. The other, more
important, reason is that the values of Xl are the values of y lagged by one
period. A sensible approach is then to consider joint transformation of y and
Xl with the same parameter. We give an example of a joint transformation
of the response and an explanatory variable for data on mussels in §4.1O.
3.4. Ozone Data 67

3.4 Ozone Data


The data in Table A.7 are the first 80 observations on a series of measure-
ments of ozone concentration and meteorological variables in California,
starting from the beginning of 1976. The full set of 300 observations were
used by Breiman and Friedman (1985) when introducing the ACE algo-
rithm for finding transformations in multiple regression. The data are given
in the supporting material for Cook and Weisberg (1994a); (diskette or web
site), with a scatter plot of two variables on p. 25, in which it is clear that
ozone concentration is related to daily temperature. As Cook and Weisberg
comment, it is not clear whether the temperature is the daily average or
maximum. In fact, Breiman and Friedman (1985) are not explicit about
the properties of the data. The following list of variables is copied from
their Appendix C.

Xl: Sandburg airforce base temperature °C


X2: inversion base height (feet)
X3: Daggett pressure gradient (mm Hg)
X4: visibility (miles)
X5: Vandenburg 500 millibar height (m)
X6: humidity (percent)
X7: inversion base temperature of
Xs: wind speed (mph)
y: Upland ozone concentration, ppm (maximum one hour average).

All measurements were taken in the region of Upland, California, East of


Los Angeles. The complete data contain several values of 93 for Xl, which,
despite the fact that such data are often collected because of a concern
with global warming, presumably must be of , not °C.
In this example we show how information from the forward search can
be combined with the standard methods of model building.
We begin by regressing ozone concentration on the eight explanatory
variables. Figure 3.34 shows the forward plot of the residuals - the pattern
is stable, with some large residuals, particularly cases 53 and 71. These
are the last two cases to enter the forward search and give large values
for the minimum deletion residual just before they enter. They also cause
appreciable increases in the plots of the measures of kurtosis and nonnor-
mality of the residuals. There is thus some evidence of either the presence
of outliers or systematic nonnormality. The plot of leverages, which we do
not include, shows no evidence of leverage points. The two units with the
highest leverages are 33 and 58; while the first is included in the initial
subset, the other enters much later.
The plot of modified Cook distances from the forward search, Fig-
ure 3.35(left), shows no structure; there is no evidence of any cases being
influential for the estimation of the parameters of the linear model. A
68 3. Regression

o 20 40 60 80
Subset size m

Figure 3.34. Ozone data: forward plot of scaled residuals

<D

III III
Q)

c<> <>
01
u; ~ 'ffi
15 .~ 0
"'8" C') u;
u 2
"0 ~
Q)
C\I 0 ~
~
::E
<>
(f)

0
'7
0

20 40 60 80 20 40 60 80
Subset size m Subset size m

Figure 3.35. Ozone data: (left) forward plot of modified Cook distances; (right)
score test for transformation of the response
3.4. Ozone Data 69

C!
53

I()
0
co'"::>
'0
'iii
l!!
'0 0

'E"
N
0
Q)
'0
::>
iii I()

9
56
65

C! 31

0 20 40 60 80
Time

Figure 3.36. Logged ozone data: studentized residuals against day of the year.
There is a clear upward time trend

surprising feature of the fitted model is that none of the t tests for the
coefficients are significant at the end of the search, the most extreme value
being -1.32, with an R2 value of 0.430. One reason for this seemingly poor
fit may, of course, be that there is no relationship between ozone concen-
tration and the eight measured variables. Another may be that some of
the variables are highly correlated, so that the coefficients are poorly de-
termined, with large variances and correspondingly small t values. There
is some evidence for this in the value of R2, which is not approximately
zero. Another, not exclusive, possibility is that there is some systematic
misspecification of the model. In fact, the plot of the score statistic for
transformation of the response, Figure 3.35(right), indicates that , after half
the cases have been included in the forward search, there is evidence of the
need for a transformation. The significance of the statistic grows steadily,
there again being no evidence of any especially influential observations.
There is thus at least one systematic failure of the model.
As we see in Chapter 4, an appropriate transformation is found by taking
log y as the response. We repeat the regression and check once more for
outliers and influential observations. The forward search still does not reveal
any strange features. The score statistic for the transformation lies within
the bounds ±2.58 throughout, although it tends to be in the lower half of
the region. However the data are in time order, so it is sensible to check
whether there is a misspecification due to the presence of a time trend.
Figure 3.36 shows the residuals against observation number. There is a clear
upward trend, so we include a term in observation number in the model.
This new term has a t value of 6.60 and R2 = 0.696. The reduction in the
residual sum of squares caused by this new term increases the significance
70 3. Regression

Table 3.1. Ozone data: selection of variables using t statistics

Model Number 1 2 3 4 5 6 7
Response y y log y log y log y log y log y
Terms
Constant -0.08 -1.90 -2.64 -2.69 -2.80 -4.22 -4.32
Time 6.19 6.60 6.64 6.81 7.25 7.08
Xl 1.29 0.68 1.03 0.98 1.00
X2 - 0.77 -0.96 -1.72 -2.50 -2.77 -2.88 -3.45
X3 1.23 -0.74 -0.80 -0.71
X4 -0.90 -2.06 -1.69 -1.69 -1.57 -1.64
X5 0.06 1.80 2.80 2.96 3.10 5.01 5.06
X6 1.00 2.12 1.78 1.83 1.75 1.71 2.13
X7 0.80 0.26 -0.37
Xs -1.32 -1.30 -1.98 -2.05 -2.08 -2.05 -2.19
R2 0.430 0.632 0.696 0.696 0.693 0.689 0.678

of the t tests for the other terms. However, as the results in Table 3.1 show,
only that for X5 is significant at the 5% level.
Since the data appear well behaved, we now use a standard backwards
model-building procedure to find a sensible linear model. Once we have
found a model, we then check it using the forward search. Backwards elim-
ination, using the values of the t statistics from refitting each time a variable
is removed, leads to successive elimination of X7 , X3, Xl and X4. The final
model thus contains an intercept, the time trend, and X2, X5, X6 and xs ,
all terms being significant at the 5% level. The value of R2 is 0.678, hardly
decreased by the omission of four variables. The details of the t statistics
are in Table 3.l.
We now repeat the forward search to check that all cases support this
model. Figure 3.37 shows the forward plot of residuals, which is very stable,
with cases 65 and 31 showing the largest residuals for most of the search.
There is still no evidence of any high leverage points, Figure 3.38. This
result is to be expected; although it is possible that removal of carriers
can cause appreciable changes in leverage, as it did in our analysis of the
stack loss data, the requisite structure of the data is unlikely. The plot of
the score statistic, Figure 3.39(left), shows that all data agree with the log
transformation. The final plot, Figure 3.39(right) , is of the t statistics for
those terms included in the final model. They show the typical decline in
absolute value as cases are introduced that are increasingly less close to
the model. But the curves are smooth, giving no sign of any exceptionally
influential cases. We also monitored a large number of other quantities,
such as the modified Cook's distance, normality and kurtosis plots, and the
3.4. Ozone Data 71

20 40 60 80
Subset size m

Figure 3.37. Logged ozone data, final model (X2, X5, X6 , X8 and time trend) :
forward plot of scaled residuals, showing a ppreciable stability

20 40 60 80
Subset size m

Figure 3.38. Logged ozone data, final model (X2, X5, X6, X8 and time trend) :
forward plot of leverages . There are no observations with unduly high leverage
72 3. Regression

CD 0
"<t

"<t
0
0 C\J
~ C\J

*
u;
0
"g

*
0 0
2
~
0 C)I
0
C/) 0
'-0
trend
C)I
U
'f '-5
'-6
0
,-8
~ 'f
20 40 60 80 20 40 60 80
Subset size m Subset size m

Figure 3.39. Logged ozone data, final model (X2 ' Xs , X6, Xs and time trend) :
(left) score test for transformation; the log transformation is acceptable, (right)
forward plot of t statistics for parameters; there are no effects of outliers (the
central band is at ±2.58)

values of R2 and 8 2 . None showed any features indicative of the influence


of individual cases.
Further models could be tried that would include interactions between
some of the variables. Such models could be checked, in the manner ex-
emplified here, by use of the forward search. It is a feature of the ACE
specification of Breiman and Friedman (1985) that the models are natu-
rally additive, so that their paper does not consider models with interaction
terms. Although they analyzed all 300 observations, their analysis and ours
have two interesting points of contact.
The results of Breiman and Friedman's analysis are presented graphically
on their page 588. The left panel shows the transformation for ozone, which
is slightly concave, close to an affine transformation of log y. The second
is their function of day of the year, which rises sharply to a peak near day
120 and then declines more slowly to reach the initial value towards the
end of the year, which is outside the range of the data. Over the period we
have studied, the first 80 days, the increase in this function is almost linear
and so matches our linear regression on day of the year. Over a longer time
period we would expect to have to use a quadratic polynomial in time, or
perhaps a cubic to satisfy the asymmetry noted by Breiman and Friedman.
If data from several years were to be analysed, the set of functions chosen
for the effect of time should be cyclical. Sines and cosines are a natural first
choice.
We return briefly to the analysis of these data in Chapter 4, where we
employ our systematic approach to the selection of a power transformation.
3.5. Exercises 73

3.5 Exercises
Exercise 3.1 Find the least median of squares estimate of location for the
two samples reported in Table 3.2 (§3.1):

Table 3.2. Two samples of observations

Sample 1 Sample 2
192 192
134 134
124 124
128 128
201 1201
120 120
186 186
204 1204

Exercise 3.2 Hawkins et al. {1984} give some synthetic regression data
with n = 75 and three explanatory variables. The purpose is to illustrate
the problems outliers at leverage points can cause for least squares. A robust
analysis of the data is given by Rousseeuw and Leroy (198'l, p. 93).
Regress y on the three explanatory variables using least squares. What do
you see from a normal plot of the residuals? Also try index plots of residuals
and leverages. What do you conclude? Now try other least squares methods
for model checking. What does a robust analysis add (§3.1)?
Exercise 3.3 What do you think the forward plot of leverages looks like
when a first-order model is fitted to the wool data (§3.1)?
Exercise 3.4 Estimate the parameters for Hawkins ' data in §3.1 using
only observations 11, 15, 37, 51, 58, 70, 96, 109, 114, 97, which were
among the first to enter in our forward search. You should now be able to
divide the data into the groups shown in Figure 3.3. How does the QQ plot
of residuals change as each successive group is included in the fit? Comment
on the differences between Figure 3.3 and the forward plot of the residuals
in Figure 3.40 (§3.1).
Exercise 3.5 Using the mean shift outlier model, or otherwise, calculate
the F test for the hypothesis that observations 4 and 21 are outlying when
the stack loss data, using only variables 1 and 2, are analyzed with response
,;y. Comment on the significance level of your test (§3.2).
Exercise 3.6 Table 3.3 gives demographic data about 49 countries taken
from Gunst and Mason {1980, p. 358}. The variables are
74 3. Regression

73

~--------------------------------------------------
--------------------~---~~=~~~~

'.

'"
"
iii
::0

~
! 0
-g
iii
ell
'l'

o 20 40 60 80 100 120
Subset size m

Figure 3.40. Hawkins data: forward residual plot

Xl =INFD=Infant deaths per 1000 live births


X2 =PHYS=Number of inhabitants per physician
X3 =DENS=Population per square kilometer
x4=AGDS=Population per 1000 hectares of agricultural land
X5 =LIT=Percentage literate of population aged 15 and over
X6 =HEID = Students enrolled in higher education per 100, 000 population
y=GNP=Gross national product per capita.
Compare the index plot of leverage for all observations with those when
observations 17 and 39 are separately deleted.
What are the most important variables (§3.4)?
Exercise 3.7 In §3.4 models were found for the ozone data using back-
wards selection on t statistics. This search traces one path through the 28
models found by including all combinations of variables. Compare the re-
sults of model building with those obtained using a best subsets regression
routine, such as that in the statistical package Minitab .
We also took time as another variable. What happens if this variable is
included in the selection procedure (§3.4)?

3.6 Solutions
Exercise 3.1
The least median of squares estimate of location is the midpoint of the
shortest half (Rousseeuw and Leroy (1987, p. 169)). The shortest half of a
3.6. Solutions 75

Table 3.3. Demographic data


Country INFD PHYS DENS AGDS LIT HElD GNP
1 19.5 860 21 98.5 856 1316
2 37.5 695 84 1720 98.5 546 670
3 60.4 3000 548 7121 91.1 24 200
4 35.4 819 301 5257 96.7 536 1196
5 67.1 3900 3 192 74 27 235
6 45.1 740 72 1380 85 456 365
7 27.3 900 2 257 97.5 645 1947
8 127.9 1700 11 1164 80.1 257 379
9 78.9 2600 24 948 79.4 326 357
10 29.9 1400 62 1042 60.5 78 467
11 31 620 108 1821 97.5 398 680
12 23.7 830 107 1434 98.5 570 1057
13 76.3 5400 127 1497 39.4 89 219
14 21 1600 13 1512 98.5 529 794
15 27.4 1014 83 1288 96.4 667 943
16 91.9 6400 36 1365 29.4 135 189
17 41.5 3300 3082 98143 57.5 176 272
18 47.6 650 108 1370 97.5 258 490
19 22.4 840 2 79 98.5 445 572
20 225 5200 138 2279 19.3 220 73
21 30.5 1000 40 598 98.5 362 550
22 48.7 746 164 2323 87.5 362 516
23 58.7 4300 143 3410 77 42 316
24 37.7 930 254 7563 98 750 306
25 31.5 910 123 2286 96.5 36 1388
26 68.9 6400 54 2980 38.4 475 356
27 38.3 980 1041 8050 57.6 142 377
28 69.5 4500 352 4711 51.8 14 225
29 77.7 1700 18 296 50 258 262
30 16.5 900 346 4855 98.5 923 836
31 22.8 700 9 170 98.5 839 1310
32 71.7 2800 10 824 38.4 110 160
33 20.2 946 11 3420 98.5 258 1130
34 54.8 3200 15 838 65.7 371 329
35 74.7 1100 96 1411 95 351 475
36 77.5 1394 100 1087 55.9 272 224
37 52.4 2200 271 4030 81 1192 563
38 75.7 788 78 1248 89 226 360
39 32.3 2400 2904 108214 50 437 400
40 43.5 1000 61 1347 87 258 293
41 16.6 1089 17 1705 98.5 401 1380
42 21.1 765 133 2320 98.5 398 1428
43 30.5 1500 305 10446 54 329 161
44 45.4 2300 168 4383 "73.8 61 423
45 24.1 935 217 2677 98.5 460 1189
46 26.4 780 20 399 98 1983 2577
47 35 578 10 339 95 539 600
48 33.8 798 217 3631 98.5 528 927
49 100 1637 73 1215 77 524 265
76 3. Regression

"'"

.&
.. C\I

,-•• •
/
"
C\I (J)
(ij

·w"
0 "0 0
(J)
(ij [I!
"
"0
·w
~ "0
<II
<II
a: 'r • .!:!
C
<II
~

••
"0
'9 • "
Ci5
'9 • 'r •
• •
0 2 4 6 8 ·2 ·1 0 2
Predicted values Quantiles of standard normal

Figure 3.41. Hawkins et al (1984) data: (left) residuals against fitted values and
(right) QQ plot of residuals

sample of n is defined as the smallest between the following differences:

Yh - Yl Yh+l - Y2 Yn - Yn-h+l,

where h = [nI2] + 1 and Yl ::; Y2 ::; ... ::; Yn are the ordered observations.
In our first sample the intervals are 186 - 120 = 66, 192 - 124 = 68,
201 - 128 = 73 and 204 - 134 = 70. The shortest half is 66 and the least
median of squares estimate of location is 0.5(120 + 186) = 153. This is also
the least median of squares estimate of location of the second sample. Note
that the presence of the two outliers in the second sample (1201 and 1204)
does not affect the LMS estimate of location.

Exercise 3.2
The QQ plot ofstudentized residuals in Figure 3.41 (right) shows that there
are four outlying observations. The other panel, of residuals against fitted
values, again shows these four observations, as well as a further group of
10 outliers. The scatterplot matrix in Figure 3.42 clearly shows the three
groups of observations.
It seems that robust estimation adds little here to the identification of
outliers, although a forward search makes it possible to monitor the effect
of these outliers on, for example, t statistics.

Exercise 3.3
As Figure 3.43 shows, the factorial structure of the data becomes apparent
as the search progresses. At the end of the search the four leverage values
come from groups of points with respectively 0, 1, 2 and 3 nonzero coordi-
nates. The smallest leverage, lin, is for the centrepoint of the design. See
Farrell et al. (1967) or Atkinson and Donev (1992, p. 129).
3.6. Solutions 77

..
o 2 .. e a 10
o

19° 0 I
."",..~~~--'
0102030

Figure 3.42. Hawkins et al (1984) data: scatterplot matrix. The three groups of
observations can be clearly identified

CD
ci

CD <D
ci
'"CD~
>
CD
..J ..,.
ci

C\I
ci

------ ----- -- --- - - -- - -. ---.


0
ci

5 10 IS 20 25
Subset size m

Figure 3.43. Wool data: forward plot of leverage


78 3. Regression

Exercise 3.4
The QQ plot of residuals from the fit using these 10 observations is similar
to that for the LMS residuals in Figure 3.2 (why?). As successive groups of
observations are introduced there is a gradual change to the least squares
residuals in the left-hand panel and the groups merge. The forward plot
of residuals in Figure 3.40 shows the four groups of residuals more clearly
than does the plot of squared residuals in Figure 3.3, which emphasizes the
largest residuals.

Exercise 3.5
The mean shift outlier model leads to the F test comparing residual sums
of squares with and without the two observations. Numerically, the F test
is
{SS(0.5, 21) - SS(0.5, 19)}/2 = (1.9666 - 0.5396)/2 = 2116
SS(0.5,19)/16 0.5396/16· ,
where SS(0.5, 19) is the residual sum of squares when the square root of
the response is taken and observations 4 and 21 are removed. SS(0.5,21)
is the residual sum of squares using all the units.
The significance level of this result is very high: F 2 ,16(0.001) = 10.97 and
we observe 21.16. However the forward procedure has found two observa-
tions that give a large reduction in the residual sum of squares when they
are deleted. If the true value of the significance mattered, some allowance
should be made for the effect of selecting the observations, either by using
the Bonferroni inequality, for example, Cook and Prescott (1981), or by
simulation. However here the effect of the two observations is so large that
such refinements are not required.
Exercise/ 3.6
Figure 3.44 gives three index plots of leverages. Observation 27 has the
highest leverage, but the pair of observations 17 and 39 have, together
with observation 20, the next highest leverages, almost 0.5. If either of the
pair of observations is deleted the other has a leverage close to one, as the
lower two plots of the figure show.
The reason for this behaviour is clear from the scatter plots of Fig-
ure 3.45. The two units are close together and remote from all others in
their values of X3 and X4. Their leverages therefore illustrate the theoret-
ical results of Exercise 2.5. The countries concerned are Hong Kong and
Singapore.
Table 3.4 shows the results of the backward selection of variables using t
statistics. On this untransformed scale only X5 and X6 are included in the
final model.
Exercise 3.7
In the Minitab output below the response C12 is logged ozone concentra-
tion, C11 is day and C2 to C9 correspond to Xl to X8. The models with 6 or
3.6. _Solutions 79

Leverage

I
,;

~-'lrl~ITIT" I 'I~I'I~:~lrrl-'IT,~,~I""~I",,r-r,rIL,~I~,-.:'1~"'rlrrl~'TI~I'!'I'I~'~lrl,LIrl'ITI~
<Xl

:
10 20 30 40 50

:I~~I~I~'I~I~I~:I~I~II~I-'I~I~I-'I~'~I~:I~"~I~'~,,~I,~,~,~,~,~,~I
o 10 20 30 40 50

~I "
~ II 1' 1I I : II
10
,II
I I II
20
II
, I
II
40
,."i" , I 50

Figure 3.44. Demographic data: index plots of leverage. The top panel i s for all
the data. The other two panels show what happens when observations 17 and 39,
respectively, are deleted

0
0
ill
C\I
••

0
0
ill
C\I -

0
0
0
0
0
39 •
1

t
0 0
0
>-~
0
>- ~ ....xo

t.
0
0
....0
.
0 0


0 0
ill

0
39-·
1
ill

0
1·~ 0
". •
0 1000 2000 3000 0 40000 100000 0 1000 2000 3000
X3 X4 X3

Figure 3.45. Demographic data: three scatter diagrams showing why units 17 and
39 have high leverage
80 3. Regression

Table 3.4. Demographic data: selection of variables using t statistics


(untransformed response)

Model number 1 2 3 4 5
Xl -1. 7322 -1. 7307 -1. 7499 -1.5825
0.2727
-0.4620 -0.5039 -0.7917
0.2830 0.3159
1.4877 1.7063 1.7122 2.1927 3.7345
4.1474 4.1843 4.3011 4.3346 4.3923
0.584 0.583 0.582 0.576 0.553

5 variables, plus a constant (p parameters in all), seem best, as judged by


the value of Cp (Weisberg 1985, p. 216) which should be near to p. However
inspection of the t values for the six-variable model leads to dropping X4
and to the five-variable model indicated in the output.
This is the same model as we obtained in Table 3.1 by backwards
elimination. So in this case, both methods give the same answer.
Best Subsets Regression
Response is C12
C
Adj. CCCCCCCC1
Vars R-Sq R-Sq C-p s 234 5 6 7 891

1 35.0 34.2 73 . 7 0.46147 x


1 34.7 33 . 8 74 . 5 0.46268 x
2 54.9 53 . 7 29.9 0.38691 X X
2 52.3 51.0 36.0 0.39800 X X
3 63.8 62.3 11.4 0.34899 x X X
3 60 . 0 58.4 20.1 0.36669 XXX
4 66.0 64 .2 8.4 0.34049 XX
X X
4 65.8 64.0 8.8 0 . 34137 X
X XX
5 67.8 65.6 6.3 0.33364 X XX XX
5 67.7 65.5 6.5 0 . 33417 X XX XX
6 68.9 66.4 5.6 0 . 32992 X XXX XX
6 68.3 65.7 7.1 0 . 33322 XX XX XX
7 69.3 66.4 6.6 0.32990 XX XXX XX
7 69.1 66.1 7.1 0.33094 XXXXX XX
8 69.6 66.1 8.1 0.33102 XXXXX X XX
8 69.3 65 . 9 8 .6 0.33220 XX XXXXXX
9 69 . 6 65.7 10.0 0.33306 XXXXXXXXX
4
Transformations to Normality

4.1 Background
Several analyses in this book have been improved by using a transformation
of the response, rather than the original response itself, in the analysis of
the data. For the introductory example of the wool data in Chapter 1, the
normal plot of residuals in Figure 1.9 is improved by working with log y
rather than y (Figure 4.2). The transformation improves the approximate
normality of the errors. The transformation also improves the homogeneity
of the errors. The plot of residuals against fitted values for the original data,
also given in Figure 1.9, showed the variance of the residuals increasing with
fitted value. The same plot for log y, given in Figure 4.2, shows no such
increase.
Two analyses in Chapter 3 also showed the advantages of transformation
of the response. For the stack loss data a simpler model was obtained when
the square root of y was used as the transformation. There was no need for
interactions - an additive model sufficed. We also saw advantages in work-
ing with the log of ozone concentration, rather than with the concentration
itself.
There are physical reasons why a transformation might be expected to
be helpful in these examples. In two of the sets of data, the response is a
concentration and in the third it is the number of cycles to failure. All are
nonnegative variables and so cannot be subject to additive errors of con-
stant variance. In this chapter we analyze such data using the parametric
family of power transformations introduced by Box and Cox (1964). A full
82 4. Transformations to Normality

discussion, including deletion diagnostics, is given by Atkinson (1985). For


some data, alternative models are provided by the family of generalized
linear models discussed by McCullagh and NeIder (1989). These models
are the subject of Chapter 6.
A consequence of embedding the various transformations in a single para-
metric family is that standard methods, such as regression, can be used.
We then have to hand the well-understood methods of model building and
diagnostic analysis described in Chapter 2. However, the resulting inferen-
tial methods use aggregate statistics. For example, in regression the test
of the null hypothesis of a particular transformation is the likelihood ratio
statistic, a function of the residual sums of squares of normalized trans-
formed data. Unfortunately the estimated transformation and related test
statistic may be sensitive to the presence of one, or several, outliers. We
use the forward search to see how the estimates and statistics evolve as we
move through the ordered data. We have already shown some examples in
Chapters 1 and 3. As we demonstrate, influential observations may only
be evident for some transformations of the data. Since observations that
appear as outlying in untransformed data may not be outlying once the
data have been transformed, and vice versa, we employ the forward search
on data subject to various transformations, as well as on untransformed
data.
We begin with a description of the Box and Cox family of transformations
and derive the score tests we use for transformation of the response and of
an explanatory variable. An important graphical method is the "fan plot"
which monitors the behaviour of the score statistic during the forward
search. The examples illustrate both transformation of the response and,
in §4.1O, transformations for both the response and explanatory variables.
The last sections of the chapter extend the technique to transformation
of both sides of a model, appropriate when the relationship between the
response and the model should not be destroyed by transformation. The
example given is of the volume of tree trunks modelled as cones.

4.2 Transformations in Regression


4·2.1 Transformation of the Response
For transformation of just the response y in the linear regression model ,
Box and Cox (1964) analyze the normalized power transformation

z('\) = {!:L\ ,\ i'


ylogy
0
,\ = 0,
(4.1)

where the geometric mean of the observations is written as iJ =


exp(E log Yi/n). The model fitted is multiple regression with response z('\);
4.2. Transformations in Regression 83

that is,
z(,X.) = X(3+£.. (4.2)
When ,X. = 1, there is no transformation: ,X. = 1/2 is the square root
transformation, ,X. = 0 gives the log transformation and ,X. = -1 the recipro-
cal. These are the most widely used transformations, frequently supported
by some empirical reasoning. For example, measurements of concentration
often have a standard deviation proportional to the mean, so that the vari-
ance of the logged response is approximately constant (Exercise 4.1). For
this form of transformation to be applicable, all observations need to be
positive. For it to be possible to detect the need for a transformation the
ratio of largest to smallest observation should not be too close to one. A
similar requirement applies to the transformation of explanatory variables.
The purpose of the analysis is to find an estimate of ,X. for which the
errors in the z('x') (4.2) are, at least approximately, normally distributed
with constant variance and for which a simple linear model adequately
describes the data. This is achieved by finding the maximum likelihood
estimate of ,X., assuming a normal theory linear regression model.
Once a value of ,X. has been decided upon, the analysis is the same as
that using the simple power transformation

(,X.) = { (yA-1)/'x' ,X.~O (4.3)


y 10gy,X. = 0 .

However the difference between the two transformations is vital when


a value of ,X. is being found to maximize the likelihood, since allowance
has to be made for the effect of transformation on the magnitude of the
observations.
The likelihood of the transformed observations relative to the original
observations y is

where the Jacobian

(4.4)

allows for the change of scale of the response due to transformation


(Exercise 4.2).
A simpler, but identical, form for the likelihood is found by working with
the normalized transformation, defined in general as
z(,X.) = y('x')jJl/n,
for which the Jacobian is one. The likelihood is therefore now
84 4. Transformations to Normality

a standard normal theory likelihood for the response z(.-\). For the power
transformation (4.3) ,

so that
10gJ = (.-\ -1) LlogYi = n(.-\ -1) logy.
The maximum likelihood estimates of the parameters are found in two
stages. For fixed .-\ the likelihood (4.5) is maximized by the least squares
estimates
~(.-\) = (XT X) - l XT z(.-\) ,
with the residual sum of squares of the z(oX),
R('-\) = z(oXf(I - H)z(.-\) = z(oXf Az(oX). (4.6)
Division of (4.6) by n yields the maximum likelihood estimator of a 2 as
0- 2 (.-\) = R(oX)/n.
Replacement of this estimate by the mean square estimate 8 2 (oX) in which
n is replaced by (n - p) does not affect the development that follows.
For fixed oX we find the loglikelihood maximized over both j3 and a 2 by
substitution of ~(.-\) and 8 2 (oX) into (4.5). If an additive constant is ignored
this partially maximized, or profile, loglikelihood of the observations is
Lmax(oX) = -(n/2) log{R(oX)/(n - p)} (4.7)
so that ~ minimizes R(oX). To repeat what has already been stressed, it is
important that R(oX) in (4.7) is the residual sum of squares of the z(.-\),
a normalized transformation with the physical dimension of Y for any oX
(Exercise 4.2). Comparisons of residual sums of squares of the simple power
transformation y(oX) are misleading. Let
S(oX) = y(.-\f(I - H)y(oX) (4.8)
be the residual sum of squares of the unnormalized y(oX). Suppose, for ex-
ample, that the observations are of order 103 , the residual sum of squares
S( 1) will be of order 106 , whereas, when oX = -1, the reciprocal transforma-
tion, the observations and S( -1) will be of order 10- 6 . However relatively
well the models for .-\ = 1 and oX = -1 explain the data, S (-1) will be very
much smaller than S(l). Comparison of these two residual sums of squares
will therefore indicate that the reciprocal transformation is to be preferred.
This bias is avoided by the use of R( oX) in (4.7) , since the magnitude of
z(.-\) does not depend on oX.
For inference about the transformation parameter oX , Box and Cox
suggest likelihood ratio tests using (4.7), that is, the statistic
TLR = 2{Lmax(~) - Lmax(oX o)} = nlog{R(oXo)/R(~)}. (4.9)
4.2. Transformations in Regression 85

A disadvantage of this likelihood ratio test is that a numerical maximization


is required to find the value of 5... For regression models a computationally
simpler alternative test is the approximate score statistic derived by Taylor
series expansion of (4.1) as
. oz(>.) I
z(>.) = z(>.o) + (>. - >'0) ~ >' = >'0

= z(>.o) + (>. - >'o)w(>.o) , (4.10)


which only requires calculations at the hypothesized value >'0. In (4.10)
w(>.o) is the constructed variable for the transformation. Differentiation of
z(>') for the normalized power transformation yields (Exercise 4.3)
oz(>.)
w(>.)
OA
y>' log yy>' - 1 .
Ay>.-l - >.y>.-l (1/A + logy). (4.11)

The combination of (4.10) and the regression model y = x T (3 + E leads to


the model
Z(AO) = x T (3 - (A - AO)W(AO) +E
= x T (3 + "( W(AO) + E, (4.12)
where,,( = -(A - AD). The approximate score statistic for testing the trans-
formation Tp(AO) is the t statistic for regression on W(AO) in (4.12). This
can either be calculated directly from the regression in (4.12), or from the
formulae for added variables in §2.2 in which multiple regression on x is
adjusted for the inclusion of an additional variable. The t test for "( = 0
in (2.30) is then the test of the hypothesis A = AD. To make explicit the
dependence of both numerator and denominator of the test statistic on A
we can write our special case of (2.27) as
*T * *T *
-yeA) = W (A) Z (A)/{W (A) W (An.
The approximate score test for transformations is thus, from (2.30),

J S~(A)/ {WT(A)Aw(A)}
-yeA)
(4.13)
JS~(A)/{';/(A) :v (An
The negative sign arises because in (4.12) "( = -(A - AD). The mean square
estimate of (72 can, from (2.29), be written in the form

(n - p - l)S~(A) =Z*T (A) Z* (A) - (z*T (A) W


*
(A))2/{W*T (A) W* (An.
86 4. Transformations to Normality

These formulae show how i is the coefficient for regression of the residuals
; on the residuals it, both being the residuals from regression on X. If, as
is usually the case, X contains a constant, any constant in W(A) can be
w
disregarded in the construction of (Exercise 4.3). Under these conditions
(4.11) becomes

(A) = yA{log(y/y) - l/A} (4.14)


W AyA- l
It is then straightforward that
w(l) = y{log(y/y) - I}.
Calculations when A o require the use of I'Hopital's rule in (4.11) to
obtain (Exercise 4.3)

W(O) = Ylogy(O.5Iogy -logy).


These are the two most frequently occurring values in the analysis of data:
either no transformation, the starting point for most analyses, or the log
transformation. For other values of A the constructed variables are found by
evaluation of (4.14). Because Tp(A) is the t test for regression on -W(A) ,
large positive values of the statistic mean that AD is too low and that a
higher value should be considered.

4.2.2 Graphics for Transformations


In both Chapters 1 and 3 we showed examples of forward plots of the score
statistic Tp(A). The extension of this plot to tests for five values of A, using
five separate searches through the data, is our main tool in the analysis
of transformations. We show in this chapter the wealth of information this
plot can provide on transformations, outliers and influential observations.
We contrast this information with that which can be obtained from two
other plots and show how they are affected by masking.
The first is the added variable plot for the constructed variable W(A) ,
which is the scatter plot of the residuals; (A) against the residuals W(A).
From the expansion in (4.12)

I = -(A - AD) = AD - A.
If there is no regression, i ~ 0 and the value of AD is acceptable. The
constructed variable plot will then show no trend. If the value of AD is too
high, the plot will show a positive trend. This is often seen at the beginning
of an analysis when the hypothesis of no transformation is explored, the
positive slope indicating that the data should be transformed. On the other
hand, if the data are overtransformed, the plot will show a negative slope
and a higher value of A is indicated.
4.2. Transformations in Regression 87

Numerous examples of constructed variable plots are given by Atkinson


(1985) who uses them to establish whether all the observations support a
proposed transformation. He also shows, as we do in §4.5 , that the plots
may reveal a single influential outlier. We then show that as the number of
outliers increases the plot can fail to indicate any such observations.
The second of the two plots we consider, in addition to the forward plot
of the score statistic, is the inverse fitted value plot described in Chapter
10 of Cook and Weisberg (1994a) and more fully in Cook and Weisberg
(1994b) . As we did in (4.2) , we assume that, for some transformation t(y),
the data obey a linear model so that
(4.15)
The linear predictor TJ is estimated by least squares regression of the un-
transformed observations y, giving fj = iF
x. The "inverse" scatter plot
of fj against y, the plot {y, fj}, then indicates the form of the transforma-
tion t(y). The plot is augmented by including the curve of fitted values
from the regression of fj on a proposed transformation to (y). A good agree-
ment between the scatter plot and the fit to to(Y) indicates that a suitable
transformation has been found.
This plot is extremely easy to calculate and often indicates a suitable
transformation to a simple additive linear model, one of the three aims
of the Box- Cox transformation. We study how the plot behaves in the
presence of outliers and of appreciable observational error. The calculations
are further simplified by using the transformation

t( ) _ { yA A f= 0
y - logy A= 0
which gives the same numerical results as the Box- Cox transformation in
(4.1) or (4.3) , although lacking the mathematical property of continuity at
A = O.

4·2.3 Transformation of an Explanatory Variable


If instead of transforming y it is desired to transform one of the explanatory
variables Xk , the model is

y = L {3jXj + {3kx~ + E. (4.16)


#k
Taylor series expansion about AO yields the linearized model

y = L{3jXj + {3kx~o + (3k(A - AO)X~O logxk + E. (4.17)


j#
The test of the hypothesis A = AO in (4.17) is equivalent to testing for
regression on the constructed variable x~o log Xk when a term in x~o is
already in the model (Box and Tidwell 1962).
88 4. Transformations to Normality

0
<D
')I

"0
0 0
0 <Xl
,f; ')I
a;
,l!
c;,
.2

~'"
Il.
0
0
'?

'"'?

·1.0 -0.5 0.0 0.5 1.0


lambda

Figure 4.1. Wool data: profile loglikelihood Lmax(.)..) (4.7) showing the narrow
95% confidence interval for A

In §4.10 we consider transformation of both the response and of one of


the explanatory variables. If we are testing for the same transformation of
both y and Xk we can use one constructed variable for testing the common
value of A for both transformations. The combination of (4.12) and (4.17)
yields the constructed variable
A
WB(AO) = (3kxko logxk - W(AO)' (4.18)
A

In (4.18)6k is the estimate from the regression (4.16) including the variable
x~o . Further details are given by Atkinson (1985, §8.4).

4.3 Wool Data


To begin we return to the data on worsted yarn from Box and Cox (1964),
which were presented in Chapter 1. The analysis of Box and Cox suggested
the log transformation. We use the example to calibrate our procedure,
showing that it does not produce spurious quantities with well-behaved
data. We also compare our procedure with the other graphical methods
mentioned above in §4.2.2.
Figure 4.1 is a plot of the profile loglikelihood Lmax (A), (4.7). It pro-
vides very strong evidence for the log transformation, with the maximum
likelihood estimate ~ equal to -0.059. The horizontal line on the plot at
a value of Lmax(~) -3.84/2 cuts the curve of the profile loglikelihood at
-0.183 and 0.064, providing an approximate 95% confidence region for A.
This plot, depending as it does solely on the value of the residual sum of
4.3. Wool Data 89

• • •
• •
,
N
0 N

•• • ••

0
0

••
• •

• ...
•• 0 ~

.
• • • •
••
'"
9 •

• '";"
••

~

""
9 •
• "?
5 7 ·2 ·1 0 2
Predicted values Quantiles of standard normal

Figure 4.2. Transformed wool data: residual plots for log y: (left) least squares
residuals against fitted values; (right) normal QQ plot of studentized residuals

squares R(A) , is of course totally uninformative about the contribution of


individual observations to the transformation.
At the start of our analysis of the wool data we showed, in Figure 1.9, a
plot of the residuals of the untransformed data against fitted values. This
showed a strong relationship between the two , which indicated the need
for a transformation. We show the same plot in Figure 4.2, but for the
data after the logarithmic transformation. The plots are now much better
behaved, especially the plot of residuals against fitted values in the left
panel, which is without structure, as it should be. There is perhaps one too
large negative residual, which however lies within the simulation envelope
of the QQ plot of the studentized residuals in the right panel. This plot is
also much better behaved than its counterpart in Figure 1.9, being much
more nearly a straight line; one observation, number 5, does lie slightly
outside the envelope, but it is not clear what to do about it, nor is it clear
what effect this observation has on the estimated transformation. We now
consider the results of our forward searches for these data.
In Chapter 1 we made separate analyses for the original data and for
log y and showed forward plots of the score statistic Tp (A) for each, based
on different searches for the two values of A. In this chapter we base our
analyses on five values of A : -1, -0.5,0,0.5 and 1. In all examples these
are sufficient to indicate a satisfactory transformation. We perform five
separate searches. The data are transformed and a starting point is found
for each forward search, which then proceeds independently for each A us-
ing the transformed data. In this example we found the five initial subsets
by exhaustive search of all subsets, although this detail does not affect our
general results. Figure 4.3 shows the values of the approximate score statis-
tic Tp(A) as the subset size m increases. The central horizontal bands on
the figure are at ±2.58, containing 99% of a standard normal distribution.
For obvious reasons, we refer to this kind of forward plot as a fan plot.
90 4. Transformations to Normality

'ih 0
2

10 15 20 25
Subset size m

Figure 4.3. Wool data: fan plot-forward plot of Tp(.\) for five values of .\. The
curve for .\ = -1 is uppermost; log y is indicated

Initially, apart from the very beginning when results may be unstable,
there is no evidence against any transformation. When the subset size m =
15 (56% of the data) , A = 1 is rejected. The next rejections are A =
0.5 at 67% and -1 at 74%. The value of A = 0 is supported not only
by all the data, but also by our sequence of subsets. The observations
added during the search depend on the transformation. In general, if the
data require transformation and are not transformed, or are insufficiently
transformed, large observations will appear as outliers. Conversely, if the
data are overtransformed, small observations will appear as outliers. This
is exactly what happens here. For A = 1 and A = 0.5, working back from
m = 27, the last cases to enter the subset are 19, 20 and 21, which are
the three largest observations. Conversely, for A = -1 and A = -0.5 case
9 is the last to enter, preceded by 8 and 7, which are the three smallest
observations. Since the data are in standard order for a 33 factorial , the
patterns of these numbers indicate a systematic failure of the model. For
the log transformation, which produces normal errors, there is no particular
pattern to the order in which the observations enter the forward search.
Similar results are obtained if Tp(A) is replaced by the signed square
root of the likelihood ratio test (4.9) . In the absence of outliers and highly
influential observations, the fan plot of the score statistic evolves smoothly
with m: there are no jumps or dramatic reorderings of the curves. More
quantitatively, the method allows assessment of the proportion of the data
supporting a particular transformation, information not available from
other methods of analysis.
There are a number of details in the combination of the forward search
with statistics about transformation that need to be decided in any im-
4.3. Wool Data 91

10 15 20 25

Subset size m

Figure 4.4. Wool data: forward plot of Tp(.\) for five values of .\, one search on
untransformed data

plementation of the method. We show results from a single forward search


from a carefully selected starting point. The alternative of using several
searches from random starting points seems to yield similar results. More
important is whether the selection of the subset and the forward search are
carried out on untransformed data or on each individual value of 'x. Fig-
ure 4.4 shows the results of calculating the score statistic for the five values
of ,x, as in Figure 4.3, but ordering the data by a search on untransformed
data.
The two plots become similar towards the end of the search (of course
they are identical when m = n), but are different at the beginning: because
the data are chosen to agree with the hypothesis of no transformation,
other transformations, such as the log, are initially rejected. An interesting
feature of Figure 4.4 is the addition of the observation when m = 10.
This has very different effects on the five score statistics (Exercise 4.6) .
Comparison of the two figures shows that individual searches give more
informative plots and are to be preferred.
There is also the question of what measures of the need for a transforma-
tion should be calculated. Instead of the approximate score statistic we can
monitor the likelihood. Figure 4.5 shows the evolution of the five values of
Lmax('x) during the searches that also produced the fan plot of Figure 4.3.
This plot shows that, throughout the search, the logarithmic transforma-
tion is to be preferred. At the end of the search the five loglikelihood values
correspond to five points from the loglikelihood curve of Figure 4.1. It is
however not quite straightforward to quantify this difference as the com-
parison is between the loglikelihood at two (or more) specified values of ,x,
rather than, as in the likelihood ratio test of (4.9), between the log likelihood
at 5. and at a specified value. For (4.9) standard likelihood arguments give
92 4. Transformations to Normality

"0
0
0
~
,5
~
0>
£
"0
0
i!l 0
')J
E
.~

::;

0
0
"i

10 15 20 25

Subset size m

Figure 4.5. Wool data: Lmax(A) for five values of A during the forward search.
The uppermost curve is for A = 0

the asymptotic chi-squared distribution of the statistic, whereas the com-


parison of two specified values involves the comparison of members of two
separate families, for which there are no standard results. Some references
are given in §6.21.
It is also possible to monitor the estimate of A during the forward search.
Again we need to determine the transformation used to order the data
during the search; a curve for each of the five values of A used to construct
the fan plot is one possibility. However curves of parameter estimates are
of limited value without error limits. We find approximate 95% confidence
intervals from the loglikelihood at each stage of the forward search. The
resulting plots, for A = 1 and 0 are given in Figure 4.6, in which the vertical
scales for the two panels are the same. The left panel for A = 1 shows that ,
from m = 13 onwards, the value of one is rejected. On the contrary, apart
from some initial fluctuations, Figure 4.6(right) shows that the value of
zero is accepted, in line with our other analyses.
There are three points to be made about plots such as Figure 4.6. One
is that the confidence limits are surprisingly tight at the beginning of the
search, especially for ). = O. Since the data have been ordered to agree
with the transformation used in the search, which is the correct transfor-
mation, model and data agree and the likelihood falls off quickly. We have
already seen the effect of this ordering on estimation of 8 2 , for example,
in Figure 1.8, where the value was initially small and the t value for a
linear parameter correspondingly high. The confidence intervals for A = 1
are broader because this is not the correct transformation, so that model
and data do not agree so well. Of course, at the end of the search, when all
data are included, the parameter estimate and confidence region are identi-
4.3. Wool Data 93

LO LO
~ci
.0 :g'" 0
E E
.!!! 0 .!!! 0
'0 ci '0 c:i
w w
-' -'
::2 LO ::2 LO
9 9

10 15 20 25 10 15 20 25
Subset size m Subset size m

Figure 4.6. Wool data: forward plot of ), with 95% likelihood intervals from two
searches, (left) )., = 1; (right) A = 0

cal for any ordering. The second point is that the plot requires a numerical
search for calculation of the maximum likelihood estimate 5. at each stage
of the search, combined with two further numerical searches to find the
ends of the confidence region. This procedure is computationally intensive
when compared with calculation of the approximate score test Tp(.\) which
only requires noniterative calculations at the null value. The third point
is that the fan plot is appreciably easier to interpret than Figure 4.6. We
accordingly use plots of score statistics, rather than of parameter estimates.
Finally, we supplement our discussion of the forward search by consider-
ing information available from the graphical methods described in §4.2.2.
Figure 4.7 is the constructed variable plot when A = 1. With its positive
slope, the plot shows clear evidence of the need for a transformation, ev-
idence which seems to be supported by all the data. The most influential
points seem to be observations 20 and 19, which are the two largest obser-
vations and 9,8, 7 and 6, which are the four smallest. The sequential nature
of these sets of numbers reflects that the data are from a factorial exper-
iment and are presented in standard order. The constructed variable plot
for .\ = 0 is in Figure 4.8. There is no trend in t he plot and the transfor-
mation seems entirely acceptable. The residuals from the six observations
that were extreme in the previous plot now lie within the general cloud of
points.
Although the constructed variable plots give the same indications as all
other plots about the satisfactory nature of the log transformation, they
do not supply direct evidence on the influence of the individual observa-
tions on the selection of this transformation. This is provided in Figure 4.9
which gives index plots of the deletion values of Tp(.\) for three values of .\.
These are calculated by deleting each observation in turn and recalculating
the statistic, rather than using the approximate formulae coming from the
deletion results in §2.3. The plots for the three values of .\ are on the same
94 4. Transformations to Normality

,..
0

20
§

••
~ 0
c
8. 5l .S
~

.• ·.7
.6
~
:!l

..
~

...'·• :'•
a:

0
0
'? -
.24

·500 500 1000 1500 2000


Residual constructed variable

Figure 4.7. Wool data: constructed variable plot for ,\ = 1. The clear slope in the
plot indicates that a t ransformation is needed. The largest observations are 19
and 20: the labelled points in the centre of the plot have the four smallest values
of y

• 6

20
0
~
• •
• .'•
.

.. ..
.•
~

~ ,
~

~
u
• .7
.8

.~ 0


0
a:

0
0
~

.2'
·200 200 400 600
Residual constructed variable

Figure 4.8. Wool data: constructed variable plot for ,\ = O. The absence of trend
indicates that the log transformation is satisfactory
4.4. Poison Data 95

lambda=-O.5

j o
~
:
1 . - - . - . - 1 I
10
"
lambda=O
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

20

20 25
lambda=O.5

10 15 20

Figure 4.9. Wool data: index plots of deletion values of Tp(>.) with 95% intervals
for >. = -0.5, 0 and 0.5. No evidence against log y

vertical scale. They show that deletion of individual observations has little
effect on the values of the statistics: >. = 0.5 and -0.5 are still firmly re-
jected. Likewise, individual deletions have virtually no effect on the value
of Tp(>'o): all values remain close to zero.
The last plot we consider in our analysis of the wool data is the inverse
response plot in Figure 4.10. The plot of fJ against y is given twice. In Fig-
ure 4.10(left) we have imposed the best fitting straight line, which clearly
fits very badly. But, in the right-hand panel the fitted curve is log y, which
fits very well. The visual impression of this plot is once more to support
the log transformation.

4.4 Poison Data


We begin by analyzing the poison data from Box and Cox (1964), which,
like the wool data, are well behaved: there are no outliers or influential
observations that cannot be reconciled with the greater part of the data by
a suitable transformation. Our fan plot and the other graphical procedures
all clearly indicate the reciprocal transformation. We then consider a series
of modifications of the data in which an increasing number of outliers is
introduced. We show that the fan plot reveals the structure in all instances,
but that the answers from other procedures, such as those in §4.2.2, can be
unhelpful, or even misleading.
The data, given in Table A.8, are the times to death of animals in a 3 x 4
factorial experiment with four observations at each factor combination. All
our analyses use an additive model, that is, without interactions, so that
96 4. Transformations to Normality


8 • 0
0

••
0

'" '"
••
• ••
8
~ '. 0
0
~
0
g 8
'"
0
'C 0

0
0

'(>

0 1000 2000 3000 0 1000 2000 3000

Figure 4.10. Wool data: inverse fitted value plots with fitted curves for (left) no
transformation and (right) the log transformation

Table 4.1. Poison data: last six observations to enter the five separate searches
and numbers of six largest observations

A -1 -0.5 0 0.5 1 Largest


m Observations

43 27 44 14 43 28 13
44 28 37 28 28 43 15
45 37 28 37 14 17 17
46 44 8 17 17 14 42
47 11 20 20 42 42 14
48 8 42 42 20 20 20

p = 6, as did Box and Cox (1964) when finding the reciprocal transforma-
tion. The implication is that the model should be additive in death rate,
not in time to death.
Our analysis is again based on five values of A : -1, -0.5,0,0.5 and
1. The fan plot of the values of the approximate score statistic Tp(A) for
each search as the subset size m increases is given in Fig 4.11 and shows
that the reciprocal transformation is acceptable as is the inverse square root
transformation (A = -0.5). Table 4.1 gives the last six observations to enter
each forward search. We first consider the ordering of the data achieved by
these forward searches and then discuss Fig 4.11 in more detail.
In addition to the ordering of the data by the search, Table 4.1 also gives
the numbers for the six largest observations. The table shows that, for A
= 0.5 and 1, observation 20, the largest observation, is the last to enter
the set used for fitting. It is the last but one (m = 47) to enter for A = 0
or -0.5 and is not in the last six for A = -1. Similarly, the four largest
observations are the last four to enter for A = 1 and 0.5, but the number
4.4. Poison Data 97

10 20 30 40
Subset size m

Figure 4.11. Poison data: fan plot- forward plot of Tp(A) for five values of A. The
curve for A = - 1 is uppermost: both A = -1 and A = -0.5 are acceptable

decreases as A decreases. For A = -1 all the large observations enter earlier


in the search than m = 43. However the next but last observation to enter
is 11, which is the smallest. These results , which parallel those for the
wool data, are both gratifying and surprising. With a simple sample it is
the large observations that would suggest a transformation to A less than
one. Since these observations may not be in agreement with the model,
they should enter the search for A = 1 at the end. Likewise, the smallest
values would tend to suggest a transformation above the inverse. If a correct
transformation has been found, small and large observations should both
enter the search throughout , including at the end. They do so here for
A = -0.5. It is however surprising that these results for a random sample
still hold when we fit a linear model to the data.
The table shows the different order for the different searches. We now
return to the fan plot of the score statistic in Fig 4.11. Initially, for small
subset sizes, there is no evidence against any transformation. During the
whole forward search there is never any evidence against either A = -1
or A = -0.5 (for all the data ~ = -0.75). The log transformation is also
acceptable until the last four observations are included by the forward
search. As the table shows, these include some of the largest observations in
order. The plot shows how evidence against the log transformation depends
critically on this last 8% of the data. Evidence that some transformation
is needed is spread throughout the data, less than half of the observations
being sufficient to reject the hypothesis that A = 1. There are no jumps in
this curve, just an increase in evidence against A = 1 as each observation
is introduced into the subset. As we show, the relative smoothness of the
curves reflects the lack of outliers and exceptionally influential cases.
98 4. Transformations to Normality

1"'

\,
10 20 30 40

Subset size m

Figure 4.12. Modified poison data: fan plot- forward plot of Tp(>.) for five values
of >.. The curve for >. = -1 is uppermost: the effect of the outlier is evident in
making >. = 0 appear acceptable at the end of the search

4.5 Modified Poison Data


For the introduction of a single outlier into the poison data we follow An-
drews (1971) and change observation 8, one of the readings for Poison II,
group A, from 0.23 to 0.13. This is not one of the larger observations so the
change does not create an outlier in the scale of the original data. The effect
on the estimated transformation of all the data is however to replace the
reciprocal with the logarithmic transformation: .x = -0.15. And, indeed,
the fan plot of the score statistics from the forward searches in Figure 4.12
shows that, at the end of the forward search, the final acceptable value of
.>. is 0, with -0.5 on the boundary of the acceptance region.
But, much more importantly, Figure 4.12 clearly reveals the altered ob-
servation and the differing effect it has on the five searches. Initially the
curves are the same as those of Figure 4.11. But for.>. = 1 there is a jump
due to the introduction of the outlier when m = 41 (85% of the data),
which provides evidence for higher values of .>.. For other values of .>. the
outlier is included further on in the search. When.>. = 0.5 the outlier comes
in at m = 46, giving a jump to the score statistic in favour of this value
of .>.. For the other values of .>. the outlier is the last value to be included.
Inclusion of the outlier has the largest effect on the inverse transformation.
It is clear from the figure how this one observation is causing an appreciable
change in the evidence for a transformation.
Figure 4.12 is the analogue of Figure 4.3 for the wool data: both are
based on five separate searches through the data. We now consider the
analogue of Figure 4.4 in which there is one search on untransformed data
to order the observations. The resulting plot is in Figure 4.13. The outlier
4.5. Modified Poison Data 99

'9 - -- - - . ~J~'"

o \'

10 20 30 40
Subset size m

Figure 4.13. Modified poison data: forward plot of Tp(A) for five values of A, one
search on untransformed data. The outlier enters well before the end of the search

now, of course, enters at the same position in all five calculations of Tp(A).
Because a small observation has been made smaller, the outlier has its
greatest effect on the tests for A = -1. But the effect of its introduction
is clear for all five test statistics. Although this figure is helpful in the
identification of an influential outlier, it is nothing like as useful as the fan
plot of Figure 4.12 in understanding which is the correct transformation.
When, as in Figure 4.12, the data are approximately correctly transformed,
which they are for A = -1, -0.5 and 0, observation 8 enters at the end of
the search. As the value of A becomes more remote from the correct value,
so the outlier enters earlier in the search.
We now compare the clear information given by the fan plot with that
which can be obtained from other graphical methods. Figure 4.14 gives con-
structed variable plots for three values of A: -1 , 0 and 1. For A = 0 there is
a clear indication of the importance of observation 8. There is a cloud of 26
points with an upward trend, and the remote point of observation 8 which
is causing the estimate of slope to be near zero. Deletion of this obser-
vation can be expected to change the estimated transformation, although
by how much cannot be determined from this plot. The plot for A = -1
seems to show that there is evidence that the reciprocal transformation has
overtransformed the data, although what the effect is of observation 8 is
not clear. Likewise the panel for A = 1 indicates that the data should be
transformed. On this plot observation 8 seems rather less important. One
conclusion from these plots is that it is helpful to look at a set of values of
A when using constructed variable plots, just as it is in the fan plot.
As a third graphical aid to choosing a transformation we give in Fig-
ure 4.15 Cook and Weisberg's inverse fitted value plot for four values of A.
The values of y and of the fitted values fJ are the same in all four plots.
100 4. Transformations to Normality

lambda=-1 lambda=O lambda=1


'"0
:. i" '"0
.-L- "..•
. ...o
..
.~1t:
.
0
0

•#" >'. o'"


'9"
• .•...... .,•
0

"....
0

~
'm
~
'is '"9
o
o if'·
.' ~
....
0: 0:

9'" 8
8 ... ~ • 8
9
·0.1 0.1 0.3 0.5 -0.05 0.05 0.15 -0.1 0.1 0.2 0.3
Residual constructed variable Residual constructed variable Residual constructed variable

Figure 4.14. Modified poison data: constructed variable plot for three values of A.
The effect of observation 8 on the estimated transformation is clearest for A = 0

~
~
"

. _.......
~ ~
~
~ ;l . _.......
'\ ~
~ ;l
'\

~ lambda=-1 ~ lambda=-D.5

0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2

~ ~

. . .......- . ...
~ ~
j ~ '\.-
'i
IE
;; I 0

~ lambda=O ~
0 lambda=D.5

0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2

Figure 4.15. Modified poison data: inverse fitted value plots with fitted curves for
four values of A. Is A = -0.5 best?

What differs is the fitted curve. Because the data consist of four obser-
vations at each factor combination, patterns of four identical values of r,
are evident in the plot. These are more widely dispersed (in the horizontal
direction, since this is an inverse plot) for larger values of r,. The difference
in dispersion makes it rather difficult to judge the plots by eye: the lower
values are best fitted, perhaps, by the reciprocal transformation and the
higher values by the log. The value of ..\ = 0.5 is clearly an inadequate
transformation: the fitted line is not sufficiently curved. The plots thus in-
dicate, in a rather general way, what is an appropriate transformation, but
they do not indicate the importance of observation 8. However, due to the
replication, the variance for the group of observations including 8 can be
seen to be rather out of line with the general relationship between mean
and variance.
4.6. Doubly Modified Poison Data: An Example of Masking 101

lambda=-1 lambda=-O.5

"-
~
'"
I .I, I
I
'" oil

I
~

'1 '"
I
'"
"-
0 10 20 30 40 10 20 30 40

lambda=O lambda=1
u;>

,I,. I
I""
,I, I ~ .'"
'"
I
I
o
~

o 10 20 30 40 10 20 30 40

Figure 4.16. Doubly modified poison data: index plots of deletion values of Tp(A)
with 99% intervals for four values of A; the log transformation is indicated

4.6 Doubly Modified Poison Data: An Example of


Masking
The simplest example of masking is when one outlier hides the effect of
another, so that neither is evident, even when single deletion diagnostics
are used. As an example we further modify the poison data. In addition to
the previous modification, we also change observation 38 (Poison I, group
D) from 0.71 to 0.14.
For the five values of A used in the fan plot the five values of the
approximate score test for the transformation are:

-1 -0.5 0 0.5 1
10.11 4.66 0.64 -3.06 -7.27

It seems clear that the data support the log transformation and that all
other transformations are firmly rejected. To show how diagnostics based
on the deletion of single observations fail to break the masking of the two
outliers, we give in Figure 4.16 index plots of the deletion values of the
Tp(.).) , calculated directly from the data with each case deleted in turn. Also
given on the panels, where possible, are lines at ± 2.58, corresponding to 1%
significance, assuming the statistics have a standard normal distribution.
The four panels have also been plotted with the same vertical scale. For >. =
-1 the statistics range from 7.22 to 10.7, so that the inverse transformation
is firmly rejected. For>. = -0.5 the range is 2.54 to 4.88, evidence for
rejection of this value. For the log transformation all values lie well within
102 4. Transformations to Normality

-1

10 20 30 40 50
Subset size m

Figure 4.17. Doubly modified poison data: fan plot- forward plot of Tp(>.) for
five values of >.. The curve for>. = -1 is uppermost; the effect of the two outliers
is clear

the significance band and the transformation continues to be acceptable.


Finally, for A = 1, the hypothesis of no transformation is rejected.
The conclusion of this analysis is that there is no reason not to accept the
log transformation. Two observations, 8 and 38, cause some changes in the
score statistic, but in neither case is this effect statistically significant. It is
least for the log transformation which could be the only one examined in
great detail, since all others considered here are rejected by the aggregate
statistics Tp(A). To break this masking we need to order the observations
by their closeness to the proposed transformation model and note the effect
of introducing each observation.
The effect of the two outliers is clearly seen in the fan plot Figure 4.17.
The plot also reveals the differing effect the two altered observations have on
the five searches. Initially the curves are similar to those of the original data
shown in Figure 4.11. The difference is greatest for A = -1 where addition
of the two outliers at the end of the search causes the statistic to jump
from an acceptable 1.08 to 10.11. The effect is similar, although smaller,
for A = -0.5. It is most interesting however for the log transformation.
Towards the end of the search this statistic is trending downwards, below
the acceptable region. But addition of the last two observations causes a
jump in the value of the statistic to a nonsignificant value. The incorrect
log transformation is now acceptable.
For these three values of A the outliers are the last two observations to
be included in the search. They were created by introducing values that
are too near zero when compared with the model fitted to the rest of the
data. For the log transformation, and more so for the reciprocal, such values
become extreme and so have an appreciable effect on the fitted model. For
the other values of A the outliers are included earlier in the search. The
effect is most clearly seen when A = 1; the outliers come in at m = 40 and
4.6. Doubly Modified Poison Data: An Example of Masking 103

lambda=O lambda=0.5

• •
,...•• •

,.....
C\I 42
• 42
20

, t~· ...••
0 20
• • •
\;., .
14 14

.
Q)

~ ~
3lc: I •
o 0
0.
'·A • • o
0.0

. , 1
C/)
~ ci
l!? ~.
lU "! lU
0 j

I •
j
"0 ,
.0;
Q)
~
Q)
~,
a: 8 a: 8
~

9 • •
38 38
<0
9
• •
-0.05 0.05 0.15 -0.1 0.0 0.1 0.2
Residual constructed variable Residual constructed variable

Figure 4.18. Doubly modified poison data: constructed variable plots for two
values of A. There seem to be two outliers, or are there three?

46, giving upward jumps to the score statistic in favour of this value of A.
For the remaining value of 0.5 one of the outliers is the last value to be
included.
Although the single case deletion diagnostics of Figure 4.16 fail to reveal
the two outliers, they are revealed by the constructed variable plots in much
the same way as they were for the singly modified data in §4.5. Plots of the
constructed variables for A = 0 and 0.5 are given in Figure 4.18. For A = 0
it seems clear that there are two outliers, observations 8 and 38, which
will influence the choice of transformation. These two observations need to
be deleted and the transformation parameter reestimated. The results for
A = 0.5 are less clear: deletion of observations 8 and 38 would indicate that
the data need to be transformed to a lower value of lambda, although there
is no evidence whether the value of A should be -0.5, -1, or some other
value. But deletion of 14, 20 and 42, the three largest observations, would
suggest a higher transformation, or perhaps no transformation at all.
The conclusions from the constructed variable plots are, as before, less
sharp than those from the fan plot which clearly reveals not only the masked
outliers, but also their effect on the estimated transformation. Although the
outliers were not revealed by the single case deletion methods exhibited in
Figure 4.16, they could be found by looking at all 48 x 47/2 =1 ,128 pairs of
deletions. But if there were three outliers, 17,296 triples would have to be
investigated, and even more if there were four outliers. The problem with
this procedure is, perhaps, not the amount of computation but, rather, the
difficulty in interpreting the mass of computer output. Use of the fan plot
from the forward search reveals the outliers and their effect on inference in
one analysis.
104 4. Transformations to Normality

4.7 Multiply Modified Poison Data-More


Masking
As our last example of the effect of masking on procedures for finding a
transformation to normality, we modify the original data by making small
changes in four observations in such a way that single deletion methods
fail. However we show that not only does our forward search identify the
outliers, it also makes clear their inferential effect.
We proceed in three stages: first we give an example of a standard di-
agnostic analysis, using single deletion techniques, and show the problems
it encounters. We then use the forward search and interpret the fan plot.
Finally we look at the behaviour of the graphical procedures of §4.2.2.

4. 7.1 A Diagnostic Analysis


In order to create masked outliers, four observations have been made
smaller. The results of the previous sections suggest that these modifi-
cations will have little effect when the data are analysed on the original
scale, but will be very evident when A = -1 and so should influence the
transformation away from -1 towards one. As we show, this is indeed what
happens. The way in which observations 6, 9, 10 and 11 were modified to
create the four outliers is specified in Table 4.2.
For these data the maximum likelihood estimate ~ is 0.274 for an additive
model without interactions. The values of the approximate score test for
transformation for the five standard values of A are:
-1 -0.5 0 0.5 1
22.08 10.01 2.87 -2.29 -8.41
All transformations considered are therefore rejected, although neither the
log nor the square root transformation is strongly rejected. Accordingly a
value in between these two is tried which is close to 5.. Since Tp(1/3) =
-0.59, 1/3 is an acceptable value for A. The next step is to check for the
influence of individual observations on the values A = 0, 1/3 and 0.5. Fig-
ure 4.19 shows index plots of the deletion values of the Tp(A), calculated
from the data with each case deleted in turn, rather than using an approx-

Table 4.2. Multiply modified poison data: the four modified observations

Observation Original Modified

6 0.29 0.14
9 0.22 0.08
10 0.21 0.07
11 0.18 0.06
4.7. Multiply Modified Poison Data- More Masking 105

lambda=O

,I,
i: i,
, , , , ,

:1
I i II I i' • • I

II

,. 2.
lambda=1 /3
30 4.

;1 i I :II' , ,I,
IIII
I II
!!, ,

" :I i '
,

1. 2.
lambda=O.5
3. 4.

Figure 4.19. Multiply modified poison data: plots of deletion Tp(A) for three
values of >.; the value of 1/3 is indicated

imation to the effect of deletion. Also given on the panels, where possible,
are lines at ± 1.96, corresponding to 5% significance, assuming the statistics
have a standard normal distribution. The three panels have been plotted
with the same vertical scale. For >. = 0 deletion of observation 11 reduces
the value of the statistic, but it is still larger than 2, suggesting that the
log transformation remains unlikely. For>. = 1/3 the largest effect is from
the deletion of observation 20. Whether or not it is included, the third root
transformation is supported by the data. It is however an unusual transfor-
mation, except for volumes. The plot for>. = 0.5 shows that if observations
20 or 42 are deleted the square root transformation is acceptable. If, as a
result of this information it were decided to move to >. = 1/ 2, this would
take the analysis even further from the value appropriate to the majority
of the data.
The next step in a standard analysis is to look at the distribution of
residuals to see whether there is any evidence of outliers. Three QQ plots
are exhibited for>. = 1/3. As well as that for the least squares residuals we
show residuals from two robust fits when different sized subsets are used in
fitting the model during the forward search. In all plots the residuals have
been scaled by the overall estimate of (>2 from the end of the forward search.
This scaling is not important when looking for evidence of outliers, since
the important feature is the shape of the plot. Figure 4.20(left panel) shows
scaled residuals from an LMS fit to an elemental set, found by searching
over 10,000 randomly selected subsets of size p = 6. The plot shows the
typically long-tailed distribution which comes from a very robust fit. Fig-
ure 4.20(middle panel) shows what happens as we move along the forward
search until a subset of size m = 27 is used for fitting. The plot still has a
106 4. Transformations to Normality

..
,... ...

..,
.., .,

... ., . ....,
<)'
'1
11 11 11
"! "!
·2 ·2 ·2 -,

Figure 4.20. Multiply modified poison data: normal QQ plots of scaled residuals
at three points in the forward search

long-tailed shape, from which it might be concluded that there were many
outliers, or none. However, when all the data are fitted by least squares,
as shown in Figure 4.20(right panel), there is no evidence of any particular
outliers. The effect of individual deletions on the estimated transformation
has already been investigated in Figure 4.19, so it is not necessary to con-
sider again the importance of observations 11 and 20, which are the most
extreme observations when m = n in Figure 4.20, as they also are when
m = 6 and m = 27.
The conclusion from this analysis is that the transformation A = 1/3 is
reasonable. This traditional approach gives no indication of the effect of the
four outliers. The example shows that, if data are analyzed on the wrong
transformation scale, even the application of very robust methods such as
LMS fails to highlight the outliers and influential observations.

4.7.2 A Forward Analysis


The fan plot of the forward score statistics for transformation is in Fig-
ure 4.21. In general it is very different from the plot for the original data,
Figure 4.11, and is, indeed, more complicated than the plots for the singly
modified data, Figure 4.12, and for the doubly modified data, Figure 4.17.
Some curves are within the bounds for most subsets and then increase
rapidly at the end: others go outside the 1% boundary, only to return at
the end. Both forms of behaviour are associated with influential outliers.
For A = -1 the curve lies well within the boundary, with no particular
pattern, until m = 45. Addition of the last four observations, which are the
four outliers, causes a rapid increase in the value of the score test from 1.16
to 22.1 and provides strong evidence against A = -1. It is interesting that
the observation included in this search when m = 44 is number 8, the last
to be included for the original data and this value of A. The behaviour of
the curve for A = -0.5 is similar, but much less extreme. The four outliers

to 10.0. The curve for A = °


are again included at the end, causing the statistic to increase from -1.36
first goes below the boundary but inclusion
4.7. Multiply Modified Poison Data- More Masking 107

o
N

----~.:~----- -:.. ---~.:. - -""----, ... ..,/" .... --.,)..f<.~


- --::;, - - _/
--...- -" ~~-....... ___~ ~ _ •• __ .. 'I"",

--~
";~' ..

o 1

10 20 30 40 50
Subset size m

Figure 4.21. Multiply modified poison data: fan plot- forward plot of Tp(A) for
five values of A. The curve for A = -1 is uppermost: the differing effects of the
four modified observations are evident

of the four contaminated observations, once more in the last four steps
of the forward search, brings it above the upper threshold. The statistic
for A = 0.5 spends much more of the central part of the search outside
the lower boundary. As we have seen, the final value of Tp(0.5) is -2.29.
But for values of m between 22 and 37 the curve lies approximately on or
below the boundary. The inclusion of units 9, 10 and 11 at m = 38, 39 and
40 increases the value of the score statistic from -2.65 to 1.89. From this
step onwards the curve decreases monotonically, except at m = 43 when
inclusion is of unit 6, the first modified unit to be included. It is interesting
that, in this scale, the four contaminated observations are not extreme and
so do not enter in the last steps of the forward search. But the forward plot
enables us to detect their appreciable effect on the score statistic.
The indication of this plot is that one possible model for these data takes
A = -1 for the greater part of the data, with four outliers. To confirm this
suggestion we look at the plot that monitors the scaled residuals during
the forward search. This is shown, for A = -1 in Figure 4.22. This plot
beautifully indicates the structure of the data. On this scale there are the
four outliers, observations 6, 9, 10 and 11, which enter in the last four steps
of the forward search. Until this point the pattern of residuals remains
remarkably constant, as we argued in §2.6.1 that it should. The pattern
only changes appreciably in the last four or five steps, when the outliers
and observation 8 are introduced.
The results of the forward search in Figures 4.21 and 4.22 clearly show
the masked outliers and their effects, which were not revealed by the single
case deletion methods exhibited in Figure 4.19 and the residual plots for
A = 1/3 of Figure 4.20. The comparison of these sets of figures exhibits the
power of our method in the presence of influential masked multiple outliers.
108 4. Transformations to Normality

11 - - - - - - - - - - - - - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...... '\

10--------------------______________________ . \ \
\ \
9· . . . . . . • . ... . ......... . . .. ...~\\

. "

o ••

10 20 30 40
Subset size m

Figure 4.22. Multiply modified poison data, A = -1: forward plot of scaled
residuals clearly revealing the four modified observations

4.7.3 Other Graphics for Transformations


For a last look at the multiply modified poison data we look again at the
diagnostic plots discussed in §4.2.2. Figure 4.23 shows the inverse fitted
value plot for these data and two values of A: 1/3 and -l.
The value of 1/3 was selected by the diagnostic analysis of §4.7.l. The
left-hand panel of the figure indicates that this is a satisfactory value: the
curve seems to pass nicely through the centre of the points, apart perhaps
from the four lowest. Certainly the fit is much better than that for the
correct value of A, -1, which is unsatisfactory at the top and the bottom
of the plot and seems much too curved. The plot is thus misled by the
masked outliers.
The constructed variable plot is equally unhelpful in extracting the struc-
ture of the data. Figure 4.24 shows the constructed variable plot for A =
1/3. There is no obvious pattern - if the three points in the upper right-
hand corner are removed, there would seem to be a downward slope and
the indication is that the data have been overtransformed. On the other
hand, if the three points in the bottom right-hand corner are deleted, there
would be a slight upward trend and a further transformation is suggested.
Since the two groups are in approximate equilibrium, it is not clear that
the value of 1/3 is unsatisfactory. Certainly the plots for A = -0.5 and -1
in Figure 4.25 indicate that these values of A are both quite unsatisfactory.
There is a strong slope to both plots, with no obvious pattern of outliers.
The conclusion from this series of analyses of increasingly corrupted
versions of the poison data is that both inverse fitted value plots and con-
structed variable plots retain their usefulness in the presence of one or
two outliers, but that they are not useful in the presence of masked mul-
tiple outliers. Only the forward search, by starting from a small, carefully
4.7. Multiply Modified Poison Data~More Masking 109

lambda=1/3 lambda=-1

'"0 • •• <Xl • •• •
• • • • • • •
• ••• • •••

---
<0
0 • • • • "'0 • • • •
'" • • '"CI> •

-
CI>
">
m ">
m
• .,.
"0
CI>
(!
":
0

"0
CI>
u:'" 0 •

'"0 '"0

0.2 0.4 0.6 0.8 1.0 1 .2 0.2 0.4 0.6 0.8 1.0 1.2

Figure 4.23. Multiply modified poison data: inverse fitted value plots for two
values of )... The inverse transformation is quite unacceptable

lambda=1/3

'"0 •
0'" •

• • •
~
0
Co
0
••• • •• ••
'"~ • ••
~
~
'"
0
0 I
• •• • "
• ••
• ••
.:. •
II:

9 • •
• •

• •
'"9 • •
·0.1 0.0 0.1 0.2 0.3

Residual constructed variable

Figure 4.24. Multiply modified poison data: constructed variable plot for)" = 1/3
110 4. Transformations to Normality

lambda=-O.5 lambda=-l
ov ~---------------------,
• •
,• . '"0

...
'"o
.1,\ ... "c:
<J) 0
~.

-"
0
0
5l-
~

•• ni
" '"9
."

• 'iii
'"
II: •

• ~

'7

• •
-0.4 0.0 0.2 0.4 0.6 0.8 ·0.5 0.0 0.5 1.0 1.5
Residual constructed variable Residual constructed variable

Figure 4.25. Multiply modified poison data: constructed variable plots for
>. = -0.5 and -1. Particularly for>. = -1, all observations seem to support
transformation to a higher value

selected subset of clean observations, is able to break such masking and


reveal the true structure of the data.

4.8 Ozone Data


We now consider some examples in which the number and nature of any
influential observations are not known. At the end of Chapter 3 we built a
model for the ozone data with log y as response and used the forward search
to check several properties of the model such as residuals and t tests. The
result was a reduced model from which four variables had been excluded.
We continue to work with this model.
The left-hand panel of Figure 3.39 showed the evolution of the score
test to check the log transformation, which was acceptable, with no no-
ticeably influential observations. It is possible that other transformations
might also be acceptable. But the fan plot of the score statistics from the
forward search for five values of A in Figure 4.26 shows that only the log
transformation is acceptable. The fan shape of the plot is similar to that for
the wool data in Figure 4.3: there is no evidence of any particularly influ-
ential observations. For A = 1 the last observations to enter are the largest:
working backwards these are 53, 71, 54 and 52. For A = -0.5 or -1 the
last observations to enter are the smallest whereas for A = 0 or 0.5, the last
five observations contain both small and large values, with the last two for
>. = 0.5 being the two largest. Although the fan plot shows that deletion of
a few extreme observations would lead to the acceptability of both 0.5 and
-0.5 as values of A, evidence for a transformation is typically concentrated
in the extreme observations, as it was for the wool data, so these dele-
4.9. Stack Loss Data 111

·1

~ ~-----,--------,--------,--------~
20 40 60 80
Subset size m

Figure 4.26. Ozone data, final model: fan plot- forward plot of Tp(A) for five
values of A. The curve for A = -1 is uppermost; log y is indicated

tions would not lead to an improved analysis. On the contrary, they would
increase the standard errors of the estimated parameters. An additional
argument for the log transformation is that, when the data are correctly
transformed in the absence of outliers, the magnitude of the untransformed
observations is not reflected in the order in which they enter the search.
The fan plot for the full model for the ozone data is similar to that in
Figure 4.26 except that, in line with the results of Table 3.1, the values
of the statistics are smaller. We do not give the figure here, but instead
give two diagnostic plots for the transformed data in Figure 4.27. The left
panel is the constructed variable plot which shows no particular pattern,
although the two largest observations 53 and 71 are evident. However these
observations did not enter the plot at the end, so they agree with the
proposed transformation. The inverse fitted value plot is in the right-hand
panel. The vertical patterning visible is due to rounding of the values of the
response so that many cases have the same value of y. The fitted curve of
log y appears to pass well through most of the data. However, as with the
left panel, the two largest observations stand slightly apart. The largest,
observation 53, enters the subset in the forward search for the logged data
two observations from the end. We see from the fan plot, Figure 4.26, that
this observation causes a very small change in Tp(O) when it enters. This
is another example of the way in which the fan plot enables us to quantify
impressions from other plots.

4.9 Stack Loss Data


In §3.2 we found, by a comparatively lengthy process, that a suitable model
for the stack loss data was to regress vy on two variables, Xl and X2.
112 4. Transformations to Normality

lambda=O lambda=O
• :: ••
.. •••
53

.. • · •
"<t
• •
~
• •
71
• •• • ••
· ... :
~
Q)
'" Co •• • • ••
'"c:0 I.·.~ •
'"
..
0-
.. • .... t

..•• ..··f•-.....,• • •
'"
i!! 0

::>
u

.
'Uj
Q) <)I
a: "<t

••
-r • '"


'" -1 0 2 3 4 5 10 15 20
Residual constructed variable

Figure 4.27. Logged ozone data, final model: (left) constructed variable and
(right) inverse fitted value plots for ,\ = 0

We now show how our forward procedure leads straightforwardly to this


conclusion and then exhibit some diagnostic plots. We stress that with
n = 21, p = 3 or 4 and four, or perhaps five, potential outliers, very little
information is being used to support a large number of inferences. Hence,
perhaps, the plethora of models for which plausible arguments have been
found.
We start with Figure 4.28 which is the fan plot of the score statistics.
The figure shows that the square root transformation, ,\ = 0.5, is supported
by all the data, with the absolute value of the statistic always less than
1.5. However the evidence for all other transformations depends on which
observations have been deleted: the log transformation is rejected when
some of the suspected outliers are introduced into the data although it is
acceptable for all the data: ,\ = 1 is rejected as soon as any of the suspected
outliers are present.
Figure 4.28 was produced using a model including Xl, X2 and X3. We
have already shown that X3 can be dropped from the model. The forward
plot of the t statistics without X3 is in Figure 3.25. The fan plot of the score
statistics when X3 is dropped is very similar to Figure 4.28 so we do not
give it here. The most important feature is that, again, A = 0.5 is the only
transformation for which the score statistic remains within the boundary
throughout the search, the absolute value never exceeding 1.12.
Since these data have been so frequently analyzed and the effects of
particular observations have been much discussed, we list in Table 4.3 the
observations entering in the last five steps of the forward search for the five
values of A. We also give the corresponding values of the score statistics.
The hypothesis of no transformation (A = 1) is rejected as soon as any
of the suspected outliers enter the model, which they do in the order 2,
1, 3, 4 and 21. Most of these observations also enter the other searches
4.9. Stack Loss Data 113

---

10 15 20
Subset size m

Figure 4.28. Stack loss data, all three variables: fan plot-forward plot of Tp(A)
for five values of A. The curve for A = -1 is uppermost: vy is indicated

Table 4.3. Stack loss data: last five observations to enter the five separate searches
and the score statistics for transformation at each stage (linear model with
variables 1 and 2 only)

A -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1


m Observation Score Statistic
17 19 4 20 13 2 3.34 0.32 1.99 -0.81 -2.11
18 15 3 1 20 1 4.64 2.80 3.23 -0.42 -4.08
19 17 21 2 2 3 5.41 1.95 3.38 -0.07 -4.12
20 18 1 21 4 4 5.80 3.08 1.43 -0.11 -2.81
21 16 2 4 21 21 6.67 3.75 1.33 -0.97 -3.61

in the final stages. However they do not affect the conclusion that the
square root transformation is acceptable. The log transformation is not
acceptable when observations 4 and 21 are deleted, but is acceptable when
all observations are included, a form of masking which may have misled
Atkinson (1985). The transformation A = -0.5 is not acceptable and the
reciprocal transformation is clearly rejected.
An interesting feature of Table 4.3 is that for all except one value of A
many of the same observations appear as outliers. We have found a similar
feature in other analyses of transformations with influential observations
or outliers, where there appear to be just two different analyses, depending
on the value of A. A feature of these particular data is that observations
1, 2, 3 and 4 (although not 21) are the largest observations. The smallest
observation is 16, with 15, 17 and 18 next smallest, followed by 19. These
are the last observations to enter the forward search when A = -1. We
114 4. Transformations to Normality
m=12 m=19 m=21

4
o
40
,.

.0 00

.0

C? 21
• 21 ~o__~__~__~__~

·2 ·1 ·2 -1

Figure 4.29. Stack loss data, Xl, X2 and ,jY: normal QQ plots of scaled residuals
for (left) m = 12, (middle) m = 19 and (right) m = 21

have already discussed this phenomenon in the context of the poison data
and of the ozone data.
The results of Table 4.3 show that , whether or not observations 4 and 21
are treated as outliers, the square root transformation is accepted. We saw
in the previous chapter that the evidence from the forward plot of residuals
in Figure 3.27 showed that these observations are outliers and that , until
m = 19, the plot of residuals is very stable. If it were important to test
whether observations 4 and 21 were outliers on the square root scale, the
mean shift outlier model could be used to provide an F test based on the
fit for m = 19. Instead we look at diagnostic plots for m = 19 and m = 21
(Exercise 3.5) . To justify this choice of values of m, we show in Figure 4.29
QQ plots for the residuals at three stages of the forward search for)' = 0.5.
The left panel shows the scaled residuals for m = 12. As the forward plot
of residuals showed, observations 4 and 21 appear outlying, as they do for
m = 19 in the middle panel. The right panel shows the residuals when all
observations are fitted and so is the plot for the residuals at the end of the
forward search. Despite the clear pattern of residuals in the forward plot,
in Figure 3.27, this plot is not so easy to interpret - the inclusion of the
two potential outliers has caused some masking of their properties.
A feature of the QQ plots of Figure 4.29 is that the plots for m = 12 and
m = 19 are very similar. This could be inferred from the forward plot of
residuals in Figure 3.27, which also shows that we could expect the plots
for m = 19 and m = 21 to be very different. Accordingly we look at the two
diagnostic plots for transformations calculated for m = 19, that is without
using observations 4 and 21 in the fit, and for m = 21. For m = 19 we use
the estimated parameter values to calculate predictions and residuals for
all 21 observations.
We begin with the constructed variable plot in Figure 4.30(right), for
m = 21, which shows a horizontal scatter of points with two, those for
observations 4 and 21 rather separate from the rest, but balancing each
other in their effect on the regression. Figure 4.30(1eft) shows that , when
the two observations are deleted , their predicted values move away from the
4.9. Stack Loss Data 115

Figure 4.30. Stack loss data, Xl , X2 and yIY: constructed variable plots (left)
m = 19 and (right) m = 21

m=19 m=21
...
0
• • ...
0


• •
g 0

21 '"
ill::>
<0
• ill::>
<0 21
> >
• ••• •4
-g 0
co • "0
Q) 0
~ 4 '"
;r:
co

•• e.

~ ~ ••

10 20 30 40 10 20 30 40
Y Y

Figure 4.31. Stack loss data, Xl , X2 and yIY: inverse fitted value plots for (left)
m = 19 and (right) m = 21
116 4. 'Ifansformations to Normality

residuals of the other observations which form a pleasingly unstructured set:


all observations support the square root transformation. Similar remarks
apply to the inverse fitted value plots in Figure 4.31. In the right panel,
for m = 21, the square root curve fits all observations except 4 and 21. If
these are deleted and predicted, as in the left panel, they move away from
the curve which fits the remaining points even better.
We have dwelt on the analysis of the stack loss data in part because of
the frequent occurrence of the data in discussions of regression diagnostics
and robustness. But also in part because we feel that the analysis given here
shows how powerfully the forward search answers questions about influence
and outliers.

4.10 Mussels' Muscles: Transformation of the


Response and of an Explanatory Variable
In the examples of the previous sections consideration was only given to
transformation of the response. But in many analyses of data both the
response and some of the explanatory variables have to be transformed. We
consider one such example in which transformations have been suggested
separately for the response and one of the explanatory variables. We show
the importance of jointly considering transformation of both variables and
use our method to discover which cases are influential in determining the
transformations.
Cook and Weisberg (1994a, p. 161) analyze data on 82 mussels from
New Zealand. The data are in Table A.9. The response M is the muscle
mass, the edible portion of the mussel in grams. There are four predictors,
all measurements of the mussel shell: W, Hand L are the width, height
and length of the shell in millimetres and S is the mass of the shell, in
grams. The relationships between M , W , Hand L appear linear but that
between S and the other predictors seems curved. In their book Cook and
Weisberg emphasize the importance of linear structure and linear spaces.
To find these they use inverse regression which also motivated the inverse
plots of §4.2.2 for finding a transformation. They accordingly start their
analysis by considering two methods for finding a transformation of S that
is linearly related to the other predictors. Inspection of the scatterplot
matrix of all variables as S is transformed indicates the transformation
SI/3. Since S is a volume this has the effect of providing four predictors
with the same units of measurement, namely, length. Alternatively they
consider the Box- Cox transformation to give linear regression of SA on W,
Hand L. For this purpose SO.2 is the preferred transformation. They then
consider transformation of M in a regression model with predictors W , H,
Land SO.2 which leads to the response M 1 / 3 . However cases 8, 25 and 48
seem to be outlying.
4.10. Mussels' Muscles: Transformation of the Response 117

+
+ 2
21
••
+ 16+
+
24
34 29+

+ • e• •
a +

...
10 ••
'" +
. ..-.--_e....
11 .\.~ 39

-,··Ii :
_to·. •+
~.
• 8

100 200 300

Figure 4.32. Mussels' muscles: scatterplot ofthe response M, muscle mass, against
the explanatory variable S, shell mass. Deletion of the 10 observations marked +
produces an approximately linear homoscedastic relationship between M and S
for which there is no evidence of a transformation

As this outline shows, Cook and Weisberg's analysis proceeds in two


stages: the transformation of one of the predictors S followed by transfor-
mation of the response M. We consider instead the simultaneous choice of
the transformation of M and S in the regression of M on all four variables.
The scatterplot of M against S in Figure 4.32 shows increasing variability
as M increases, a strong indication of the need for a transformation. But it
also shows a nearly linear relationship between M and S, so that the two
variables may well need a similar transformation.
We let Al be the transformation parameter for the response M and A2
be that for the predictor S. We use our forward searches to monitor three
score statistics based on constructed variables: that for AI, which uses the
same variable as the statistic of previous sections, that for A2 which uses
the variable given by (4.17) and, if we are interested in whether Al = A2,
the t test for the single constructed variable for transformation of both vari-
ables given by (4. 18). These are called T M (A1, A2), Ts (AI, A2) and TM S (A) .
Except in the case of equality of the A, there are two constructed variables,
both of which are included in the regression when the t statistics are cal-
culated. We found that we needed 2,000 random sets to provide a stable
start to the forward search; for fewer subsets the plots were unstable at low
values of m, although identical for subset sizes m > 65, that is, for the last
20% of the data.
We begin with untransformed M; that is Al = 1. The plot of T M s(l)
in Figure 4.33 decreases monotonically from m = 67 until the end of the
forward search, with the exception of a jump when m changes from 73
to 74. Table 4.4 reports in order the units that are included in the last
10 steps of the forward search. Up to m = 72 the score test is above the
118 4. Transformations to Normality

20 40 60 so
Subset size m

Figure 4.33. Mussels' muscles: score statistics for power transformations of M and
S as the subset size m increases. Test for the same transformation for M and S,
TMs(l); test for transformation of M only, TM(l, 0.2); and test for transformation
of S only, Ts(1,0.2). Transformation of M is needed as well as that of S

lower boundary of -2.58. Removal of the 10 observations in the second


column of Table 4.4, which are marked with crosses in the scatterplot of
Figure 4.32, indeed produces a linear homoscedastic relationship between
M and S. Among these ten observations numbers 24, 21, 16, 34, 2, 10, 11
and 29 (in order working backwards from m = 82) form a group above
the general trend: only two, 39 and 8, lie below, with observation 8 at low
values of both M and 8. It is the inclusion of this observation that causes
the jump in the value of the score statistic at m = 74. Inclusion of the
remaining cases causes an increase in the evidence for a transformation.
The large negative value of -8.71 for TMs(l) indicates that both values
of >. should be less than 1. Figure 4.33 also shows plots of TM(l, 0.2) and
Ts(1,0.2), which are the tests for transformation of each variable when
S has already been transformed to 8°. 2 . The score for S does not show
evidence of the need for further transformation. But the plot of TM(l , 0.2)
is similar to that of TMs(l). For several values of m below 74 the value of
TM(l, 0.2) is more extreme than that ofTMs(l), indicating that, once S has
been transformed, the evidence for transformation of M is increased. But
inclusion of observation 8 when m = 74 causes a larger jump in T M (l, 0.2)
(and indeed in Ts(l , 0.2)) than in TMs(l). As the third column of Table
4.4 shows, observation 8 then leaves the subset, being reintroduced when
m = 76. Thereafter the values of the two score statistics are very similar.
There is no doubt that transforming 8 without transforming M is broadly
rejected.
We next consider the transformation suggested by Cook and Weisberg,
that is, (1/3, 0.2) . Figure 4.34 is a scatter plot of Ml/3 against SO.2. Com-
pared with Figure 4.32 this plot shows a virtually constant error variance.
The score statistics T M (1/3,0.2) and Ts(1/3,0 .2) in Figure 4.35 are re-
4.10. Mussels' Muscles: Transformation of the Response 119

Table 4.4. Mussel's muscles: units included in the last 10 steps of the forward
search for various null transformations

Subset A=
Size m (1 , 1) (1 , 0.2) (1/3, 0.2) (1/3, 1/3) (0.2,0.2)

73 29 10 11 11 2
74 8 8 10 2 44
75 11 1,29 2 10 10
76 10 8 25 25 34
77 39 23 34 34 21
78 2 16 21 21 16
79 34 34 16 16 25
80 16 21 48 24 24
81 21 24 24 48 48
82 24 2 8 8 8

markably similar to each other, until close to the end of the search. For
m = 75, 78 and 79 T M(1/3 ,0.2) is below the lower boundary, while
Ts(1/3 , 0.2) always lies inside. This divergence shows that the two con-
structed variables are responding to different aspects of the data. Although
the transformation (1/3, 0.2) is supported by all the data, the plot shows
that it would be rejected except for the last three observations added. These
are, again working backwards, 8, 24 and 48, two of the three observations
identified by Cook and Weisberg as influential. However our analysis is also
informative about observations included not just at the end of the search.
Figure 4.35 shows that from m = 67 onwards nearly every unit added is
causing a decrease in the value of the score statistics. The three exceptions
are shown by heavy lines in the figure corresponding to the inclusion, from
largest m , of observations 8, 48 and 25, all of the three observations noted
as influential by Cook and Weisberg. The effect of cases 25 and 48 is to
bring the value of the score statistic back above the lower boundary. The
inclusion of observation 8 forces the test statistics to be positive. Apart from
these observations the statistic for transforming the response is responding
to the correlation between M and S. If S is transformed with too Iowa
value of A2, a lower value of Al is indicated.
Finally we consider the third root transformation for both variables.
This has the physically appealing property that both volumes have been
converted to the dimension of length, which is that of the other three vari-
ables. Figure 4.36 shows the plot of TMs(1/3) which stays within the limits
for the whole of the forward search - the increase in the statistic in the
last two steps being caused by observations 8 and 48. As Figure 4.37 con-
firms, these are outlying observations, whatever reasonable transformation
we take. The effect of observation 25 is no longer evident. Also given in
120 4. Transformations to Normality

21 ++
'"<'i
16 •
24 3<t + 1.
-
° +_,ef- \ - - -
- ,;:-.-
+,
0
11

-
<'i

"'" --:j; -
-
.. -
--
~
;r-
::; -:

--
-- - -
0

" +
25 +
'" 8

48
0 +
2.0 2.5 3.0

S'(0.2)

Figure 4.34. Mussels' muscles: scatterplot of transformed variables Ml / 3 against


SO.2. The 10 observations marked + are the last to be added in the forward search
for these parameter values summarized in Table 4.4

co

~"
i1ii 0

~
~
0
"
C/l
')l T_M (1/3,0.2)
T _S(1/3,0.2)

20 40 60 80
Subset size m

Figure 4.35. Mussels' muscles: score statistics for power transformations of M


and S as the subset size m increases. Transformation of M only, TM(I/3,0.2);
transformation of S only, Ts(I/3, 0.2) . The upward jumps, marked in bold on the
plot of TM (1/3,0.2) are due to the inclusion of observations 25, 48 and 8
4.11. Transforming Both Sides of a Model 121

20 40 60 80
Subset size m

Figure 4.36. Mussels' muscles: score statistics for t he same power transformation
of M and S as the subset size m increases. TMs(1/3); TMS(O.2). The score test
for the third root transformations lies throughout within the 99% limits

the plot is T M s(0 .2) which behaves very similarly, except that the effect
of observations 8 and 48 is to cause the transformation to be rejected in
favour of higher values of A. Although the statistics in Figures 4.35 and
4.36 are calculated from different searches, Table 4.4 shows that, the units
included in the last 10 steps are virtually identical. The main difference is
the order.
Our sequence of transformations has produced plots of increasing
smoothness as better transformations are found. But the analysis of jumps,
that is, of nonsmoothnesses, in the central part of the forward search,
can highlight important cases: these are not outliers with respect to
the transformation being used, but contain information about a suitable
transformation. For example, case 48 is not an outlier if the data are un-
transformed, and so is not present in the last steps of the forward search.
However its inclusion causes the increase in the value of TMs(l), visible in
Figure 4 .33, as m goes from 50 to 5l.
Our analysis of this example shows how our forward method provides a
link between the evidence for a transformation and the scatter plots of the
data. As a result of this we are led to a physically meaningful transformation
and the identification of two outliers, together with knowledge of their effect
on the estimated transformations.

4.11 Transforming Both Sides of a Model


The best models obtained as a result of a transformation often have some
physical interpretation. An example is the reciprocal transformation of the
poison data in §4.4 which yielded a simple model for the death rate of
the animals. In this section we suppose that there is already a physically
122 4. Transformations to Normality

21 + + 2

"'
16 ••

M
2/34+ 1.
+ +
,..r••
.....

.
0 11 10
M
• #:.~

••
.. •
··1;
--. -
f[ "'
N
~ -I
::;

0
N •• • •
+
25 +
'" 8

0
48
+

S"{1/3)

Figure 4.37. Mussels' muscles: scatterplot of transformed variables M1/3 against


8 1 / 3 . The 10 observations marked + are the last to be added in the forward
search for these parameter values summarized in Table 4.4. Observations 8 and
48 are revealed as outliers on this scale

meaningful model linking the response to the explanatory variables, but


that statistical considerations, such as an increase of variance with the
mean, suggest that the response should be transformed. However, if the
response on its own is transformed, the physical interpretation of the model
will be lost. Therefore both sides of the model need to be subjected to the
same transformation.
In t he next section we consider data on the volume y of trees as a func-
tion of their diameter Xl and height X2 . The example illustrates several
theoretical points.
There are a number of models that have a clearer interpretation than an
arbitrary polynomial in diameter and height , such as might be found by
variable selection or other standard techniques of model building. As one
possibility, a 1/3 power transformation of the response yields the first-order
model
(4.19)
in which both sides have the dimension of length. An alternative is the
logged model
log y = /30 + /31 log Xl + /32 log X2 + f (4.20)
which is dimensionless. A third possibility, which we explore further, is to
view the tree trunks as cones, suggesting the regression model
y = axix2 +f = av +f = TJ + f, (4.21)
where v = XiX2 is proportional to t he volume of a cone and both sides have
the dimension of volume. If the conical model (4.21) holds exactly, a =
7r / 12. Taking logs in (4.21) gives theoretical values for the parameters in
4.11. Transforming Both Sides of a Model 123

(4.20). If the data indicate a transformation of the response, the functional


relationship is preserved by transforming both sides of (4.21).
Transformation of both sides of a regression equation is described in
Chapter 4 of Carroll and Ruppert (1988). In general the method re-
quires the use of nonlinear least squares. Although, for a one-parameter
model such as (4.21) nothing more complicated than linear least squares is
required, we initially treat the general case.
We transform both sides of the linear regression model

E(Y) = T) = x T (3,
with the normalized Box-Cox transformation (4.1), to obtain
(yA _ l)/( Ay A-l) (T)A - l) / ( AyA-l) A f= 0
Z(A) = { (4.22)
ylogy Ylog T) A = 0,
where, as before, the geometric mean of the observations is written as y.
The maximum likelihood estimator of A is again the value minimizing R(A),
the residual sum of squares of the Z(A): transformation on the right-hand
side of (4.22) has no effect on the scale of the observations and so does not
enter the Jacobian of the transformation.
For fixed A, estimation of the parameters of the linear predictor T) in
(4.22) does not depend on whether the response is Z(A) or the nonnor-
malized y(A) (4.3). Multiplication of both sides of (4.22) by AyA-l and
simplification, leads to the model

(4.23)

When (3 is a vector of parameters, minimizing the residual sum of squares


in (4.23) is a nonlinear least squares problem of the kind discussed in the
next chapter. But when (3 is a scalar, as in (4.21), the simplified model
(4.23) becomes

(4.24)
Now let q = yA and u = VA. Model (4.24) then reduces to the simple form
q t5u + E A f= 0
(4.25)
q = log y = log t5 + u + E A = O.
For general A this model is regression through the origin and the residual
sum of squares R(A) is found by dividing the residual sum of squares of q
by (AyA-l)2. For the log model there is no regression, only correction by a
constant.
Calculation of the score test for transformation requires the constructed
variable found by Taylor series expansion of (4.22) about Ao. To find this
variable let
k(A) = AyA-l.
124 4. Transformations to Normality

Then, in (4.22),
z( ,X.) = (y>' - l)/k('x')
and the derivative (4.11) from Taylor expansion of z(,X.) is written
8z('x')
~ = {y>'logy- (y>' -l)(l/'x'+logy)}/k('x').

Likewise

8~~) = {ry>'logry - (ry>' - 1)(1/'x' + logy)}/k('x').

The constructed variable for the transform both sides model (4.22) is found
as the difference of these two, since they occur on different sides of the
equation, and is
(4 .26)
In (4.26) the multiplicative constant k('x') has been ignored since scaling
a regression variable does not affect the value of the t statistic for that
variable.
The general constructed variable (4.26) simplifies for the one-parameter
model (4.25) being written in terms of q = y>' and 8u = ry>', provided ,X. "I O.
Of course, 8 is not known but is estimated by b, so that ry>' is replaced by
bu = q to give the constructed variable
WBS (,X.) = (q log q - q log q) /,X. - (q - q) (1/ ,X. + log y). (4.27)
When ,X. = 0 similar reasoning leads to the variable
WBS(O) = (q2 -l)/2 - (q - q) logy.
Evidence of regression between the residuals z· (,X.) from a fitted model
in which both sides have been transformed and the residuals wss('x') is
evidence of the need for a different transformation. Atkinson (1994b) gives
examples of the use of this variable in a diagnostic analysis of data on tree
volumes for which we give a forward analysis in the next section.

4.12 Shortleaf Pine


Table A.lO contains 70 observations on the volume in cubic feet of shortleaf
pine, from Bruce and Schumacher (1935) together with Xl, the girth of each
tree, that is, the diameter at breast height, in inches and X2, the height of
the tree in feet . The girth and, to a lesser extent the height, are easily
measured, but it is the volume of usable timber that determines the value
of a tree. The aim is therefore to find a formula for predicting volume from
the other two measurements.
4.12. Short leaf Pine 125

-1

20 40 60
Subset size m

Figure 4.38 . Short leaf pine: fan plot of score statistics for transforming both sides
of the conical model. The logarithmic transformation is indicated. There are no
influential observations

The trees are arranged in the table from small to large, so that one
indication of a systematic failure of a model would be the presence of
anomalies relating to the smallest or largest observations. To investigate
transformations for these data we use the conical model (4.21) with six
transformations: the usual five values plus A = 1/3, which had a special
interpretation in (4.19). Figure 4.38 is a fan plot of the score statistics
which, unlike the other plots in this chapter, uses the constructed variable
WBs defined in (4.27) . The forward search orders the residuals yA - fJA.
The plot shows that the log transformation is supported by all the data. All
other values are rejected, including 1/3, which has no special dimensional
significance when both sides are transformed. The smooth curves in the
plot do not reveal any highly influential observations.
The forward plot of residuals from the log transformation is Figure 4.39.
The pattern of residuals is very stable, with four slightly large residuals
throughout, the largest belonging to observation 53, which is the last to be
included in the forward search. The resulting model is of the form

log y - log(xix2) = {j + f.
Our analysis shows no evidence of any departure from this model.
There is a long history of the use of such models in forest mensuration.
Spurr (1952) gives 1804 as the date of the first construction of a table
relating volume to diameter and height. The definitive description of the
logarithmic formula found here by statistical means, is by Schumacher and
Hall (1933) , who analyze data for nine species. Bruce and Schumacher
(1935) give, in part, an introduction to multiple regression for workers in
forestry based on equations for tree volume, especially the logarithmic one
found here. The book discusses in detail many of the difficulties that arise
in trying to establish such equations.
126 4. Transformations to Normality

-----------------------------------6
<:> 1L _ _ _ _- r - _

20 40 60

Subsel sIze m

Figure 4.39. Short leaf pine: forward plot of the scaled residuals from the log-
arithmic model when both sides are transformed. A very stable pattern of
residuals

One difficulty is that trees change shape over their lifetimes. The trunk
of a young tree may be nearly conical, but a mature pine under certain con-
ditions is virtually cylindrical. The parameter 8 in (4.24) will then change
with age and so with tree size. There is no evidence of any such drift
here: for the logarithmic transformation large and small observations en-
ter throughout the forward search. Only for untransformed data does the
largest tree enter last. Another difficulty arises in the measurement of the
volume of the trunk of each tree, which is often not a smooth geometric
shape but may be highly irregular, as are the trunks of many European
oaks. Even a conical trunk will have to be truncated as there will be a
minimum diameter for the provision of useful timber. Furthermore, how
should the trees for measurement be sampled?
These problems were also discussed by Spurr (1952) who was reduced to
the defeatist position that the problems can only be avoided by modelling
stands of single species trees all of the same age. Hakkila (1989) stresses
that there is more to trees than trunks, particularly if all woody material
is to be used for paper or fuel chips. Hakkila's plot (p. 16) of the dry mass
of Appalachian hardwood trees against the square of diameter at breast
height shows the need for the variance stabilizing effect of the logarithmic
transformation. The collection of papers edited by Ranneby (1982) contains
survey papers on forest biometry and on the errors in prediction arising
from estimated volume residuals.
Developments in statistical methodology for the models considered here
are presented by Fairley (1986) and Shih (1993), who discusses deletion
diagnostics for the transform both sides model.
4.13. Other Transformations and Further Reading 127

4.13 Other Transformations and Further Reading


Atkinson (1985) gives examples of the use of transformations for other kinds
of data, for example, proportions, which involve no new principles. The use
of transformation after the addition of a constant is however problematic,
unless the value of the constant is known beforehand.
In this shifted power transformation, Y in the normalized power trans-
formation (4.1) is replaced by q = Y + IL, with IL a second parameter to be
estimated. We then require Yi + IL > 0 for all observations. The range of
the observations now depends on the value of IL, leading to a nonregular
estimation problem. The residual sum of squares of the normalized trans-
formed q has a global minimum of zero as IL --+ -Ymin' where Ymin is the
smallest observation, although the examples in Atkinson et al. (1991) show
that local minima may also exist. They use a grouped likelihood to obtain
estimates of the transformation parameter, although the estimate can de-
pend critically on the grouping interval. Some references to the literature
on nonregular estimation and to the shifted power transformation are in
Atkinson et al. (1991).
We have used a score statistic derived from a simple Taylor series expan-
sion of the normalized power transformation. A test statistic with better
asymptotic properties was introduced by Lawrance (1987). The compar-
isons reported in Atkinson and Lawrance (1989) suggest that monitoring
Lawrance's statistic in the forward search would yield results similar to
those given here.
128 4. Transformations to Normality

4.14 Exercises
Exercise 4.1 Given a sample of observations for which
var(Yi) ex {E(Yi)}2a = f.l 2a ,
use a Taylor series expansion to find a variance stabilizing transformation
g(y) such that var{g(Y;)} is approximately constant. What happens when
ex = 1 (§4.2)?
Exercise 4.2 Find the Jacobian (4.4) for the power transformation (4.3).
The physical dimension of the sample mean is the same as that of an ob-
servation. Use a dimensional argument to justify comparison of R()..) for
different).. (§4.2).
Exercise 4.3 Derive the expression for w()..) (4 .11). Explain why the nor-
mal equations of linear least squares lead to the simplification of w()..) in
(4.14). Verify that z (O) is as given in (4.1) and find w(O) (§4.2).
Exercise 4.4 The folded power transformation is defined as:
y()..) = yA _ (1 - y)A
O:Sy:S1. (4.28)
)..

See what happens when).. --> 0, obtain the normalized form and find the
constructed variable for the transformation when ).. = 1 and O.
For what kind of data would this transformation be suitable? What
happens for data near 0 or near 1 (§4. 2) ?
Exercise 4.5 Suggest a transformation for percentages and describe its
properties (§4.2).
Exercise 4.6 The fan plot of Figure 4.4 shows distinct related patterns
at m = 10 and m = 24. What kind of observation causes each of these
patterns (§4.3)?
Exercise 4.7 Analyze the wool data using a second-order model and the
"standard" five values of )... For each)" obtain the QQ plot of the residuals,
the plot of residuals against fitted values and the constructed variable plot
for the transformation. What transformation is indicated?
How does the F test for the second-order terms change with)" (§4.3)?
Exercise 4.8 The poison data have four observations at each combination
of factors, so that an estimate of a 2 can be calculated from the within cells
sum of squares. Use this estimate to calculate the lack of fit sum of squares
for the untransformed data. How does the F test for lack of fit vary with )..
(§4·3)?
Exercise 4.9 Table 3.3 gave some demographic data about 49 countries
taken from Gunst and Mason (1980, p. 358). In Exercise S.li '/jou were asked
to find the most important explanatory variables for the demographic data.
4.15. Solutions 129

Repeat your model building exer·cise with y-05 as the response. Compare
the answers to those you obtained earlier (§4.8).
Exercise 4.10 Figure 3.44 showed plots of leverages for the demographic
data and Figure 3.45 showed how the leverage points were generated by the
values of X3 and X4. Construct a leverage plot using only variables 1, 5 and
6. What are the units with the largest leverage? If the data are analyzed
with response y-O.5, what is the effect of these units on R2 and on the t
statistic for X6 (§4. 8)?

4.15 Solutions
Exercise 4.1
Using a first-order Taylor expansion about f.L

Consequently
var[g(Y;)] R::: {g' (f.L )}2var(Y;)
R::: {g'(f.L)}2f.L2c>.
Now for var{g(Y;)} to be approximately constant, g(Y;) must be chosen so
that

So that, on integration,
if a ~ 1
if a = 1,
since the constant does not matter. For example, if the standard deviation
of a variable is proportional to the mean (a = 1) a logarithmic transfor-
mation (the base is irrelevant) will give a constant variance. If the variance
is proportional to the mean (a = 1/2), the square root transformation will
give a constant variance and so on. Table 4.5 reports the transformation
required to stabilize the variance for different values of a.

Exercise 4.2
The Jacobian of the transformation is the determinant of the matrix
OY1(A) 0Yl(A) 0YdA)
~ 0Y2 0Yn
0Y2(A) 0Y2(A) 0Y2(A)
J = ~ a:y;- a:;;;:-
130 4. Transformations to Normality

Table 4.5 . Transformations to constant variance when the variance depends on


the mean

a var[Y] = kJ.L2o. Transformation

0 k Y
1/2 kJ.L JY
1 kJ.L2 logY
3/2 kJ.L3 l/JY
2 kJ.L4 I/Y

A-I o o
YI
o A-I o
Y2

o o A-I
Yn
So J = rr=1 Iy;-Il = iJn(A-I).
For linear models including a constant we can ignore the -1 in the nu-
merator of z(>'). We also ignore the>. in the denominator and consider
only the dimension of yA/iJA-I. The geometric mean has the same dimen-
sion as the arithmetic mean and as y , so the dimension of z(>.) is that of y.
The same is true for z(O) since changing the scale of measurement of the y
merely adds a constant to this z. Therefore the response in the regression
model has the dimension of y whatever the value of >.. Sums of squares can
therefore be directly compared.
See also Bickel and Doksum (1981) and Box and Cox (1982).

Exercise 4.3

w(>.) = d:\) >.iJA-IyA logy - (iJ A- I + >.iJA-IlogiJ)(yA - 1)


(>.iJA-I )2
yA log Y yA - 1 .
>.iJA-I - >.iJA-I (1/>. + logy). (4.29)

The normal equations (2.5) can be expressed in terms of residuals as

so if the model contains a constant, it follows that ~ ei = O. Since we


require the residuals of the constructed variables in, for example, the con-
structed variable plot, addition of a constant to w(>.) will leave the residuals
unchanged.
4.15. Solutions 131

To verify the expression for z(O) requires the use of I'Hopital's rule, which
we exemplify for the limit of W(A) , a similar, but more complicated, opera-
tion. To find w(O) we rewrite equation (4.29) in a form that allows the use
of I'Hopital's rule. We obtain
dZ(A) AY>' logy - (y>' -1) - Alogy(y>' -1)
dA A2 y>.-1
Application of I'Hopital's rule yields
w(O) = lim log y(y>' + AY>' log y) - y>' log y - log y(y>' - 1) - AY>' log Ylog y.
>'-->0 2Ay>.-1 + A2 y>.-1Iogy
Dividing the numerator and denominator by A we obtain

w ()
. y>' log2 Y - Y'\-l log y - y>' log Ylog y
0 = hm ----::-:--;--0'---:-.,....,----:;-::-..,----
>'-->0 2y>'-1 + Ay>. - llogy
Now letting A -+ 0
w(O) Y(0.5Iog 2 y -logylogy)
Ylog y(0.5 logy -logy).

Exercise 4.4
When A -+ 0 applying I'Hopital's rule shows that the folded power
transformation reduces to the logit transformation:
y(O) = log -y-.
1-y
In order to obtain the normalized version we must divide (4.28) by jIln.
In this case the Jacobian is Il~=l {y;-l + (1 - Yi)>. - l}. The normalized
response variable is thus
y"-(l-y)"
Z(A) = { >'G(>') (A i= 0)
log ~C-l(O) (A = 0) ,

rn rn ,
where

G(>')~ (g g, ~ (g{y;-' + (1 - y,)'-' }


the geometric mean of y;-l + (1-Yi)>.-1 and C- 1(0) is the geometric mean
of Yi(1- Yi). Note that in this transformation the geometric mean depends
on A and has to be calculated afresh from all the n observations for each
value of A. This detail further implies that computing the expression for
the constructed variable W(A) = 8z/8A requires the appropriate expression
for 8C(A)/8A. If we let

Q = ~ ~ 8g i
6 g ·8A'
i=l '
132 4. Transformations to Normality

8z {y\logy - (1 - y)\log(l - y)} )"'G()"')


8)", )...2G()...)2
{y\ - (1 - y)\} {G()"') + )"'G('\)Q/n}
(4.30)
,\2G()...)2

Now with p = y / (l - y) we obtain


OZ (1 - y)\ { \ \ }
8'\ = '\G()"') P logy -log(l- y) - (p -1)(1/).. + Q/n) .

Then

w(l) = 0.5{y log y - (1 - y) log(l - yn - (y - 0.5){1 + 0.5 log G- 1(On


and, after applying l'Hopital's rule to equation (4.30),

w(O) = G-1(0) [{log2 y -log2(1 - yn/2 - (l/n) log{y/(1 - ynQ(O)],

with

Q(O) = L {(I - Yi) log Yi + Yi log(l - Yin,

which is suitable for the transformation of proportions. As y ~ 0 we


obtain the Box- Cox transformation; whereas, for y near 1, the power
transformation is of 1 - y.

Exercise 4.5
One possibility is a "folded" transformation, similar to that in Exercise 4.3,
but now
y(,\) = y\ - (100 - y)\
,\
0::::: y ::::: 100. (4.31)

The geometric mean and constructed variable change in a similar manner.

Exercise 4.6
At m = 10 observation 24 enters, a relatively small observation that has
its largest effect on the curve for ,\ = -1. Conversely, when m = 24,
observation 22 enters, a large observation having the greatest effect on the
plot for ,\ = 1.

Exercise 4.7
The plots of residuals against fitted values in Figure 4.40 for ,\ = 1 and
Figure 4.41 for ,\ = 0, suggest the log transformation, although the QQ
plot of residuals for ,\ = 0 is less good than that in Figure 4.2 for the
first-order model. The constructed variable plots in Figure 4.42 not only
indicate rejection of ,\ = 1, but also suggest a value slightly below zero for
~.
4.15. Solutions 133

0
• •
~

• •

-
8
'" •• •
"'
<0 '" .. •. •
~"
"
0
II: •

8')J
•• •
• • ••
0
0
't
• ••
0 1000 2000 3000 ·2 -1 2
Predicted values Quantiles of standard normal

Figure 4.40_ Wool data, second-order model: no transformation

• • •
"!
0 • '" •

• • •
~'"
•• •
0
"0
'in
'"
<0
• • ~

0
"
"0 0
• •
"0 0
'in
"
• ••
N
"
II:
• ~
9 -g"
• iii
';-

••

'"9 • •
')J •
9'" •
6 7 -2 -1 0 2
Predicted values Quantiles of standard normal

Figure 4.41. Wool data, second-order model: log transformation

iambda=l iambda=O
0

••
0
co

0
0
• ~ • ••
••
~

'"'c:" • ''"c:" •• •
•• •
.-
0
§.
• •• •
0

• •
a. 0

'"~ '" ~
0
• •
• ••
<0 <0
::1
"0 0 •• ::1
:l2
III ••
'in
'" "'" •
II:
• II: 0
0
• •• •
0
• • •

0
')J
• • •
8 ••• 0
0
't ')J

-600 -200 0 200 400 600 800 -50 0 50 100


Residual constructed variable Residual constructed variable

Figure 4.42 . Wool data, second-order model: added variable plots


134 4. Transformations to Normality

Table 4.6. Poison data: F test for lack of fit for different values of A

A F test

1 1.87
0.5 1.62
0 1.22
-0.5 0.92
-1 1.09

The F tests for the second-order order terms are: for A = 1,


(5480980.89 - 1256680.56) / 6 = 9.52
1256680.56/17
and for A = 0,
(0.792518 - 0.638709) /6 = 0.68.
0.638709 / 17
The first-order model is indicated on the log scale, but for untransformed
data the second-order terms are indicated, since F 6 ,17,O.99 = 4.10. Here a
more complicated linear model is, to an inadequate extent, compensating
for the lack of fit due to the need for a transformation. Note that, for A = 0,
the calculations have been done without normalizing the transformation:
for such comparisons for fixed A the normalization is irrelevant.

Exercise 4.8
Table 4.6 indicates that, for the poison data, the lack of fit is smallest
between A = -0.5 and A = -1. The maximum likelihood estimate is -0.75.
However the test has very low power, not rejecting A = 1. More information
would be found by plotting the individual estimates of (J2 from the 12 cells
and looking for patterns that change with A.

Exercise 4.9
Table 4.7 shows that variable selection for the demographic data depends on
the transformation used. While there can be no universally best procedure,
it is advisable to find a good transformation before removing variables from
the model. Any model believed to be final should be checked as to whether
the transformation still holds.

Exercise 4.10
The units that have a leverage much bigger than the others are 20 and
46. If they are removed R2 passes from 0.793 to 0.691 and t6 = -1.7333
becomes nonsignificant.
4.15. Solutions 135

Table 4.7. Demographic data: selection of variables using t statistics (transformed


response y-O.5). Compare with the results of Table 3.4

Model number 1 2 3 4

Xl 4.7616 4.7957 4.8337 5.0383


X2 -0.3950
X3 0.6146 0.6736 0.3897
X4 -0.5457 -0.5951
X5 -3.6060 -4.3471 -4.3581 -4.9287
X6 -2.2613 -2.2586 -2.3686 -2.3989
R2 0.796 0.796 0.794 0.793
5
Nonlinear Least Squares

In this chapter we extend our methods based on the forward search to re-
gression models that are nonlinear in the parameters. Estimation is still by
least squares although now iterative methods have to be used to find the
parameter values minimizing the residual sum of squares. Even with nor-
mally distributed errors, the parameter estimates are not exactly normally
distributed and contours of the sum of squares surfaces are not exactly
ellipsoidal. The consequent inferential problems are usually solved by lin-
earization of the model by Taylor series expansion, in effect ignoring the
nonlinear aspects of the problem. The next section gives an outline of this
material, booklength treatments of which are given by Bates and Watts
(1988) and by Seber and Wild (1989). Both books describe the use of
curvature measures to assess the effect of nonlinearity on approximate in-
ferences using the linearized model. Since we find it informative to monitor
measures of curvature during the forward search, we present a summary
of the theory in §5.1.2 . Ratkowsky (1983) uses measures of curvature to
find parameter transformations that reduce curvature and so improve the
performance of nonlinear least squares fitting routines.
There follows some material more specific to the forward search. We
briefly touch on parameter estimation, since parameters have continually to
be updated during the forward search. We then outline differences between
the forward search for linear and nonlinear models. These differences are
not so much in the search itself as in the calculation of the quantities, such
as deletion residuals which we monitor.
The examples begin with one in which inference is virtually indistinguish-
able from that for a linear model and move through a series of examples in
5.1. Background 137

which nonlinearity is of increasing importance and has an increasing effect


on inferences drawn from the data. The general conclusions are similar to
those in earlier chapters for the linear model: deletion methods may fail
in the presence of multiple departures from the assumed model, whereas
the forward search highlights outliers and groups of observations that are
influential for the fitted model.

5.1 Background
5.1.1 Nonlinear Models
In this section we describe nonlinear regression models and give some
examples. We compare and contrast linear and nonlinear models.
The model for the ith of the n observations was written in (2.2) as
Yi = 'f/(Xi, {3) + f-io (5.1)
For the linear models of Chapter 2 we could then write
p-1
'f/(Xi, {3) = xf {3 = {30 + L {3j X ij, (5.2)
j=1

models that are linear in the parameters {3. So, for example,
'f/(Xi, {3) = {30 + {31 X i + {32 x ;
is a linear model, as is

since the parameters enter linearly. However


'f/(Xi, {3) = {31e fhxi (5.3)
is a nonlinear model, since {32 enters nonlinearly.
The statistical importance of the lack of linearity depends upon the
way in which the errors ti affect the observations. If the errors satisfy the
"second-order" assumptions of Chapter 2, that is,

i=j
(5.4)
i i= j,
are normally distributed and are also additive, as in (5.1), the maximum
likelihood estimates of the two parameters in the model (5.3) are also the
least squares estimates minimizing
n
8({3) = L {Yi - {31 e.62Xi}2. (5.5)
i=1
138 5. Nonlinear Least Squares

For linear models, differentiation of this expression yields the linear normal
equations (2.6) which can be solved explicitly to give the estimates ~. But
for nonlinear models, differentiation leads to sets of nonlinear equations,
which require iterative numerical methods for their solution. As an example,
the nonlinear model (5.3) yields the pair of equations
n
:L.::>c32 Xi {Yi - ~l ec32xi } 0
i=l
n
L~lXiec32Xi{Yi - ~lec32Xi} o. (5.6)
i=l

The second of these equations can be simplified by cancellation of ~l to


yield
n
L xiec32xi {Yi - ~lec32Xi } = O.
i=l

The equations are thus linear in ~l' which occurs linearly in (5.3) , but
nonlinear in ~2' Numerical solution of such equations is, in general, not
appreciably easier than minimization of the sum of squares function (5.5).
However it is not the difficulty in numerical calculation of least squares es-
timates that is the most important difference between linear and nonlinear
least squares. It is the lack of exact inferences even when the errors are
normally distributed.
For linear least squares the estimates ~ (2.7) are linear functions of the
observations. If the errors are normally distributed , so are the estimates,
with the consequences of t tests for individual parameters, F tests for
groups of parameters and ellipsoidal confidence regions in the parameter
space of the form
(5.7)
where 8 2 is an estimate of 0'2 on v degrees of freedom . The ellipsoidal shape
of these regions is a consequence of the ellipsoidal contours of the sum of
squares surface.
For nonlinear least squares, explicit formulae cannot usually be found
for the parameter estimates. The estimates are not linear combinations of
the observations, so that they will not be exactly normally distributed,
even if the observations are. Any distributional results for test statistics
will therefore only be asymptotic, increasing in accuracy as the number of
observations increases. In addition, the sum of squares contours may be far
from ellipsoidal. Such contours are often described as being banana shaped,
but some of the figures in Chapter 6 of Bates and Watts (1988) are even
worse than this, showing regions that extend to infinity in one direction.
To find confidence regions based on these contours is computationally com-
plicated and is not usually attempted. Instead inference is made using a
5.1. Background 139

linearized form of the model, which yields approximate confidence regions


which are ellipsoidal.
Taylor series expansion of the nonlinear model (5.1) about the value (30
yields
(5.8)
where If is the vector of partial derivatives of the model for observation i
with respect to the p parameters evaluated at (30, that is,
fO= OT)(Xi, (3) I (5.9)
" 0(3 /3=/30 .

For the linear model T)i = x; (3, the partial derivatives Ii are equal to Xi
and the procedures of Chapter 2 are obtained (Exercise 5.1).
It is convenient to write the linearized model (5.8) in matrix form. If we
let
Yi - T)(Xi, (30)
(3 - (30 (5.10)
the linearized model is
(5.11)
where

and ZO = [ zr 1
z~
a vector of random variables. The superscripts in (5.11) emphasize the de-
pendence of the linearized extended design matrix F on the parameter value
used in the linearization. As we show, this dependence can produce unex-
pected results when the parameter estimate changes due to the introduction
of outliers in the forward search.
The linearized model (5.11) suggests the Gauss- Newton method for find-
ing the least squares parameter estimates /3 by iteratively solving for the
least squares estimate in the linearized model and updating, giving the
iteration

and
k = 0,1, .... (5.12)
Convergence occurs when ,k+l is below some tolerance. However, like other
numerical procedures based on Newton's method, (5.12) may diverge. A
brief description of some algorithms for nonlinear least squares is given in
the next section.
As for the linear model of previous chapters, the least squares parameter
estimate /3 gives a vector of residuals e and an estimate 8 2 of 17 2 . We
140 5. Nonlinear Least Squares

denote the extended design matrix for the linearized model by P so that
the asymptotic variance covariance matrix of /3 from the linearized model
is

(5.13)

The corresponding hat matrix is

(5.14)

although we may often omit the "hat" and write iI as H if no confu-


sion can occur. The approximate 100(1 - a)% confidence region from the
linearization consists of those values of (3 for which

(5.15)
There are two approximations involved in (5.15). The first is that the ellip-
soidal confidence regions that it generates may not follow closely the sum
of squares contours given by the likelihood region (5.7). The second is that
the content of the region may not be exactly 100(1 - a)%.
A geometrical interpretation of the difference between linear and non-
linear least squares is helpful in interpreting the results on curvature in
the next section. If the vector observation y is considered as a point in n-
dimensional space, a linear model with p explanatory variables defines the
expectation plane, a p-dimensional subspace of this n-dimensional space.
Least squares estimation finds the nearest point on this subspace to y. The
least squares estimate /3 is therefore the foot of the perpendicular from y
to the expectation plane. Confidence regions for (3 are the locus of points
where the distance from y to the plane is constant. This distance is of
course greater than the perpendicular distance to /3. Such loci are formed
by the intersection of a cone and the plane and are circular. As Figure 5.1
indicates, the lines of constant parameter values on the expectation plane
are in general neither perpendicular nor do they have the same scale in the
different parameters. The circular intersection of the cone and subspace
therefore becomes ellipsoidal in the space of the parameters.
For nonlinear models changing parameter values likewise generate an ex-
pectation surface, but this is no longer planar. The least squares estimate is
again the point on the expectation surface nearest to y. The linearized fitted
model from iteration of (5.11) is E(Z) = P:y, the linearization producing
a planar approximation to the expectation surface which is the tangent
plane to the surface at /3. The approximate confidence region for (3 from
the linearization (5.15) is the locus of points formed by the intersection of
the cone of constant distances from f) on this tangent plane. The contours
of constant increase of sum of squares that form the likelihood-based confi-
dence region are more difficult to calculate. They consist of the intersection
of lines of constant length from y to the true expectation surface. In general
the resulting intersection will not lie in a linear subspace. It may well have
5.1. Background 141

a complicated shape both in the n-dimensional observation space and in


the space of the parameters.
If the exact distribution of the parameter estimates, or the exact content
of the confidence region, is of importance, simulation may be used to evalu-
ate these for a particular model and set of parameter values. In general the
adequacy of the linear approximation to the model depends on the nonlin-
earity of the sum of squares surface, which can be captured through the
properties of the higher-order derivatives of the model. These derivatives ,
which were ignored in (5.8), are the subject of the next section. They are
zero for models that are exactly linear.
We conclude this section with some comments on the relationship
between nonlinear models and the form of the error distribution.
The nonlinear model introduced in (5.3) was combined with additive
errors to give the statistical model of the ith observation as
(5.16)
The resulting least squares equations for estimation of the parameters in
(5.6) were linear in f31 but nonlinear in f32. But suppose that instead of
additive errors, there were multiplicative errors Ei with a log normal dis-
tribution. Such errors might be appropriate for data in which the response
cannot be negative. The model is then

(5.17)
On taking logs we obtain
log Yi = log f31 + f32xi + ti, (5.18)

with ti = log Ei having a normal distribution. Model (5.18) is not in fact


linear in f31, but can be linearized by putting a new linear parameter '"Yo =
log f31. The only nonlinearity in this model is due to the parameterisation.
In passing we note that this transformation is a specific instance of the
method of transforming both sides of a model which was the subject of
§4.11.
The important point for the next section is that it is possible to distin-
guish between nonlinearity that can be reduced, or in this case, removed,
by reparameterization and the nonlinearity that is inherent in the model.
But this nonlinearity depends both on the model and on the errors. If (5.3)
is fitted by regression of log y on x, it is clear from the derivation of (5.18)
that multiplicative errors with a particular variance structure are being
assumed.

5.1.2 Curvature
Before giving a mathematical definition of curvature for nonlinear models
we show examples of the expectation surface which was introduced at the
142 5. Nonlinear Least Squares

Figure 5.1. Linear model: portion of the expectation plane in the response space,
with lines generated by some values of the parameters /31 and /32

end of the previous section. The function 7](x, (3) when (3 varies forms a
p-dimensional expectation surface in the n-dimensional response space. If
7](x, (3) = X(3 this surface is a linear subspace of the response space. For lin-
ear models the expectation surface is often called the "expectation plane."
For example, consider a linear model with just three cases and suppose
that the design matrix X is

0.26]
1.78 . (5.19)
2.00

In this case the expectation function 7]((3) defines a two-dimensional


expectation plane in the three-dimensional response space.
Figure 5.1 shows a portion of the expectation plane in the response space.
This figure also shows the parameter lines corresponding to the values
(31 = -2, -1.5, ... , 1.5 and (32 = 0, 1,2, . . . ,5. For this linear model straight,
parallel and equispaced lines in the parameter space have images in the
expectation plane that are straight, parallel and equispaced.
The two vectors corresponding to the two columns of the matrix X are
not orthogonal (the angle between them is about 50 0 ) and have unequal
lengths (Exercise 5.3). This implies that unit squares on the parameter
plane map to parallelograms on the expectation plane. For linear models
we leave it as an exercise to show that the Jacobian of the transformation
from the parameter plane to the expectation surface is constant and equal
to IX T XI 1! 2. Thus, for linear models, regions of fixed size in the parameter
plane map to regions of fixed size in the expectation surface. Unfortunately,
this relationship no longer holds when we consider nonlinear models.
5.1. Background 143

(3=O

Figure 5.2. Nonlinear cooling model: one-dimensional expectation surface in the


three-dimensional response space. Points corresponding to f3 = 0, 0.1,0.2, ... ,00
are marked with crosses

To help understand the characteristics of the expectation surface for


nonlinear models we consider the model
j3 ? O. (5.20)
This equation is a particular example of Newton's law of cooling when the
ambient temperature is 60 degrees and the initial temperature of the object
being cooled is 70 degrees higher. The predicted temperature at time Xi
is then 17(Xi , j3). Bates and Watts (1988, page 268) give some data from
a historical experiment on the cooling of a cannon barrel due to Count
Rumford and describe the experiment on p. 33. On page 44 they use two
observations from this example to introduce the geometry of nonlinear least
squares. We extend this by supposing we have three observations, one each
at Xl = 4, X2 = 40 and X3 = 10. Figure 5.2 shows the one-dimensional ex-
pectation surface obtained by substituting values for j3 in the range [0, 00)
and then plotting the points in the three-dimensional response space. In
this case, in contrast to what happens for linear models, the expectation
surface is curved and of finite extent. In addition, Figure 5.2 clearly shows
that points with equal spacing on the parameter line map to points with
unequal spacing on the expectation surface. Generally, in nonlinear mod-
els straight parallel and equispaced lines in the parameter space map to
lines that are neither straight, nor parallel, nor equispaced. Furthermore,
in nonlinear models the Jacobian determinant is not constant and this im-
plies that unit squares on the parameter plane map to nonconstant and
irregularly shaped areas on the expectation surface. The nonlinearity of
the expectation surface involves two aspects: the intrinsic bending of the
144 5. Nonlinear Least Squares

-r=-c:?-

T=-3

T=-2.5

Figure 5.3. Nonlinear cooling model: one-dimensional expectation sur-


face in the three-dimensional response space. Points corresponding to
T = -00, ... , -4, -3.5, ... , -1, ... , +00 are marked with crosses

curve and the unequal spacing of the values on the expectation surface. It
is interesting to analyze what happens when we reparameterize the model.
If we set T = IOglO /3, equation (5.20) can be rewritten as
7](Xi' T) = 60 + 70e- x , lOT.
Figure 5.3 shows the plot of the new expectation surface after the repa-
rameterization. The expectation curve is identical to that of Figure 5.2, but
now the spacing of the values of T is much more uniform in the centre of the
expectation surface. This simple example shows the different characteris-
tics of the two aspects of curvature: the curving of the expectation surface
which does not depend on the parameterization used (this aspect is called
"intrinsic curvature") , and the second aspect which reflects how equally
spaced values in the parameter space map to unequally spaced values in
the response space. This second aspect depends on the parameterization
used and therefore is called "parameter effects curvature." If intrinsic cur-
vature is high the model is highly nonlinear and the linear tangent plane
approximation is not appropriate. High parameter effects curvature, on the
other hand, can often be corrected by an appropriate reparameterization
of the model.
In the previous section we obtained linear approximation inference re-
gions from a first-order Taylor series approximation to the expectation
surface evaluated at [3. Geometrically equation (5.15) assumes that around
[3 we can replace the expectation surface by the tangent plane. This local
approximation is appropriate only if 7](x, [3) is fairly flat in that neighbour-
hood, which in turn is true only if in the region of interest straight , parallel,
5.1. Background 145

equispaced lines in the parameter space map into nearly straight, parallel,
equispaced lines in the expectation surface. To determine how planar is
the expectation surface and how uniform the parameter lines are on the
tangent plane we can use second derivatives of the expectation function.
If ry((3) and (3 are one-dimensional, then the first derivative of ry((3) gives
the slope of the curve, while the second derivative gives the rate of change
of the curve, which is related to the idea of curvature. Intuitively, since in
linear models second- and higher-order derivatives are zero, it seems logical
to measure nonlinearity by investigating the second-order derivatives of the
expectation function.
/3
More formally, if (3 is close to we have the quadratic approximation
'. 'T 1 'T" ,
7)(Xi, (3) - 7)(Xi , (3) = ((3 - (3) Ii + 2(/J - (3) Fi ((3 - (3), (5.21 )

where, as in (5.9) , Ii is the p x 1 vector of partial derivatives of the model


for observation i with respect to the p parameters, but now evaluated at
/3;that is,

f t· = 87)(Xi, (3) I
8(3 ..
/3=/3
Likewise Fi is a p x p symmetric matrix of second derivatives, with element
r, s for the ith observation defined as
.. 8 2 7)(Xi , (3)
Iirs= 8(3r8(3s ' r, s=1, ... ,p.

r
If we let b = (3r - /3r, equation (5 .21) can be rewritten as
7)(Xi, (3) - 7)(Xi, /3) ,;, ((3 - /3? Ii + ~
r=ls=l
brbs iirs .tt (5.22)

To calculate the curvature of the expectation surface it is more convenient


to use a vector notation in which vectors are of length n so that we have,
for example, a vector of derivatives of all observations with respect to a
single parameter. Let such a vector be I r> which is then the rth column of
the linearized extended design matrix F that appeared in equations (5.13)
to (5.15). Similarly if we let the n x 1 vector irs= (fIrs,'" , inrs)T and

7)(X,(3) = {7)(XI,(3), ... , 7)(xn,(3)}T,


we can express (5.21) in vector form as

7)(X, (3) - 7)(X, /3) ,;, t


r=l
jr br + ~ tt
r=l s=l
brbs irs· (5.23)

Bates and Watts (1988) call the vectors I r "velocity vectors" because they
give the rate of change of 7) with respect to each parameter. As a conse-
146 5. Nonlinear Least Squares

quence, the vectors f rs are called acceleration vectors because they give
the rate of change of the velocity vectors with respect to the parameters.
From the first-order Taylor series expansion used in the preceding section
we know that the velocity vectors form the tangent plane to the expecta-
tion surface at the point /3. The validity of the tangent plane approximation
depends on the relative magnitude of the elements of the vectors f, which
contain the quadratic terms, to the velocity vectors, which contain the
linear terms. To assess this magnitude, the acceleration vectors can use-
.. T
fully be divided into two parts. One, f rs' lies in the tangent plane and is
informative about parameter-effects curvature. The other part of the accel-
.N
eration vectors, f rs' is normal to the tangent plane. The division uses the
projectors

and

to split the n x 1 vectors f rs that contain the quadratic terms into the two
orthogonal vectors
.. T
fr s FF fr s (5.24)
··N
frs (In - FF) f rs' (5.25)

with, of course,
.. ..T .. N
frs=frs + frs .
.. T .. N
The projection on the tangent plane is given by frs ' whereas f rs is normal
to the tangent plane.
The extent to which the acceleration vectors lie outside the tangent plane
measures the degree of deviation of the expectation surface from a plane
and therefore the nonplanarity of the expectation surface. In other words,
.. N
the vectors f rs measure the intrinsic nonlinearity of the expectation surface
which is independent of the parameterization used . The projections of the
.. T
acceleration vectors in the tangent plane (f rs) measure the degree of non-
uniformity of the parameter lines on the tangent plane and so depend on
the parameterization used.
In order to evaluate parameter effects curvature and the intrinsic
curvature we can use the ratios
.. T
II L~-l L~-l frs orosll (5.26)
II L~=l ir orl12
..N
II L~-l L~-l fr s orosll (5.27)
II L~=l ir orl12
5.1. Background 147

where by the symbol Ilzll we mean the Euclidean norm of z; that is, the
square root of the sum of squares of the elements of z: Ilzll = Y!L:~l z;.
If we want to measure the curvatures in a direction specified by some
vector h = (h 1, ... ,hp )T we can replace br by hr (r = 1,2, ... ,p) in
equations (5.26) and (5.27). These curvatures can be standardized to be
comparable between different models
. ..and sets of data, using the dimen-
sions of the derivatives. Both I r and Irs have the same dimension as the
response, so the numerators of the curvature measures are of dimension
response and the denominators of dimension (response)2. The curvatures
are therefore measured in units of (response) -1 and may be made scale free
through multiplication by the factor s.
It is possible to show that the geometric interpretation of intrinsic cur-
vature is as the reciprocal of the radius of the hypersphere that best
approximates the expectation surface in the direction h. Given that the
sum of squares contour {y - 1](X,,8) V {y - 1](X, ,8)} bounding a nom-
inal 1 - a region in the tangent plane coordinates is a hypersphere of
radius y!pS2 Fp,v,l-o, multiplication of the curvatures in equations (5.26)
and (5.27) by the factor svp gives values that can be compared with the
percentage points of 1/ y!Fp,v,l - o'
Bates and Watts (1988, Ch. 7) suggest maximizing the two curvature
measures with respect to h, rescaling them by the factor svp and then
comparing the obtained values with the percentage points of 1/ y!Fp,v,l-o'
The maximizing values of h are found by numerical search as described by
Bates and Watts (1980, §2.5).
During the forward search in order to evaluate the degree of curvature
of the model we monitor the quantities
.. 7
7
'Ymax
r:::
Synmax II L:~=l L:~=l. Irs hrhs II (5.28)
h IIL:~=l Ir hr l1 2
.. N

~ax svnmax II L:~=l L:~= l Irs hrhsll (5.29)


h II L:~=l Ir hrW
A major disadvantage of the ratios (5.28) and (5.29) is that they mea-
sure the worst possible curvature in any direction. Therefore, sometimes
'Y~ax may be spuriously high even if the model is satisfactory. On the other
hand Cook and Witmer (1985) give examples of data sets with 'Y~ax below
1/ y!Fp,v,o but with an unsatisfactory tangent plane approximation. The
problem of the lack of a precise threshold with which to compare the re-
sulting statistics is not so important in the context of the forward search,
because we are interested in monitoring the effect on the curvatures of the
introduction of each observation. Furthermore, as we saw in the earlier
chapters, the units that are included in the last steps of the forward search
often form particular clusters. In this way we can easily monitor the effects
on the degree of curvature of such clusters of observations.
148 5. Nonlinear Least Squares

5.2 The Forward Search


5.2.1 Parameter Estimation
The parameter estimates minimizing the sum of squares 8(13) can be found
by use of one of the many techniques for numerical minimization such as
quasi-Newton methods. Descriptions are given by Seber and Wild (1989,
Ch. 14). Or a method can be employed that uses some of the least squares
structure of the problem. Here we briefly outline the Marquardt- Levenberg
algorithm (Levenberg 1944; Marquardt 1963) .
The Gauss- Newton iteration was given in (5.12) as
13k+! 13k + ')'k+l,
where
')'k+! (FkT Fk)-l FkT zk.
This procedure can fail to converge if in any iteration the sum of squares
increases; that is, 8(f3k+l) > 8(f3k). If there is an increase a search
can be performed in the Gauss- Newton direction, yielding the parameter
correction
(5.30)
with the step length (Xk < 1 such that the iteration does make progress, that
is, so that 8(f3k+l) < 8(f3k). Either a rough line search can be performed,
or, starting with (xo = I, the value of (Xk can be continually halved until
the sum of squares does decrease.
An alternative is the method of steepest descent. The sum of squares can
be written approximately as
(5.31)
where, from (5.10),
Zk = y - rJ(x,f3k).
Differentiation of zk with respect to the parameters yields the matrix of
partial derivatives Fk (5.11), so the gradient direction for 8(f3k) is _FkT zk.
The parameter correction to decrease 8(f3k) is thus
(5.32)
where, again, (Xk is a step length to be determined numerically.
Although the steepest descent algorithm (5.32) will converge, conver-
gence can be very slow, whereas the Gauss- Newton algorithm converges
speedily once it is in the neighbourhood of a region near the minimum
where the linear approximation to the model is good. The Marquardt-
Levenberg algorithm combines the two methods by taking the parameter
correction
(5.33)
5.2. The Forward Search 149

When A = 0, (5.33) is the Gauss- Newton algorithm. As A ---+ 00 steepest


descent is obtained. In using (5.33) it is customary to standardize pk so
that the diagonal elements of pkT pk are all one. As a result of the stan-
dardization, changes in the scalar A are interpretable as equal changes in
the search direction in all coordinates of the linearized model
The fundamental idea of the algorithm is to choose a value of AO large
enough that the initial iterations move in the steepest descent direction
and then gradually to change to Gauss- Newton by decreasing A. If there
is progress at iteration k , that is S(,8k+l) < S(,8k), put

(V> 1).

If however S(,8k+l) > S(,8k), return to ,8k and repeat the step (5.33) with
VAk, v 2 Ak, etc. until improvement occurs, which it must for A sufficiently
large, unless a minimum has been reached. A value of two is often used for
v.
A difficulty in the application of this algorithm is that a line search
involving G:k can be included for any Ak. A strategy that seems to work
well is to start with the full step, G:k = 1. If this gives an increase in the sum
of squares, G:k can be reduced and the sum of squares recalculated. If several
reductions in G:k fail to bracket the minimum, Ak should be increased and
the process repeated from the full step length. Successful searches should
lead to a decrease in the value of A.
A general method like this can be combined with methods taking ad-
vantage of any special structure in the model. Sometimes it is possible to
partition the parameters into a group that occur linearly in the model and
a group that occur nonlinearly. Numerical search is then only necessary
in the lower-dimensional space of the nonlinear parameters. For example,
the model (5.3) is such that, for known values of ,82, the model is linear
in ,81. The parameter estimates can then be found by a numerical search
over values of ,82, the corresponding value of ,81 being found by solution of
a linear least squares problem.
Our numerical requirements fall into two parts. It is not always easy to
achieve convergence of numerical methods for the randomly selected subsets
of p observations, one of which provides the starting point for the forward
search. We attack this problem by brute force , using in succession the
numerical optimization algorithms provided by GAUSS until one of them
yields convergence. As a first algorithm we use steepest descent, followed
by several quasi-Newton algorithms and finishing with a form of conjugate
gradient algorithm.
Once the forward search is under way we order the observations from the
subset of size m using the linear approximation at the parameter estimate
~:". We then use this estimate as a starting point in finding ~:"+1' the
estimate for the subset of size m + 1.
150 5. Nonlinear Least Squares

5.2. 2 Monitoring the Forward Search


In the examples of linear regression we found it informative to look at the
evolution of residuals, leverage, parameter estimates and t statistics during
the search. For nonlinear regression we also look at these quantities, now
derived from the linear approximation (5.8). In addition, useful information
can be obtained from the evolution of the value of 8 2 and of the squared
multiple correlation coefficient R2 .
In (2.20) R2 was defined as the proportion of the corrected sum of squares
explained by the model. This is a proper definition if it is known that
the observations do not have zero mean, so that a sensible null model is
estimated by fitting a constant to the data. However most nonlinear models
do not include a constant, so that the appropriate definition of R2 is as the
proportion of the total sum of squares explained by the model. If the total
sum of squares of the observations is
n

ST = LY~'
i=1

the squared multiple correlation coefficient is now defined as

R2 = {ST - S({3)}/ST. (5.34)

A value near one indicates that a large proportion of the total sum of
squares has been explained by the nonlinear regression .
For the detection of outliers and influential observations we again monitor
residuals and leverages, but there are now some differences in the method of
calculation. A consequence of the dependence of the design matrix Fs( m ) on
the parameter estimate {3;,. is that some of the deletion results used in §2.6.5
in the derivation of forward deletion formulae now hold only approximately.
si
As before we denote by m ) the subset of size m used for parameter
estimation in the forward search. The parameter estimate is {3;,. and the
design matrix Fs( m ) with ith row jT (m)' The leverage is then written
... 't,8.

T -1
= f.t , s(m) f.
AT ( A A ) A

h. s(m ) Fs(m) Fs(m) s(m).


t, ... ... ...'" 1" ...

One residual we monitor is the maximum studentized residual in the


subset
nls
r[mJ = max Iri,S~m)
nls I for i E sim ) m =p+ 1, . . . ,n, (5.35)

where the studentized residual rnls(m) is defined as


t ,S.

(5.36)
5.3. Radioactivity and Molar Concentration of Nifedipene 151

In linear regression we also monitored the minimum deletion residual


among the units not belonging to the subset. In nonlinear regression in or-
der to compute the minimum deletion residual r[m~fl we use equation (2.36)
adapted to the nonlinear case
* nls .
r[m+1J = mIll ri ,s;rr»
* m = p + 1, . . . n, - 1, (5.37)

where r* (m) is given by


2,8 ""

for i (j. sim ) . (5.38)

In linear regression in order to compute the deletion residual for unit i


we can use either the fundamental definition in (2.36) or the more easily
calculated form (2.37). This identity no longer holds in the nonlinear case
because of the dependence of the ith row of the matrix F on the estimated
parameters.
Likewise, in order to monitor the Cook distance Dm,i we calculate
h.2, s(rr»
Dm ,i = { nls
ri s(m)
}
2
(1 _ h
""
) for i (j. sim - 1) but i E sim ) .
, "" P i ,S:m)
(5.39)
Finally, in order to calculate confidence intervals for the response we need
to extend the definition of leverage to any point x , at which the vector of
partial derivatives in the linearized model is is;m )(x). If we write

hs;m) (x) = i~;m)(x) (ft~m)fts;m)rl is;m) (x),


an approximate 100(1 - a)% confidence interval for the fitted response,
based on the linearization, is given by

(5.40)

where Fp ,n-p,l-a is, as before, the relevant percentage point of the F


distribution.

5.3 Radioactivity and Molar Concentration of


Nifedipene
We start with an example in which the behaviour of the quantities moni-
tored during the forward search is practically indistinguishable from those
in the examples of linear regression in Chapters 1 and 3. However there
seems to be one outlier that, while not important to inferences, does make
the fitted model more curved.
152 5. Nonlinear Least Squares

12----------- -----~
N

4 6 8 10 12 14 16
Subset size m

Figure 5.4. Molar data: forward plot of scaled residuals

Bates and Watts (1988, p. 306, 307) give data relating molar concentra-
tion of nifedipene (NIF) to radioactivity counts Yi in rat heart tissue tagged
with radioactive nitrendipene. They propose fitting the four-parameter
logistic model

(5.41)

where
Xi = loglO(NIF concentration).

We follow St Laurent and Cook (1993) and consider only the 16 observa-
tions for tissue sample 2. These data are in Table A.11. The data consist
of replicate observations at eight values of x. For observations 1 and 2 the
NIF concentration is zero, so that Xi = -00. Again following St Laurent
and Cook we take the value as -27 in our calculations, a value also used
in our plot of the data in Figure 5.7. Then, provided /34 > 0, (5.41) for
these observations becomes 'T](Xi' (3) = /31 + /32. These two observations give
values of Yi around 5,000. The minimum value, observation 15, is 1,433.
Thus, although the data are counts, there should be no problem in treating
them as continuous observations.
Figure 5.4 shows a forward plot of the scaled residuals from an initial
subset consisting of observations 5, 7, 9and 13. These residuals are remark-
ably stable, with little indication of any outliers. Observation 12 has the
largest residual and is the last to enter the forward search, but its residual
is hardly changed by its inclusion; this observation seems to agree with the
model fitted to the remaining data. The forward plot of leverages, which we
5.3. Radioactivity and Molar Concentration of Nifedipene 153

\\ \ "
--;;; =--------------.,.<.""'
,
~
-
-\::.:.:. -" - /

._.._. . ... . . . . . . /.r\;... " I .~..-"""'\


"
\ .... ,..J /\ " / \ \
o ~
'"
\ 1_-
" t 4 .....
'-:::_--

§'
b_4 ,
I
I
I

_. __. b_2
----- b_3
--_ . b_4

4 6 8 10 12 14 16 6 8 10 12 14 16
Subset size m Subset size m

Figure 5.5. Molar data: (left) scaled estimated beta coefficients and (right)
statistics

do not give here, also fails to reveal any observations appreciably different
from the majority of the data.
The indication of the lack of several influential observations is supported
by the plot in Figure 5.5 which shows the parameter estimates and their
associated t statistics during the forward search. Given that the coefficients
have a very different scale, we have divided each curve by its maximum
(minimum for /33 because its values are negative) . From Figure 5.5(left),
the estimates of /31, /32 and /33 can be seen to be virtually constant. The
inclusion of observation 12 seems to have a non negligible effect only on
the estimate of /34 ' The plots of the t statistics show the declining shape,
without appreciable jumps, that follows from the stability of the parameter
estimates coupled with the increase in the estimate 8 2 during the forward
search. The effect of the inclusion of observation 12 is to halve the values
of the first three t statistics and to reduce t4 from 3.25 to 2.2, a value no
longer significant at the 1% level.
Further information about the effect of observation 12 is given by Figure
5.6(left) which shows forward plots of the maximum studentized residual
amongst the observations in the subset and, in the right panel, the value
of 8 2 . Both curves show an increase when observation 12 is introduced.
The studentized residual achieves its maximum value and there is a large
increase in the value of 8 2 . These plots both support the conclusion from
the others that there is no complicated structure of influential observations
or outliers to be unravelled and that observation 12 is outlying.
Finally, in Figure 5.7 we show a plot of the data together with the fitted
model with and without observation 12. A slightly strange feature of these
data, shared with those from the other tissue samples given by Bates and
Watts, is that they show a slight initial rise. The large residual of observa-
tion 12 is not immediately apparent, because it falls in a part of the curve
that is decreasing rapidly. Including this observation has a noticeable effect
154 5. Nonlinear Least Squares

g
o
'"
C\J

o
C\JO
< 0
(/) g

6 8 10 12 14 16 6 8 10 12 14 16
Subset size m Subset size m

Figure 5.6. Molar data: forward plots of (left) the maximum studentized residual
in the subset and (right) 8 2

on the shape of the fitted model and also on the two measures of curva-
ture: the parameter effects curvature increases from 1.33 to 3.09 and the
intrinsic curvature from 0.68 to 0.95. Although, with just one explanatory
variable, the outlying nature of observation 12 is not hard to detect, our
forward procedure both provides evidence of the effect of this observation
on a variety of inferences and establishes that this remote observation is
the only one that has a significant effect on any inferences about the model.

5.4 Enzyme Kinetics


We now look at a slightly more complicated data set, which we use in part
to recall the bridge that the forward search provides between least median
of squares and least squares. It also provides a further example of the way in
which models may be linearized by rearrangement if the error distribution
is ignored.
The data in Table A.12 are from an enzyme kinetics study with two
variables, substrate concentration x and inhibitor concentration I. The
response is the initial velocity of the reaction. The data were analyzed by
Ruppert et al. (1989) using a transformation of both sides of the model
combined with weighted least squares and by Stromberg (1993) combining
nonlinear least squares and very robust regression. Lee and Fung (1997)
used deletion of four observations found from a combination of least squares
and Stromberg's results on least median of squares.
The data are in the form of the results of a factorial experiment with
substrate Xi at five levels and inhibitor at four levels. Since one of the
observations is missing, there are 19 observations in all. We number the
5.4. Enzyme Kinetics 155

0
0
0
<0 4

Q)
c:
::>
0
0

0
0 It)

"=
Q) 1

'"" 0
"c 8't
'c"
.!1
::>
0 0
0 0
~ 0

n '"
.s;

u'"
0 0
0
cc'"
0
C\I
15

0

0
~
-25 -20 -15 -10 -5

log(NIF concentration)

Figure 5.7. Molar data: observed and fitted values with (continuous line) and
without (dashed line) observation 12. The inclusion of this observation reduces the
curvature of the fitted model. The two points plotted at abscissa -27 correspond
to log concentrations of -00

observations downward by column, noting that, since Stromberg was using


a Lisp-based package, his numbering runs from 0 to 18.
The model fitted by Stromberg was
/30Xi
Yi = + ti (1 = 1, ... , 4), (5.42)
/31 + Xi
so that there is an individual coefficient /31 for each of the four inhibitors.
Figure 5.8 shows the residuals during the forward search. For most of the
search the pattern is stable, with observation 5 the last to enter, preceded by
14 and, before that, 19. These three observations have the largest residuals
until observation 19 enters the subset. These three residuals are therefore
those that appear large in the normal plot of residuals from LMS fitting
such as that given by Stromberg. The normal plot of least squares residuals
corresponds, of course, to the residuals at the end of the search in Figure
5.8. In this figure the largest positive residual is that for observation 15.
The normal plot is again given by Stromberg.
Figure 5.9 gives the forward plot of the leverages hi, calculated using
the least squares estimates of the parameters at each stage in the forward
search. Observations 5 and 19 have high leverage. But they do not have
much effect on the fitted model, as is shown by the behaviour at the end
of the forward plots in Figure 5.10.
Figure 5.1O(right) shows the maximum studentized residual among the
units in the subset used for fitting . The increase when the penultimate
156 5. Nonlinear Least Squares

C\J

'"
(ij
:l
"0 0
'(ij
~
"0
Q)

~
CIl ,5
_________ ;---; 14
~
'.......... .... ..... --- ............... , ,/ /
------...... " " I

---------------- ----~:,......:~:::.:.~--_/

5 10 15
Subset size m

Figure 5.S. Kinetics data: forward plot of scaled residuals

'"ci ---------\
1 \

Q) (!)
ci . 19
e
Cl

Q)
>
Q)
...J
~ ............... _._---""
ci
"'"''-----
.
C\J
--"
"'"'
ci

0
"'------------ ---
-----------~~:-~~~~
ci

4 6 8 10 12 14 16 18
Subset size m

Figure 5.9. Kinetics data: forward plot of leverages


5.4. Enzyme Kinetics 157

~
---- "-
'-- ........
I.t) \
0> \
0>
o \
\
\
\
NO \
< 0>
a: 0> \
o \
\
\
\
\
\
\
'.
o
~
o
8 10 12 14 16 18 6 8 10 12 14 16 18
Subset size m Subset size m

Figure 5.10. Kinetics data: forward plots of (left) maximum studentized residual
in the subset and (right) two values of R2: the continuous line is for nonlinear
models (5.34)

observation, case 14, enters the subset is noticeable: there is then a slight
decrease when the last observation, case 5, enters. The plot reinforces the
separate nature of 5 and 14 which was shown in Figure 5.8. An interpreta-
tion of the importance of these observations comes from the forward plots
of the two values of R2 shown in Figure 5.10( right). The upper value is that
appropriate for nonlinear least squares problems, calculated as in (5.34) us-
ing the total sum of squares, rather than using the corrected sum of squares
as the base. The figure shows how this value of R2 declines at the end of
the search to a value of 0.995. Also given is the curve for the values ap-
propriate for a linear model, calculated using the corrected sum of squares.
This lower curve decreases to 0.983. Whichever value of R2 is used, it is
clear that the last observations are not causing severe degradation of the
model.
Our final plot from the forward search is of t statistics in Figure 5.1l.
The four parameters 131 all have very similar values, between 6.5 and 8 at
the end of the search: the common parameter 130 is much more precisely
estimated. Overall the curve shows the gentle downward drift which we
associate with such plots. There is no evidence, at the end of the search or
elsewhere, of an appreciable effect of any outliers. The data seem again to
be well behaved and the parameters well estimated.
There are three further points about the analysis of these data. One is
that we have taken different values of the parameter 131 for each inhibitor,
while using a common value 130. An alternative, potentially requiring fewer
parameters, is to use a value of 131 that depends directly on the inhibitor
concentration. We leave the exploration of this idea to the exercises.
The second point is that the data could be analyzed by rearranging the
model to be linear. We conclude our discussion of this example by looking
at such a rearrangement, one instance of which was introduced at the end
158 5. Nonlinear Least Squares

a
~

a
co

a
co
"
~
~
- a
"<t

a
N

6 8 10 12 14 16 18

Subset size m

Figure 5.11. Kinetics data: forward plot of t statistics

of §5.1.1 for a different model. To start we assume that there is not only a
different parameter (JI for each level of inhibitor, but that the parameter
(Jo also varies with I. Let the parameter for level I be (Jo(l) ' If the errors
can be ignored the model (5.42) becomes, for group I,
(Jo(l) Xi
Yi = . (5.43)
(JI + Xi
This can be rearranged to yield the model
1 1 (JI
-=--+--. (5.44)
Yi (Jo(l) (J0(I) Xi

Thus estimates of the two parameters at each inhibitor level can be found
by regression of I/Yi on l/xi. If a common value (Jo is assumed, linear
regression can again be used, but now involving all observations to give the
estimate of 1/ (Jo.
However the parameters are estimated, a plot of I/Yi against l/xi should
be an approximately straight line if the model holds. As the plot of Figure
5.12 shows, this does seem to be the case. However the plot shows the effect
of transformation in generating leverage points from low values of the sub-
strate concentration Xi. A particularly dangerous point is observation 16,
which is both outlying and a leverage point for the highest inhibitor level.
This observation would have an appreciable influence on the parameter es-
timates if least squares were used on the rearranged model. If the errors
are additive and of constant variance in the original form of the model
(5.42), better parameter estimates are obtained by the use of nonlinear
least squares. And, indeed, our analysis shows that observation 16 is not
5.5. Calcium Uptake 159

0
0
'" 30
0 6

'"'"
0
0
~
·0
0
'"
Qi 0
>
.,
OJ ~
·c 10
~ 0
~

____
0
3
'"
~~==~~~~ ----~O
0

0.0 0.01 0.02 0.03 0.04


1/Substrate

Figure 5.12. Kinetics data: plots of observations and fitted least squares lines for
the model rearranged to be linear (eq. 5.44) for each inhibitor concentration I
(right-hand axis)

influential for the nonlinear model. We suggest in the exercises a method


of exploring the effect of the linearizing rearrangement on the distribution
of the parameter estimates.
The final comment is on the difference between the conclusions of our
forward analysis and the backwards diagnostic analysis of Lee and Fung
(1997). Diagnostic procedures start from the least squares fit. As Figure
5.8 shows, the largest residuals at the end of the search are 5, 14 and 15
and these are, indeed, three of the four observations whose importance is
investigated by Lee and Fung (1997). Our forward method has focused
rather on observations 5, 14 and 19. As we have shown, none of these has
a marked effect on the fit of the model to the data.

5.5 Calcium Uptake


We now move on to the consideration of three examples in which our pro-
cedure highlights model inadequacies. The first of these examples uses data
on calcium uptake analyzed by Rawlings et al. (1998, pp. 501- 507).
The data are in Table A.13. There are 27 observations of calcium uptake
y consisting of three readings at each of nine times. The focus of the analysis
by Rawlings et al. (1998) is on numerical procedures for the calculation of
least squares estimates and on tests of hypotheses of the parameters. One
of their conclusions is that an appropriate model is a Wei bull growth model
160 5. Nonlinear Least Squares

~ ~-------------------------------------------,

5 10 15 20 25
Subset size m

Figure 5.13. Calcium data: forward plot of scaled residuals

combined with additive errors; that is,


(5.45)
where t is time in minutes.
Figure 5.13 is a forward plot of the residuals from fitting this model,
which shows a strange downward trend in many of the more extreme resid-
uals during the last third of the forward search. In the first two thirds of
the search there is a preponderance of positive residuals.
The plot of the estimated parameters in Figure 5.14(left) shows a corre-
sponding upward trend in two of the parameter estimates towards the end
of the search.
The approximate Cook distances in Figure 5.14(right) show an upward
trend. This is not the pattern associated with the introduction of a group
of outliers, which tend to enter the search together and to cause a jump
in the plot, followed by a swift decline. Here the indication is more of the
sequential addition of observations that consistently change the model in a
way to be identified.
Figure 5.15 shows how the curvatures change during the search. The top
panel shows that the parameter effects curvature increases gradually as
observations are included. More surprising is Figure 5.15(bottom), show-
ing that the intrinsic curvature, having increased steadily, decreases in the
last third of the search. These indications of a systematic change in the
relationship between model and data towards the end of the search are
revealed and explained if we look at plots of the fitted model and data
during the search. Figure 5.16(left) shows the fitted model and all obser-
vations when m = 20, which is just before the changes begin in the earlier
5.5. Calcium Uptake 161

0
'"d
III

Q)
d
0
c
tl 0

"
~
0
0
d

(,)
III
0
d

0
d
5 10 15 20 25 5 10 15 20 25
Subset size m Subset size m

Figure 5.14. Calcium data: forward plots of (left) estimated parameters and
(right) Cook's distance

i!1 0
~ .,;
~
::>

tl"
0

~ '"
" C!
~E
.,
~
0
Q. 0
10 15 20 25

OJ')
0
i!1

~
...0
::>
""
.~ (")
0
~
'"0
10 15 20 25
Subset size m

Figure 5.15. Calcium data: forward plots of the two curvature measures: (top)
parameter effects curvature; (bottom) intrinsic curvature
162 5. Nonlinear Least Squares

22 / .......
0
19
'"E" ~
Ii) 'IT l()

r
E
c 17 c
~ 'l' ~ 'l'
~ ~

""
'C
(') ""'C
C
(')
C
res res
Q) Q)

""~ '" ""~ '"


::>
E
::>
'0
E
::>

::>
'0
I t
.'
(ij (ij
() ()
0 0

0 5 10 15 20 0 5 10 15 20
Time Time

Figure 5.16. Calcium data: observations and fitted curve: (left) m = 20; (right)
m = n; 99% confidence intervals calculated from equation (5.40)

(ij
::>
:2M
~
C
.2
1i>
Qi '"
'C
E
::>
E
'c
::E
o
o L,----.----.----.----.~
5 10 15 20 25 10 15 20 25
Subset size m Subset size m

Figure 5.17. Calcium data: forward plots of (left) maximum studentized residual
in the subset and (right) minimum deletion residual not in the subset used for
fitting

forward plots. The seven observations still to enter are numbered, as they
are in Figure 5.13. These clearly all lie above and away from the fitted
model, with the last three observations to enter, 22, 19 and 23 being most
extreme. The other half of the plot shows the fit at the end of the search
(m = 27). The fitted curve has moved up to accommodate the final group
of observations and is less near its horizontal asymptote at high times, as
well as being apparently concave, which it was not at small times for m =
20.
This gradual change in the shape of the fitted curve explains the patterns
seen in the earlier plots. The large residuals in the earlier part of Figure
5.13 are for the last observations to enter. The only one not to decrease
is for observation 4. As Figure 5.16 shows, this observation is the only
5.5 . Calcium Uptake 163

member of the group taken at a low value of t. It is therefore little affected


by the change in the asymptote of the fitted curve. The gradual upwards
movement of the curve for larger t does indeed cause the positive residuals
for observations 17, 19, 22, 23, 24 and 26 to decrease, as is evident in
Figure 5.13. Figure 5.14 shows the corresponding changes in the estimated
parameters and the related Cook statistic. The move of the model towards
concavity at low values of time as the last observations are included is a
move towards a more nearly linear form, which is reflected in the reduction
in intrinsic curvature shown in Figure 5.15(bottom). The parameter effects
curvature is related to the parameterisation of the model, not to the form
it is representing, and so is, as Figure 5.15(top) shows, unaffected by this
change.
A comparison of the plots of the fitted models for m = 20 and m = 27 in
Figure 5.16 seems to suggest that the observations at higher values of tare
less reliable than those for lower values. An overall reflection of this is the
decrease in the value of R2 to 0.970 at the end of the search, compared with
0.985 when m = 20, although both values are large enough to suggest good
agreement between model and data. More specifically, the observations for
the penultimate time seem jointly somewhat high. This is reflected in the
approximate 99% confidence intervals for the expected value of y, calculated
according to (5.40) which are also plotted in Figure 5.16. The addition of
the last seven ob~ervations causes the nearly parallel curves at the right-
hand end of the plot to be replaced by curves that open out like the bell
of a trumpet.
The change in the parameter estimates during the last stages of the
forward search is gradual and consistent as is caught, for example, by the
plot of Cook's distance in Figure 5.14. This gradual change is not revealed
by the residual plots of Figure 5.17. The left panel shows the maximum
studentized residual amongst the observations included in the subset and
the right panel the minimum deletion residual among the observations not
in the subset. Both plots are uneventful, indicating the absence of outliers at
any stage in the search. This finding agrees with the nature of the forward
plot of residuals in Figure 5.13 where there is no evidence of any single
large residual. We suggest, in the Exercises, some further exploration of
the data to see if more structure can be found or explained.
An interesting methodological feature of this analysis is the strong rela-
tionship between the diagnostic plots, especially, but not only, the forward
plot of residuals in Figure 5.13 and the two plots of the fitted model for
m = 20 and m = n in Figure 5.16. It was possible to explain the features
in the forward diagnostic plots by reference to plots of the observations,
once they had been divided into two groups by the forward search. Such
simplicity of interpretation is harder to find when there is more than one
explanatory variable, so that simple plots of fitted models are no longer
available in two dimensions.
164 5. Nonlinear Least Squares

23· 23·

X1=average 10· 10·


influent

"".-.
nitrogen
concentration ~.:, 'If •• • • 2· 22·

• •
,.
• • 2·

.'. X2=water ,-
- \.•
I ..\,
23- retention 23·
time 22·
I
10- 10-

22· 22·

-
10· 10·
2- 2·


........
y=mean annual
23- 23·

.... ,
nitrogen

t, --. -• •
concentration

10 20 30 o

Figure 5.18. Lakes data: scatterplot matrix

5.6 Nitrogen in Lakes


In this example there seem to be two clear outliers. Interest in the example
is on the effect of the introduction of these outliers on inferential quantities,
such as t tests.
The data are given in Table A.14. There are 29 observations on the
amount of nitrogen in US lakes. The variables are:

Xl: average influent nitrogen concentration


X2: water retention time.
y: mean annual nitrogen concentration.

The scatterplot matrix in Figure 5.18 shows that there may be a linear
relationship between y and Xl with two outlying observations 10 and 23.
The plot of y against X2 reveals in addition that observations 2 and 22 are
likely to be important.
The data were analyzed by Stromberg (1993) using the model
Xli
Yi = {3 +Ei· (5.46)
1 + {3lx 21
The forward plot of residuals, Figure 5.19, clearly shows these two outliers.
In addition it shows that inclusion of observation 10 causes an apprecia-
ble reduction in the residual for observation 23. The two observations are
5.6. Nitrogen in Lakes 165

<II
OJ

10----------------------------- --,
....... - ... ,
::l

~
:!! ~
.... ----
.,
"0
OJ
()
en
0 23--------------------~

o 5 10 15 20 25 30
Subset size m

Figure 5.19. Lakes data: forward plot of scaled residuals

presumably outlying in a similar way. Stromberg (1993) finds the same


two outliers from a plot of least squares residuals against Xli. The forward
plot also shows that the residual for observation 2 increases when observa-
tions 10 and 23 are included. Although it is not straightforward to interpret
scatterplot matrices when nonlinear models are being fitted, the theoretical
relationship between y and Xl in (5.46) is linear. The scatterplot matrix in
Figure 5.18 suggests that the model will find it difficult to accommodate
observation 2 as well as 10 and 23.
The forward plot of leverages in Figure 5.20 shows that when observation
10 enters, it is a leverage point. However inclusion of the extreme observa-
tion 23 (h23 = 0.81) makes observation 10 much less extreme in X space.
Reference to the data shows that Xl ,23 = 34.319 and Xl,1l = 18.053, with
the range of the other values being 0.890 to 5.548. Clearly these two are
very atypical values for Xl. Figure 5.20 also shows that the fourth observa-
tion that seems important from the scatter plot, 22, has high leverage for
most of the search. In the search shown here this observation was included
in the initial basic subset. Other searches, in which it was not in the ini-
tial subset, resulted in its early inclusion during the forward search. Such
searches gave results indistinguishable from those described here.
An important difference between linear and nonlinear least squares is
exemplified in Figure 5.21. The left panel shows the parameter estimates,
which change appreciably when the last two observations are included.
Indeed ~l increases more than fourfold. However the t statistics in the
right panel show only a relatively small change. We now consider conditions
under which such changes are possible in nonlinear models, but impossible
for linear models.
166 5. Nonlinear Least Squares

<Xl
0

Q) <0
Ol 0
~
Q)

.,,
>
Q)
...J
",
'f
0
,,

'"0
0
-=~----------- -<.; .: :
0

5 10 15 20 25 30
Subset size m

Figure 5.20. Lakes data: forward plot of leverages

0
10

1=
0
'f
U
(J
.,.,0 t_2
~

~ 0
C\J

..... __.._ .._._..._ ...................----..-...J


/ 0
_.\......../'"-
o 0

5 10 15 20 25 30 5 10 15 20 25 30
Subset size m Subset size m

Figure 5.21. Lakes data: forward plots of (left) parameter estimates and (right)
t statistics
5.6. Nitrogen in Lakes 167

~r====:::::::::::::-------'

o i-=;:::::;:::;:::::;:::;::::;:::;::::;:::;::::;:::;:::;::!.../..--J
2 4 6 8 12 16 20 24 28 2 4 6 8 12 16 20 24 28

2 4 6 8 12 16 20 24 28 2 4 6 8 12 16 20 24 28
Subset size m Subset size m

Figure 5.22. Lakes data: forward plots of Cook's distance, R2, maximum studen-
tized residual in the subset and minimum deletion residual among observations
not in the subset

The values of the t statistics depend in general on three components: the


parameter estimates, the square roots of the diagonal elements of (FT F)-I
and the estimate 8 of the error standard deviation. In a linear model a
nearly constant statistic with a great change in parameter value would
only be possible if the value of 8 2 increased by a large amount. But here
the increase is from 0.715 when m = 27 to 1.61 at the end of the search,
not a dramatic increase. The explanation lies in the behaviour of (FT F)-I.
For a linear model, addition of an observation causes the diagonal ele-
ments of X T X to increase, or at least not to decrease, so that the variance
of the estimated parameter decreases, apart from any effect of the estimate
8 2 . But here

'T' (24.40 -14.13 )


(F F)m=26 = -14.13 35.34
and

'T' (0.79 -0.68 )


(F F)m=28 = -0.68 16.16 .

The addition of observations has caused all elements of the matrix to de-
crease. This behaviour is explained by the effect of the parameter estimates
on F, the elements of which may change appreciably as the parameter
estimates change. So here the two outlying observations cause an appre-
ciable change in the parameter estimates, but less marked changes in the t
statistics.
168 5. Nonlinear Least Squares

(jJ

~<D

'" •
23

Figure 5.23. Lakes data: fitted response surface when m =n- 2. Observations
10 and 23 are clearly outlying


23

Figure 5.24. Lakes data: fitted response surface when m = n. The surface is more
curved than it is in Figure 5.23
5.6. Nitrogen in Lakes 169

Other effects of these two observations are shown in Figure 5.22. Both
have very large Cook's distances and their inclusion causes R2 to decrease to
0.696. The maximum studentized residual among observations in the subset
shows a pattern typical of a pair of outliers: there is one large value followed
by one that is slightly smaller, due to the masking effect of observation 10 on
observation 23. This effect is seen even more dramatically in the last panel
of Figure 5.22, where the minimum deletion residual among observations
not in the subset is more than four for observation 10 before it enters. The
value for observation 23 at the end of the search is nearer three.
It is clear that these outliers have a large effect on some aspects of the
model. This can also be seen by plotting the fitted surface of the model as
a function of Xl and X2. Figure 5.23 shows the surface at step n - 2 and
Figure 5.24 shows the same surface, but when all n observations are used
in fitting. The two outliers are clearly visible in the first figure as being
remote from the surface. However in Figure 5.24 observation 10 appears
close to the fitted surface, as would observation 23 if the fitted surface were
extended to the remote region of this observation.
Such three-dimensional plots are hard to interpret in two dimensions.
The structure is seen much more clearly if the plots are either rotated
or jittered, so that the eye fabricates the illusion of a three-dimensional
object. What is clear from these figures is that addition of the last two
observations has resulted in an appreciable increase in the curvature of the
fitted surface. This is reflected in the measures of curvature: the parameter
effects curvature increases from 0.24 to 0.42 and, more importantly for the
difference in the figures, the intrinsic curvature increases from 0.14 when
m = 27 to 0.35 when m = 29.
Figure 5.21 shows another effect of the two outliers. When m = 27 the t
statistic for f32 is 1.85, increasing to 4.32 when the two outliers are included.
In a linear regression model the implication would be that X2 should be
dropped from the model, but here the interpretation is less clear. Since
f32 is the power to which X2 is raised the implication is that we could
consider a small value of f32. In the Box-Cox transformation of Chapter 4
the value>. = 0 led to the log transformation. Accordingly we here try log
X2. This value leads to so large an increase in the residual sum of squares
for the 27 observations, that log X2 has to be rejected as a variable. In fact,
this transformation led to difficulties in obtaining convergence for the least
squares estimation of the parameters from the 27 observations, perhaps
because there are some values of X2 close to zero which become extreme
when logged.
There remain the two outliers themselves. They correspond to units for
which the two values of Xl are about 10 times the values of this variable
for the other units. It is therefore possible that they have been caused by
decimal points being written in the wrong place. We suggest in the Exercises
that the analysis be repeated with these two X values replaced by one-tenth
of their values. Although such manipulation of the data is one statistical
170 5. Nonlinear Least Squares

23--------

5 10 15 20
Subset size m

Figure 5.25. Pentane data: forward plot of scaled residuals

method of perhaps producing outlier free data, the resulting data may not
represent the physical situation under study. It may be, for example, that
the lakes really are highly polluted. One way to resolve such questions is by
inspection of the records from which the data were transcribed. Another is
comparison with other recorded measurements on the same lakes.

5.7 Isomerization of n-Pentane


As a last example of a nonlinear model we show how the forward search
immediately exhibits a failure in the model. The data, from Carr (1960) ,
given in Table A.15, are from an experiment on the catalytic isomerization
of n-pentane to iso-pentane in the presence of hydrogen. There are 24
observations and the variables are:

Xl partial pressure of hydrogen


X2 partial pressure of n-pentane
X3 partial pressure of iso-pentane
y rate of disappearance of n-pentane.

The rate model studied in the statistical literature is


(31{33(X2i - x3d1.632)
Yi = + ti· (5.47)
1 + {32Xli + {33x2i + {34x3i
Figure 5.25 is the forward plot of the scaled residuals when this model is
fitted to the data. It is steady until near the end of the search. For most
5.7. Isomerization of n-Pentane 171

.\1
'1
\I
<Xl
\\
\.\
\\.
,

C\J

-- --- ---- --- -- _.... -


o
o o

5 10 15 20 5 10 15 20
Subset size m Subset size m

Figure 5.26. Pentane data: forward plots of three parameter estimates and of t
statistics

of the search the distribution of residuals appears a little skewed, although


there is no evidence of skewness at the end of the search. This pattern
suggests that the data should perhaps be transformed. However we do not
here pursue this idea, which involves the techniques of the previous chapter.
Instead we look at a plot highly informative about the model.
Figure 5.26(right) shows the forward plot of the t statistics for the four
parameters. The revealing and unexpected feature of the figure is that three
of the four parameters have indistinguishable t values for the last half of the
search. The values are not very different anywhere in the search. A partial
explanation is to be found in the linear theory approximate correlation
matrix of the parameter estimates:

1
-0.805 1
-0.840 0.998 1
-0.790 0.998 0.995 1.
The estimates of {32, {33 and (34 are clearly extremely highly correlated, as
Figure 5.26(left) shows, and so will have the very similar t values seen in
the figure.
For a linear model with these correlations it would be customary to drop
one or two of the variables as we did in the analysis of the ozone data in
Chapter 3. But, with nonlinear models such as this, it is not always as
obvious, just as it was not obvious in the previous section, how the model
should be simplified. The only variable that can be dropped on its own
is Xl' But, since the model comes from the mechanism of the reaction,
simplified nonlinear models should come from consideration of simplified
reaction mechanisms.
These data have been subject to several analyses in the statistical liter-
ature, both in the journal Technometrics and in the books of Bates and
172 5. Nonlinear Least Squares

Watts (1988) and of Seber and Wild (1989). Carr's original analysis rear-
ranged the model to be linear. After rearrangement the response is y;l, for
which the variance is not constant. Box and Hill (1974) allow for this inho-
mogeneity by using weighted linear least squares for parameter estimation,
with the weights chosen to allow for the inhomogeneity of variance.
Pritchard et al. (1977) comment that heteroscedasticity may be intro-
duced inadvertently by the data analyst who transforms a nonlinear model
to a linear form, which is what has happened with Carr's analysis. They
perform two analyses using nonlinear least squares on the original data,
one using weighting to allow for heteroscedasticity and the other being
unweighted. They find no evidence of variance inhomogeneity when the
data are analyzed without weights. They also discuss rearrangement of the
model. If the errors in (5.47) are ignored the model may be written
1 (32 1 (34
(X2i - X3i/1.632)/Yi = (31(33 + (31(33 Xli + (31 X2i + (31(33 X3i· (5.48)

It is straightforward to reparameterise this as a linear model with response


proportional to I/Yi, which is the form used by Carr. An alternative is
to divide both sides by (X2i - x3i/1.632), giving a second linear model.
Pritchard et al. (1977) comment that , although this form is supposedly
used by Box and Hill (1974), their numerical results are appropriate for
(5.48) .
Analysis of models rearranged in linear form and transformations of the
resulting response for these data are discussed by Seber and Wild (1989 ,
pp. 77- 79). Bates and Watts (1988, pp. 55- 58) give the results of fitting
the nonlinear model (5.47) together with plots of the data and of fitted
values and residuals. The matrix of plots of joint confidence regions for
pairs of parameters, based on the linear approximation (5.10) , exhibit the
effects of the high correlations between parameter estimates. The three
plots for the pairs (/32,ih) , (/32, /34) and (/33 , /34) are virtually identical, the
high correlation resulting in long, thin, diagonal ellipses. Likewise the three
plots for /31 against the other estimates are also virtually indistinguishable
from each other. An important feature of these approximate confidence
regions is that they include negative values of the parameters which , to
be physically meaningful, must be positive. The ellipses are therefore poor
approximations to the contours of the sums of squares surfaces (5.7) which
yield likelihood regions for the parameters. That the linear approximation
may be poor is indicated by the value of 140.1 for the parameter effects
curvature at the end of the search. The results of §5.1.2 indicate that this
value should be compared with the percentage points of 1/ VF4,20 . Any
value greater than one will therefore be significant and 140.1 very strongly
so.
A final point about these data concerns experimental design. The param-
eter estimates are highly correlated because the design is poor. In addition
to inferential problems this poor design leads to numerical difficulties when
5.8. Related Literature 173

estimating the parameters. This is the only example of those in this chapter
in which we had trouble with convergence of the routines both for nonlinear
least squares and for calculation of the curvatures. Pritchard and Bacon
(1977) show how a design giving more precise estimates of the parame-
ters for the same number of points, 24, as in Carr's data, can be found by
sequential construction of a D-optimum design for the linearized model.
Such designs minimize the volume of the asymptotic confidence region for
the parameters (5.15). Pritchard et al. (1977) comment that Carr's original
design was a c entral composite design in the space of the process variables
X. Such a design would be good , although not optimum, for predicting
the behaviour of the response using a low-order polynomial model in the
process variables. However good design for the parameters of a nonlinear
model requires designs that are good in the space of the partial derivatives
FO. D-optimum designs for nonlinear models arising in chemical kinetics
are described in Atkinson and Donev (1992 , Chapter 18).

5.8 Related Literature


Other than the books and papers already cited, there is surprisingly lit-
tle literature on robustness and nonlinear least squares. Stromberg and
Ruppert (1992) were the first to apply least median of squares in the non-
linear context. Computational issues are discussed in Stromberg (1993). St
Laurent and Cook (1993) explored how the nonlinearity of the model may
affect the leverage and analyzed the relationships between leverage and
local influence.
174 5. Nonlinear Least Squares

5.9 Exercises
Exercise 5.1 Show for the linear model TJi = xf (3 that the partial deriva-
tives ff given by (5.9) are equal to Xi. What is the implication for least
squares estimation (§ 5.1) ?
Exercise 5.2 Figure 5.1 shows a plot of the expectation plane. Add a data
point to this plot and sketch the position of the estimate /J and of a con-
fidence region for (3 . Show that, in general, the confidence region will be
elliptical in the parameter space (§ 5.1).
Exercise 5.3 Compute the angle between the two vectors (columns) of the
matrix X defined in equation (5.19). What are the implications of the non-
orthogonality of the two vectors for the parameter lines on the expectation
plane (§5.1)?
Exercise 5.4 Compute the Jacobian of the transformation from the pa-
rameter plane to the expectation surface when the matrix X is defined as in
equation (5.19). Give the general expression of the Jacobian for the multiple
linear regression model E(Y) = X(3 (§5.1).
Exercise 5.5 A certain chemical reaction can be described by the nonlinear
model:

(5.49)

where y is the fraction of original material remaining, Xl is the reaction


time in minutes, and X2 is the temperature in degrees Kelvin. The data
are in Table 5.1 and were taken from Hunter and Atkinson (1965). They
originally appeared in Srinivasan and Levi (1963) and can also be found in
Draper and Smith (1998). See if you can find any outliers. (You may use
the preliminary estimate (3p = (0.01,5000) if you wish (§5.4)).
Exercise 5.6 Try to guess the appearance for the calcium and lakes data
of the curve that monitors the average of the absolute values of the deletion
residuals among the units which do not belong to the subset (§5.5 and §5.6).

Exercise 5.7 Table 5.2 contains data on biochemical oxygen demand


(more details can be found in Bates and Watts; 1988, p. 270). The model
considered appropriate for these data is:

Yi = (31{I - exp(l - (32 t )} + Ei,


where y is the biochemical oxygen demand and t is time in days . Find the
least squares estimates, s2, the estimated response function and the 95%
confidence band for the response (§ 5. 5).
5.9. Exercises 175

Table 5.1. Bicyclo hexane data for Exercise 5.5

Xl X2 Y Xl X2 Y Xl X2 Y
120 600 0.900 60 620 0.795 45 631 0.688
60 600 0.949 60 620 0.800 40 631 0.717
60 612 0.886 60 620 0.790 30 631 0.802
120 612 0.785 30 620 0.883 45 631 0.695
120 612 0.791 90 620 0.712 15 639 0.808
60 612 0.890 150 620 0.576 30 639 0.655
60 620 0.787 60 620 0.802 90 639 0.309
30 620 0.877 60 620 0.802 25 639 0.689
15 620 0.938 60 620 0.804 60 639 0.437
60 620 0.782 60 620 0.794 60 639 0.425
45 620 0.827 60 620 0.804 30 639 0.638
90 620 0.696 60 620 0.799 30 639 0.659
150 620 0.582 30 631 0.764

Exercise 5.8 In (5.34) a definition was given of the squared multiple cor-
relation coefficient for a nonlinear model. For the lakes data of§5.6 this has
the value 0.696, with the residual sum of squares b eing 43.392. Calculate
the customary value of R2 for a linear model (2.20) for these data. Explain
your answer (§5.7).

Exercise 5.9 Find the maximum likelihood estimate of the Box- Cox trans-
formation of the response for the data on the isomerization of n-pentane
(§5.7).

There are no solutions for the remaining exercises.

Exercise 5.10 Repeat the sketch of Exercise 5.2 for the expectation surface
of a nonlinear model, for example, Figure 5.2. Include the tangent plane
approximation and sketch both the approximate confidence interval and the
interval given by a constant increase in the residual sum of squares.

Exercise 5.11 In our analysis of the data on enzyme kinetics in §5.4 it


was assumed that there was a different value of the parameter /31 for each
inhibitor. Find the four estimates, together with the inhibitor levels. Does
this pattern suggest a model in which /31 is a function of inhibitor level?
Analyze the data using this model.

Exercise 5.12 In §5.4 it was shown how the model for the kinetics data
could be rearranged to be linear. Find the distribution of errors in the
original model (5.42) for which the rearrangement is appropriate.
176 5. Nonlinear Least Squares

Table 5.2. BOD data for Exercise 5.7

Time Biochemical Oxygen Demand


x y

1 8.3
2 10.3
3 19.0
4 16.0
5 15.6
7 19.8

Use simulation to find the distribution of the estimates from linear least
squares in the rearranged model, if the errors in the original model are
additive and normal with constant variance.
Can you find other linearizations? What assumptions do they lead to
about the errors (Ruppert et al. 1989)?

Exercise 5.13 The data on calcium uptake analyzed in §5.5 were treated
as nonlinear regression data. However the data consist of three replicate
experiments at seven sets of experimental conditions. Seven estimates of
pure error are therefore available, unaffected by the lack of fit of the model,
although they will be affected by any outliers.
Is there any evidence that the error variance increases with increasing
time? If there were such evidence how would you analyze the data? How is
your answer affected by the existence of one negative observation?

Exercise 5.14 Repeat the analysis of the lakes data with the outlying
values of Xl adjusted as suggested at the end of §5. 6.

5.10 Solutions
Exercise 5.1
For a linear model fJ-Tld af3j = Xij' Since the derivative does not depend on
the parameter values, iterative methods of parameter estimation are not
required.

Exercise 5.2
See Bates and Watts (1988, page 19).

Exercise 5.3
The cosine of the angle (a) between the two vectors can be computed as
5.10. Solutions 177

10 15 20 25 10 15 20 25
Subset size m Subset size m

Figure 5.27. Average absolute deletion residual for observations not in the subset
for (left) calcium and (right) lakes data

follows,

4.04 = 0.867.
v3 x 7.236
The angle between the two vectors is about 180 x 0.867/7r = 50° . This
means that the parameter lines on the expectation plane are not at right
angles as they are on the tangent plane. As Figure 5.1 shows, unit squares
on the parameter plane map to parallelograms on the expectation plane.

Exercise 5.4
The Jacobian of the transformation geometrically is equal to the area of the
parallelogram that corresponds to a unit square on the parameter plane.
From computational geometry the area is:

IIxIIIIIX211 sin aIIxlllllx21 1Jl - cos2 a


=

= VXfXIXfx2 - (XfX2)2 = VIXTXI.


This implies that the Jacobian from the parameter plane to the expectation
plane is constant and equal to JIXT XI. The Jacobian for equation (5.19)
is equal to 2.32.

Exercise 5.5
The data do not seem to contain any outliers.

Exercise 5.6
Figure 5.27 shows: (1) that the curve on the right is always higher since
outliers are present, (2) the upward jump in the right panel when the
outliers are included and (3) a partial masking effect in the last step of the
right panel.
178 5. Nonlinear Least Squares

Ii)

'"
-=

-
"C
C 0
'" '"
"C
C

'"
E
Q)
"C

__ ___
C ~

_._ _
Q)
Cl
~ .....•../ /.. . .. ........_........_...... .. .............. _-.....-.._......._-
-/
0
OJ
0 ;2 ..
'f:Q)

,/
.I:
0
0
iii Ii)

0
I. /'
.........
....

o 2 4 6 8 10

Time

Figure 5.28. BOD data: observations, fitted curve and 95% inference band

Exercise 5.7
The least squares estimates are S
= (19.143,0.5311)T with 8 2 = 6.498
on four degrees of freedom. The estimated response function and the 95%
confidence band are plotted in Figure 5.28. It is interesting to notice that
the band has zero width when t = 0, widens up to t ~ 3, narrows around
t = 4 and then widens again. Compare this plot with Figure 5.16 and with
a plot for quadratic regression through the origin.

Exercise 5.8
If a constant is included R2 = 0.020. If the constant is not included R2 =
-1.657. If the linear regression model does not contain a constant R2 is no
longer forced to lie within the interval [0,1].

Exercise 5.9
A = 0.72.
6
Generalized Linear Models

In all examples in previous chapters it was assumed that the errors of


observation were either normally distributed or, in Chapter 4, could be
made approximately so by transformation. This chapter extends the class
of models for the forward search to include generalized linear models. We
give examples in which the errors of observation have the gamma distri-
bution. For this continuous distribution the results are similar to those for
the normal distribution. We also give examples of discrete data from the
Poisson distribution and from the binomial. Interest again is in the rela-
tionship between the distribution of the response and the values of one
or more explanatory variables. The distribution which is most unlike the
normal is that for binary data, that is, binomial observations with one trial
at each combination of factors.
The structure of the chapter is as follows. We first introduce two exam-
ples of discrete data and discuss the properties of appropriate models. The
following five sections introduce the exponential family of distributions and
the related generalized linear models. In §6.4 maximum likelihood estima-
tion of the parameters in the models is shown to be a form of iterative
weighted least squares, a result found useful in the two following sections
on inference and model checking in the generalized linear model.
The remainder of the chapter considers the three main classes of model
in some detail. The gamma model is explored in §6.7. The two succeeding
sections each contain the analysis of a set of data for which the gamma
distribution may be appropriate. A similar structure is followed for the
Poisson distribution, in which the theory of §6.10 is followed by two anal-
yses of data. Models for binomial data are discussed in §6.13, followed by
180 6. Generalized Linear Models

three examples of increasing complexity. The special case of binary data is


investigated in §§6.17 and 6.18. Particular problems arise because the data
can be fitted exactly for much of the forward search. The final two analyses
of data are in §§6.19 and 6.20. The chapter ends with some suggestions for
further reading.

6.1 Background
We give two examples of discrete data in which the distribution depends
on the levels of one or more factors.

6.1.1 British Train Accidents


Table A.18 gives data on train accidents in Britain in which there was at
least one death. The data are simplified from Evans (2000) who gives more
details. The variables are:

Xl: Date of the accident - month and year


X2: Type of rolling stock:
1. Mark 1. An obsolescent form of passenger train in which doors
are closed by slamming
2. More modern passenger trains with automatic doors
3. Goods (freight) trains
X3: Annual train kilometres
y: Number of deaths in the accident.

The focus of interest is the relationship between the observed response y


and the vector of explanatory variables x. If y were normally distributed,
we would start by fitting a regression model with linear predictor
3
7](X) =,so + L,sjXj, (6.1)
j=l

allowing for X2 as a factor with three levels, rather than a single explanatory
variable. The normal theory regression model (2.2) was written in terms of
a linear predictor as

In the generalized linear model for Poisson data the mean of the Poisson
distribution again depends on the linear predictor, but the errors are no
longer additive.
6.1. Background 181

6.1.2 Bliss's Beetle Data


The data in Table A.20 result from subjecting eight groups of around 60
beetles to eight different doses of insecticide. The number of beetles killed
was recorded. (The data were originally given by Bliss 1935 and are re-
ported in many text books, e.g. Flury 1997, p. 526). The resulting data are
binomial with variables:

Xi: logarithm of the dose of insecticide


ni: number of insects exposed to dose Xi
R i : number of insects dying at dose Xi.

At dose level Xi the model is that the observations are binomially dis-
e
tributed, with parameter i . Interest is in whether there is a relationship
e
between the probability of success i and the dose level - the data clearly
show some relationship - and if so what is the form of that relationship.
Here the definitions of "success" and "failure" are a matter of point of
view: success for the beetle in surviving is a failure for the experimenter
who administered the insecticide. A starting point is a model with linear
predictor
1](Xi) = /30 + /3lXi' (6.2)
By analogy with linear regression we could consider the model
ei = 1](Xi) = /30 + /3lxi, (6.3)
but the ei are probabilities so that it is necessary that 0 ::; ei ::; 1. Instead,
for binomial data, we use models of the form
(6.4)
with the inverse link function 'I/J such that 0 ::; e ::; 1. One family of func-
tions with this property are cumulative distribution functions of univariate
distributions. The plot of the proportions of successes Ri/ni in Figure 6.1
do indeed have such a form. If, for example, 'I/J is the cumulative normal
distribution, probit analysis of binomial data results. Instead of the inverse
link function 'I/J, the theory of generalized linear models is developed in
terms of the equivalent link function 9 = 'I/J-l.

6.1.3 The Link Function


The link function g(p,) = 1] relates the mean p, to the linear predictor 1].
We first discuss suitable links for binomial and Poisson data and then list
some useful combinations of link function and error distribution.
For binomial data let the proportion of successes Yi = Rd ni. Then if we
let
E(Y) = e = p, and
182 6. Generalized Linear Models

• •

co
0 •
"0
J1 0co
:;<
c:
.Q

...
1:
0

e
Q.
0
Q.


'"0 •

0
0
1.70 1.75 1.80 1.85

Log(dose)

Figure 6.1. Bliss's beetle data: proportion of deaths increasing with log dose

the linear predictor 'f/ and the mean p, are related by the link function
(6.5)
For binomial data we require a link such that the mean lies between zero
and one. A widely used link for binomial data that satisfies this property
is the logistic link

log C~ p,) = log C~ e) = 'f/ = rF x. (6.6)


Rearrangement gives the expression for the probabilities e as
e -- - e'-
" - ef3T x
(6.7)
1 + e'" - 1 + ef3T x·
-----,0=-

Comparison with (6.4) shows that the inverse link 'lj; for the logistic link is
the cumulative distribution function of the logistic distribution. Whatever
e
the values of f3 and x, the probabilities lie between zero and one.
For Poisson observations we require a link such that the mean p, cannot
be less than zero. An often used link is the log link
log(p,) = 'f/ = f3T X, (6.8)
so that p, = ef3T x cannot be negative. This model yields not only Poisson
regression models but also log linear models for the analysis of contingency
tables.
These models for binomial and Poisson data are both special cases of
generalized linear models, a family that generalizes the normal theory lin-
ear model of the earlier chapters on linear regression. The generalization
extends both the random and systematic parts of linear models:
6.1. Background 183

Generalization
• Distribution. Members of the one-parameter exponential family: for
regression models the normal distribution;

• Systematic Part. Link function g(p,) = TJ relating the mean vector


and linear predictor: for regression the identity link p, = TJ .

In addition to the normal, gamma, Poisson and binomial distributions,


the one-parameter exponential family includes the inverse Gaussian distri-
bution, not often used in the analysis of data, and some special cases of the
negative binomial distribution (McCullagh and Neider 1989, p. 373). Fit-
ting the models is by maximum likelihood, the iterative method for which
is described in §6.4. This section concludes with more material on link
functions .
We give several links in Table 6.1. The first part of the table gives links for
continuous and Poisson data. These include the inverse or reciprocal link,
the log link and the identity. Two parametric generalizations of these links
are the power link, which includes the inverse and identity as special Cases
and a link based on the Box and Cox parametric family of transformations
used in Chapter 4. This link incorporates the other three as special cases.
Both of these more general links include a parameter A which has to be
estimated. Usually analyses for a few values of A are compared, rather
than the transformation parameter being estimated. In this way we avoid
numerical maximization of the likelihood. The practice is similar to that
on which the fan plots of Chapter 4 were based. The important difference
from that chapter is that there the data y were being transformed, whereas
here it is the relationship between the mean of Y and the linear predictor
that is transformed. The goodness of fit of the link Can be ascertained not
only by use of the parametric families of links in the table, but also by use
of a general goodness of link test introduced in §6.6.4.
The second part of the table lists links for binomial data. The logistic
link arises naturally from the form of the binomial likelihood. The probit
and complementary log log links have inverse link functions which are the
cumulative distribution functions of the standard normal distribution <I>(TJ)
and the extreme value distribution 1 - exp{ - exp( TJ)}. The arcsine link (as
we show in §6.18) is especially useful in the analysis of binary data.
For both discrete and continuous data we find that a forward plot of the
score statistic for a goodness of link test is helpful in diagnosing modelling
and data errors.
Some of the combinations of error distribution and link function are
well known, such as normal errors and the identity link for regression.
Table 6.2 gives the best-known combinations. We use these and others in
our examples.
184 6. Generalized Linear Models

Table 6.1. The most usual link functions: for the probit link <1> is the cumulative
distribution function of the standard normal distribution

Name Link 9(/1-) = 1] Inverse Link /1- = 9- 1(1])


Identity /1- 1/
Log log/1- exp1]
Inverse or Reciprocal 1//1- 1/11
Power Family /1->' 1]1/A
Box-Cox (/1->' -l)/A (A1/ + 1)(1/>')
Logit or Logistic log (Gt) ~
l+exp ry
Probit <1>-1(/1-) <11(/1-)
Complementary log log log{ -log(l - /1-)} 1 - exp( - exp 1])

{
0.5(1 + sin 1]) if -~ S1/S
{ sin-l (2/1- - 1)
Arcsine 1 if 1/?
(0 S /1- S 1)
0 if 1/S-

Table 6.2 . Names of the most widely used combinations of distribution and link
function

Distribution Link Name


Normal Identity Regression and Analysis of Variance
Gamma Inverse Inverse Polynomials
Poisson Log Loglinear Models for Contingency Tables
Binomial Logit Logistic Regression
Binomial Probit Pro bit Analysis
6.2. The Exponential Family 185

6.2 The Exponential Family


The major consequence of the use of generalized linear models is that a
single formulation can be given of the maximum likelihood estimation of
parameters and of inference about models. Least squares estimation in re-
gression is replaced by iteratively reweighted least squares and the analysis
of variance by the analysis of "deviance," §6.5.1. In this section we in-
troduce the exponential family of distributions and explore the properties
of the likelihood. These are employed in the next section in deriving the
algorithm for maximum likelihood estimation.
We consider the density j (y; (), ¢) defined by

logj(y;(} ,¢) = yb((}) ;C((}) +d(y, ¢) -00 <y< +00 ¢ > O. (6.9)

If c/> is known, this is the one-parameter exponential family. As we see later,


¢, the dispersion parameter, is equal to the variance a 2 for the normal dis-
tribution. Although in regression applications a 2 is usually not known,
knowledge of its value is irrelevant for least squares estimation of the pa-
rameters (J in the linear predictor. The same is true in general for (6.9), an
estimate of ¢ only being required for inferences such as tests of hypotheses.
We leave it to the exercises to show that the four distributions of Table 6.2
can indeed be written in this form.

6.3 Mean, Variance, and Likelihood


6.3.1 One Observation
The loglikelihood for a single observation is

l((} ,¢;y) = logj(y;(} ,c/». (6.10)

Under the standard regularity conditions that allow the interchange of the
order of differentiation and integration, the expectation

(6.11)

These conditions are those under which the Cramer- Rao lower bound for
the variance of maximum likelihood estimators holds. The most frequent
violation occurs when the range of the observations depends upon (), which
is not the case for (6.9). Derivations of (6.11) and the result for second
derivatives (6.14) are to be found in many textbooks, for example, Casella
and Berger (1990, p. 309).
186 6. Generalized Linear Models

Application of (6.11) to the exponential family (6.9) yields

E {Yb'(B) + e'(B) } = p,b'(B) + e'(B) = 0 (6.12)


¢ ¢'
where we use b' (B) to denote 8b( B) j 8B and b" (B) for the second derivative.
We thus obtain an expression for the expected value of Y as

E(Y) = J.L = -e'(B)jb'(B). (6.13)


To obtain an expression for the variance of Y requires the relationship

r
between second derivatives (Exercise 6.2)

E ( ;;;) +E (;~ = O. (6.14)

From (6.13) e'(B) = -J.Lb'(B) so that the derivative of (6.10) can be written
az b'(B)(y - J.L)
(6.15)
8B ¢
Then in (6.14)

and
E (;~ r =E {b'(B)(~ - J.L) r r = {b'~) varY (6.16)

(8l)
E 88 2
2 = J.Lb" (B) + el/ (B)
¢'
so that
__{~}2
varY - b'(8)
J.Lbl/(8) + el/(B)
¢ . (6.17)

This equation can be written in an informative way by substitution of J.L


from (6.13). If in addition (6.13) is differentiated to give a relationship
between the derivatives of b( 8) and e( B) it follows that (Exercise 6.3)
8fJ 1
varY = ¢ 88 b'(8)' (6.18)

6.3.2 The Variance Function


The relationship between mean and variance is an important character-
istic of a statistical model. For the regression models of earlier chapters
the variance was constant, independent of the mean. One of the indica-
tions in Chapter 4 of the need for a transformation was the existence of
a relationship between the value of the observations and their variance.
Another well-known relationship is that for the Poisson distribution, where
6.3. Mean, Variance, and Likelihood 187

Table 6.3. Variance functions for four generalized linear models

Distribution Variance Function V(p,) Dispersion Parameter 4>


Normal
Gamma
Inverse Gaussian
Poisson
Binomial

the equality of the mean and the variance is the basis for a test of the Pois-
son assumption. The mean variance relationship for all generalized linear
models is obtained by rewriting (6.18). Let
1 8p,
b'(8) 88 = V(p,) = V, (6.19)

the variance function, a function solely of p,. Then


varY = 4> V(p,). (6.20)
Variance functions and dispersion parameters are given in Table 6.3 for
the generalized linear models of this chapter. For both the Poisson and
the binomial distributions the dispersion parameter has the value one. The
dispersion parameter for the gamma distribution is found by writing the
density as
f(Yia,P,) (6.21 )

where r(a) = Jooo uo.-1e-udu. In this form E(Y) = p, and varY = p,2 la,
in agreement with the results of Table 6.3. Derivation of the result for the
inverse Gaussian distribution is left to the exercises.
At first glance, Table 6.3 seems to provide an extremely restricted fam-
ily of models for mean-variance relationships. A richer family is found by
specifying not the complete distribution, but just the relationship between
the mean and the variance, an extension of the second-order assumptions
for regression models of (2.3). The resulting quasilikelihood models are de-
scribed by Firth (1991) in a chapter that provides a more wide-ranging
introduction to generalized linear models than that given here.
A second departure from the variances listed in Table 6.3 is overdisper-
sion for Poisson and binomial data, also described by Firth among others,
in which the form of V(p,) seems correct, but the estimated dispersion
parameter is appreciably greater than one. This phenomenon can be gen-
erated by an extra source of variation in the data beyond that included in
the generalized linear model, perhaps arising from a compound distribu-
tion. For example, in Bliss's beetle data, the number of insects Ri dying
188 6. Generalized Linear Models

out of batches of size ni is modelled as having a binomial distribution with


a parameter Oi, described by a generalized linear model with linear predic-
tor j3T Xi, where the parameters are constant over all groups. But if each
group came from a different batch of eggs, the parameters might vary in
an unmodelled way over the batches by being sampled from some distri-
bution. The resulting observations would then show overdispersion. That
is, the variance would be greater than that expected if the common model
held (Exercise 6.4). To an extent this description includes the solution to
the problem, which is that the overdispersion is a result of an inadequate
model. In the data analyzed in §6.16 seeming overdispersion, or relation-
ships between the mean and variance other than those given in Table 6.3,
are removed by correct modelling and the detection of outliers. The family
of relationships provided by Table 6.3 is in fact richer than it might seem
and allows for the analysis of a wide variety of data.

6.3.3 Canonical Parameterization


The parameterization of the exponential family model (6.9) is not unique.
We require the maximum likelihood estimate 8. But if the model is reparam-
eterized with new parameters 'ljJ = h(O), the maximum likelihood estimate
becomes ~ = h( 8) and inferences based on the likelihood, for example,
likelihood ratio tests of terms in the linear predictor, will not be changed.
The parameterization can therefore be chosen for mathematical or numer-
ical convenience. In the canonical parameterization 0 = b(O). The model is
usually written, for example, by McCullagh and NeIder (1989), as
yO + c( 0)
logj(y;0, </»= </> +d(y,</». (6.22)

In this parameterization the calculation of the mean and variance of Y


(§6.3.1) is greatly simplified. Since b(O) = 0, b'(O) = 1 and bl/(O) = o. Then
J-l = -c'(O) and varY = -</>cl/(O). (6.23)

6.3.4 The Likelihood


For generalized linear models, as for regression, the observations are inde-
pendent. The loglikelihood of the n independent observations is then, from
(6.9),

(6.24)

Since J-li = -c'(Oi)/b'(Oi) , Oi is defined by the linear model xT j3 through


the link function g(J-li) = 'rJi. As with regression, the purpose is to find a
linear predictor with few terms that fully describes the data. Again we use
6.4. Maximum Likelihood Estimation 189

the forward search to detect influential observations, groups of outliers and


inadequate models.

6.4 Maximum Likelihood Estimation


Maximum likelihood estimation for the parameters (3 in the linear predic-
tor of a generalized linear model usually requires an iterative numerical
procedure. The exception is normal theory regression. In this section the
iterative algorithm is shown to be a form of weighted least squares. As
well as being computationally convenient, the form means that diagnos-
tic results from regression can be applied , with little modification, to the
extended class of models. We first recall some regression results that are
extended to weighted least squares. Newton's method for solving equations
and Fisher's method of scoring are combined with these results to yield the
iterative reweighted least squares algorithm.

6.4.1 Least Squares


The matrix form of the regression model is
E(Y) = X(3.

In the derivation of the algorithm it is sometimes more convenient to work


with elements of matrices, so that the model is written
p

E(Yi) = x;(3 = I>ij(3j .


j=l

The least squares estimator S= (Sl " .. , Ss, . .. , Sp) T is defined by

XTXS=XTy.
The rth equation (r = 1, ... ,p) can be written as
p n n

LLXirXisSs LXirYi
s=li=l i=l
or
(6.25)
s

since
n
(XT X)rs =L Xirxis =L XrXs
i=l

when the dependence on i is suppressed.


190 6. Generalized Linear Models

6.4.2 Weighted Least Squares


With independent errors of constant variance, var(Y) = (J"2 In, where In is
the n x n identity matrix. If the variances are not the same let
Y = X{3 + E* var(Y) = (J"2W-1,
with E* = (Ei, . . . ,E~f. Since the observations are independent, W is a
diagonal matrix of weights W = diag( WI, . .. ,wn ) which, in our case, are
calculated by the fitting algorithm.
In weighted least squares a transformation is made to variables W I / 2 y,
where W I / 2 = diag(JWI, ... , JWn), which have constant variance (J"2, and
least squares is applied to these variables. The model becomes
WI/2y = W I / 2X{3 + E,
where the Ei = JWiEi satisfy the second-order assumptions (2.3). The
normal equations are now

that is,

LL Wi X ir X is;3s
s
or
(6.26)
s

The solution to the equations is


~ = (XTWX)-IXTWy (6.27)
with
(6.28)

6.4 .3 Newton's Method for Solving Equations


In Chapter 5 we described several algorithms for the iterative estimation
of parameters by nonlinear least squares. These were based on a lineariza-
tion of the model, which is also the basis of Newton's method for solving
equations.
The maximum likelihood estimators ~ maximize the likelihood or
equivalently the loglikelihood L(j3) and so are the solution of the p equations

8L({3) I = O.
8{3 f3=i3
If we let the derivatives
8L({3)
U({3) the score function
8{3
6.4. Maximum Likelihood Estimation 191

and
= -J({3) the observed information, (6.29)

the maximum likelihood estimators satisfy

U(/3) = o. (6.30)

Taylor series expansion of (6.30) around {3 yields

U (/3) = 0 ~ U ({3) - J ({3)(/3 - (3). (6.31 )

The iteration to find /3 is therefore


U({3k) - J({3k)({3k+l - 13k ) k = 0,1, ...
U({3k) + J({3k){3k
or
J- 1 ({3k)U({3k) + 13k, (6.32)

where 13 k denotes the estimated value of (3 at the kth iteration.

6.4.4 Fisher Scoring


In Fisher scoring the observed information J({3) in (6.32) is replaced by the
expected information I({3) where

8 2 L({3)
I({3) = E{I({3)} = -E 8{38{3T' (6.33)

so that the iteration (6.32) becomes

(6.34)
or
(6.35)

For the normal theory linear model


8L({3) XT(y - X(3)
8{3 (j2

(j 2 '

which is not a function of random variables. Therefore the observed and


expected information are the same; that is, I({3) = J({3). As a consequence
the algorithms (6.35) and (6.32) converge at the first iteration and, as we
know, iterative methods are not needed for linear least squares.
192 6. Generalized Linear Models

6.4.5 The Algorithm


We now have all the parts that are needed to use Fisher scoring to maxi-
mize the loglikelihood L((3) given by (6.24) and to show that the resulting
algorithm is a form of iterative weighted least squares. One of the main
mathematical tools is the chain rule of differentiation. From (6.30) we need
to find an expression for the components of the score

U((3) = (U1 , ... , Up) T = (8L 8L)T


8(31' ·· ·' 8(3p

which we write
8L 8L dB dJ.L 8ry
(6.36)
8(3j 8B dJ.L dry 8(3j .
The definition of the variance function in (6.19) yields

~~ = V(J.L)b'(B) = Vb'(B).
Also

so

An expression for 8lj8B for a single observation is given in (6.15). Then,


using the chain rule (6.36) ,
81 b'(B)(y-J.L) 1 dJ.L
8(3j = ¢ Vb' (B) dry Xj .
To relate the algorithm to least squares let the "quadratic weight" w be
defined by

w- 1 = (~:r V. (6.37)

Then, the score for the sample is

8L ~ Wi dryi
8(3 . = ~ -:;:(Yi - J.Li)d- Xij . (6.38)
J i=l ~ J.L.
If the subscript i is suppressed we write
8L "W dry
8(3j = ~ ¢(y - J.L) dJ.L Xj.

The maximum likelihood equation for (3j is thus

¢Uj = "~ w(y - J.L)-Xj


dry = 0 j = 1, .. . , p. (6.39)
dJ.L
For the iterative solution of these equations we require the expected in-
formation Irs = {I((3)}rs . As (6.39) shows, the value of ¢ does not affect
6.4. Maximum Likelihood Estimation 193

the estimate of the parameter /3. For notational simplicity we take ¢> = 1,
when the expected information matrix found from further differentiation
of (6.39) is

But E(y - f1) = °and, since y is an observed value ay/a/3s = 0, so

Also
af1 df1 aT)
a/3s dT) a/3s '
whence

(6.40)

a matrix of weighted sums of squares and products.


We nOw substitute this information matrix in the Fisher scoring iteration.
From (6.34) this yields, for the rth equation,

(6.41 )

with

(6.42)
s

(6.43)

the last equality following from the definition of the linear predictor, with
T)f the estimated linear predictor at iteration k. To find the p equations
for the parameter estimates we substitute for Ur in (6.41) from (6.39) and
obtain

(6.44)
194 6. Generalized Linear Models

Comparison with (6.26) shows that this is weighted least squares with
weights Wi and "working" response

z = ry + (y - /1) dry. (6.45)


d/1
The quadratic weights were defined in (6.37) as

W = V-I (~~r
Both the weights and the working response depend on the parameter es-
timate (3k. The iteration is started, where possible, by putting p, = y. For
zero observations in Poisson or Gamma models we put p, = 0.1. Similar ad-
justments for binomial starting values are given, for example, on page 117
of McCullagh and NeIder (1989). Usually about five iterations are required
to obtain satisfactory parameter estimates.

6.5 Inference
6.5.1 The Deviance
In regression differences in residual sums of squares are used to test the
significance of one or more terms in the linear model. Suppose that the
hypothesis is that a specified 8 of the elements of (3 are zero. The residual
sum of squares under this hypothesis is accordingly 8(/380)' If the hypoth-
esis is true the difference in residual sums of squares 8(/380) - 8(/3) is
distributed as 0'2X; . Usually 0'2 is estimated by 8 2 and the scaled differ-
ences {8(/380) - 8(/3)} /8 2 are displayed as an analysis of variance table.
The generalization considered in this section is to the analysis of deviance in
which likelihood ratio tests are expressed as differences in scaled deviances.
Let the maximized loglikelihood of the observations for the linear predic-
tor be L(/3) and the loglikelihood under the hypothesis that, again, 8 of the
elements of (3 are zero be L(/380)' Then, asymptotically the loglikelihood
ratio
(6.46)
For normal theory regression the result reduces to the distribution of the
scaled difference in residual sums of squares {8(/380) - 8(/3)} /0'2 and so
is exact. For other distributions the approximation to the distribution of
(6.46) improves as the number of observations increases. The distributional
result (6.46) also holds for the more general hypothesis that 8 linear com-
binations of (3 are constrained to have specified values, the only difference
being that /380 is now the maximum likelihood estimate satisfying these
constraints.
6.5. Inference 195

The deviance D((3) , the extension of the residual sum of squares, is


defined as ¢> times the loglikelihood ratio test for comparing the model
with parameters (3 in the linear predictor to the saturated model for which
the parameter estimates (3max are such that the fitted means Pi equal the
observations Yi; that is,

D((3) = 2¢>{L((3max) - L((3)}. (6.47)

For a linear regression model the deviance D((3) reduces to the residual
sum of squares 8((3). To test the goodness of fit of the regression model
when (12 is known, we can use the scaled sum of squares 8((3) / (12. Likewise
for testing hypotheses about generalized linear models we use the scaled
deviance Dsc((3) defined as

D sc((3) = D((3) /¢>, (6.48)

which, from (6.47) is a likelihood ratio test. If the linear model contains p
parameters, the goodness of fit test based on the scaled deviance compares
Dsc((3) with the X2 distribution on n - p degrees of freedom. In general
the distributional result is again asymptotic, a word requiring careful in-
terpretation: for gamma or Poisson data we mean that n -; 00, while
for binomial data we require that each ni - ; 00 . For binary data each
ni = 1 and the value of the deviance (Exercise 6.7) becomes completely
uninformative about the goodness of fit of the model.
The scaled deviance is most useful not as an absolute measure of goodness
of fit but for comparing nested models. In this case the reduction in scaled
deviance Dsc(/3so ) - D sc (/3) is asymptotically distributed as X;. The X2
approximation is usually quite accurate for differences in scaled deviances
even if it is inaccurate for the scaled deviances themselves. Likelihood ratio
tests of parameters in the analysis of deviance depend upon differences in
scaled deviances, rather than on differences in deviances. However, since
these tests are commonly used for Poisson and binomial data where the
scale parameter is one, the scaled and unscaled deviances are identical for
these distributions. Perhaps as a result, the literature is not always clear as
to whether the deviances being discussed are the scaled deviances Dsc((3)
or the deviances D((3) which do not depend on ¢>. When we want to stress
that we are using the value of a Poisson or binomial deviance to indicate the
fit of a model, we sometimes refer to the residual deviance. We compare
this deviance with that from the null model (that is, one in which the
linear predictor only contains a constant). The difference between the null
deviance and the residual deviance is called the explained deviance. These
relationships are summarized in Table 6.4.
To find an expression for the deviance for the exponential family model
with loglikelihood (6.24) let the vector parameter (3 correspond to indi-
vidual parameters Oi, with (3max corresponding to parameters 0i ax . Then
196 6. Generalized Linear Models

Table 6.4. Summary of models, deviances and likelihoods

Symbol Meaning
Null model Model which only contains a
constant (one parameter)
Current model Model with p parameters
Saturated model Model for which Pi Yi,
i = 1, ... ,n (model with n
parameters)
L(~) Loglikelihood of the current
model
Loglikelihood of the null model
Loglikelihood of the
saturated model
Likelihood ratio 2 { L(~) - L(~so) }
Deviance (of the current model) 2¢ lL(f3max) - L(~) }
Deviance of the null model 2¢ L(f3max) _ L(~nUll)}

Deviance explained 2¢ L(~) - L(~null) }


Scaled deviance Deviance/¢
6.5. Inference 197

from (6.47)
n
D({3) = 22: {Yib(8f'ax) + c(8f'ax) - Yib(8i) - c(8i )} , (6.49)
i=l
which is a function neither of d(Yi, ¢) nor , more importantly, of ¢. We leave
it as an exercise to show (Exercise 6.5) that this reduces to the residual sum
of squares for the regression model. The deviances for other distributions
will be derived when we come to analyze data from each family.

6.5.2 Estimation of the Dispersion Parameter


In regression we estimated 0"2 by the residual mean square estimate 8 2 =
S(/J)/(n - p) rather than using the maximum likelihood estimator with
divisor n. The analogous estimator for generalized linear models is
¢= D(/J)/(n - p).
McCullagh and NeIder (1989, p. 296) comment that for the gamma distri-
bution this estimate is extremely sensitive to rounding errors in very small
observations since it includes a term in log Yi . They recommend instead

t
the moment estimator

¢= (Yi -::- Pi ) 2 1 , (6.50)


i=l J-li n- p
which can be related to the observed value of Pearson's chi-squared
goodness of fit statistic.
The chi-squared statistic was introduced by Pearson for Poisson
distributed responses of contingency tables (see §6.1O) as

X2 = t
i=l
(Yi ~ {li)2 =
J-li
t
(Yi - jti)2 ,
i=l V(J-li)
(6.51)

in the notation of this chapter. Since, for the gamma distribution, V(J-l) =
J-l2, we can rewrite (6.50) as

(6.52)

In the for~ard search we monitor both the deviance D(/J) and the dispersion
estimate ¢.

6.5.3 Inference About Parameters


The analysis of deviance gives an analogue of the analysis of variance for
testing hypotheses about groups of parameters. For analogues of t tests of
individual parameters we find the variance of our parameter estimates via
the iteratively reweighted least squares fitting algorithm.
198 6. Generalized Linear Models

For linear regression the variance of the parameter estimates (2.18) was
vartJ = a 2 (X T X)-1 with the estimated standard error of the rth element
of tJ being

estimated s. e. (tJr) = (S2V rr )1/2 ,


where Vrr is the rth diagonal element of (XT X)-I.
For generalized linear models, since estimation is by weighted least
squares

vartJ = ¢>(X T WX) - I,

estimated by J(XTW X)-I, so that

estimated s. e. ((3r)
• A

= (¢>vrr) 1/2 ,
-

(6.53)

where now Vrr is the rth diagonal element of (XTW X)-I. This formula
applies for the gamma distribution. For the Poisson and binomial distribu-
tions we calculate t statistics and confidence intervals using the theoretical
value of one for the dispersion parameter.

6.6 Checking Generalized Linear Models


6.6.1 The Hat Matrix
The quantities used to check generalized linear models are similar in outline
to those used in linear regression. In the forward search we again typically
monitor such quantities as residuals, leverages, Cook's distance and score
tests. Since we use an iterative method of parameter estimation, as we did in
Chapter 5 for nonlinear models, the standard deletion formulae of Chapter
2 no longer hold exactly. There is therefore a choice between "one-step"
methods based on deletion formulae and quantities estimated by iteration
after deletion.
Several quantities are generated by consideration of the weighted least
squares fitting algorithm. For example, the hat matrix now becomes

(6.54)
Since W is a diagonal matrix of nonnegative weights, W 1 / 2 is found as the
elementwise square root of W.

6.6.2 Residuals
Three residuals can be defined by analogy with least squares, for which
they are all identical. We use two of them.
6.6. Checking Generalized Linear Models 199

Pearson Residuals
The simple definition of least squares residuals is that in (2.10) as ei =
Yi -iJi , the difference between what is observed and what is predicted. This
definition for generalized linear models leads, when allowance is made for
the dependence of the variance of Y on the mean, to the Pearson residual
(6.55)
where, as in (6.20) , varY = 1> V(fi,). The name for the residual arises since
n
Lr~i = 1>X 2 = X2
i=l

for the Poisson distribution for which 1> = 1. Here, as in (6.51), X 2 is the
observed value of the generalized form of Pearson's chi-squared goodness
of fit statistic with the appropriate variance function .
The Pearson residual can be studentized, as the least squares residual
was in (2.14) , and is
Yi - P,i
r~i =~--~--~----- (6.56)
{¢V(P,i)(l- hi )}1 /2'
where hi is the diagonal element of matrix H defined in equation (6.54).

Deviance Residuals
In regression the residual sum of squares is the sum of squares of the least
squares residuals, 8(/3) = I: er· The deviance, which generalizes the resid-
ual sum of squares, is likewise the sum of n quantities, so we can write
formally

D((3) = ~di'
A 2
'"' (6.57)

even though the deviance components dr


are not the squares of simple
quantities. They are however nonnegative, so the deviance residual can be
defined as
rdi = sign(Yi - P,i ) di (6.58)
and the studentized deviance residual as
r di sign (Yi - P,i) di
r~i = ~----'-----,- (6.59)
{¢(1- hi )P /2 {¢(1- hi )}1 / 2
In general, the distribution of the deviance residuals is more nearly normal
than that of the Pearson residuals although, for discrete data with only a
few small values of Y, neither is close to the normal distribution. The most
extreme example, as we show in Figure 6.51, is for binary data, where the
residual can only have one of two values, depending on whether a zero or
a one was observed. This comment indicates that residual plots may be
200 6. Generalized Linear Models

less useful for generalized linear models for discrete data than they are for
normal data.

Deletion Residuals
A third residual can be defined by the effect of deletion. For the regression
model the exact change in the residual sum of squares when the ith obser-
vation is deleted is given by (2.34) as er!(1 - hi) . For generalized linear
models Williams (1987) shows that a one-step approximation to the change
in deviance on deletion yields a deletion residual that is a linear combina-
tion of one-step approximations to the effect of deletion on the Pearson and
deviance residuals (McCullagh and NeIder 1989, p. 398).

6.6.3 Cook's Distance


Cook's distance was introduced in §2.3.3 as a function of the difference
in residual sums of squares S(fj(i)) - S(fj) for all the data when the pa-
rameter estimate is changed by deletion of observation i. For generalized
linear models the change in likelihood gives Cook's distance as the scaled
likelihood difference
(6.60)
Exact calculation of (6.60) requires n + 1 maximizations of the likelihood.
A more easily calculated approximation is found by using the approximate
change in the parameter estimate on deletion which, from the weighted
least squares fitting algorithm is
, , T -1
J3(i) - J3 = -(X WX) xiwirpi/(l- hi) ,
analogous to (2.33). The resulting approximate Cook's distance for
generalized linear models is

(6.61)

a function of the Pearson residual and other quantities all known from a
single fit of the model.

6.6·4 A Goodness of Link Test


The choice of a link, like the choice of a transformation of the response,
is guided both by the physical nature of the data and by statistical con-
siderations, such as plots and the results of tests. Here we describe an
approximate score test for the link function which, although quite general,
is particularly useful for the analysis of gamma and binomial data, where
the link is often not obvious in advance of the analysis.
6.6. Checking Generalized Linear Models 201

Suppose that the link used in fitting the data is g(p,) when the true link
is g*(p,) = ry. Let h(ry) = g{g*-l(ry)}. Then
g(p,) = g{g*-l(ry)} = h(ry).
= ry. Otherwise h(ry) will be nonlinear in
If the fitted link is correct, h(ry)
ry. So we need to test whether g(J-L) is a linear function of ry. Taylor series
expansion around zero yields
g(p,) = h(ry) h(O) + h'(O) ry + hl/(O) (ry2 / 2) + ...
(6.62)
~ a + bxT f3 + "(ry2,
where a, band "( are scalars. Since f3 is to be estimated (6.62) becomes
(6.63)
provided that the fitted model contains a constant. The test of the goodness
of the link then reduces to testing whether in (6.63) "( = O.
The test statistic is calculated in two stages. In the first the model is
fitted with link g(p,), yielding an estimated linear predictor", with iterative
weights W and estimated dispersion parameter ¢. Following the prescrip-
tion in (6.63) the linear predictor is extended to include the variable ",2 . The
model is then refitted to give a t test for "(. However the refitting is without
iteration, so that the parameters of the linear predictor are reestimated
with the weights W found previously. Likewise the dispersion estimate is
not adjusted for inclusion of the extra explanatory variable. As we show
in Figure 6.2 and in several other figures, monitoring the resulting t test is
sometimes a powerful method of detecting inadequacies in the model.

6.6.5 Monitoring the Forward Search


The forward search for generalized linear models is similar to that for regres-
sion, except that we replace squared least squares residuals with squared
deviance residuals, that is, deviance components d;
(6.57). Then, given a
subset of dimension m :::: p, say m si
), the forward search moves to dimen-

sion m + 1 by selecting the m + 1 units with the smallest values of the d;,
units being chosen by ordering all deviance components d2 (=). The search
7"S",

starts by randomly selecting subsets of size p and chooses the one for which
the median deviance component is smallest.
As it was for linear and nonlinear regression models, so also for gen-
eralized linear models is it informative during the search to look at the
evolution of leverages (hi ,si=»)' parameter estimates, /3;,., Cook's distance
(Dm) , and t statistics. For the forward search the leverage (6.54) for unit
si
i, with i E m ), is the i th diagonal element of the matrix

W 1 /(m)
2
XS (m)
(XT
SCm)
W S(m)XS(m) )-1 X T( =) W 1 (m)'
/2 (6.64)
S
S'" '" '" '" '" '" S'"
202 6. Generalized Linear Models

Cook's distance (6.60) is


A* A* T T A* A* -
(!3m-1 - 13m) (X si"') Wsim) X sim ) )(!3m-1 - !3m)/(P¢s~"'»)
(m=p+1, ... ,n), (6.65)
where X SC"') is the m x p matrix that contains the m rows of the matrix X
which correspond to the units forming the subset, and WSC"') is the diagonal
matrix of the weights in the final iteration of the fitting ~lgorithm.
The forward version of the modified Cook distance is given for regres-
sion models in (2.57). For generalized linear models the corresponding
expression uses deviance residuals and is

Cmt. - -
1/2 h sC"' ) d 2 sC"' ) }1/2
(i,S~"'» 4>S~"'-l)
m-p t, * *
(6.66)
{ p}
{ 1,
1- h 2 -

for i (j. si m - and i E si m ) ,


1)

where again m = p + 1, ... , n.


In linear regression in order to determine whether the response variable
had to be transformed, we monitored the score statistic, often presented as
a fan plot. In generalized linear models to validate a particular link function
we correspondingly monitor the goodness of link test introduced in §6.6.4.
As indicated in §6.6.2, we use deviance residuals, because they take less
extreme values than Pearson residuals, especially for binomial data when
the fitted probabilities are close to 1 or to O. In our examples standardiza-
tion of the residuals to allow for the effect of leverage on variance had no
observable effect. In a manner analogous to that for linear regression, we
monitored the maximum deviance residual in the subset; that is,

for i E sim ) m =p+ 1, ... ,n. (6.67)

As in the former chapters, these plots give complementary information


about the structure of the data. For example, as we see in our second
example of binomial data, a considerable change in deviance residuals is
likely to produce big changes in the curves of the forward plots of the t
statistics for the parameters. Such changes are usually accompanied by a
peak in the value of Cook's distance and high leverage for the unit that
joins the subset.

6.7 Gamma Models


The theory of the previous sections is now applied to data modelled by
particular members of the family of generalized linear models. We start
with gamma models, suitable for nonnegative continuous observations with
6.7. Gamma Models 203

a skew distribution. The skewness reduces as the scale parameter decreases,


that is, as the parameter a increases.
For the gamma distribution, written as in (6.21), the loglikelihood of a
single observation (6.22) can be written
l(f.l, a; y) = -a log f.l- ay/ f.l + d(y, a).
Then, for the estimated means P given by a model with specified lin-
ear predictor and link function resulting in parameter estimates 13, the
loglikelihood is
l(j3, a; y) = -alogp- ay/p + d(y, a) . (6.68)
The loglikelihood for the saturated model is
l(fJillax,a;y) = -alogy-a+d(y,a), (6.69)
found by replacing P by y . The scaled deviance for the sample is found by
summing twice the difference between (6.69) and (6.68) as
n
Dsc(j3) = 2a 2:) -log(yi/ Pi) + (Yi - Pi)/ Pd·
i=l

Given that ¢ = l/a , the deviance (6.47) is therefore


n
D(j3) = 2 L {-log(yi/Pi) + (Yi - Pi)/ Pd· (6.70)
i=l

The second term in this deviance is usually identically zero.


It is not always obvious which link should be used when fitting gamma
models. The reciprocal link
TJ = f.l- 1
has the mathematical property that it is the canonical link, that is, the link
for which the sufficient statistics are linear functions of the data. However
it is not such that f.l is positive for all values of TJ. This useful property
holds for the log link, which often provides a satisfactory model, although
the identity link is also sometimes found to fit the data. In our analysis of
data on dielectric breakdown strength in §6.9 we find that several members
of the power family of links are indicated, although none of them maps TJ
into a set of nonnegative values.
204 6. Generalized Linear Models

6.8 Car Insurance Data


Our first example of the use of a generalized linear model emphasizes the
relationship with earlier procedures for the normal distribution. It is of the
car insurance data analyzed by McCullagh and NeIder (1989, p. 298). The
data, given in Table A.16, are for privately owned, comprehensively insured
cars in the UK. There are three factors that affect the frequency and size
of claims:

Factor 1: Policy holder's age (PA) with eight levels: 17-20, 21-24, 25-29,
30-34, 35-39, 40-49, 50-59, 60+;
Factor 2: Car (vehicle) group (VG) with four levels: A, B, C, D;
Factor 3: Car (vehicle) age (VA) with four levels: 0-3, 4-7, 8-9, 10+.

The response is the average claim. The numbers of claims mijk are also
given in Table A.16.
The data are thus in the form of the results of an 8 x 4 x factorial,
4
but there are five cells for which there are no observations. The total num-
ber of observations is therefore 123. We parameterize the factors by using
indicator variables for each level except the first, the parameterization em-
ployed by McCullagh and NeIder. Like them we also fit a first-order model
in the factors - there is no evidence of any need for interaction terms.
Since the responses are average claims, we weight the data by the number
of observations mijk forming each average.
To start, we explore the Box- Cox family of links for the five .A values
-1, -0.5, 0, 0.5 and 1, calculating for each the goodness of link test in-
troduced in §6.6.4. Figure 6.2 is the resulting equivalent of the fan plot for
transformations, but is instead a series of forward plots for the goodness of
link test. The test statistics are well behaved throughout the search, as the
figure shows: both .A = -1 and .A = -0.5 seem completely acceptable, a
conclusion in agreement with that from Figure ILIon page 377 of McCul-
lagh and NeIder (1989). When.A = -1 the final value of the goodness of link
test is 0.37 and the maximum absolute value during the search is 1.63. In
the remaining part of the analysis we stay with the canonical (reciprocal)
link.
Figure 6.3 shows the forward plot of the deviance residuals from a search
using 50,000 subsets to find the least median of deviances fit. This well-
behaved plot shows that, for all of the search, the most distant observation
is 18, which is the one identified by McCullagh and NeIder from the fit
to all the data as having the largest residual. Several residuals decrease in
magnitude at the end of the search but there is no evidence of any masking.
We next consider the forward plot of the leverage in Figure 6.4. There is
no information here about observations of exceptional leverage, although
there are two points to be made about the consequences of the factorial
structure of the data. The first is that, for normal regression, the leverage
6.8 . Car Insurance Data 205

1ii
~
""

'0 0
VI
VI
Q)
c
"0
8
<.!)

<)'

40 60 80 100 120
Subset size m

F igure 6.2. Car insurance data: goodness of link tests for five values of A

ID
18

.,.
VI
OJ
::0
"0 N
·iii
~
Q)
0
c
to 0
.:;
Q)
0
<)'

"'r

20 40 60 80 100 120
Subset size m

Figure 6.3. Car insurance data: forward plot of deviance residuals


206 6. Generalized Linear Models

co
0

., CD
0
'"~
'">
'" ...
..J
0

(\J
0

0
0

20 40 60 80 100 120
Subset size m

Figure 6.4. Car insurance data: forward plot of leverage

structure depends solely on the values of the factors for those observa-
tions in the subset. But, for a generalized linear model, the leverage (6.54)
depends also on the observations through the parameter estimates and
weights. Effects similar to those for nonlinear least squares in Figure 5.20
are therefore a possibility. There are no such dramatic changes in Figure 6.4,
although the plot does show the effect of the factorial structure in the near
vertical decreases in leverage, caused by the introduction of some factorial
points. This structure is absent from leverage plots where the explanatory
variables have a more random structure.
The next plot shows the deviance and estimate of the dispersion param-
eter during the search. Since we are searching using deviance residuals, the
plot of the deviance in Figure 6. 5 (left ) is smoother than that in (right)
J
which shows the evolution of estimated from the value of Pearson's X 2
(6.52). The data appear correctly ordered, with none of the jumps and
decreases in deviance associated with masked outliers.
The next pair of plots, both in Figure 6.6, show the parameter estimates,
which are stable throughout the search, and the t statistics. These param-
eter estimates on single degrees of freedom have been plotted to show the
/3
factor to which each belongs. The values of the do not indicate any in-
fluential observations. The values of the t statistics decrease in magnitude
during the forward search as do those for normal data, due to the increasing
value of the estimate of the dispersion parameter as the search progresses.
The only new feature is the occasional upward jumps in the t statistics.
Since, as we have already seen, the parameter estimates and the value of
J behave smoothly, these jumps relate to the changes in leverage that are
shown in Figure 6.6, which result from the factorial structure of the data.
6.S. Car Insurance Data 207

0
0

.
0
'"C
U
<Xl

.s; 0

'"
<0
0
...
0

'"
0

20 40 60 80 100 120 20 40 60 80 100 120


Subset size m Subset size m

Figure 6.5. Car insurance data: (left) deviance and (right) dispersion parameter
¢ estimated from Pearson's X2

...
0
cQ)
0
ci
·u 0
iE
Q)
0
u '"00 u '"
~
~
.0
ci
~ 0

"0
Cii

*
~
w
E ci
0

--- ---------- -- -., ---- -_ ...... ---"'- -"" . -~


0
'i'

0
-------_ ...... - _.... --' '7
'"
0
0
9
20 40 60 80 100 120 20 40 60 80 100 120
Subset size m Subset size m

Figure 6.6. Car insurance data: forward plots of (left) parameter estimates and
(right) t statistics. The upward jumps in the t statistics result from the factorial
structure of the data
208 6. Generalized Linear Models

20 40 60 80 100 120 20 40 60 80 100 120


Subset size m Subset size m

Figure 6.7. Car insurance data: (left) approximate modified Cook distance and
(right) maximum deviance residual during the forward search

Table 6.5. Car Insurance Data: the last five stages of the forward search, showing
the increase in deviance
m Observation y Dispersion Deviance Deviance
Number i ¢ Difference

119 36 264 0.830 79.3 7.7


120 96 196 0.901 88.4 9.1
121 73 233 0.986 98.8 10.4
122 62 129 1.066 110.9 12.1
123 18 420 1.208 124.7 13.8

The final plot, Figure 6.7 again stresses the well-behaved nature of the
data. Given in the left panel are approximate modified Cook distances cal-
culated using (6.66). This shows that the observations entering at the end of
the search cause slightly more change in the parameter estimates than those
entering earlier. The parameter estimates themselves in Figure 6.6(left)
show how slight is this effect. Figure 6.7(right) gives the maximum abso-
lute deviance residual in the subset. This shows once more how the more
extreme observations enter towards the end of the search. There are no
surprises.
Finally we consider the deviance in the last few stages of the search.
Table 6.5 lists the last five observations to enter the search, together with
the deviance and the estimated dispersion parameter. As the last column of
the table shows, there is a steady upward trend in the increase in deviance
as each observation is added. The final observation to be included is 18,
which was revealed as the most extreme in the forward plot of the residuals.
If this observation is not included, the estimated dispersion parameter is
1.066, close to one for the exponential distribution.
6.9. Dielectric Breakdown Strength 209

Our analysis thus shows that the data are well fitted by a model close
to the exponential and that there are no influential observations, although
one observation is somewhat remote. However it has no significant effect on
inferences. A final comment is that our parameter estimates at the end of
the search agree with those of McCullagh and NeIder (1989, p . 299) except
that the estimate for level 2 of the vehicle age factor should be 366 not 336,
an easily generated outlier.

6.9 Dielectric Breakdown Strength


If the analysis of the previous section was so bland as to suggest that there
are no new features in the analysis of generalized linear models, the data
analyzed in this section are sufficiently complicated in structure that we fail
to find a simple model. Our analysis does however show how the forward
search indicates model inadequacies.
The data, in Table A.17, originally from Nelson (1981), are for the
performance degradation from accelerated tests. The response, dielectric
breakdown strength in kilovolts, is measured at the points of a factorial
design , with factors time and temperature:

Xl: time (weeks) with eight levels


X2: temperature (0C) with four levels
y: dielectric breakdown strength in kilovolts.

Since the response is nonnegative with many small values around one
and a maximum of 18.5, some form of model other than linear regression is
likely to be needed. The original analysis used the logged response together
with a nonlinear model in temperature. Here we follow the suggestion of
G. Smyth of the University of Queensland (www.maths.uq.edu.au/rvgks
/data/general/dialectr.html) and explore generalized linear models,
using the gamma distribution. Some previous analyses have treated the
two continuous explanatory variables as factors. We instead treat them as
variables and try to find a model with few parameters. The data are in the
form of an 8 x 4 factorial with four observations per cell. As we did in the
analysis of the Box and Cox poison data in §4.4, we ignore the presence of
the replicate observations. Here these could provide a test of the goodness
of fit of the models using the residual deviance from a saturated two-way
model with interactions to estimate the dispersion parameter regardless of
the linear model.
We begin with plots of the data. Since there is a factorial structure we
replace the scatterplot matrix with scatter plots using symbols to show
the levels of the factor not plotted. Figure 6.8 is a plot of y (strength)
against time Xl. The responses for the highest temperature (represented
210 6. Generalized Linear Models

co 0

~i ~
0
<0
§ 0
0
0
0
B
"<t 0

§ B B
N ~ 8
.c $" 2 + 6
enc
!!?
0
2
0 ~
6
0
~
U5 co +
+
~
<0
fr !:.
!:.
* +
!:.
"<t

o 10 20 30 40 50 60
Time

Figure 6.8. Dielectric data: scatterplot of observations . Symbols represent


temperature levels

by triangles 6.) are by far the lowest readings at high time and lie away
from the rest of the observations. Because the plot is very congested for
low values of time we show in Figure 6.9 the plot of y against log Xl.
The patterns for the four temperatures as a function of time are revealed
as rather different: for the lowest temperature, represented by squares (0),
the line of points is virtually horizontal, showing little effect, on breakdown,
of time at this level of temperature. For the next higher level, plotted as
circles (0) , there is a slight and roughly linear downward trend. For the
third level (+) the response is dropping rapidly towards the end, whereas
for the highest temperature (6) an asymptote seems to have been reached
around response values of one.
Figure 6.10 shows two plots against the other factor, temperature.
The increasing spread to the right in Figure 6.10(1eft) indicates that we
need to include some interaction in the model. We repeat this plot in
Figure 6.1O(right) with the readings for different times systematically sepa-
rated by increasing the temperature readings by 3° for each increase in Xl.
This plot shows how the groups of observations for high temperatures and
high times (located in the lower right-hand corner of the graph) lie away
from the rest of the data. It is not clear from the plots whether a simpler
structure and better model will be obtained by using Xl as a variable, or
its logarithm. We use log Xl.
The interaction structure visible in the plots suggests that it may be hard
to find a satisfactory linear predictor for the data. To try to see what kind
of model might be satisfactory we start by fitting a linear predictor with
linear terms in log Xl and X2 using a gamma model with the reciprocal link.
6.9. Dielectric Breakdown Strength 211

0
~ 0
0

~
0
co 0 0 0 El
§ 0

;
0
0
~ 6
§ B
~
~
~ :; 8+ § 8
El
.c;

'"" ~
~ '" fr
fr
+
0
6 ~
6
0 §
~
iii +
OJ
Temp. levels
0=180 ..
+
$
+
co
0 =225 ~ ~
.,. +=250 61-64
77-80
6 =275 93-96
& 125-1 8
'" 109-112 ~ ~

0 2 3 4
Log(time)

Figure 6.9. Dielectric data: scatterplot of observations against log(time). Symbols


represent t emperature (temp.) levels

x x

I rio'i
I~
I:l 0
Of)
~+ I:l 0

~ I:l §rim ~'V :j:


Cfl
Ejl
:j: "'0
.c;

~ ,.-
Time levels
0=1
6XO~§
0
ITlj:
:j:
'V 0=2 'V
i'i5
I + =4
/', =8 ~
~ Of) x =16 ~
0=32
$ " =48 $
EO =64
I ill
180 200 220 240 260 280 300 180 200 220 240 260 280 300
Temperature Temperature

Figure 6.10. Dielectric data: (left) scatterplot of observations against tempera-


ture. Symbols represent time. Observations (right) have been separated to exhibit
the effect of time
212 6. Generalized Linear Models

'"
d

95. 96

~
1
III 110.112. 125
'7 109127.128 ::::::.--::::::-

~ ~-----~~1~----------------------------

20 40 60 80 100 120
Subset size m

Figure 6.11. Dielectric data, reciprocal link: forward plot of deviance residuals

Figure 6.11 shows the forward plot of the deviance residuals and Figure 6.12
the forward plot of the score statistics for the goodness of link test. Neither
is satisfactory: the plot of residuals shows several remote groups of residuals
throughout the forward search and the final value of the score statistic,
using the Box-Cox link, is -8.26.
We first try to find a satisfactory link and then consider the linear pre-
dictor. The strategy is similar to that in Chapter 4 where a satisfactory
transformation was found first, before checking the linear model. Table 6.6
gives the value of the goodness of link statistic for five values of A calcu-
lated using the Box-Cox link. This gives the same numerical values as the

Table 6.6 . Dielectric Breakdown Strength: goodness of link tests for some values
of A using the Box-Cox link

A Link Test

2 -0.55
1 -8.82
0.5 -9.09
0 -8.58
-0.5 -8.33
-1 -8.26
6.9. Dielectric Breakdown Strength 213

C\I

0
a;
~
~
;§ C)I
"0
(/)
(/)
<Il
c 'f
"
0
0
Cl
~

20 40 60 80 100 120

Subset size m

Figure 6.12. Dielectric data, reciprocal link : goodness of link test

power link, except for the sign of the statistic which is informative about
the direction of departure.
The table suggests that we should consider a link with A = 2. However
the forward plot of this goodness of link statistic in Figure 6.13 shows that,
although the value of the statistic may be acceptable for all the data, it is
not so earlier on, having a maximum absolute value of 5.58 when m = 115.
The forward plot of deviance residuals, Figure 6.14, shows that around this
extreme value there is a rapid change in the values of some residuals (units
125 to 128). We do not print the forward plot of leverages, but it also
helps to explain Figure 6.13: with this link some observations of very high
leverage enter at the end of the search and affect the value of the link test.
We therefore reject A = 2 and follow our second course of action, which is
to try to build a better linear predictor. For this we return to our results for
A = -1 , the reciprocal link often being found to be satisfactory for gamma
data.
The discussion of the scatter plots such as Figure 6.9 suggested that
interactions and second-order terms would be needed. A full second-order
model in log Xl and X2 including an interaction term on one degree of
freedom has a deviance of 15.15 as opposed to 23.64 for the first-order
model. Although this is an appreciable reduction for three extra degrees of
freedom , the forward plot of residuals is similar to Figure 6.11, still showing
groups of negative outliers. We try to accommodate these outliers by fitting
dummy variables for the individual groups. The hope is that we can explain
a few groups individually and that then the rest of the observations will be
fitted by the second-order model.
214 6. Generalized Linear Models

0
(;j
~
""~ 'l'
'0
'"'"c:
'" 'f
8
<!l
'7'

'9

20 40 60 80 100 120
Subset size m

Figure 6.13. Dielectric data, Box- Cox link with ,\ = 2: goodness of link t est

If)
ci

0
ci

'"
(ij
:::>
"0
·iii
'"9
!!!
'<>"
c
.!l!
:>
C!
Q) or;
Cl

II"!
or;
111 .

0
~
20 40 60 80 100 120
Subset size m

Figure 6.14. Dielectric data, Box- Cox link with ,\ = 2: forward plot of deviance
residuals
6.9. Dielectric Breakdown Strength 215

..-
ci

'"ci
'"::>
OJ 0
ci
"C
'iii
!
~
c: '"9
,~
II)
0
..-
9

'"9
62, 79
<Xl
9
20 40 60 80 100 120

Figure 6.15. Dielectric data, reciprocal link with second-order model and two
dummy variables: forward plot of deviance residuals

The lowest group of residuals in Figure 6.11 contains observations 109 to


112 and 125 to 12S. These are the eight smallest observations, plotted with
triangles at the two highest times for the highest temperature in Figure 6.9.
The next group of observations giving low residuals in Figure 6.9 contains
observations 93 to 96, which are the four with the next lowest response
values. The addition of the two dummy variables for these groups causes
a dramatic reduction of the deviance to 2.S6. As Figure 6.15, the forward
plot of residuals, shows, the dummy variables have accommodated the two
groups. However a third group is now in evidence. These are observations
61 to 64 and 77 to SO, the next two groups of observations for the highest
temperature which show clearly in Figure 6.9 with response values around
6. If we include a third dummy variable for these observations, the deviance
is reduced to 1.54. Residual plots from this model seem virtually featureless,
so we now return to consideration of an appropriate link function.
Figure 6.16 shows a series of forward plots for the goodness of link test
for A = -1, -0.5, 0, 0.5 and 1.
This plot shows that the two links which cannot be rejected are A = 0
and A = 0.5. Given that the curve referred to A = 0.5 seems to be much
more centred around 0 than that associated with A = 0, we use the square
root link for the rest of the analysis.
The first step is to see whether the linear predictor can be simplified.
Figure 6.17 shows the forward plot of the t statistics for A = 0.5. This
shows the typical contracting shape arising from estimation of the disper-
216 6. Generalized Linear Models

1;; C\J
$
-"
§
a
!I)
a
!I)
Q)
C
"0
0
0
(!) ~

'i

20 40 60 80 100 120
Subset size m

Figure 6.16. Dielectric data, Box- Cox link with second-order model and three
dummy variables: goodness of link test for five values of A

sion parameter. The smallest of the nonsignificant t values at the end of the
search is for x~. We therefore exclude this variable from the linear predictor
and rerun the forward search for an eight-variable predictor. All variables
are now significant, so we present a full analysis of this final model.
The final model has a linear predictor including three dummy vari-
ables and a full second-order model with interaction in temperature and
log(time), except for the quadratic term in temperature. Figure 6.18 is the
forward plot of the residuals, which is in general well behaved throughout.
The highest residual is for unit 125, which is the largest observation at
the highest level of both explanatory variables and one of the highest in
its group of eight observations. The last observation to enter the search
is 111. In some other searches we performed this observation entered the
search earlier than here and left again before entering again at the end of
the search, giving slightly different plots. The most negative residuals in
Figure 6.18 reflect the replicated structure of the data and come from the
smallest of four observations in particular cells of the factorial.
The forward plot of the leverages, Figure 6.19, shows horizontal lines
of high leverage that arise from the dummy variables, for which the
coefficients are determined by only a few observations, either four or
eight. The parameter estimates are in Figure 6.20(left) and are stable
during most of the search, trending slightly towards the end. The t statis-
tics in Figure 6.20(right) confirm that all variables are now significant.
Figure 6.21(left) shows the forward estimate of the deviance and, on the
right, the estimated dispersion parameter. Both show the smooth upward
trend associated with data correctly ordered by the forward search. The
6.9. Dielectric Breakdown Strength 217

0
<'l Inweep' (xO)
Xl
x2
0 xl . . .2
N
><2"2
x1"x2
dummIes
0

<J ...... X1

:~ 0
~ 4-

0
(,:f"-
I
I
..,
0
I
I
0 I
'?
20 40 60 80 100 120
Subset size m

Figure 6. 17. Dielectric data, Box- Cox link with A = 0.5 and a second-order model
including t hree dummy variables: forward plot of t statistics

.,.
ci

N
c::i
C/)
c;;
-6
'(ji)
0
0
~

ffi~ '"
9
.~
o .,. 11 1 - - - _ , _____ _ ____ ____ ...,-.-
9
<0
9 105.106.123

9
00
~----~------~------~------_r------_r------_r--~

20 40 60 80 100 120
Subset size m

Figure 6. 18. Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including t hree dummy variables: forward plot of deviance
resid uals
218 6. Generalized Linear Models

C!

CD
0

'~"
<0
0
'"
§?
'" 0.,.
...J

(\J
0

0
0

20 40 60 80 100 120
Subset size m

Figure 6.19. Dielectric data, Box-Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: forward plot of leverage,
showing effect of replicated factorial structure

-_. x2
x1"2
.,.
0

x1"XZ ,, ........ ~
dummies
0 ,, ......- ........... .
.........
___ ------ xl---- __ _____________ -"'-" (\J
.......
() \~\I..r·-" ...........
.~

x2 , x1'2 , x1'x2 .~ 0
1ii
0
'l'

0
't

20 40 60 80 100 120 20 40 60 80 100 120


Subset size m Subset size m

Figure 6.20. Dielectric data, Box- Cox link with A = 0.5 and the reduced
second-order model including three dummy variables: forward plots of (left)
estimated coefficients and (right) their t statistics
6.9. Dielectric Breakdown Strength 219

~
C\J
;;
~ c:i
.2l
Q)

C> E IX>
Q)
(,) ~ 0
c:
<IS
.;; ~ci
Q) o
.~
Cl
U") Q) ....
0.0
c:i
i5~

0 o
c:i c:i
20 40 60 80 100 120 20 40 60 80 100 120
Subset size m Subset size m

Figure 6.21. Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: (left) forward plots of the
deviance and (right) of the scale estimate J>

goodness of link test for this reduced model is in Figure 6.22. The plot lies
within the 5% limits throughout. The most noticeable feature towards the
end of the search is the effect of observation 111, causing an upward jump
when it is introduced at the end of the search. However, in general, compar-
ison with Figure 6.16 shows that dropping one term from the predictor has
not affected the link test. As a pendant to this analysis, Figure 6.23 shows
that observation 111, entering at the end of the search, has a large residual
in the forward plot of the maximum absolute residual in the subset.
In fact, observation 111 comes from the group of observations from the
next to longest time and highest temperature. It is therefore included in
the same group of eight observations as 125 to 128 as Figure 6.24 shows.
The dummy variable for this group of eight should therefore perhaps be
split from that for the group for the longest time and the forward search
repeated. We do not do this here, but instead consider what our analysis
has achieved.
The forward search revealed the groups of observations that do not agree
with the model fitted to the rest of the data. The search also showed the
effect of these observations on the selected link and linear predictor. The
result is a model in which the five groups of observations for the highest
temperature (61- 64, 77- 80, 93-96, 109-112 and 125-128) and log(time)
greater than 2 (Figure 6.24) are modelled separately from the rest of the
data. The implication is that simple linear models are not adequate to
describe these data.
There are many other possible models. A simple alternative to that ex-
plored here is to work with time rather than its logarithm. Another is to
consider normal theory regression: our final dispersion estimate is 0.012,
implying an index for the gamma distribution around 80; the distribution
is thus close to normal. Another possibility is to consider a nonlinear model
220 6. Generalized Linear Models

(')

iii
2
-'<
~
'0 0
'"'"
Q)
c
"
0
0 ';"
CJ

C)'

'?

20 40 60 80 100 120
Subset size m

Figure 6.22 . Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: forward plot of goodness of
link test

0
(')
ci
It>
N
(ij
::l ci
:g
'"~ 0
N
Q) ci
0
.,
c
.s; It>
Q)

"E
::l
ci

E 0
.,
·x ci
::2:
It>
0
ci

0
ci

20 40 60 80 100 120
Subset size m

Figure 6.23. Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: forward plot of the maximum
absolute deviance residual in the subset; observation 111 enters last
6.10. Poisson Models 221

0
~ 0
0

~
~ 0
§ 0
0
0
0
8
0
;! 0

~
8
8 ~ ~ ~ g § 8 8
8
~
.r::

~
c;, 0 0
c
(l!
~ 8 0
Ci5
co 0
Q ~
0

I•
I
CD
Dummies
• first dummy

61-64
v 77-80
.I. second dummy 93-96

*

C\l
• third dummy 125-1 8
109-112 •

0 2 3 4
Log(time)

Figure 6.24. Dielectric data: dummy variables. In this version of Figure 6.9 the
observations modelled using the three dummy variables are shown with filled
symbols

in time to fit the exponential decay indicated in F igure 6.8, with the rate a
function of temperature. Such a generalized nonlinear model is outside the
models described in this book, although similar diagnostic methods based
on the forward search would apply.

6.10 Poisson Models


The second important instance of generalized linear models is provided by
the Poisson distribution for which the probability function

II>O y = 0,1 , . . . ,
which is used to measure counts with no specified upper bound. The
loglikelihood of a single observation (6.22) is

l(lI; y) = -II + Y log II + dey)·


For the Poisson distribution the dispersion parameter is equal to one.
With estimated means fi, from parameter estimates 13, the loglikelihood
is

l( j3;y) = -fi, + ylogfi, + d(y)


and that for the saturated model is
l((3max; y) = -y + y log y + dey) .
222 6. Generalized Linear Models

Then the deviance for the sample is found , as in §6.7, by differencing and
summing twice the difference over the sample to be
n
D(~) = 2 L {Yi 10g(Yi/ Pi) - (Yi - Pi)}' (6.71)
i=l

The second term in this deviance, 2:(Yi - Pi) , is identically zero for linear
predictors including a constant. The deviance will be small for models with
good agreement between Yi and Pi' Since the scale parameter 4> = 1, the
value of the deviance can be used to provide a measure of the adequacy of
the model, tested by comparison with a chi-squared distribution on n - p
degrees of freedom. If the estimated means P are "small," the distribution
of the deviance may need to be checked by simulation.
We examine two examples of Poisson data: the data on train accidents,
in which we have one factor and one continuous explanatory variable, and
data on cellular differentiation which are the results of a 4 x 4 factorial
experiment with quantitative factors. We do not specifically consider con-
tingency tables, that is, Poisson data with qualitative factors. References
to the large literature on this subject are given at the end of the chapter.
The most straightforward analysis of contingency tables is concerned with
discovering whether there is a relationship between the factors. If there is
no relationship an additive model is appropriate , with cell means Pi esti-
mated from the product of the marginal distributions of observations over
each factor. Pearson's chi-squared goodness of fit test compares the predic-
tions from this model with those from the saturated model, giving the test
statistic

(6.72)

In the last expression in (6.72) the Oi are the observed Yi and Ei are the
expected values. Pearson's statistic provides an alternative to the use of
the deviance, which is sometimes called G 2 in the literature on contingency
tables. Both statistics can be used for overall testing of models where the
Pi come from fitting more complicated models than the product of the
estimated marginal distributions. We leave it to Exercise 6.6 to establish
the relationship between the two statistics.
A distinction between the Poisson model and the other generalized linear
models examined in this chapter is that the link function is not often in
question. In our examples we use the (canonical) log link log(p,) = T).

6.11 British Train Accidents


The Poisson generalized linear model was introduced in §6.1.1 using some
data on British train accidents, which we now analyze. The important
6.11. British Train Accidents 223

difference from the preceding analyses of gamma data is that the dispersion
parameter does not have to be estimated, since it is one for Poisson data.
The deviance therefore provides a test of the goodness of fit of the model.
We use Poisson models with a log link, so that J1 = exp1](x). However
the distribution of the number of accidents will also depend on the amount
of traffic on the railway system, measured by billions of train kilometres
and also given in Table A.18. For year t let the value be mt. The mean
number of accidents will be proportional to mt, so the model for year t
becomes
J1 = mt exp 1]( x) = exp{log mt + 1]( x)}. (6.73)
The dependence of accidents on traffic is thus modelled by adding to the
linear predictor a term log mt with known coefficient one. Such a term with
a known coefficient is called an offset (McCullagh and NeIder 1989, p. 423).
A complication which we ignore in our analysis is that the data do not
include figures on accidents in which there were no deaths. From a practi-
cal point of view, there are difficulties in defining such accidents. From a
statistical point of view we need to explore the effect of ignoring such zero
truncation in the estimation of Poisson parameters.
We first fitted a model in which the linear predictor included the offset
and the year as well as a factor with three levels for the type of rolling
stock. To demonstrate the difference between residual plots for continuous
data and those for discrete data we give in Figure 6.25 a plot of deviance
residuals against fitted value when the three largest observations (units 13,
23 and 63) are excluded.
This plot shows appreciable structure, which is unrelated to whether the
model fits well. First, the fitted values fall into three groups corresponding
successively to nonpassenger (goods) trains, post-Mark 1 passenger trains
and Mark 1 trains. The fitted values for these groups are slightly smeared by
the effect of calendar year. Also visible in the plot are a series of decreasing
curves: the lowest is for the residuals for all observations which are one,
the one above for all observations equal to two, and so on. Such banding is
typical of residual plots for discrete data with few different observed values.
We can expect that this structure may also affect some forward plots of
residuals.
The forward analysis of all the data showed that time was not significant,
so we removed the variable and used only the single factor for train type.
The deviance residuals from the forward search are shown in Figure 6.26.
The three large residuals for units 63, 13 and 23 are for observation values
of 49, 35 and 13, the three largest, the others all being 10 or less. These
are the last three observations to enter the forward search. A less obvious
feature of the plot is the slight curvature starting around m = 29 which
corresponds to the successive inclusion of 10 units of Mark 1 stock for all of
which there was one death. After this the order of entry of the data reflects
the size of the response.
224 6. Generalized Linear Models

0
,.; 7
10

0
9
N
7
7
6 6
'"
~
33 5 55
Cii 5
22 1
:>
2 4 5
.,
"0
'Vi ~
cr 4
0
ci 2
3
1111 11
11111 1 1 22 2 2
~
';"
2:2
1 1

0
<)i pass.
I
Non
Post mark 1 Mark 1
111 111

1.5 2 .0 2.5 3 .0 3.5


Fitted values

Figure 6.25 . Train data: residuals against fitted values showing the pattern in-
duced by the discrete values of the response. Numbers correspond to the value of
the response. The vertical lines divide the observations into three nonoverlapping
groups. The higher risk from Mark 1 stock is evident

0
(I)
Cii
:>
:!2
(I)

.,~
0
c:
.,0'"
.:; III

o 10 20 30 40 50 60
Subset size m

Figure 6.26. Train data: forward plot of deviance residuals. The residuals from
the three largest observations are evident
6.11. British Train Accidents 225

~
~ N ~----------------------------------------+-~
:a
'0
UJ
UJ
CIl
c
"8 a
C}

~ +-------------------------------------------~

10 20 30 40 50 60 70
Subset size m

Figure 6.27. Train data: goodness of link test, again showing the effect of the
largest observations

The model seems inadequate for all the data and the inadequacy shows
in the plots from the forward search. Figure 6.27, for example, is the for-
ward plot of the goodness of link test. Here the effect of the inclusion of
the last three observations to enter the search (23 , 13 and 63) is highly
evident. Inclusion of unit 23 causes the value of the test to become slightly
negative while units 13 and 63 cause an upward jump and bring the value
of the statistic much beyond the 1% limits. As a final set of plots showing
the inadequacy of the model, Figure 6.28 presents the deviance and the
estimated dispersion parameter. Both are smooth curves, showing that the
data are well ordered by the search, although the values are very large
towards the end of the search. Since, if the Poisson model holds, <p equals
one, we have an absolute test of the adequacy of the model by comparing
the deviance with the X2 distribution on m - 3 degrees of freedom. The
deviance is below the 95% point of this distribution until m = n - 2 = 65,
when at 102.8 it exceeds the 99.9% point of X~2. Although this distribution
is asymptotic and may not hold exactly for small numbers of counts, there
is no doubt that the Poisson model is inadequate. This is not surprising
since, for example, we do not have data on accidents in which there are no
fatalities , so that a zero-truncated distribution is required. A more plausible
model would be a compound Poisson process, in which accidents happen
in a Poisson process, but the number killed, given that there has been an
accident , has a zero-truncated distribution. It would then be of interest
to determine the relationship between the factors of both the number of
accidents and their severity.
226 6. Generalized Linear Models

~ co
.l!!
Q)
E
Q) 0 ~ <0
<.> 0
c: N
'"a.c:
'"
~ o
.~
Q)
v
o a.
o w
is N

o o

10 20 30 40 50 60 10 20 30 40 50 60
Subset size m Subset size m

Figure 6.28. Train data: (left) deviance and (right) dispersion parameter Jy. The
largest observations cause significant lack of fit

6.12 Cellular Differentiation Data


In Table A.19 we give data on the number of cells showing differentiation
after treatment with two immuno-activating agents. The variables are:

Xl: Dose of tumor necrosis factor (U/ml) with four levels (0, 1, 10 and
100), TNF
X2: Dose of Interferon-/, (U/ml) with four levels (0, 4, 20 and 100) , IFN
y: Number of cells exhibiting differentiation.

The analysis of Piegorsch et al. (1988) focused on the presence of posi-


tive interaction, "synergism" in medical terminology. This is also the main
emphasis of the analysis of Fahrmeir and Tutz (1994, p. 36), who fitted a
Poisson model. We also begin by looking at the evidence for interaction. We
then note that both sets of authors treated the factor levels as qualitative,
although they are quantitative doses. In our analysis we then look for a
simple model in the quantitative variables.
We start our analysis with the seven-parameter model in which there
is no interaction and both variables are treated as categorical. Since a
cubic polynomial can be fitted exactly through four points, this model
is identical to that in which a cubic polynomial is fitted to each dose. If
interaction is present the deviance will be too large and can be tested, since,
for Poisson data, ¢> = 1. With only 16 observations and discrete data there
may be problems in the interpretation of some plots, particularly those
based on residuals. Despite this, Figure 6.29, which is the forward plot of
residuals from a fit with the log link, is easily understood. Observation
16, that at the highest level of both factors, has a large negative residual
until it enters at the last stage of the forward search. Figure 6.30(left)
shows the forward plot of the deviance and (right) the forward plot of the
6.12. Cellular Differentiation Data 227

'" ,
4 nn.n _ ______ h __________ n _ h _____ ~'
~'1~ •

0 -=0
-- iiM~ . <"
- - ....
A

'"
;;;
:0 9' . • . . . . . . .
., . . "". ::.;

"0 C)I
.0;

.,
~ I

.,c:
()

.,
.:; 'r
I
/
0 I
I
~ I
I
I
'l'
16---- -------------__
-------,")
8 10 12 14 16

Subset size m

Figure 6.29. Cellular data: forward plot of deviance residuals. Observation 16 has
a large negative residual

0
N Q) co
()
c
.,
.,c
()
u>
*"'"
'0 <!)

.,
0
.:; 0

., ..,.
()
0 ~ "0
-=
'0
0
u> :2
'"
0 0

8 10 12 14 16 10 12 14 16
Subset size m Subset size m

Figure 6.30. Cellular data: (left) deviance and (right) modified Cook distance.
Introduction of observation 16 at the end of the search has a dramatic effect on
both statistics
228 6. Generalized Linear Models

g
N IFN levels
o IFN=O
'V IFN=4
+ IFN=20 TFN levels
• IFN=100 o TFN=O
'V TFN=1
+ TFN=10
• TFN=100

Q; 0 Q; 0
~::l '" ~::l '"
Z Z

o o

o 2 3 4 o 2 3 4
Log(dose of TNF +1) Log(dose of IFN +1)

Figure 6.31. Cellular data: scatter diagrams of y against transformations of the


two doses. Points corresponding to the same level of the second factor have been
connected by lines. If no interaction is present the slope of the lines is similar

modified Cook distance. Both show the dramatic effect of the introduction
of observation 16: the deviance, for example, increases from 4.39 to 23.03.
Thus, without observation 16, testing the deviance against a chi-squared
distribution on eight degrees of freedom shows that the model fits well
and there is no evidence of any interaction. With observation 16 included
the model does not fit adequately so any evidence for interaction depends
on this observation alone. Another piece of evidence that observation 16
is influential for the model is that the goodness of link test changes from
-0.59 to -3.26 when the observation is introduced at the end of the search.
Otherwise the log link is in agreement with the data.
To try to understand this effect, we show scatterplots of a transformation
of the data in Figure 6.31. For both variables the lowest dose levels are 0
and the highest 100. It is usually found , as it is for Bliss's beetle data
introduced in §6.1.2, that simple models are obtained by working with log
dose. If, as here, one of the dose levels is zero, it is customary to add one to
the dose before taking logs, thus working with variables Wj = 10g(1 + Xj) ,
j = 1, 2. These are the variables used in the scatter plots of Figure 6.31 ,
which show the values of y increasing with both explanatory variables. In
the absence of interaction the changes in response with the levels of the
factor in the plot should be the same for all levels of the second factor
represented by symbols. The additive structure therefore results in moving
the dose response relationships up or down for the levels of the second
factor. Changes in shape indicate interactions. Compared with plots for
normal data, those here for Poisson data are made more difficult to interpret
by the dependence of the variance on the magnitude of the observations.
Such an additive structure appears to be present in Figure 6.31 , except
that the left panel indicates that the highest observation, 16, may be too
low for additivity. We have already seen that this is the unit with a large
6.12. Cellular Differentiation Data 229

U')

'" o xO
/./... _................_......
'" xl
x2
o
'" x1*x2

/
/

Q)
g -
U')
.-..--.. .~.--=-- =::-=--~ _........ --_......
'"
'5
Q)
---
Cl ~ /----
U')

//
--------------//
U')

o - - - - - - - - - - - - - - - -.. . . . . . .

4 6 8 10 12 14 6 8 10 12 14
Subset size m Subset size m

Figure 6.32. Cellular data, observation 16 deleted: (left) deviance, showing lack
of fit throughout the search, and (right) t statistics; the interaction term is not
needed

negative residual until it enters at the last step of the forward search,
another indication that the value is too low. We therefore consider the
effect of arbitrarily adding 100 to the observation. Since the observation
enters in the last step of the forward search, nothing changes until the end
of the search, when the residual deviance becomes 6.08 rather than 23.03.
This new value is in agreement with the model without interaction and
with the rest of the data.
We now try to find a simpler model by regression on the "log" doses
WI and W2 · For the reasons given above, we leave aside observation 16.
Inspection of Figure 6.31 does not inspire much hope that it will be possible
to find a simple regression model for the data. Although the responses
increase with the variables, the increase is generally of a "dog-legged" form
which requires a cubic model, rather than a linear or quadratic one. The
main exception is the upper set of responses in Figure 6.31(left) which,
with the suspect observation 16, form a nice straight line.
Our numerical results confirm our initial pessimism. The final deviance
for a model for 15 observations with first-order terms and their interac-
tion is 26.1, so that there is evidence of lack of fit of this simple model.
The forward plot of the deviance, Figure 6.32(1eft), shows that the lack of
fit extends through a large part of the data: there are no outstandingly
large contributions to the deviance, just a steady increase in value as each
observation is added. The plot of t statistics in Figure 6.32(right) is very
different in form from those for normal and gamma models in which es-
timation of the dispersion parameter caused the plot to contract as the
search progressed. Here, since cjJ is taken as one throughout the search, the
t statistics either remain approximately constant or increase with m . This
particular plot shows great stability during the search - again there are no
influential observations. There is also no evidence of any interaction.
230 6. Generalized Linear Models

We repeated this analysis for a second-order model with interaction, six-


parameters in all, for which the residual deviance was 24.20, a nonsignificant
decrease when compared with the value of 26.1 for the first-order model.
If observation 16 is introduced into the second-order model the deviance is
25.71. This observation is not remote from this unsatisfactory model.
The main conclusions are that there is no evidence of any interaction,
provided observation 16 can be rejected, and that, although independent,
neither dose response curve can be represented by a low-order polynomial.
These conclusions are strengthened by the suspicions that surround the
larger observations. Although Fahrmeir and Tutz (1994) analyze the data
as Poisson, they state that the counts are out of 200 cells examined. The
binomial model might therefore seem to be appropriate. However, if the
number of counts is small, the Poisson approximation to the binomial with
small () would make the choice between these two models unimportant.
Unfortunately the five largest counts are 193, 180, 171 , 128 and 102, which
are not small in this context. Piegorsch et al. (1988) are less dogmatic about
sample size, saying that the numbers of cells examined at each treatment
combination are unknown, but seem never to have been greater than 250.
Our experience in modifying observation 16 to 293, a value which still gave
a large negative residual, suggests that, for this treatment combination at
least, the number of cells examined may have been much greater. Otherwise
our plots and the value of the deviance suggest that the Poisson assumption
holds for these data.

6.13 Binomial Models


In binomial data the observation Ri is the number of successes out of ni
independent trials with individual probabilities of success ()i. The distribu-
tion has two interesting features for the development of methods for the
analysis of data: it is that which is least like the normal, especially as the
ni ---+ 1 and it is the distribution for which the link is most in question. In
this section we give general results about the deviance of the distribution
and compare several link functions. There follow three sections in which
we present examples of the forward analysis of binomial data. The special
difficulties arising from binary data, that is, when all ni = 1, are described
in §6.17.1.
The binomial probability function is

fer; e) = '(
r. n
n~ r.),er (1 _ e)n-r, r = 0, 1, ... , n ,

with E(R) = ne and var(R) = ne(1 - e). The loglikelihood of a single


observation (6.22) is
lee; r) = r loge + (n - r) log(l- e) + den, r).
6.13. Binomial Models 231

As for the Poisson distribution, here the dispersion parameter is also equal
to one.
It is convenient to rewrite the distribution in terms of the variable
y = rjn with E(Y) =0 and var(Y) = O( 1 - 0) j n,
when the loglikelihood for a single observation becomes
l(O; y) = ny log 0 + n(l - y) 10g(I - 0) + d(n, y).
With estimated probabilities Bfrom parameter estimates ~, the loglikeli-
hood for a single observation is
l(~; y) = ny 10gB + n(1- y) 10g(I - B) + d(n, y)
and that for the saturated model is
l(f3max; y) = ny logy + n(I - y) 10g(I- y) + d(n, y).
Then the deviance for the sample is found, as in §6.7, by differencing and
summing over the sample to be

D(~) = 2 t
i=l
{niYi log(YdBi) + ni(I - Yi) log (1
1
=~2)}.
()2
(6.74)

The deviance will be small for models with good agreement between Yi
and Bi . Since the dispersion parameter ¢ = 1, the value of the deviance
can be used to provide a measure of the adequacy of the model , tested by
comparison with a chi-squared distribution on n - p degrees of freedom
provided the ni are "large". If the numbers ni in the groups are "small",
the distribution of the deviance may need to be checked by simulation. For
the limiting case of binary data, when all ni = 1, discussed in §6.I7.I, this
deviance is uninformative about the fit of the model (Exercise 6.7). Intu-
itively, for the saturated model for ungrouped binary responses (ni = 1),
as n increases, the number of estimated parameters tends to infinity, since
one parameter is fitted to each observation. On the contrary, for binomial
observations only one parameter is fitted to each proportion Rdni in the
saturated model. Thus, when ni - 7 00 the number of fitted parameters
remains constant.
We now compare several link functions for binomial data. Table 6.1
included four often used for modelling binomial data. Figure 6.33 shows
how they vary in the relationship they give between linear predictor and
probability. In all panels, the continuous line represents the logit link.
For the symmetrical links (probit and arcsine) the curves have been
rescaled to agree not only when the probability equals 0.5, but also for
probabilities 0.1 and 0.9. The first panel shows how very close are the
probit and logit links. Chambers and Cox (1967) calculate that several
thousand observations would be needed in order to tell these two apart.
The fourth panel emphasizes the short tails of the arcsine link: outside a
certain range the probabilities are identically zero and one.
232 6. Generalized Linear Models

logit and probit link Logit and Clog log link

.
'"
d

.
0

'" ·2

Logit and 10910g link logit and arcsine link

.'"
.~
]5

i d

~ ~==~-"----------~ ~ =-- __--__--__ ----,.J

·4 ·2 ·4 ·2
Linear predictor Linear predictor

Figure 6.33. Comparison of links for binomial data to the logit link, represented
by a continuous line. Symmetrical links have been scaled to agree with the logit
at probabilities of 0.1 and 0.9

The other two panels of Figure 6.33 show two asymmetric links. The
complementary log log link was defined in Table 6.1 as g(f1) = log{ -log(l-
f1)}. Applying this link to 1 - y gives the log log link for which g(f1) =
log{ -log(f1)}. We find both links useful in the analysis of data. Because
of the similarity of the probit and logit links it does not , usually, matter
which is fitted. The logit link has the advantage of easier interpretation
through interpretation of the logit as a log of odds.

6.14 Bliss's Beetle Data


Our first application of binomial models is to the analysis of Bliss's beetle
data, Table A.20, which were used in §6.1.2 as one introduction to gener-
alized linear models. With approximately 60 beetles in each of the eight
groups there are none of the problems resulting from small values of ni that
occur most dramatically with binary data.
We start by exploring possible link functions. Figure 6.34 shows plots of
absolute deviance residuals from forward searches for three models in which
the explanatory variable is log (dose) and the three links are the logit, probit
and complementary log log. The observations are numbered from the lowest
dose level to the highest. For the logit link observations 1 and 2 are the last
two to be included in the forward search. The crossing of the lines at the
end of the plot in Figure 6.34( top) shows that the inclusion of observations
1 and 2 seems noticeably to affect the ordering of the residuals. With the
probit link units 3 and 4 (the last two to be included) seem to be different
6.14. Bliss's Beetle Data 233

2 3 4 5 6 7 8

~~:: 1 3==;;--~
4
__==_~=:~~:.:=: ::;;:=_ SC~;_-.:._- -
----------------::-----
===
o . ;; _;:;.--: - - ; : ________ '": a:: __=.::;

2 3 4 5 6 7 8

i~o r2=-~-----~----~-----~-----~----~-----~----~-----~-----~----~-3
U I,
Lg__ -_- --: -;:= '" ;s-

2 3 4 5 6 7 8

Figure 6.34. Bliss's beetle data: absolute values of deviance residuals as the subset
size increases: (top) logit, (middle) probit and (bottom) complementary log log
links

from the rest of the data: they are badly predicted by models in which
they are not included. On the other hand, the residuals from the forward
search with the complementary log log link show no such behaviour; all
residuals are smaller than two throughout, and relatively constant. Since
the scale parameter is not estimated, it is possible to make such absolute
comparisons of the residuals across different models, even if they come from
different link families .
Figure 6.35 shows a forward plot of the goodness of link test, the order
of introduction of the observations being different for the three links. For
the logit and probit links these plots show evidence of lack of fit at the
5% level, which is indicated by the statistic going outside the bounds in
the plot. Although it is inclusion of the last two observations that causes
the values of the statistic to become significant, it is clear from the steady
upward trend of the plots that lack of fit is due to all observations. The
plot for the complementary log log link shows no evidence of any departure
from this model. This plot also shows that unit 5, which is the one with
the biggest residual for the complementary log log link and the last to be
included in this forward search, has no effect on the t value for the goodness
of link test.
This analysis shows that, of the three links considered, only the comple-
mentary log log link is satisfactory. The plots of fitted values in Figure 6.36
relate this finding to individual observations. The upper pair of plots show
234 6. Generalized Linear Models

Logit
Probit
Cloglog
'l' r-

4 5 6 7 8
Subset size m

Figure 6.35. Bliss's beetle data: forward plot of the goodness of link test. Only
the complementary log log link is satisfactory

the fitted dose response curve for the logistic model, both at the beginning
and at the end of the forward search. When m = 2 observations 1 and 2
are badly fitted by this symmetrical link. At the end of the search these
two lower observations are better fitted, but observations 7 and 8 are now
less well fitted . The complementary log log link, the fitted dose response
curves for which are shown in the lower two panels, are not symmetrical
and can fit both the higher and lower dose levels for these data. The sym-
metrical pro bit link gives fitted curves very similar to those for the logistic
model. We now consider another relatively simple example in which there
are advantages in using an asymmetrical link.

6.15 Mice with Convulsions


Lindsey (1995) gives a detailed diagnostic analysis of data on mice with
convulsions. The data are on his page 69 and were previously analyzed by
Finney (1978) who used probit analysis.
The binomial data in Table A.21 record the number of mice with con-
vulsions as a function of the dose of insulin, which is prepared in two ways.
The variables are:

Xl: dose of insulin


X2: preparation type.

There are two types of preparation, with nine levels of dose for the stan-
dard preparation and five levels for the test preparation. Since there are
6.15. Mice with Convulsions 235

Logit link (m=2) Logit link (m=n)

""'" '"
Q)

<ii
>
<0
0 "
<ii
>
<0
0
i
:E
"c: ..
0
"~"
"c: .
0
'"
<ii '"
<ii

«
fl 1e "
0
«
0 0
0 0
1.4 1.6 1.8 2.0 1.4 1.6 1.8 2.0
Log(dose) Log(dose)

Clog log link (m=2) Clog log link (m=n)

'"
Q)
'""
"
<ii
>
co
0 ">
(ij
co
0

" ..
" g""
..
Q)

:S
c: 0 "c: 0
'"
<ii
• '"
<ii

~" 0 ~" 0
0 0
1.4 1.6 1.8 2.0 1.4 1.6 1.8 2.0
Log(dose) Log(dose)

Figure 6.36. Bliss's beetle data: actual and fitted values showing that the fit of
the complementary log log link does not change appreciably with m

between 30 and 40 mice in each group, the value of the deviance should be
a useful overall measure of the goodness of fit of the models.
We start considering the three models analyzed by Lindsey using
log( dose). The first is a logistic model in the two variables, for which the
residual deviance is 8.79. The second is a model with the complementary
log log link and, again, both variables, for which the deviance is a slightly
larger 12.87. Because the link is not symmetrical, it matters whether we
take the proportion of mice with convulsions as the response or, as we do
for the third model, the proportion with no convulsions. For this log log
model the residual deviance is 4.688. There are 14 - 3 = 11 residual degrees
of freedom, so it therefore seems that all models fit adequately, with the
log log link fitting slightly better than the others. We now use the forward
search to elucidate the reasons for this ordering of the models.
Figure 6.37 shows forward plots of the deviance residuals for the logistic
model. For this, as for the other two models, the plot is very stable. For the
logistic model observation one consistently has the most extreme residual.
This observation is that with the lowest dose level, so low that there are
no convulsions. It is the last to enter the forward search. In the analysis of
Bliss's data on beetles we saw that it was observations at extreme values
of the explanatory variables that carried information about the correctness
of the link. For the complementary log log link, the forward residual plot
236 6. Generalized Linear Models

C'l

11------------___________ / / ' . . . . . . . . . .,
12 ..................... _ - - -

._._u___.~ . ~

lO-~----------~~~~:~:---<::-
-------------_... ----
... _... ... -

1-------------__________ -------___---

4 6 8 10 12 14
Subset size m

Figure 6.37. Mice: deviance residuals from the logistic model

(Figure 6.38) shows that both observations 1 and 10 have relatively large
absolute deviance residuals. Observation 10 is that with the lowest dose
level for the group receiving the test preparation and is the second to last
to enter the forward search, observation 1 again being last. It seems that
this new link has accentuated the failings of the model at low dose levels.
Figure 6.39 shows the plot for the log log link. There are no observations
that consistently have large residuals, although the last two to enter the
search are again observations 10 and 1.
Figure 6.40 shows the plot of the goodness of link tests for the models
during the three forward searches for the three links. At the end of the
searches the values are -1.353, -2.324 and 0.247. There is thus some ev-
idence that the complementary log log model has an unsatisfactory link.
The power of this test on one degree of freedom for a specific departure is
to be contrasted with the value of 12.87 for the residual deviance, for which
the significance level is 30.2%.
The plot shows that some of the evidence for lack of fit comes from
observations 10 and 1, the last two to enter the search. A similar pattern,
but less pronounced, can be seen in the last step of the plot for the logistic
link, when observation 1 enters at the end of the search. The plot for the
log log model if anything shows an increase in support for the link when
the last two observations enter.
A final plot is Figure 6.41 which shows the growth of the deviance during
the forward search. All are smooth curves, showing that the data have
been correctly ordered. But those for the logistic link and, especially, for
6.15. Mice with Convulsions 237

<J)
Cij 12
:;:J
"0
11
·iii

__=_"="""_5=:·
~
0
Ql
(.)
c:
<U
.;;
Ql
Cl ';"
2

--- ........... "" .... ----

----
10----- ________________ _
... _----".,.,. ...

1--------------------------------
4 6 8 10 12 14
Subset size m

Figure 6.38. Mice: deviance residuals from the complementary log log model

10 _ _ _ _==-=:::__ ::=-_-:::-:.:::-~..,.........".~:_.::::._:::_..::------- 1
<J)
Cij ...... '

---
:;:J "':...,,-
"0
·iii

;2_~~~---=---::=-~~;=;=
= ..
~
0
Ql

---------~==------- -----
(.)
c:
<U
.;;
·':'...,....:.....:.~~~11
Q)

-------
Cl ';" 7 --
8- ....... . . .. .
9· .
')I

"?

4 6 8 10 12 14
Subset size m

Figure 6.39. Mice: deviance residuals from the log log model
238 6. Generalized Linear Models

N ~ __________________________________________ ~

Logit
Cloglog
Loglog

~ ~--------------------------------------~~~

6 8 10 12 14

Subset size m

Figure 6.40. Mice data: forward plots of the goodness of link tests for the three
models

the complementary log log link, show marked increases at the end of the
search, due to observation 1 or to observations 10 and l.
Other plots, such as that for the t statistics for the parameters are not
shown here. They are smooth and well behaved, showing no evidence of
observations influential for the parameters of the linear predictor. However
the stable form of the forward plots of residuals, with persistent extreme
values may be evidence of systematic departures from the model. As is to
be expected on general statistical grounds, the specific goodness of link test
on one degree of freedom provides a more powerful test for link departures
than does the general test using the residual deviance on several degrees
of freedom. Finally, for this example, by considering the anomalous obser-
vations in the context of the data, extra information has been gained: here
that low dose levels are responsible for the departures from some of the
links.

6.16 Toxoplasmosis and Rainfall


6.16.1 A Forward Analysis
Efron (1978) presents data, which we give in Table A.22 , on the proportion
of subjects aged between 11 and 15 testing positive for toxoplasmosis, as a
function of annual rainfall x (in mm.) in 34 cities in EI Salvador. He fits a
6.16. Toxoplasmosis and Rainfall 239

Logit
Cloglog
Loglog

4 6 8 10 12 14

Subset size m

Figure 6.41. Mice data: forward plot of the deviances for the three models

binomial model with logit link to a cubic in standardized rainfall: that is,

log (1 _e)e = 130 + 131 Z + 132 Z 2 + 133z 3, (6.75)

where Z denotes the standardized value of x. This fit has some strange
features: the cubic term in the model (/33) is significant at the 1% level,
with a t value of 3.35. The linear term (/31) is also significant at the 1%
level, with the quadratic term (/32) significant at the 5%: the constant
term (/30) is not significant. However the relationship does not explain all
the variation in the data: the residual deviance is 62.63 on 30 degrees of
freedom, significant evidence of lack of fit at a level of 0.043%, if asymptotic
theory is an adequate guide.
Fig 6.42 is the plot of residuals from the forward search. This figure
clearly shows that four units (34, 14, 19 and 23) have very large negative
residuals until m = 30. But some of these signs change when m = n = 34:
units 23 and 34 have positive deviance residuals of 1.39 and 0.13, whereas
unit 19 has a small negative deviance residual of -0.37. In the upper part
of Figure 6.42 we can detect three units (30, 27 and 29) that show deviance
residuals always above 3 in the central part of the forward search. However,
in the last step of the forward search unit 29 has a deviance residual that
is equal to only 0.22. We may thus expect some problems for backwards
methods due to masking. These are shown in the next part of this section.
Finally, in the plot of Figure 6.42 we can see two units (7 and 21) whose
negative residuals (less than -2) remain virtually constant in all steps of
the forward search.
240 6. Generalized Linear Models

27

o
7~~~Jji
21-- - ---- --- ------· --- ---·---- ·------·---t·-·,~

II)

5 10 15 20 25 30 35
Subset size m

Figure 6.42. Toxoplasmosis data with logistic link: deviance residuals as the subset
size increases in the forward search

To interpret the remaining plots we give in Table 6.7 the order in which
the forward search causes the observations to enter the fit. Also given is
the estimate of the dispersion parameter ¢. Of course, for binomial data,
we hope for a value around one.

Table 6.7. Last steps of the forward search - subset size, observation introduced
and estimate of dispersion parameter ¢

m 34 33 32 31 30 29 28 27 26
Obs. 14 34 19 23 30 21 27 29 7
¢sim ) l.94 l.73 l.76 l.72 l.64 l.33 1.05 0.92 0.76

We notice that the observations entering at the end of the search are
precisely those identified as different by the forward plot of residuals.
The plot of the goodness of link test , in Figure 6.43(left), shows that
when observations 14, 34 and 19 are excluded, the statistic is almost sig-
nificant at the 5% level, having a value of -l.92. Adding the last three
observations causes the plot to move in quite a different direction, the final
value being l.58. The plot of the deviance in Figure 6.43(right), shows that
the forward search has ordered the data up to m = 30, but that the last
four observations seem to be outliers - the smooth shape of the curve is lost.
The influential importance of the last four observations is shown by the
plot of the Cook statistic in Figure 6.44(left). The implication of the peak at
m = 31 is that addition of observation 23 causes a significant change in the
6.16. Toxoplasmosis and Rainfa ll 241

'" 0
<D
C\I
0

"'
$
-'"
~
~

Q)
'"
..,.
0
0
15 c
en 0 .s:'" 0
en
Q)
C
"0
Q)
Cl '"
0 ';" 0
0 C\I
(!)
~ ~

'? 0

5 10 15 20 25 30 35 5 10 15 20 25 30 35
Subset size m Subset size m

Figure 6.43. Toxoplasmosis data: (left) goodness of link test and (right) residual
deviance, showing effect of the last four observations to enter

<D

C\I

0
~
g 0

C\I
-en

~
.-..."\. ~ v j.J
'i -
.............-.--..~.......-.-... ........_/
o

5 10 15 20 25 30 35 5 10 15 20 25 30 35
Subset size m Subset size m

Figure 6.44 . Toxoplasmosis data: (left) Cook's distance showing the effect of in-
cluding observation 23 when m = 31 and (right) the t statistics for the individual
parameters

parameter values. The values of the statistic for larger m are small because
the introduction of the remaining observations reinforces the change in
the parameter values signalled by the Cook statistic. These changes are
most easily seen by looking at the plot of the individual t statistics in
Figure 6.44(right). The statistics for the linear and cubic terms remain
sensibly unchanged for most of the search. But those for the other two
terms change sign and become less significant in the last five steps of the
search.
These results have a straightforward interpretation if we go back to the
data as plotted in Figure 6.45. The solid line shows the cubic fit using all
data; the line with short dashes shows a cubic fit without observations 23,
19, 14 and 34 (the last four in the forward search). These four observations
form a group with the highest rainfall and are clearly all influencing the
242 6. Generalized Linear Models

05

Q)
<Xl
c:i 010
> 023
~ 030
~ CD 34
g> 0
~
.sl 019
c
o 0
to
o
a. 000 0
e
a.
0

0 0
0
014

0 0 0

1800 2000 2200


Rainfall (mm)

Figure 6.45. Toxoplasmosis data: proportion testing positive versus rainfall for 34
cities in El Salvador (logit link). o=observed proportion. Solid line: fitted cubic
using all the observations (m = 34); short dashes: fitted cubic when m = 30; long
dashes: fitted cubic when m = 29

shape of the cubic curve in the same way, lessening the curvature at the
second point of inflection. The first of the four to be included is 23. Once
it has been included the other points do not greatly change the shape of
the curve, which explains the values of the Cook statistic in Figure 6.44.
When all are included observation 34 is virtually on the fitted curve. But
when m = n - 4 this observation has a deviance residual of -12.9. This
dramatic change can be seen in Figure 6.42. The last observation to be
considered is 30, which enters immediately before this group of four. The
effect resulting from its additional deletion, shown by the curve with long
dashes in Figure 6.45, is to reduce the curvature of the fitted cubic model.
It may seem surprising that observations 5 and 10 do not have a similar
effect, but they are for 2 and 10 subjects, whereas observation 30 is from
75.
Deletion of these five observations has other beneficial effects. The resid-
ual deviance is 36.42 on 25 degrees of freedom, still perhaps some evidence
of lack of fit, if asymptotic theory is a good guide, but a decided improve-
ment on the previous value. Deletion of one further observation gives a
value of 1.05 for 1>s~28), removing any evidence of that overdispersion which
caused Firth (1991) to wonder whether the model was appropriate. Of
course, to remove observations solely to achieve a small deviance is not
likely to lead to a correct model for the data. But our results show how
many aspects of model building and criticism come together once the obser-
vations have been ordered by the forward search. As one further example,
the t statistics for the parameters in Fig 6.44(right) are reasonably stable
up to m = 29.
6.16. Toxoplasmosis and Rainfall 243

Table 6.8. Toxoplasmosis data: the last stages of the forward search, with residual
deviances, for the logistic and complementary log log links.

Logistic Complementary
log log
Subset Observation Residual Observation Residual
Size m Entering Deviance Entering Deviance
26 7 18.97 21 22.67
27 29 23.55 24 26.57
28 27 27.75 27 34.25
29 21 36.42 20 43.80
30 30 45.96 13 46.23
31 23 50.18 23 50.63
32 19 53.03 19 53.06
33 34 54.20 34 54.26
34 14 62.63 14 62.43

In the earlier analyses in this chapter we compared analyses using the


logistic link with those using the complementary log log. We do the same
here, but find little difference between the two. However interpretation of
the complementary log log analysis does reinforce some of the points about
the interpretation of changes in forward plots.
Table 6.8 compares the last nine steps of the forward search for the two
links. For the last four steps the residual deviances are indistinguishable as
the four observations in the group at the high values of rainfall are entered.
But, for the two preceding steps the complementary log log link has an
appreciably higher value of the deviance. Forward plots help to interpret
this.
The plot of the goodness of link test for the model, Figure 6.46(left),
shows fluctuations as observation 20 is introduced, as does the plot of the
Cook statistics in Figure 6.46(right). This now has two peaks, compared
with the one of Figure 6.44. The later peak for both corresponds to the
effect of observation 23, the first of the observations with a high value of x.
It might then be expected that 20 is the first of a group of two observations
which are introduced into the search. The plot of the data and fitted model
with complementary log log link in Figure 6.47 shows that this is the case:
observations 20 and 13 have similar x values, the two lowest rainfalls.
Other plots for the complementary log log link show similar fluctuations
towards the end. For both links our analysis identified 4, 5 or 6 observa-
tions that are distorting the model. Once these are removed the terms in
the model are more significant; the parameter estimates hardly fluctuate
during the forward search and the smooth increase in the residual deviance
indicates agreement between model and data. A further appealing feature
244 6. Generalized Linear Models

C\I 0
C\I
Cii
J!l
"'"c:
-'<
:§ ~
'0 0
~
'C
''"c:" -'<

"0
0
'" '7 U
8 ~

0
<!)
III
~

'?
"""35
0

10 15 20 25 30 35 5 10 15 20 25 30
Subset size m Subset size m

F igure 6.46. Toxoplasmosis data with complementary log log link: (left ) good-
ness of link test and (right) Cook's distance showing t he effect of including
observations 20 when m = 29 and then 23 when m =31

q 0 (~-,'
I :'
I :
IX>
o I :
ci
~'"
·iii o p/
I :
023
0
Q.
en
CD
-" I: 34
ci
c:
~ 0 /
"~
", ,of,:'
J
J!l
c:
0
...
ci I
I ob'"\', ___ -- /7""--_ _ 019
'e : I ", / 0

,.620 o
0
I 0 0- / 0
e
Q.
10
a. "l I o
0
I 0 o 014
/
/
0 ~:"::'~----o/ o o o
ci

1600 1800 2000 2200


Rainfall (mm)

Figure 6.47. Toxoplasmosis data wit h complementary log log link: proport ion
testing positive versus rainfall for 34cit ies i n El Salvador. o= observed proportion .
Solid line; fitted cubic using all the observations (m = 34) ; s hort da shes: fitted
cubic when m = 30; long dashes: fitted cubic when m = 2 8
6.16. Toxoplasmosis and Rainfall 245

27·
30·
C\I 28·

o
...
.........-

-3 -2 -1 o 2 3
Quantiles of Standard Normal

Figure 6.48. Toxoplasmosis data, logistic link: normal plot of deletion residuals
with simulation envelope

of our analysis is that the observations form a clear subset when the data
are appropriately plotted.

6.16.2 Comparison with Backwards Methods


The structure we have discussed was found in the data by a forward search.
It is important and informative to compare our results with those of Lee and
Fung (1997) who used a method derived in part from conventional back-
wards diagnostics. A typical starting point for this approach is the normal
plot of deletion residuals shown in Figure 6.48, which are the signed square
roots of the changes in deviance as each observation is deleted in turn.
In order, the five observations with largest absolute residuals are 27, 30,
14, 21 and 28. These observations were investigated by Lee and Fung for
outlyingness and influence. As the plot of the related deviance residuals
in Figure 6.42 shows, there is appreciable evidence of masking, shown by
the rapid change in the plot in the final stages of the forward search. Fig-
ure 6.48 corresponds to a normal plot of the residuals in Figure 6.42 when
m = 34, with allowance made for individual leverages. It is therefore not at
all surprising that Lee and Fung failed to identify the group of four influ-
ential observations that we found for high rainfall. In contrast, the points
they investigated form no particular pattern in plots of the data. A final
feature of Figure 6.48 that is of interest is that the plot on its own looks
curved, but arguably not too far from what might be expected. However,
superimposition of the simulated envelope from 100 simulations shows that
there is some systematic lack of fit , with too many negative residuals. While
the backward method alerts us to the fact that something might be wrong,
it does not provide any suggestions as to what needs improving.
Finally, we discuss Figure 6.49 which shows the leverage in each step of
the forward search. The last four units to enter (23, 19, 34 and 14) each
have high individual leverage when the other units of the group are absent.
246 6. Generalized Linear Models

~
--------- - --------~
,,
,
,,
co
0 ,,
,,
., CD
0
.,>~
C1l

.,
...J
...0
co
0 ------- - ... -

0 -------~'"=--=-~
0

5 10 15 20 25 30 35
Subset size m

Figure 6.49. ToxoplasmQsis data, logistic link: curves of the leverage in each step
of the forward search. The leverage for the unit that joins the subset in the last
step is denoted with a filled square

However, the presence of one unit of the group causes the second unit to
enter to have reduced leverage. Thus, as this plot clearly shows, units 23
and 19 are included with a leverage equal to 0.86 and 0.39, respectively.
In the final step, however, their leverage is simply equal to 0.12 and 0.08.
Unit 34 comes in with a leverage equal to 0.77, much bigger than that for
observation 14 (the last to enter). This explains why the curves for these
two units in Figure 6.42 cross in step n-3 = 31. These comments show that
an analysis of leverage at the last step can be highly misleading if there is a
cluster of outliers. The results also agree with those for the Cook distances
in Figure 6.44, where the inclusion of observation 23 has the largest effect.

It seems to us that a comparison of the forward and backward analyses


provides a strong confirmation of the power of our procedure.

6.17 Binary Data


6.11.1 Introduction: Vasoconstriction Data
Binary data are binomial data with each ni = 1. Superficially the analysis
is similar to that for binomial data with larger values of ni. However the
binary nature of the response leads to problems that do not arise for bino-
mial data. We begin with a straightforward analysis of some binary data
6.17. Binary Data 247

0 0 ··1
0
0
0 0 •§ •• • • • • • .17
4·1~ 0 0 0 24 •
0
• •

0
0
0
0 •
~
Oi
";-
0 0
0
-'

'?
0 32

-1.0 -0.5 0.0 0.5 1.0


Log(volume)

Figure 6.50. Vasoconstriction data: scatter plot showing regions of zero response
o and unit response • . Observations 4 and 18 are surrounded by zero responses
and observation 24 is on the boundary of the two response classes. Observations
17 and 32 are well within the correct regions for their response

and then discuss features that make the analysis different from our earlier
analyses of binomial data.
The vasoconstriction data are in Table A.23. There are n = 39 readings
on the occurrence (20 times) or nonoccurrence of vasoconstriction and two
explanatory variables, the volume (xd and the rate (X2) of air inspired,
that is, breathed in. The data, from Finney (1947), were used by Pregibon
(1981) to illustrate his diagnostics for logistic regression. Other analyses are
given, for example, by Aitkin et al. (1989, pp. 168- 179), and by Fahrmeir
and Tutz (1994, p. 128).
Figure 6.50 is a scatterplot of the data against the explanatory variables
which also shows the two classes of response. Observations 4 and 18 are in
an area of otherwise zero responses . Pregibon found that these two observa-
tions were not well fitted by a linear predictor in the logs of the explanatory
variables. They do indeed stand out in the plot of the deviance residuals
against fitted values in Figure 6.51. However what is most striking about
the figure is its structure: it consists of two decreasing bands of points,
the upper being the residuals for observations with value one, the lower
for those equal to zero. This plot is more extreme than Figure 6.25 for the
Poisson distributed train data in which there were also bands for each value
of the discrete response. Both are quite distinct from plots of residuals for
normal data.
It is interesting to identify the observations with extreme residuals in
Figure 6.51 with their positions in the scatter plot of Figure 6.50. Obser-
vations 4 and 18 stand out clearly in both figures. In addition, observation
24 is the zero response with highest expected value and is on the edge
248 6. Generalized Linear Models

C\I

.
.!il

• • .... ...,
-.
::l

.,
:9
!/l
32
a:
\
0
17

0
Cb

';" o
o
o
0 24

0.0 0.2 0.4 0.6 0.8 1.0

Fitted values

Figure 6.51. Vasoconstriction data: deviance residuals against fitted values, that
is, estimated probabilities -rr. The plot is quite unlike those from normal theory
regression. Inspection of Figure 6.50 shows that observations 4 and 18, with
response 1, are in a region of zero response

of the region of zero responses. Apart from these observations it looks as


if the data may be divided into two groups by a line in the plane of the
variables so that responses of zero and one are completely divided. Such
a model would fit perfectly. Observations 17 and 32 are at the other ex-
treme. They perhaps look outlying in the scatter plot of Figure 6.50, but
they are in fact in regions of the explanatory variables where the responses
are, respectively, firmly one and zero.
As a last introductory plot we look at the estimated probabilities. Fig-
ure 6.52 shows the data and the fitted model using the logistic link plotted
against the linear predictor. The relationships with the residual plot (Fig-
ure 6.51) and scatter plots (Figure 6.50) are evident. Observations 4 and
18 are the two observations with unit response and the lowest value of the
linear predictor. Observation 24 is the value with zero response and the
highest value of the predictor. The residual deviance is 29.33 on 36 degrees
of freedom, which might be taken to suggest agreement of the model and
data.
We now consider some of the implications of this example for the general
analysis of binary data.

6.17.2 The Deviance


The test of the residual deviance of the vasoconstriction data seemed to
indicate that the logistic model fitted adequately. However the residual
6.17. Binary Data 249

17
q .. - ___I!!.--'O--- -
18

co
0

<0
~ 0
:0
e'" "'"
.0

c- o
(\J
0

0
------~~------~m__wco m mm 24
0 32

-20 -15 -10 -5 o 5 10


Linear predictor

Figure 6.52. Vasoconstriction data: observations and the estimated probabilities


* from the fitted logistic model. There is appreciable overlap in the groups with
zero and unit response

deviance from a model fitted to binary data is completely noninformative


about the fit of the model (McCullagh and NeIder 1989, p. 121; see also
our Exercise 6.7). Despite the failure of the residual deviance, differences
in deviance are still available to test the effect of the addition of terms to
the linear predictor. We could, for example, test the fit of the first-order
model by adding quadratic and interaction terms in the two variables.
This causes a reduction in deviance of 2.19 to be compared with the chi-
squared distribution on three degrees of freedom . No inadequacy of the
model is evident. Testing of such nested models using reductions in deviance
together with the inspection of plots of fitted models therefore provide
important ways of checking a proposed model. A corollary is that, if for
binomial data many of the ni are small, the distribution of the residual
deviance may be far from the nominal X2 value. If the value of the deviance
for binomial data with some small ni is important, its distribution should
be checked by simulation.

6.17. 3 The Forward Search for Binary Data


In general the numbers of zero and one responses will not be equal. Sup-
pose that zeroes predominate. Then a fit just to observations with a zero
response will exactly fit enough of the data to give a value of zero for the
residual used in assessing the least median of deviances fit. We therefore
need to modify the search to avoid including only observations of one kind.
250 6. Generalized Linear Models

This requires modification both of the initial subset and of the progress of
the search.
We modify the initial subset so that it is constrained to include at least
one observation of each type. A perfect fit to one kind of observation is
thus avoided at the beginning of the search. To maintain a balance of
both kinds of response during the search we balance the ratio of zeroes
and ones in the subset so that it is as close as possible to the ratio in the
complete set of n observations. Given a subset of size m we fit the model and
then separately order the observations with zero response and those with
response equal to one. From these two lists of observations we then take the
mo smallest squared residuals from those with zero response and the ml
smallest squared residuals with unit response such that mO+ml = m+l and
the ratio mo/ml is as close as possible to the ratio nO/nl in the whole set of
observations, where no + nl = n. In the vasoconstriction data the numbers
of zero and one responses are as equal as they can be for n = 39. After the
initial stages the forward search, therefore, alternately adds observations
with zero and unit responses. The quantities we monitor are those described
in §6.6.5 with one exception. Given that in binary data y = 0 or y = 1, we
found it useful to monitor separately the maximum residual for the units
whose response is equal to one and those whose response is equal to zero.
More precisely, in every step we monitored

r[m]q -- max I ri , S~"') I £or z. E S(m)


* ,

y = q: q = 0, 1 and m = p + 1, ... ,n. (6.76)

6.17.4 Perfect Fit


Perfect fit to both zero and unit responses together occurs when one of the
explanatory variables Xj divides the data into non-overlapping groups with
all responses zero in one group and all one in the other. More generally it
occurs when there is a direction f3T x in which such division can occur as it
does for the vasoconstriction data.
In discussing Figure 6.50 we commented that it appeared as if the removal
of observations 4, 18 and 24 would cause a division of the data into two
groups separated by a straight line. In Figure 6.53 we show the results
of fitting a linear logistic model to the vasoconstriction data when these
three observations are excluded from the fit. The residual deviance is zero
to within numerical accuracy (a point discussed later) and the model fits
perfectly. In the figure the three excluded observations are crossed through.
Comparison with Figure 6.52 shows that a linear predictor has been found
which causes separation of the two groups. It is also noteworthy that the
numerical value of the linear predictor is much larger than it was, so that
all estimated probabilities are indistinguishable from zero or one.
6.17. Binary Data 251

4
)()(
18
__ ....... ...........- .
17

co
c:i

C\I
c:i

:; 0 - - - - - -_ _9-----Be-€ll---....,..-6-<""'"')8( 24
32

-2000 -1500 -1000 -500 o 500


Linear predictor

Figure 6.53. Vasoconstriction data: an example of perfect fit. Observations 4, 18


and 24 have been deleted. Comparison with Figure 6.52 shows there is no longer
any overlap between the two groups of responses

It is clear from the figure that the slope of the fitted model between the
two groups of observations can be arbitrarily large without significantly
affecting the fit of the model. The phenomenon is not restricted to binary
data, but becomes increasingly unlikely for binomial data as the numbers
of observations per group ni increase. For such perfect fits McCullagh and
NeIder (1989, p. 117) comment that although the parameter estimates go to
infinity, the estimated probabilities are well behaved and calculation of the
deviance converges. They do not mention the t statistics for the parameter
estimates in the linear predictor, which are shown by Hauck and Donner
(1977) to approach zero as the fit improves, even though the parameter
estimates themselves go to infinity. Venables and Ripley (1997, p. 237)
describe this property of the t statistics as "little-known." It is however
related to the asymptotic properties of the Wald test for the difference
13k - 13k.
In multiple regression we stated in §2.1.3 that the t test for 13k = 0 was
the signed square root of the F test from the difference in the residual sum
of squares when 13k is and is not included in the model. The standard large
sample result is the asymptotic equivalence of the likelihood ratio test for
a parameter and the square of the Wald test, which calculates the ratio
of 13k to its standard error. This is the t test that we have used in this
chapter for, for example, our goodness of link test. In some cases this t test
for individual parameters can be very sensitive to the parameterization of
the model, when the likelihood ratio and Wald tests can give very different
252 6. Generalized Linear Models

a a

j
'" '" I

'"'" '"'"
a a
Q)
u
'" Q)
u
'"
c
olijs:
Q)
'"
T"""
'" '"
"S:
Q)
T"""

a a
~ ~

'" '"
a a

10 20 30 40 10 20 30 40
Subset size m Subset size m

Figure 6.54. Vasoconstriction data: plot of deviances showing perfect fit . The
residual deviances are zero for most of the forward search: (left) balanced search
and (right) unbalanced search

results. A discussion in the context of generalized linear models is given by


Vreth (1985) and Mantel (1987).
One way in which the problem can arise is when we are at, or near, a
perfect fit. Such perfect fits may be encountered during the forward search,
as they are in the vasoconstriction data. We saw in Figure 6.51 that obser-
vations 4, 18 and 24 had the most extreme residuals. We would therefore
expect that, in the absence of masking, they would be the last three obser-
vations to enter the forward search and, indeed, this is what happens when
we use a balanced search. As we saw from Figure 6.53, the absence of these
three observations leads to a perfect fit and so a deviance close to zero. The
plot of the deviance from the balanced forward search in Figure 6.54(1eft)
shows that the deviance is virtually zero until these three observations en-
ter. Also shown in the figure is the deviance from an unbalanced search.
Since we do not force balance in this search, the zero observations, in this
case, all enter first , followed by those for which the response is one. The
last five observations to enter are 39, 31, 29, 18 and 4, the five observations
with largest positive residuals in Figure 6.51. There is also a perfect fit in
this search. Although it is not clear from Figure 6.54(right), the perfect fit
is broken when the first of these five observations enters (Exercise 6.11).
However, as the figure does show, the first large upward jump occurs when
the third of these observations enters.
In the remainder of this chapter we focus our analysis of the vasoconstric-
tion data on the discrepancy between the t tests and the likelihood ratio
test based on the deviance. In using the forward search for data that are
near to a perfect fit we identify the few observations which are not perfectly
fitted, which enter at or near to the end of the forward search, depending
on whether the search is balanced. As these observations are included the
values of the t statistics for the parameters of the linear predictor increase
6.18. Theory: The Effect of Perfect Fit and the Arcsine Link 253

from zero as the perfect fit is destroyed. An unexpected result is the high
correlation of the statistics for different variables.
In the next section we consider the rate of convergence to zero of the
statistics as the parameter estimates go to infinity. The key is an analysis
of the limiting behaviour of the weights in the iterative fitting of generalized
linear models. In particular we analyse the arcsine link and show that the
t statistics for this link converge more slowly to zero than they do for the
logistic and complementary log log links. We argue that the new link should
also produce larger t statistics near the perfect fit, which statistics will
therefore be more in agreement with the deviance. They should also have
a reduced correlation. Analysis of our examples confirms and quantifies
the extent of improvement that can be obtained using the arcsine link as
opposed to the logistic or complementary log log. The arcsine link also
seems to give better agreement between t statistics and deviances in an
example in which the fit is far from perfect.

6.18 Theory: The Effect of Perfect Fit and the


Arcsine Link
The arcsine link was defined in §6.1.3 as

if - ~ '5: 'fJ '5: ~


if 'fJ~ ~ , (6.77)
if 'fJ '5: -~

where we write 7T rather than () to stress that we are considering probabil-


ities and want to avoid confusion with 3.14159 .... Since the magnitudes
of those values of 'fJ for which 1'fJ1 > 7r /2 do not affect the estimated prob-
abilities, we can expect that the link may be robust to extreme values of
the linear predictor caused by being close to a perfect fit. As we show, a
consequence is that use of the arcsine link gives better agreement between
t tests and those based on differences of deviance. To establish this result
we analyze the implications of a perfect fit on several link functions.
Maximum likelihood estimates of the parameters 13 in generalized linear
models can, as we showed in §6.4, be obtained by iterative weighted least
squares. For binomial data the weights are given by

W = diag { (~~;) 2 /( 7Ti(l - 7Ti ))} . (6.78)

The behaviour of these weights, especially for large values of I13k I, is central
to our analysis. In the vasoconstriction data the perfect fit was obtained as
a linear combination of the parameters went to infinity. In such cases we
assume that a linear transformation of the carriers Xij has been made such
254 6. Generalized Linear Models

that some new parameter can, without loss of generality, be considered as


going to infinity.
The links most used for binary regression models are the logit, probit
and complementary log log (cloglog). Since, as Figure 6.33 showed, the
logit link is very close to the probit, we only consider the logit link. In the
logit and complementary log log links a perfect fit implies that at least one
l,8kl -+ 00. But, when l,8kl -+ 00 every unit affected by ,8k is exactly fitted
with probability 0 or 1. As we showed in the previous section, the occurrence
of perfect fit in binary regression models is not just of theoretical interest.
It is therefore important to have a link in which units fitted perfectly can
coexist with units whose estimated probability lies inside the interval (0,1).
The arcsine link satisfies this requirement since we can have a perfect fit
even if ,8k does not tend to infinity because it is sufficient that l1)i I 2: 7T /2
for all i = 1, ... ,n. Thus, in the arcsine link, the presence of a unit that is
perfectly fitted by the model is compatible with the presence of units that
are not so fitted. We can therefore expect that the arcsine link will behave
better than the logit and complementary log log links when the model is
close to a perfect fit. In the remaining part of the section we formalize this
intuition by proving that the rate of convergence to zero of the t statistic
from the arcsine link is slower than those from the logit and complementary
log log links.
The t statistic for variable k (k = 1, . . . ,p) can be expressed as

L~l akiZi
(6.79)
V(XTWX)k~'
where aki is the ith element of the kth row of the matrix (XTW X) -1 XTW
and (XTW X)k~ is element (k , k) of the matrix (XTW X)-l. In a situation
of perfect fit the elements aki remain bounded and the rate of convergence
to infinity of the denominator simply depends on the weights. Substitution
in equation (6.79) of the expression for Zi found in (6.45), followed by
consideration of only those terms which determine the rate of convergence
of the t statistic, shows that we can write the statistic as

(6.80)
VL~=l1ri(l - 1ri) (d1)i/ d1r i)2
Using the appropriate weights for each link (given in Table 6.9) we can
analyze the rate of convergence to zero of the t statistics for the different
links.
The results are reported in Table 6.10. Several points arise:
1. The rates of convergence for the t statistics for the logit and the
arcsine links do not depend on whether 1) -+ +00 or 1) -+ -00 (since
the links are symmetrical). However, for the complementary log log
link we have two different rates of convergence;
6.18. Theory: The Effect of Perfect Fit and the Arcsine Link 255

Table 6.9. Weights for different links

Logit Cloglog Arcsine


exp(1)i) exp(-exp(1)i»exp(21)i) 1
[l+exp(1)i)J2 l-exp( - exp(1);) if
o otherwise

2. While these rates of convergence for the logistic and complementary


log log links can be derived simply by letting TJ ---t 00 in equation
(6.80), a more delicate analysis is required for the arcsine link. In the
theoretical case in which TJ ---t 00 the numerator of (6.80) also goes
to 00 and so the t statistic for the arcsine link tends to 00 (with rate
yfii). However we are likely to observe a perfect fit in practice with
the arcsine link when 7r /2 < ITJI < +00 , so that the numerator does
not go to 00. Then the t statistic tends to zero with rate 1/ yfii

3. The t statistics from the arcsine link tend to zero at a slower rate than
those based on the other two links. This implies that, in a situation
close to a perfect fit , the t statistics associated with the arcsine link
are to be trusted more than those based on the logit or complementary
log log links. We present some numerical evidence that this is so.

Table 6.10 . Rate of convergence of t statistics for different links (TJ --> 00)

Logit TJ --> ±oo Cloglog TJ ---t +00 Cloglog TJ ---t -00

o(J~) o( 1)2 exp(21) )


exp(exp(1)) O(J~)
Arcsine TJ ---t ±oo

O(Jry) if ITJI > 7r /2 and ITJI < 00

o (yfii) if ITJI ---t 00

Other aspects of our analysis of the weights also have practical impli-
cations. The rate of convergence 1/ yfii of the t statistics associated with
the arcsine link simply depends on the existence of the threshold and not
on the particular characteristic of this link. This follows since if 1i" = 1 or
1i" = 0 when ITJI > I (for some threshold I < (0), d1i"/dTJ = 0 for 11]1 > I and
the denominator of equation (6.80) goes to 00 with speed yfii.
256 6. Generalized Linear Models

Table 6.11. Vasoconstriction data. Logistic link: t tests and deviances for models
fitted to the original data and with observations 4 and 18 changed from 1 to 0

Log(Volume) Log(Rate)
tl t2 Deviance Explained
Original Data 2.78 2.48 24.81
Modified Data 1.70 1.80 46.47

In this chapter among all possible links with a threshold we have cho-
sen the arcsine link due to the structure of the weights. As emphasized by
Pregibon (1981 , page 712), the weights in fitting generalized linear models
(6.37) are not fixed, as in weighted least squares regression, but are deter-
mined by the fit. Table 6.9 shows that in a situation in which no unit is
fitted perfectly the weight given by the arcsine link to each unit is constant
and equal to one. As equation (6.27) clearly shows, the matrix of weights
affects all variables in the columns of X in the same way. The presence of
a few dominant weights will therefore tend to cause correlation among the
t statistics for the 13k. By using the arcsine link, with its constant value of
the weights away from a perfect fit, we expect to reduce this effect. There
is thus an advantage to be expected from this link even far from a perfect
fit. A final point about the weights for the arcsine link in Table 6.9 is that
they are similar to those which occur in robust estimation of location us-
ing redescending M estimators described, for example, by Hoaglin et al.
(1983).

6.19 Vasoconstriction Data and Perfect Fit


The plot of the vasoconstriction data in Figure 6.50 showed cases 4 and 18,
both with one as response, surrounded by cases with zero as response. It is
to be expected that if these two observations are switched from one to zero
the fit of the model will improve. We begin our numerical investigation
of the effect of near perfect fit on t values and deviances by comparing
analyses of the original data with data modified by making this exchange.
Table 6.11 shows that modifying the data has caused an appreciable in-
crease in the deviance explained by the model from 24.81 to 46.47. Since
this is not a residual deviance but the difference between the residual de-
viance for the fitted model and the null model with just a constant, the
values do have a meaning. Despite this increase in explained deviance, the
effect of the modification is to reduce the values of both t statistics. Thus
the information from the t statistics is in conflict with that from the ex-
plained deviance. If the two variables were orthogonal we would expect that
the sum of the squared t values would be close to the value of the explained
6.19. Vasoconstriction Data and Perfect Fit 257

N
,
(~

)
\4
-/' ~
~
-
.. ----
--- -
Intercept
Log(volume)
Log(rate)
I cct:tt:!!
/24
, --- ---- - -- - _..-..--.,.,.

10 20 30 40 36 37 38 39
Subset size m Subset size m

Figure 6.55. Vasoconstriction data: the effect of perfect fit on the model with a
logistic link; (left) t statistics; (right) deviance residuals in the last four steps of
the forward search

deviance. This is more nearly so for the original data. A subsidiary point
is that , for each fit , the two t statistics have very similar values.
In these calculations, as in all others on binary data, we have standard-
ized the explanatory variables to have zero mean and unit variance. In
addition the constant term in the model is taken as 1/..;n. These scalings
are not important for t values, other than that of the intercept, but are
important when we come to compare parameter estimates.
Figure 6.55(left) shows a plot of the values of the t statistics during the
forward search through the vasoconstriction data. As with other plots of t
statistics for binary data, we have added 95% confidence intervals. Because
we have balanced the search the last three observations to enter are 18,
24 and 4. The plot shows that without these observations the t values are
effectively zero: the data give a perfect fit, the parameter estimates can
be arbitrarily large and the probabilities are estimated as being close to
zero or one. The actual values of the parameter estimates depend upon the
convergence criterion used. We have used a value of 10- 8 for the change in
the deviance. The actual value of the deviance for the fitted model, which
theoretically is zero, increases from this value of 10- 8 to 10- 5 during this
forward search and the values of the t statistics increase slightly, as shown.
For a numerically perfect fit they would be zero.
Figure 6.55(right) shows the deviance residuals for the last four steps of
the forward search. When m = 36 there is a perfect fit and all residuals of
observations in this subset are zero; only those outside (4, 18 and 24) are
nonzero, with large values. As soon as the perfect fit is destroyed, that is,
when m = 37, the deviance increases and there are many nonzero residuals.
In fact the major change in the pattern of residuals occurs at this point, the
increasing number of appreciably negative residuals explaining the increase
in deviance in Figure 6.54(1eft).
Figure 6.56 shows how the numerical values of the parameter estimates,
in the presence of perfect fit, depend on the value of the convergence cri-
258 6. Generalized Linear Models

0 0 ~

.r' \
c
Q)
0
LO
I.' c 0
LO }1
~1
Q)
'0 '0
==0
.--( \ ==~ -0/ ~
-~
Q)
0
r 0 •• -e=<::::..,\ .
'\.,\ ' I
(.) (.)

~
.c
"0
Q)
iii
0
0
u;>
\ I
, I
\1
~
.c
"0
Q)
iii
0
0
u;> "I \ i
\ i
E
.~ 0
Intercept
Log(volume)
~ E 0
Intercept
Log(v<>ume) IiiI
0 ~ 0
W 0
'7
Log(rate)
I W 0
'7
Log(rate)
I 'Jj
10 20 30 40 10 20 30 40
Subset size m Subset size m

Figure 6.56. Vasoconstriction data, perfect fit: dependence of the magnitude of


the parameter estimates on the convergence criterion. The parameter estimates
on the right are roughly v'2 times those on the left

terion. In Figure 6.56(left) the tolerance on the deviance is 10- 4 . In the


right-hand panel it is 10- 8 and the parameter estimates for perfect fit are
approximately multiplied by J2. If the convergence criterion were a toler-
ance on the estimates, rather than on the deviance, we would expect the
parameter estimates to be doubled.
Comparison of Figure 6.55(left) with Figure 6.56 shows that once there
is no longer perfect fit, the value of the parameter estimates decreases
markedly at the same time as the t values move away from zero. The
important features of Figure 6.55 are the jump from approximately zero to
around 2.5 in the values of the t statistics as soon as the model no longer
fits perfectly, the almost constant value of the statistics and the similarity
in values of those for the two explanatory variables. We note that it would
be difficult to identify the importance of the three observations by the use
of standard techniques of regression diagnostics.
We now briefly reanalyse the vasoconstriction data using the arcsine
link. Table 6.12 repeats the calculations of Table 6.11. Comparison of the
two tables shows that the behaviour of the two t statistics is in line with
our theoretical results. For all the data both statistics increase markedly
in value although the deviance explained by the model does not change
appreciably. There is thus improved agreement between inferences based
on the t statistics and that using the explained deviance. For the nearly
perfect fit of the model after the data have been modified, all values in
the table are slightly larger than those in Table 6.11 , but not enough to
overcome the inferential problems with near perfect fit.
The similarity in t values for the two variables when the logistic link is
used with all the data suggested to Aitkin et al. (1989, p. 176) that only
one variable was required. The difference between the two t values when
the arcsine link is fitted calls this conclusion into question. However, fitting
the sum of the variables (which are logged, so that the variable is the log of
the product) using the arcsine link gives a t value of 4.23 and an explained
6.20 . Chapman Data 259

Table 6.12. Vasoconstriction data. Arcsine link: t tests and explained deviances
for models fitted to the original data and with observations 4 and 18 changed
from 1 to 0

Log(Volume) Log(Rate) Deviance Explained


tl t2
Original Data 3.61 2.79 24.94
Modified Data 1.75 1.93 46.87

deviance of 24.76, as good a fit as the two-variable model of Table 6.12.


With a single explanatory variable the value of the square of the t statistic,
in this case 17.86, should be close to the explained deviance. It is certainly
much closer than it is for the logistic link, for which the square of the t
value, 2.82, is only 7.94, far from the explained deviance of 24.52.

6.20 Chapman Data


These data (Table A.24) on 200 men taken from the Los Angeles Heart
Study, extensively analyzed by Christensen (1990), come from a study on
heart disease by Dr Chapman. The variables are:

Xl: age (in years)


X2: systolic blood pressure (millimetres of mercury)
X3: diastolic blood pressure (millimetres of mercury)
X4: cholesterol (millimetres per DL)
X5: height (inches)
X6: weight (pounds)
y: coronary incident (1 if an incident had occurred in the previous 10
years; 0 otherwise).

There are only 26 nonzero responses, for those who had a "coronary
incident" in the last 10 years, and six explanatory variables. The data are
thus far from balanced. Christensen concludes that a model with the three
variables Xl, X4 and X6 is adequate to describe the data. The values of
the three t statistics are 2.54, 1.82 and 2.12, with the model explaining a
deviance of 19.1. There is thus not the large discrepancy between t statistics
and deviance that was present in the vasoconstriction data.
Figure 6.57 is a scatter plot of the data showing the two classes of re-
sponse. The predominance of zero responses is evident. Even if we can see
a slight positive connection between the incidence of heart attack and the
three variables, the plot shows that the two responses are intermingled. It
260 6. Generalized Linear Models

200 300 400 sao

~ '/,
~
Iii X4=cholesterol .
B0

0
co · ~

~
X6=weight

0 ~

30 40 so 60 70 .so 200 2SO

Figure 6.57. Chapman data: scatterplot matrix of data showing zero 0 and unit
• responses

seems unlikely that we shall find a direction in the three-dimensional space


of the explanatory variables in which, for all the data, the two classes of re-
sponse will be separated. However the forward search finds such directions
until more than half the data are included in the subset.
To see the effect on the fitted model of near perfect fit during part of
the forward search we give in Figure 6.58 the result of a forward search
using the logistic link. Figure 6.58( top) shows the deviance explained by
the model and the bottom shows the t statistics for the four parameters.
These t values show that, using a search which is balanced for numbers
of one and zero responses, we have a perfect fit up to m = 139. The plot
of the t statistics is surprisingly like that in Figure 6.55. As soon as there
is a lack of perfect fit the t statistics, apart from that for the constant,
move close to their values for all 200 observations. The behaviour of the
explained deviance requires clarification.
During the period of perfect fit the explained deviance is the difference
between the residual deviance for the model with just a constant and the
residual deviance from the fitted model. But this latter residual is zero,
so the explained deviance is solely the residual deviance from the constant
model. The estimate of the mean is

ir = mdm,
where ml is the number of unit responses in the subset of size m. The resid-
ual deviance for this model is easily calculated from the general expression
6.20. Chapman Data 261

"'iiic 8
Q)
~

C.
x
Q)
0
~
:>Q) 0
Cl
0 50 100 150 200
Subset size m
~

- -. .
(\J

b]
.2 -------------------------~{i.
'ii) C}I

~
-
-XO
------ X1
~
---- X4
--- X6
0
~
,
0 50 100 150 200
Subset size m

Figure 6.58. Chapman data: forward search with the logit link, showing the effect
of perfect fit up to m = 139; (top) explained deviance; (bottom) t statistics

for the binomial deviance (6.74). Since all ni = 1 and all Yi are zero or one,
we find that
D(iJ) = (m logm - m1logml - mo logmo). (6.81)
The deviance therefore just counts the number of zero and unit responses
in the subset. But, with a balanced search, these are determined by the
requirement of balance, independently of any other function of the data.
The resulting pattern in Figure 6.58(top) slopes up gently for the addition
of units with zero response, with a series of larger steps upward when
unit responses are included. Once the perfect fit has been broken, the plot
begins to decline as the nonzero residual deviance for the fitted model is
subtracted. The steady downward slope means that there are no highly
influential observations or outliers causing large increases in this residual
deviance. The small jumps in the explained deviance are again caused by
the inclusion of observations with a unit response. The difference between
t tests and deviance is greatest when the data are near a perfect fit. The
values of tl and of t6 are highly correlated, being virtually indistinguishable
over the range of m for which the fit is not perfect.
Straightforward use of the arcsine link with Chapman's data leads to a
model with one less variable than the three used above and by Christensen
with the logistic link. For all six potential variables, three of the six t
statistics when fitting the arcsine link have absolute values less than one.
Backwards elimination using these values one variable at a time, in a similar
way to that summarized in Table 3.1 , leads to the three-variable linear
predictor used above but with t values h = 2.86, t4 = 1.35 and t6 = 1.84,
which are more dispersed than the values for the logistic link, which are
262 6. Generalized Linear Models

<0

v
If)
iii
:> C\I
"0
.(i;
2? 0
Ql
u
c
ro
.> ~
Ql
0
"i

'-?

170 180 190 200


Subset size m

Figure 6.59. Chapman data: forward plot of deviance residuals in the last stages
of the search using an arcsine link and a two-variable model

repeated at the top of Table 6.13. We therefore drop variable 4 to obtain a


model with t values of 3.63 and 2.10 for the coefficients of Xl and X6 which
explains a deviance of 16.66.
We now check this model. Figure 6.59 is the plot of deviance residuals
from the last stages of the forward search. For each value of the subset
size m the values of all 200 deviance residuals are plotted. The extreme
values, near ±6, plotting as horizontal lines, are for observations for which
the linear predictor lies outside ±7r /2. As m increases there are fewer such
observations and the residuals decrease. At the end of the search there is
a cluster of residuals with small negative values which are those from the
174 zero responses. The observations with response 1 form a group with
a minimum value of 1.34. The structure in which all the residuals from
y = 1 are comparatively large results from the unbalanced nature of the
data. This is a second example (Figure 6.25 was the first) of a structure
in the residuals that comes from the discrete nature of the observations
and does not necessary reflect a failure of the model. Such patterns can be
misleading (Christensen 1990, p. 261).
Observation 86 enters at m = 194. As the plot shows, its residual has
started to decrease before this, but the observation seems to be outlying,
even at the end of the search. It is the last observation to enter with a unit
response, but does not enter at m = 200 because we are using a balanced
search. If we delete observation 86, the new residual plot, Figure 6.60,
shows more regular behaviour for all other observations. The two t statistics
increase to 3.98 and 2.68 and the deviance explained by the model to 20.33
from 16.66. Since there are so few responses equal to one, each of them is
likely to have an appreciable effect on inferences. However, once observation
86 has been removed, the t value of X4 is 1.86 and the deviance explained
6.20. Chapman Data 263

<0


ell
(ij
:> C\J
"0
'(jj
~
Q) 0
0
<::
tV
·sQ) ~
D ,,;;;>'

"r ,
I '
_________ ! '
'--~---.JII
~

160 170 180 190 200


Subset size m

Figure 6.60. Chapman data without observation 86: forward plot of deviance
residuals in the last stages of the search using an arcsine link and a two-variable
model

by the three-variable model rises to 24.41. To make comparisons with the


logistic link we use this three-variable model, the t values for which are
given in the lower part of Table 6.13.
The first comparison is that the t values for the arcsine link are again
more dispersed than those for the logistic, the values for the most significant
value, tl, being respectively 3.27 and 2.54, as shown by the first column
of Table 6.13. Figure 6.61 is a forward plot of these t values. Comparison
with the forward plot for the logistic link in Figure 6.58(bottom) not only
shows larger t values for the arcsine link, but also shows that the correlation
between hand tij , although still high, has indeed been reduced by using
the arcsine link rather than the logistic.
Finally we investigate the effect of near perfect fit on the two links by a
simulation based on the fitted models, but in which the slope of the linear
predictor is increased. Interest is in comparing the values of the t statistics
with that of the explained deviance. We use a parameter 't/J to increase the
xi
slope. If the linear predictor for observation i is TJi = f30 + f3 we simulate
from TJi ('t/J) = f30 +'t/Jx; f3, with 't/J ~ 1. Ten thousand sets of 200 or 199 binary
observations were simulated for a range of values of 't/J and the t statistics
and deviances explained were calculated. As the value of 't/J increased the
distributions of the t statistics became more skewed with occasional outliers
due to nonconvergence. We therefore summarize the distributions in Table
6.13 using medians.
Each row of the table shows that the t values initially increase with 't/J
and then decrease as a perfect fit is approached. They are thus in line
with the results for comparison of two proportions using the logistic link
in Table 1 of Hauck and Donner (1977) . But our results extend this table
264 6. Generalized Linear Models

ti~""N-:-:'{~
~~ '\':'
/'-,........,""-_ ........ -..... -\-,

0
~4
-1-'
u \'l
.~
\

§
~ u;>
'v,
\,
\
-- ---- X1

~
---- X4
--- X6
0
~
,

o 50 100 150 200


Subset size m

Figure 6.61. Chapman data: t statistics using the arcsine link and the
three-variable model. Compare with Figure 6.58

Table 6.13. Chapman data. Values of t statistics and deviances explained from
simulations that move towards a perfect fit as 'ljJ increases

Logistic Link (200 observations)


'l/J (Multiple of Linear Predictor)
Observed 1 2 5 10 20 50

tl 2.54 2.69 4.81 5.63 4.27 3.37 1.93


t4 1.82 1.91 3.64 4.76 4.10 3.40 2.09
t6 2.12 2.21 4.08 4.66 4.23 3.49 1.87
Explained
Deviance 19.0 20.3 70.8 176.2 221.4 248.5 273.0

Arcsine Link (199 observations)


'l/J (Multiple of Linear Predictor)
Observed 1 2 5 10 20 50

tl 3.27 3.34 5.77 7.46 5.80 4.85 2.47


t4 1.86 1.91 3.49 5.57 5.30 4.81 2.63
t6 2.57 2.80 4.57 6.19 5.66 4.79 2.16
Explained
Deviance 24.4 28.4 73.7 178.4 223.7 248.35 273.0
6.21. Developments and Further Reading 265

to the comparison of two links. As our theory predicts, the t values from
the arcsine link decrease less rapidly with increasing 't/J than those from
the logistic link. They are also larger, and so more nearly correspond to
the values of the deviances. In making this comparison it is the sum of
squares of the t statistics that need to be compared against the explained
deviances.

6.21 Developments and Further Reading


Many other books, such Morgan (1992) and Venables and Ripley (1997) ,
give a short introduction to generalized linear models. A more advanced
short treatment is Firth (1991). The standard reference is McCullagh and
Neider (1989) with a booklength introduction by Dobson (1990). Books
on data analysis with generalized linear models include Fahrmeir and Tutz
(1994) and Lindsey (1997) . There is a much larger literature on the analysis
of contingency tables, a topic we briefly mentioned in §6.1O as a special case
of Poisson models: for example, Fienberg (1980) , Agresti (1990) and, more
briefly, Agresti (1996). The use of graphical models is described in Chapter
4 of Lauritzen (1996).
In the theory of generalized linear modelling as described in this chapter,
it has been assumed that the dispersion parameter ¢ is known. When it
is estimated, any inferential effects of estimation are ignored. To test hy-
potheses about the value of ¢ requires the calculation of likelihood ratio
tests that are more complicated than those for the linear predictor, since
the term d(y , ¢) does not cancel in the difference of loglikelihoods when
different values of ¢ are of interest. An example would be to test whether
the car insurance data were indeed exponentially distributed. A more com-
plicated inferential problem is to test which of two families of models best
fits the data. In the dielectric data, although we fitted a gamma model,
there was an indication that a normal model might be suitable. In general
neither model is a special case of the other, so that ordinary likelihood
ratio tests do not apply. Methods for finding tests of nonnested hypotheses
are described by Cox (1962) and Atkinson (1970). A booklength treatment
of related problems in econometrics is Godfrey (1991). A simple Monte
Carlo test for comparing lognormal and gamma models is exemplified by
Atkinson (1985 , page 244) .
In recent years , there have been several attempts to develop robust al-
gorithms for generalized linear models. Stefanski et al. (1986) proposed
bounded influence estimators that minimize certain functionals of the
asymptotic covariance matrix. Bedrick and Hill (1990) developed tests for
single and multiple outliers assuming a logistic slippage model. Morgen-
thaler (1992) explored the consequences of replacing the L2 norm by the
L1 norm in the derivation of quasilikelihoods. Christmann (1994) suggested
266 6. Generalized Linear Models

transformation of the data for large strata, followed by the application of


the least median of squares algorithm to the transformed data.
Finally, we have a few comments on extensions of our forward search.
For large sample sizes, perhaps when n > 1,000, slight variations of the
method can be employed. For example, after choosing the best subset of
dimension p we can obtain the k units (e.g., k = n/2) with the smallest
deviance residuals. Since inferentially important events usually occur in the
last third of the search, the forward search estimator and the monitoring of
the statistics can start from step k. There is some risk of a loss of robustness
that can be reduced by a technique related to that of Woodruff and Rocke
(1994) which provides a faster algorithm for large data sets. The data are
initially divided into smaller subgroups and initial subsets found for each
group which are then amalgamated to provide the initial subset for all the
data.
Of the forms of data discussed in this book, binary data provide the
greatest challenge to the forward search. In addition to the balanced search
of §6.17.3 we also tried several other algorithms for maintaining balance
during the search, for example, one in which balance was forced only in the
final steps. We found that the form used here best allowed the introduction
and removal of observations from the subset.
In the analysis of binary data our discussion is in terms of binary re-
sponses. But the example used by Hauck and Donner (1977) is for the
comparison of two binomial populations. It is thus the explanatory vari-
able that is binary. Problems of near perfect fit in binomial, rather than
binary, data are most likely to occur with designed experiments with factors
at few levels. Our analysis applies also to such data structures.
6.22. Exercises 267

6.22 Exercises
Exercise 6.1 Show that the normal, gamma, Poisson, inverse Gaussian
and binomial distributions can be written as in equation (6.9) and find for
each distribution b((J), c((J), ¢ and d(y, ¢) {§6.2}.
The density function of the inverse Gaussian distribution is:

f(Y ;/L,O") = kexp{-~-:)22}


7fO"Y yO" /L
.
Exercise 6.2 Prove the following identities (§6.3)

E (:~) o

Exercise 6.3 Starting from equation (6.17), show that varY can be written
as in equation (6.18) (§6.3).
Exercise 6.4 Given data on n proportions, Yi = Rdni' i = 1, .. . ,n, sup-
pose that the response probability for the ith observation (Hi) is a random
variable with mean (Ji and variance ¢(Ji(1- ()i) where ¢ is an unknown scale
parameter.
(a) What is the effect on E(Ri) and var(Ri) of the assumption of random
variability in the response probabilities?
(b) Is it possible with binary data to estimate the overdispersion
parameter ¢ (§6.3)?
Exercise 6.5 Show that the deviance for normal data is equal to the
residual sum of squares (§6.5).
Exercise 6.6 For Poisson observations show the asymptotic equivalence
between the deviance and Pearson's Chi-squared statistic (eq. (6.51); §6.10) .
Exercise 6.7 The deviance cannot be used as a summary measure of the
goodness of fit of a model for binary data. Prove this statement for the
logistic model (§. 6.13).
Exercise 6.8 Suppose you have fitted a linear logistic model with two co-
variates Xl and X2. Holding Xl fixed, what is the effect of a unit change
in X2 on the following scales: (a) log odds, (b) odds and (c) the probability
scale (§6.13)?
Exercise 6.9 In dose response models it may be important to know the
estimated dose that produces a specified value of the response probability.
The dose which is expected to produce a response in 50% of the exposed
subjects is usually called the median effective dose or ED50. When the
268 6. Generalized Linear Models

response is death this quantity is renamed median lethal dose or LD50.


Similarly, LD90 is the dose associated with a theoretical percentage of deaths
equal to 90%. In the Bliss beetle data the final parameter estimates for logit,
pro bit and cloglog links are the following .

/30 /31
logit -60.77 34.30
probit -34.97 19.75
cloglog -39.60 22.05.

Given that the explanatory variable was loglO (dose), find the LD50 and
LD90 for the above three models (§6.13).

Exercise 6.10 Suppose that R "-' B(m, if) and that m is large. Show that
the random variable

Z = arcsin ( JR/m)

has approximate first two moments (§ 6. 13}:

EZ ~ arcsin (..fi) _ 1 - 2if


8mJif(1 - if)
1
varZ ~
4m

Exercise 6.11 In the unbalanced forward search for the vasoconstriction


data the last five observations to enter are 39,31,29,18 and 4. All of them
have y = 1. The perfect fit is broken when the first of these observations
enters. Can you guess where units 39, 31 and 29 are located in Figure 6.50?
Why is the perfect fit broken when observation 39 enters (§6.17)?

6.23 Solutions
Exercise 6.1
(a) Normal distribution

2 1
f(y;p"a ) = - - e x p
V27ra 2
{ ( )2}
1
--
2
y-p,
--
a

y2 p,2) 1 1
( --+yp,-- ---log27ra 2
2 2 a2 2
yp, - if
2
1 2 y2
a2 - 2log27ra - 2a 2 '
6.23. Solutions 269

From the last expression we obtain:


fL2
b(8)=fL c(8) = - -
2
(b) Gamma distribution
f(y; 0:, fL) (0:/ fL)uyU-le-ex(Y/I") /f(o:),
10gf(Y;0:,fL) 0: log 0: - o:logfL + (0: -1) logy - o:y -logf(o:)
fL
_yl -logfL
I" 1 +o:logo:+(o:-l)logy-logf(o:)
ex

so that
1
b(8) = -- c(8) = -logfL
fL

d(c/J, y) = 0: log 0: + (0: - 1) logy -log f(o:).


(c) Poisson distribution
fLYe-1"
f(y; fL)
y!
log f(y; fL) ylogfL - fL -logy!
so that
b(8) = 10gfL c(8) = -fL c/J=1 d(c/J, y) = -logy!
(d) Inverse Gaussian distribution

log f(y; fL, u) =

so that
1
c(8) = - d(c/J,y) = - -1-2 - 1 3
-10g27fUY .
fL 2yu 2
(e) Binomial distribution

(;)fLY(l - fLt- Y

10gf(Y;fL) log (;) + y log fL + (n - y) log (1 - fL)

y log _fL_ +
1-fL
n
10g(1 - fL) + log (n)
y
270 6. Generalized Linear Models

so that

b( 0) = log _11_
1-11
c(O) = nlog(1 - 11) ¢=1 d(¢,y) = log (~).
Exercise 6.2
The standard conditions assume the exchangeability of integration and
differentiation. Then, for the first identity,

dlog f(y, 0) f( O)d


j dO y, y
df(y, ()) d
j d() Y

:() j f(y, ())dy


d
d() 1
o.
In order to prove the second identity we start from the following equation:

~ j dlog f(y, ()) f( ())d = O.


d() d() y, y

Differentiating both sides we obtain:

d2 10 gf (Y'())f( ())d jdlogf(y,())df(y,O)d


j d()2 y, y+ d() d() Y 0,
that is,
2 gf gf
j d 10 d()2(Y'())f( y, O)d y + j{dIO d()(y,())}2 f (y, ())dY O.

Therefore:

E{d 2 10gf (Y,())} E{dIO gf (y,())}2 =0


d()2 + d()

d2l) (dl)2
E ( d()2 = -E d()

Exercise 6.3
From equation (6.13)

E(Y) = 11 = -c'(())/b'(()). (6.82)


Consequently
c"(())b'(()) - b"(())c'(())
(6.83)
{b,(())}2
6.23. Solutions 271

Substituting the expression for fJ, in equation (6.17) we obtain:


¢ c"(O)b'(O) - b"(O)c'(O)
var Y = - --'---....". ---'-''---'-'-...,....,.,..----'--'---'-'-
{b'(O)}2 b'(O)
Use of the expression for 8fJ,/80 found in equation (6.83) immediately gives

var Y =

Exercise 6.4
(a) Given a particular value of 7r, the observed number of successes for the
ith unit (Ri) has a binomial distribution with mean and variance:
E(Ril7ri) = ni7ri and var(Ril7ri) = ni7ri(l - 7ri).
Now using standard results from conditional probability theory:
E(Ri) = EE(Ril7ri) = E(ni7rd = niE(7ri) = niOi
var(Ri) = Evar(Ril7ri) + var{E(Rd7ri)} '
We obtain:
E{ni7ri(l- 7ri)}
ni { E7ri - Var7ri - (E7ri)2}
ni {Oi - ¢Oi (1 - Oi) - On
ni {Oi(l - Oi)(l - ¢)}
and
var{E(Ri l7ri)} var{ni7ri}
nt¢Oi(l - Oi) .
We therefore find that:
var(Ri) = niOi(l - Oi) {I + (ni - l)¢}. (6.84)
We conclude that if there is variation in the response probabilities (that is,
if ¢ > 0) the variance of Ri exceeds the variance under binomial sampling
by a factor of {I + (ni - 1)¢}. In other words, variation among the response
probabilities leads to overdispersion; that is, to a variance of the observed
number of successes which is greater than it would have been if the response
probabilities did not vary at random.
(b) With (ungrouped) binary data ni = 1 for all values of i. Thus the
expression for the variance in equation (6.84) reduces to Oi (1- Oi), which is
exactly the variance of a binary response variable. This implies that binary
data provide no information about the overdispersion parameter ¢.
272 6. Generalized Linear Models

Exercise 6.5
For normal data the log likelihood of a single observation can be written

Setting M = Y gives the maximum achievable log likelihood:


1
ley, y) = - 2 log(27r0'2)
so that the deviance is:
n
D(/J) = 20'2{L(y, y) - L(P, y)} = ~)Yi - Pi)2.
i=l

We conclude that with normal data the deviance is identical to the residual
sum of squares.

Exercise 6.6
Let (y - M) / M = E so that
y
y - M = ME, y = M(1 + E) and - = 1 + E.
M
Now consider the squared deviance residual for unit i as a function of E (for
simplicity we drop the subscript i)

2 {y log(y/ M) - (y - M)}
2 {M(1 + E) log(l + E) - ME} .
Differentiation yields

Taylor series expansion around E = 0 gives:

TJ ~ o +2Mlog(l+E)I,=OE+2-2l ~I
1 + E ,=0
E2

~ ME2 = (y - M)2
M
Thus we can state that asymptotically
n
D(/J) = 22: {Yi log(yi/ Pi) - (Yi - Pi)} ~ X2 = 2: Yi
i=l
n

i=l
(
-;;i
" )2
6.23. Solutions 273

Exercise 6.7
The expression for the deviance of binary data can be obtained by putting
ni = 1 in equation (6.74); that is,

D(/3) = 2 t
i=l
{Yi log(Yi/8i ) + (1 - Yi) log (1
1
=~,) }.
B,
Remembering that Yi = 0 or Yi = 1 we have Yi log Yi = (1 - Yi) log( 1 - Yi) =
O. D(/3) can be rewritten as
n
D(/3) = -2 L {Yi log ei + (1 - Yi) log(l - ei )}
i=l

-2 t
i =l
{Yi log ~
1 - Bi
+ 10g(1- ei)}.

t
In matrix notation we can write:

D(/3) = -2 {yT iJ + 10g(1 - ei ) }

= -2 {yT x/3 + t 10g(1 - ei) } , (6.85)

where Y = (YI, ... , Ynf and iJ = x/3 = (iJI, ... , iJnf· Now, in the case of
linear logistic models, Wi = Bi(l - Bi ), ¢ = 1 and 8ry/8Bi = l/{Bi (l - Bi )}
so that equation (6.38) becomes
8L n 1
8(3j ~ Bi(l - Bi)(Yi - Bi ) Bi(l _ Bi ) Xij
n
L(Yi - Bi)Xij j = 1, .. . ,po
i=l

In matrix notation we can write:


8L T
8(3 = X (y - B) ,

where B = (B I , . .. , Bn)T. Moreover if /3 is the maximum likelihood estimator


XTy = xTe,
from which it follows that:
yTX/3 = eT x/3 = 8TiJ.
Using this identity we can rewrite equation (6.85) as

D(/3) = -2 { eT iJ + t log(l - ei ) } .
274 6. Generalized Linear Models

This expression shows that the deviance depends on the binary observations
Yi only through the fitted probabilities Bi and so it cannot tell us anything
about the agreement between the observations and their fitted probabilities.
In other words: given /3, D(/3) has a conditionally degenerate distribution
and cannot be used to evaluate the goodness of fit of a model. The result
is also true for the other links.

Exercise 6.8
Given that:

log (1 ~ ()) = {30 + {31 X 1 + {32 x 2, (6.86)

holding Xl fixed the effect of a unit change in X2 is to increase the log odds
by an amount (32.
In terms of the odds, equation (6.86) can be rewritten as

(6.87)

Consequently, the effect of a unit change in X2 is to increase the odds


multiplicatively by a factor exp({32).
On the probability scale

() = exp ((30 + {31 X 1 + (32 X 2)


(6.88)
1 + exp ((30 + (31X1 + (32X2)
On this scale the effect on () of a unit change in X2 depends on the values
of Xl and X2 and therefore is much more complicated. The derivative of ()
with respect to X2 is
f)()
f)X2 = ()(l - ()){32.

Thus, given that the maximum of ()(l - ()) is obtained when () = 0.5 we
can state that a small change in X2 measured on the probability scale has
a larger effect on () if () is near 0.5 than if () is near 0 or 1.

Exercise 6.9
We start with the logit model. Given that logO.5/(1 - 0.5) = 0, the dose
for which 'iT = 0.5 (ED50) satisfies the equation

(30 + {31ED50 = 0

so that the ED 50 = - {30 1(31' If loge (dose) rather than dose is used as
explanatory variable ED50 is estimated by exp( -/301 /3d.
Similarly, the ED90 must be estimated from the equation
0.9 ' ,-
log 1 _ 0.9 = (30 + {31 ED90 .
6.23. Solutions 275

EOOO = 2.197? - ~o .
i31
Estimates of the ED50 and ED 90 can be obtained similarly under a probit
or cloglog model When loge(dose) is used as an explanatory variable for
the probit model we obtain:

exp ( -~ol ~1 )

-
ED 90 exp ( 1.2816
, -
i31
~o) .

Finally for the cloglog

-
ED50 = exp (
-0.3665 -
,
i31
~o )

=exp (
0.8340 -
,
~o) .
i31
If, as in the exercise, logarithms of dose are taken to base 10, exp(.) in the
above expressions needs to be replaced by 10('). Making this adjustment
and using the estimated values of the parameters reported in the text of
the exercise, we obtain the following table.

ED50 ED90
logit 59.12 68.51
probit 58.97 68.47
cloglog 60.16 68.19

The models agree more closely in the upper tail than they do in the centre
of the distribution.

Exercise 6.10

n
Expanding arcsin ( J Rim) in a Taylor series around 7r up to second order:

z ~
a""in (JR/m) l, + 2~ • (~ - ii)
- ~! ~ {~ (1 - ~) ( /2
) (1 -2~) I. (~ -ii)'
Taking the expectation and the variance of both sides the result
immediately follows.
276 6. Generalized Linear Models

00
•••
0 0
o 4-11f
0
0 • •
0 •
0 0
0
~1
Q)
~ ...., 0
0
0;
0
-'
~

C?
0

-1 .0 -0.5 0.0 0.5 1.0


Log(volume)

Figure 6.62. Vasoconstriction data: scatter plot showing zero 0 and unit • re-
sponses. Without units 4, 18, 29, 31 and 39 the two regions are completely
separated

Exercise 6.11
As Figure 6.62 shows, without units 4, 18, 29, 31 and 39 there is a line in
the space that completely separates the two groups of observations. This
ceases when unit 39 (the first of the five to enter) is included.
Appendix A
Data
278 Appendix A. Data

Table A.I. Forbes' data on air pressure in the Alps and the boiling point of water

Observation Boiling 100 X


Number Point (OF) Log(Pressure)
1 194.5 131.79
2 194.3 131.79
3 197.9 135.02
4 198.4 135.55
5 199.4 136.46
6 199.9 136.83
7 200.9 137.82
8 201.1 138.00
9 201.4 138.06
10 201.3 138.05
11 203.6 140.04
12 204.6 142.44
13 209.5 145.47
14 208.6 144.34
15 210.7 146.30
16 211.9 147.54
17 212.2 147.80
Appendix A. Data 279

Table A.2. Multiple regression data showing the effect of masking

y Xl X2 X3 X4

1 0.2133 -2 .3080 -8.5132 -10.6651


2 0.9413 1.2048 -4.1313 5.2739
3 -0.6165 -1.0754 -5.4988 -2.2908
4 -1.2154 1.4384 -4.7902 2.8050
5 0.3678 1.6125 -2.8339 7.5527
6 0.0366 0.5840 -5.8016 0.4213
7 0.1636 0.7677 -3.9292 1.8518
8 -1.4562 0.3064 -7.1359 -4.0885
9 -2.6535 -2.1420 -7.6303 -7.0537
10 -1.2722 0.2899 -5.0166 -0.8915
11 -1.2276 -1.4133 -7.6792 -7.0021
12 1.3087 0.5436 -2.9362 4.4775
13 -0.1036 -2.2270 -7.7164 -10.7953
14 -0.2724 0.2107 -3.1560 5.3512
15 -0.3896 -3.1353 -8.7172 -14.5927
16 0.6659 0.8877 -3.2162 5.8618
17 2.1210 1.6520 -1.1933 11 .1076
18 -0.9035 -1.1082 -5.7122 -4.6404
19 -0.8800 -1.7030 -7.0756 -8.3006
20 -0.4792 -1.4409 -7.1334 -7.3699
21 -1.9608 -1.9674 -7.0908 -6.2175
22 -1.9271 0.5333 -5.8178 -1.5325
23 -0.4970 -1.2314 -7.8755 -9.2648
24 -1.0759 -1.0643 -5.4215 -2.9707
25 -1.3846 0.8026 -5.1148 0.1859
26 2.1650 -1.6213 -3.0335 0.5601
27 -0.9351 0.1970 -4.5964 2.2223
28 0.1378 -1.1782 -5.8880 -3.6547
29 -2.5223 -1.6597 -9.4292 -13.6328
30 -0.6787 -1.9986 -4.9259 -1.4192
280 Appendix A. Data

Table A.2. Multiple regression data (concluded)

y Xl X2 X3 X4

31 -1.8665 -1.7153 -6.5110 -3.9751


32 0.2733 0.4637 -3.9374 2.2070
33 2.5053 -0.2927 -3.3827 2.9221
34 -0.6474 0.8221 -4.1466 2.9272
35 -0.2099 -0.7850 -6.1856 -3.8248
36 -0.1594 -2.4325 -6.8908 -7.3217
37 -0.3642 -0.8784 -4.4361 -1.0826
38 -l.7083 -2.0538 -6.9914 -5.7308
39 0.0828 1.6929 -3.0457 5.2853
40 -0.6833 -0.0002 -6.4165 -2.8162
41 0.0098 -0.2040 -4.5454 -0.5628
42 0.8122 -0.3706 -3.0402 4.2226
43 -0.7197 0.5228 -2.8731 2.8370
44 -0.4476 -0.0371 -6.8418 -4.2108
45 -1.2557 -0.2973 -6.9266 -5.0248
46 l.2623 -2.7807 -7.0192 -9.6272
47 -1.9520 -1.5694 -7.0712 -4.9137
48 -0.6370 -0.1024 -5.1515 -2.0070
49 -1.7872 1.6853 -5.6793 l.8649
50 -0.6200 1.5364 -2.1320 9.4976
51 -0.4671 2.4089 -4.7093 6.2493
52 -l.5115 0.4483 -5.4337 -0.7360
53 -0.5081 0.9826 -4.8532 2.1675
54 1.6276 -0.0213 -2.7575 4.9091
55 -3.0987 0.9362 -7.9089 -5.4528
56 -1.1715 -0.9129 -8.0691 -8.3308
57 -1.1799 0.8266 -5.8699 -0.6863
58 0.5283 -l.5339 -5.7189 -3.8938
59 -l.8800 -0.7472 -8.4265 -8.6413
60 -0.1453 0.0606 -5.7231 -0.8855
Appendix A. Data 281

Table A.3. Wool data: number of cycles to failure of samples of worsted yarn in
a 33 experiment

Factor Levels Cycles to


Observation Xl X2 X3 Failure
1 -1 -1 -1 674
2 -1 -1 0 370
3 -1 -1 1 292
4 -1 0 -1 338
5 -1 0 0 266
6 -1 0 1 210
7 -1 1 -1 170
8 -1 1 0 118
9 -1 1 1 90
10 0 -1 -1 1414
11 0 -1 0 1198
12 0 -1 1 634
13 0 0 -1 1022
14 0 0 0 620
15 0 0 1 438
16 0 1 -1 442
17 0 1 0 332
18 0 1 1 220
19 1 -1 -1 3636
20 1 -1 0 3184
21 1 -1 1 2000
22 1 0 -1 1568
23 1 0 0 1070
24 1 0 1 566
25 1 1 -1 1140
26 1 1 0 884
27 1 1 1 360
282 Appendix A. Data

Table A.4. Hawkins' data simulated to baffle data analysts

Observation Xl X2 X3 X4 Xs X6 X7 Xs y
Number
1 -15 -10 -14 -8 2 -4 -10 59 8.88
2 9 0 8 -8 18 8 -18 74 12.18
3 -3 4 10 0 16 -14 6 49 5.75
4 -19 6 12 -16 8 -6 4 95 11.75
5 -3 0 6 4 -8 22 -16 57 10.52
6 11 -32 -38 10 -16 -2 10 97 10.57
7 11 2 0 18 -18 12 4 27 1.70
8 -11 32 38 -10 16 2 -10 62 5.31
9 -3 -2 -16 -12 -6 -8 -10 56 8.51
10 9 14 30 12 6 12 0 60 1.21
11 -3 -6 -2 4 -8 -6 -6 43 3.36
12 -9 12 12 -12 26 -8 -8 53 8.26
13 5 -24 -36 -2 -6 4 -4 72 10.14
14 -11 16 8 -14 8 -10 10 67 -0.58
15 -3 8 -4 -16 18 -16 2 24 7.10
16 7 0 18 6 -2 8 4 61 -0.63
17 9 18 16 -4 8 10 -4 68 5.87
18 11 -6 4 10 16 2 2 7 -0.25
19 -1 12 -4 -6 2 -4 -14 10 -9.45
20 -7 16 12 -2 10 4 -24 58 8.93
21 1 -12 4 6 -2 4 14 76 18.88
22 -3 -20 -10 16 -18 12 8 69 4.01
23 -11 -14 -20 2 -26 -12 22 78 8.48
24 13 -2 4 20 0 14 -14 6 -0.16
25 -21 12 10 0 0 6 -6 43 7.26
26 -1 6 8 -8 -10 -16 18 49 1.69
27 1 8 20 -6 8 14 -10 2 -4.46
28 -1 8 10 10 0 -2 -10 49 3.36
29 5 -10 -14 18 -18 8 14 67 7.53
30 7 4 4 -10 0 6 0 68 3.91
31 3 16 24 0 16 -10 -4 77 6.73
32 15 10 14 8 -2 4 10 1 -2.91
33 5 -28 -22 14 -8 6 0 97 8.80
34 -5 -10 -2 -6 8 -18 10 1 1.80
35 -13 -2 10 -4 -2 -12 18 7 -2.40
36 7 -16 -12 2 -10 -4 24 94 6.25
37 -7 0 -18 -6 2 -8 -4 89 15.60
38 -1 -20 -20 2 2 12 -14 28 1.06
39 -3 12 6 8 -18 -4 8 92 9.99
40 -9 0 -8 8 -18 -8 18 94 2.10
41 -3 -16 -24 0 -16 10 4 7 1.63
42 -9 -14 -30 -12 -6 -12 0 11 5.84
43 7 -14 -10 20 0 10 -4 1 -2.30
Appendix A. Data 283

Table A.4. Hawkins' data (continued)

Observation Xl X2 X3 X4 X5 X6 X7 Xs y
Number
44 7 6 6 8 10 20 -28 1 1.42
45 -5 -6 -16 -22 10 -20 6 93 2.67
46 7 -10 -24 4 2 8 -8 38 -6.93
47 -3 10 18 0 16 14 -4 16 0.75
48 -15 8 -6 -4 -8 -2 4 96 14.31
49 -5 8 6 -2 -2 -16 24 23 2.93
50 3 -8 4 16 -18 16 -2 68 2.06
51 3 2 16 12 6 8 10 89 5.97
52 -11 -2 0 -18 18 -12 -4 88 9.78
53 11 -2 -10 -6 18 0 -2 73 10.20
54 -15 4 8 12 -10 0 8 80 8.90
55 -5 10 14 -18 18 -8 -14 84 7.55
56 3 -4 -10 0 -16 14 -6 80 7.11
57 5 4 -6 6 -8 -10 0 98 12.60
58 -9 -18 -16 4 -8 -10 4 19 2.80
59 5 10 2 6 -8 18 -10 79 5.88
60 -11 12 22 2 6 -8 14 21 3.38
61 -9 2 0 -8 2 0 -20 94 7.10
62 -3 24 26 -12 26 -4 -18 69 4.43
63 11 -16 -8 14 -8 10 -10 31 9.47
64 17 -20 -24 10 -16 2 0 59 4.92
65 -1 14 18 10 0 26 -20 31 2.44
66 -15 24 24 0 0 10 -16 29 2.03
67 13 2 -10 4 2 12 -18 73 10.35
68 3 -10 -18 0 -16 -14 4 48 5.65
69 -17 -6 -18 -10 -16 -6 8 81 2.02
70 9 -16 -22 -12 10 -4 2 25 3.45
71 1 -6 -8 8 10 16 -18 58 8.94
72 3 22 32 0 16 18 -14 25 9.69
73 13 -4 2 2 -10 0 14 24 13.81
74 -7 2 4 10 0 22 -10 44 2.66
75 3 0 -6 -4 8 -22 16 83 2.55
76 9 -2 0 8 -2 0 20 49 5.61
77 17 10 4 -6 18 4 -12 33 3.21
78 13 -18 -26 16 -8 2 6 6 3.41
79 15 -24 -24 0 0 -10 16 22 3.95
80 1 6 -2 -22 10 -16 -4 14 2.28
81 -7 -4 -4 10 0 -6 0 78 10.65
82 -9 20 8 -4 -8 2 -6 28 5.70
83 -17 10 12 -6 -8 6 -12 82 7.35
84 -9 -12 -8 4 -8 18 -6 75 6.69
85 21 -12 -10 0 0 -6 6 90 6.01
86 9 -12 -12 12 -26 8 8 40 1.01
284 Appendix A. Data

Table A.4. Hawkins ' data (concluded)

Observation Xl X2 X3 X4 Xs X6 X7 Xs y
Number
87 -13 2 -4 -20 0 -14 14 94 10.14
88 1 2 12 -6 8 -14 0 6 -2.33
89 3 20 10 -16 18 -12 -8 12 4.05
90 23 2 2 6 8 -2 2 1 -0.90
91 -1 -8 -20 6 -8 -14 10 61 10.72
92 -3 -22 -32 0 -16 -18 14 30 -2.72
93 11 14 20 -2 26 12 -22 2 -0.52
94 -7 10 24 -4 -2 -8 8 53 16.00
95 -13 18 26 -16 8 -2 -6 23 -0.55
96 -1 -6 2 22 -10 16 4 57 4.77
97 -5 28 22 -14 8 -6 0 14 2.27
98 -9 16 22 12 -10 4 -2 91 8.13
99 5 2 6 -2 26 8 -12 95 7.36
100 19 -6 -12 16 -8 6 -4 67 4.71
101 7 -2 -4 -10 0 -22 10 9 2.93
102 -1 -2 -12 6 -8 14 0 5 3.42
103 -23 -2 -2 -6 -8 2 -2 58 6.78
104 15 -8 6 4 8 2 -4 97 4.97
105 -7 -6 -6 -8 -10 -20 28 18 0.47
106 17 6 18 10 16 6 -8 8 7.64
107 1 -8 -10 -10 0 2 10 23 4.90
108 11 -12 -22 -2 -6 8 -14 87 6.91
109 1 -14 -18 -10 0 -26 20 58 6.46
110 5 -8 -6 2 2 16 -24 76 6.94
111 -13 4 -2 -2 10 0 -14 9 -8.69
112 -17 -10 -4 6 -18 -4 12 89 11.03
113 -5 -4 6 -6 8 10 0 70 4.18
114 -11 2 10 6 -18 0 2 81 5.16
115 -7 14 10 -20 0 -10 4 82 8.70
116 -5 24 36 2 6 -4 4 98 6.83
117 9 12 8 -4 8 -18 6 25 3.27
118 17 -10 -12 6 8 -6 12 9 1.71
119 3 -12 -6 -8 18 4 -8 86 7.78
120 15 -4 -8 -12 10 0 -8 11 0.20
121 -17 20 24 -10 16 -2 0 59 6.86
122 1 20 20 -2 -2 -12 14 91 12.06
123 3 6 2 -4 8 6 6 62 7.10
124 -5 -2 -6 2 -26 -8 12 91 11.21
125 9 -20 -8 4 8 -2 6 87 5.79
126 5 6 16 22 -10 20 -6 92 15.30
127 -11 6 -4 -10 -16 -2 -2 64 7.33
128 3 -24 -26 12 -26 4 18 53 7.76
Appendix A. Data 285

Table A.5. Brownlee's stack loss data on the oxidation of ammonia. The response
is ten times the percentage of ammonia escaping up a stack, or chimney

Cooling Water
Inlet Acid Stack
Observation Air Flow Temperature Concentration Loss
Number Xl X2 X3 Y
1 80 27 89 42
2 80 27 88 37
3 75 25 90 37
4 62 24 87 28
5 62 22 87 18
6 62 23 87 18
7 62 24 93 19
8 62 24 93 20
9 58 23 87 15
10 58 18 80 14
11 58 18 89 14
12 58 17 88 13
13 58 18 82 11
14 58 19 93 12
15 50 18 89 8
16 50 18 86 7
17 50 19 72 8
18 50 19 79 8
19 50 20 80 9
20 56 20 82 15
21 70 20 91 15
286 Appendix A. Data

Table A.6. Salinity data. Measurements on water in Pamlico Sound, North


Carolina

Lagged Water
Observation Salinity Trend Flow Salinity
Number Xl X2 X3 Y
1 8.2 4 23.005 7.6
2 7.6 5 23.873 7.7
3 4.6 0 26.417 4.3
4 4.3 1 24.868 5.9
5 5.9 2 29.895 5.0
6 5.0 3 24 .200 6.5
7 6.5 4 23.215 8.3
8 8.3 5 21.862 8.2
9 10.1 0 22.274 13.2
10 13.2 1 23.830 12.6
11 12.6 2 25.144 10.4
12 10.4 3 22.430 10.8
13 10.8 4 21.785 13.1
14 13.1 5 22.380 12.3
15 13.3 0 23 .927 10.4
16 10.4 1 33.443 10.5
17 10.5 2 24.859 7.7
18 7.7 3 22.686 9.5
19 10.0 0 21.789 12.0
20 12.0 1 22.041 12.6
21 12.1 4 21.033 13.6
22 13.6 5 21.005 14.1
23 15.0 0 25.865 13.5
24 13.5 1 26.290 11.5
25 11.5 2 22.932 12.0
26 12.0 3 21.313 13.0
27 13.0 4 20.769 14.1
28 14.1 5 21.393 15.1
Appendix A. Data 287

Table A.7. Ozone data: ozone concentration at Upland , CA as a function of eight


meteorological variables

Observation Ozone
Number Xl X2 X3 X4 Xs X6 X7 Xs Concentration
(ppm)
1 40 2693 -25 250 5710 28 47.66 4 3
2 45 590 -24 100 5700 37 55.04 3 5
3 54 1450 25 60 5760 51 57.02 3 5
4 35 1568 15 60 5720 69 53.78 4 6
5 45 2631 -33 100 5790 19 54.14 6 4
6 55 554 -28 250 5790 25 64.76 3 4
7 41 2083 23 120 5700 73 52.52 3 6
8 44 2654 -2 120 5700 59 48.38 3 7
9 54 5000 -19 120 5770 27 48.56 8 4
10 51 111 9 150 5720 44 63.14 3 6
11 51 492 -44 40 5760 33 64.58 6 5
12 54 5000 -44 200 5780 19 56.30 6 4
13 58 1249 -53 250 5830 19 75.74 3 4
14 61 5000 -67 200 5870 19 65.48 2 7
15 64 5000 -40 200 5840 19 63.32 5 5
16 67 639 1 150 5780 59 66.02 4 9
17 52 393 -68 10 5680 73 69.80 5 4
18 54 5000 -66 140 5720 19 54.68 4 3
19 54 5000 -58 250 5760 19 51.98 3 4
20 58 5000 -26 200 5730 26 51.98 4 4
21 69 3044 18 150 5700 59 52.88 5 5
22 51 3641 23 140 5650 70 47.66 5 6
23 53 111 -10 50 5680 64 59.54 3 9
24 59 597 -52 70 5820 19 70.52 5 6
25 64 1791 -15 150 5810 19 64.76 5 6
26 63 793 -15 120 5790 28 65.84 3 11
27 63 531 -38 40 5800 32 75.92 2 10
28 62 419 -29 120 5820 19 75 .74 5 7
29 63 816 -7 6 5770 76 66.20 8 12
30 54 3651 62 30 5670 69 49.10 3 9
31 36 5000 70 100 5590 76 37.94 3 2
32 31 5000 28 200 5410 64 32.36 6 3
33 30 1341 18 60 5350 62 45.86 7 3
34 36 5000 0 350 5480 72 38.66 9 2
35 42 3799 -18 250 5600 76 45.86 7 3
36 37 5000 32 350 5490 72 38.12 11 3
37 41 5000 -1 300 5560 72 37.58 10 4
38 46 5000 -30 300 5700 32 45.86 3 6
39 51 5000 -8 300 5680 50 45.50 5 8
40 55 2398 21 200 5700 86 53.78 4 6
288 Appendix A. Data

Table A.7. Ozone data (concluded): ozone concentration at Upland, CA

Observatio Ozone
Number Xl X2 X3 X4 Xs X6 X7 Xs Concentration
(ppm)
41 41 5000 51 100 5650 61 36.32 5 4
42 41 4281 42 250 5610 62 41.36 5 3
43 49 1161 27 200 5730 66 52.88 5 7
44 45 2778 2 200 5770 68 55.76 5 11
45 55 442 26 40 5770 82 58.28 3 13
46 41 5000 -30 300 5690 21 42.26 8 6
47 45 5000 -53 300 5700 19 43.88 3 5
48 51 5000 -43 300 5730 19 49.10 11 4
49 53 5000 7 300 5690 19 49.10 7 4
50 50 5000 24 300 5640 68 42.08 5 6
51 60 1341 19 150 5720 63 59.18 6 10
52 54 1318 2 150 5740 54 64.58 3 15
53 53 885 -4 80 5740 47 67.10 3 23
54 53 360 3 40 5740 56 67.10 3 17
55 44 3497 73 40 5670 61 49.46 7 7
56 40 5000 73 80 5550 74 40.10 10 2
57 30 5000 44 300 5470 46 29.30 7 3
58 25 5000 39 200 5320 45 27.50 11 3
59 40 5000 -12 140 5530 43 33.62 3 4
60 45 5000 -2 140 5600 21 39.02 3 6
61 51 5000 30 140 5660 57 42.08 7 7
62 48 3608 24 100 5580 42 39.38 5 7
63 45 5000 38 140 5510 50 32.90 5 6
64 47 5000 56 200 5530 61 35.60 5 3
65 43 5000 66 120 5620 61 34.34 9 2
66 49 613 -27 300 5690 60 59.72 0 8
67 56 334 -9 300 5760 31 64.40 4 12
68 53 567 13 150 5740 66 61.88 3 12
69 61 488 -20 2 5780 53 64.94 5 16
70 63 531 -15 50 5790 42 71.06 2 9
71 70 508 7 70 5760 60 66.56 3 24
72 57 1571 68 17 5700 82 56.30 4 13
73 35 721 28 140 5680 57 55.40 4 8
74 52 505 -49 140 5720 21 67.28 5 10
75 59 377 -27 300 5720 19 73.22 5 8
76 67 442 -9 200 5730 32 75.74 4 9
77 57 902 54 250 5710 77 60.44 5 10
78 42 1381 4 60 5720 71 56.30 4 14
79 55 5000 -16 100 5710 19 50.00 3 9
80 40 5000 38 150 5600 45 46.94 6 11
Appendix A. Data 289

Table A.8. Box and Cox poison data. Survival times in lO-hour units of animals in
a 3 x 4factorial experiment. Each cell in the table includes both the observation
number and the response

Treatment Poison
1 2 3 4 A I
0.31 0.45 0.46 0.43

5 6 7 8 A II
0.36 0.29 0.40 0.23

9 10 11 12 A III
0.22 0.21 0.18 0.23
13 14 15 16 B I
0.82 1.10 0.88 0.72
17 18 19 20 B II
0.92 0.61 0.49 1.24
21 22 23 24 B III
0.30 0.37 0.38 0 .29
25 26 27 28 C I
0.43 0.45 0.63 0.76
29 30 31 32 C II
0.44 0.35 0.31 0.40
33 34 35 36 C III
0.23 0.25 0.24 0.22

37 38 39 40 D I
0.45 0.71 0.66 0.62
41 42 43 44 D II
0.56 1.02 0.71 0.38
45 46 47 48 D III
0.30 0.36 0.31 0.33
290 Appendix A. Data

Table A.9. Mussels data from Cook and Weisberg. The response is the mass of
the edible portion of the mussel

Number W=Width H=Height L=Length S=Shell Mass y=Mass


1 318 68 158 345 47
2 312 56 148 290 52
3 265 46 124 167 27
4 222 38 104 67 13
5 274 51 143 238 31
6 216 35 99 68 14
7 217 34 109 75 15
8 202 32 96 54 4
9 272 44 119 128 23
10 273 49 123 150 32
11 260 48 135 117 30
12 276 47 133 190 26
13 270 50 126 160 24
14 280 52 130 212 31
15 262 50 134 208 31
16 312 61 120 235 42
17 220 34 94 52 9
18 212 32 102 74 13
19 196 28 85 42 7
20 226 38 104 69 13
21 284 61 134 268 50
22 320 60 137 323 39
23 331 60 140 359 47
24 276 46 126 167 40
25 186 30 92 33 5
26 213 35 98 51 12
27 291 47 130 170 26
28 298 54 137 224 32
29 287 55 140 238 40
30 230 40 106 68 16
31 293 57 135 208 33
32 298 48 135 167 28
33 290 47 134 187 28
34 282 52 135 191 42
35 221 37 104 58 15
36 287 54 135 180 27
37 228 46 129 188 33
38 210 33 107 65 14
39 308 58 131 299 29
40 265 48 124 159 26
41 270 44 124 145 25
Appendix A. Data 291

Table A.9. Mussels data (concluded)

Number W=Width H=Height L=Length S=Shell Mass y=Mass


42 208 33 99 54 9
43 277 45 123 129 18
44 241 39 110 104 23
45 219 38 105 66 13
46 170 27 87 24 6
47 150 21 75 19 6
48 132 20 65 10 1
49 175 30 86 36 8
50 150 22 69 18 5
51 162 25 79 20 6
52 252 47 124 133 22
53 275 48 131 179 24
54 224 36 107 69 13
55 211 33 100 59 11
56 254 46 126 120 18
57 234 37 114 72 17
58 221 37 108 74 15
59 167 27 80 27 7
60 220 36 106 52 14
61 227 35 118 76 14
62 177 25 83 25 8
63 230 47 112 125 18
64 288 46 132 138 24
65 275 54 127 191 29
66 273 42 120 148 21
67 246 37 110 90 17
68 250 43 115 120 17
69 290 48 131 203 34
70 226 35 111 64 16
71 269 45 121 124 22
72 267 48 121 153 24
73 263 48 123 151 19
74 217 36 104 68 13
75 188 33 93 51 10
76 152 25 76 19 5
77 227 38 112 88 15
78 216 25 110 53 12
79 242 45 112 61 12
80 260 44 123 133 24
81 196 35 101 68 15
82 220 36 105 64 16
292 Appendix A. Data

Table A.lO. Short leaf pine. The response is the volume of the tree, Xl the girth
and X2 the height

Number Xl X2 Y
1 4.6 33 2.2
2 4.4 38 2.0
3 5.0 40 3.0
4 5.1 49 4.3
5 5.1 37 3.0
6 5.2 41 2.9
7 5 .2 41 3.5
8 5.5 39 3.4
9 5.5 50 5.0
10 5.6 69 7.2
11 5.9 58 6.4
12 5.9 50 5.6
13 7.5 45 7.7
14 7.6 51 10.3
15 7.6 49 8.0
16 7.8 59 12.1
17 8.0 56 11.1
18 8.1 86 16.8
19 8.4 59 13.6
20 8.6 78 16.6
21 8.9 93 20.2
22 9.1 65 17.0
23 9.2 67 17.7
24 9.3 76 19.4
25 9.3 64 17.1
26 9.8 71 23.9
27 9.9 72 22.0
28 9.9 79 23.1
29 9.9 69 22.6
30 10.1 71 22.0
31 10.2 80 27.0
32 10.2 82 27.0
33 10.3 81 27.4
34 10.4 75 25.2
35 10.6 75 25.5
Appendix A. Data 293

Table A.IO. Shortleaf pine data (concluded)

Number Xl X2 Y
36 11.0 71 25.8
37 11.1 81 32.8
38 11.2 91 35.4
39 11.5 66 26.0
40 11.7 65 29.0
41 12.0 72 30.2
42 12.2 66 28.2
43 12.2 72 32.4
44 12.5 90 41.3
45 12.9 88 45.2
46 13.0 63 31.5
47 13.1 69 37.8
48 13.1 65 31.6
49 13.4 73 43.1
50 13.8 69 36.5
51 13.8 77 43.3
52 14.3 64 41.3
53 14.3 77 58.9
54 14.6 91 65.6
55 14.8 90 59.3
56 14.9 68 41.4
57 15.1 96 61.5
58 15.2 91 66.7
59 15.2 97 68.2
60 15.3 95 73.2
61 15.4 89 65.9
62 15.7 73 55.5
63 15.9 99 73.6
64 16.0 90 65.9
65 16.8 90 71.4
66 17.8 91 80.2
67 18.3 96 93.8
68 18.3 100 97.9
69 19.4 94 107.0
70 23.4 104 163.5
294 Appendix A. Data

Table A.ll . Radioactivity and the molar concentration of nifedipene

Number lOglO(NIF Con- Total Counts for


centration) 5 x 10- 10 Molar
NTD Additive.

1 (0) 4403
2 (0) 5042
3 -11 5259
4 -11 5598
5 -10 4868
6 -10 4796
7 -9 3931
8 -9 4503
9 -8 2588
10 -8 3089
11 -7 2084
12 -7 3665
13 -6 2149
14 -6 2216
15 -5 1433
16 -5 1926
(0) indicates NIF concentration = O.
Appendix A. Data 295

Table A.12. Enzyme kinetics data. The response is the initial velocity of the
reaction

Number Substrate Inhibitor Concentration (I) Initial Velocity


S 0 3 10 30
1 Sl 25 0 0 0 0.0328
2 S2 50 0 0 0 0.0510
3 S3 100 0 0 0 0.0697
4 S4 200 0 0 0 0.0934
5 S5 400 0 0 0 0.0924
6 Sl 0 25 0 0 0.0153
7 S2 0 50 0 0 0.0327
8 S3 0 100 0 0 0.0536
9 S4 0 200 0 0 0.0716
10 S5 0 400 0 0 0.0904
11 Sl 0 0 25 0 0.0087
12 S2 0 0 50 0 0.0146
13 S3 0 0 100 0 0.0231
14 S4 0 0 200 0 0.0305
15 S5 0 0 400 0 0.0658
16 Sl 0 0 0 25 0.0039
17 S3 0 0 0 100 0.0094
18 S4 0 0 0 200 0.0175
19 S5 0 0 0 400 0.0398
296 Appendix A. Data

Table A.I3. Calcium data. Calcium uptake of cells suspended in a solution of


radioactive calcium.

Number Time Calcium


t y
(min.) (nmoles/ mt)
1 0.45 0.34170
2 0.45 -0.00438
3 0.45 0.82531
4 1.30 1.77967
5 1.30 0.95384
6 1.30 0.64080
7 2.40 1.75136
8 2.40 1.27497
9 2.40 1.17332
10 4.00 3.12273
11 4.00 2.60958
12 4.00 2.57429
13 6.10 3.17881
14 6.10 3.00782
15 6.10 2.67061
16 8.05 3.05959
17 8.05 3.94321
18 8.05 3.43726
19 11.15 4.80735
20 11.15 3.35583
21 11.15 2.78309
22 13.15 5.13825
23 13.15 4.70274
24 13.15 4.25702
25 15.00 3.60407
26 15.00 4.15029
27 15.00 3.42484
Appendix A. Data 297

Table A.14. Nitrogen concentration in American lakes

Number NIN TW TN
Xl X2 Y
1 5.548 0.137 2.590
2 4 8. 96 2.499 3.770
3 1.964 0.419 1.270
4 3.586 1.699 1.445
5 3.824 0.605 3 .290
6 3.111 0.677 0.930
7 3.607 0.159 1.600
8 3.557 1.699 1.250
9 2.989 0.340 3.450
10 18.053 2.899 1.096
11 3.773 0.082 1.745
12 1.253 0.425 1.060
13 2.094 0.444 0.890
14 2.726 0.225 2.755
15 1.758 0.241 1.515
16 5.011 0.099 4.770
17 2.455 0.644 2.220
18 0.913 0.266 0.590
19 0.890 0.351 0.530
20 2.468 0.027 1.910
21 4.168 0.030 4.010
22 4.810 3.400 1.745
23 34.319 1.499 1.965
24 1.531 0.351 2.555
25 1.481 0.082 0.770
26 2.239 0.518 0.720
27 4.204 0.471 1.730
28 3.463 0.036 2.860
29 1.727 0.721 0.760

Xl: average influent nitrogen concentration.


X2: water retention time.
y : mean annual nitrogen concentration.
298 Appendix A. Data

Table A.15. Reaction rate for the catalytic isomerization of n-pentane to


isopentane

Run Partial Pressures (psia) Rate


Number Xl X2 X3 Y
1 205.8 90.9 37.1 3.541
2 404.8 92.9 36.3 2.397
3 209.7 174.9 49.4 6.694
4 401.6 187.2 44.9 4.722
5 224.9 92.7 116.3 0.593
6 402.6 102.2 128.9 0.268
7 212.7 186.9 134.4 2.797
8 406.2 192.6 134.9 2.451
9 133.3 140.8 87.6 3.196
10 470.9 144.2 86.9 2.021
11 300.0 68.3 81.7 0.896
12 301.6 214.6 101.7 5.084
13 297.3 142.2 10.5 5.686
14 314.0 146.7 157.1 1.193
15 305.7 142.0 86.0 2.648
16 300.1 143.7 90.2 3.303
17 305.4 141.1 87.4 3.054
18 305.2 141.5 87.0 3.302
19 300.1 83.0 66.4 1.271
20 106.6 209.6 33.0 11.648
21 417.2 83.9 32.9 2.002
22 251.0 294.4 41.5 9.604
23 250.3 148.0 14.7 7.754
24 145.1 291.0 50.2 11.590

Xl: partial pressure of hydrogen.


X2: partial pressure of n-pentane.
X3: partial pressure of iso-pentane.
y: rate of disappearance of n-pentane.
Appendix A. Data 299

Table A.16. Car insurance data from McCullagh and Neider. The response is the
average claim, in £. Also given are observation number and m, the number of
claims in each category

Vehicle age (VA)


PA VG 0-3 4-7 8-9 10+
N. £ m N. £ m N. £ m N. £ m
17-20 A 1 289 8 2 282 8 3 133 4 4 160 1
B 5 372 10 6 249 28 7 288 1 8 11 1
C 9 189 9 10 288 13 11 179 1
D 12 763 3 13 850 2
21-24 A 14 302 18 15 194 31 16 135 10 17 166 4
B 18 420 59 19 243 96 20 196 13 21 135 3
C 22 268 44 23 343 39 24 293 7 25 104 2
D 26 407 24 27 320 18 28 205 2
25-29 A 29 268 56 30 285 55 31 181 17 32 110 12
B 33 275 125 34 243 172 35 179 36 36 264 10
C 37 334 163 38 274 129 39 208 18 40 150 8
D 41 383 72 42 305 50 43 116 6 44 636 1
30-34 A 45 236 43 46 270 53 47 160 15 48 110 12
B 49 259 179 50 226 211 51 161 39 52 107 19
C 53 340 197 54 260 125 55 189 30 56 104 9
D 57 400 104 58 349 55 59 147 8 60 65 2
35-39 A 61 207 43 62 129 73 63 157 21 64 113 14
B 65 208 191 66 214 219 67 149 46 68 137 23
C 69 251 210 70 232 131 71 204 32 72 141 8
D 73 233 119 74 325 43 75 207 4
40-49 A 76 254 90 77 213 98 78 149 35 79 98 22
B 80 218 380 81 209 434 82 172 97 83 110 59
C 84 239 401 85 250 253 86 174 50 87 129 15
D 88 387 199 89 299 88 90 325 8 91 137 9
50-59 A 92 251 69 93 227 120 94 172 42 95 98 35
B 96 196 366 97 229 353 98 164 95 99 132 45
C 100 268 310 101 250 148 102 175 33 103 152 13
D 104 391 105 105 228 46 106 346 10 107 167 1
60+ A 108 264 64 109 198 100 110 167 43 111 114 53
B 112 224 228 113 193 233 114 178 73 115 101 44
C 116 269 183 117 258 103 118 227 20 119 119 6
D 120 385 62 121 324 22 122 192 6 123 123 6
300 Appendix A. Data

Table A.17. Dielectric breakdown strength in kilovolts from a 4 x 8 factorial


experiment

Number Xl X2 Y Number Xl X2 Y
1 1 180 15.0 44 4 250 13.5
2 1 180 17.0 45 4 275 10.0
3 1 180 15.5 46 4 275 11.5
4 1 180 16.5 47 4 275 11.0
5 1 225 15.5 48 4 275 9.5
6 1 225 15.0 49 8 180 15.0
7 1 225 16.0 50 8 180 15.0
8 1 225 14.5 51 8 180 15.5
9 1 250 15.0 52 8 180 16.0
10 1 250 14.5 53 8 225 13.0
11 1 250 12.5 54 8 225 10.5
12 1 250 11.0 55 8 225 13.5
13 1 275 14.0 56 8 225 14.0
14 1 275 13.0 57 8 250 12.5
15 1 275 14.0 58 8 250 12.0
16 1 275 11.5 59 8 250 11.5
17 2 180 14.0 60 8 250 11.5
18 2 180 16.0 61 8 275 6.5
19 2 180 13.0 62 8 275 5.5
20 2 180 13.5 63 8 275 6.0
21 2 225 13.0 64 8 275 6.0
22 2 225 13.5 65 16 180 18.5
23 2 225 12.5 66 16 180 17.0
24 2 225 12.5 67 16 180 15.3
25 2 250 12.5 68 16 180 16.0
26 2 250 12.0 69 16 225 13.0
27 2 250 11.5 70 16 225 14.0
28 2 250 12.0 71 16 225 12.5
29 2 275 13.0 72 16 225 11.0
30 2 275 11.5 73 16 250 12.0
31 2 275 13.0 74 16 250 12.0
32 2 275 12.5 75 16 250 11.5
33 4 180 13.5 76 16 250 12.0
34 4 180 17.5 77 16 275 6.0
35 4 180 17.5 78 16 275 6.0
36 4 180 13.5 79 16 275 5.0
37 4 225 12.5 80 16 275 5.5
38 4 225 12.5 81 32 180 12.5
39 4 225 15.0 82 32 180 13.0
40 4 225 13.0 83 32 180 16.0
41 4 250 12.0 84 32 180 12.0
42 4 250 13.0 85 32 225 11.0
43 4 250 12.0 86 32 225 9.5
Appendix A. Data 301

Table A.17. Dielectric breakdown strength (concluded)

Number Xl X2 Y Number Xl X2 Y
87 32 225 11.0 108 48 250 7.9
88 32 225 11.0 109 48 275 1.2
89 32 250 11.0 110 48 275 1.5
90 32 250 10.0 111 48 275 1.0
91 32 250 10.5 112 48 275 1.5
92 32 250 10.5 113 64 180 13.0
93 32 275 2.7 114 64 180 12.5
94 32 275 2.7 115 64 180 16.5
95 32 275 2.5 116 64 180 16.0
96 32 275 2.4 117 64 225 11.0
97 48 180 13.0 118 64 225 11.5
98 48 180 13.5 119 64 225 10.5
99 48 180 16.5 120 64 225 10.0
100 48 180 13.6 121 64 250 7.2
101 48 225 11.5 122 64 250 7.5
102 48 225 10.5 123 64 250 6.7
103 48 225 13.5 124 64 250 7.6
104 48 225 12.0 125 64 275 1.5
105 48 250 7.0 126 64 275 1.0
106 48 250 6.9 127 64 275 1.2
107 48 250 8.8 128 64 275 1.2

Xl: Time (weeks).


X2: Temperature (Oe).
y: Dielectric breakdown strength in kilovolts.
302 Appendix A. Data

Table A.18. Deaths in British Train Accidents.

Number Month Year Rolling Stock mt y


1 9 97 2 0.436 7
2 8 96 2 0.424 1
3 3 96 3 0.424 1
4 1 95 2 0.426 1
5 10 94 1 0.419 5
6 6 94 1 0.419 2
7 7 91 1 0.439 4
8 1 91 1 0.439 2
9 8 90 2 0.431 1
10 4 89 3 0.436 1
11 3 89 1 0.436 2
12 3 89 1 0.436 5
13 12 88 1 0.443 35
14 11 88 2 0.443 1
15 10 87 1 0.397 4
16 9 86 2 0.414 1
17 9 86 1 0.414 2
18 4 86 3 0.414 1
19 3 86 2 0.414 1
20 12 84 2 0.389 3
21 12 84 1 0.389 1
22 10 84 2 0.389 3
23 7 84 2 0.389 13
24 2 84 3 0.389 2
25 12 83 1 0.401 1
26 2 83 2 0.401 1
27 12 82 1 0 3. 72 1
28 12 81 1 0.417 4
29 12 81 2 0.417 1
30 11 80 3 0.430 2
31 3 80 1 0.430 1
32 10 79 1 0.426 5
33 4 79 1 0.426 7

Rolling stock 1: Mark 1 train.


Rolling stock 2: Post-Mark 1 train.
Rolling stock 3: Non-passenger.
mt: Amount of traffic on the railway system (billions of train km) .
y: Fatalities.
Appendix A. Data 303

Table A.18. Deaths in British Train Accidents (concluded)

Number Month Year Rolling Stock mt y


34 2 79 1 0.426 1
35 12 78 1 0.430 1
36 12 78 1 0.430 3
37 9 77 1 0.425 2
38 11 76 3 0.426 1
39 1 76 3 0.426 2
40 10 75 2 0.436 1
41 8 75 3 0.436 2
42 6 75 1 0.436 6
43 1 75 2 0.436 1
44 10 74 3 0.452 1
45 6 74 1 0.452 1
46 12 73 1 0.433 10
47 8 73 1 0.433 5
48 4 73 3 0.433 1
49 9 72 3 0.431 1
50 6 72 1 0.431 6
51 12 71 3 0.444 3
52 10 71 3 0.444 1
53 7 71 1 0.444 2
54 2 71 1 0.444 1
55 5 70 1 0.452 2
56 12 69 2 0.447 1
57 5 69 1 0.447 1
58 5 69 1 0.447 6
59 4 69 2 0.447 2
60 3 69 1 0.447 2
61 1 69 1 0.447 4
62 9 68 1 0.449 2
63 11 67 1 0.459 49
64 9 67 1 0.459 1
65 7 67 1 0.459 7
66 3 67 1 0.459 5
67 2 67 1 0.459 9
304 Appendix A. Data

Table A.19. Number of cells showing differentiation in a 42 experiment

Number Xl X2 Y
1 0 0 11
2 0 4 18
3 0 20 20
4 0 100 39
5 1 0 22
6 1 4 38
7 1 20 52
8 1 100 69
9 10 0 31
10 10 4 68
11 10 20 69
12 10 100 128
13 100 0 102
14 100 4 171
15 100 20 180
16 100 100 193
Dose of TNF (V/ml) .
Xl :
X2:Dose of IFN (V/ml).
y: Number of cells differentiating.

Table A.20 . Bliss's beetle data on the effect of an insecticide

Number Dose Killed Total

1 49.09 6 59
2 52.99 13 60
3 56.91 18 62
4 60.84 28 56
5 64.76 52 63
6 68.69 53 59
7 72.61 61 62
8 76.54 60 60
Appendix A. Data 305

Table A.21. Number of mice with convulsions after treatment with insulin

Number Dose Preparation With Convulsions Total

1 3.4 0 0 33
2 5.2 0 5 32
3 7.0 0 11 38
4 8.5 0 14 37
5 10.5 0 18 40
6 13.0 0 21 37
7 18.0 0 23 31
8 21.0 0 30 37
9 28.0 0 27 30
10 6.5 1 2 40
11 10.0 1 10 30
12 14.0 1 18 40
13 21.5 1 21 35
14 29.0 1 27 37
Preparation: 0 = Standard, 1 = Test.
306 Appendix A. Data

Table A.22. Toxoplasmosis incidence and rainfall in 34 cities in El Salvador

Observation Rain Number of Total No. of


Number (mm) Cases Children
1 1735 2 4
2 1936 3 10
3 2000 1 5
4 1973 3 10
5 1750 2 2
6 1800 3 5
7 1750 2 8
8 2077 7 19
9 1920 3 6
10 1800 8 10
11 2050 7 24
12 1830 0 1
13 1650 15 30
14 2200 4 22
15 2000 0 1
16 1770 6 11
17 1920 0 1
18 1770 33 54
19 2240 4 9
20 1620 5 18
21 1756 2 12
22 1650 0 1
23 2250 8 11
24 1796 41 77
25 1890 24 51
26 1871 7 16
27 2063 46 82
28 2100 9 13
29 1918 23 43
30 1834 53 75
31 1780 8 13
32 1900 3 10
33 1976 1 6
34 2292 23 37
Appendix A. Data 307

Table A.23. Finney's data on vasoconstriction in the skin of the fingers

Number Volume Xl Rate X2 y


1 3.70 0.825 1
2 3.50 1.090 1
3 1.25 2.500 1
4 0.75 1.500 1
5 0.80 3.200 1
6 0.70 3.500 1
7 0.60 0.750 0
8 1.10 1.700 0
9 0.90 0.750 0
10 0.90 0.450 0
11 0.80 0.570 0
12 0.55 2.750 0
13 0.60 3.000 0
14 1.40 2.330 1
15 0.75 3.750 1
16 2.30 1.640 1
17 3.20 1.600 1
18 0.85 1.415 1
19 1.70 1.060 0
20 1.80 1.800 1
21 0.40 2.000 0
22 0.95 1.360 0
23 1.35 1.350 0
24 1.50 1.360 0
25 1.60 1.780 1
26 0.60 1.500 0
27 1.80 1.500 1
28 0.95 1.900 0
29 1.90 0.950 1
30 1.60 0.400 0
31 2.70 0.750 1
32 2.35 0.030 0
33 1.10 1.830 0
34 1.10 2.200 1
35 1.20 2.000 1
36 0.80 3.330 1
37 0.95 1.900 0
38 0.75 1.900 0
39 1.30 1.625 1

y: 0 = nonoccurrence; 1 = occurrence.
308 Appendix A. Data

Table A.24. Chapman's data on the incidence of heart disease as a function of


age, cholestorol concentration and weight

Number Xl X4 X6 Y Number Xl X4 X6 Y
1 44 254 190 0 44 56 428 171 1
2 35 240 216 0 45 53 334 166 0
3 41 279 178 0 46 47 278 121 0
4 31 284 149 0 47 30 264 178 0
5 61 315 182 1 48 64 243 171 1
6 61 250 185 0 49 31 348 181 0
7 44 298 161 0 50 35 290 162 0
8 58 384 175 0 51 65 370 153 1
9 52 310 144 0 52 43 363 164 0
10 52 337 130 0 53 53 343 159 0
11 52 367 162 0 54 58 305 152 1
12 40 273 175 0 55 67 365 190 1
13 49 273 155 0 56 53 307 200 0
14 34 314 156 0 57 42 243 147 0
15 37 243 151 0 58 43 266 125 0
16 63 341 168 0 59 52 341 163 0
17 28 245 185 0 60 68 268 138 0
18 40 302 225 0 61 64 261 108 0
19 51 302 247 1 62 46 378 142 0
20 33 386 146 0 63 41 279 212 0
21 37 312 170 1 64 58 416 188 0
22 33 302 161 0 65 50 261 145 0
23 41 394 167 0 66 45 332 144 0
24 38 358 198 0 67 59 337 158 0
25 52 336 162 0 68 56 365 154 0
26 31 251 150 0 69 59 292 148 0
27 44 322 196 1 70 47 304 155 0
28 31 281 130 0 71 43 341 154 0
29 40 336 166 1 72 37 317 184 0
30 36 314 178 0 73 27 296 140 0
31 42 383 187 0 74 44 390 167 0
32 28 360 148 0 75 41 274 138 0
33 40 369 180 0 76 33 355 169 0
34 40 333 172 0 77 29 225 186 0
35 35 253 141 0 78 24 218 131 0
36 32 268 176 0 79 36 298 160 0
37 31 257 154 0 80 23 178 142 0
38 52 474 145 0 81 47 341 218 1
39 45 391 159 1 82 26 274 147 0
40 39 248 181 0 83 45 285 161 0
41 40 520 169 1 84 41 259 245 0
42 48 285 160 1 85 55 266 167 0
43 29 352 149 0 86 34 214 139 1
y : 0 = nonoccurrence; 1 = occurrence.
Appendix A. Data 309

Table A.24. Chapman's data on heart disease (continued)

Number Xl X4 X6 Y Number Xl X4 X6 Y
87 51 267 150 0 130 51 286 134 0
88 58 256 175 0 131 37 260 188 0
89 51 273 123 0 132 28 252 149 0
90 35 348 174 0 133 44 336 175 0
91 34 322 192 0 134 35 216 126 0
92 26 267 140 0 135 41 208 165 0
93 25 270 195 0 136 29 352 160 0
94 44 280 144 0 137 46 346 155 0
95 57 320 193 0 138 55 259 140 0
96 67 320 134 0 139 32 290 181 0
97 59 330 144 0 140 40 239 178 0
98 62 274 179 0 141 61 333 141 0
99 40 269 III 0 142 29 173 143 0
100 52 269 164 0 143 52 253 139 0
101 28 135 168 0 144 25 156 136 0
102 34 403 175 0 145 27 156 150 0
103 43 294 173 0 146 27 208 185 0
104 38 312 158 0 147 53 218 185 0
105 45 3ll 154 0 148 42 172 161 0
106 26 222 214 0 149 64 357 180 0
107 35 302 176 0 150 27 178 198 0
108 51 269 262 0 151 55 283 128 1
109 55 3ll 181 0 152 33 275 177 0
llO 45 286 143 0 153 58 187 224 0
III 69 370 185 1 154 51 282 160 0
ll2 58 403 140 0 155 37 282 181 0
ll3 64 244 187 0 156 47 254 136 0
ll4 70 353 163 0 157 49 273 245 0
115 27 252 164 0 158 46 328 187 0
116 53 453 170 0 159 40 244 161 1
117 28 260 150 0 160 26 277 190 0
ll8 29 269 141 0 161 28 195 180 0
ll9 23 235 135 0 162 23 206 165 0
120 40 264 135 0 163 52 327 147 0
121 53 420 141 0 164 42 246 146 0
122 25 235 148 0 165 27 203 182 0
123 63 420 160 1 166 29 185 187 0
124 48 277 180 1 167 43 224 128 0
125 36 319 157 0 168 34 246 140 0
126 28 386 189 1 169 40 227 163 0
127 57 353 166 0 170 28 229 144 0
128 39 344 175 0 171 30 214 150 0
129 52 210 172 1 172 34 206 137 0
310 Appendix A. Data

Table A.24. Chapman's data on heart disease (concluded)

Number Xl X4 X6 Y
173 26 173 141 0
174 34 248 141 0
175 35 222 190 0
176 34 230 167 0
177 45 219 159 0
178 47 239 157 0
179 54 258 170 0
180 30 190 132 0
181 29 252 155 0
182 48 253 178 0
183 37 172 168 0
184 43 320 159 1
185 31 166 160 0
186 48 266 165 0
187 34 176 194 0
188 42 271 191 1
189 49 295 198 0
190 50 271 212 1
191 42 259 147 0
192 50 178 173 1
193 60 317 206 0
194 27 192 190 0
195 29 187 181 0
196 29 238 143 0
197 49 283 149 0
198 49 264 166 0
199 50 264 176 0
200 31 193 141 0
Bibliography

Agresti, A. (1990). Categorical Data Analysis. New York: Wiley.


Agresti, A. (1996). An Introduction to Categorical Data Analysis. New
York: Wiley.
Aitkin, M., D. Anderson, B. Francis, and J. Hinde (1989). Statistical
Modelling in GLIM. Oxford: Oxford University Press.
Andrews, D. F. (1971). Sequentially designed experiments for screening
out bad models with F tests. Biometrika 58,427-432.
Atkinson, A. C. (1970). A method for discriminating between models
(with discussion). Journal of the Royal Statistical Society, Series
B 32, 323-353.
Atkinson, A. C. (1982). Regression diagnostics, transformations and con-
structed variables (with discussion). Journal of the Royal Statistical
Society, Series B 44, 1-36.
Atkinson, A. C. (1985). Plots, Transformations, and Regression. Oxford:
Oxford University Press.
Atkinson, A. C. (1986). Diagnostic tests for transformations.
Technometrics 28, 29-37.
Atkinson, A. C. (1994a). Fast very robust methods for the detection of
multiple outliers. Journal of the American Statistical Association 89,
1329-1339.
312 Bibliography

Atkinson, A. C. (1994b). Transforming both sides of a tree. American


Statistician 48, 307- 313.
Atkinson, A. C. and A. N. Donev (1992). Optimum Experimental
Designs. Oxford: Oxford University Press.
Atkinson, A. C. and A. J. Lawrance (1989). A comparison of asymptot-
ically equivalent tests of regression transformation. Biometrika 76,
223- 229.
Atkinson, A. C. and H.-M. Mulira (1993). The stalactite plot for the
detection of multivariate outliers. Statistics and Computing 3, 27- 35.
Atkinson, A. C., L. R. Pericchi, and R. L. Smith (1991). Grouped like-
lihood for the shifted power transformation. Journal of the Royal
Statistical Society, Series B 53, 473- 482.
Bartlett, M. (1951) . An inverse matrix adjustment arising in discriminant
analysis. Annals of Mathematical Statistics 22, 107-111.
Bates, D. M. and D. G. Watts (1980). Relative curvature measures
of nonlinearity (with discussion) . Journal of the Royal Statistical
Society, Series B 42, 1- 25.
Bates, D. M. and D. G. Watts (1988) . Nonlinear Regression Analysis
and Its Applications. New York: Wiley.
Bedrick, E. J. and J. R. Hill (1990). Outlier tests for logistic regression.
Biometrika 77, 815-827.
Belsley, D. A., E . Kuh, and R. E. Welsch (1980). Regression Diagnostics.
New York: Wiley.
Bickel, P. J. and K. A. Doksum (1981) . An analysis of transformations
revisited. Journal of the American Statistical Association 76, 296-
311.
Bliss, C.1. (1935). The calculation of the dosage-mortality curve. Annals
of Applied Biology 22, 134- 167.
Box, G. E . P.and D. R. Cox (1964). An analysis of transformations (with
discussion). Journal of the Royal Statistical Society, Series B 26,
211-246.
Box, G. E. P. and D. R. Cox (1982). An analysis of transformations re-
visited, rebutted. Journal of the American Statistical Association 77,
209- 210.
Box, G. E. P. and W. J. Hill (1974) . Correcting inhomogeneity of variance
with power transformation weighting. Technometrics 16, 385- 389.
Box, G. E. P. and P. W. Tidwell (1962). Transformations of the
independent variables. Technometrics 4, 531- 550.
Bibliography 313

Breiman, L. and J. H. Friedman (1985). Estimating optimal transforma-


tions for multiple regression and transformation (with discussion).
Journal of the American Statistical Association 80, 580- 619.
Brown, P. J. (1993). Measurement, Regression, and Calibration. Oxford:
Oxford University Press.
Brownlee, K. A. (1965). Statistical Theory and Methodology in Science
and Engineering (2nd edition). New York: Wiley.
Bruce, D. and F. X. Schumacher (1935). Forest Mensuration. New York:
McGraw- Hill.
Carr, N. L. (1960). Kinetics of catalytic isomerisation of n-pentane.
Industrial and Engineering Chemistry 52, May, 391- 396.
Carroll, R. J. and D. Ruppert (1988). Transformation and Weighting in
Regression. London: Chapman and Hall.
Casella, G. and R. L. Berger (1990). Statistical Inference. New York:
Springer- Verlag.
Cerioli, A. and M. Riani (1999). The ordering of spatial data and the de-
tection of multiple outliers. Journal of Computational and Graphical
Statistics 8, 239- 258.
Chambers, E. A. and D. R. Cox (1967). Discrimination between
alternative binary response models. Biometrika 54, 573- 578.
Chatterjee, S. and A. S. Hadi (1988). Sensitivity Analysis in Linear
Regression. New York: Wiley.
Christensen, R. (1990). Log-Linear Models. New York: Springer-Verlag.
Christmann, A. (1994). Least median of weighted squares in logistic
regression with large strata. Biometrika 81, 413- 417.
Cook, R. D. (1977). Detection of influential observations in linear
regression. Technometrics 19, 15- 18.
Cook, R. D. (1994) . Regression Graphics. New York: Wiley.
Cook, R. D. and P. Prescott (1981). Approximate significance levels for
detecting outliers in linear regression. Technometrics 23, 59- 64.
Cook, R. D. and P. Wang (1983) . Transformations and influential cases
in regression. Technometrics 25, 337- 343.
Cook, R. D. and S. Weisberg (1982). Residuals and Influence in
Regression. London: Chapman and Hall.
Cook, R. D. and S. Weisberg (1994a). An Introduction to Regression
Graphics. New York: Wiley.
314 Bibliography

Cook, R. D. and S. Weisberg (1994b). Transforming a response variable


to linearity. Biometrika 81, 731- 737.
Cook, R. D. and S. Weisberg (1999). Applied Regression Including
Computing and Graphics. New York: Wiley.
Cook, R. D. and J. A. Witmer (1985). A note on parameter-effects
curvature. Journal of the American Statistical Association 80,
872- 878.
Cox, D. R. (1962). Further results on tests of separate families of
hypotheses. Journal of the Royal Statistical Society, Series B 24,
406- 424.
Dempster, A.P., M. R. Selwyn, C. M. Patel, and A. J. Roth (1984). Sta-
tistical and computational aspects of mixed model analysis. Applied
Statistics 33, 203- 214.
Dobson, A. (1990) . An Introduction to Generalized Linear Models.
London: Chapman and Hall.
Draper, N. R. and H. Smith (1998). Applied Regression Analysis (3rd
Edition). New York: Wiley.
Efron, B. (1978). Regression and ANOVA with zero- one data: mea-
sures of residual variation. Journal of the American Statistical
Association 73, 113- 12l.
Evans, A. W. (2000). Fatal train accidents on Britain's main line
railways. Journal of the Royal Statistical Society, Series A 163,
99- 119.
Fahrmeir, L. and G. Tutz (1994). Multivariate Statistical Modelling
Based on Generalized Linear Models. Berlin: Springer-Verlag.
Fairley, D. (1986). Cherry trees with cones? American Statistician 40,
138-139.
Farrell, R. H. , J. Kiefer, and A. Walbran (1967). Optimum multivariate
designs. In Proceedings of the Fifth Berkeley Symposium, Vol. 1, pp.
113- 138. Berkeley: University of California Press.
Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data
(2nd edition). Cambridge, MA: M.I.T. Press.
Finney, D. J. (1947). The estimation from individual records of the
relationship between dose and quantal response. Biometrika 34 ,
320-334.
Finney, D. J. (1978) . Statistical Methods in Biological Assay. London:
Griffin.
Bibliography 315

Firth, D. (1991). Generalized linear models. In D. V. Hinkley, N. Reid,


and E. J . Snell (Eds.), Statistical Theory and Modelling, pp. 55-82.
London: Chapman and Hall.
Flury, B. (1997). A First Course in Multivariate Statistics. New York:
Springer-Verlag.
Godfrey, L. G. (1991). Misspecijication Tests in Econometrics.
Cambridge: Cambridge University Press.
Gunst, R. F. and R. L. Mason (1980). Regression Analysis and Its
Application. New York: Dekker.
Hadi, A. S. (1992). Identifying multiple outliers in multivariate data.
Journal of the Royal Statistical Society, Series B 54, 761- 771.
Hadi, A. S. and J. S. Simonoff (1993). Procedures for the identification of
multiple outliers in linear models. Journal of the American Statistical
Association 88, 1264-1272.
Hadi, A. S. and J. S. Simonoff (1994). Improving the estimation and
outlier identification properties of the least median of squares and
minimum volume ellipsoid estimators. Parisankhyan Sammikkha 1,
61-70.
Hakkila, P. (1989). Utilization of Residual Forest Biomass. Berlin:
Springer-Verlag.
Hauck, W. H. and A. Donner (1977) . Wald's test as applied to hypotheses
in logit analysis. Journal of the American Statistical Association 72,
851- 853.
Hawkins, D. M., D. Bradu, and G. V. Kass (1984). Location of
several outliers in multiple-regression data using elemental sets.
Technometrics 26, 197-208.
Henderson, H. V. and S. R. Searle (1981). On deriving the inverse of a
sum of matrices. SIAM Review 23, 53-60.
Hoaglin, D. C., F. Mosteller, and J. W. Tukey (1983). Understanding
Robust and Exploratory Data Analysis. New York: Wiley.
Hunter, W. G. and A. C. Atkinson (1965). Planning experiments for
fundamental process characterization. Technical report, Department
of Statistics, University of Wisconsin, Madison, WI.
Kianifard, F. and W. H. Swallow (1989). Using recursive residuals, calcu-
lated on adaptively-ordered observations, to identify outliers in linear
regression. Biometrics 45, 571- 585.
Lauritzen, S. L. (1996). Graphical Models. Oxford: Oxford University
Press.
316 Bibliography

Lawrance, A. J. (1987). The score statistic for regression transformation.


Biometrika 74, 275-289.
Lee, A. H. and W. K. Fung (1997). Confirmation of multiple outliers in
generalized linear and nonlinear regressions. Computational Statistics
and Data Analysis 25, 55- 65.
Levenberg, K. (1944). A method for the solution of certain nonlinear
problems in least squares. Quarterly of Applied Mathematics 2, 164-
168.
Li, K. C. (1991). Sliced inverse regression for dimension reduction (with
discussion). Journal of the American Statistical Association 86, 316-
342.
Lindsey, J. K. (1995). Modelling Frequency and Count Data. Oxford:
Oxford University Press.
Lindsey, J. K. (1997). Data Analysis with Generalized Linear Models.
New York: Springer-Verlag.
Mantel, N. (1987). Understanding Wald's test for exponential families.
The American Statistician 41, 147- 148.
Marquardt, D. W. (1963). An algorithm for the estimation of non-linear
parameters. SIAM Journal 11, 431- 44l.
McCullagh, P. and J. A. NeIder (1989). Generalized Linear Models (2nd
edition). London: Chapman and Hall.
Morgan, B. J. T. (1992). Analysis of Quantal Response Data. London:
Chapman and Hall.
Morgenthaler, S. (1992). Least-absolute-deviations fits for generalized
linear models. Biometrika 79, 747-754.
Nelson, W. (1981). The analysis of performance-degradation data. IEEE
Transactions on Reliability R-30, 149- 155.
Piegorsch, W. W., C. R. Weinberg, and B. H. Margolin (1988). Explor-
ing simple independent action in multifactor tables of proportions.
Biometrics 44, 595- 603.
Plackett, R. L. (1950). Some theorems in least squares. Biometrika 37,
149- 157.
Pregibon, D. (1981). Logistic regression diagnostics. Annals of
Statistics 9, 705-724.
Pritchard, D. J. and D. W. Bacon (1977). Accounting for
heteroscedasticity III experimental design. Technometrics 19,
109- 115.
Bibliography 317

Pritchard, D. J., J. Downie, and D. W. Bacon (1977) . Further considera-


tion of heteroscedasticity in fitting kinetic models. Technometrics 19,
227- 236.
Ranneby, P. (Ed.) (1982). Statistics in Theory and Practice (Essays in
Honour of Bertin Matern). Umea: Swedish University of Agricultural
Science.
Ratkowsky, D. A. (1983). Nonlinear Regression Modelling: A Unified
Practical Approach. New York: Dekker.
Rawlings, J. 0., S. G. Pantula, and D. A. Dickey (1998). Applied
Regression Analysis: A Research Tool. New York: Springer- Verlag.
Rousseeuw, P. J. (1984). Least median of squares regression. Journal of
the American Statistical Association 79, 871- 880.
Rousseeuw, P. J. and A. M. Leroy (1987). Robust Regression and Outlier
Detection. New York: Wiley.
Rousseeuw, P. J. and B. C. van Zomeren (1990). Unmasking multivari-
ate outliers and leverage points. Journal of the American Statistical
Association 85, 633- 639.
Ruppert, D. and R. J. Carroll (1980). Trimmed least squares esti-
mation in the linear model. Journal of the Amer-ican Statistical
Association 75, 828- 838.
Ruppert , D., N. Cressie, and R. J. Carroll (1989). A transforma-
tion/weighting model for estimating Michaelis-Menten parameters.
Biometrics 45, 637- 656.
Ryan, T. P. (1997). Modern Regression Methods. New York: Wiley.
Schumacher, F. X. and F. d. S. Hall (1933). Logarithmic expression of
timber tree volume. Journal of Agricultural Research 45, 741- 756.
Seber, G. A. F. and C. J. Wild (1989). Nonlinear Regression. New York:
Wiley.
Sherman, J. and W. J. Morrison (1949). Adjustment of an inverse matrix
corresponding to changes in the elements of a given column or a
given row of the original matrix (abstract). Annals of Mathematical
Statistics 20, 62l.
Shih, J.-Q. (1993). Regression transformation diagnostics in transform-
both-sides model. Statistics and Probability Letters 16, 411- 420.
Spurr, S. H. (1952). Forest Inventory. New York: Ronald.
Srinivasan, R. and A. A. Levi (1963). Kinetics of the thermal isomerisa-
tion of bicyclo hexane. Journal of the American Chemical Society 85,
3363-3365.
318 Bibliography

St Laurent, R. T. and R. D. Cook (1993). Leverage, local influence and


curvature in nonlinear regression. Biometrika 80, 99-106.
Stefanski, L. A., R. J. Carroll, and D. Ruppert (1986). Opti-
mally bounded score functions for generalized linear models with
applications to logistic regression. Biometrika 73, 413-424.
Stromberg, A. (1993). Computation of high breakdown nonlinear regres-
sion parameters. Journal of the American Statistical Association 88,
237- 244.
Stromberg, A. J. and D. Ruppert (1992). Breakdown in nonlinear
regression. Journal of the American Statistical Association 87,
991- 997.
Vreth, M. (1985). On the use of Wald's test in exponential families.
International Statistical Review 53, 199-214.
Venables, W. N. and B. D. Ripley (1997). Modern Applied Statistics with
S-Plus (2nd edition). New York: Springer-Verlag.
Weisberg, S. (1985). Applied Linear Regression (2nd edition). New York:
Wiley.
Williams, D. A. (1987). Generalized linear model diagnostics using the
deviance and single case deletions. Applied Statistics 36, 181-191.
Woodbury, M. (1950). Inverting modified matrices. Technical Report 42,
Statistical Techniques Research Group, Princeton University.
Woodruff, D. and D. M. Rocke (1994). Computable robust estimation of
multivariate location and shape in high dimension using compound
estimators. Journal of the American Statistical Association 89, 888-
896.
Author Index

Agresti, A., 265, 311 Carr, N. 1., 170, 313


Aitkin, M., 247, 258, 311 Carroll, R. J. , 62, 123, 154, 176,
Anderson, D. , 247, 258, 311 265, 313, 317, 318
Andrews, D. F., 98, 311 Casella, G., 185, 313
Atkinson, A. C., 21, 22, 25, 28, 30, Cerioli, A. , 33, 313
33, 35, 51, 63, 66, 76, 82, Chambers, E. A., 231, 313
87, 88, 113, 124, 127, 173, Chatterjee, S., 28, 35, 313
174, 265,311,312, 315 Christensen, R., 259, 262, 313
Christmann, A. , 265, 313
Bacon, D. W , 172, 173, 316, 317 Cook, R. D. , 18, 21, 25, 28, 35, 67,
Bartlett, M., 35, 312 78, 87, 116, 147, 152, 173,
Bates, D. M. , 136, 138, 143, 145, 313, 314, 318
147, 151, 171 , 172, 176, 312 Cox, D. R. , 9, 81 , 82, 88, 95, 96,
Bedrick, E. J. , 265, 312 130, 231 , 265, 312-314
Belsley, D. A., 35, 312 Cressie, N., 154, 176, 317
Berger, R. L. , 185, 313
Bickel, P. J., 130,312
Bliss, C. I., 181, 312 Dempster, A. P. , 27, 314
Box, G. E. P. , 9, 81, 82, 87, 88, 95, Dickey, D. A., 159, 317
96, 130, 172, 312 Dobson, A., 265, 314
Bradu, D. , 73, 315 Doksum, K. A. , 130, 312
Breiman, 1., 67, 72, 313 Donev, A. N. , 76, 173, 312
Brown, P. J., 15, 313 Donner, A., 251, 263, 266, 315
Brownlee, K. A., 50, 313 Downie, J. , 172, 173, 317
Bruce, D., 124, 125, 313 Draper, N. R., 174, 314
320 Author Index

Efron, B. , 238, 314 Mantel, N. , 252, 316


Evans, A. W., 180, 314 Margolin, B. H, 226, 230, 316
Marquardt, D. W. , 148, 316
Fahrmeir, L., 226, 230, 247, 265, Mason, R L., 73, 128, 315
314 McCullagh, P., 82, 183, 188, 194,
Fairley, D., 126, 314 197, 200, 204, 209, 223, 249,
Farrell, R H., 76, 314 251, 265, 316
Fienberg, S. E ., 265, 314 Morgan, B. J . T. , 265 , 316
Finney, D. J., 234, 247, 314 Morgenthaler, S., 265, 316
Firth, D., 187, 265, 315 Morrison, W . J., 35, 317
Flury, B. , 181 , 315 Mosteller, F. , 256, 315
Francis, B., 247, 258, 311 Mulira, H.-M., 33, 312
Friedman, J . H. , 67, 72, 313
Fung, W. K., 154, 159, 245, 316 NeIder, J . A. , 82 , 183, 188, 194,
197, 200, 204, 209, 223, 249,
Godfrey, L. G., 265, 315 251, 265, 316
Gunst, R F., 73, 128, 315 Nelson, W., 209, 316

Pantula, S. G., 159, 317


Hadi, A. S., 28, 30, 33, 35, 313, 315
Patel, C. M., 27, 314
Hakkila, P., 126, 315
Pericchi, L. R , 127, 312
Hall, F. dos S. , 125, 317
Piegorsch, W. W., 226, 230, 316
Hauck, W. H., 251, 263, 266, 315
Plackett, R L. , 35, 316
Hawkins, D. M. , 73, 315
Pregibon, D., 247, 256, 316
Henderson, H. V., 35, 315
Prescott, P., 78, 313
Hill, J. R, 265, 312
Pritchard, D. J ., 172, 173, 316, 317
Hill, W. J. , 172, 312
Hinde, J., 247, 258, 311 Ranneby, P., 126, 317
Hinkley, D. V. , 187, 265, 315 Ratkowsky, D. A. , 136, 317
Hoaglin, D. C., 256, 315 Rawlings, J. 0 ., 159, 317
Hunter, W. G., 174,315 Reid , N., 187, 265, 315
Riani, M., 33, 313
Kass, G. V. , 73, 315 Ripley, B. D., 251, 265, 318
Kianifard , F. , 35, 315 Rocke, D. M., 30, 266, 318
Kiefer, J., 76, 314 Roth, A. J., 27, 314
Kuh, E., 35, 312 Rousseeuw, P. J., 29, 35, 73, 74,
317
Lauritzen, S. L., 265, 315 Ruppert, D, 173, 318
Lawrance, A. J ., 127, 312, 316 Ruppert , D. , 62, 123, 154, 176,
Lee, A. H., 154, 159, 245, 316 265, 313, 317, 318
Leroy, A. M. , 35, 73, 74, 317 Ryan, T. P. , 35, 317
Levenberg, K , 148, 316
Levi, A. A., 174, 317 Schumacher, F. X., 124, 125, 313,
Li, K C., 35, 316 317
Lindsey, J. K, 234, 265, 316 Searle, S. R, 35, 315
Author Index 321

Seber, G. A. F ., 136, 148, 172, 317 Tutz, G., 226, 230, 247, 265, 314
Selwyn, M. R, 27, 314
Sherman, J. , 35, 317 van Zomeren, B. C., 29, 317
Shih, J -. Q. , 126, 317 Venables, W. N., 251 , 265, 318
Simonoff, J. S., 30, 33, 315 Vreth, M., 252, 318
Smith, H., 174, 314
Smith, R L., 127, 312 Walbran, A., 76, 314
Snell, E. J. , 187, 265, 315 Wang, P.C., 21, 313
Spurr, S. H., 125, 126, 317 Watts, D. G., 136, 138, 143, 145,
Srinivasan, R, 174, 317 147,151,171,172, 176,312
Stefanski, L. A. , 265, 318 Weinberg, C. R, 226, 230, 316
Stromberg, A., 154, 164, 165, 173, Weisberg, S., 2, 18, 28, 35, 67, 80,
318 87, 116, 313, 314, 318
Stromberg, A. J., 173,318 Welsch, R E., 35 , 312
Swallow, W. H., 35, 315 Wild, C. J. , 136, 148, 172, 317
St Laurent, R T. , 152, 173, 318 Williams, D. A., 200, 318
Witmer, J. A., 147, 314
Tidwell, P. W., 87, 312 Woodbury, M., 35, 318
Tukey, J. W ., 256, 315 Woodruff, D.L., 30, 266, 318
Subject Index

absolute comparison of models, boring plots, 12, 89


233 Box and Cox link, 184, 204
acceleration vector, 146 Box and Cox transformation,
added variable, 20 82-85, 123, 183
analysis of deviance, 185, 194
arcsine link, 184, 231, 253-256, 261 canonical parameterization, 188
assumptions in regression, 14, 17 chain rule, 192
asymmetrical link, 234, 235 combinatorial explosion, 103
asymptotic approximations, 140, complementary log log link, 184,
194, 195, 267 232
failure, 251 compound Poisson process, 225
rate of convergence, 254 confidence interval for .x, 92
asymptotics, 138 constructed variable, 20, 85, 87
both sides, 124
backwards deletion, 2, 4, 245 goodness of link, 201
balanced search, 250, 260 plot for transformation, 86, 93,
banana-shaped contour, 138 99, 103, 108
beautiful plot, 107 contingency tables, 182, 222
binary data, 246-265 convergence criterion, 257
correlated t statistics, 253, 258, Cook's distance, 25, 27
263 forward form, 34
binomial data, 181 generalized linear model, 200,
links, 231 202
binomial distribution, 181, 230 modified, 202
binomial models, 230-232 modified, 25
324 Subject Index

forward form, 34 enzyme kinetics, 154-159, 295


nonlinear Forbes' data, 2-5, 278
forward form, 151 Hawkins' data, 43-50, 282- 284
cumulative distribution function, isomerization of n-pentane,
181 170- 173, 298
curvature, 141, 143, 145 mice with convulsions, 234-238,
intrinsic, 144, 146 305
forward form, 147 multiple regression data, 5-9,
parameter effects, 144, 146 279- 280
forward form, 147 mussels' muscles, 116-121,
290- 291
dataset, see example nitrogen in lakes, 164- 170, 297
deletion diagnostics, 1, 27 ozone data, 67-72, 110- 111,
failure, 105, 159, 258 287-288
deletion estimates, 23 poison data, 95-97, 289
deletion score test, 93, 101, 105 doubly modified, 101- 103
deviance, 194-197 modified, 98-100
binary data, 249, 261 multiply modified, 104- 110
binomial, 231 radioactivity and molar
gamma, 203 concentration of nifedipene,
Poisson, 222 151- 154,294
discrete data, 179 salinity data, 62-66, 286
discriminant analysis, 259, 276 short leaf pine, 124- 126, 292-293
dispersion parameter, 185, 187 stack loss data, 50- 62, 111- 116,
estimation, 197, 206 285
dummy variable, 213- 219 toxoplasmosis and rainfall,
effect on leverage, 216 238- 246, 306
vasoconstriction data, 246- 248,
ED50, 267, 274 256- 259, 307
elemental set, 29, 106 wool data, 9-12, 88- 95, 281
example expectation plane, 140, 142
bicyclo hexane data, 174 expectation surface, 140, 143, 146,
biochemical oxygen demand, 174 175
Bliss's beetle data, 181, 232-234, expected information, 191, 192
304 experimental design, 172
British train accidents, 180, explained deviance, 195
222-225, 302- 303 exponential distribution, 209
calcium uptake, 159- 163, 296 exponential family, 185-186, 188
car insurance data, 204-208, 299 likelihood, 188, 195
cellular differentiation data,
226-230, 304 factorial structure, 90, 204, 266
Chapman data, 259- 265, effect on leverage, 76, 206
308- 310 scatter plot, 209
dielectric breakdown strength, fan plot, 89, 96, 98, 102, 204
209-221, 300- 301 Fisher scoring, 191, 193
Subject Index 325

folded power transformation, 128, inappropriate response surface


131 design, 173
forest mensuration, 125 indistinguishable t values, 171
forward plot induced heteroscedasticity, 172
S2, 12,48 initial subset, 31
t statistics, 8 insufficient transformation, 90
generalized linear model, 229 interaction plot, 210, 228
nonlinear residuals, 164 inverse fitted value plot, 87, 96, 99,
and systematic failure, 160, 163 108
Cook's distance, 163 inverse Gaussian distribution, 183,
curvature, 160 267
deviance, 236 inverse link function, 181, 184
deviance residuals, 204, 226 inverse regression, 116
binary data, 262 iteratively reweighted least
dispersion parameter, 206 squares, 185, 192, 253
goodness of link test, 204, 215, starting values, 194
225
leverage, 51
l'Hopital's rule, 86
nonlinear residuals, 152
R2 lack of fit sum of squares, 128, 134
large data sets, 266
nonlinear, 157
LD50, 268
R 2 ,12
least median of deviances, 204
residuals, 5, 8, 33, 45
least median of squares, 28, 73, 105
stability, 70, 126
forward search, 2, 4, 28-35, 189, least trimmed squares, 32
197, 201 leverage, 19, 53
binary data, 249 forward form, 34
recovery from poor start, 33 nonlinear, 150
generalized linear model, 201
c 2 , 222, 267, 272 likelihood ratio test for
gamma distribution, 187, 197 transformation, 84
gamma models, 202 linear predictor, 87, 180, 181, 193
Gauss-Newton algorithm, 139 linearized model, 139, 190
generalization of regression, 182 decreasing information, 167
generalized linear model, 182 silly confidence regions, 172
geometric mean, 82, 123, 130, 131 link function, 181-183
geometry of least squares, 140 log link, 182, 184
goodness of link test, 183, 200, log log link, 232
204, 213 logistic distribution, 182
logistic link, 182, 184
hat matrix, 18, 36, 198 logistic nonlinear regression, 152,
generalized linear model, 201 155
nonlinear model, 140 logistic regression, 184
logit link, 184
identity link, 184 long tailed, 105, 114
326 Subject Index

M estimators, 256 straightforward interpretation,


Mahalanobis distance, 2 241
Mallows C p , 80 overdispersion, 187, 242, 267, 271
Marquardt-Levenberg algorithm, overtransformation, 90
148
masking, 1, 22, 101, 239, 245 parameter estimates
maximum likelihood, 83, 84, 183, F tests, 19
189 t statistics, 8, 19, 21
mean shift outlier model, 26, 37, for generalized linear model,
78 198
model building, 70, 80 highly correlated, 171
model failure least squares, 17
isolated, 14, 219 of (12, 33, 85
systematic, 10, 14, 90, 125, 160 within cells, 128
moving forward, 32, 149, 201 , 250 stable, 5, 30, 206
parametric link function, 183
negative binomial distribution, 183 partially linear model, 149
Newton's law of cooling, 143 Pearson's chi-squared test , 197,
Newton's method, 139, 190 199, 222, 267
nonlinear least squares, 137 perfect fit , 248, 250- 253
nonlinear model, 137 in forward search, 252
error distribution, 141 similar t values, 258
nonnested hypotheses, 265 value of deviance, 257
nonnormal errors, 179 physically meaningful
nonobvious link, 200, 203 transformation, 96, 121
normal equations, 193 Poisson data, 180, 182
linear, 17 Poisson distribution, 180, 186, 221
nonlinear, 138 Poisson models, 221-222
partitioned, 20 power family link, 184, 203
normalized power transformation, probit analysis, 184
82 pro bit link, 184, 231
profile loglikelihood, 84, 88
observed information, 191
obvious link, 222 QQ plot, 45, 76, 105, 114
offset, 223 quadratic weight, 192, 194
ordering data, 9, 32, 91, 240 quasilikelihood, 187
effect of factorial , 90, 93
effect of link, 233, 243 R 2 ,20
effect of transformation, 31, 96, nonlinear model, 150, 157
112,117 radius of curvature, 147
outlier, 46 rearrangement to linearity, 157,
increase in curvature, 169 172,175
masked,6 reciprocal link, 184
reaction to, 3, 63, 170, 229 residual deviance, 195
single, 98 residuals
Subject Index 327

deletion, 23, 36, 200 simultaneous choice, 117


nonlinear, 151 steepest descent, 148
deviance, 199, 202 structured residual plot, 224, 247
studentized, 199 synergism, 226
forward forms, 34
least median of squares, 28, 45, tangent plane, 146
105, 155 tangent plane approximation, 144
least squares, 18 transform both sides, 121-124,
Pearson, 199 141, 154
studentized, 199 transformation
scaled, 18 of x, 87
studentized, 18 of x and y, 88, 116
nonlinear, 150 ofy, 9, 12, 66, 81-82
robustness, 29, 32, 253, 265 plot of residuals, 10, 28, 31, 89

saturated model, 195 uninformative deviance, 195, 231,


scaled deviance, 195, 203 249, 267, 274
scatterplot matrix, 6, 43
and outliers, 164 variable selection, 70, 78, 135, 171
score function, 190 variance function, 186-188
score test for transformation, 10, variance stabilizing transformation,
20, 55, 69, 85, 89 128, 130, 268
Lawrance's, 127 velocity vector, 145
Scotland,3
separate families, 265 Wald test, 251
shape of tree, 126 relationship with likelihood, 256
Sherman-Morrison-Woodbury Weibull growth model, 159
formula, 22, 35 weighted least squares, 190, 194
shifted power transformation, 127 working response, 194
simple power transformation, 83,
87 zero value of t statistics, 251, 254
simulation envelope, 4, 27, 245 almost zero, 257
simulation inference, 176, 249, 263,
265
Springer Series in Statistics
(continued from p. ii)

Ramsay/Silverman: Functional Data Analysis.


Rao/Toutenburg: Linear Models: Least Squares and Alternatives.
ReadiCressie: Goodness-of-Fit Statistics for Discrete Multivariate Data.
Reinsel: Elements of Multivariate Time Series Analysis, 2nd edition.
Reiss: A Course on Point Processes.
Reiss: Approximate Distributions of Order Statistics: With Applications
to Non-parametric Statistics.
Rieder: Robust Asymptotic Statistics.
Rosenbaum: Observational Studies.
Rosenblatt: Gaussian and Non-Gaussian Linear Time Series and Random Fields.
Siirndal/Swensson/Wretman: Model Assisted Survey Sampling.
Schervish: Theory of Statistics.
ShaolTu: The Jackknife and Bootstrap.
Siegmund: Sequential Analysis: Tests and Confidence Intervals.
SimonojJ: Smoothing Methods in Statistics.
Singpurwalla and Wilson: Statistical Methods in Software Engineering:
Reliability and Risk.
Small: The Statistical Theory of Shape.
Sprott: Statistical Inference in Science.
Stein: Interpolation of Spatial Data: Some Theory for Kriging.
Taniguchi/Kakizawa: Asymptotic Theory of Statistical Inference for Time Series.
Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, 3rd edition.
Tong: The Multivariate Normal Distribution.
van der Vaart/Wellner: Weak Convergence and Empirical Processes: With
Applications to Statistics.
Verbeke/Molenberghs: Linear Mixed Models for Longitudinal Data.
Weerahandi: Exact Statistical Methods for Data Analysis.
West/Harrison: Bayesian Forecasting and Dynamic Models, 2nd edition.

You might also like